# Dataset Context and Exploration

**age**  
- Demographic: Age  
- Description: How old is this sample of people  
- Data Type: Integer  
- Missing Values: No  

**workclass**  
- Demographic: Income  
- Description: Employment classification  
- Possible values: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked  
- Data Type: Categorical  
- Missing Values: Yes  

**fnlwgt**  
- Demographic:  
- Description: Final weight representing the number of people the census believes the entry represents  
- Data Type: Integer  
- Missing Values: No  

**education**  
- Demographic: Education Level  
- Description: Highest level of education achieved  
- Possible values: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool  
- Data Type: Categorical  
- Missing Values: No  

**education-num**  
- Demographic: Education Level  
- Description: Numeric representation of education level  
- Data Type: Integer  
- Missing Values: No  

**marital-status**  
- Demographic: Other  
- Description: Marital status of the individual  
- Possible values: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse  
- Data Type: Categorical  
- Missing Values: No  

**occupation**  
- Demographic: Other  
- Description: Type of job  
- Possible values: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces  
- Data Type: Categorical  
- Missing Values: Yes  

**relationship**  
- Demographic: Other  
- Description: Relationship within household  
- Possible values: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried  
- Data Type: Categorical  
- Missing Values: No  

**race**  
- Demographic: Race  
- Description: Race of the individual  
- Possible values: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black  
- Data Type: Categorical  
- Missing Values: No  

**sex**  
- Demographic: Sex  
- Description: Gender of the individual  
- Possible values: Female, Male  
- Data Type: Binary  
- Missing Values: No  

**capital-gain**  
- Demographic:  
- Description: Income from capital gains  
- Data Type: Integer  
- Missing Values: No  

**capital-loss**  
- Demographic:  
- Description: Capital losses  
- Data Type: Integer  
- Missing Values: No  

**hours-per-week**  
- Demographic:  
- Description: Number of hours worked per week  
- Data Type: Integer  
- Missing Values: No  

**native-country**  
- Demographic: Other  
- Description: Country of origin  
- Possible values: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands  
- Data Type: Categorical  
- Missing Values: Yes  

**income**  
- Demographic: Income  
- Description: Income class  
- Possible values: >50K, <=50K  
- Data Type: Binary  
- Missing Values: No  



In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Missing Values

In [3]:
data = pd.read_csv('base_data/adultdataset.csv', na_values='NaN', skipinitialspace=True)

#Identify and count the number of missing values in each column, the missing values are repesented by "NaN"
missing_values = data.isna().sum()
print(missing_values)

age                  0
workclass         2799
fnlwgt               0
education            0
education-num        0
marital-status       0
occupation        2809
relationship         0
race                 0
sex                  0
capital-gain         0
capital-loss         0
hours-per-week       0
native-country     857
income               0
dtype: int64


# Total fnlwgt

The fnlwgt represent the wheight a certain row has over the whole dataset, this works like a percentage.
From this point forward if we discribe something as wheighted it means that it wheighted to the total fnlwgt

In [4]:
total_fnlwgt = data['fnlwgt'].sum()
print(total_fnlwgt)


9263575662


# Wheighted Age Distribution

In [5]:
age_bins = [(17, 26), (27, 36), (37, 46), (47, 56), (57, 66), (67, 76), (77, 86), (87, 99)]

age_distribution = (data.groupby('age')['fnlwgt'].sum() / total_fnlwgt).to_dict()

# Aggregate into custom age ranges
age_ranges = {}
for start, end in age_bins:
    label = f'{start}-{end}'
    age_ranges[label] = sum(age_distribution.get(age, 0) for age in range(start, end + 1))

print(age_ranges)

{'17-26': 0.23254280804728858, '27-36': 0.27141756733459493, '37-46': 0.23722836582580933, '47-56': 0.15117445207951713, '57-66': 0.07870687525021912, '67-76': 0.02373170685071477, '77-86': 0.004034587978086513, '87-99': 0.001163636633769635}


# Wheighted Education Distribution

In [6]:

education_distribution = (data.groupby('education')['fnlwgt'].sum() / total_fnlwgt).to_dict()
print(education_distribution)

{'10th': 0.029468504815025498, '11th': 0.03816294386736559, '12th': 0.01400998833877717, '1st-4th': 0.0062664863027056044, '5th-6th': 0.012629909796056028, '7th-8th': 0.019338862285637366, '9th': 0.016240940376528228, 'Assoc-acdm': 0.03347668679083759, 'Assoc-voc': 0.0399183750953639, 'Bachelors': 0.163176883759985, 'Doctorate': 0.011804269538010278, 'HS-grad': 0.32139634430936437, 'Masters': 0.0520525309657691, 'Preschool': 0.0021408804465594703, 'Prof-school': 0.016798332164364632, 'Some-college': 0.22311806114765018}


# Wheighted Race Distribution

In [7]:

race_distribution = (data.groupby('race')['fnlwgt'].sum() / total_fnlwgt).to_dict()
print(race_distribution)

{'Amer-Indian-Eskimo': 0.00609399275827926, 'Asian-Pac-Islander': 0.026202793700461278, 'Black': 0.11656649596219383, 'Other': 0.008577742860778288, 'White': 0.8425589747182873}


# Wheighted Native Country Distribuition

In [8]:
ncountry_distribution = (data.groupby('native-country')['fnlwgt'].sum() / total_fnlwgt).to_dict()
print(ncountry_distribution)

{'Cambodia': 0.0006054133095717787, 'Canada': 0.0035612337183499113, 'China': 0.002275493585750992, 'Columbia': 0.001998964619672796, 'Cuba': 0.003584282917470276, 'Dominican-Republic': 0.002264667852399304, 'Ecuador': 0.000867478746135166, 'El-Salvador': 0.004194289701695103, 'England': 0.0025167153430435273, 'France': 0.0007650541495625774, 'Germany': 0.004291807337752816, 'Greece': 0.000795958306925307, 'Guatemala': 0.0024352646130521775, 'Haiti': 0.0017626972128033124, 'Holand-Netherlands': 3.009852892374422e-06, 'Honduras': 0.0005169305217253592, 'Hong': 0.0006895137723332388, 'Hungary': 0.00040688543360871404, 'India': 0.0026994449996861265, 'Iran': 0.0012345983254516651, 'Ireland': 0.000583518308397231, 'Italy': 0.0020298072457196714, 'Jamaica': 0.002418630971181892, 'Japan': 0.0019346626674100781, 'Laos': 0.0005085181113512882, 'Mexico': 0.02920982867446892, 'Nicaragua': 0.0015055085108452585, 'Outlying-US(Guam-USVI-etc)': 0.00046019119997985937, 'Peru': 0.0013488914492551592, 

# Wheighted Workclass Distribution

In [9]:
workclass_distribution = (data.groupby('workclass')['fnlwgt'].sum() / total_fnlwgt).to_dict()
print(workclass_distribution)

{'Federal-gov': 0.028380069488549937, 'Local-gov': 0.06437528452930555, 'Never-worked': 0.00023212775265828018, 'Private': 0.7051966278850047, 'Self-emp-inc': 0.03275067868712141, 'Self-emp-not-inc': 0.07319917748192728, 'State-gov': 0.03890616400732109, 'Without-pay': 0.00038062581109622506}


# Wheighted Income Distribution

In [10]:
income_distribution = (data.groupby('income')['fnlwgt'].sum() / total_fnlwgt).to_dict()
print(income_distribution)

{'<=50K': 0.7622240390354357, '>50K': 0.2377759609645643}


# Wheighted Sex Distribution

In [28]:
sex_distribution = (data.groupby('sex')['fnlwgt'].sum() / total_fnlwgt).to_dict()
print(sex_distribution)

{'Female': 0.32424719304894256, 'Male': 0.6757528069510574}


# Wheighted Relationship Distribution

In [12]:
relationship_distribution = (data.groupby('relationship')['fnlwgt'].sum() / total_fnlwgt).to_dict()
print(relationship_distribution)

{'Husband': 0.39841025891757864, 'Not-in-family': 0.25856131437726904, 'Other-relative': 0.03308744508422627, 'Own-child': 0.15858494253209673, 'Unmarried': 0.10587409719372355, 'Wife': 0.04548194189510577}


# Wheighted Average Hours per Weak Distribution

In [18]:
hours_per_week_distribution = (data.groupby('hours-per-week')['fnlwgt'].sum() / total_fnlwgt).to_dict()
print(hours_per_week_distribution)
sorted_hours_per_week = dict(sorted(hours_per_week_distribution.items(), key=lambda item: item[1], reverse=True))
print(sorted_hours_per_week)

hours_per_week_distribution_grouped = {"<40":0, "40":0, ">40":0}
for key, value in hours_per_week_distribution.items():
    if key < 40:
        hours_per_week_distribution_grouped["<40"] += value
    elif key == 40:
        hours_per_week_distribution_grouped["40"] += value
    else:
        hours_per_week_distribution_grouped[">40"] += value
print(hours_per_week_distribution_grouped)


average_hours_per_week = 0
for key, value in hours_per_week_distribution.items():
    average_hours_per_week += key * value
print("Average Hours:" , average_hours_per_week)

{1: 0.0005355909187801483, 2: 0.001110879143807656, 3: 0.0011775734768106653, 4: 0.001502640719727734, 5: 0.001782997041601753, 6: 0.0017477536310751623, 7: 0.0008876964252294569, 8: 0.004306831773788993, 9: 0.0005425860578457053, 10: 0.00822093884464027, 11: 0.00029105478255659766, 12: 0.00464480234954009, 13: 0.0005934932903449739, 14: 0.001068289757765461, 15: 0.012683411491090814, 16: 0.005952873923857416, 17: 0.0009196336610004795, 18: 0.002499484415610746, 19: 0.00037384246929681955, 20: 0.03775265175785208, 21: 0.0009668757860683624, 22: 0.0012101099412383685, 23: 0.000869856013920686, 24: 0.007465276748769972, 25: 0.020016737139702544, 26: 0.0008324194977574308, 27: 0.0008371259957222337, 28: 0.0029465861775135357, 29: 0.0003484678182358651, 30: 0.035563565627408374, 31: 0.00024778768844258833, 32: 0.008798474473991942, 33: 0.0013127072572940571, 34: 0.0009296414596500113, 35: 0.039750254268501276, 36: 0.007376472162936601, 37: 0.004874414119077986, 38: 0.014070435839875153, 39

# Class Balance of the Dataset

Check the distribution of the target variable (the class you’re trying to predict).

In [26]:
data['income'].value_counts(normalize=True)


income
<=50K    0.760718
>50K     0.239282
Name: proportion, dtype: float64

## **Preprocessing the Dataset**

In this section, we handle the preprocessing of the Adult Income dataset to prepare it for machine learning.

### **Handling Missing Values**

The dataset contains some missing values, represented as `"NaN"`, particularly in categorical features like `workclass`, `occupation`, and `native-country`. We will address them using two strategies:

- **Removing rows with missing values** to eliminate uncertainty.
- **Replacing missing values** with the most frequent value (mode) in the respective column.

### **Encoding Categorical Features**

Machine learning models typically require numerical input. Therefore, we will encode categorical features such as `education`, `marital-status`, and `occupation` using:

- **One-hot encoding** for features with no ordinal relationship.
- **Label encoding** for features with inherent order (if applicable).

### **Transforming Numerical Features**

Some continuous numerical features contain extreme or highly granular values that could introduce noise:

- `hours-per-week` includes values as low as 1 and as high as 99 — we may **clip or bin these values** into broader categories (e.g., part-time, full-time, overtime).
- `capital-gain` and `capital-loss` have **many rare and highly specific values**, most of which are zero — we may **bucketize or simplify** these into categories like "none", "low", and "high".
- We will also consider removing these features completely to see what results we achieve.

### **Splitting the Data**

We will explore two approaches to splitting the dataset for training and testing:

- Use the **original train/test split** provided with the dataset.
- Perform **random train/test splits** (e.g., 80/20 or 70/30) using `train_test_split` for experimentation and validation.

### **Optional: Addressing Class Imbalance**

If we find that the dataset is imbalanced (e.g., far more samples with income ≤50K than >50K), we may apply techniques to balance the classes:

- **Oversampling** the minority class.
- **Undersampling** the majority class.
- Using synthetic data generation methods like **SMOTE (Synthetic Minority Over-sampling Technique)**.


In [None]:
# Pronto aqui é para fazer o que diz em cima

## **Model Implementation**

In this section, we will implement and evaluate various machine learning models to predict whether an individual's income exceeds $50K based on the features in the dataset. The objective is to confirm and apply machine learning theory learned in class, deepen our understanding of classification models, and identify the most effective approach for this problem.

### **Implemented Models**

We will start by implementing the following classification models:

- **Logistic Regression**  
  A simple and interpretable linear model; it's a good baseline for binary classification tasks.

- **Random Forest Classifier**  
  A robust ensemble method that reduces overfitting; chosen for its ability to handle non-linear relationships and mixed feature types.

- **Gradient Boosting (XGBoost or LightGBM)**  
  A highly accurate boosting algorithm; selected for its strong performance in structured/tabular data and ability to capture complex patterns.

### **Optional Models (for Extended Work)**

For further exploration and comparison, the following models may be implemented:

- **Support Vector Machine (SVM)**  
  Effective in high-dimensional spaces; useful for classification with clear margins of separation.

- **K-Nearest Neighbors (KNN)**  
  A simple, instance-based method that can perform well with normalized data and lower dimensions.

- **Neural Network (MLPClassifier)**  
  A basic feed-forward neural network; useful for modeling non-linear relationships in the data.

- **Naive Bayes**  
  Fast and efficient, especially on categorical data; good as a quick baseline for comparison.

### **Hyperparameter Tuning and Model Selection (Optional)**

To improve model performance, we may apply:

- **Cross-validation**  
  Using k-fold cross-validation to evaluate model reliability across different data splits.

- **Grid Search or Randomized Search**  
  For systematic tuning of hyperparameters, especially in Random Forest and Gradient Boosting models.

### **Evaluation Metrics**

All models will be evaluated using key classification metrics to gain a comprehensive view of performance:

- **Accuracy**  
  The proportion of total predictions that are correct. Good for balanced datasets.

- **Precision**  
  The proportion of positive predictions that were actually correct. Important when false positives are costly.

- **Recall**  
  The proportion of actual positives that were correctly predicted. Important when false negatives are costly.

- **F1-Score**  
  The harmonic mean of precision and recall. Useful when you need a balance between precision and recall.

- **ROC AUC Score**  
  Measures the ability of the model to distinguish between classes across different thresholds. A higher score indicates better overall classification performance.

---

Through this modeling process, we aim to validate theoretical concepts covered in class, enhance practical skills with real-world data, and gain insights into which algorithms are most suitable for income prediction tasks.


In [None]:
# Fazer o que diz em cima