In [55]:
import pandas as pd 
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler 
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report , confusion_matrix

## Machine Learning Phase

In this section, a machine learning pipeline is developed to predict whether an individual
earns more than 50K per year based on census data. The workflow includes preprocessing,
model training, and evaluation on an unseen test dataset.

In [56]:
from sqlalchemy import create_engine
engine = create_engine(
    "postgresql://postgres:13811383sS@localhost:5432/Adult"
)

In [57]:
query = """
SELECT
    age,
    workclass,
    education_num,
    occupation,
    relationship,
    sex,
    capital_gain,
    capital_loss,
    hours_per_week,
    income
FROM adult;
"""

adult = pd.read_sql(query, engine)

The feature `education` was removed because it represents the same information as `education_num` in categorical form. 
Since `education_num` provides an ordinal and numerical representation, it was preferred to avoid redundancy and multicollinearity in the model.

The feature `marital_status` was removed due to its strong overlap with the `relationship` feature. 
Exploratory analysis showed that `relationship` provides a more direct and fine-grained representation of household roles, which better differentiates income groups.

The feature `marital_status` was removed due to its strong overlap with the `relationship` feature. 
Exploratory analysis showed that `relationship` provides a more direct and fine-grained representation of household roles, which better differentiates income groups.

Although the feature `race` shows some association with income, it was removed due to its relatively low discriminative power compared to other features, severe class imbalance across categories, and its sensitive nature. 
Excluding this feature simplifies the model without sacrificing predictive performance.

The feature `native_country` was removed because it contains a large number of categories with very small sample sizes, leading to high sparsity after one-hot encoding and limited predictive value for income classification.

The feature `fnlwgt` was removed because it represents sampling weights rather than individual characteristics and is not directly useful for predicting personal income.

In [58]:
adult.head()

Unnamed: 0,age,workclass,education_num,occupation,relationship,sex,capital_gain,capital_loss,hours_per_week,income
0,45,Private,13,Exec-managerial,Own-child,Male,0,1408,40,<=50K
1,30,Federal-gov,10,Adm-clerical,Own-child,Male,0,0,40,<=50K
2,22,State-gov,10,Other-service,Husband,Male,0,0,15,<=50K
3,48,Private,7,Machine-op-inspct,Unmarried,Male,0,0,40,<=50K
4,21,Private,10,Machine-op-inspct,Own-child,Male,0,0,40,<=50K


In [59]:
adult.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   education_num   32561 non-null  int64 
 3   occupation      32561 non-null  object
 4   relationship    32561 non-null  object
 5   sex             32561 non-null  object
 6   capital_gain    32561 non-null  int64 
 7   capital_loss    32561 non-null  int64 
 8   hours_per_week  32561 non-null  int64 
 9   income          32561 non-null  object
dtypes: int64(5), object(5)
memory usage: 2.5+ MB


In [60]:
adult['income'].value_counts()

income
<=50K    24720
>50K      7841
Name: count, dtype: int64

In [61]:
X = adult.drop(columns=['income']).copy()
y = adult['income'].copy()

y = y.map(
    {'<=50K': 0 , '>50K' : 1}
)
print(y.isna().sum()  , y.value_counts())

0 income
0    24720
1     7841
Name: count, dtype: int64


In [62]:
num_cols = ['age', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']
cat_cols = ['workclass', 'occupation', 'relationship', 'sex']

### Model Selection

Logistic Regression was chosen as the baseline model due to its simplicity,
interpretability, and suitability for binary classification tasks.
Class weights were balanced to address the class imbalance in the dataset.

In [63]:
numeric_transformer = Pipeline(steps=[
    ('scaler' , StandardScaler())
])
categorical_transformer = Pipeline(steps=[
    ('oneHot' , OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num' , numeric_transformer , num_cols),
        ('cat' , categorical_transformer , cat_cols)
    ]
)

X_train , X_valid , y_train , y_valid = train_test_split(
    X , y ,
    test_size=0.2 , 
    random_state=42 ,
    stratify = y
)

X_train_p = preprocessor.fit_transform(X_train)
X_valid_p = preprocessor.transform(X_valid)

# print(X_train.shape , X_valid.shape)
# print(X_train_p.shape , X_valid_p.shape)

clf = Pipeline(steps=[
    ('preproccessor' , preprocessor),
    ('class' , LogisticRegression(
        max_iter=1000 , 
        class_weight='balanced'
    )) 
]
)
clf.fit(X_train , y_train)
y_pred = clf.predict(X_valid)

print(confusion_matrix(y_valid , y_pred))
print(classification_report(y_valid , y_pred , digits=3))

[[3998  947]
 [ 261 1307]]
              precision    recall  f1-score   support

           0      0.939     0.808     0.869      4945
           1      0.580     0.834     0.684      1568

    accuracy                          0.815      6513
   macro avg      0.759     0.821     0.776      6513
weighted avg      0.852     0.815     0.824      6513



### Validation Results

The model demonstrated strong recall for the high-income class (>50K),
indicating effective identification of high-income individuals.
The performance metrics suggest a good balance between precision and recall.

In [64]:
clf.fit(X, y)

0,1,2
,steps,"[('preproccessor', ...), ('class', ...)]"
,transform_input,
,memory,
,verbose,False

0,1,2
,transformers,"[('num', ...), ('cat', ...)]"
,remainder,'drop'
,sparse_threshold,0.3
,n_jobs,
,transformer_weights,
,verbose,False
,verbose_feature_names_out,True
,force_int_remainder_cols,'deprecated'

0,1,2
,copy,True
,with_mean,True
,with_std,True

0,1,2
,categories,'auto'
,drop,
,sparse_output,True
,dtype,<class 'numpy.float64'>
,handle_unknown,'ignore'
,min_frequency,
,max_categories,
,feature_name_combiner,'concat'

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,'balanced'
,random_state,
,solver,'lbfgs'
,max_iter,1000


An internal 80/20 split was used for initial validation and sanity checking.
After finalizing the modeling choice, the pipeline was retrained on the full
training dataset (adult.data) and evaluated on the official unseen test set (adult.test).

In [66]:
query = """
SELECT
    age,
    workclass,
    education_num,
    occupation,
    relationship,
    sex,
    capital_gain,
    capital_loss,
    hours_per_week,
    income
FROM adult_test;
"""

adult_test = pd.read_sql(query, engine)
adult_test.head()

Unnamed: 0,age,workclass,education_num,occupation,relationship,sex,capital_gain,capital_loss,hours_per_week,income
0,39,Private,9,Handlers-cleaners,Own-child,Male,0,0,40,<=50K
1,28,unknown,10,unknown,Other-relative,Male,0,0,40,<=50K
2,33,unknown,7,unknown,Unmarried,Female,0,0,53,<=50K
3,48,unknown,7,unknown,Unmarried,Female,0,0,10,<=50K
4,18,unknown,6,unknown,Own-child,Male,0,0,40,<=50K


In [None]:
X_test = adult_test.drop(columns=['income']).copy()
y_test = adult_test['income'].map(
    {'<=50K' : 0 , '>50K' : 1}
    )
print(y_test.isna().sum())
print(y_test.value_counts())

0
income
0    12435
1     3846
Name: count, dtype: int64


In [70]:
y_test_pred = clf.predict(X_test)
print(confusion_matrix(y_test , y_test_pred))
print(classification_report(y_test , y_test_pred , digits=3))


[[9912 2523]
 [ 613 3233]]
              precision    recall  f1-score   support

           0      0.942     0.797     0.863     12435
           1      0.562     0.841     0.673      3846

    accuracy                          0.807     16281
   macro avg      0.752     0.819     0.768     16281
weighted avg      0.852     0.807     0.819     16281



## Model Evaluation & Performance Analysis

The performance of the Logistic Regression model was evaluated on the test dataset using a confusion matrix and classification metrics including precision, recall, and F1-score.

### Confusion Matrix Analysis
The confusion matrix shows the following results:

- True Negatives (Class 0 correctly predicted): 9,912  
- False Positives (Class 0 predicted as Class 1): 2,523  
- False Negatives (Class 1 predicted as Class 0): 613  
- True Positives (Class 1 correctly predicted): 3,233  

This indicates that the model performs very well in identifying individuals with income ≤50K (Class 0), while still maintaining a reasonable detection rate for income >50K (Class 1).

### Classification Report Analysis

- **Class 0 (Income ≤50K):**
  - Precision: 0.942  
  - Recall: 0.797  
  - F1-score: 0.863  

  The high precision indicates that when the model predicts an individual belongs to the ≤50K income group, it is usually correct. The recall value shows that most low-income individuals are successfully identified.

- **Class 1 (Income >50K):**
  - Precision: 0.562  
  - Recall: 0.841  
  - F1-score: 0.673  

  The recall for the high-income class is relatively high, meaning the model is effective at detecting individuals earning more than $50K. However, the lower precision suggests that some individuals predicted as high-income actually belong to the lower-income group.

### Overall Model Performance

- Accuracy: 0.807  
- Macro Average F1-score: 0.768  
- Weighted Average F1-score: 0.819  

The accuracy of approximately 81% indicates solid overall performance. The difference between macro and weighted averages reflects the class imbalance in the dataset, where the majority of samples belong to the ≤50K income group.

### Conclusion
Overall, the Logistic Regression model provides a strong and interpretable baseline for income prediction on the Adult dataset. Despite class imbalance, the model achieves a good balance between precision and recall, particularly for identifying higher-income individuals, which is often the more critical class in real-world applications.
