# Detailed Working on Income Dataset

Hello everyone, I'm Rishabh, a pre-final year student of IIIT Gwalior. This notebook includes detailed working on income-classification dataset. It includes **Exploratory Data Analysis, Feature Engineering, Feature Scaling, Feature Selection, Modeling and Hyperparameter Optimization** techniques. I hope you will find this notebook informative. Thanks!!

## 1) Importing Libraries

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Importing other Necessary Libraries

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
SEED = 42
np.random.seed(SEED)
%matplotlib inline

## 2) Importing Dataset

In [None]:
df = pd.read_csv("/kaggle/input/income-classification/income_evaluation.csv")
df_copy = df.copy()

In [None]:
df.info()

**Interpretation: There are 6 columns with int64 dtype and 9 columns with object dtype.**

In [None]:
# Renaming Columns

col_names = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'marital_status', 'occupation', 'relationship',
             'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'income']

df.columns = col_names

df.columns

## 3) Exploratory Data Analysis

In [None]:
# Shape 

df.shape

**Interpretation: Rows = 32,561 | Cols = 15**

In [None]:
# Statistical distribution of the data

df.describe().T

In [None]:
# Checking for missing values

df.isnull().sum()

In [None]:
df.columns

In [None]:
categorical = []
numerical = []

for i in df.columns:
    if df[i].dtype=='O':
        categorical.append(i)
    else:
        numerical.append(i)

In [None]:
# Categorical columns

print(categorical)

* **We have nine categorical columns including the target variable.**

In [None]:
# Numerical Columns

print(numerical)

* **We have six numerical columns.**

* **We will now perform separate analysis on categorical as well as numerical features.**

## 3.1) Exploring Categorical Columns

In [None]:
# Categorical columns

print(categorical)

In [None]:
df[categorical].isnull().sum()

* We can't see any missing value in categorical columns.

In [None]:
for i in categorical:
    print(df[i].value_counts())

* **We observe few categories to have '?' value which indicates NULL value which are:-**

    1. **workclass**
    2. **occupation**
    3. **native-country**

* **Now, let us analyze each feature and make intuition out of it.**

In [None]:
# Let us explore our target variable first - 'income'

df['income'].value_counts()

* **We have two unique values in target variable with title '<=50K' and '>50K'.**
* **Count of '<=50K' and '>50K' is 24720 and 7841 respectively.**

In [None]:
# Let us visualise it through pie chart and countplot

fig, ax=plt.subplots(1,2,figsize=(18,8))

ax[0].pie(df['income'].value_counts(),labels=['<=50K','>50K'],explode=[0,0.2],shadow=True)
ax[1] = sns.countplot('income',data=df,palette='coolwarm')
plt.tight_layout()
plt.show()

* **There is higher percentage of '<=50K' than '>50K'.**

In [None]:
# Visualise 'income' wrt 'sex' feature

fig, ax=plt.subplots(1,1,figsize=(18,8))
ax = sns.countplot(df['income'],hue=df['sex'])
plt.tight_layout()
plt.show()

In [None]:
# Visualise 'income' wrt 'workclass' feature

fig, ax=plt.subplots(1,1,figsize=(18,8))
ax = sns.countplot(df['income'],hue=df['workclass'])
plt.tight_layout()
plt.show()

* **Private 'workclass' gets high income than other types.**

In [None]:
# Visualise 'income' wrt 'occupation' feature

fig, ax=plt.subplots(1,1,figsize=(18,8))
ax = sns.countplot(df['income'],hue=df['occupation'])
plt.tight_layout()
plt.show()

In [None]:
# Visualise 'income' wrt 'race' feature

fig, ax=plt.subplots(1,1,figsize=(18,8))
ax = sns.countplot(df['income'],hue=df['race'])
plt.tight_layout()
plt.show()

* **White people are taking more salary their home than any other race.**

In [None]:
# Let us explore 'workclass' categorical column.

df['workclass'].value_counts()

In [None]:
# Let us visualise it through pie chart and countplot

fig, ax=plt.subplots(1,2,figsize=(18,8))

ax[0].pie(df['workclass'].value_counts(),labels=df['workclass'].value_counts().index,shadow=True)
ax[1] = sns.countplot('workclass',data=df,palette='coolwarm')
plt.tight_layout()
plt.show()

In [None]:
# Visualise 'workclass' wrt 'race' feature

fig, ax=plt.subplots(1,1,figsize=(18,8))
ax = sns.countplot(df['workclass'],hue=df['race'])
plt.tight_layout()
plt.show()

* **White people are dominating in every workclass.**

In [None]:
# Let us explore 'occupation' categorical column.

df['occupation'].value_counts()

In [None]:
# Let us visualise it through pie chart and countplot

fig, ax=plt.subplots(1,2,figsize=(23,10))

ax[0].pie(df['occupation'].value_counts(),labels=df['occupation'].value_counts().index,shadow=True)
ax[1] = sns.countplot('occupation',data=df,palette='coolwarm')
plt.tight_layout()
plt.show()

In [None]:
# Visualise 'occupation' wrt 'race' feature

fig, ax=plt.subplots(1,1,figsize=(18,8))
ax = sns.countplot(df['occupation'],hue=df['sex'])
plt.tight_layout()
plt.show()

* **In "Adm-clerical", "Other-service" and "Priv-house-serv" females are dominating whereas in rest males are more in count.**

In [None]:
# Let us explore 'native-country' categorical column.

df['native_country'].value_counts()

In [None]:
fig, ax=plt.subplots(1,1,figsize=(18,8))
ax = sns.countplot(df['native_country'])
plt.tight_layout()
plt.show()

* **In terms of native-country, United-States is clearly dominating than any other country.**

* **We need to replace '?' with np.nan values.**

In [None]:
# Replacing '?' with np.nan values.

df.replace(' ?',np.NaN,inplace=True)

In [None]:
# Checking for missing values.

df.isnull().sum()

* **We are facing missing values in three columns. We will deal with it later in Feature Engineering section.**

## 3.2) Exploring Numerical Columns

In [None]:
# Numerical columns

print(numerical)

In [None]:
df[numerical].isnull().sum()

* **There are no missing values in numerical columns.**

In [None]:
# Pairwise plot of all the numerical features.

sns.pairplot(df[numerical])
plt.tight_layout()
plt.show()

* **Let us visualise each numerical features one by one.**

In [None]:
# We will first explore 'age' feature 

fig, ax=plt.subplots(1,1,figsize=(18,8))
ax = sns.distplot(df['age'])
plt.tight_layout()
plt.show()

* **The distribution of 'age' variable is positively skewed.**

In [None]:
fig, ax=plt.subplots(1,1,figsize=(18,8))
ax = sns.boxplot(y=df['age'])
plt.tight_layout()
plt.show()

* **In boxplot, outlier values are denoted by black dots.**
* **Clearly, we can see many outliers in the 'age' feature.**

In [None]:
# Visualise 'age' wrt to 'sex' feature

fig, ax=plt.subplots(1,1,figsize=(18,8))
ax = sns.boxplot(x=df["sex"],y=df['age'])
plt.tight_layout()
plt.show()

* **Median of Male is more than the Female wrt 'age' feature.**

In [None]:
# Visualise 'age' wrt to 'sex' and 'race' feature

fig, ax=plt.subplots(1,1,figsize=(18,8))
ax = sns.boxplot(x=df["sex"],y=df['age'],hue=df["race"])
plt.tight_layout()
plt.show()

In [None]:
# Visualise 'age' wrt to 'income' feature

fig, ax=plt.subplots(1,1,figsize=(18,8))
ax = sns.boxplot(x=df["income"],y=df['age'])
plt.tight_layout()
plt.show()

* **It's is obvious that people with more age will have more salary.**

In [None]:
# Visualise 'age' wrt to 'income' and 'sex' feature

fig, ax=plt.subplots(1,1,figsize=(18,8))
ax = sns.boxplot(x=df["income"],y=df['age'],hue=df["sex"])
plt.tight_layout()
plt.show()

In [None]:
# Checking for correlation between numerical features

fig, ax=plt.subplots(1,1,figsize=(18,8))
ax = sns.heatmap(df[numerical].corr(),annot=True)
plt.tight_layout()
plt.show()

* **Fortunately, we don't encounter the problem of multi-collinearity which can create obstacle in creating linear model.**

## Feature Engineering

In [None]:
X = df.drop('income',axis=1)
y = df["income"]

In [None]:
# Splitting the data

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42)

In [None]:
X_train.shape, X_test.shape

In [None]:
# Filling missing values in train data

X_train["workclass"].fillna(X_train["workclass"].mode()[0],inplace=True)
X_train["occupation"].fillna(X_train["occupation"].mode()[0],inplace=True)
X_train["native_country"].fillna(X_train["native_country"].mode()[0],inplace=True)

In [None]:
# Filling missing values in test data

X_test["workclass"].fillna(X_test["workclass"].mode()[0],inplace=True)
X_test["occupation"].fillna(X_test["occupation"].mode()[0],inplace=True)
X_test["native_country"].fillna(X_test["native_country"].mode()[0],inplace=True)

In [None]:
# checking for missing values in train data

X_train.isnull().sum()

In [None]:
# checking for missing values in test data

X_test.isnull().sum()

In [None]:
# Encoding Categorical features using One-Hot Encoding

import category_encoders as ce

In [None]:
encoder = ce.OneHotEncoder(cols=['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country'])

X_train = encoder.fit_transform(X_train)

X_test = encoder.transform(X_test)

In [None]:
X_train.head()

In [None]:
X_test.head()

* **Next, we are going to perform scaling and transformation texhniques to our features.**

## Feature Scaling / Feature Transformation

In [None]:
from sklearn.preprocessing import RobustScaler

* **We are using RobustScaler for scaling purpose as it is robust to outliers.**
* **We have previously seen that there are outliers in our data.**

In [None]:
cols = X_train.columns
scaler = RobustScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

In [None]:
X_train = pd.DataFrame(X_train,columns=[cols])
X_test = pd.DataFrame(X_test,columns=[cols])

In [None]:
X_train.head()

In [None]:
X_test.head()

* **Next, we are going to select best features using embedding method.**

## Feature Selection

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rc = RandomForestClassifier(n_estimators=100,random_state=42)

In [None]:
rc.fit(X_train,y_train)

In [None]:
feature_score = pd.Series(rc.feature_importances_,X_train.columns).sort_values(ascending=False)

In [None]:
feature_score

* **Most Important Feature - 'fnlwgt'**
* **Leat Important Feature - 'native_country_37'**

## Modelling

* **We are going to work on four models for our classification problem that are:-**

1. LogisticRegression
2. Support Vector Machine
3. Naive Bayes Classifier
4. XGBoost


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
import xgboost

In [None]:
# Metrics

from sklearn.metrics import classification_report,confusion_matrix, precision_score, recall_score, f1_score

* **We will not use accuracy as a metrics since we are having imbalanced dataset.**

In [None]:
y_train = pd.get_dummies(y_train,drop_first=True)
y_train.columns= ['income']

In [None]:
y_test = pd.get_dummies(y_test,drop_first=True)
y_test.columns= ['income']

* **Encoded >50K - 1**
* **Encoded <=50K - 0**

In [None]:
# Logistic Regression

lr = LogisticRegression(solver='liblinear',random_state=42)
lr.fit(X_train,y_train.values.ravel())

In [None]:
# Train accuracy

y_pred = lr.predict(X_train)
print(classification_report(y_train,y_pred))
print(f'F1 Score for train data is {f1_score(y_train,y_pred)}')

In [None]:
# Test accuracy

y_pred = lr.predict(X_test)
print(classification_report(y_test,y_pred))
print(f'F1 Score for test data is {f1_score(y_test,y_pred)}')

In [None]:
# Support Vector Machine

svc = SVC(random_state=42)
svc.fit(X_train,y_train.values.ravel())

In [None]:
# Train accuracy

y_pred = svc.predict(X_train)
print(classification_report(y_train,y_pred))
print(f'F1 Score for train data is {f1_score(y_train,y_pred)}')

In [None]:
# Test accuracy

y_pred = svc.predict(X_test)
print(classification_report(y_test,y_pred))
print(f'F1 Score for test data is {f1_score(y_test,y_pred)}')

In [None]:
# Random Forest Classifier

rc = RandomForestClassifier(n_estimators=100,random_state=42)
rc.fit(X_train,y_train.values.ravel())

In [None]:
# Train accuracy

y_pred = rc.predict(X_train)
print(classification_report(y_train,y_pred))
print(f'F1 Score for train data is {f1_score(y_train,y_pred)}')

In [None]:
# Test accuracy

y_pred = rc.predict(X_test)
print(classification_report(y_test,y_pred))
print(f'F1 Score for test data is {f1_score(y_test,y_pred)}')

* **We encounter overfitting in case of random forest.**

In [None]:
# XGBoost Classifier

from xgboost import XGBClassifier
xgb = XGBClassifier(random_state=42)
xgb.fit(X_train,y_train.values.ravel())

In [None]:
# Train accuracy

y_pred = xgb.predict(X_train)
print(classification_report(y_train,y_pred))
print(f'F1 Score for train data is {f1_score(y_train,y_pred)}')

In [None]:
# Test accuracy

y_pred = xgb.predict(X_test)
print(classification_report(y_test,y_pred))
print(f'F1 Score for test data is {f1_score(y_test,y_pred)}')

* **In terms of training accuracy, Random Forset Classifier performs best. We will try to reduce variance by tuning hyperparameters.**

## Hyperparameter Optimization

* **We will use RandomizedSearchCV for hyperparameter optimization.**

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
rc = RandomForestClassifier(random_state=42)

In [None]:
n_estimators = [500,800,1500,1200,2500,5000,6000]
max_features = ['auto','sqrt','log2']
max_depth = [5, 8, 15, 25, 30,10, 20, 40, 50]
max_depth.append(None)
min_samples_split = [2, 5, 10, 15, 20, 100]
min_samples_leaf = [1, 2, 5, 10, 15, 20] 

params = dict(n_estimators = n_estimators, max_features = max_features, max_depth = max_depth,  
              min_samples_split = min_samples_split, 
             min_samples_leaf = min_samples_leaf)

In [None]:
grid = RandomizedSearchCV(rc,params,n_jobs=-1,cv=10,verbose=3)
grid.fit(X_train,y_train.values.ravel())

In [None]:
grid.best_params_

In [None]:
# Train accuracy

y_pred = grid.predict(X_train)
print(classification_report(y_train,y_pred))
print(f'F1 Score for train data is {f1_score(y_train,y_pred)}')

In [None]:
# Test accuracy

y_pred = grid.predict(X_test)
print(classification_report(y_test,y_pred))
print(f'F1 Score for test data is {f1_score(y_test,y_pred)}')

* **It seems like XGBoost performance is better than Random Forset after hyperparameetr optimization.**
* **We reduced overfitting in Random Forest but not significant increase in Test F1 Score.**

# Thank You!!

* **If you find this notebook interesting, do upvote it.**