# **Census Income Prediction**

Problem **Statement**

This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)).

In the daset Income is the target variable which has two classes so it termed to be a Classification problem. Here the prediction task is to determine whether a person makes over $50K a year.

# **Importing necessary Libraries**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# **Loading dataset:**

In [None]:
# Reading the csv file in a dataframe
df = pd.read_csv('/content/drive/MyDrive/census_income.csv')
df

The dataset contains the information of 51 state samples of US population.The dataset contains 15 columns including the features and the target variable. Here,Income column is the label which is less than or greater than 50K depending on the various features like Age, Workclass, Fnlwgt,Education, Education_num, marital_status, Occupation, Relationship, Race, Sex, Capital_gain, Capital_loss, Hours_per_week and Native_country.As Income variable has two classes i.e <=50K or >=50, it a a Classification Problem Statement and we need to redict that whether hte person's income is over 50K or not.

In [None]:
# Displaying top 20 records of this dataset
df.head(20)

In [None]:
# Displaying bottom 20 records of this dataset
df.tail(20)

By observing the records, we came across some corrupted data which are filled with '?', which needs to get filled or drop.

# **Exploratory Data Analysis (EDA)**

In [None]:
#Checking the dimensions of the dataframe
df.shape

The dataframe contains 32560 rows and 15 columns.

In [None]:
# For getting the overview summary of the dataset
df.info()

This provides the information about the dataset including the range index, column type, no null values and memory usage.

In [None]:
# Checking the types of the dataset
df.dtypes

So, out of 15 columns, 6 are numeric data variables and 9 are categorical data variables.And we need to tranform these categorical data variables to numeric format for further preoceedings.

In [None]:
# Checking the unique values in each column
df.nunique()

These are the number of unique values present in each column.

# Checking for missing **data**

In [None]:
# Checking null values in the dataset.
df.isnull().sum()

**Visualizing missing data using heatmap**

In [None]:
# Null values visulaization using heatmap
sns.heatmap(df.isnull())

We can get a clear visualization of no null data.

In [None]:
# Checking the columns of the dataset
df.columns.tolist()

# **Value Count on each column**

In [None]:
# Checking the value_counts of each column
for i in df.columns:
    print(df[i].value_counts())
    print("\n")

Using value_count method, we fount the list of values present in each column. Here we can see the columns having '?' are 'Workclass','Occupation','Native_country' and they all are categorical variables. So, we can fill them with "Most Frequently occuring values" of the respective columns i.e mode values.

In [None]:
df['Workclass'] = df.Workclass.str.replace('?','Private')
df['Occupation'] = df.Occupation.str.replace('?','Prof-specialty')
df['Native_country'] = df.Native_country.str.replace('?','United-States')

**Rechecking for '?' values**

In [None]:
for i in df[['Workclass','Occupation','Native_country']]:
    print(df[i].value_counts())
    print("\n")

Now we can see there is no '?' sign. It means it got filled.

Also we can notice the columns capital gain and capital loss are having majorly '0s',which will impact the model prediction. So lets drop these columns

In [None]:
# Dropping the columns having more no. of '0s'
df.drop("Capital_gain",axis=1,inplace=True)
df.drop("Capital_loss",axis=1,inplace=True)

In [None]:
# Checking the dataframe
df.head(20)

In [None]:
# Checking the list of values of Income
df['Income'].value_counts()

These are the two unique values in the target variable ehich are <=50K and >=50K. We can state that the class is imbalanced and we need to balance the data using SMOTE before model building.

In [None]:
# Checking whether the dataset contain any space
df.loc[df['Income']==" "]

It means there is no space in the dataset.

In [None]:
# Statistical summary of the dataset
df.describe()

This gives the statistical information of the dataset. The summary of this dataset looks perfect as there is no negative/invalid values present. From the above description we can observe the following things:

The counts of the columns are same which means there is no missing values presentin any column.

The mean is greater than the median(50%)in some columns which means they are skewed to the right.

The mean and the median(50%) are almost equal in Education_num and Hours_per_week which means the data is symmetric in these columns hence the data is normal and no skewness present here.

There is a huge difference in 75% amd max it shows that huge outliers are present in the columns.

In summarizing the data we can observe that the dataset contains the person's age between 17 years to 90 years.

# **Separating Categorical and numerical columns**

In [None]:
# checking for categorical columns
categorical_col = []
for i in df.dtypes.index:
    if df.dtypes[i] == "object":
        categorical_col.append(i)
print(categorical_col)

These are the categorical columns present in the dataset.

In [None]:
# checking for numerical columns
numerical_col = []
for i in df.dtypes.index:
    if df.dtypes[i] != "object":
        numerical_col.append(i)
print(numerical_col)

These are the columns having numerical data.

# **Data Visualization**

**Univariate Analysis:**

In [None]:
# Visualizing whether the income is above 50K or not.
plt.figure(figsize=(10,10))
sns.countplot(df['Income'])
plt.show()

Most of the people have the income less than or equal to 50K. We can also observr the class is imbalance and hence needs to balance it before model building.

In [None]:
# Visualizing the count of Workclass of the people
plt.figure(figsize=(10,10))
sns.countplot(df['Workclass'])
plt.xticks(rotation=90)
plt.show()

We can see the count of Private workclass is high compare to others. This means the people working in private sectors are more than the people working in other sectors.

In [None]:
# Visualizing the count of Education of the people
plt.figure(figsize=(10,10))
sns.countplot(df['Education'])
plt.xticks(rotation=90)
plt.show()

The count of HS-grad is high than others and the count is more than 10K.

In [None]:
# Visualizing the count of Marital_status of the people
plt.figure(figsize=(10,10))
sns.countplot(df['Marital_status'])
plt.xticks(rotation=90)
plt.show()

The people who got married have high counts followed by the singles or nevermarried.

In [None]:
# Visualizing the count of Occupation of the people
plt.figure(figsize=(10,10))
sns.countplot(df['Occupation'])
plt.xticks(rotation=90)
plt.show()

The people who are in the position of Prof-speciality have highest count and the people in the position Armed-Forces have very least counts.

In [None]:
# Visualizing the count of Relationship of the people
plt.figure(figsize=(10,10))
sns.countplot(df['Relationship'])
plt.xticks(rotation=90)
plt.show()

The count is high for the Husband Category which has around 15K of count and other relative has very least count.

In [None]:
# Visualizing the count of Race of the people
plt.figure(figsize=(10,10))
sns.countplot(df['Race'])
plt.xticks(rotation=90)
plt.show()

White Family groups have high vount of around 30K whereas other race have least count.

In [None]:
# Visualizing the count of Sex of the people
plt.figure(figsize=(10,10))
sns.countplot(df['Sex'])
plt.xticks(rotation=90)
plt.show()

The count of Male is high rather than the count of females.

In [None]:
# Visualizing the count of Native_country of the people
plt.figure(figsize=(10,10))
sns.countplot(df['Native_country'])
plt.xticks(rotation=90)
plt.show()

The United States Country has the highest coutn of around 29K and other countries have very less counts.

# **Distribution Of Data(Skewness Visualization)**

**Plotting Numerical Columns**

In [None]:
# Checking the distribution of data in each numric column

plt.figure(figsize=(20,25),facecolor="white")
plotnumber=1

for col in numerical_col:
    if plotnumber<6:
        ax = plt.subplot(3,2,plotnumber)
        sns.distplot(df[col],color='cyan')
        plt.xlabel(col,fontsize=20)
    plotnumber+=1
plt.tight_layout()

From the above distribution plot it can be inferred that Age column seems to be normal but the mean is more than the median so it is rightly skewed.

The data is not normal in the above columns and the column final_weight has right skewness.

The data in the column Education_num and Hours_per_week are not normal but they have skewness.

# **Bivariate Analysis**

In [None]:
# Visualizing the age of the person who have more income
plt.figure(figsize=(10,10))
sns.catplot(x='Income',y='Age',data=df,kind='strip')
plt.title('Comparison between Income and Age')
plt.show()

We can say that the person below age of 25 having income less than or equal to 50K.

In [None]:
# Visualizing the Final Weight of the person who have more income
plt.figure(figsize=(10,10))
sns.catplot(x='Income',y='Fnlwgt',data=df,kind='strip',palette="gnuplot")
plt.title('Comparison between Income and Final Weight')
plt.show()

There is no significant relation between final weight and Income.

In [None]:
# Visualizing the number of Education with income
plt.figure(figsize=(10,10))
sns.catplot(x='Income',y='Education_num',data=df,kind='bar',hue="Sex",palette="gist_rainbow")
plt.title('Comparison between Income and Education_num')
plt.show()

The income is more than 50K for the people having high education number. Here both gender have the income more than 50K

In [None]:
# Visualizing the number of Hours per week with income
plt.figure(figsize=(10,10))
sns.catplot(x='Income',y='Hours_per_week',data=df,kind='bar',hue='Sex',palette="coolwarm")
plt.title('Comparison between Income and Hours_per_week')
plt.show()

This shows how the income is related to the hours per week. The income is >=50K when the hours is high for both male and female.

In [None]:
# Visualizing how the income changes with work class of the people
plt.figure(figsize=(10,10))
sns.catplot(x='Workclass',y='Education_num', data=df,kind='bar',hue="Income");
plt.title('Comparision between Workclass and Education_num')
plt.xticks(rotation=90)
plt.show()

The people having gov sector jobs(State gov, Federal gov,Local gov) with high education number have the income >50K, also the Private sector position with average education number have second highest income >=50K.

In [None]:
# Visualizing the relation between work class and Income of the people
plt.figure(figsize=(10,10))
sns.countplot(df["Workclass"],hue=df["Income"])
plt.title("Comparision between Workclass and Income")
plt.show()

The people who are working in the private sectors have the income <=50K and very few people in that sector have income >=50K.

In [None]:
# Visualizing the relation between Education and Income of the people
plt.figure(figsize=(10,10))
sns.countplot(df["Education"],hue=df["Income"],palette='ch:.25')
plt.title("Comparision between Education and Income")
plt.xticks(rotation=90)
plt.show()

The people who have completed their high school have income <=50K followed by the people who done their Secondary School.And the people who have done their Graduation are earning more i.e 50K.

In [None]:
# Visualizing the relation between Marital status and Income of the people
plt.figure(figsize=(10,10))
sns.countplot(df['Marital_status'],hue=df['Income'],palette="husl")
plt.xticks(rotation=90)
plt.show()

The people who are married are having income >=50K. And the people who are never married are majorly earning <=50K.

In [None]:
# Visualizing the relation between Occupation and Income of the people
plt.figure(figsize=(10,10))
sns.countplot(df["Occupation"],hue=df["Income"],palette='Set2')
plt.title("Comparision between Occupation and Income")
plt.xticks(rotation=90)
plt.show()

Majority people who are in the position Prof-speciality, Other-service, Adm-clerical, Sales and Craft repair have the income more than 50K.

Very few people who are in the position Handlers-clearners, Farming fishing have income less than 50K.

In [None]:
# Visualizing the relation between Race and Income of the people
plt.figure(figsize=(10,10))
sns.countplot(df['Race'],hue=df['Income'],data=df,palette="husl")
plt.title("Comparision between Race and Income")
plt.xticks(rotation=90)
plt.show()

The White family groups have high income <50K compare to other racial groups.

In [None]:
# Visualizing the relation between Income and Sex groups of the people
sns.catplot(x='Income',col='Sex',data=df,kind='count',palette="ch:.28")
plt.show()

Majority of males are having the income more than 50K.

In [None]:
# Visualizing the relation between Native country and Income of the people
plt.figure(figsize=(10,6))
sns.countplot(df["Native_country"],hue=df["Income"] )
plt.title("Comparision between Native_country and Income")
plt.xticks(rotation=90)
plt.show()

The people from United states are having high income rather than other countries.

In [None]:
# Visualizing how the income changes for Native country of the people
plt.figure(figsize=(15,15))
plt.title("Income in each Native Country")
sns.pointplot(x='Native_country',y='Education_num',data=df, hue='Income',join=False,palette="Set2",ci="sd")
plt.xticks(rotation=90)
plt.show()

The countries having high education numbers have high incoke that is more than 50K.

# **Multivariate Analysis**

In [None]:
# Checking the pairwise relation in the dataset.
sns.pairplot(df,hue="Income",palette="husl")

This gives the pairwise relation between the columns which is plotted on the basis of target variable "Income". We can see, most of the features are highly correlated with each other. Some of the featurs have outliers and skewness which needs to remove before model building.

# **Outliers**

In [None]:
# Visualizing the outliers present in the numerical columns

plt.figure(figsize=(10,8),facecolor="white")
plotnumber=1

for col in numerical_col:
    if plotnumber<=4:
        ax = plt.subplot(2,2,plotnumber)
        sns.boxplot(df[col],color='g')
        plt.xlabel(col,fontsize=12)
    plotnumber+=1
plt.tight_layout()

We can see that all columns have outliers present.

# **Removing Outliers**

**Zscore method**

In [None]:
# Removing outliers using zscore
from scipy.stats import zscore
col = df[["Age","Fnlwgt","Education_num","Hours_per_week"]]
z = np.abs(zscore(col))
z

In [None]:
# Creating new dataframe
new_df = df[(z<3).all(axis=1)]
new_df

This is the new dataframe after removing the outliers.We have removed the outliers whose zscore is less than 3.

In [None]:
# Checking the dimenions of both dataframes
print(df.shape)
print(new_df.shape)

# **Data loss percent**

In [None]:
loss = (32560-31461)/32560*100
loss

Here we are losing only 3% data. Lets check with IQR technique.

# **IQR**

In [None]:
# Removing outliers using IQR
Q1 = col.quantile(0.25)

Q3 = col.quantile(0.75)

#IQR
IQR = Q3-Q1

df1=df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
df1

In [None]:
print(df.shape)
print(df1.shape)

In [None]:
loss = (32560-21950)/32560*100
loss

Using IQR methos, data loss is 32%. As it is more than 10%, we will consider Zscore method

# **Removing Skewness**

In [None]:
new_df.skew()

As we can see there is skewness in Fnlwgt column. Removing it using cube root method.

In [None]:
# Removing skewness using cube root method
new_df["Fnlwgt"] = np.cbrt(df['Fnlwgt'])
new_df.skew()

Now the skewness has been removed.

In [None]:
# Visualizing the data distribution after removing skewness
sns.distplot(new_df["Fnlwgt"],color="b",kde_kws={"shade": True},hist=False)

And we can see the data is normalized.

# **Encoding the categorical columns**

In [None]:
categorical_col = ['Workclass','Education','Marital_status','Occupation','Relationship','Race','Sex','Native_country','Income']
from sklearn.preprocessing import LabelEncoder
lbl = LabelEncoder()
new_df[categorical_col] = new_df[categorical_col].apply(lbl.fit_transform)

In [None]:
new_df[categorical_col]

This dataframe is having the encoded numerical data now.

# **Correlation**

In [None]:
# Checking the correlation between featuresa nad target variables
cor = new_df.corr()
cor

This gives the correlation between dependent and independent variables. We can visualize this by using heatmap.

In [None]:
# Visualizing correlation between dependent and independent variables by using heatmap
plt.figure(figsize=(10,10))
sns.heatmap(new_df.corr(),linewidths=0.1,fmt='.1g',cmap="seismic",annot=True)
#plt.xticks(rotation=90)

This heatmap shows the correlation matrix by visualizing the data. We can observe the relation between one feature to other.

This heatmap contains both positive and negative correlation.

There is no much correlation between the target and the label.

The columns Education_num, Age, Sex and Hours_per_week have positive correlation with the label.

The columns Relationship and Sex are highly correlated with each other also the column Fnlwgt ha very has correlation with the label so we can drop this column.

There is no multicolinearity issue exits in the data.

In [None]:
cor['Income'].sort_values(ascending=False)

Here we can easily find the positive and negative correlation between the features and the labels.

# **Visualizing the correlation between features and labels using bar plot.**

In [None]:
# Visualization using barplot.
plt.figure(figsize=(22,7))
new_df.corr()['Income'].sort_values(ascending=False).drop(['Income']).plot(kind='bar',color='r')
plt.xlabel('Feature',fontsize=18)
plt.ylabel('Target',fontsize=18)
plt.title('Correlation between label and features using bar plot',fontsize=20)
plt.show()

The column fnlwgt and workclass has very less relation with the label, so we can drop these columns if necessary.

# **Separating the features and label variables**

In [None]:
x = new_df.drop("Income",axis=1)
y = new_df["Income"]

In [None]:
x.shape

In [None]:
y.shape

# **Feature Scaling**

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
x = pd.DataFrame(scaler.fit_transform(x),columns=x.columns)
x

We have scaled the data using Standard Scalarization method to overcome the issue of data biasness.

In [None]:
y.value_counts()

Here we can notice the class imbalancing issue. Lets use SMOTE to balance the data.

# **Oversampling**

In [None]:
# Oversampling the data
from imblearn.over_sampling import SMOTE
SM = SMOTE()
x, y = SM.fit_resample(x,y)

In [None]:
y.value_counts()

The data is balanced now.

In [None]:
# Checking the dataframe after preprocessing and data cleaning.
new_df.head()

So, data cleaning and preprocessing is done. Now we can build the model.

# **Modeling**

# **Finding the best random state**

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split as TTS
from sklearn.metrics import accuracy_score
maxAccu = 0
maxRS = 0
for i in range(1,200):
    x_train,x_test,y_train,y_test = TTS(x,y,random_state=i,test_size=.30)
    DTC = DecisionTreeClassifier()
    DTC.fit(x_train,y_train)
    pred = DTC.predict(x_test)
    acc = accuracy_score(y_test,pred)
    if acc>maxAccu:
        maxAccu = acc
        maxRS = i
print("Best accuracy is",maxAccu,"at random_state",maxRS)

# **Creating train_test_split**

In [None]:
x_train,x_test,y_train,y_test = TTS(x,y,random_state=maxRS,test_size=.30)

Splitted the data using best random state.

# **Classification Algorithms**

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.metrics import classification_report,confusion_matrix,roc_curve,accuracy_score

# **DecisionTreeClassifier**

In [None]:
# checking accuracy_score for DecisionTreeClassifier
DTC = DecisionTreeClassifier()
DTC.fit(x_train,y_train)
predDTC = DTC.predict(x_test)
print(accuracy_score(y_test,predDTC))
print(confusion_matrix(y_test,predDTC))
print(classification_report(y_test,predDTC))

The accuracy using DecisionTreeClassifier is 83.88%

In [None]:
# Lets plot confusion matrix for DTC
cm = confusion_matrix(y_test,predDTC)
x_axis_labels = ["0","1"]
y_axis_labels = ["0","1"]
f,ax = plt.subplots(figsize=(7,7))
sns.heatmap(cm,annot=True,linewidths=.2,linecolor="black",fmt=".0f",ax=ax,xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title("Confusion Matrix for Decision Tree Classifier")
plt.show()

This is the confusion matrix for DecisionTreeClassifier where we can observe trp,fpr,tnr and fnr. And we plotted Predicted values against true values.

# **RandomForestClassifier**

In [None]:
# checking accuracy_score for RandomForestClassifier
RFC = RandomForestClassifier()
RFC.fit(x_train,y_train)
predRFC = RFC.predict(x_test)
print(accuracy_score(y_test,predRFC))
print(confusion_matrix(y_test,predRFC))
print(classification_report(y_test,predRFC))

The accuracy score using RandomForestClassifier is 88.05%.

In [None]:
# Lets plot confusion matrix for RFC
cm = confusion_matrix(y_test,predRFC)
x_axis_labels = ["0","1"]
y_axis_labels = ["0","1"]
f,ax = plt.subplots(figsize=(7,7))
sns.heatmap(cm,annot=True,linewidths=.2,linecolor="black",fmt=".0f",ax=ax,xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title("Confusion Matrix for RandomForestClassifier")
plt.show()

# **LogisticRegression**

In [None]:
# checking accuracy_score for LogisticRegression
LR = LogisticRegression()
LR.fit(x_train,y_train)
predLR = LR.predict(x_test)
print(accuracy_score(y_test,predLR))
print(confusion_matrix(y_test,predLR))
print(classification_report(y_test,predLR))

The accuracy score using LogisticRegression is 75.24%.

In [None]:
# Lets plot confusion matrix for LogisticRegression
cm = confusion_matrix(y_test,predLR)
x_axis_labels = ["0","1"]
y_axis_labels = ["0","1"]
f,ax = plt.subplots(figsize=(7,7))
sns.heatmap(cm,annot=True,linewidths=.2,linecolor="black",fmt=".0f",ax=ax,xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title("Confusion Matrix for LogisticRegression")
plt.show()

# **KNeighborsClassifier**

In [None]:
# checking accuracy_score for KNeighborsClassifier
knn = KNN()
knn.fit(x_train,y_train)
predknn = knn.predict(x_test)
print(accuracy_score(y_test,predknn))
print(confusion_matrix(y_test,predknn))
print(classification_report(y_test,predknn))

The accuracy score using KNeighborsClassifier is 84.36%.

In [None]:
# Lets plot confusion matrix for KNeighborsClassifier
cm = confusion_matrix(y_test,predknn)
x_axis_labels = ["0","1"]
y_axis_labels = ["0","1"]
f,ax = plt.subplots(figsize=(7,7))
sns.heatmap(cm,annot=True,linewidths=.2,linecolor="black",fmt=".0f",ax=ax,xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title("Confusion Matrix for KNeighborsClassifier")
plt.show()

# **Gradient Boosting Classifier**

In [None]:
# checking accuracy_score for GradientBoostingClassifier
GB = GradientBoostingClassifier()
GB.fit(x_train,y_train)
predGB = GB.predict(x_test)
print(accuracy_score(y_test,predGB))
print(confusion_matrix(y_test,predGB))
print(classification_report(y_test,predGB))

The accuracy score using GradientBoostingClassifier is 85.35%.

In [None]:
# Lets plot confusion matrix for GradientBoostingClassifier
cm = confusion_matrix(y_test,predGB)
x_axis_labels = ["0","1"]
y_axis_labels = ["0","1"]
f,ax = plt.subplots(figsize=(7,7))
sns.heatmap(cm,annot=True,linewidths=.2,linecolor="black",fmt=".0f",ax=ax,xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title("Confusion Matrix for GradientBoostingClassifier")
plt.show()

# **XGBClassifier**

In [None]:
# checking accuracy_score for XGBClassifier
XGB = XGBClassifier()
XGB.fit(x_train,y_train)
predXGB = XGB.predict(x_test)
print(accuracy_score(y_test,predXGB))
print(confusion_matrix(y_test,predXGB))
print(classification_report(y_test,predXGB))

The accuracy score using GradientBoostingClassifier is 88.75%.

In [None]:
# Lets plot confusion matrix for XGBClassifier
cm = confusion_matrix(y_test,predXGB)
x_axis_labels = ["0","1"]
y_axis_labels = ["0","1"]
f,ax = plt.subplots(figsize=(7,7))
sns.heatmap(cm,annot=True,linewidths=.2,linecolor="black",fmt=".0f",ax=ax,xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title("Confusion Matrix for XGBClassifier")
plt.show()

# **ExtraTrees Classifier**

In [None]:
# checking accuracy_score for ExtraTreesClassifier
from sklearn.ensemble import ExtraTreesClassifier
XT = ExtraTreesClassifier()
XT.fit(x_train,y_train)
predXT = XT.predict(x_test)
print(accuracy_score(y_test,predXT))
print(confusion_matrix(y_test,predXT))
print(classification_report(y_test,predXT))

The accuracy score using ExtraTreesClassifier is 88.89%.

In [None]:
# Lets plot confusion matrix for ExtraTreesClassifier
cm = confusion_matrix(y_test,predXT)
x_axis_labels = ["0","1"]
y_axis_labels = ["0","1"]
f,ax = plt.subplots(figsize=(7,7))
sns.heatmap(cm,annot=True,linewidths=.2,linecolor="black",fmt=".0f",ax=ax,xticklabels=x_axis_labels,yticklabels=y_axis_labels)
plt.xlabel("PREDICTED LABEL")
plt.ylabel("TRUE LABEL")
plt.title("Confusion Matrix for ExtraTreesClassifier")
plt.show()

# **Checking the cross validation score**

In [None]:
from sklearn.model_selection import cross_val_score

In [None]:
# cv score for Decision Tree Classifier
print(cross_val_score(DTC,x,y,cv=5).mean())

In [None]:
# cv score for Random Forest Classifier
print(cross_val_score(RFC,x,y,cv=5).mean())

In [None]:
# cv score for Logistic Regression
print(cross_val_score(LR,x,y,cv=5).mean())

In [None]:
# cv score for KNN Classifier
print(cross_val_score(knn,x,y,cv=5).mean())

In [None]:
# cv score for Gradient Boosting Classifier
print(cross_val_score(GB,x,y,cv=5).mean())

In [None]:
# cv score for XGBClassifier
print(cross_val_score(XGB,x,y,cv=5).mean())

In [None]:
# cv score for ExtraTreesClassifier
print(cross_val_score(XT,x,y,cv=5).mean())

Above are the cross validation scores for the models used.

The difference between the accuracy score and the CV score of RandomForestClassifier is least i.e 0.41.So, we can conclude that RandomForestClassifier as best fitting model.

# **Hyperparameter Tuning**

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# RandomForestClassifier
parameters = {'criterion':["gini","entropy"],
             'max_features':['auto','sqrt','log2'],
             'max_depth':[10,20,30,40,50],
             "min_samples_leaf":[2,3,4,5,6]}

In [None]:
GCV=GridSearchCV(RandomForestClassifier(),parameters,cv=5)

In [None]:
GCV.fit(x_train,y_train)

In [None]:
GCV.best_params_

In [None]:
census = RandomForestClassifier(criterion='entropy', max_depth=40, max_features='auto',min_samples_leaf=2)
census.fit(x_train, y_train)
pred = census.predict(x_test)
acc=accuracy_score(y_test,pred)
print(acc*100)

The accuracy of the best model after tuning is 87.67%.

# **Plotting ROC and compare AUC for all the models used**

In [None]:
# Plotting for all the models used here
from sklearn import datasets 
from sklearn import metrics
from sklearn import model_selection
from sklearn.metrics import plot_roc_curve 


disp = plot_roc_curve(RFC,x_test,y_test)
plot_roc_curve(DTC, x_test, y_test, ax=disp.ax_) # ax_=Axes with confusion matrix
plot_roc_curve(LR, x_test, y_test, ax=disp.ax_)
plot_roc_curve(XT, x_test, y_test, ax=disp.ax_)
plot_roc_curve(knn, x_test, y_test, ax=disp.ax_)
plot_roc_curve(GB, x_test, y_test, ax=disp.ax_)
plot_roc_curve(XGB, x_test, y_test, ax=disp.ax_)

plt.legend(prop={'size':11}, loc='lower right')
plt.show()

# **Plotting ROC and Compare AUC for the best model**

In [None]:
# Let's check the Auc for the best model after hyper parameter tuning
plot_roc_curve(census, x_test, y_test)
plt.title("ROC for the best model")
plt.show()

The Auc for the best model is 0.95

# **Saving The Model**

In [None]:
# Saving the model using .pkl
import joblib
joblib.dump(census,"Census Income Prediction.pkl")

# **Predicting the saved model**

In [None]:
# Let's load the saved model and get the prediction

# Loading the saved model
model=joblib.load("Census Income Prediction.pkl")

#Prediction
prediction = model.predict(x_test)
prediction

In [None]:
pd.DataFrame([model.predict(x_test)[:],y_test[:]],index=["Predicted","Original"])