# **Problem Statement**

In this project, initially you need to preprocess the data and then develop an
understanding of the different features of the data by performing exploratory
analysis and creating visualizations. Further, after having sufficient knowledge
about the attributes, you will perform a predictive task of classification to predict
whether an individual makes over 50,000 a year or less by using different
machine learning algorithms.


**Importing The Necessary Libraries**

In [1]:
import pandas as pd #data manipulation
import numpy as np  #numerical python
import matplotlib.pyplot as plt #data visualization
%matplotlib inline
import seaborn as sns #data visualization
import plotly.express as px #data visualization
import plotly.graph_objects as go #data visualization
import plotly.io as pio
import os
from IPython.display import Markdown
from plotly.subplots import make_subplots #to make subplots
import warnings
warnings.filterwarnings('ignore') #to ignore warnings

In [2]:
from sklearn.preprocessing import StandardScaler #for rescalling the data
from sklearn.preprocessing import LabelEncoder #for encoding

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import*

**Importing The Dataset**

In [3]:
df=pd.read_csv('/content/adult.csv',na_values='?',skipinitialspace=True)

**Intrepreting the Dataset**

In [4]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        46043 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       46033 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   47985 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [6]:
null_values=df.isnull().sum()
null_values

Unnamed: 0,0
age,0
workclass,2799
fnlwgt,0
education,0
educational-num,0
marital-status,0
occupation,2809
relationship,0
race,0
gender,0


In [7]:
#null values in percentage
(null_values/len(df))*100

Unnamed: 0,0
age,0.0
workclass,5.730724
fnlwgt,0.0
education,0.0
educational-num,0.0
marital-status,0.0
occupation,5.751198
relationship,0.0
race,0.0
gender,0.0


columns named **workclass,occupation,native-country** have null values.

**Imputing the null Values**

In [8]:
#imputing the workclass with mode
df['workclass'].fillna(df['workclass'].mode()[0],inplace=True)

In [9]:
#again checking checking for null values
df.isnull().sum()

Unnamed: 0,0
age,0
workclass,0
fnlwgt,0
education,0
educational-num,0
marital-status,0
occupation,2809
relationship,0
race,0
gender,0


In [10]:
#imputing the native_country column with mode
df['native-country'].fillna(df['native-country'].mode()[0],inplace=True)

In [11]:
#checking for null values again
df.isnull().sum()

Unnamed: 0,0
age,0
workclass,0
fnlwgt,0
education,0
educational-num,0
marital-status,0
occupation,2809
relationship,0
race,0
gender,0


**Dropping the Null values**

In [12]:
df.dropna(inplace=True)

In [13]:
df.isnull().sum()

Unnamed: 0,0
age,0
workclass,0
fnlwgt,0
education,0
educational-num,0
marital-status,0
occupation,0
relationship,0
race,0
gender,0


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 46033 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              46033 non-null  int64 
 1   workclass        46033 non-null  object
 2   fnlwgt           46033 non-null  int64 
 3   education        46033 non-null  object
 4   educational-num  46033 non-null  int64 
 5   marital-status   46033 non-null  object
 6   occupation       46033 non-null  object
 7   relationship     46033 non-null  object
 8   race             46033 non-null  object
 9   gender           46033 non-null  object
 10  capital-gain     46033 non-null  int64 
 11  capital-loss     46033 non-null  int64 
 12  hours-per-week   46033 non-null  int64 
 13  native-country   46033 non-null  object
 14  income           46033 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [15]:
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'educational-num',
       'marital-status', 'occupation', 'relationship', 'race', 'gender',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')

# **Data Visualiztion**

In [16]:
def save_and_display_plot(fig,filename,folder='plots'):
 os.makedirs(folder,exist_ok=True)
 filepath=os.path.join(folder,filename)
 pio.write_image(fig,filepath)
 display(Markdown(f'![{filename}]({filepath})'))

workclass vs income

In [17]:
fig=px.histogram(df,x=df['workclass'],color='income',barmode='group',title='workclass wrt income')
save_and_display_plot(fig,'workclass_income.png')

![workclass_income.png](plots/workclass_income.png)

1.**78.21%** of the people in **private sector** have income less than or equal to 50k and **21.79%** people have income
more than 50k

2.**44.6%** of of the people who are **self employed** have income=<50k and **55.4%** having income >50k.

**Marital Status vs Income**

In [18]:
fig=px.histogram(df,x=df['marital-status'],color=df['income'],barmode='group'
,title='marital status vs income')
save_and_display_plot(fig,'marital_status_income.png')

![marital_status_income.png](plots/marital_status_income.png)

**Occupation vs Income**

In [19]:
fig=px.histogram(df,x=df['income'],color='occupation',barmode='group'
,title='occupation wrt income')
save_and_display_plot(fig,'occupation_income.png')

![occupation_income.png](plots/occupation_income.png)

**Education wrt Income**

In [20]:
fig=px.histogram(df,x=df['income'],color='education',barmode='group',title='education wrt education')
save_and_display_plot(fig,'education_income.png')

![education_income.png](plots/education_income.png)

**Relationship vs Income**

In [21]:
fig=px.histogram(df,x=df['income'],color='relationship',barmode='group',title='realtionship wrt income')
save_and_display_plot(fig,'relationship_income.png')

![relationship_income.png](plots/relationship_income.png)

**Race vs Income**

In [22]:
fig=px.histogram(df,x=df['income'],color='race',barmode='group',title='race wrt income')
save_and_display_plot(fig,'race_income.png')

![race_income.png](plots/race_income.png)

**Gender Wrt Income**

In [23]:
fig=px.histogram(df,x=df['income'],color='gender',barmode='group',title='gender wrt income')
save_and_display_plot(fig,'gender_income.png')

![gender_income.png](plots/gender_income.png)

**Encoding The categorical variables**

In [24]:
def object_to_int(dataframe):
 for i in dataframe.columns:
  if dataframe[i].dtype=='object':
   le=LabelEncoder()
   dataframe[i]=le.fit_transform(dataframe[i])
 return dataframe



In [25]:
df2=df
df2=object_to_int(df2)


In [26]:
df2.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,2,226802,1,7,4,6,3,2,1,0,0,40,38,0
1,38,2,89814,11,9,2,4,0,4,1,0,0,50,38,0
2,28,1,336951,7,12,2,10,0,4,1,0,0,40,38,1
3,44,2,160323,15,10,2,6,0,2,1,7688,0,40,38,1
5,34,2,198693,0,6,4,7,1,4,1,0,0,30,38,0


**Splitting the data into training and test set**

In [27]:
x=df2.drop('income',axis=1)
y=df2[['income']]
x_train,x_test,y_train,y_test=train_test_split(x,y,train_size=0.7,random_state=555)

**Scalling the data**

In [28]:
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.transform(x_test)

**Logistic Regression**

In [29]:
#fitting the model to data
lr=LogisticRegression()
lr.fit(x_train,y_train)

In [30]:
#prediction on x_test
pred1=lr.predict(x_test)

In [31]:
#evaluating the model
print(classification_report(y_test,pred1))

              precision    recall  f1-score   support

           0       0.84      0.94      0.89     10403
           1       0.70      0.46      0.55      3407

    accuracy                           0.82     13810
   macro avg       0.77      0.70      0.72     13810
weighted avg       0.81      0.82      0.80     13810



**Decision tree Classifier**

In [32]:
#fitting the model to data
dt=DecisionTreeClassifier()
dt.fit(x_train,y_train)

In [33]:
#predicting on x_test
pred2=dt.predict(x_test)

In [34]:
#evaluating the model
print(classification_report(y_test,pred2))

              precision    recall  f1-score   support

           0       0.87      0.87      0.87     10403
           1       0.61      0.61      0.61      3407

    accuracy                           0.81     13810
   macro avg       0.74      0.74      0.74     13810
weighted avg       0.81      0.81      0.81     13810



**Randomforest Classifier**

In [35]:
#fitting the model to data
rf=RandomForestClassifier()
rf.fit(x_train,y_train)

In [36]:
#predicting on x_test
pred3=rf.predict(x_test)

In [37]:
#evaluating the model
print(classification_report(y_test,pred3))

              precision    recall  f1-score   support

           0       0.88      0.93      0.90     10403
           1       0.73      0.62      0.67      3407

    accuracy                           0.85     13810
   macro avg       0.81      0.77      0.79     13810
weighted avg       0.85      0.85      0.85     13810



In [38]:
accuracy_report=pd.DataFrame({'models':['Logistic Regression','Decision Tree','Random Forest'],'accuracy':
                              [0.82,0.81,0.85]})
accuracy_report

Unnamed: 0,models,accuracy
0,Logistic Regression,0.82
1,Decision Tree,0.81
2,Random Forest,0.85
