# Income prediction using Machine Learning

In this tutorial, you explore developing machine learning models using Python.
The data set that will be used is called UCI: Adult – Predict Income. 
This data set is meant to be used to predict whether an individual has an income of less than 50K or more than 50K based on census data. 

## In this notebook

 - Find the API Docs for the running version of Pandas & scikit-learn
 - Run data exploration
 - Run data visualization
 - Run data preparation
 - Train models
 - Evaluate models
 - Save and load trained models


Let's take a look at what data we have here

In [None]:
%ls

Install Pandas & Scikit-learn libraries

In [None]:
!pip install pandas scikit-learn numpy matplotlib seaborn

# Data Exploration 

In [None]:
import pandas as pd

In [None]:
# Load the data from the CSV to a Panda's Dataframe

filePath = 'adult.csv'
 
df_data_1=pd.read_csv(filePath)

In [None]:
# Using head() method with an argument to display more rows of the dataset
df_data_1.head(n=20)

Note: Notice that row 14 in the dataset contains a question mark for an unknown value, many other rows in the data set are also missing values. Later on when prepossessing the data, all rows that contain missing values will be dropped. 

In [None]:
# Using tail() to display last rows of the dataset
df_data_1.tail()

In [None]:
# Using dtypes to display the datatypes of each column 
df_data_1.dtypes

In [None]:
# Using describe to display the summary statistics of the numeric columns 
df_data_1.describe()

# Data Visualization

Seaborn will be used to create plots in order to visualize the dataset. Using Seaborn many different types of plots can be created. To browse the available visualization types, visit the Seaborn gallery at https://seaborn.pydata.org/examples/index.html. In this exercise two types of plots will be used, a count plot and a violin plot. 

1.	Import the Seaborn library and import pyplot from the matplotlib module.
2.	Assign a value to the style parameter in the Seaborn set method to change the appearance of the plot. The five Seaborn styles are darkgrid, whitegrid, dark, white and ticks. In the example below darkgrid is used.


In [None]:
#import necessary modules
import seaborn as sb
from matplotlib import pyplot as plt

#set the plot theme
sb.set(style = "darkgrid") 

3.	Create a count plot using the countplot method to see the number of males and females in each income category. Have SEX as the x value and SALARY as the hue. 


In [None]:
#create a count plot
sb.countplot('sex', data=df_data_1, hue = 'salary')
#display plot
plt.show()

4.	Create a violin plot using the violinplot method to see the age distribution for each income category.  Have SALARY as the x value, AGE as the y value, and the df_data_1 as the data set.

In [None]:
#create and disply a violin plot
sb.violinplot(x = "salary", y = "age", data = df_data_1)
plt.show()

5.	Set the hue of the violin plot to SEX and set split to True to see the age distribution based on gender. 

In [None]:
#create and display a violin plot
sb.violinplot(x = "salary", y = "age", hue="sex", data = df_data_1, split=True)
plt.show()

6. Find correlation between columns

In [None]:
def plot_correlation(df, size=15):
    corr= df.corr()
    fig, ax =plt.subplots(figsize=(size,size))
    ax.matshow(corr)
    plt.xticks(range(len(corr.columns)),corr.columns)
    plt.yticks(range(len(corr.columns)),corr.columns)
    plt.show()

In [None]:
plot_correlation(df_data_1)

# Data Preparation

1.	Now, we will be preparing to build a prediction model using scikit-learn. First, we need to clean up the data.  Recall how some rows contained ‘ ?’ instead of a value. We will begin by removing these rows.

I.	Import numpy

II.	Three different columns, workclass, occupation and native_country, contain ‘ ?’. Mark values that contain ‘ ?’ as  missing by replacing the ‘ ?’ with NaN. 

III.	Drop all rows that contain missing values using the dropna method.


In [None]:
[df_data_1['workclass'].value_counts(), df_data_1['occupation'].value_counts(), df_data_1['country'].value_counts()]

In [None]:
import numpy

# mark ' ?' values as missing or NaN
df_data_1['workclass'] = df_data_1['workclass'].replace(' ?', numpy.NaN)
df_data_1['occupation'] = df_data_1['occupation'].replace(' ?', numpy.NaN)
df_data_1['country'] = df_data_1['country'].replace(' ?', numpy.NaN)

# drop rows with missing values
df_data_1.dropna(inplace=True)

[df_data_1['workclass'].value_counts(), df_data_1['occupation'].value_counts(), df_data_1['country'].value_counts()]

2.	Also recall how many of the columns in the dataset contained Object (String) data values. We are now going to convert these values to lower case. For each String column use pandas map method to apply the lower method to all record. Use the head method to see the results.

In [None]:
df_data_1.dtypes

In [None]:
#convert all String values to lowercase
string_columns = ['workclass', 'education', 'marital-status', 'occupation', 'relationship',  'race', 'sex', 'country', 'salary']

for col in string_columns:
    df_data_1[col] = df_data_1[col].map(lambda x: x.lower())

#display the initial records that are now lowercase
df_data_1.head()

3.	Convert the columns that have Object/String datatypes into numeric values using dummy encoding. Reminder: to see which columns are Strings you can use the dtype method described earlier. 


In [None]:
# Convert String Columns to numeric using one hot encoding
# use pd.concat to join the new columns with your original dataframe
all_numeric_df = pd.concat([df_data_1,pd.get_dummies(df_data_1[string_columns], prefix=string_columns, drop_first=True)],axis=1)

In [None]:
all_numeric_df.columns

In [None]:
all_numeric_df.dtypes

In [None]:
# now drop the original String column (you don't need it anymore)
all_numeric_df.drop(string_columns,axis=1, inplace=True)

In [None]:
all_numeric_df.dtypes

4.	Instead of transforming all attributes to numeric as in the previous step, we can also transform all attributes to Categorical/String types depending on the machine learning algorithm we are using. Try this on the age attribute. Instead of having age values ranging from 17 to 90, break the values into ten bins.  

I.	Instantiate the KBinsDiscretizer Object. Set n_bin to 10, encode to ordinal, and strategy to uniform. 

II.	Call the fit_transform method on the values in the AGE column. Print the results and notice that the values are assigned a bin from 0 to 9 based on how high the number is. 


In [None]:
from sklearn import preprocessing

#create an instance of the KBinsDiscretizer Object
bd = preprocessing.KBinsDiscretizer(n_bins=10, encode='ordinal', strategy='uniform')

#bin continuous data into intervals and print the result
print(bd.fit_transform([[x] for x in df_data_1['age']]))



Note: For the rest of this exercise we will be using the original numeric attributes, however, if we had wanted to update the values in the dataset we could have assigned the result to df_data_1['AGE'], instead of printing it.

# Training Set & Test Set
1.	Separate the data into feature and target variables. 

I.	The PREDICTION column will be the target set. 

II.	Select all columns other than the PREDICTION column and assign them to a variable called data. 

III.	Assign the target set equal to the PREDICTION column.


In [None]:
#assign features columns to a DF variable
numeric_features_df = all_numeric_df.loc[:, all_numeric_df.columns != 'salary_ >50k']

#Set target set equal to prediction column
target = all_numeric_df['salary_ >50k']

numeric_features_df
    

2.	Split the data set into training and testing sets, with 20% of the data being used as the test data and 80% used as the training data.  

I.	Import the train_test_split model from sklearn.model_selection.

II.	Split the data using the train_test_split method.



In [None]:
#import the necessary module
from sklearn.model_selection import train_test_split

#split data set into train and test sets
X_train, X_test, y_train, y_test = train_test_split(numeric_features,target, test_size = 0.2, random_state = 10)

#Creation of Train and validation dataset
X_train, X_val, y_train, y_val = train_test_split(X_train,y_train,test_size=0.2,random_state=5)

print ("Train dataset: {0}{1}".format(X_train.shape, y_train.shape))
print ("Validation dataset: {0}{1}".format(X_val.shape, y_val.shape))
print ("Test dataset: {0}{1}".format(X_test.shape, y_test.shape))

# Building Models

1. Let's select few algorithm used for classification

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

In [None]:
models = []
model_names = ['LR','Random Forest','Neural Network','GaussianNB','DecisionTreeClassifier','SVM','KNN']

models.append((LogisticRegression()))
models.append((RandomForestClassifier(n_estimators=10)))
models.append((MLPClassifier()))
models.append((GaussianNB()))
models.append((DecisionTreeClassifier()))
models.append((SVC()))
models.append((KNeighborsClassifier(n_neighbors=3)))

print (models)

2. Run K-Cross Validation to Build the models and find the one with the highest accuracy.

In [None]:
from sklearn import model_selection
from sklearn.metrics import accuracy_score

kfold = model_selection.KFold(n_splits=10,random_state=7)

for i in range(0,len(models)):    
    cv_result = model_selection.cross_val_score(models[i],X_train,y_train,cv=kfold,scoring='accuracy')
    print ('-'*40)
    print ('{0}: {1}'.format(model_names[i],cv_result))
    
    trained_model=models[i].fit(X_train,y_train)
    print ('-'*40)
    print ('{0}: {1}'.format(model_names[i],trained_model))
    
    prediction = models[i].predict(X_val)
    acc_score = accuracy_score(y_val,prediction)     
    print ('-'*40)
    print ('{0}: {1}'.format(model_names[i],acc_score))

3. Let's predict our test data and see prediction results

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
randomForestModel = RandomForestClassifier(n_estimators=100)
randomForestModel.fit(X_train,y_train)
prediction = randomForestModel.predict(X_test)

In [None]:
print ('-'*40)
print ('Accuracy score:')
print (accuracy_score(y_test,prediction))
print ('-'*40)
print ('Confusion Matrix:')
print (confusion_matrix(y_test,prediction))
print ('-'*40)
print ('Classification Matrix:')
print (classification_report(y_test,prediction))

4. Visualize model performance using a library called yellowbrick.

In [None]:
#install yellowbrick
!pip install yellowbrick


In [None]:
#import the necessary module 
from yellowbrick.classifier import ClassificationReport


Create a classification report for the RandomForest algorithm

I.	Instantiate the Classification Report instance, passing in the RandomForest Object and the PREDICTION classes

II.	Pass the training sets into the fit method 

III.	Pass the test sets into the score method

IV.	Using the poof method to display the results


In [None]:
#Instantiate the classification model and visualizer
visualizer = ClassificationReport(randomForestModel, classes=['<=50','>50'])

#Fit the training data to the visualizer
visualizer.fit(X_train, y_train)
 
#Evaluate the model on the test data
visualizer.score(X_test, y_test) 

# Draw/show/poof the data
g = visualizer.poof() 


# Saving the trained model

In [None]:
#import the pickle library
import pickle

In [None]:
# save the model to disk
filename = 'finalized_randomForestModel.mdl'
pickle.dump(randomForestModel, open(filename, 'wb'))
 

# Load the model later

In [None]:
# load the model from disk
filename = 'finalized_randomForestModel.mdl'
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_test, y_test)
print(result)

prediction = randomForestModel.predict(X_test)
print ('-'*40)
print ('Accuracy score:')
print (accuracy_score(y_test,prediction))
print ('-'*40)
print ('Confusion Matrix:')
print (confusion_matrix(y_test,prediction))
print ('-'*40)
print ('Classification Matrix:')
print (classification_report(y_test,prediction))

Summary
--
What we have done in this notebook:

* Find the API Docs for the running version of Pandas & scikit-learn
* Run data exploration
* Run data visualization
* Run data preparation
* Train models
* Evaluate models
* Save and load trained models
 