# Introduction

In this notebook we will analyze the [Census Dataset](https://archive.ics.uci.edu/ml/datasets/census+income) from UCI Machine Learning Repository.  

The dataset contains three files: 


*   [adult.data](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data) - training set
*   [adult.names](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names) - dataset description
*   [adult.test](https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test) - test set 


The data contains anonymous information such as age, occupation, education, working class, etc. The goal is to train a binary classifier to predict the income which has two possible values '>50K' and '<50K'. There are 48842 instances and 14 attributes in the dataset. The data contains a good blend of categorical, numerical and missing values. 

We will use **Logistic Regression** to train our model





# 1. Importing Libraries

In [13]:
import numpy as np
import pandas as pd
import io
import requests
import seaborn as sns
from matplotlib import pyplot as plt
import pickle
from pandas.api.types import CategoricalDtype

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import cross_val_score

%matplotlib inline

# Loading Data

*   The train and test dataset doesn't come with the column names by default. Hence we assign the column names manually
*   There are ceratain instances where there are  whitespaces before and after the data values. You can pass a regex to the **sep** paramter to the pandas [read_csv](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function. We will use the separator as \* , *

*   The missing values in the dataset are indicated by **'?'**. We will use **na_values** parameter to indicate the missing values 
*   The test dataset contains some weird first row, hence we will use **skiprows=1** to skip the first row




In [17]:
columns = ["age", "workClass", "fnlwgt", "education", "education-num",
           "marital-status", "occupation", "relationship", "race", "sex", 
           "capital-gain", "capital-loss", "hours-per-week", "native-country", "income"]

train_data = pd.read_csv('adult.data', names = columns, sep=' *, *', na_values='?',engine='python')
test_data = pd.read_csv('adult.test', names = columns, sep=' *, *', skiprows =1, na_values='?',engine='python')

Let's look at the first 5 rows of the training data

Unnamed: 0,age,workClass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


Similarly, we look at the first 5 rows of the test dataset

Both the train and the test dataset contains the predictor variable '**income**'. This needs to be removed before the data is passed to a machine-learning model

# Exploratory Data Analysis

## Cleaning the data

Let's look for any missing values in both the train and the test dataset. We need to fill/remove these values 

**Observations on the train dataset**

*  See what is the number of samples in the train data set
*  There are both categorical and numerical attributes in the dataset
*  Which columns have missing values?



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16281 entries, 0 to 16280
Data columns (total 15 columns):
age               16281 non-null int64
workClass         15318 non-null object
fnlwgt            16281 non-null int64
education         16281 non-null object
education-num     16281 non-null int64
marital-status    16281 non-null object
occupation        15315 non-null object
relationship      16281 non-null object
race              16281 non-null object
sex               16281 non-null object
capital-gain      16281 non-null int64
capital-loss      16281 non-null int64
hours-per-week    16281 non-null int64
native-country    16007 non-null object
income            16281 non-null object
dtypes: int64(6), object(9)
memory usage: 1.9+ MB


**Observations on the test dataset**

*  See what is the number of samples in the test data set
*  Is ther any columns with missing values?

## Handing Numerical Attributes

We will select all the numerical attributes from the dataset using [select_dtypes](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.select_dtypes.html) function from the pandas dataframe library 

Index(['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss',
       'hours-per-week'],
      dtype='object')


The variables **age**, **hours-per-week** are self-explanatory. 


*   **fnlwgt**: sampling weight
*  ** education-num**: number of years of education in total
* **capital-gain/capital-loss**: income from investment sources other than salary/wages

fnlwgt is not related to the target variable **income** and will be removed before building the model



**Data Visualizations**

Plot histograms of the numerical values

Use describe to understand the numerical attributes


**Observations**

* None of the numerical attributes have missing values 
* The values are on different scales. Many machine learning models require the features to be on the same scale. 
* We can see that there are many outliers present in the data, we will use [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) from sklearn library

## Handling Categorical Attributes
We will select all the categorical attributes from the dataset using [select_dtypes](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.select_dtypes.html) function from the pandas dataframe library. Remember categorical variables are stored as Object type in Pandas 


Index(['workClass', 'education', 'marital-status', 'occupation',
       'relationship', 'race', 'sex', 'native-country', 'income'],
      dtype='object')


**Data-visualizations**

Visualize the workclass types with a seaborn countplot using 'income' as hue.

Visualize the occupation types with a seaborn countplot using 'income' as hue.

**Observations**

* All the variables are self-explanatory. 
* The column **education** is just a string representation of the column  **education-num**. We will drop the **education** column 
* The variables **workClass**, **occupation**, **native-country** have missing values. We will replace the missing values in each column with the **most_frequent** occurring value of that column.







We need to handle the numerical and categorical attributes differently.  Numerical attributes needs to be scaled, where as we need to fill the missing values and then encode the categorical values into numerical values.  To apply these sequence of transformations we will use the sklearn [Pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).  We will also build custom transformers that can be directly used with Pipeline

# Creating Pipelines

sklearn has many in-built transformers. However, if the in-built ones doesn't get the job done for you, you can build a custom transformer. All you need to do is to inherit [BaseEstimator](http://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html) and [TransformerMixin](http://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html) classes. You also need to implement the **fit** and **transform** methods. 
* **fit** - should return an instance of self and you can add the logic in 
* **transform** - add the logic here.

## ColumnSelector Pipeline


sklearn doesn't provide libraries to directly manipulate with pandas dataframe. We will write our own Custom Transformer which will select the corresponding attributes (either numerical or categorical)

Create a ColumnsSelector class that does this work.


## Numerical Data Pipeline

For the numerical data we create a numerical_pipeline. We select the numerical attributes using the **ColumnsSelector** transformer defined above and then scale the values using the Standard Scaler included in

## Categorical Data Pipeline

### Handling missing values

We need to replace the missing values in the categorical columns. We will replace the missing values with the most frequently occuring value in each column.  sklearn comes with [Imputer](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html#) to handle missing values. However,  **Imputer** works only with numerical values. We will write a custom transformer which will accept a list of columns for which you need to replace the missing values and the strategy used to fill the missing values.

Let't buid this CategoricalImputer.


### Encoding categorical values to numerical 

All the machine learning models expects numerical values. We need to convert the categorical columns to numerical values. We will use [pd.get_dummies](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html). This is similar to using [OneHotEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) except that OneHotEncoder requires numerical columns. 

\\

We need to merge the train and test dataset before using pd.get_dummies as there might be classes in the test dataset that might not be present in the training dataset. But, we need to pass only the encoded train data to further transformers in the pipeline when we train the model. For this in the fit method we will concatenate the train and test dataset and find out all the possible values for a column. In the transform method, when we pass either the train dataset, we will convert each column to [Categorical](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.api.types.CategoricalDtype.html) Type and specify the list of categories that the column can take. pd.get_dummies will create an column of all zeros for the category not present in the list of categories.

\\

The transformer also takes an argument **dropFirst** which indicates whether we should drop the first column after creating dummy columns using pd.get_dummies. By default, the value is set to **True**

In [16]:
class CategoricalEncoder(BaseEstimator, TransformerMixin):
  
  def __init__(self, dropFirst=True):
    self.categories=dict()
    self.dropFirst=dropFirst
    
  def fit(self, X, y=None):
    join_df = pd.concat([train_data, test_data])
    join_df = join_df.select_dtypes(include=['object'])
    for column in join_df.columns:
      self.categories[column] = join_df[column].value_counts().index.tolist()
    return self
    
  def transform(self, X):
    X_copy = X.copy()
    X_copy = X_copy.select_dtypes(include=['object'])
    for column in X_copy.columns:
      X_copy[column] = X_copy[column].astype({column: CategoricalDtype(self.categories[column])})
    return pd.get_dummies(X_copy, drop_first=self.dropFirst)
 

We create our pipeline for handling the categorical attributes. First transformer is to select the categorical attributed from the dataframe and then we replace the missing values and then encode the categorical features to numerical features.

### Complete Categorical Pipeline

We select the categorical attributes using the **ColumnsSelector** transformer we defined above. The missing values are replaced by the **CategoricalImputer** pipeline and finally we encode the categorical values to numerical values using the **CategoricalEncoder** transformer. 

## Complete Pipeline

We have two transformer pipeline i.e, **num_pipeline** and **cat_pipeline**. We can merge them using  [FeatureUnion](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html).

# Building the Model

We now have all the pipelines for preprocessing our data, next step is to prepare the data to be passed to the model.  Let's drop 'fnlwgt' and 'education' from train_data and test_data

## Preparing the data for training

In [7]:
# copy the data before preprocessing


# convert the income column to 0 or 1 and then drop the column for the feature vectors

# creating the feature vector 


# target values



## Training the model

In [21]:
# pass the data through the full_pipeline


(32561, 81)


In [12]:
# Create a LogisticRegression model and train it



In [8]:
#show the coefficients of the model


# Testing the model

We need to use the same pipeline for preprocessing the test data set before testing the model

In [9]:
# take a copy of the test data set


# convert the income column to 0 or 1


# separating the feature vecotrs and the target values


In [25]:
# preprocess the test data using the full pipeline



(16281, 81)

In [26]:
# Predict the classes on the procesed data



[0 0 0 ... 1 0 1]


# Model Evaluation

**We will use [accuracy_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) from sklearn to find the accuracy of the model** 

**Let's plot the [confusion_matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)**

**Interpretation**

* Y-axis represents the actual classes
* X-axis represents the predicted classes
* **_** times when the model correctly predicted 0 when the actual class was 0 (**True Negatives**)
* **_** times the model  predicted 0 when the actual class was 1 (**False Negatives**)
* **_** times the model  predicted 1 when the actual class was 0 (**False Positives**)
* **_** times the model correctly predicted 1 when the actual class was 1 (**True Positives**) 

# Cross Validation

We will use [StratifiedKFold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) to divide our dataset into k folds. In each iteration, k-1 folds are used as training set and the remaining is used as the training set to validate the model. We use StratifiedKFold because it preserves the percentage of samples of each class. 

\\

If we use [KFold](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html), we might run the risk of introducing sampling bias i.e, the train set might contain a large  number of samples where income is greater than 50K and the test set contains more samples where income is less than 50K. In this case, the model build from training data will not generalize well for test dataset. Whereas StratifiedKFold will ensure that there are enough samples of each class in both the train and test dataset.

We will use [cross_val_score](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html) function from the sklearn library to compute scores for each cross-validation. The parameter **cv** determines the cross-validation folds.

In [10]:
# create a cross validation model with the logistic regressor 
# find the scores from running the model using 5 folds



# Print the scores and the mean of them


# Fine Tuning the Model

By default Logistic Regression takes the below parameters

LogisticRegression(**C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False**)
          
We can fine-tune our model by playing around with the parameters. sklearn comes with [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to do an exhaustive search
over specified parameter values for an estimator.
          

Creating the hyperparameter space

In [30]:
penalty = ['l1', 'l2']
C = np.logspace(0, 4, 10)
random_state=[0]

# creating a dictionary of hyperparameters
hyperparameters = dict(C=C, penalty=penalty, random_state=random_state)

**Use [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)  to find the optimal parameters**

In [11]:
#Get the best_model

#print the penalty and c for the best model


**Predict the categories using the best model parameters**

[0 0 0 ... 1 0 1]


**Calculate the acurracy for the new model**

# Saving the model to pickle


We have done all the hard work of creating and testing the model. It would be good if we could save the model for future uses rather than retrain it. We will save our model in the [pickle](https://docs.python.org/2/library/pickle.html). 

In [35]:
filename = 'final_model.sav'
pickle.dump(model, open(filename, 'wb'))

Loading the model from pickle

In [36]:
saved_model = pickle.load(open(filename, 'rb')) 
print(saved_model)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=0, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)


Now you can predict using the saved model. 