##  The Problem Statement:


Snap-Shot LLC.  is a startup which specializes in building mirrorless digital cameras.  The company is planning to introduce a new line of above full frame digital cameras. Based on historical statistics of digital camera sales, a person earning more than specific income is more likely to buy an expensive camera.  Company decides to launch a marketing camping to targeted to specific set of potential customers who earn more than $50000 annually.

### Solution:

In [32]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import plot_roc_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.metrics import roc_curve,roc_auc_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_val_score
import vtreat
import vtreat.util
from pprint import pprint
from sklearn.metrics import r2_score
from sklearn.metrics import f1_score, precision_score, recall_score,confusion_matrix
from sklearn.metrics import roc_auc_score,auc
from sklearn.metrics import plot_roc_curve
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier,GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.neural_network import MLPClassifier

### Step 1: Data Exploration

#### Read raw data in DataFrame 

In [34]:
# List of csv files.
#Training Data.
csv_file = 'census-income.data'

#Simplified column names for training and testing data.
headers_easy = 'headers_easy.txt'

#Test data
csv_file_test = 'census-income.test'

In [35]:
#Read the training and test data into a dataframe.

#Training data.
df = pd.read_csv(csv_file,header=None)

#Testing data.
df_test = pd.read_csv(csv_file_test, header=None)

In [36]:
#Peek into few data rows
df.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,32,33,34,35,36,37,38,39,40,41
0,73,Not in universe,0,0,High school graduate,0,Not in universe,Widowed,Not in universe or children,Not in universe,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,0,95,- 50000.
1,58,Self-employed-not incorporated,4,34,Some college but no degree,0,Not in universe,Divorced,Construction,Precision production craft & repair,...,United-States,United-States,United-States,Native- Born in the United States,0,Not in universe,2,52,94,- 50000.
2,18,Not in universe,0,0,10th grade,0,High school,Never married,Not in universe or children,Not in universe,...,Vietnam,Vietnam,Vietnam,Foreign born- Not a citizen of U S,0,Not in universe,2,0,95,- 50000.


####  No headers present for the data read

In [37]:
with open(headers_easy,'r') as fp:
    columns_easy = fp.readline()[:-1].lower().split('\t')

In [38]:
#Set the simplified columns names for test and training dataset.
df.columns = columns_easy
df_test.columns = columns_easy

### Step 2: Data Cleanup

#### 2.1 Check Dimensions of data

In [39]:
shape = df.shape
print(shape)
print(df_test.shape)

(199523, 42)
(99762, 42)


#### 2.2 Check data type of each attribute

<p> Most of the fields in the data set are strings and values which represent multiple categories of the field. </p>

In [40]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199523 entries, 0 to 199522
Data columns (total 42 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   age                     199523 non-null  int64  
 1   workerclass             199523 non-null  object 
 2   industrycode            199523 non-null  int64  
 3   occupationcode          199523 non-null  int64  
 4   education               199523 non-null  object 
 5   wageperhour             199523 non-null  int64  
 6   student                 199523 non-null  object 
 7   maritalstatus           199523 non-null  object 
 8   majorindustrycode       199523 non-null  object 
 9   majoroccode             199523 non-null  object 
 10  race                    199523 non-null  object 
 11  hispanicorigin          199523 non-null  object 
 12  sex                     199523 non-null  object 
 13  memberofl               199523 non-null  object 
 14  unemploymentreason  

#### 2.3 Convert the label datatype to boolean.

<p>The 'income' field of the data set are " 50000+" or "  50000-" values. Convert  these values to boolean values of True or False, where <br>
    
<b>True:</b> Individual has income higher than 50000 <br>
<b>False:</b> Individual has income less than 50000
</p>


In [41]:
def setlabels(df, label):
    """
      This function converts the labels to boolean values
      input  : dataframe, column name.
      retutns: dataframe.
    """
    df[label]= df.income.apply(lambda x :  True if x == " 50000+." else False)
    return df


#Set labels for test and training data sets.
df = setlabels(df,'income')
df_test = setlabels(df_test,'income')

#### 2.4 Convert the data types of required columns to categorical variables.

In [42]:
def convert_types(df):
    """
     This function converts the string fields to categorical values. 
     There few fields which are of type integer,but still are categorical
     varibales, there are no order in these integer values.
     input: dataframe.
     output: modified dataframe.
    """
    obj_cols = df.select_dtypes(include='object').columns
    for col in obj_cols:
        df[col]=df[col].astype('category')
    df['industrycode'] = df['industrycode'].astype('category')
    df['occupationcode'] = df['occupationcode'].astype('category')
    df['occupationcode'] = df['occupationcode'].astype('category')
    df['instanceweight']=df.instanceweight.astype(int)
    return df

In [43]:
#Convert datatypes of the string columns to categorical columns.
df = convert_types(df)
df_test = convert_types(df_test)

In [44]:
# Get the list of categorical  column names. This will be used in handling test and traing dataset
cat_cols = df.select_dtypes(include='category').columns

In [45]:
# List the columns of 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199523 entries, 0 to 199522
Data columns (total 42 columns):
 #   Column                  Non-Null Count   Dtype   
---  ------                  --------------   -----   
 0   age                     199523 non-null  int64   
 1   workerclass             199523 non-null  category
 2   industrycode            199523 non-null  category
 3   occupationcode          199523 non-null  category
 4   education               199523 non-null  category
 5   wageperhour             199523 non-null  int64   
 6   student                 199523 non-null  category
 7   maritalstatus           199523 non-null  category
 8   majorindustrycode       199523 non-null  category
 9   majoroccode             199523 non-null  category
 10  race                    199523 non-null  category
 11  hispanicorigin          199523 non-null  category
 12  sex                     199523 non-null  category
 13  memberofl               199523 non-null  category
 14  unem

In [46]:
df.to_pickle('df.pkl')
df_test.to_pickle('df_test.pkl')