# STATUS OF PAPER:  
  
  
<font color='blue'>  
80% - Define and prep class variables (10 points)  </font>  
<font color='blue'>90% - Describe the final dataset (5 points)  </font>  
<font color='red'>0% -  Choose and explain your evaluation metrics (10 points)  </font>   
<font color='red'>0% -  Choose the method for training/testing split (10 points)  </font>  
<font color='red'>0% -  Create three classification models (20 points)  </font>   
<font color='red'>0% -  Analyze results (10 points)  </font>  
<font color='red'>0% -  Discuss advantages of each classification model (10 points)  </font>   
<font color='red'>0% -  Which attributes are most important? (10 points)  </font>  
<font color='green'>100% - How useful is your model? (5 points)  </font>  
<font color='red'>0% -  Exceptional work (10 points)  </font>  
  
  


# Lab 2: Classification and/or Regression



## Dataset Selection  

<font color='blue'> Select a dataset identically to the way you selected for the first project work week and mini-project.
You are not required to use the same dataset that you used in the past, but you are encouraged.
You must identify two tasks from the dataset to regress or classify. That is:  
• two classification tasks OR  
• two regression tasks OR  
• one classification task and one regression task  
For example, if your dataset was from the diabetes data you might try to predict two tasks: (1)
classifying if a patient will be readmitted within a 30 day period or not, and (2) regressing what the
total number of days a patient will spend in the hospital, given their history and specifics of the
encounter like tests administered and previous admittance. 
</font>

For this lab assignment we have chosen to use the "income" dataset that was donated to the UCI machine learning database.  This dataset includes information on over 32,000 patients, and the information gathered includes their age, marital status, education, income, and others.  
  
There will be two classification tasks that we will perform on this data.  The first will be classifying individuals into two separate classes based upon whether or not they earned less than or more than 50,000 dollars a year.  This is the classification task for which the data was gathered, and will help provide insight into which factors best predict an individual's income level.  
  
The second classification task will be to classify individuals by their marital status such as married, never married, and divorced.  This will provide insight into how well economic factors can predict an individuals marital status to answer questions such as "do married people generally earn more?" or "does earning more increase the odds of being or becoming divorced?".  These questions can be of great interest to social scientists and legislators who may want to put forward new policy to account for or attempt to influence any trends found by these classification tasks.

## Data Preparation (15 points total)
### • [10 points] Define and prepare your class variables.  
<font color='blue'>Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for
dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for
the analysis.</font>

In the code below we will be importing the income dataset and preparing the class variables.  All variables in this dataset are either categorical variables or integer variables.  The integer variables are of different scales, for example the hours-per-week variable is an integer that generally falls within the range of 0-40, while the capital-gain variable can have values in the thousands.  For this reason we will be standardizing the integer values as part of our pre-processing.  
  
Furthermore, the categorical variables are represented by strings, which can pose issues with some of the processes we will be using in our classification tasks.  Therefore we will also be encoding the strings as integer labels as part of our preprocessing steps as well.

In [2]:
# This code imports the packages we will be using, and sets parameters for the matplotlib.pylab package.

import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.pylab as pylab
params = {'legend.fontsize': 'x-large',
          'figure.figsize': (15, 5),
         'axes.labelsize': 'x-large',
         'axes.titlesize':'x-large',
         'xtick.labelsize':'x-large',
         'ytick.labelsize':'x-large'}
pylab.rcParams.update(params)
%matplotlib inline

In [3]:
# Path to where all of the data set files resides 
path = 'C:/Users/ledbeg1/data'

In [13]:
# This code reads in the initial csv file
filename = path + '\income.csv'
df_income = pd.read_csv(filename) # read in the csv file
df_income.info()
df_income.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education-num     32561 non-null int64
marital-status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital-gain      32561 non-null int64
capital-loss      32561 non-null int64
hours-per-week    32561 non-null int64
native-country    32561 non-null object
target            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [14]:
# This code turns all of our string entries into integer entries for the purposes of feeding them into the SVM.
# The code in this section is based upon examples found at scikit-learn.org/stable/modules/preprocessing.html#label-encoding

from sklearn import preprocessing
encoder = preprocessing.LabelEncoder()
df_income['workclass'] = encoder.fit_transform(df_income['workclass'])
df_income['native-country'] = encoder.fit_transform(df_income['native-country'])
df_income['education'] = encoder.fit_transform(df_income['education'])
df_income['marital-status'] = encoder.fit_transform(df_income['marital-status'])
df_income['occupation'] = encoder.fit_transform(df_income['occupation'])
df_income['relationship'] = encoder.fit_transform(df_income['relationship'])
df_income['race'] = encoder.fit_transform(df_income['race'])
df_income['sex'] = encoder.fit_transform(df_income['sex'])

df_income.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,<=50K
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,<=50K
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,<=50K
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,<=50K
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,<=50K


In [19]:
# This code creates a target array with 0 representing under 50,000 USD a year and 1 representing over 50,000 USD a year

i = 0
rangeLength = len(df_income)
#print(rangeLength)
target = []
for i in range(0,rangeLength):
    if df_income['target'].iloc[i] == " <=50K":
        target.append(0)
    else:
        target.append(1)
#print(target)

In [21]:
# this code replaces the existing 'target' variable with our newly created array of integer representations.

df_income['income'] = target
df_income = df_income.drop('target', axis=1)
df_income.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,7,77516,9,13,4,1,1,4,1,2174,0,40,39,0
1,50,6,83311,9,13,2,4,0,4,1,0,0,13,39,0
2,38,4,215646,11,9,0,6,1,4,1,0,0,40,39,0
3,53,4,234721,1,7,2,6,0,2,1,0,0,40,39,0
4,28,4,338409,9,13,2,10,5,2,0,0,0,40,5,0


<font color='red'>
# TO DO  
1. Possibly remove the fnlwgt variable?  
2. standardize the integer values.
</font>

### • [5 points] Describe the final dataset that is used for classification/regression  

<font color='blue'>
(include a description of any newly formed variables you created). </font>

The final dataset we have created consists only of integer values, categorical variables that are represented as integers, and binary variables that are represented as integers.  For one classification task the response variable will be the "income" variable which represents a 0 for income less than 50,000 dollars a year and 1 for income greater than 50,000 dollars a year.  For the other classification task the response variable will be the "education-num" variable which is an integer value that represents the years of education attained for a given individual.  
  
Variables:
Age - integer - the individual's age in years.  
workclass - categorical(integer) - the class of the worker's employment (self-employed, public, private, etc.).  
education - categorical(integer) - the level of education obtained (high school, bachelor's degree, master's degree, etc.).  
education-num - integer - the level of education obtained as represented by number of years of education.  
marital-status - categorical(integer) - The current marital status of the individual (married, never married, divorced, etc.).  
occupation - categorical(integer) - The type of work the individual is employed in (executive, janitorial, etc.).  
relationship - categorical(integer) - What part the individual plays in their current relationship (husband, wife, etc.).  
race - categorical(integer) - the individual's race.  
sex - categorical(binary) - the individual's gender where 0 is female and 1 is male.  
capital-gain - integer - the individual's amount of capital gain as measured in US dollars.  
capital-loss - integer - the individual's amount of capital loss as measured in US dollars.  
native-country - categorical(integer) - the native country for that individual (USA, Ecuador, etc.).  
income - categorical(binary) - The individuals general income level where 0 is less than 50,000 US dollars of income a year and 1 is more than 50,000 US dollars of income a year.  



## • Modeling and Evaluation (70 points total)  
### • [10 points] Choose and explain your evaluation metrics that you will use 
(i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s)
appropriate for analyzing the results of your modeling? Give a detailed explanation
backing up any assertions.  
### • [10 points] Choose the method you will use for dividing your data into training and testing splits 
(i.e., are you using Stratified 10-fold cross validation? Why?). Explain why
your chosen method is appropriate or use more than one method as appropriate.  
### • [20 points] Create three different classification/regression models 
(e.g., random forest, KNN, and SVM). Two modeling techniques must be new (but the third could be SVM or
logistic regression). Adjust parameters as appropriate to increase generalization
performance using your chosen metric.  
### • [10 points] Analyze the results using your chosen method of evaluation. 
Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why
they are interesting to someone that might use this model.  
### • [10 points] Discuss the advantages of each model for each classification task, if any. 
If there are not advantages, explain why. Is any model better than another? Is the
difference significant with 95% confidence? Use proper statistical comparison methods.  
### • [10 points] Which attributes from your analysis are most important? 
Use proper methods discussed in class to evaluate the importance of different attributes. Discuss
the results and hypothesize about why certain attributes are more important than others
for a given classification task.  

## • Deployment (5 points total)  
### • [5 points] How useful is your model for interested parties 

<font color='blue'>

(i.e., the companies or organizations that might want to use it for prediction)? How would you measure the
model's value if it was used by these parties? How would your deploy your model for
interested parties? What other data should be collected? How often would the model
need to be updated, etc.?

</font>

Organizations that may be interested in our model include economists, social scientists, and legislators.  Economists will be interested in seeing what factors contribute to an individual's general income level so that they can better understand modern economical factors that affect the population (in our case US workers).  Social scientists will be interested in seeing what factors contribute to a person's marital status so that they can better understand how the economic health of a family affects the health of the marriage.  Finally, legislators and other government officials will be interested in both of these effects so that they can draft legislation to support individual welfare for their citizens and the economic health of their country as a whole.  
  
We would measure the model's value for these parties by how accurately it is able to predict both of its classifications.  This is because the more accurately it predicts the class of individuals in the model the more useful any action taken by the legislators and other interested parties can be expected to be.  For example, if the classification is only 20% predictive of an individuals economic or marital health, then an economist or social scientist lecturer will not be likely to include the model in their class materials.  However, if that classification is 95% accurate in predicting the class then not only will these scientists be wise to teach concepts based on the model results but that model can be used reliably in their future research.  Likewise, if a legislator drafts a new law based upon results of the model and the model is inaccurate then that law will be ineffective at combating the effects that they thought were real but were really just based on modeling error.  
  
For economists and social scientists we would deploy this model by sharing both the results of our model and the code that produced those results, which will allow them to perform similar classification on other populations of interest to their studies and research.  For legislators we would instead show them only the results but not the code (which they would not be trained to understand) in order to provide a proof of concept.  After obtaining funding from those legislators we would then move forward with deploying this model against Census data for their country, which would not only help train the model to that country's specific situation but would provide historical insight into the situations the country's citizens have recently found themselves in.  Finally, we would prepare the code such that when new Census data was received one could quickly feed that data into the model and receive actionable insights tailored specifically to that country's situation.  
  
The other data that would be collected for this model include historical Census data specific to the country for which we are applying the model (as the existing data is restricted to a population of US individuals), as well as future Census data that would be collected after the initial model had already been built.  In this situation the model would need to be updated every Census cycle as new data is collected, which for the US would be every 10 years.  If the customer for the model is an economist or social scientist that is unaffiliated with any government entity but is interested in doing their own research, then the additional data needed will be survey data from the population of interest for their research.  In this situation the model will need to be updated anytime new survey data is collected.  

## • Exceptional Work (10 points total)
### • You have free reign to provide additional modeling.  
### • One idea: grid search parameters in a parallelized fashion and visualize the performances across attributes.  
Which parameters are most significant for making a
good model for each classification algorithm?