# Project Overview - Income Classification
We will use a classic dataset in the data science world - the adult census income data. This dataset was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker and is publically available on the UCI Machine Learning Repository. The data includes ~49,000 records with census data on U.S. adults. The prediction task is to determine whether a person makes over \$50K a year.

## Machine Learning Project Framework
Based on Aurlien Geron's excellent [*Hands-On Machine Learning with Scikit-Learn & TensorFlow*](https://learning.oreilly.com/library/view/hands-on-machine-learning/9781491962282/)
1. **Frame the problem and look at the big picture.**
2. **Get the data.**
3. **Explore the data to gain insights.**
4. **Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.**
5. **Explore many different models and shortlist the best ones.**
6. **Fine-tune your models and combine them into a great solution.**
7. **Present your solution.**
8. **Launch, monitor, and maintain your system.**

We will cover roughly two of these points in each of our first four sessions.

# Framing the problem
The first stage in a data science project is getting a good understanding of the business objective and context. Such understanding will inform the framing of our problem and our choices of features, techniques, models, and evaluation metrics. Since our focus here is on Python, we will not go deeply into the business context. 

For our purposes, we pretend that this dataset was recieved from one of our business partners, and it represents demographic data about potential candidates for one of our products. It turns out (in our pretend world), that whether people make over \\$50k a year determines whether they are good candidates for the product. The final deliverable is a model that will be deployed in production, recieving new data on individuals (similar in format to our dataset) and automatically predicting whether they make more than $50k. The business will decide whether to target potential individuals based on our prediction. 

## Selecting a performance measure
It is a good practice to start a machine learning project by selecting a single performance metric to optimize. This give focus to the modeling efforts and allows the data science team to unambiguously compare different versions of the solution and iterate quickly. 

A typical performance measure (and the default in sklearn) for classification problems is Accuracy: the percentage of cases that have been accurately classified by the model. However, before deciding on a performance metric, we shoud understand the business impact of our model and seek to tailor our metric to that impact. In ou case, targeting each potential customer costs the business \\$25. For a prospective customer who makes less then \\$50k, the business makes \\$5 on average; for a prospective customer who makes over \\$50k, the business makes \\$85. Currently, the business targets all clients and breaks even (25% of the potential customers make over \\$50k). 

In terms of model prediction outcomes, we can to organize this information this way:
* True positive: \\$60 profit
* False positive: \\$20 loss
* True negative: \\$20 profit
* False negative: \\$60 loss

Therefore, to align our performance metric with the business value we will need to create a custom metrics that penalizes false negatives three times more than false positives.

# Get the Data
Time to get our hands dirty!
## Create a workspace
### Build a git repository
I created a [reporsitory for the course](https://devtools.metlife.com/bitbucket/users/tyifat/repos/python-for-dna/browse) in Bitbucket, which will allow us to work in an environment that is similar to real projects. To work with the repository, we will use git, the leading version-control system for tracking changes in source code during software development. Here are useful git learning resources:
* [Git cheatsheet](https://www.atlassian.com/git/tutorials/atlassian-git-cheatsheet)
* [Git tutorial](https://www.codecademy.com/learn/learn-git)

You should clone the repository (aka "repo") to your computer:
1. From the Windows Start Menu, open Git Bash. (If you don't have it installed, install it first from the Software Center).
2. In the Git Bash window, navigate to the folder where you want to clone the repo (using the `cd` command).
3. Open the repo in Bitbucket, click "Clone" in the left upper part of the screen, and copy the address in the popup box.
4. In the git window, type `git clone https://tyifat@devtools.metlife.com/bitbucket/scm/~tyifat/python-for-dna.git`, using the copied address.
5. The repo should be cloned to your folder now.

### Build an Anaconda environment
Note that the commands below may be slightly different based on your Anaconda version. 
1. From the Windows Start Menu, open Anaconda Prompt.
2. To create a new environment named 'py4dna' (you can choose a different name), type: `conda create -n py4dna python=3.7`. We are specifying version 3.7, which is recent but stabler then 3.8 (as of January 2020).
3. To install the packages for this project, type: `conda install -c conda-forge jupyter numpy pandas scikit-learn matplotlib seaborn`. The latest available versions of the these packages that fit together will be installed.
4. Once you installed the packages, you can open Jupyter from the Anaconda prompt with your environment of choice by typing: `jupyter notebook`.

In [1]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
# Which versions are installed?
import sys
print("Python version")
print (sys.version)
print("\nPandas info")
print (pd.__version__)

Python version
3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]

Pandas info
0.25.3


## Read dataset
The original dataset arrives divided into two parts - train and test. We will append those into a single dataset and then split it ourselves.

In [3]:
# Read the training set from web 
df_1 = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", 
                   names=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 
                   'marital-status', 'occupation', 'relationship', 'race', 'sex', 
                   'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 
                   '<=50K'], skipinitialspace=True)
df_1.shape

(32561, 15)

In [4]:
df_1.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,<=50K
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [5]:
# Read the test set
df_2 = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", 
                   names=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 
                   'marital-status', 'occupation', 'relationship', 'race', 'sex', 
                   'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 
                   '<=50K'], skipinitialspace=True, skiprows=1)
df_2.shape

(16281, 15)

In [6]:
df_2.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,<=50K
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K.
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K.
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K.
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K.
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K.


In [7]:
# This is what it looks like if we don't skip the first row
pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", 
                   names=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 
                   'marital-status', 'occupation', 'relationship', 'race', 'sex', 
                   'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 
                   '<=50K'], skipinitialspace=True).head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,<=50K
0,|1x3 Cross validator,,,,,,,,,,,,,,
1,25,Private,226802.0,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,40.0,United-States,<=50K.
2,38,Private,89814.0,HS-grad,9.0,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,50.0,United-States,<=50K.
3,28,Local-gov,336951.0,Assoc-acdm,12.0,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,40.0,United-States,>50K.
4,44,Private,160323.0,Some-college,10.0,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688.0,0.0,40.0,United-States,>50K.


In [8]:
# The target column in the test data has a dot in the end. We are going to strip it to ensure consistency
df_2['<=50K'] = df_2['<=50K'].str.strip('.')
df_2.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,<=50K
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [9]:
df_combined = df_1.append(df_2, ignore_index=True)
df_combined.shape

(48842, 15)

In [10]:
df_combined.tail()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,<=50K
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
48838,64,?,321403,HS-grad,9,Widowed,?,Other-relative,Black,Male,0,0,40,United-States,<=50K
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K
48841,35,Self-emp-inc,182148,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,60,United-States,>50K


# Pandas Basics
Check out these [pandas tutorials](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html).

The two main data structures in pandas are:
* **Series**: a one-dimensional labeled array capable of holding any data type.
* **Dataframe**: a two-dimensional labeled data structure with columns of potentially different types. Each column in a dataframe is a series.

## Creating series and dataframes

In [11]:
# A simple way to create a series
s = pd.Series([1, 3, 5, np.nan, 8, 10])
s

0     1.0
1     3.0
2     5.0
3     NaN
4     8.0
5    10.0
dtype: float64

In [12]:
# Create a series with defined labels
color = pd.Series(['red', 'orange', 'yellow', 'green', 'red'], index=['apple', 'orange', 'banana', 'kiwi', 'strawberry'])
color

apple            red
orange        orange
banana        yellow
kiwi           green
strawberry       red
dtype: object

In [13]:
# Create a series using a dictionary
season = pd.Series({'apple':'fall', 'orange':'winter', 'banana':'all year', 'kiwi':'summer, fall, winter', 
                    'strawberry':'spring'})
season

apple                         fall
orange                      winter
banana                    all year
kiwi          summer, fall, winter
strawberry                  spring
dtype: object

In [14]:
# Creating a dataframe from series
import random

fruits = pd.DataFrame({'color': color,
        'season': season,
        'quantity': pd.Series([random.randint(1,10) for i in range(5)] , 
                              index=['apple', 'orange', 'banana', 'kiwi', 'strawberry'],
                             dtype='int8')})
fruits

Unnamed: 0,color,season,quantity
apple,red,fall,6
orange,orange,winter,8
banana,yellow,all year,1
kiwi,green,"summer, fall, winter",5
strawberry,red,spring,9


### Creating a time series dataframe

In [15]:
# First we create the index
dates = pd.date_range(start='20130101', periods=12, freq='Y')
dates

DatetimeIndex(['2013-12-31', '2014-12-31', '2015-12-31', '2016-12-31',
               '2017-12-31', '2018-12-31', '2019-12-31', '2020-12-31',
               '2021-12-31', '2022-12-31', '2023-12-31', '2024-12-31'],
              dtype='datetime64[ns]', freq='A-DEC')

In [16]:
# and then the dataframe
df = pd.DataFrame(np.random.randn(12, 4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2013-12-31,-0.028629,-1.580756,-0.262392,1.238888
2014-12-31,-2.750948,0.1133,1.329606,0.804971
2015-12-31,2.483952,-2.941857,-0.669112,0.606806
2016-12-31,0.296415,1.904526,0.241851,-0.487842
2017-12-31,-1.444228,0.396434,-0.959535,1.610537
2018-12-31,-0.026776,0.676202,-1.169048,0.24191
2019-12-31,-0.894479,-0.555229,0.741542,-0.645525
2020-12-31,1.138164,-0.715849,-0.62685,-2.143731
2021-12-31,-1.242546,-0.63793,-0.886466,0.696174
2022-12-31,0.986463,-1.482744,0.232951,-0.587185


## Atrributes and underlying data

In [18]:
# We can see the dimensions of a dataframe
fruits.shape

(5, 3)

In [23]:
# We can view the beginning or end using the .head() and .tail() methods
fruits.head(2)

Unnamed: 0,color,season,quantity
apple,red,fall,6
orange,orange,winter,8


In [27]:
# The index (that is, row labels)
fruits.index

Index(['apple', 'orange', 'banana', 'kiwi', 'strawberry'], dtype='object')

In [28]:
# Column names
fruits.columns

Index(['color', 'season', 'quantity'], dtype='object')

In [29]:
# An index is iterable
for col in fruits.columns:
    print(col, ':', fruits[col].dtype)

color : object
season : object
quantity : int8


In [30]:
# Access the data types of a dataframe
fruits.dtypes

color       object
season      object
quantity      int8
dtype: object

In [33]:
# We can access the contents of a series without the labels 
fruits['color'].values

array(['red', 'orange', 'yellow', 'green', 'red'], dtype=object)

In [34]:
# We can convert dataframes to numpy arrays
fruits.to_numpy()

array([['red', 'fall', 6],
       ['orange', 'winter', 8],
       ['yellow', 'all year', 1],
       ['green', 'summer, fall, winter', 5],
       ['red', 'spring', 9]], dtype=object)

In [35]:
# General dataframe info
fruits.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, apple to strawberry
Data columns (total 3 columns):
color       5 non-null object
season      5 non-null object
quantity    5 non-null int8
dtypes: int8(1), object(2)
memory usage: 285.0+ bytes


In [37]:
# Descriptive statistics for a dataframe
df.describe()

Unnamed: 0,A,B,C,D
count,12.0,12.0,12.0,12.0
mean,0.028337,-0.48203,-0.087813,0.247836
std,1.420279,1.325745,0.797623,1.040478
min,-2.750948,-2.941857,-1.169048,-2.143731
25%,-0.981496,-1.507247,-0.723451,-0.512678
50%,0.134819,-0.59658,-0.047267,0.65149
75%,0.987244,0.466376,0.366774,0.833153
max,2.483952,1.904526,1.329606,1.610537


In [43]:
# Value counts for a series
fruits['color'].value_counts()

red       2
green     1
orange    1
yellow    1
Name: color, dtype: int64

In [47]:
# Transpose a dataframe
fruits.T

Unnamed: 0,apple,orange,banana,kiwi,strawberry
color,red,orange,yellow,green,red
season,fall,winter,all year,"summer, fall, winter",spring
quantity,6,8,1,5,9


## Sorting

In [62]:
# Sort by index
fruits.sort_index(ascending=False)

Unnamed: 0,color,season,quantity
strawberry,red,spring,9
orange,orange,winter,8
kiwi,green,"summer, fall, winter",5
banana,yellow,all year,1
apple,red,fall,6


In [57]:
# The original did not change - the sort method (and most pandas methods) returns a new sorted dataframe object
fruits

Unnamed: 0,color,season,quantity
strawberry,red,spring,9
orange,orange,winter,8
kiwi,green,"summer, fall, winter",5
banana,yellow,all year,1
apple,red,fall,6


In [63]:
# To sort the original dataframe, we need to either assign the new object to the old variable or use the inplace argument
# fruits = fruits.sort_index(ascending=False)
fruits.sort_index(ascending=False, inplace=True)
fruits

Unnamed: 0,color,season,quantity
strawberry,red,spring,9
orange,orange,winter,8
kiwi,green,"summer, fall, winter",5
banana,yellow,all year,1
apple,red,fall,6


In [64]:
# sort by values
fruits.sort_values(by=['color', 'quantity'])

Unnamed: 0,color,season,quantity
kiwi,green,"summer, fall, winter",5
orange,orange,winter,8
apple,red,fall,6
strawberry,red,spring,9
banana,yellow,all year,1


## Selecting data

In [65]:
# Select a single column from a dataframe
fruits['season']

strawberry                  spring
orange                      winter
kiwi          summer, fall, winter
banana                    all year
apple                         fall
Name: season, dtype: object

In [66]:
# Slice rows using []
fruits[0:3]

Unnamed: 0,color,season,quantity
strawberry,red,spring,9
orange,orange,winter,8
kiwi,green,"summer, fall, winter",5


In [69]:
fruits.loc['kiwi']

color                      green
season      summer, fall, winter
quantity                       5
Name: kiwi, dtype: object

In [70]:
fruits.loc['kiwi', 'quantity']

5

In [67]:
# Select rows by label
fruits.loc[['kiwi', 'banana']]

Unnamed: 0,color,season,quantity
kiwi,green,"summer, fall, winter",5
banana,yellow,all year,1


In [72]:
# Select on bith axes by label
fruits.loc['orange':'kiwi', 'color']

Series([], Name: color, dtype: object)

In [None]:
# Getting a scalar value
fruits.loc['strawberry', 'quantity']

In [None]:
# Select by position
fruits.iloc[3]

In [None]:
# Select by integer slices, similar to python lists/numpy arrays
fruits.iloc[0:2, 1:3  ]

# Homework
1. Follow the steps above to create an environment on your computer and get the data.
2. Review the sessions from Season 1 of the Python training.
3. You are most welcome to start exploring the dataset and think about how we may want to prepare it for modeling. I'd appreciate your thoughts on it on the next session!