## Pre-Modeling: Data Preprocessing and Feature Exploration in Python
## Python

* Goal:
    * Pre-modeling/modeling 80%/20% of work
    * Show the importance of data preprocessing, feature exploration, and feature engineering on model performance
    * Go over a few effective pre-modeling steps
    * This is only a small subset of pre-modeling
* Python libraries:
    * Numpy
    * Pandas
    * Sci-kit learn
    * Matplotlib
    * Almost entire workflow is covered by these four libararies

Source of 'adult' dataset: https://archive.ics.uci.edu/ml/datasets/Adult


In [49]:
import numpy as np
import pandas as pd

In [50]:
df = pd.read_csv("dataset/adult.csv", na_values=['#NAME?'])

In [51]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_cpuntry,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [52]:
df['income'].value_counts()

 <=50K    24720
 >50K      7841
Name: income, dtype: int64

In [53]:
df['income'] = [0 if x == '<=50K' else 1 for x in df['income']]

In [54]:
x = df.drop('income', 1)
y = df.income

In [55]:
print(x.head(5))

   age          workclass  fnlwgt   education  education_num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital_status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital_gain  capital_loss  hours_per_week  native_cpuntry  
0          2174             0              40   United-States  
1           

In [56]:
print(y.head(5))

0    1
1    1
2    1
3    1
4    1
Name: income, dtype: int64


In [57]:
print(x['education'].head())

0     Bachelors
1     Bachelors
2       HS-grad
3          11th
4     Bachelors
Name: education, dtype: object


In [58]:
print(pd.get_dummies(x.education.head()))

    11th   Bachelors   HS-grad
0      0           1         0
1      0           1         0
2      0           0         1
3      1           0         0
4      0           1         0


In [59]:
for col_name in x.columns:
    if x[col_name].dtypes == 'object':
        unique_cat = len(x[col_name].unique())
        print(f"Feature {col_name} has {unique_cat} unique categories")

Feature workclass has 9 unique categories
Feature education has 16 unique categories
Feature marital_status has 7 unique categories
Feature occupation has 15 unique categories
Feature relationship has 6 unique categories
Feature race has 5 unique categories
Feature sex has 2 unique categories
Feature native_cpuntry has 42 unique categories


In [60]:
print(x['native_cpuntry'].value_counts().sort_values(ascending=False).head(10))

 United-States    29170
 Mexico             643
 ?                  583
 Philippines        198
 Germany            137
 Canada             121
 Puerto-Rico        114
 El-Salvador        106
 India              100
 Cuba                95
Name: native_cpuntry, dtype: int64


In [61]:
x['native_cpuntry'] = ['United-States' if x == 'United-States' else 'Other' for x in x['native_cpuntry']]

print(x['native_cpuntry'].value_counts().sort_values(ascending=False))

Other    32561
Name: native_cpuntry, dtype: int64


In [62]:
x['relationship'].dropna()

0         Not-in-family
1               Husband
2         Not-in-family
3               Husband
4                  Wife
              ...      
32556              Wife
32557           Husband
32558         Unmarried
32559         Own-child
32560              Wife
Name: relationship, Length: 32561, dtype: object

In [69]:
todummy_list = ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_cpuntry']

In [67]:
def dummy_df(df, todummy_list):
    for x in todummy_list:
        dummies = pd.get_dummies(df[x], prefix=x, dummy_na=False)
        df = df.drop(x, 1)
        df = pd.concat([df, dummies], axis=1)
    return df

In [70]:
x = dummy_df(x, todummy_list)
print(x.head(5))

KeyError: 'workclass'

In [72]:
x.isnull().sum().sort_values(ascending=False).head()

age                                      0
occupation_ Machine-op-inspct            0
marital_status_ Married-AF-spouse        0
marital_status_ Married-civ-spouse       0
marital_status_ Married-spouse-absent    0
dtype: int64

In [74]:
#!pip install sklearn

Collecting sklearn
  Downloading sklearn-0.0.tar.gz (1.1 kB)
Collecting scikit-learn
  Downloading scikit_learn-0.24.2-cp38-cp38-manylinux2010_x86_64.whl (24.9 MB)
[K     |████████████████████████████████| 24.9 MB 2.9 MB/s eta 0:00:01
Collecting joblib>=0.11
  Downloading joblib-1.0.1-py3-none-any.whl (303 kB)
[K     |████████████████████████████████| 303 kB 84.5 MB/s eta 0:00:01
Collecting threadpoolctl>=2.0.0
  Downloading threadpoolctl-2.1.0-py3-none-any.whl (12 kB)
Using legacy 'setup.py install' for sklearn, since package 'wheel' is not installed.
Installing collected packages: threadpoolctl, joblib, scikit-learn, sklearn
    Running setup.py install for sklearn ... [?25ldone
[?25hSuccessfully installed joblib-1.0.1 scikit-learn-0.24.2 sklearn-0.0 threadpoolctl-2.1.0


In [83]:
from sklearn.impute import SimpleImputer

In [86]:
imputer=SimpleImputer(missing_values=np.nan,strategy='mean')
imputer=imputer.fit(x[:,1:3])
x[:,1:3]=imputer.transform(x[:,1:3])

TypeError: '(slice(None, None, None), slice(1, 3, None))' is an invalid key