In [1]:
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
# Which versions are installed?
import sys
print("Python version")
print (sys.version)
print("\nPandas info")
print (pd.__version__)

Python version
3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]

Pandas info
0.25.3


## Read dataset
The original dataset arrives divided into two parts - train and test. We will append those into a single dataset and then split it ourselves.

In [3]:
# Read the training set from web 
df_1 = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data", 
                   names=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 
                   'marital-status', 'occupation', 'relationship', 'race', 'sex', 
                   'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 
                   '>=50K'], skipinitialspace=True)
df_1.shape

(32561, 15)

In [4]:
# Read the test set
df_2 = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test", 
                   names=['age', 'workclass', 'fnlwgt', 'education', 'education-num', 
                   'marital-status', 'occupation', 'relationship', 'race', 'sex', 
                   'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 
                   '>=50K'], skipinitialspace=True, skiprows=1)
df_2.shape

(16281, 15)

In [5]:
df_combined = df_1.append(df_2, ignore_index=True, sort=True)
df_combined.shape

(48842, 15)

In [6]:
# First, let's divide the dataset to train and test sets. 
df_combined = df_combined.sample(frac=1, random_state=0) 
# We are setting a random state so that our datasets are replicable and we can all work on the same sets. 

df_train = df_combined.iloc[:int(df_combined.shape[0] * .75)].copy()
df_test = df_combined.iloc[int(df_combined.shape[0] * .75):].copy()
df_train.shape, df_test.shape

((36631, 15), (12211, 15))

In [7]:
df_train.columns

Index(['>=50K', 'age', 'capital-gain', 'capital-loss', 'education',
       'education-num', 'fnlwgt', 'hours-per-week', 'marital-status',
       'native-country', 'occupation', 'race', 'relationship', 'sex',
       'workclass'],
      dtype='object')

In [8]:
# Separate predictors from the label
X_train = df_train.drop('>=50K', axis=1)
X_test = df_test.drop('>=50K', axis=1)
y_train = df_train['>=50K']
y_test = df_test['>=50K']   

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((36631, 14), (12211, 14), (36631,), (12211,))

In [9]:
df_train.head()

Unnamed: 0,>=50K,age,capital-gain,capital-loss,education,education-num,fnlwgt,hours-per-week,marital-status,native-country,occupation,race,relationship,sex,workclass
38113,<=50K.,41,0,0,HS-grad,9,151856,40,Married-civ-spouse,United-States,Protective-serv,White,Husband,Male,Private
39214,<=50K.,57,0,0,Doctorate,16,87584,25,Divorced,United-States,Prof-specialty,White,Not-in-family,Female,Self-emp-not-inc
44248,<=50K.,31,6849,0,Bachelors,13,220669,40,Never-married,United-States,Prof-specialty,White,Own-child,Female,Local-gov
10283,<=50K,55,0,0,Assoc-voc,11,171355,20,Married-civ-spouse,United-States,Machine-op-inspct,White,Husband,Male,Private
26724,<=50K,59,0,0,10th,6,148626,40,Married-civ-spouse,United-States,Farming-fishing,White,Husband,Male,Self-emp-not-inc


# Intro

## How shall we prepare our data for modeling?
* Cleaning
* Missing data
* Scale numeric features
* Encode categorical features
* Trasformations

## Why should we automate data preparation?
* Efficiency
* Controlled Experimentation
* Reproducibility in Production

# Sklearn Data Preparation Pipelines
Learning resources:
* [Sklearn documentation](https://scikit-learn.org/stable/data_transforms.html) - Dataset transformations
* [Hands-On Machine Learning with Scikit-Learn & TensorFlow / Aurelien Geron](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/) - Chapter 2.

## Sklearn interface design principles
Sklearn objects share a consistent interface.
* **Estimators** include any object that can estimate parameter/s base on a dataset. Estimation is performed by the ```fit()```method, which recieves a dataset as an argument (plus a labels dataset in case of supervised learning). 
    -     A **hyperparameter** is any parameter that guides the estimation. Hyperparameters are directly accessible via public instance variables. For example, we can access logistic regression's penalty hyperparameter using ```log_reg.penalty```.
    -     An estimator's **learned parameters** can be accessed via public instance variables with an underscore suffix (e.g. ```log_reg.coef_```).
* **Predictors** are estimators that are capable of making predictions, using the ```predict()```method.
* **Transformers** are estimators that can transform a dataset, typically based on the learned parameters. Transformation is performed by the ```transform()```method, , which recieves the dataset to trasform as an argument and returns the transformed dataset. All transformers have a convenience ```fit_transform()``` method that is equivalent to calling ```fit()``` and then ```transform()```.

## Transformers

In [10]:
# Sklearn has a rich library of built-in transformers. Here is the standard scaler: 
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
num_features = ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
# During the .fit() method, the scaler learns the mean and the variance in the train data
scaler.fit(X_train[num_features]) 
# The .transform() method normalizes the data using to the learned parameters
train_scaled = scaler.transform(X_train[num_features])
print("Beginning of scaled data:\n", train_scaled[:3])

Beginning of scaled data:
 [[ 0.16988957 -0.35849554 -0.41755169 -0.14580568 -0.21438168 -0.03499028]
 [ 1.336995   -0.96551754  2.2946621  -0.14580568 -0.21438168 -1.24844225]
 [-0.55955132  0.29141429  1.13228476  0.75816992 -0.21438168 -0.03499028]]


In [11]:
# Here is how we access the learned parameters:
print(scaler.mean_)
print('\nThe mean age is %.2f' % scaler.mean_[0])

[3.86709618e+01 1.89813810e+05 1.00776665e+01 1.10470137e+03
 8.51810215e+01 4.04325298e+01]

The mean age is 38.67


In [12]:
# The trained transformer can be used to process unseen data
test_scaled = scaler.transform(df_test[num_features])
print("Beginning of scaled test data:\n", test_scaled[:3])

Beginning of scaled test data:
 [[-1.21604812  0.07871292 -0.03009258 -0.14580568 -0.21438168  0.7739777 ]
 [ 1.55582727 -0.65782259  1.13228476 -0.14580568 -0.21438168 -0.03499028]
 [ 0.46166593  0.65732562  1.90720299 13.05271322 -0.21438168  0.7739777 ]]


In [13]:
# An alternative, more concise version:
train_scaled = scaler.fit_transform(X_train[num_features])
print("Beginning of scaled data:\n", train_scaled[:3])

Beginning of scaled data:
 [[ 0.16988957 -0.35849554 -0.41755169 -0.14580568 -0.21438168 -0.03499028]
 [ 1.336995   -0.96551754  2.2946621  -0.14580568 -0.21438168 -1.24844225]
 [-0.55955132  0.29141429  1.13228476  0.75816992 -0.21438168 -0.03499028]]


In [14]:
# We can easily write our own transformers
from sklearn.base import BaseEstimator, TransformerMixin

class MyStandardScaler(BaseEstimator, TransformerMixin):
    def __init__(self):
        # We must have an __init__ method - the constructor of the object. 
        # This is where we would initialize any hyperparameters. 
        pass
 
    def fit(self, X, y=None):
        # This is where we learn our learned parameters - here the mean and std. 
        # We must include a 'y' argument, but we don't have to use it.
        # fit() always return 'self' - a reference to the transformer.
        self.mean_ = np.mean(X) 
        self.std_ = np.std(X)     
        return self

    def transform(self, X, y=None):
        # Transforms the data and returns the transformed dataset.
        X_transformed = X.copy()
        for column in X:
            X_transformed[column] = (X[column] - self.mean_[column]) / self.std_[column]
        return X_transformed.values

In [15]:
# Hey - the it looks just the same!
my_scaler = MyStandardScaler()
my_train_scaled = my_scaler.fit_transform(X_train[num_features])
print("Beginning of custom scaled data:\n", my_train_scaled[:3])
print("\nBeginning of sklearn scaled data:\n", train_scaled[:3])

Beginning of custom scaled data:
 [[ 0.16988957 -0.35849554 -0.41755169 -0.14580568 -0.21438168 -0.03499028]
 [ 1.336995   -0.96551754  2.2946621  -0.14580568 -0.21438168 -1.24844225]
 [-0.55955132  0.29141429  1.13228476  0.75816992 -0.21438168 -0.03499028]]

Beginning of sklearn scaled data:
 [[ 0.16988957 -0.35849554 -0.41755169 -0.14580568 -0.21438168 -0.03499028]
 [ 1.336995   -0.96551754  2.2946621  -0.14580568 -0.21438168 -1.24844225]
 [-0.55955132  0.29141429  1.13228476  0.75816992 -0.21438168 -0.03499028]]


## Pipelines

In [16]:
# Now what if we want to add another transformer, say polynomial features?
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

scaler = StandardScaler()
poly = PolynomialFeatures(degree=2)

# Asimple syntax for initializing a pipeline
num_pipeline = Pipeline([('PolynomialFeatures', poly), ('StandardScaler', scaler)])

# The pipeline object is just another transformer
num_train_prepared = num_pipeline.fit_transform(df_train[num_features])
num_train_prepared.shape

(36631, 28)

In [17]:
# We can easily see the definition of the pipeline
num_pipeline

Pipeline(memory=None,
         steps=[('PolynomialFeatures',
                 PolynomialFeatures(degree=2, include_bias=True,
                                    interaction_only=False, order='C')),
                ('StandardScaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True))],
         verbose=False)

In [18]:
# We can access a pipeline learned parameters through the steps attribute:
num_pipeline.steps[1][1].mean_

array([1.00000000e+00, 3.86709618e+01, 1.89813810e+05, 1.00776665e+01,
       1.10470137e+03, 8.51810215e+01, 4.04325298e+01, 1.68338353e+03,
       7.22755188e+06, 3.90800551e+02, 5.09331961e+04, 3.59402132e+03,
       1.57505403e+03, 4.72400351e+10, 1.90363423e+06, 2.08697379e+08,
       1.60906021e+07, 7.66291370e+06, 1.08220496e+02, 1.35913009e+04,
       9.44393328e+02, 4.12171712e+02, 5.86242014e+07, 0.00000000e+00,
       5.25411236e+04, 1.65129703e+05, 3.70308722e+03, 1.78759439e+03])

In [19]:
# And we can access parameters of transformers in a pipeline using the methods
# get_params/set_params. A parameter is referenced using the estimator 
#  name + '__' + parameter name.
num_pipeline.get_params()

{'memory': None,
 'steps': [('PolynomialFeatures',
   PolynomialFeatures(degree=2, include_bias=True, interaction_only=False,
                      order='C')),
  ('StandardScaler',
   StandardScaler(copy=True, with_mean=True, with_std=True))],
 'verbose': False,
 'PolynomialFeatures': PolynomialFeatures(degree=2, include_bias=True, interaction_only=False,
                    order='C'),
 'StandardScaler': StandardScaler(copy=True, with_mean=True, with_std=True),
 'PolynomialFeatures__degree': 2,
 'PolynomialFeatures__include_bias': True,
 'PolynomialFeatures__interaction_only': False,
 'PolynomialFeatures__order': 'C',
 'StandardScaler__copy': True,
 'StandardScaler__with_mean': True,
 'StandardScaler__with_std': True}

In [20]:
# We can use the LabelEncoder to encode the targets 
from sklearn.preprocessing import LabelEncoder

y_train = y_train.str.strip('.')
y_test = y_test.str.strip('.')

le = LabelEncoder() 
y_train_prepared = le.fit_transform(y_train)
y_train_prepared[:10]

array([0, 0, 0, 0, 0, 1, 0, 0, 1, 0])

## Column transformers

In [21]:
# We can use column transformers to match different features with different pipelines
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

cat_features = X_train.columns.drop(num_features)

encoder = OneHotEncoder()

column_transformer = ColumnTransformer([('Standard Scaler', scaler, num_features), 
                                        ('OneHotEncoder', encoder, cat_features)],
                                      sparse_threshold=0)

X_train_prepared = column_transformer.fit_transform(X_train)
print(X_train_prepared.shape, '\n')
X_train_prepared[:1]

(36631, 107) 



array([[ 0.16988957, -0.35849554, -0.41755169, -0.14580568, -0.21438168,
        -0.03499028,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  1.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  1.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  1.        ,  0. 

In [22]:
# By combining pipelines and column transformers, we can create any data-preparation 
# configuration we want.
column_transformer = ColumnTransformer([('Numeric pipeline', num_pipeline, num_features), 
                                        ('OneHotEncoder', encoder, cat_features)],
                                      sparse_threshold=0)
X_train_prepared = column_transformer.fit_transform(X_train)
X_train_prepared.shape

(36631, 129)

## Pro's trick: cross-validated pipelines

In [23]:
# Let's use cross validation to compare StandardScaler and MinMaxScaler.
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

std_scaler = StandardScaler()
min_max_scaler = MinMaxScaler()
scaler = None # We will assign the actual scaler to this placeholder variable.
lr = LogisticRegression(random_state=0, max_iter=1000)

column_transformer = ColumnTransformer([('Scaler', scaler, num_features), 
                                 ('OneHotEncoder', encoder, cat_features)])

cv_pipeline = Pipeline([('Transformer', column_transformer), 
                        ('Regressor', lr)])
# Set paramaeter to test using cross validation
param_grid = {
    'Transformer__Scaler': [std_scaler, min_max_scaler],
}

grid_search = GridSearchCV(cv_pipeline, param_grid, cv=5, return_train_score=False)

grid_search.fit(X_train, y_train)
print("Standard scaler score: (CV score=%0.3f):" % grid_search.cv_results_['mean_test_score'][0])
print("Min-max scaler score: (CV score=%0.3f):" % grid_search.cv_results_['mean_test_score'][1])

Standard scaler score: (CV score=0.852):
Min-max scaler score: (CV score=0.850):


# Building Pipeline for our Project
## The plan
Here are some decisions on how to process the data:
1. **Data cleaning** - we need to strip the dot from the end of the target.
2. **Missing values** - leave '?' as valid categories.
3. **Feature selection** - we will remove `education` and only use `education-num`.
4. **Scaling** numeric features
5. **Encoding** categorical features
6. **Combining categories**: combining very small categories with bigger ones:
    * For `marital-status`, we will combine `Married-AF-spouse` with `Married-civ-spouse`.
    * For `workclass`, we will combine `Without-pay` and `Never-worked` with `?`.
    * For `occupation`, we will combine `Armed-Forces` with `Prof-specialty`. 
6. **Feature creation and transformation for 'native-country'**: in order to deal with the high cardinality, we will build a target encoder that will substitute a category by its (regularized) positive rate. We will also create categorical variables for Continents (to capture interactions).
   

In [136]:
# Data cleaning: you will develop your own transformer during homework. For now, I am importing a version I created.
from lesson11 import CharacterStripper

character_stripper = CharacterStripper(character_to_strip='.')
y_train_prepared = character_stripper.fit_transform(y_train)
y_train_prepared.value_counts()

<=50K    27824
>50K      8807
Name: >=50K, dtype: int64

In [None]:
# A class that changes category names according to specified rules
class CategoryReviser(BaseEstimator, TransformerMixin):
    ''' This transformer revises categories according to a dictionary with rules '''
    def __init__(self, cat_change_rules={}):
        self.cat_change_rules = cat_change_rules

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        X_transformed = X.copy()
        for feature in self.cat_change_rules:
            for cat in self.cat_change_rules[feature]:
                X_transformed.loc[X[feature]==cat, feature] = self.cat_change_rules[feature][cat]
        return X_transformed

In [None]:
cat_change_rules = {'marital-status':{'Married-AF-spouse':'Married', 'Married-civ-spouse':'Married'}, 
                'workclass':{'Without-pay':'?', 'Never-worked':'?'},
                'occupation':{'Armed-Forces':'Prof-specialty'}}

category_reviser = CategoryReviser(cat_change_rules=cat_change_rules)
X_train_prepared = category_reviser.fit_transform(X_train)
X_train_prepared['marital-status'].value_counts()

# Homework
1. Ealier, we stripped manually the dot from the end of some values in the target feature. You task is to create a custom transfomer  - `CharacterStripper` - which does it automatically and can work on production data. The example below shows the expected behavior of the `CharacterStripper`.

In [137]:
# Let's un-strip the dots from the end of the target feature
y_train = df_train['>=50K']
y_test = df_test['>=50K']   
y_train.value_counts()

<=50K     18490
<=50K.     9334
>50K       5923
>50K.      2884
Name: >=50K, dtype: int64

In [138]:
# Develop a character stripper that works like this:
from lesson11 import CharacterStripper

character_stripper = CharacterStripper(character_to_strip='.')
y_train_prepared = character_stripper.fit_transform(y_train)
y_train_prepared.value_counts()

<=50K    27824
>50K      8807
Name: >=50K, dtype: int64

2. Build a TargetEncoder transformer that replaces a category with the positive rate for that category.

In [141]:
X_train.head()

Unnamed: 0,age,capital-gain,capital-loss,education,education-num,fnlwgt,hours-per-week,marital-status,native-country,occupation,race,relationship,sex,workclass,target
38113,41,0,0,HS-grad,9,151856,40,Married,United-States,Protective-serv,White,Husband,Male,Private,0
39214,57,0,0,Doctorate,16,87584,25,Divorced,United-States,Prof-specialty,White,Not-in-family,Female,Self-emp-not-inc,0
44248,31,6849,0,Bachelors,13,220669,40,Never-married,United-States,Prof-specialty,White,Own-child,Female,Local-gov,0
10283,55,0,0,Assoc-voc,11,171355,20,Married,United-States,Machine-op-inspct,White,Husband,Male,Private,0
26724,59,0,0,10th,6,148626,40,Married,United-States,Farming-fishing,White,Husband,Male,Self-emp-not-inc,0


In [148]:
df_test = X_train.copy()
df_test['fnlwgt'] = df_test['fnlwgt'] * 13.67
df_test['fnlwgt'] = df_test['fnlwgt'].apply(lambda x : '{0:,.0f}'.format(x))
df_test.head().to_csv('test.csv')
df_test['fnlwgt'].head()

38113    2,075,872
39214    1,197,273
44248    3,016,545
10283    2,342,423
26724    2,031,717
Name: fnlwgt, dtype: object

In [147]:
df_test.head().to_csv('test.csv')