# Objective: 

https://courses.dataschool.io/view/courses/building-an-effective-machine-learning-workflow-with-scikit-learn/723182-introduction/2106243-welcome-to-the-course

To help you work efficiently in Scikit learn so that you can solve supervised learning problems with real world data. Specifically focus on ML centric and Sklearn centric.

Workflow would be far superior than choosing an "algorithm". If you nail the workflow properly, it will be very easy to experiment with different algorithms quickly. Understading algorithms is useful without a doubt but following a proper workflow is more important.

<b>Top priority</b>: To stay on a single mental track so that nobody gets lost in the course material.

# Outline:

- Review of the basic ML workflow
- Encoding categorical data
- Using ColumnTransformer and Pipeline
- Encoding text data
- Handling missing values
- Switching to full dataset
- Evaluating and tuning a Pipeline

# Library requirements

The only libraries you'll need to install are scikit-learn (version 0.20.2 or later) and pandas (any version).

In [1]:
import sklearn
sklearn.__version__

'1.1.3'

In [2]:
seed = 1998

# Review of basic ML workflow

In [3]:
import pandas as pd

In [4]:
# Loading the Titanic dataset
# Target: Survived
df = pd.read_csv('titanic.csv', nrows=10)
df.sample()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [5]:
# selecing two features: Parch (no. of parents/children aboard with the passenger), Fare
X = df[['Parch', 'Fare']]
y = df['Survived']
X.shape

(10, 2)

#### Note: Conventions of naming X & y

**We use a capital letter for features matrix(X) because it's two dimensional whereas we use a lower case letter for the target class because it is single dimensional.**

In [6]:
# Using logistic regression to evaluate the model (good starting choice)
from sklearn.linear_model import LogisticRegression

In [7]:
logreg = LogisticRegression(solver='liblinear', random_state=seed)

#### Note: Model evaluation

The goal of model evaluation is to simulate model performance on future data so that we can choose between models. For that we need two things:
- Evaluation procedure -> <i>Cross validation</i> for this case
- Evaluation metric -> <i>Accuracy</i> for this case

In [8]:
from sklearn.model_selection import cross_val_score

In [9]:
cross_val_score(logreg, X, y, cv=3, scoring='accuracy').mean()

0.6944444444444443

#### Note: cross_val_score

```cross_val_score``` is doing the dataset splitting, model training, predictions and evaluating all together. So, it's wrapping a lot of functionality in a single line of code. ```cross_val_score``` does not return the model object. Can use ```cross_validate``` for that.

In [10]:
# Fitting the model: training
logreg.fit(X, y)

In [11]:
# Reading new dataset
df_new = pd.read_csv('titanic.csv').tail(10).drop(columns=['Survived'])
df_new.shape

(10, 11)

In [12]:
X_new = df_new[['Parch', 'Fare']]
logreg.predict(X_new)

array([0, 0, 0, 0, 1, 0, 1, 1, 1, 0], dtype=int64)

#### Model v/s Algorithm

An algorithm is the general approach you will take. The model is what you get when you run the algorithm over your training data and what you use to make predictions on new data.
You can generate a new model with the same algorithm but with different data, or you can get a new model from the same data but with a different algorithm.

**For example**: LogisticRegression() implements an algorithm using solver='liblilnear' to solve an optimization problem needed to get the feature coefficients. When we say model, we are really talking about the logreg object which has the ability to get trained and predict on unseen data.

https://stackoverflow.com/questions/44824153/what-is-the-exact-difference-between-a-model-and-an-algorithm

https://math.stackexchange.com/questions/2269641/whats-the-difference-between-a-model-and-an-algorithm

Additional: <a href="https://stackoverflow.com/questions/2334225/what-is-the-difference-between-a-heuristic-and-an-algorithm?rq=3">Algorithm vs Heuristic</a>

# Encoding categorical data

In [13]:
df.sample(2)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


We want to improve our model by adding two more columns.

In [14]:
from sklearn.preprocessing import OneHotEncoder

In [15]:
ohe = OneHotEncoder()
ohe.fit_transform(df[['Embarked']])

<10x3 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>

#### Note: Questions from OHE

<ol>
    <li>What is a Sparse matrix?</li>
    <p>If the matrix is mostly 0s, instead of storing everything, you only store the positinos of the non-zero values and the values at those positions. Requires less storage and is faster. You can set <b>sparse=False</b> with the ohe object to view the complete (dense) matrix.</p><br>
    <li>What values exist in the matrix?</li>
    <p>OHE learns the categories in alphabetical order. So, the first column in the ohe object corresponds to Embarked = 'C'. It created 3 columns for Embarked in this case. We have essentially created 3 features from this one categorical features to provide the model with an opportunity to learn the relationship between these features that are now decomposed into 3 columns and the target.</p>
    <p>One could also encode them into a single column but we would not do that GENERALLY speaking because that would imply an inherent ordering of the categories which may not be the case.</p><br>
    <li>What is fit_transform??</li>
    <p>OHE is a transformer meaning that its role is data transformations.</p>
    <p>fit is when it learns something from the data. transform is it when it uses what it learned to do the transformatin. For OHE, fit is when it learns the categories. When we use transform is when it uses what we just learned to create the output matrix.</p>
    <p>ohe.fit_transform effectively does ohe.fit() and then ohe.transform().</p>
    <li>Why two brackets to for column selection for ohe object?</li>
    <p>Because double bracket ensures that the output is a dataframe and when you pass a dataframe to sklearn, it knows that the passed value is a column or a single feature with some observations rather than a row which is a single observation with a lot of features. Sklearn explicitly needs to know if you want to work with a 2D object.</p>
    <p>A series looks like a column but is a one dimensional object.</p>
</ol>

In [16]:
# Ordered by columns (alphabetically)
ohe.categories_

[array(['C', 'Q', 'S'], dtype=object)]

#### Note: The trailing underscore in some object attributes

The trailing underscore in some model attributes (like ohe.categories_) is an indication that it was learned during the **fit** (training) step. Convention.

In [17]:
# Encoding both the categorical variables
ohe = OneHotEncoder(sparse=False)
ohe.fit_transform(df[['Embarked', 'Sex']])

array([[0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 1., 0.]])

#### Note: Current workflow issue
Now, we can stack them next to our two numerical features and train our model. But, we also need to replicate the same sets to pre-process ```df_new``` which currently does not ohe the mentioned features. So, anything we do to train, must also be done to test.

That brings up two problems:

- **Problem 1**: You repeat the same workflow twice which is inefficient and error prone.

- **Problem 2**: Your workflow complexity is continue to increase with increased pre-processing steps.

# Using ColumnTransformer and Pipeline

In [18]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex']
X = df[cols]
X.sample()

Unnamed: 0,Parch,Fare,Embarked,Sex
1,0,71.2833,C,female


In [19]:
from sklearn.compose import make_column_transformer

In [20]:
ohe = OneHotEncoder()
# You pass one or more tuples of length 2 which 
# should be: (transformer, list of columns)
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    remainder='passthrough'
)

In [21]:
# It did OHE of two columns first and stacks the remaining columns next
# to the ohe columns
ct.fit_transform(X)

array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  7.25  ],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 71.2833],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    ,  7.925 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    , 53.1   ],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  8.05  ],
       [ 0.    ,  1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  8.4583],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    , 51.8625],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  1.    , 21.075 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  2.    , 11.1333],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 30.0708]])

To apply the same sets of transformations to both the training and testing sets, we use Pipeline. It chains together steps **sequentially.**

In [22]:
from sklearn.pipeline import make_pipeline

In [23]:
pipe = make_pipeline(ct, logreg) # effectively: logreg.fit(ct.fit_transform(X), y)

In [24]:
# First the X, which has 4 columns. These 4 columns get passed to the first step in the pipeline object
# which is the ColumnTransformer. CT turns 4 cols into 7 columns. These 7 columns are passed to logistic
# regression as feature matrix and then logreg is fit on the transformed feature matrix of 7 columns with the
# y variable
pipe.fit(X, y)

In [25]:
pipe.named_steps

{'columntransformer': ColumnTransformer(remainder='passthrough',
                   transformers=[('onehotencoder', OneHotEncoder(),
                                  ['Embarked', 'Sex'])]),
 'logisticregression': LogisticRegression(random_state=1998, solver='liblinear')}

In [26]:
# Getting the coefficients of fitted logistic regressions
pipe.named_steps.logisticregression.coef_

array([[ 0.26491287, -0.19848033, -0.22907928,  1.0075062 , -1.17015293,
         0.20056557,  0.01597307]])

In [27]:
X_new = pd.read_csv('titanic.csv')[cols].tail(10)
pipe.predict(X_new) # Equivalent to: logreg.predict(ct.tranform(X_new)) [notice, tranform instead of fit_transform]

array([0, 1, 0, 0, 1, 0, 1, 1, 0, 0], dtype=int64)

# Encoding text data

Improving model by adding more features. In this case, maybe Name has some information. Maybe if someone's title is Master or Dr. they are rescued first.

There may be predictive infromation in the Name column.

In [28]:
from sklearn.feature_extraction.text import CountVectorizer

In [29]:
vect = CountVectorizer()
# count vectorizer wants a 1D object
# returns a document term matrix
dtm = vect.fit_transform(df['Name'])
dtm

<10x40 sparse matrix of type '<class 'numpy.int64'>'
	with 46 stored elements in Compressed Sparse Row format>

In [30]:
# Getting the features it created
print(vect.get_feature_names_out())

['achem' 'adele' 'allen' 'berg' 'bradley' 'braund' 'briggs' 'cumings'
 'elisabeth' 'florence' 'futrelle' 'gosta' 'harris' 'heath' 'heikkinen'
 'henry' 'jacques' 'james' 'john' 'johnson' 'laina' 'leonard' 'lily'
 'master' 'may' 'mccarthy' 'miss' 'moran' 'mr' 'mrs' 'nasser' 'nicholas'
 'oscar' 'owen' 'palsson' 'peel' 'thayer' 'timothy' 'vilhelmina' 'william']


In [31]:
# the matrix that will get passed to the logistic model
# toarray() converts the sparse matrix to a dense 
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names_out())

Unnamed: 0,achem,adele,allen,berg,bradley,braund,briggs,cumings,elisabeth,florence,...,nasser,nicholas,oscar,owen,palsson,peel,thayer,timothy,vilhelmina,william
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,1,0,1,1,0,1,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
8,0,0,0,1,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,1,0
9,1,1,0,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0


#### Note: CountVectorizer NLP?

- Using a bag of words representation is a ML tool for understanding text in one way and can also be used as a tool towards language understanding which is the goal of NLP.

- The only words that were not converted to a feature were single letter words, in this case.

#### Note: How is CountVec different from one hot encoding?

- OHE would have taken from Name columns, individual strings which would have resulted in 10 columns, 1 for each name in df.

- CountVec on the other hand gives 40 columns after applying its own pre-processing. We would have learned very little by simply ohe name.

In [32]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name']
X = df[cols]
X.sample()

Unnamed: 0,Parch,Fare,Embarked,Sex,Name
5,0,8.4583,Q,male,"Moran, Mr. James"


In [33]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    # countvect expects 1D array. Do not use []
    # for multiple columns, pass the vect tuple
    # multiple twice
    (vect, 'Name'),
    remainder='passthrough'
)

In [34]:
pipe = make_pipeline(ct, logreg)

In [35]:
pipe.fit(X, y)

In [36]:
X_new = pd.read_csv('titanic.csv')[cols].tail(10)
pipe.predict(X_new) # Equivalent to: logreg.predict(ct.tranform(X_new)) [notice, tranform instead of fit_transform]

array([0, 1, 0, 0, 1, 0, 1, 1, 0, 0], dtype=int64)

# Handling missing values 

In [66]:
df.sample()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


NaN : Missing value

_This is different form new categories (OHE) or different words from the training data._

In [67]:
cols

['Parch', 'Fare', 'Embarked', 'Sex', 'Name']

In [68]:
cols.append('Age')

In [69]:
X = df[cols]
X

Unnamed: 0,Parch,Fare,Embarked,Sex,Name,Age
0,0,7.25,S,male,"Braund, Mr. Owen Harris",22.0
1,0,71.2833,C,female,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,0,7.925,S,female,"Heikkinen, Miss. Laina",26.0
3,0,53.1,S,female,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0
4,0,8.05,S,male,"Allen, Mr. William Henry",35.0
5,0,8.4583,Q,male,"Moran, Mr. James",
6,0,51.8625,S,male,"McCarthy, Mr. Timothy J",54.0
7,1,21.075,S,male,"Palsson, Master. Gosta Leonard",2.0
8,2,11.1333,S,female,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",27.0
9,0,30.0708,C,female,"Nasser, Mrs. Nicholas (Adele Achem)",14.0


### Option 1: Drop all missing values

It is okay to do so under the following circumstances:

- If you are losing only a tiny proportion of your training data.

- If you know that the missingness is completely random.

You can drop either from the entire dataset using X.dropna() or using only a subset of columns using X.dropna(subset=[])

Another option is to drop the entire column if it contains missing value. X.dropna(axis='columns') will drop the entire 'age' column from our dataset.

### Option 2: Impute missing values

Something to keep in mind is that imputation is inferring/filling from available data. Rules of thumb link drop when >95% empty are okay but a feature could be 80% empty and still be useful especially if the missingness is **not random**.

In [70]:
from sklearn.impute import SimpleImputer

In [72]:
imp = SimpleImputer()
# uses mean by default
imp.fit_transform(X[['Age']])

array([[22.        ],
       [38.        ],
       [26.        ],
       [35.        ],
       [35.        ],
       [28.11111111],
       [54.        ],
       [ 2.        ],
       [27.        ],
       [14.        ]])

In [73]:
imp.statistics_

array([28.11111111])

In [74]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age']),
    remainder='passthrough'
)
ct.fit_transform(X)

<10x48 sparse matrix of type '<class 'numpy.float64'>'
	with 88 stored elements in Compressed Sparse Row format>

In [75]:
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
# pipe.named_steps

Conceptually, pipe predicts X_new, one row at a time. It only makes sense to impute the value in the testing set, by using the value that was fit from the training set.

In [76]:
X_new = df_new[cols]
pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 1, 0, 0], dtype=int64)

### Option 3: Imputation + Adding missingness as a feature

In [77]:
imp_ind = SimpleImputer(add_indicator=True)
imp_ind.fit_transform(X[['Age']])

array([[22.        ,  0.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       [35.        ,  0.        ],
       [35.        ,  0.        ],
       [28.11111111,  1.        ],
       [54.        ,  0.        ],
       [ 2.        ,  0.        ],
       [27.        ,  0.        ],
       [14.        ,  0.        ]])

# Switching to full dataset

In [78]:
df = pd.read_csv('titanic.csv')
df.shape

(891, 12)

## Missing values: issues

In [79]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [80]:
df_test = pd.read_csv('titanic_test.csv')
df_test.shape

(418, 11)

In [81]:
df_test.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

**Problems:**

- Embarked has missing value in training but not testing.
- Fare has missing values in testing but not training. 

Impute missing values before OHE.

In [87]:
# Problem 1: Embarked missing

# Fill value with the word: 'missing'. To treat missing value as 4th category.
imp_constant = SimpleImputer(strategy='constant', fill_value='missing')

# Transformer only pipeline: imputation + ohe
imp_ohe = make_pipeline(imp_constant, ohe)

# New column transformer
ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age']),
    remainder='passthrough'
)

ct.fit_transform(X)

<891x1518 sparse matrix of type '<class 'numpy.float64'>'
	with 7328 stored elements in Compressed Sparse Row format>

In [88]:
# Problem 2: Fare missing (updating previous cell code)

# Fill value with the word: 'missing'. To treat missing value as 4th category.
imp_constant = SimpleImputer(strategy='constant', fill_value='missing')

# Transformer only pipeline: imputation + ohe
imp_ohe = make_pipeline(imp_constant, ohe)

# New column transformer
ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough'
)

ct.fit_transform(X)

<891x1518 sparse matrix of type '<class 'numpy.float64'>'
	with 7328 stored elements in Compressed Sparse Row format>

#### NOTE: All Pipeline steps, other than the final step MUST be a tranformer. The final step can be a model or a transformer.

In [89]:
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)

In [90]:
X_new = df_test[cols]
pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

In [91]:
# To view the statistics of transformers
ct.named_transformers_.simpleimputer.statistics_

array([29.69911765, 32.20420797])

In [92]:
ct.named_transformers_.pipeline.named_steps.simpleimputer.statistics_

array(['missing', 'missing'], dtype=object)

#### Note: Q) Why is it necessary to build a pipeline of SI and OHE for Embarked and Sex instead of just adding it as a step in the column transformer?

Hint: Results get stacked, not replcaed.

#### Note: Why use Sklearn instead of pandas for pre-processing?

- Cannot do Count Vectorizer in pandas. Combining sklearn and pandas just for count vectorizer is too much hassle like say going back and forth between dense and sparse matrices.

- Performing OHE using get_dummies adds those columns to the pandas dataframe, which is not desirable. Using sklearn leaves the source dataset as is.

- Missing values imputation using pandas can cause data leakage. When we use a model evaluation procedure such as train-test split or cross validation, that is supposed to simulate the future to allow us to estimate the future peformance of the model. Imputing missing values using pandas, and passing the dataset to the model, will no long accurately simulate the reality because the model will learn something that it is not supposed to. For example, your imputation values will be based on the entire dataset instead of just the "training portion". Data Leakage therefore in this could would boil down to learning something from the testing data that you are not supposed to do. This is one reason why seperate fit and tranform methods are so useful.

- Cross validation of the Pipeline

# Evaluating and tuning a Pipeline

In [93]:
from sklearn.model_selection import cross_val_score

In [94]:
cross_val_score(pipe, X, y, cv=5, scoring='accuracy')

array([0.79888268, 0.8258427 , 0.80337079, 0.78651685, 0.84269663])

**Neat thing about cross validation is, in this case it divides the data into 5 parts, trains on the first 4 (does a fit) uses what it learnt in the 4 folds to evaluate on the 5th fold and returns the accuracy. Very important thing to know is cv splits the data before transformations to simulate real scenario to the closest. Repeats it 5 more times.**

## Tuning hyperparameters of the Pipeline

**Parameters**: values learnt by the model <br>
**Hyperparameters**: values set by the modeller.

In [95]:
pipe.named_steps.keys()

dict_keys(['columntransformer', 'logisticregression'])

### Tuning LogisticRegression

In [96]:
params = {}
params['logisticregression__penalty'] = ['l1', 'l2']
params['logisticregression__C'] = [0.1, 1, 10]

In [97]:
from sklearn.model_selection import GridSearchCV

In [98]:
grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid.fit(X, y)

In [101]:
pd.DataFrame(grid.cv_results_).sort_values('rank_test_score')[:2]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__penalty,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
4,0.017427,0.00211,0.006022,0.00084,10,l1,"{'logisticregression__C': 10, 'logisticregress...",0.826816,0.820225,0.820225,0.792135,0.848315,0.821543,0.01796,1
2,0.013607,0.000771,0.0058,0.000281,1,l1,"{'logisticregression__C': 1, 'logisticregressi...",0.815642,0.820225,0.797753,0.792135,0.848315,0.814814,0.019787,2


### Tuning columnTransformer

In [104]:
pipe.named_steps.columntransformer.named_transformers_

{'pipeline': Pipeline(steps=[('simpleimputer',
                  SimpleImputer(fill_value='missing', strategy='constant')),
                 ('onehotencoder', OneHotEncoder())]),
 'countvectorizer': CountVectorizer(),
 'simpleimputer': SimpleImputer(),
 'remainder': 'passthrough'}

In [105]:
# Tuning drop parameter of OHE
params['columntransformer__pipeline__onehotencoder__drop'] = [None, 'first']
params['columntransformer__countvectorizer__ngram_range'] = [(1, 1), (1, 2)]
params['columntransformer__simpleimputer__add_indicator'] = [False, True]

In [106]:
grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid.fit(X, y)

In [107]:
pd.DataFrame(grid.cv_results_).sort_values(by='rank_test_score')[:2]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__countvectorizer__ngram_range,param_columntransformer__pipeline__onehotencoder__drop,param_columntransformer__simpleimputer__add_indicator,param_logisticregression__C,param_logisticregression__penalty,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
28,0.023085,0.001123,0.005985,0.000264,"(1, 2)",,False,10,l1,{'columntransformer__countvectorizer__ngram_ra...,0.843575,0.825843,0.825843,0.786517,0.853933,0.827142,0.022985,1
40,0.027629,0.00298,0.006493,0.000253,"(1, 2)",first,False,10,l1,{'columntransformer__countvectorizer__ngram_ra...,0.849162,0.825843,0.820225,0.780899,0.859551,0.827136,0.027288,2


In [108]:
grid.best_score_, grid.best_params_

(0.8271420500910176,
 {'columntransformer__countvectorizer__ngram_range': (1, 2),
  'columntransformer__pipeline__onehotencoder__drop': None,
  'columntransformer__simpleimputer__add_indicator': False,
  'logisticregression__C': 10,
  'logisticregression__penalty': 'l1'})

#### Note: grid object refits the model on all data using the best set of parameters. So you can directly make your predictions

In [109]:
grid.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0,

# QnA

### Why is it important to select a Series instead of a Dataframe for the target variable?

Accessing using single brackets to return a series gives you a 1D output whereas using double brackets gives you a 2D output. (see below)

Sklearn wants a 1D array for the target. Notice the output once we convert the column to a numpy array.

Reason: Sklearn supports multilabel classification(different form multiclass, where each observation can have multiple labels simultaneously). In case of MC, the target will be a 2D array. 2D signals sklearn to a multilabel problem which is not the case with this example.

**Series**

In [37]:
df['Survived'], df['Survived'].shape 

(0    0
 1    1
 2    1
 3    1
 4    0
 5    0
 6    0
 7    0
 8    1
 9    1
 Name: Survived, dtype: int64,
 (10,))

In [38]:
df['Survived'].to_numpy()

array([0, 1, 1, 1, 0, 0, 0, 0, 1, 1], dtype=int64)

**Dataframe**

In [39]:
df[['Survived']], df[['Survived']].shape

(   Survived
 0         0
 1         1
 2         1
 3         1
 4         0
 5         0
 6         0
 7         0
 8         1
 9         1,
 (10, 1))

In [40]:
df[['Survived']].to_numpy()

array([[0],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1]], dtype=int64)

### Why us LogisticRegression instead of LogisticRegressionCV, which has a built-in cross-validation?

LRcv is a variation of LR class and it makes it easy on you to allow you to tune parameters without using GridSearch. But, it does not fit very well with the workflow. LRCv is super specific and covers one specific use case and is not very flexible. Especially when you want to experiment with multiple models.

### When OHE, what happens if the testing data has a new category that was not in the training data?

**Training OHE**

In [41]:
demo_train = pd.DataFrame({'letter': ['A', 'B', 'C', 'A']})
demo_train

Unnamed: 0,letter
0,A
1,B
2,C
3,A


In [42]:
ohe = OneHotEncoder(sparse=False)
ohe.fit_transform(demo_train[['letter']])

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

**Testing OHE - I**

In [43]:
demo_test = pd.DataFrame({'letter': ['A', 'C', 'A']})
demo_test

Unnamed: 0,letter
0,A
1,C
2,A


In [44]:
ohe.transform(demo_test[['letter']])

array([[1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

**Testing OHE - II**

In [45]:
demo_test_unseen = pd.DataFrame({'letter': ['A', 'C', 'D']})
demo_test_unseen

Unnamed: 0,letter
0,A
1,C
2,D


In [47]:
# ohe.transform(demo_test_unseen[['letter']])
# ValueError: Found unknown categories ['D'] in column 0 during transform

In [48]:
ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
ohe.fit_transform(demo_train[['letter']])

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [50]:
ohe.transform(demo_test_unseen[['letter']])

array([[1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 0.]])

### Should we drop one of the OHE features, since some models (like linear regression) do not like collinearity between features?

- Contrary to theoretical standards, multicollinearity is rarely a problem with most sklearn models.

- Dropping a column encodes one of the category as [0 0 0 ...0]. But, if you encounter an unknown category in production with appropriate OHE parameters, it will also encode this new category as [0 0 0 ... 0]. In short, drop='first' cannot be used with handle_unknown='ignore.

- If using regularization or Standardization, using OHE with drop='first' does not make a lot of sense, theoretically.  

### What's the difference between OHE, OrdinalEncoder and LabelEncoder?

- OHE is used for **unordered** categorical data.

- Ordinal Encoder is used for **ordered** categorical data. It is for the features.

- Label Encoder similar to OrdinalEncoder except for a few differences. a) OE is used for features, LE is used for labels(target). b) OE allows you to define the categorical ordering, LE does not it encodes alphabetically. c) OE works for multiple features, LE assumes you only have 1 target and therefore works only for 1 feature.

Sklearn models now all can handle string based **target** so there is no true need of Label Encoder anymore.

### In a ColumnTransformer, what are the options for "remainder"?

**passthrough lets the unmarked columns go as is.**

In [52]:
ohe = OneHotEncoder()

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    remainder='passthrough'
)

In [54]:
X.columns

Index(['Parch', 'Fare', 'Embarked', 'Sex', 'Name'], dtype='object')

In [55]:
ct.fit_transform(X)

<10x47 sparse matrix of type '<class 'numpy.float64'>'
	with 78 stored elements in Compressed Sparse Row format>

**drop removes the unmarked columns**

In [56]:
ohe = OneHotEncoder()

ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    remainder='drop'
)

In [57]:
X.columns

Index(['Parch', 'Fare', 'Embarked', 'Sex', 'Name'], dtype='object')

In [58]:
ct.fit_transform(X)

<10x45 sparse matrix of type '<class 'numpy.float64'>'
	with 66 stored elements in Compressed Sparse Row format>

**getting the names of columns**

In [59]:
ct.get_feature_names_out()[:25]

array(['onehotencoder__Embarked_C', 'onehotencoder__Embarked_Q',
       'onehotencoder__Embarked_S', 'onehotencoder__Sex_female',
       'onehotencoder__Sex_male', 'countvectorizer__achem',
       'countvectorizer__adele', 'countvectorizer__allen',
       'countvectorizer__berg', 'countvectorizer__bradley',
       'countvectorizer__braund', 'countvectorizer__briggs',
       'countvectorizer__cumings', 'countvectorizer__elisabeth',
       'countvectorizer__florence', 'countvectorizer__futrelle',
       'countvectorizer__gosta', 'countvectorizer__harris',
       'countvectorizer__heath', 'countvectorizer__heikkinen',
       'countvectorizer__henry', 'countvectorizer__jacques',
       'countvectorizer__james', 'countvectorizer__john',
       'countvectorizer__johnson'], dtype=object)

### Is there a more efficient way to specify columns for a ColumnTransformer than listing them one-by-one?

In [61]:
# specify by position
ct = make_column_transformer(
    (ohe, [2, 3]),
    (vect, 4),
    remainder='passthrough'
)

# specify by slice
ct = make_column_transformer(
    (ohe, slice(2, 4)),
    (vect, 4),
    remainder='passthrough'
)

# specify using make_column_selector
from sklearn.compose import make_column_selector
cs = make_column_selector(pattern='E|S')
ct = make_column_transformer(
    (ohe, cs),
    (vect, 4),
    remainder='passthrough'
)

### Does pipe.fit() modify the underlying objects (ct, logreg) passed to it or use them as a template?

In [62]:
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)

In [63]:
pipe.named_steps.logisticregression.coef_

array([[ 0.18828769, -0.14100295, -0.16593861,  0.66504677, -0.78370063,
         0.11596792,  0.11596792, -0.1262833 ,  0.13845919,  0.07231978,
        -0.12539973,  0.07231978,  0.07231978,  0.13845919,  0.07231978,
         0.10454614, -0.18913104, -0.12539973,  0.10454614,  0.23375375,
        -0.1262833 ,  0.10454614, -0.14100295,  0.07231978,  0.13845919,
         0.23375375, -0.18913104,  0.10454614, -0.18913104,  0.10454614,
        -0.20188362,  0.23375375, -0.14100295, -0.5945696 ,  0.43129302,
         0.11596792,  0.11596792,  0.13845919, -0.12539973, -0.18913104,
         0.10454614,  0.07231978, -0.20188362,  0.13845919, -0.1262833 ,
         0.08778734,  0.01334678]])

In [64]:
logreg.coef_

array([[ 0.18828769, -0.14100295, -0.16593861,  0.66504677, -0.78370063,
         0.11596792,  0.11596792, -0.1262833 ,  0.13845919,  0.07231978,
        -0.12539973,  0.07231978,  0.07231978,  0.13845919,  0.07231978,
         0.10454614, -0.18913104, -0.12539973,  0.10454614,  0.23375375,
        -0.1262833 ,  0.10454614, -0.14100295,  0.07231978,  0.13845919,
         0.23375375, -0.18913104,  0.10454614, -0.18913104,  0.10454614,
        -0.20188362,  0.23375375, -0.14100295, -0.5945696 ,  0.43129302,
         0.11596792,  0.11596792,  0.13845919, -0.12539973, -0.18913104,
         0.10454614,  0.07231978, -0.20188362,  0.13845919, -0.1262833 ,
         0.08778734,  0.01334678]])

**they match**

### Why did we create the ```imp_ohe``` pipeline? Why didn't we use imp_constant to the ColumnTransformer?

In [112]:
# Transformer only pipeline: imputation + ohe
imp_ohe = make_pipeline(imp_constant, ohe)

# New column transformer
ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough'
)

In [113]:
ct_suggestion = make_column_transformer(
    (imp_constant, ['Embarked', 'Sex']),
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough'
)

In [115]:
df_tiny = df.head(10).copy()
df_tiny.sample()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S


In [116]:
X_tiny = df_tiny[cols]

**Testing if the suggested column transformer works**

In [117]:
# original
make_column_transformer(
    (imp_ohe, ['Embarked']),
    remainder='drop'
).fit_transform(X_tiny)

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [118]:
# suggested
make_column_transformer(
    (imp_constant, ['Embarked']),
    (ohe, ['Embarked']),
    remainder='drop'
).fit_transform(X_tiny)

array([['S', 0.0, 0.0, 1.0],
       ['C', 1.0, 0.0, 0.0],
       ['S', 0.0, 0.0, 1.0],
       ['S', 0.0, 0.0, 1.0],
       ['S', 0.0, 0.0, 1.0],
       ['Q', 0.0, 1.0, 0.0],
       ['S', 0.0, 0.0, 1.0],
       ['S', 0.0, 0.0, 1.0],
       ['S', 0.0, 0.0, 1.0],
       ['C', 1.0, 0.0, 0.0]], dtype=object)

In [119]:
# modification of suggestion
make_column_transformer(
    (ohe, ['Embarked']),
    (imp_constant, ['Embarked']),
    remainder='drop'
).fit_transform(X_tiny)

array([[0.0, 0.0, 1.0, 'S'],
       [1.0, 0.0, 0.0, 'C'],
       [0.0, 0.0, 1.0, 'S'],
       [0.0, 0.0, 1.0, 'S'],
       [0.0, 0.0, 1.0, 'S'],
       [0.0, 1.0, 0.0, 'Q'],
       [0.0, 0.0, 1.0, 'S'],
       [0.0, 0.0, 1.0, 'S'],
       [0.0, 0.0, 1.0, 'S'],
       [1.0, 0.0, 0.0, 'C']], dtype=object)

**Key:** Output of step 1 is input to step 2 in the column transformer.

### What about Stratified-Kfold?

In [121]:

# Fill value with the word: 'missing'. To treat missing value as 4th category.
imp_constant = SimpleImputer(strategy='constant', fill_value='missing')

# Transformer only pipeline: imputation + ohe
imp_ohe = make_pipeline(imp_constant, ohe)

# New column transformer
ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough'
)

pipe = make_pipeline(ct, logreg)

In [122]:
cross_val_score(pipe, X, y, cv=5, scoring='accuracy')

array([0.79888268, 0.8258427 , 0.80337079, 0.78651685, 0.84269663])

**SKFold ensures that the distribution of target is similar in each fold. In this case roughly 38% survived should exist both in the training and testing folds.**

In [123]:
from sklearn.model_selection import StratifiedKFold
kf = StratifiedKFold(5)
cross_val_score(pipe, X, y, cv=kf, scoring='accuracy')

array([0.79888268, 0.8258427 , 0.80337079, 0.78651685, 0.84269663])

In [129]:
# indices used to split: 80% training, 20% testing
list(kf.split(X, y))[0]

(array([168, 169, 170, 171, 173, 174, 175, 176, 177, 178, 179, 180, 181,
        182, 185, 188, 189, 191, 196, 197, 199, 200, 201, 202, 203, 204,
        205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217,
        218, 219, 220, 221, 222, 223, 224, 225, 226, 227, 228, 229, 230,
        231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241, 242, 243,
        244, 245, 246, 247, 248, 249, 250, 251, 252, 253, 254, 255, 256,
        257, 258, 259, 260, 261, 262, 263, 264, 265, 266, 267, 268, 269,
        270, 271, 272, 273, 274, 275, 276, 277, 278, 279, 280, 281, 282,
        283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294, 295,
        296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308,
        309, 310, 311, 312, 313, 314, 315, 316, 317, 318, 319, 320, 321,
        322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334,
        335, 336, 337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347,
        348, 349, 350, 351, 352, 353, 354, 355, 356

**Notice that the row indices are not shuffled. By extension, cv does not shuffle. Therefore, cv and GridSearch do not have a random state. Becuase there is no randomness.**

If the ordering of the rows is arbitrary, don't shuffle the rows. If not, shuffle. With shuffle, can set random state.

In [130]:
kf = StratifiedKFold(5, shuffle=True, random_state=seed)

### Should you use a validation set when tuning the model's hyperparameters?

The terminology between when people call something a testing set vs validation set is a little fuzzy.

There are two jobs we are trying to do:
- Tuning the models hyperparameters.
- Estimating its future performance.

If you use the same dataset to do both (cv + evaluation), you are making the model biased towards that specific data. To make the representation more realistic and if you have enough data to spare, you split the entire dataset into two splits called training and validation (also called testing). Do the cross validation on training and evaluate on validation.

If you **do not have enough data** to spare for a validation set, you can use nested cross validation. The inner loop does the parameter tuning using grid search and the outer loop estimates model performance using cross validation. To get the most realistic estimate of performance of your model, one need to *add complexity*. There is no workaround.

### What is the difference between FeatureUnion and ColumnTransformer?

In [131]:
SimpleImputer(add_indicator=True).fit_transform(X_tiny[['Age']])

array([[22.        ,  0.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       [35.        ,  0.        ],
       [35.        ,  0.        ],
       [28.11111111,  1.        ],
       [54.        ,  0.        ],
       [ 2.        ,  0.        ],
       [27.        ,  0.        ],
       [14.        ,  0.        ]])

**Q) How do we replicate the above without using ```add_indicator```?**

In [134]:
from sklearn.impute import MissingIndicator
indic = MissingIndicator()
indic.fit_transform(X_tiny[['Age']])        # for numerical: use .astype(int)

array([[0],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [0],
       [0]])

**How to stack the above next to the Age feature?**

In [135]:
from sklearn.pipeline import make_union
imp_indic = make_union(imp, indic)
imp_indic.fit_transform(X_tiny[['Age']])

array([[22.        ,  0.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       [35.        ,  0.        ],
       [35.        ,  0.        ],
       [28.11111111,  1.        ],
       [54.        ,  0.        ],
       [ 2.        ,  0.        ],
       [27.        ,  0.        ],
       [14.        ,  0.        ]])

- Feature union applies multiple transformations to a single input column and stacks the result side-by-side.

- ColumnTransformer applies a different transformations to each input column and stacks the result side-by-side.

In [136]:
make_column_transformer(
    (imp_indic, ['Age']),
    remainder='drop'
).fit_transform(X_tiny)

array([[22.        ,  0.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       [35.        ,  0.        ],
       [35.        ,  0.        ],
       [28.11111111,  1.        ],
       [54.        ,  0.        ],
       [ 2.        ,  0.        ],
       [27.        ,  0.        ],
       [14.        ,  0.        ]])

### How would you add feature selection to our Pipeline?

In [137]:
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

0.8114619295712762

#### Selection from percentile

In [138]:
from sklearn.feature_selection import SelectPercentile, chi2
# You pass a statistical test (chi2 in this case) and percentile of features
# to keep.
selection = SelectPercentile(chi2, percentile=50)

In [139]:
pipe_selection = make_pipeline(ct, selection, logreg)
cross_val_score(pipe_selection, X, y, cv=5, scoring='accuracy').mean()

0.8193019898311469

**Performance goes up!!!**

#### Selection from model

- Uses coefficients (coef_) or the feature importances attribute of a model as the score.

- Threshold can be the mean or median of the scores that you can optionally include as a scaling parameter. Say, 0.5*Mean

In [140]:
logreg_selection = LogisticRegression(solver='liblinear', penalty='l1',  random_state=seed)

from sklearn.feature_selection import SelectFromModel
selection = SelectFromModel(logreg_selection, threshold='mean')

In [141]:
pipe_selection = make_pipeline(ct, selection, logreg)
cross_val_score(pipe_selection, X, y, cv=5, scoring='accuracy').mean()

0.8260121775155358

**Performance goes up!!!**

### Scaling all features in a Pipeline.

**Note**: Don't use StandardScaler when the output of a columntransformer is a sparse matrix because it will get rid of the zeros and the memory of the resultant matrix will blow out of proportions.

In [142]:
from sklearn.preprocessing import MaxAbsScaler # preserves sparsity
scaler = MaxAbsScaler()

In [144]:
pipe_scaled = make_pipeline(ct, scaler, logreg)
cross_val_score(pipe_scaled, X, y, scoring='accuracy').mean()

0.8114556525014123

***Scaling did very little here. The solver liblinear is very robust to different feature scales and usually scaling is not required with this.***

#### Should you scale all features or the ones that were originally numerical?

_Ans:_ Try both! Experiment.

### Adding Outlier handling to our Pipeline.

- Try Scaling techniques robust to outliers like RobustScaler.
```python
from sklearn.preprocessing import RobustScaler
```

- Identify and remove them from training data (if removal is necessary at all).

**Sklearn does not support any transformers currently that removes rows.** All transformers sometimes make more columns, sometimes retain the same number of columns.

### How to include custom transformations for feature engineering within a Pipeline?

In [146]:
df_tiny.sample()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


**Engineering ideas:**

1. Age and fare will be better features if we floor them.

2. Cabin name starts with a letter, which may represent the deck of the ship. Maybe the deck of the ship is a predictive feature.

3. We want to get the sum of SibSp and Parch. Maybe total number of features aboard is more predictive that SibSp and Parch.

#### Idea 1:

In [147]:
import numpy as np
np.floor(df_tiny[['Age', 'Fare']])

Unnamed: 0,Age,Fare
0,22.0,7.0
1,38.0,71.0
2,26.0,7.0
3,35.0,53.0
4,35.0,8.0
5,,8.0
6,54.0,51.0
7,2.0,21.0
8,27.0,11.0
9,14.0,30.0


**To include this in sklearn, we need a transformer i.e. we need np.floor to behave as a transformer.**

In [148]:
from sklearn.preprocessing import FunctionTransformer

In [149]:
get_floor = FunctionTransformer(np.floor)

get_floor.fit_transform(df_tiny[['Age', 'Fare']])

Unnamed: 0,Age,Fare
0,22.0,7.0
1,38.0,71.0
2,26.0,7.0
3,35.0,53.0
4,35.0,8.0
5,,8.0
6,54.0,51.0
7,2.0,21.0
8,27.0,11.0
9,14.0,30.0


In [150]:
make_column_transformer(
    (get_floor, ['Age', 'Fare']),
    remainder='drop'
).fit_transform(df_tiny)

array([[22.,  7.],
       [38., 71.],
       [26.,  7.],
       [35., 53.],
       [35.,  8.],
       [nan,  8.],
       [54., 51.],
       [ 2., 21.],
       [27., 11.],
       [14., 30.]])

#### Idea 2:

In [152]:
df_tiny[['Cabin']].apply(lambda x: x.str.slice(0, 1))

Unnamed: 0,Cabin
0,
1,C
2,
3,C
4,
5,
6,E
7,
8,
9,


In [154]:
def cabin_first_letter(df):
    """
    The input is planned to be a dataframe. However, the input could be a numpy array:
        1. Because you can pass a numpy array to a column transformer.
        2. If this column transformer becomes the second step of a pipeline, the first
        step will return a numpy so there could be problems.
    
    Therefore, we make sure to convert df into DataFrame.
    """
    return pd.DataFrame(df).apply(lambda x: x.str.slice(0, 1))

In [155]:
get_first_letter = FunctionTransformer(cabin_first_letter)

In [156]:
make_column_transformer(
    (get_floor, ['Age', 'Fare']),
    (get_first_letter, ['Cabin']),
    remainder='drop'
).fit_transform(df_tiny)

array([[22.0, 7.0, nan],
       [38.0, 71.0, 'C'],
       [26.0, 7.0, nan],
       [35.0, 53.0, 'C'],
       [35.0, 8.0, nan],
       [nan, 8.0, nan],
       [54.0, 51.0, 'E'],
       [2.0, 21.0, nan],
       [27.0, 11.0, nan],
       [14.0, 30.0, nan]], dtype=object)

When you write functions that are going to be transformed and used in a column transformer, there are two shape considerations to make:

- Function does not have to accept 2D input but it's better if it does. This allows you to pass multiple columns simultaneously.

- CT requires that all transformers (your custom function) output 2D output.

#### Idea 3:

In [157]:
df_tiny[['SibSp', 'Parch']].sum(axis=1)

0    1
1    1
2    0
3    1
4    0
5    0
6    0
7    4
8    2
9    1
dtype: int64

In [158]:
def sum_cols(df):
    """
    Input might be a dataframe or a numpy array.
    
    Summing over the columns axis returns a 1D object (see previous cell).
    Therefore, we convert at the end the object to 2D which is a requirement
    of a columntransformer function.
    
    reshape(-1, 1): First dimension of the object should be inferred. Second dimension
    of the object should be 1.
    """
    return np.array(df).sum(axis=1).reshape(-1, 1)

In [159]:
sum_cols(df_tiny[['SibSp', 'Parch']])

array([[1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [4],
       [2],
       [1]], dtype=int64)

In [160]:
get_sum = FunctionTransformer(sum_cols)

make_column_transformer(
    (get_floor, ['Age', 'Fare']),
    (get_first_letter, ['Cabin']),
    (get_sum, ['SibSp', 'Parch']),
    remainder='drop'
).fit_transform(df_tiny)

array([[22.0, 7.0, nan, 1],
       [38.0, 71.0, 'C', 1],
       [26.0, 7.0, nan, 0],
       [35.0, 53.0, 'C', 1],
       [35.0, 8.0, nan, 0],
       [nan, 8.0, nan, 0],
       [54.0, 51.0, 'E', 0],
       [2.0, 21.0, nan, 4],
       [27.0, 11.0, nan, 2],
       [14.0, 30.0, nan, 1]], dtype=object)

In [161]:
cols

['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age']

In [162]:
cols.extend(['Cabin', 'SibSp'])

In [163]:
X, X_new = df[cols], df_new[cols]

In [164]:
# Impute: Age, Fare
imp_floor = make_pipeline(imp, get_floor)

In [166]:
X['Cabin'].str.slice(0, 1).value_counts(dropna=False)

NaN    687
C       59
B       47
D       33
E       32
A       15
F       13
G        4
T        1
Name: Cabin, dtype: int64

Problems: 

- get_first_letter introduces Nan to the column which needs to be imputed.

- the ouput is a string which needs to be encoded to numeric.

- some categories (G, T) are very rare. Problem with cross validation because all categories might show up in the same fold repeatedly.

In [168]:
# setting handle_unknown to handle the problem
ohe_ignore = OneHotEncoder(handle_unknown='ignore')
letter_imp_ohe = make_pipeline(get_first_letter, imp_constant, ohe_ignore)

In [169]:
# setting up final ct
ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp_floor, ['Age', 'Fare']),
    (letter_imp_ohe, ['Cabin']),
    (get_sum, ['SibSp', 'Parch']),
    remainder='drop'
)

In [170]:
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.predict(X_new)

array([0, 0, 0, 0, 0, 0, 1, 0, 1, 0], dtype=int64)

In [171]:
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

0.8271420500910175

---