<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Transfer Learning

_Author:_ Tim Book

### Learning Objectives
*After this lesson, students will be able to:*

1. Define transfer learning
1. Carry out transfer learning with and without pipelines and gridsearching
1. Identify scenarios in which transfer learning can benefit

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression, LogisticRegression

## Part I: Transfer Learning from PCA
#### aka, _principal components regression (PCR)_
Often we have many columns, and too few rows. We've seen that PCA can extract only the pertinent information out of a data frame. Why don't we use that in a regression? This is the first and most common variety of transfer learning. It is also the most common thing to do with principal components beyond EDA.

In [None]:
# Santander bank data. All columns are censored - we have no idea what we're
# looking at! The target is whether or not a transaction occurs.


In [None]:
# Let's separate into X and y


In [None]:
# How is our class imbalance?

In [None]:
# Train test split


In [None]:
# Logistic regression - overfit?


In [None]:
# Let's make a pipeline that feeds PCA results into a logistic regression.
# In order to do this, we need to:
# Step 1: Scale data
# Step 2: Decompose data into PCs
# Step 3: Perform logistic regression

In [None]:
# What are the parameters?

In [None]:
# Let's make a gridsearcher and fit!

In [None]:
# Best params

In [None]:
# Best score

In [None]:
# How about the test score?

**(THREAD)**: Why did I do a train-test-split AND cross-validation here?

In [None]:
# Plot the tuning parameter curve

## Part II: Transfer Learning with Clusters
Oftentimes, we'd actually like to _add_ dimensionality to our data to give our model more information. In this example, we will use clustering to use lattitude/longitude data effectively.

It's not shown in this example, but using clustering with transfer learning is more often used as dimensionality reduction similar to PCA. Use clusters as an x-variable to replace several other variables.

In [None]:
# Some data "cleaning"
# NOTE: This shouldn't be considered best, or even good, practice.
# This is merely to get the data into a workable shape so we don't
# spend our lesson cleaning missing and categorical columns.
mel = pd.read_csv("data/melbourne.csv")
keepvars = [
    "Price", "Rooms", "Bedroom2", "Bathroom",
    "Car", "Landsize", "Lattitude", "Longtitude"
]
mel = mel[keepvars].dropna()
mel.columns = ["price", "rooms", "bed", "bath", "car", "land", "lat", "long"]
mel = mel.loc[mel["price"] < np.quantile(mel["price"], 0.99), :]
mel.head()

In [None]:
# Where are the highly prices houses?
mel.plot(kind="scatter", x="long", y="lat", c="price",
         cmap="RdYlGn", figsize=(14, 10), s=2);

In [None]:
# Histogram of house prices - skew?

In [None]:
# Check histogram of log-price

In [None]:
# Let's winnow our data down to only these quantitative variables.

# House prices are skew - so let's do log regression.

In [None]:
# Let's do a train-test-split
X_train, X_test, y_train, y_test = train_test_split(
    X.copy(), y.copy(), random_state=42, test_size=0.5
)

In [None]:
# Carry out a linear regression

In [None]:
# How'd it do?

### The regular log-model performed kinda badly...
But as we saw in our map, home prices are not distributed uniformly about Melbourne. What if we clustered by lat/long, and used those clusters in our model?

In [None]:
# Let's scoop lat/long up in a matrix so we can use them easily

In [None]:
# Let's cluster our observations by lat/long

In [None]:
# What do these clusters look like visually?
plt.figure(figsize=(10, 10))
plt.scatter(mel.long, mel.lat, c=km.labels_, s=1, cmap="tab20");

In [None]:
# Neat! Now let's append these clusters back onto X

In [None]:
# Train-test-split again
X_train, X_test, y_train, y_test = train_test_split(
    X.copy(), y.copy(), random_state=42, test_size=0.5
)

In [None]:
# How'd we do now?

### Issue #1: How do we tune $k$?
Two choices:
1. Bottle all the above into a function and iterate, finding $k$ that gives the best testing error.
2. Force this into a gridsearchable class.
    * We can't use `GridSearchCV` now because clusterers aren't _transformers_.
    * In order to fix this, we'll need some OOP skills outside the scope of today's lesson. In short, you can subclass a scikit-learn mixin and create your own class that acts like a scikit-learn transformer. I wrote a blog post about how to do this [here](https://towardsdatascience.com/building-a-custom-model-in-scikit-learn-b0da965a1299)!

In [None]:
# All this wrapped up!
def transfer_tune(X, y, k):
    location_data = mel[["long", "lat"]]
    km = KMeans(n_clusters=k)
    km.fit(location_data)
    X.loc[:, "cluster"] = km.predict(location_data)
    X_dummy = pd.get_dummies(columns=["cluster"], data=X)
    X_train, X_test, y_train, y_test = train_test_split(
        X_dummy.copy(), y.copy(), random_state=42, test_size=0.5
    )
    model = LinearRegression()
    model.fit(X_train, y_train)
    r2 = model.score(X_test, y_test)
    print(f"{k} : {r2}")

In [None]:
for k in range(2, 103, 5):
    transfer_tune(X, y, k)

### Issue #2: The Train-Test-Split Dilemma
If you have many clusters, it is possible that your test data will have some clusters not represented. Using `pd.get_dummies()` won't help you here - it won't make columns for unrepresented categories.

**Example:**
Suppose you do 10-means clustering on your training data. Your training data now has labels 1 through 10. There's no guarantee that every cluster would be represented in your test data. Maybe no testing data points are put into cluster 5. When you use `pd.get_dummies()` on your testing data, it won't make a `cluster_5` column, and scikit-learn will complain about dimension mismatches (since you have one fewer column in `X_test` now).

In a production setting, you might prefer to use the `OneHotEncoder` scikit-learn class:

In [None]:
D_train = pd.DataFrame({"cluster": ['A', 'B', 'C', 'D', 'E']})
D_test = pd.DataFrame({"cluster": ['A', 'B', 'C', 'E']})

In [None]:
pd.get_dummies(columns=["cluster"], data=D_train)

In [None]:
pd.get_dummies(columns=["cluster"], data=D_test)

In [None]:
from sklearn.preprocessing import OneHotEncoder

In [None]:
km.labels_

In [None]:
oh = OneHotEncoder(categories="auto", sparse=False)
oh.fit(km.labels_.reshape(-1, 1))

In [None]:
dummy_matrix = oh.transform(km.labels_.reshape(-1, 1))

In [None]:
dummy_matrix

In [None]:
dummy_matrix.shape

In [None]:
ohe = OneHotEncoder(categories='auto', sparse=False)
ohe.fit(D_train[['cluster']])
ohe.transform(D_train)

In [None]:
ohe.transform(D_test)