<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Data Pipeline Lab

_Authors: Richard Harris (CHI)_

---

In this lab, we will be using the famous Titanic survivors dataset. While we will be modeling this data shortly, our first goal will be to create two things:

1. Code that transforms our training and test set in the same manner
2. Code that takes predictions and stores them to file

We have taken this data from the well-known [Kaggle Challenge](https://www.kaggle.com/c/titanic)

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression

# we need to import the template classes to create a class that works like an sklearn class
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline

In [2]:
test = pd.read_csv('../assets/datasets/test.csv')

In [3]:
df = pd.read_csv('../assets/datasets/train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Write a function named `age_extractor`. 
- That takes the dataframe as an input
- extracts the age column out of the data frame
- Fills missing values with the age '20'
- Returns a reshaped Numpy array as such:

`series.values.reshape(-1, 1)`

In [4]:
def age_extractor(dataframe):
    age = dataframe['Age']
    age.fillna(20, inplace=True)
    return age.values.reshape(-1, 1)

### Take the "Survived" column and assign it to the variable 'y' to use as an outcome

In [5]:
y = df['Survived']

### Fit a LogisticRegression on y _using_ the output of your `age_extractor` function

In [6]:
logistic_regression = LogisticRegression()
logistic_regression.fit(age_extractor(df), y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### Print out predictions for your model. 

You should be able to call `age_extractor()` on the test data without too much trouble.

In [7]:
logistic_regression.predict(age_extractor(test))

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0,

### Recreate the age_extractor function as a class with the methods transform and fit

The following template will help

```Python
class AgeExtractor(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def transform(self, X, *args):
        # The transform methods needs to returns some type of data that sklearn can understand
        return X

    def fit(self, X, *args):
        # Fit must return self to work within sklearn pipelines
        return self
```

In [8]:
class AgeExtractor(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def transform(self, X, *args):
        age = X['Age']
        age.fillna(20, inplace=True)
        return age.values.reshape(-1, 1)

    def fit(self, X, *args):
        return self

### Test AgeExtractor using the transform method on our data

In [9]:
age = AgeExtractor()
age.transform(df)

array([[ 22.  ],
       [ 38.  ],
       [ 26.  ],
       [ 35.  ],
       [ 35.  ],
       [ 20.  ],
       [ 54.  ],
       [  2.  ],
       [ 27.  ],
       [ 14.  ],
       [  4.  ],
       [ 58.  ],
       [ 20.  ],
       [ 39.  ],
       [ 14.  ],
       [ 55.  ],
       [  2.  ],
       [ 20.  ],
       [ 31.  ],
       [ 20.  ],
       [ 35.  ],
       [ 34.  ],
       [ 15.  ],
       [ 28.  ],
       [  8.  ],
       [ 38.  ],
       [ 20.  ],
       [ 19.  ],
       [ 20.  ],
       [ 20.  ],
       [ 40.  ],
       [ 20.  ],
       [ 20.  ],
       [ 66.  ],
       [ 28.  ],
       [ 42.  ],
       [ 20.  ],
       [ 21.  ],
       [ 18.  ],
       [ 14.  ],
       [ 40.  ],
       [ 27.  ],
       [ 20.  ],
       [  3.  ],
       [ 19.  ],
       [ 20.  ],
       [ 20.  ],
       [ 20.  ],
       [ 20.  ],
       [ 18.  ],
       [  7.  ],
       [ 21.  ],
       [ 49.  ],
       [ 29.  ],
       [ 65.  ],
       [ 20.  ],
       [ 21.  ],
       [ 28.5 ],
       [  5.  

### Refit logistic regression using AgeExtractor to transform the data

In [10]:
logistic_regression = LogisticRegression()
logistic_regression.fit(age.transform(df), y)
logistic_regression.predict(age.transform(test))

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0,

### Write a function named `output_predictions` that takes an array of predictions and a name of a csv file. 

This function should take your predictions, imports it into a Pandas series, names the column 'Predictions', and saves them to a CSV file named as appropriate.

In [11]:
def output_predictions(predicts, filename):
    csv_series = pd.Series(predicts, name='Predictions')
    csv_series.to_csv(filename, header=True)

### Run output_predictions on the prediction from logistic_regression

In [12]:
output_predictions(logistic_regression.predict(age_extractor(test)), 'test_predictions.csv')

We're going to try out what happens when we have two columns we want to model. 

### Write a function named `gender_extractor`. 

That takes in the dataframe, extracts the gender column, turns it into a dummy variable, fills any missing values with `0`, and returns the reshaped array

In [13]:
def gender_extractor(dataframe):
    sex = dataframe['Sex']
    sex = sex.apply(lambda x: 1 if x=='male' else 0)
    sex.fillna(0, inplace=True)
    return sex.values.reshape(-1, 1)

gender_extractor(df)[0:5]

array([[1],
       [0],
       [0],
       [0],
       [1]])

### Recreate the age_extractor function as a class with the methods transform and fit

The following template will help

```Python
class GenderExtractor(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def transform(self, X, *args):
        # The transform methods needs to returns some type of data that sklearn can understand
        return X

    def fit(self, X, *args):
        # Fit must return self to work within sklearn pipelines
        return self
```

In [14]:
class GenderExtractor(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def transform(self, X, *args):
        sex = X['Sex']
        sex = sex.apply(lambda x: 1 if x=='male' else 0)
        sex.fillna(0, inplace=True)
        return sex.values.reshape(-1, 1)

    def fit(self, X, *args):
        return self

### Test GenderExtractor using the transform method on our data

In [15]:
gender = GenderExtractor()
gender.transform(df)

array([[1],
       [0],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0],
       [1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [1],
       [1],
       [0],
       [1],
       [0],
       [1],
       [1],
       [0],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [1],
       [1],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [1],
       [0],
    

Next, we want to join these two columns together. 

### Write a function named `join_arrays()` that takes in a tuple of arrays and returns one array joined together

This would use the concatenate feature of Numpy like this:

`new_array = np.concatenate((array1, array2), axis=1)`

[np.concatenate](https://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html)

In [16]:
def join_arrays(arrays):
    return np.concatenate(arrays, axis=1)

join_arrays((age_extractor(df), gender_extractor(df)))

array([[ 22.,   1.],
       [ 38.,   0.],
       [ 26.,   0.],
       ..., 
       [ 20.,   0.],
       [ 26.,   1.],
       [ 32.,   1.]])

We can achieve the same effect using sklearn.

### Use the FeatureUnion class from sklearn to take instances of AgeExtractor and GenderExtractor and combine their output into one dataframe
- You might create a variable called combine and store the FeatureUnion instance in it

In [17]:
combine = FeatureUnion([('age', age), ('gender', gender)])

### Test the FeatureUnion instance using .transform() on our data
- Make sure that you get a 2 dimensional numpy array

In [18]:
combine.transform(df)

array([[ 22.,   1.],
       [ 38.,   0.],
       [ 26.,   0.],
       ..., 
       [ 20.,   0.],
       [ 26.,   1.],
       [ 32.,   1.]])

Finally, we want one function that joins all of these features together. 

### Create a sklearn pipeline called lr_pipe. 
The pipeline should:
1. Have a feature union step
    - The FeatureUnion instance should union the AgeExtractor class and GenderExtractor class
    - This gives us a 2 dimensional numpy array
2. Run run logistic_regression

We will be able to run .fit() and .predict() on the entire pipeline

In [19]:
lr_pipe = Pipeline([
    ('features', FeatureUnion([
        ('age', age), 
        ('gender', gender)
    ])),
    ('logreg', logistic_regression)])

### Fit your logistic regression using lr_pipe

In [20]:
lr_pipe.fit(df, y)

Pipeline(steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('age', AgeExtractor()), ('gender', GenderExtractor())],
       transformer_weights=None)), ('logreg', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

### Feed the data from `test.csv` through lr_pipe.predict() and predict values using your newly fit model.

In [21]:
lr_pipe.predict(test)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1,
       0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       0, 0,

### Feed those predictions into the `output_predictions()` function

In [22]:
output_predictions(lr_pipe.predict(test), 'new_predictions.csv')

### Bonus

1. Modify your code further. Iterate over some additional features, decide how you want to transform those features. Create new functions to reproducibly transform individual columns. Add that to `preprocessor()`. Refit the logistic model required.
2. Attempt this with a kNN model. You will need to standardize your features. Just like LogisticRegression we can fit our StandardScaler to one set of data and transform it according to that standard, like this:

```
from sklearn.preprocessing import StandardScaler

standard = StandardScaler()
standard.fit(train_data['column'])

standardized_train_data = standard.transform(train_data['column'])
standardized_test_data = standard.transform(test_data['column'])
```

Your deliverables are:

1. At least one additional column added to `preprocessor()`
2. A `preprocessor()` function that standardizes at least one input using `StandardScaler` as written above 
3. Apply the standardized feature to a kNN model