# `sklearn` Basics  
#### The model building process with `sklearn`  

---------------------  

### Assignment Contents
- [The Train/Test Split](#The-Train/Test-Split)
- [Model Evaluation](#Model-Evaluation)
- [The Model Fitting Process](#The-Model-Fitting-Process)

#### EXPECTED TIME 1.5 HRS

### Overview
This, and the next assignment will review, demonstrate, and test the model building process using `sklearn`. The fundamental process tested in this assignment is the creation of a test/train split. A review of classification metrics 

### Activities in this Assignment
- train_test_split
- Reviewing classification metrics
- Review model fitting functions

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression, LinearRegression
import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
%matplotlib inline

### Reading in office Data
The below cell reads in, cleans and parses the `office` dataset which we have seen throughout the course. 

In [2]:
data_path = "../resource/asnlib/publicdata/office_supply.csv"
df = pd.read_csv(data_path)

# Rename Columns
df.columns = [col.strip().replace(' ', '_').lower() for col in df.columns]

# Pull out target and explanatory variables
X = df.drop('campaign_period_sales', axis = 'columns')
y = df['campaign_period_sales']


# Function for preprocessing data
def office_preprocess(X,y):
    # Hard-code lists for dropping and to_bool
    # Dropped variables include dates and features with many missing values
    to_drop = ['date_of_last_transaction', 'date_of_first_purchase',
               'customer_number', 'language',
               'last_transaction_channel', 'number_of_employees']
    to_bool = ['desk', 'executive_chair', 'standard_chair',
               'monitor', 'printer','computer', 'insurance',
               'toner', 'office_supplies']
    # Hard-code values for notice, auto, and prem
    notice = "NOTICE"
    auto = "AUTO RENEW"
    prem = "Premier"
    
    # Function to convert and fill "Y/N" features
    def convert_fill_bool(val):
        if val == 'Y': return True
        else: return False
    
    # Function to encode the service as "premium" : true or false
    def encode_service(val):
        if val == prem: return True
        else: return False
    
    # Function to encode the repurchase feature into two columns: "notice" true/false and "auto_renew" true/false
    # "payment" plan implied by "false" in "notice" and "auto_renew" columns
    def encode_repurchase(series):
        
        def notice_encode(val):
            if val == notice: return True
            else: return False
        
        def auto_renew_encode(val):
            if val == auto: return True
            else: return False

        ser_notice = series.apply(notice_encode)
        ser_notice.name = "repurchase_notice"
        ser_auto = series.apply(auto_renew_encode)
        ser_auto.name = "repurchase_auto"

        return pd.concat([ser_notice, ser_auto], axis = 'columns')
    
    # Function to transform campaign_period_sales to a float
    def transform_target(raw):
        # make sure the value is initially cast as a string
        raw = str(raw)

        # determine if negative or not
        if raw.count("(") > 0: sign = -1
        else: sign = 1

        # remove all spaces, commas, dollar signs, and parentheses
        for to_rem in [" ",",","$", "(",")"]:
            raw = raw.replace(to_rem,"")
        return sign *float(raw)

    y_trans = y.apply(transform_target)
    
    X_trans = X.drop(to_drop, axis = 'columns')
    
    for col in to_bool:
        X_trans[col] = X_trans[col].apply(convert_fill_bool)
        
    X_trans['premier_service'] = X_trans['service_level'].apply(encode_service)
    X_trans.drop('service_level', axis = 'columns', inplace = True)
    
    repurch = encode_repurchase(X_trans['repurchase_method'])
    X_trans = pd.concat([X_trans.drop('repurchase_method', axis = 'columns'), repurch], axis = 'columns')
    
    return X_trans, y_trans

X, y = office_preprocess(X,y)

df = pd.concat([y,X],axis = 'columns')

df.head()

Unnamed: 0,campaign_period_sales,number_of_transactions,do_not_direct_mail_solicit,do_not_email,do_not_telemarket,email_available,desk,executive_chair,standard_chair,monitor,printer,computer,insurance,toner,office_supplies,premier_service,repurchase_notice,repurchase_auto
0,107.16,20,False,False,False,False,False,False,False,False,False,False,False,False,True,False,True,False
1,110.66,2,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False
2,4697.62,12,False,True,False,True,False,True,False,False,False,False,False,True,True,True,True,False
3,103.08,9,False,False,False,True,False,False,False,False,False,False,False,True,True,True,False,True
4,-566.5,1,True,True,True,True,False,False,False,False,False,False,False,False,False,True,True,False


`sklearn` requires that all data, both target and predictor variables, be numeric. In fact, the first step `sklearn` performs before fitting any model is converting everthing to floats.  

At this point, all of the data in our `DataFrame` may be treated as numeric; boolean are treated as 1's for `True` and 0 for `False`

### The Train/Test Split

As was demonstrated in lecture 11-4, one of the important parts of evaluating models is the creation of train/test split.  

Models do a good job of "fitting" to the data they know about; creating a `test_train_split` and/or further "holdout" sets allow us to fit models on unseen (out-of-sample) data, offering an idea of how the model will perform on new data. 

`sklearn` provides the `train_test_split` function in its `model_selection` library. While `train_test_split` has many optional features, at its core, the function splits data into two sets, with a random  $p\%$ of the data in the training set and the remaining $1-p\%$ of the data in the testing data set.  

The argument `test_size` allows you to specify the proportion of data you wish to be in the test-set. For example, `test_size = .3` Would give a training set of $70\%$ of the data and a test_set of $30\%$ of the data.  

In [3]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# specify that 30% of the data should be held out as the test set
train, test = train_test_split(df, test_size = .3)

print("Training data shape:", train.shape)
print("Testing data shape:", test.shape)

print("\nFirst few rows/cols of training data:")
print(train.iloc[:5,:2])

print("\nFirst few rows/cols of testing data:")
print(test.iloc[:5,:2])

Training data shape: (32484, 18)
Testing data shape: (13923, 18)

First few rows/cols of training data:
       campaign_period_sales  number_of_transactions
16152                 149.96                      18
40186                 215.37                      10
10194                 939.37                       3
22659                 356.91                      15
45334                2757.83                       8

First few rows/cols of testing data:
       campaign_period_sales  number_of_transactions
9788                  216.39                      19
44482                 280.00                      23
42524                 410.85                       9
12109                 184.23                      20
34773                2279.33                       3


Notice how our target variable `campaign_period_sales` remains attached to the `train` and `test` datasets.  

Below they are separated out.

In [4]:
# Taking all rows and from the 2nd column on for the X matrix
X_train = train.iloc[:,1:]
# Taking just the first column for the y-target
y_train = train.iloc[:,0]

# Doing the same for the test set
X_test = test.iloc[:,1:]
y_test = test.iloc[:,0]

print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

X_test shape: (13923, 17)
y_test shape: (13923,)


It is also possible to separate the X and y before passing to the `train_test_split` function, as shown below:

In [5]:
X = df.drop("campaign_period_sales", axis = 'columns')
y = df.loc[:, 'campaign_period_sales']
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .3)

print("X_test shape:", X_train.shape)
print("y_test shape:", y_test.shape)

X_test shape: (32484, 17)
y_test shape: (13923,)


**The below cell provides data for Questions 1/2**

In [6]:
data_path = "../resource/asnlib/publicdata/office_supply.csv"
df = pd.read_csv(data_path)
df.columns = [col.strip().replace(' ', '_').lower() for col in df.columns]
X = df.drop('campaign_period_sales', axis = 'columns')
y = df['campaign_period_sales']

X, y = office_preprocess(X,y)

df = pd.concat([y,X],axis = 'columns')

#### Question 1

Given the technique of your choosing, create a train-test split with $22\%$ of the data in the test set.

This means that the y_test data should have 10,210 rows of `'campaign_period_sales'` data; the X_test should have 10,210 rows and all other 17 columns.  

The X_train and y_train should have the same **_columns_** as the `test` data. But the `train` data should include the other 36,197 rows. 

In [7]:
### GRADED

### Assign data with appropriate shapes, but randomly selected values to `X_train, X_test, y_train, and y_test`  

### The train_test_split may be created from either `X` and `y` or `df` depending on your preference
### NOTE: the `X` in the variable names is capitalized.

### YOUR ANSWER BELOW
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .22)

print("X_test shape:", X_train.shape)
print("y_test shape:", y_test.shape)

X_test shape: (36197, 17)
y_test shape: (10210,)


In [8]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


Performing the train_test_split is a critical step in building any model. Thus it is tested again below
#### Question 2

Given the technique of your choosing, create a train-test split with $15\%$ of the data in the test set.

This means that the y_test data should have 6,962 rows of `'campaign_period_sales'` data; the X_test should have 6,9620 rows and all other 17 columns.  

Column wise, the X_train and y_train should be identical to the `test` data. But they should include the other 39,445 rows. 

In [9]:
### GRADED

### Assign data with appropriate shapes, but randomly selected values to `X_train, X_test, y_train, and y_test`  

### The train_test_split may be created from either `X` and `y` or `df` depending on your preference
### NOTE: the `X` in the variable names is capitalized.

### YOUR ANSWER BELOW

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = .15)

print("X_test shape:", X_train.shape)
print("y_test shape:", y_test.shape)

X_test shape: (39445, 17)
y_test shape: (6962,)


In [10]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


#### Question 3

In [11]:
### GRADED
### Why is it important to use a train_test_split in model building?
### 'a') To reduce correlation between factors
### 'b') To reduce processing time
### 'c') To simulate out of sample data
### 'd') To simulate errors in our data
### Assign character associated with your choice as string to ans1
### YOUR ANSWER BELOW

ans1 = 'c'

In [12]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


Below a train/test split is created on our dataframe with two twists; The target; `'campaign_period_sales'` is modified to be binary; "True" for values greater than or equal to the mean, and "False" for values less than the mean.  

Also, the train-test split is made with set indicies for reproducibility.  
NOTE: this is only done here for demonstration and should not be done in practice.

In [13]:
X = df.drop("campaign_period_sales", axis = 'columns')
y = df.loc[:, 'campaign_period_sales'].apply(lambda x: x >= df['campaign_period_sales'].mean())
print(y.head())

split_index = 13922
X_test = X.iloc[:split_index, :]
X_train = X.iloc[split_index:, :]
y_test = y.iloc[:split_index]
y_train = y.iloc[split_index:]

0    False
1    False
2     True
3    False
4    False
Name: campaign_period_sales, dtype: bool


Now we can import our `LinearRegression` model from `sklearn`, fitting it on the `X_train` and `y_train` from above, and predicting on the `X_test` data.  

Finally, a threshold of .5 will be used for assigning the X_test predictions to True (1) or False (0)

In [14]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

raw_predictions = lr.predict(X_test)
preds = raw_predictions >=.5
preds

array([False,  True, False, ..., False,  True, False])

### Model Evaluation
Below, the `confusion_matrix` is imported from `sklearn.metrics`

In [15]:
from sklearn.metrics import confusion_matrix
print("Confusion matrix:\n", confusion_matrix(y_test, preds), sep = "")

Confusion matrix:
[[9472  673]
 [ 656 3121]]


In the confusion matrix above, the upper left it True Negatives (tn), the upper right is False Positive (fp), the lower left is False Negatives (fn), and the lower right is True Positive (tp).  
Below are equations for various metrics involving these observations  


**Precision:** $$\frac{\text{TP}}{\text{TP+FP}}$$  

**Recall (True-Positive Rate):**$$\frac{\text{TP}}{\text{TP+FN}}$$  

**False-Positive Rate:** $$\frac{\text{FP}}{\text{TN+FP}}$$  

These equations are also described in Lecture 11-5  

For the purposes of the next few questions, assume the confusion matrix came out as below:

|       |Predicted F | Predicted T|
|------|------|------|
|**Actual F**|9472|673|
|**Actual T**|656|3121|

Given this potential confusion matrix above::

tn = 9472  
fp = 673  
fn = 656  
tp = 3121   


The Precision is would then be:
$\frac{\text{TP}}{\text{TP+FP}} = \frac{3121}{3121+673} = \frac{3121}{3794} \approx .8226$   

#### Question 4

In [None]:
### GRADED

### What is the Recall of our model?
### Assign answer to the variable `recall`

### YOUR ANSWER BELOW
   
recall = .8263


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


#### Question 5

In [None]:
### GRADED

### What is the False-Positive Rate of our model?
### Assign answer to the variable `fpr`

### YOUR ANSWER BELOW
  

fpr = .0663

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


#### Question 6

In [None]:
### GRADED
### Assume a classification problem trying to predict 1 or 0;
### with '1' being the (positve) presence of some trait
### A True-Positive is?
### 'a') Predicted: 0, Actual 0
### 'b') Predicted: 0, Actual 1
### 'c') Predicted: 1, Actual 0
### 'd') Predicted: 1, Actual 1
### Assign character associated with your choice as string to ans1
### YOUR ANSWER BELOW

ans1 = 'd'

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


#### Question 7

In [None]:
### GRADED
### Assume a classification problem trying to predict 1 or 0;
### with '1' being the (positive) presence of some trait
### A True-Negative is?
### 'a') Predicted: 0, Actual 0
### 'b') Predicted: 0, Actual 1
### 'c') Predicted: 1, Actual 0
### 'd') Predicted: 1, Actual 1
### Assign character associated with your choice as string to ans1
### YOUR ANSWER BELOW

ans1 = 'a'

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


#### Question 8

In [None]:
### GRADED
### A better model will have an AUC score:
### 'a') closer to 0
### 'b') closer to .5
### 'c') closer to 1
### Assign character associated with your choice as string to ans1
### Covered in Lecture 11-6
### YOUR ANSWER BELOW

ans1 = 'c'

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


### The Model Fitting Process

The most important part of the model creation process not yet focused on either in this assignment or in lecture is the actually fitting of the model.  

In fact, if you weren't paying attention you might have missed it.  

Thus far, both in the lecture and in this assignment, we have fit a `LinearRegression` model in `sklearn`. In another lecture a Linear Regression model was fit using `statsmodels`.  

In this and the next assignment, we will focus on model building with `sklearn`. Thankfully, the syntax for creating `sklearn` models is remarkably consistent.  

Below, we again fit a `LinearRegression` model on the `True/False` target, but we also fit a `LogisticRegression` model

In [16]:
from sklearn.linear_model import LogisticRegression

# Fitting the LinearRegression (taken almost verbatim from above)

linreg = LinearRegression()
linreg.fit(X_train, y_train)

raw_predictions = linreg.predict(X_test)
linpreds = raw_predictions >=.5

# Fitting a LogisticRegression
# (Note similarity in syntax -- just variable name and instantiated model were changed)
logreg = LogisticRegression()
logreg.fit(X_train,y_train)
logpreds = logreg.predict(X_test)

print("Confusion matrix with Linear Regression:\n", confusion_matrix(y_test, linpreds), sep = "")
print("\nConfusion matrix with Logistic Regression:\n", confusion_matrix(y_test, logpreds), sep = "")



Confusion matrix with Linear Regression:
[[9472  673]
 [ 656 3121]]

Confusion matrix with Logistic Regression:
[[9558  587]
 [ 689 3088]]


The syntax even holds for much more complicated models. Below, a complicated ensemble model is used, yet, the syntax remains the same.

In [17]:
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)

gbcpreds = gbc.predict(X_test)

print("Confusion matrix with Gradient Boosted Trees:\n", confusion_matrix(y_test, gbcpreds), sep = "")

Confusion matrix with Gradient Boosted Trees:
[[9748  397]
 [ 770 3007]]


At this point, hopefully the similarities are starting to become clear. Each of the `sklearn` models can be fit using the `.fit()` method; and predictions may be made using the `.predict()` method.  

The `.fit()` method takes the explanatory (X) features as its first argument and the target (y) as its second argument.  

Below, you will be asked to provide a model fitted on a provided test_train_split. 

#### Question 9
In the below cell, an `X_train` and `y_train` dataset are provided.  

1. Create either a `LinearRegression` or `LogisticRegression` model.
2. Fit that model using `X_train` and `y_train`.
3. Save that model to the variable `my_classifier`.  

NB: In the above examples, models were saved to the variables `lr`, `linreg`, `logreg`, and `gbc`

In [18]:
### GRADED

### Follow the above directions; saving a model to `my_classifier`
### **fit** with the provided `X_train` and `y_train`

### YOUR ANSWER BELOW
X = df.drop("campaign_period_sales", axis = 'columns')
y = df.loc[:, 'campaign_period_sales'].apply(lambda x: x >= df['campaign_period_sales'].mean())
X_train = X.iloc[:100, :]
y_train = y.iloc[:100]

my_classifier = LinearRegression()
my_classifier.fit(X_train, y_train)

raw_predictions = my_classifier.predict(X_test)
preds = raw_predictions >=.5
preds

array([False,  True, False, ..., False,  True, False])

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


#### Question 10

In [None]:
### GRADED
### Which of the following methods would NOT be used in model building and evaluation
### with `sklearn`?
### 'a') `.transform()`
### 'b') `.fit()`
### 'c') `.predict()`
### 'd') `.score()`
### Assign character associated with your choice as string to ans1
### YOUR ANSWER BELOW

ans1 = 'a'

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###
