# Phase 4 Code Challenge Review

## Overview

- Pipelines and gridsearching
- Ensemble Methods
- Natural Language Processing
- Clustering

In [1]:
# Basic Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

In [None]:
# from src.call import call_on_students

# 1) Pipelines and Gridsearching

### What are the benefits of using a pipline?

In [None]:
# call_on_students(1)

Create a pipeline →  it becomes an sklearn estimator →  fit, transform, cross validate, etc. 
- Compartmentalizes each part of the process
- Flexible - can change each part separately
- Helps prevent data leakage in cross validation
- Column transformer: calling diff transformation objects on diff columns. Ohe, scaling, imputation only on the columns that need it. Like “mini” pipelines for each step
- Handling Missing Values (nulls) can happen differently on diff columns


### What does a gridsearch achieve?

In [None]:
# call_on_students(1)

The problem with doing model tuning by hand - sometimes each adjustment would affect the other if you combined adjustments
- Really you need to tune the model is tandem

Grid Search:
Automatically/Systematically searching for the optimal combination of hypterparametres by providing a grid to our Grid Search with the values of each hypterparameter that we want to try
Grid search will evaluate all those combos with the evaluation metric that we choose, then return the best combination of hypterparameters = the best model
scikit-Learn’s GridSearchCV class w/ .fit() method 
Cross validates each of those combinations


### Set up a pipeline with a scaler and a logistic regression model on the breast cancer dataset that predicts whether the tumor is malignant (target = 1). Don't worry for now about a train-test split.

In [None]:
# call_on_students(1)

**Answer**:

In [4]:
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)

In [15]:
# Your code here
pipeline = Pipeline([('scaler', StandardScaler()),
                   ('lr', LogisticRegression(random_state=1, C=1))])

In [None]:
# call_on_students(1)

### Split the data into train and test and then gridsearch over pipelines like the one you just built to find the best-performing model. Try C (inverse regularization) values of 10, 1, and 0.1. Try out the best estimator on the test set.

In [None]:
# call_on_students(1)

**Answer**:

In [12]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [17]:
# Your code here
grid = {
    'lr__C': [0.1, 1, 10],
}

grid_search = GridSearchCV(pipeline, grid, verbose=2, return_train_score=True)
#default cv=5

grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits
[CV] lr__C=0.1 .......................................................
[CV] ........................................ lr__C=0.1, total=   0.0s
[CV] lr__C=0.1 .......................................................
[CV] ........................................ lr__C=0.1, total=   0.0s
[CV] lr__C=0.1 .......................................................
[CV] ........................................ lr__C=0.1, total=   0.0s
[CV] lr__C=0.1 .......................................................
[CV] ........................................ lr__C=0.1, total=   0.0s
[CV] lr__C=0.1 .......................................................
[CV] ........................................ lr__C=0.1, total=   0.0s
[CV] lr__C=1 .........................................................
[CV] .......................................... lr__C=1, total=   0.0s
[CV] lr__C=1 .........................................................
[CV] ............

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s


[CV] ......................................... lr__C=10, total=   0.0s


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:    0.2s finished


GridSearchCV(estimator=Pipeline(steps=[('scaler', StandardScaler()),
                                       ('lr',
                                        LogisticRegression(C=1,
                                                           random_state=1))]),
             param_grid={'lr__C': [0.1, 1, 10]}, return_train_score=True,
             verbose=2)

In [18]:
grid_search.best_params_

{'lr__C': 0.1}

In [19]:
grid_search.best_estimator_

Pipeline(steps=[('scaler', StandardScaler()),
                ('lr', LogisticRegression(C=0.1, random_state=1))])

In [21]:
#testing the best estimator out on test data
grid_search.best_estimator_.score(X_test, y_test)

0.958041958041958

# 2) Ensemble Methods

### What sorts of ensembling methods have we looked at?

In [None]:
# call_on_students(1)

Ensemble: a model consisting of a bunch of other models
Generalize (perform on unseen test data) well
Decreases variance →  less overfitting
More complexity: have to train each model or part of model
BUT sometimes means →  less interpretation
Ensembles consisting of a bunch of trees still have feature importances
But ensembles consisting of trees and logistic regs and knn’s have feature imps, coeffs, and	but not those for the whole model
Takse up more space: have to keep each model
BUT also means →  more computational power and time
We want differently randomized models that are overfit in different ways
-Bagging: a bag of marbles, each marble is a model
Many model types naturally overfit
So, we can randomly assign data and features to each model
Each model overfits in different ways
Then we aggregate all the models 
Bootstrap (replacement) aggregating
Repeat algorithm many times w/ replacement (bootstrap)
Create multiple samples from your data
Train models on those samples
Final model predicts by averaging or voting across those models
→  hope that that smooths over/cancels out the different ways each model overfits to reduce variance
→  low variance since it averages out quirks the individual models might’ve learned
4 Levels of Randomization
1. Simple Bag: BaggingClassifier
- Train each model on random samples w/ replacement
- Bagging Algorithm
Take a sample of your X_train and fit a decision tree to it.
Replace the first batch of data and repeat.
When you've got as many trees as you like, make use of all your individual trees' predictions to come up with some holistic prediction.
(Most obviously, we could take the average of our predictions, but there are other methods we might try.)
Bagging:
Because we're resampling our data with replacement, we're bootstrapping.
Because we're making use of our many samples' predictions, we're aggregating.
Because we're bootstrapping and aggregating all in the same algorithm, we're bagging.

2. Random Forest: random sample AND give model all the features, but at each node only allow it to choose from a random subset of those features to split on, using a splitting criteria like gini or entropy to split on
Each tree in the forest is very different
Finds best split based on gini or entropy, like normal decision trees
3. Extra Trees: random sample, AND random subset of features, AND among that random subset of features, but not splitting based on gini or entropy, just choosing a random feature to split on 

Ensembles of Different Model Types: 
Maybe you have some linear data, some non-linear, a lot of categorical data
Logreg models better for linear relationships, but trees are better for the last two
2 methods: 
VotingClassifier or VotingRegressor: taking a simple average (or a vote) of the outputs of the sub-models
Sometimes uses weighted averaging
StackingClassifier or StackingRegressor: feeding the ouputted predictions of a bunch of sub-models as features in another final model
Also known as Meta-Classifier/Meta-Regressor
Approaching a neural network 
Estimates the weights in the averaging for us
This means we'll have one layer of base estimators and another layer that is "trained to optimally combine the model predictions to form a new set of predictions". See this short blog post for more.


### What is random about a random forest?

In [None]:
# call_on_students(1)

-Still bootstrapping: random samples w/ replacement
- Now give each tree all the features, but at each node we give the tree a random subset of the features to choose from/split on 

Pros:

- High performance, low variance
- Transparent: inherited from Decision Tree (white box??)

Cons:

- So many trees to plant
- Computationally expensive
- Memory: every tree stored back to memory, almost as memory intensive as KNN


### What hyperparameters of a random forest might it be useful to tune? How so?

In [None]:
# call_on_students(1)

- n_estimators: #s of sub-models to train
- Max_features: feature subsetting
    - Can also set this parameter in a Bagging Classifier and it will make that bagging ~ Random Forest
- Bootstrap: whether you want to random sample the dataset or not. We do want this in order to grow different trees each time
- Max_samples: size of samples
    - Must be 0 -1
    - Default = None, will sample the length of X
    - If it is = 1: →  
    - This can be the same size as the full datase t b/c we allow replacement after we take each data point
    - So we can have a bunch of diff samples that are all the same size as the original for each model
- also all same parameters as a decision tree: max depth, min samples leaf, min samples split

### Build a random forest model on the breast cancer dataset that predicts whether the tumor is malignant (target = 1).

In [None]:
# call_on_students(1)

**Answer**:

In [24]:
# Your code here
rfc = RandomForestClassifier(random_state=1)
rfc.fit(X_train, y_train)
# y_pred = rfc.predict(X_test)

# from sklearn.model_selection import cross_val_score
# scores = cross_val_score(estimator=rfc, X=X_train, y=y_train, cv=5)
# scores

# rfc.score(X_train, y_train)
# rfc.score(X_test, y_test)


0.9440559440559441

# 3) Natural Language Processing

## NLP Concepts

### Some Example Text

In [None]:
# Each sentence is a document
sentence_one = "Harry Potter is the best young adult book about wizards"
sentence_two = "Um, EXCUSE ME! Ever heard of Earth Sea?"
sentence_three = "I only like to read non-fiction.  It makes me a better person."

# The corpus is composed of all of the documents
corpus = [sentence_one, sentence_two, sentence_three]

### NLP Pre-processing

List at least three steps you can take to turn raw text like this into something that would be semantically valuable (aka ready to turn into numbers):

In [None]:
# call_on_students(1)

1. Remove stop words like "is" that have no semantic value
2. Lower case all the words
3. Remove punctuation
4. Stemming or lemmatize

### Describe what vectorized text would look like as a dataframe.

If you vectorize the above corpus, what would the rows and columns be in the resulting dataframe (aka document term matrix)

In [None]:
# call_on_students(1)

Represented as a document term matrix:
Each row is a document
Each column is a unique vocabulary n-gram (a token). 1 or 0 for if that doc has it
So each row is a numerical vector


### What does TF-IDF do?

Also, what does TF-IDF stand for?

In [None]:
# call_on_students(1)

## NLP in Code

### Set Up

In [None]:
# New section, new data
policies = pd.read_csv('data/2020_policies_feb_24.csv')

def warren_not_warren(label):
    
    '''Make label a binary between Elizabeth Warren
    speeches and speeches from all other candidates'''
    
    if label =='warren':
        return 1
    else:
        return 0
    
policies['candidate'] = policies['candidate'].apply(warren_not_warren)

The dataframe loaded above consists of policies of 2020 Democratic presidential hopefuls. The `policy` column holds text describing the policies themselves.  The `candidate` column indicates whether it was or was not an Elizabeth Warren policy.

In [None]:
policies.head()

The documents for activity are in the `policy` column, and the target is candidate. 

### Import the Relevant Class, Then Instantiate and Fit a Count Vectorizer Object

In [None]:
# call_on_students(1)

In [None]:
# First! Train-test split the dataset


In [None]:
# Import the relevant vectorizer


In [None]:
# Instantiate it


In [None]:
# Fit it


### Vectorize Your Text, Then Model

In [None]:
# call_on_students(1)

In [None]:
# Code here to transform train and test sets with the vectorizer


In [None]:
# Code here to instantiate and fit a Random Forest model


In [None]:
# Code here to evaluate your model on the test set


# 4) Clustering

## Clustering Concepts

### Describe how the K-Means algorithm updates its cluster centers (centroids) after initialization.

In [None]:
# call_on_students(1)

### What is inertia, and how does K-Means use inertia to determine the best estimator?

Please also describe the method you can use to evaluate clustering using inertia.

Documentation, for reference: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

In [None]:
# call_on_students(1)

### What other metric do we have to score the clusters which are formed?

Describe the difference between it and inertia.

In [None]:
# call_on_students(1)

## Clustering in Code with Heirarchical Agglomerative Clustering

After the above conceptual review of KMeans, let's practice coding with agglomerative clustering.

### Set Up

In [None]:
# New dataset for this section!
from sklearn.datasets import load_iris

data = load_iris()
X = pd.DataFrame(data['data'])

### Prepare our Data for Clustering

What steps do we need to take to preprocess our data effectively?


In [None]:
# call_on_students(1)

In [None]:
# Code to preprocess the data

# Name the processed data X_processed


### Import the Relevant Class, Then Instantiate and Fit a Hierarchical Agglomerative Clustering Object

Let's use `n_clusters = 2` to start (default)

In [None]:
# call_on_students(1)

In [None]:
# Import the relevent clustering algorithm


# Instantiate and fit


In [None]:
# Calculate a silhouette score


### Write a Function to Test Different Options for `n_clusters`

The function should take in the number for `n_clusters` and the data to cluster, fit a new clustering model using that parameter to the data, print the silhouette score, then return the labels attribute from the fit clustering model.

In [None]:
# call_on_students(1)