# Phase 4 Code Challenge Review

## Overview

- Pipelines and gridsearching
- Ensemble Methods
- Natural Language Processing
- Clustering

In [1]:
# Basic Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

In [2]:
# from src.call import call_on_students

# 1) Pipelines and Gridsearching

1. What are the benefits of using a pipline?

In [3]:
# call_on_students(1)

Benefits of using a pipeline is to consolidate code and streamline process. Visually pleasing and easy to see entire process.
- flexible, grid searching, trying new models, iterations
- reduce potential for error, reducing repitition, reducing risk of data leakage (via train and test, also via cross validation
- simplifies workflows, condenses code
- a lot easier to troubleshoot (subjective)

2. What does a gridsearch achieve?

In [4]:
# call_on_students(1)

- optimal combination of hyperparamters from a given parameter grid
- this can help expedite model iteration process
- typically done with cross_val, and you can monitor/look at any metric you choose (or multiple)
A gridsearch allows us to try multiple hyperparameters and determine which ones work best for the metric we're focusing on.

3. Set up a pipeline with a scaler and a logistic regression model on the breast cancer dataset that predicts whether the tumor is malignant (target = 1). Don't worry for now about a train-test split.

**Answer**:

In [5]:
from sklearn.datasets import load_breast_cancer

In [11]:
# Your code here
data_overall = load_breast_cancer()
data = data_overall.data
target = data_overall.target
log_pipe = Pipeline(steps=[('scale', StandardScaler()),
                          ('logreg', LogisticRegression(random_state=42))])

In [12]:
X, y = load_breast_cancer(return_X_y=True)

4. Split the data into train and test and then gridsearch over pipelines like the one you just built to find the best-performing model. Try C (inverse regularization) values of 10, 1, and 0.1. Try out the best estimator on the test set.

**Answer**:

In [13]:
# Your code here
X_train, X_test, y_train, y_test = train_test_split(data, target, random_state=42)

parameters = {'logreg__C': [10, 1, 0.1]}

clf = GridSearchCV(estimator = log_pipe, param_grid = parameters)
clf.fit(X_train, y_train)
print(clf.best_estimator_)
print(clf.best_score_)

Pipeline(steps=[('scale', StandardScaler()),
                ('logreg', LogisticRegression(C=10, random_state=42))])
0.9764705882352942


In [14]:
clf_test_score = clf.best_estimator_.score(X_test, y_test)
clf_test_score

0.972027972027972

# 2) Ensemble Methods

1. What sorts of ensembling methods have we looked at?

- Bagging - Sample Bag, Random Forest, Extra Trees
- random forest (tree based)
- gradient boosting (tree based)
- xgboost (tree based)
- voting or stacking (any algo)

In [None]:
# call_on_students(1)

2. What is random about a random forest?

- each tree samples randomly from training data (bootstrap aggregation -> bagging)
- above AND each decision node in each tree only looks at a random subset of features to determine best split (bootstrap aggr -> Random Forest)

- Extra Trees -> each node each tree, features are random and the splitting value is random

In [None]:
# call_on_students(1)

3. What hyperparameters of a random forest might it be useful to tune? How so?

- max_depth -> cut the tree off at certain depth (steps) - stop tree from growing, helps prevent overfitting

- n_estimators -> number of base estimators. In this case, number of trees in the forest

- max_features -> how large the subset of features at each node is

- max_samples -> how large each sample of training data for each tree is

- min_samples_split / min_samples_leaf -> controlling when we split

- criteria -> what metric do we use to determine 'best' split (gini, entropy)

In [15]:
# call_on_students(1)

4. Build a random forest model on the breast cancer dataset that predicts whether the tumor is malignant (target = 1). Make sure you do a train-test split!

**Answer**:

In [17]:
# Your code here
X_train, X_test, y_train, y_test = train_test_split(data, target)

random_forest = RandomForestClassifier(random_state=42)
random_forest.fit(X_train, y_train)

rf_params = {'n_estimators':[500, 100, 1000], 'min_samples_split':[2, 5, 10] }
rf_grid_search = GridSearchCV(estimator= random_forest, param_grid=rf_params)
rf_grid_search.fit(X_train,y_train)

print(rf_grid_search.best_params_)
print(rf_grid_search.best_estimator_.score(X_test, y_test))

{'min_samples_split': 2, 'n_estimators': 500}
0.972027972027972


# 3) Natural Language Processing

## NLP Concepts

### Some Example Text

In [19]:
# Each sentence is a document
sentence_one = "Harry Potter is the best young adult book about wizards"
sentence_two = "Um, EXCUSE ME! Ever heard of Earth Sea?"
sentence_three = "I only like to read non-fiction.  It makes me a better person."

# The corpus is composed of all of the documents
corpus = [sentence_one, sentence_two, sentence_three]

### 1: NLP Pre-processing

List at least three steps you can take to turn raw text like this into something that would be semantically valuable (aka ready to turn into numbers):

In [20]:
# call_on_students(1)

- rm punctuation
- rm uppercase
- rm stopwords
- rm weird symbols (emojis, hashtags, web addresses)

1. standardize with .lower()
2. tokenize with tokenizer.tokenize()
3. rm stop words
4. stemming (stem of word) or lemmatizing (english root of word) words

### 2: Describe what vectorized text would look like as a dataframe.

If you vectorize the above corpus, what would the rows and columns be in the resulting dataframe (aka document term matrix)

- rows would be each sentence and columns would be the stemmed words. Values are how many times they appear in each sentence

In [21]:
# call_on_students(1)

### 3: What does TF-IDF do?

Also, what does TF-IDF stand for?

Term Frequency-Inverse Document Frequency. 
- Measure of how useful a word is as an indicator. If term frequency is high in a single doc and term frequency is low across multiple docs, it is good indicator

In [22]:
# call_on_students(1)

## NLP in Code

### Set Up

In [23]:
# New section, new data
policies = pd.read_csv('data/2020_policies_feb_24.csv')

def warren_not_warren(label):
    
    '''Make label a binary between Elizabeth Warren
    speeches and speeches from all other candidates'''
    
    if label =='warren':
        return 1
    else:
        return 0
    
policies['candidate'] = policies['candidate'].apply(warren_not_warren)

The dataframe loaded above consists of policies of 2020 Democratic presidential hopefuls. The `policy` column holds text describing the policies themselves.  The `candidate` column indicates whether it was or was not an Elizabeth Warren policy.

In [None]:
policies.head()

The documents for activity are in the `policy` column, and the target is candidate. 

### 4: Import the Relevant Class, Then Instantiate and Fit a Count Vectorizer Object

In [None]:
# call_on_students(1)

In [24]:
# First! Train-test split the dataset
y = policies.pop('candidate')
X = policies['policy']
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)

In [27]:
# Import the relevant vectorizer
from sklearn.feature_extraction.text import CountVectorizer 
#TFIDF vectorizer
#Hashing vectorizer - encodes tokens as numbers to mask/encode/hash tokens so people don't know our words. 
# We lose on interpratibility but keep it safe

In [26]:
# Instantiate it
cv = CountVectorizer(stop_words='english')

In [28]:
# Fit it
cv.fit(X_train)

CountVectorizer(stop_words='english')

In [29]:
cv.vocabulary_

{'1987': 108,
 'united': 9977,
 'church': 1844,
 'christ': 1837,
 'commission': 2030,
 'racial': 7560,
 'justice': 5356,
 'commissioned': 2031,
 'studies': 9210,
 'hazardous': 4543,
 'waste': 10322,
 'communities': 2052,
 'color': 1996,
 'years': 10554,
 'later': 5477,
 '28': 187,
 'ago': 649,
 'month': 6170,
 'delegates': 2732,
 'national': 6270,
 'people': 6889,
 'environmental': 3539,
 'leadership': 5513,
 'summit': 9301,
 'adopted': 556,
 '17': 61,
 'principles': 7289,
 'federal': 3919,
 'government': 4348,
 'largely': 5471,
 'failed': 3846,
 'live': 5660,
 'vision': 10227,
 'trailblazing': 9712,
 'leaders': 5512,
 'outlined': 6632,
 'responsibilities': 8095,
 'represent': 8010,
 'predominantly': 7202,
 'black': 1313,
 'neighborhoods': 6332,
 'detroit': 2889,
 'navajo': 6287,
 'southwest': 8900,
 'louisiana': 5735,
 'cancer': 1595,
 'alley': 716,
 'industrial': 4963,
 'pollution': 7108,
 'concentrated': 2127,
 'low': 5739,
 'income': 4908,
 'decades': 2625,
 'tacitly': 9420,
 'writ

#### BONUS: Hyperparameters to tweak
- ngram_range: do we want single word tokens, bigrams, or more
- max_df / min_df: max and min frequency for a token to be returned as an actual token
- max_features: hard limit on the number of tokens returned

### 5: Vectorize Your Text, Then Model

In [30]:
# call_on_students(1)

In [31]:
# Code here to transform train and test sets with the vectorizer
X_train_vectorized = cv.transform(X_train)
X_test_vectorized = cv.transform(X_test)

In [33]:
X_train_vectorized.shape, X_test_vectorized.shape

((141, 10585), (48, 10585))

In [34]:
print(cv.get_feature_names())



In [35]:
# Code here to instantiate and fit a Random Forest model
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_vectorized, y_train)

RandomForestClassifier(random_state=42)

In [36]:
# Code here to evaluate your model on the test set
rf.score(X_test_vectorized, y_test)

0.9375

# 4) Clustering

## Clustering Concepts

### 1: Describe how the K-Means algorithm updates its cluster centers after initialization.

In [None]:
# call_on_students(1)

K-Means updates its cluster centers by calculating distances to nearest points and calculating average

   - Initialization is random
   - remeasures distance from all points to centers
   - reassign points to their respective closest cluster center
   - updating the center based on new points assigned to it

### 2: What is inertia, and how does K-Means use inertia to determine the best estimator?

Please also describe the method you can use to evaluate clustering using inertia.

Documentation, for reference: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html


- Inertia is the SSE (sum of squares errors) between the data points and the cluster center. The K-Means algorithm tries to minimize this value.
- Inertia is used in the Elbow plot method to find best k
    - want to find the elbow of the plot, point of greatest change
- K means tries to minimize within cluster distance (Intra, inertia and silhouette) and maximize between cluster distance(Inter, silhouette)

In [None]:
# call_on_students(1)

### 3: What other metric do we have to score the clusters which are formed?

Describe the difference between it and inertia.

- Silhouette score takes into consideration cohesion of cluster and distance from other clusters. Ranges from -1 to 1 with 1 being good clustering

- Silhouette score (closer to 1 to better), as opposed to inertia where you're looking for elbow

In [None]:
# call_on_students(1)

## Clustering in Code with Heirarchical Agglomerative Clustering

After the above conceptual review of KMeans, let's practice coding with agglomerative clustering.

### Set Up

In [39]:
# New dataset for this section!
from sklearn.datasets import load_iris

data = load_iris()
X = pd.DataFrame(data['data'])

### 4: Prepare our Data for Clustering

What steps do we need to take to preprocess our data effectively?

- scaling is vitally important b/c it relies solely on distance

In [40]:
# call_on_students(1)

In [41]:
# Code to preprocess the data
scaler = StandardScaler()
# Name the processed data X_processed
X_processed = scaler.fit_transform(X)

### 5: Import the Relevant Class, Then Instantiate and Fit a Hierarchical Agglomerative Clustering Object

Let's use `n_clusters = 2` to start (default)

In [42]:
# call_on_students(1)

In [43]:
# Import the relevent clustering algorithm
from sklearn.cluster import AgglomerativeClustering

# Instantiate and fit
cluster = AgglomerativeClustering(n_clusters=2)
cluster.fit(X_processed)

AgglomerativeClustering()

In [44]:
# Calculate a silhouette score
from sklearn.metrics import silhouette_score
labels = cluster.fit_predict(X_processed)
cluster_sil_score = silhouette_score(X_processed, labels)
cluster_sil_score

0.5770346019475989

### 6: Write a Function to Test Different Options for `n_clusters`

The function should take in the number for `n_clusters` and the data to cluster, fit a new clustering model using that parameter to the data, print the silhouette score, then return the labels attribute from the fit clustering model.

In [45]:
# call_on_students(1)
def test_clusters(n_clusters, data):
    # Instantiate and fit
    cluster = AgglomerativeClustering(n_clusters=n_clusters)
    cluster.fit(data)
    
    #print silheoutte score
    labels = cluster.fit_predict(data)
    cluster_sil_score = silhouette_score(data, labels)
    print(cluster_sil_score)
    return cluster.labels_

In [None]:
test_clusters(2, X_processed)

In [47]:
for k in range(2, 10):
    test_clusters(k, X_processed)

0.5770346019475989
0.446689041028591
0.4006363159855973
0.33058726295230545
0.3148548010051283
0.316969830299128
0.310946529007258
0.31143422475471655
