Lambda School Data Science

*Unit 4, Sprint 1, Module 3*

---

# Document Classification (Prepare)

Today's guided module project will be different. You already know how to do classification. You ready know how to extract features from documents. So? That means you're ready to combine and practice those skills in a [kaggle competition](https://www.kaggle.com/c/whiskey-201911/). We we will open with a five minute sprint explaining the competition, and then give you 25 minutes to work. After those twenty five minutes are up, I will give a 5-minute demo an NLP technique that will help you with document classification (*and **maybe** the competition*).

Today's all about having fun and practicing your skills. The competition will begin

## Learning Objectives
* <a href="#p1">Part 1</a>: Text Feature Extraction & Classification Pipelines
* <a href="#p2">Part 2</a>: Latent Semantic Indexing
* <a href="#p3">Part 3</a>: Word Embeddings with Spacy

## Challenge -- this afternoon's lab module assignment

1. Join Lambda School's [Whisky Classification Kaggle Competition](https://www.kaggle.com/c/whiskey-201911/)
2. Download the data
3. Train a model & try: 
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (lsi) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores to specify parameters from each level in the nested pipeline. For example, `lsi__svd__n_components` specifies the parameter `n_components` inside the `svd` pipeline, which is nested inside the `lsi` pipeline
    - Try to extract word embeddings with Spacy and use those embeddings as your features for a classification model.
4. Make a submission to Kaggle 

# 1. Text Feature Extraction & Classification Pipelines (Learn)
<a id="p1"></a>

## Overview

Sklearn pipelines allow you to stitch together multiple components of a machine learning process. The idea is that you can pass your raw data and get predictions out of the pipeline. This ability to pass raw input and receive a prediction from a singular class makes pipelines well suited for production, because you can pickle a pipeline without worry about other data preprocessing steps. 

*Note:* Each time we call the pipeline during grid search, each component is fit again. The vectorizer (tf-idf) transforms our entire vocabulary during each cross-validation fold. That transformation adds significant run time to our grid search. There *might* be interactions between the vectorizer and our classifier, so we estimate their performance together in the code below. However, if your goal is to reduce run time, train your vectorizer separately (ie out of the grid-searched pipeline). 

##1.1 Prepare Colab notebook

###1.1.1 Get Spacy

In [66]:
# Locally (or on colab) let's use en_core_web_lg 
!python -m spacy download en_core_web_md # Can do lg, takes awhile

Collecting en_core_web_md==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-2.2.5/en_core_web_md-2.2.5.tar.gz (96.4 MB)
[K     |████████████████████████████████| 96.4 MB 1.2 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


###1.1.2 Restart runtime!

###1.1.3 Imports

In [1]:
# Import Statements
import os
import re
import numpy as np
import pandas as pd
import seaborn as sns
import spacy

from sklearn.datasets import fetch_20newsgroups
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV


from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.preprocessing import MinMaxScaler, StandardScaler
%matplotlib inline

### 1.1.4 Load spacy

In [2]:
# load in pre-trained w2v model 
nlp = spacy.load("en_core_web_md")

##1.2 Example NLP document classification pipeline 
Working with the `20newsgroups` data set available from `sklearn`, <br>we'll build a classifier that can classify news articles into 2 different categories.

### 1.2.1 Get the data set

In [3]:
# Dataset
from sklearn.datasets import fetch_20newsgroups

# 2 categories to class today
categories = ['alt.atheism',
              'talk.religion.misc']

data = fetch_20newsgroups(subset='all', 
                          categories=categories)

#### 1.2.2 Examine and understand the data set!

In [4]:
type(data)

sklearn.utils.Bunch

In [5]:
dir(data)

['DESCR', 'data', 'filenames', 'target', 'target_names']

In [6]:
data.DESCR



How  would you classify the first post? i.e., Religion or Atheism?

In [7]:
print(type(data.data))
print(len(data.data))
print(data.data[0])

<class 'list'>
1427
From: agr00@ccc.amdahl.com (Anthony G Rose)
Subject: Re: Who's next?  Mormons and Jews?
Reply-To: agr00@JUTS.ccc.amdahl.com (Anthony G Rose)
Organization: Amdahl Corporation, Sunnyvale CA
Lines: 18

In article <1993Apr20.142356.456@ra.royalroads.ca> mlee@post.RoyalRoads.ca (Malcolm Lee) writes:
>
>In article <C5rLps.Fr5@world.std.com>, jhallen@world.std.com (Joseph H Allen) writes:
>|> In article <1qvk8sINN9vo@clem.handheld.com> jmd@cube.handheld.com (Jim De Arras) writes:
>|> 
>|> It was interesting to watch the 700 club today.  Pat Robertson said that the
>|> "Branch Dividians had met the firey end for worshipping their false god." He
>|> also said that this was a terrible tragedy and that the FBI really blew it.
>
>I don't necessarily agree with Pat Robertson.  Every one will be placed before
>the judgement seat eventually and judged on what we have done or failed to do
>on this earth.  God allows people to choose who and what they want to worship.

I'm sorry, bu

In [13]:
print(data['target_names'])

['alt.atheism', 'talk.religion.misc']


In [14]:
print(len(data['target']))

1427


In [15]:
data.target[:10]

array([1, 1, 0, 1, 0, 1, 1, 0, 0, 1])

In [16]:
print(len(data.filenames))
data.filenames[0]

1427


'/root/scikit_learn_data/20news_home/20news-bydate-train/talk.religion.misc/84101'

In [17]:
np.unique(data.target)

array([0, 1])

###1.2.3 Function to clean the data

In [18]:
def clean_data(text):
    """
    Accepts a single text document and performs several regex substitutions in order to clean the document. 
    
    Parameters
    ----------
    text: string or object 
    
    Returns
    -------
    text: string or object
    """
    
    # order of operations - apply the expression from top to bottom
    email_regex = r"From: \S*@\S*\s?"
    non_alpha = '[^a-zA-Z]'
    multi_white_spaces = "[ ]{2,}"
    
    text = re.sub(email_regex, "", text)
    text = re.sub(non_alpha, ' ', text)
    text = re.sub(multi_white_spaces, " ", text)
    
    # apply case normalization 
    return text.lower().lstrip().rstrip()

### 1.2.4 Create and run a pipeline

In [19]:
# prep data, instantiate a model, create pipeline object, and run a gridsearch 

###BEGIN SOLUTION
# save our model input data to X


# save our targets/labels to Y 


# clean our docs 


# Create Pipeline Components

# create vectorizer
 # data transformer 

# create classifier
 # estimator 

# Instantiate a pipeline object -- which is a list of tuples
#   Each tuple specifies (name of the pipeline component, the pipeline component)
       # data transformer
                  # classifier 



In [20]:

# create a hyper-parameter dictionary for BOTH our vectorizer and our ML model 
# here we will determine which tfidf parameter values lead to the best performing model



# Instantiate a GridSearchCV object

# Note: For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. For example with n_jobs=-2, all CPUs but one are used.


###END SOLUTION

Fitting 3 folds for each of 32 candidates, totalling 96 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  44 tasks      | elapsed:    9.7s
[Parallel(n_jobs=-2)]: Done  96 out of  96 | elapsed:   20.7s finished


CPU times: user 2.4 s, sys: 237 ms, total: 2.64 s
Wall time: 21.4 s


Establishing a baseline accuracy with a naive model

In [21]:
frac_ones = y.sum()/len(y)
frac_ones

0.4400840925017519

Since the majority class is zeros, naive model is to predict all zeros!

In [22]:
y_naive_pred = np.zeros((1,len(y)))

Naive model error

In [23]:
frac_error = np.abs(y_naive_pred - y).sum()/len(y)
print(frac_error)

0.4400840925017519


Naive model accuracy

In [24]:
baseline_accuracy = 1-frac_error
print(baseline_accuracy)

0.559915907498248


Pipeline results after hyperparameter tuning!

In [25]:
gs.best_score_

0.8836812619784755

In [26]:
gs.best_params_

{'clf__max_depth': 20,
 'clf__n_estimators': 100,
 'vect__max_df': 1.0,
 'vect__max_features': 1000,
 'vect__min_df': 10}

In [27]:
best_model = gs.best_estimator_
best_model

Pipeline(memory=None,
         steps=[('vect',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=1000,
                                 min_df=10, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words='english', strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=20, max_features='auto',
                                        max_leaf_nodes=None, max_samples=None,


Getting your predictions using the pipeline

In [28]:
# because the vectorizer was included in the pipeline object
# we can simply pass in raw text data into gs and it will provide a classification
y_pred = gs.predict(X_clean)

In [29]:
# this is what you would submit to Kaggle
y_pred

array([1, 1, 0, ..., 1, 1, 0])

#2. Latent Semantic Analysis (Learn)
a.k.a. Latent Semantic Indexing
<a id="p2"></a>

## Overview

![](https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1538411402/image3_maagmh.png)

**Take Aways:** LSA has two main benefits

1. Dimensionality Reduction 
2. Topic Modeling (feature engineering) - identifies latent (hidden) topics that are present in our doc-term matrix. <br>
This is something that counting vectorizers can't do (i.e. CountVectorizer, TFIDF)

In [30]:
from IPython.display import YouTubeVideo
YouTubeVideo('OvzJiur55vo', width=1024, height=576)

## 2.1 An example of Latent Semantic Analysis

Before we apply Latent Semantic Analysis in a pipeline, let's work through a simple example together in order to better understand how LSA works and develop an intuition along the way. 

First, if you haven't already, watch the short video provided above. We will be implementing the example from the video in our notebook. 

In [31]:
# Import

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer

svd = TruncatedSVD(n_components=2, # number of topics to generate (also the size of the new feature space)
                   algorithm='randomized',
                   n_iter=10)

# let's use the same data that was used in the video for consistancy 

        # topic 1 data 
data = ["pizza", 
        "pizza hamburger cookie",
        "hamburger", 
        # topic 2 data
        "ramen", 
        "sushi", 
        "ramen sushi"]

In [32]:
# CREATE Term-Frequency matrix 

###BEGIN SOLUTION
# use CountVectorizer to create a Term-Frequency matrix (a.k.a. Doc-Term Matrix )
 

# switch integer indicies with terms

###END SOLUTION

Unnamed: 0,cookie,hamburger,pizza,ramen,sushi
pizza,0,0,1,0,0
pizza hamburger cookie,1,1,1,0,0
hamburger,0,1,0,0,0
ramen,0,0,0,1,0
sushi,0,0,0,0,1
ramen sushi,0,0,0,1,1


In [33]:
# Use SVD to transform our Term-Frequency matrix into a Topic matrix with reduced dimensionality


###BEGIN SOLUTION
# Use SVD to transform our Term-Frequency matrix into a Topic matrix with reduced dimensionality
 

# this is the output of SVD
# same number of rows 
# number of features has been reduced to 2 
 
###END SOLUTION

array([[ 0.63, -0.  ],
       [ 1.72,  0.  ],
       [ 0.63,  0.  ],
       [ 0.  ,  0.71],
       [-0.  ,  0.71],
       [ 0.  ,  1.41]])

In [34]:
# let's move X_reduced into a dataframe and rename the indices and columns for interpretability  

###BEGIN SOLUION
# let's move X_reduced into a dataframe and rename the indicies and columns for interpretability  

###END SOLUTION

Unnamed: 0,topic_1,topic_2
pizza,0.63,-0.0
pizza hamburger cookie,1.72,0.0
hamburger,0.63,0.0
ramen,0.0,0.71
sushi,-0.0,0.71
ramen sushi,0.0,1.41


## 2.2 Build a Latent Semantic Analysis (LSA) pipeline
Now that we've gone through an example of applying LSA on a small dataset, let's implement it in a classification pipeline to run on the `20newsgroups`data.  


In [35]:
# build a pipeline, incorporate SVD, and run a gridsearch 

###BEGIN SOLUTION -- ask svd to truncate to the best 100 principal components, i.e. find 100 "topics"
 

# instantiate a pipeline object
 

# instantiate a pipeline object
 

# a nice default starter set for hyper-parameter values
# include more parameters and values to try to increase model performance 
 



In [36]:
%%time
 
###END SOLUTION

Fitting 3 folds for each of 36 candidates, totalling 108 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 3 concurrent workers.
[Parallel(n_jobs=-2)]: Done  44 tasks      | elapsed:   33.7s
[Parallel(n_jobs=-2)]: Done 108 out of 108 | elapsed:  1.7min finished


CPU times: user 6.97 s, sys: 3.34 s, total: 10.3 s
Wall time: 1min 43s


GridSearchCV(cv=3, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('lsi',
                                        Pipeline(memory=None,
                                                 steps=[('vect',
                                                         TfidfVectorizer(analyzer='word',
                                                                         binary=False,
                                                                         decode_error='strict',
                                                                         dtype=<class 'numpy.float64'>,
                                                                         encoding='utf-8',
                                                                         input='content',
                                                                         lowercase=True,
                                                                         max_df=1.0,
             

What results can we get from the `gs` object?

In [40]:
dir(gs)

['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_impl',
 '_check_is_fitted',
 '_estimator_type',
 '_format_results',
 '_get_param_names',
 '_get_tags',
 '_more_tags',
 '_pairwise',
 '_required_parameters',
 '_run_search',
 'best_estimator_',
 'best_index_',
 'best_params_',
 'best_score_',
 'classes_',
 'cv',
 'cv_results_',
 'decision_function',
 'error_score',
 'estimator',
 'fit',
 'get_params',
 'iid',
 'inverse_transform',
 'multimetric_',
 'n_jobs',
 'n_splits_',
 'param_grid',
 'pre_dispatch',
 'predict',
 'predict_log_proba',
 'predict_proba',
 'refit',
 'refit_time_',
 'return_train_score',
 'score

In [41]:
gs.best_params_

{'clf__max_depth': 15,
 'clf__n_estimators': 250,
 'lsi__svd__n_components': 100,
 'lsi__vect__max_df': 0.9}

In [39]:
gs.best_score_

0.8731505233672415

## Challenge

Continue to apply Latent Semantic Indexing (LSI) to various datasets. 

#3. Word Embeddings with Spacy (Learn)
In this section we'll complete our preparation for Lambda School's [Whisky Classification Kaggle Competition](https://www.kaggle.com/c/whiskey-201911/)
<a id="p3"></a>

## Follow Along
1. Join the [Whisky Classification Kaggle Competition](https://www.kaggle.com/c/whiskey-201911/)
2. Download the data to your local machine, then upload the files to your Colab notebook by first clicking the **folder icon** in the left sidebar, then clicking the **folder with the up arrow icon** that appears under "Files" in the left sidebar. The files should now be in the /content folder. To get the path to an object that appears in the left sidebar, hover over it, click the three vertical dots that appear on the right, then select "Copy path".

## 3.1 Get the data
Download the `.csv` files from the [Whisky Classification Kaggle Competition](https://www.kaggle.com/c/whiskey-201911/) to your local machine, <br>
then upload them to this Colab notebook.

In [43]:
test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')

In [44]:
train

Unnamed: 0,id,description,category
0,1,A marriage of 13 and 18 year old bourbons. A m...,2
1,2,There have been some legendary Bowmores from t...,1
2,3,This bottling celebrates master distiller Park...,2
3,4,What impresses me most is how this whisky evol...,1
4,9,"A caramel-laden fruit bouquet, followed by une...",2
...,...,...,...
2581,4146,"Earthy, fleshy notes with brooding grape notes...",1
2582,4153,With its overt floral perfume notes and the sc...,4
2583,4154,"An unaged whiskey from Carroll County, Iowa, w...",3
2584,4155,"Fiery peat kiln smoke, tar, and ripe barley on...",1


In [45]:
train['category'].value_counts()

1    1637
2     449
3     300
4     200
Name: category, dtype: int64

##3.2 Build a classification model that is trained on the word vectors from `spacy`<br>
Question: does `spacy` use `CountVectorizer()`, `TfidfVectorizer()` or `word2vec` to numericalize text?<br>
Run the classification model on the Whisky data set and get a preliminary result<br>


In [51]:
%%time
# build a model that is trained on word vectors 

###BEGIN SOLUTION
def get_word_vectors(docs):
    """
    This serves as both our tokenizer and vectorizer. 
    Returns a list of document vectors, i.e. our doc-term matrix
    """
 
    

# raw text data for train and test sets
 

# transform raw data into doc-term matrices for train and test sets 
 

# save ratings to y vector
 

# create RF model, use out-of-bag (oob) score for a quick estimate of generalization performance
# For best results, however, you want to do hyperparameter tuning using GridSearchCV 
 


###END SOLUTION

CPU times: user 45.4 s, sys: 187 ms, total: 45.6 s
Wall time: 45.5 s


Questions: 
What information do the entries of of `X_train` contain? 
Why does each element in `X_train` have the following shape?

In [56]:
X_train[0].shape

(300,)

In [53]:
# train set accuracy -- massively overfitted!
rfc.score(X_train, y_train)

1.0

In [54]:
# out-of-bag accuracy score, which can be thought of as a proxy for the test set score 
rfc.oob_score_

0.7300850734725445

## Challenge  -- this afternoon's lab module assignment

1. Join Lambda School's [Whisky Classification Kaggle Competition](https://www.kaggle.com/c/whiskey-201911/)
2. Download the data
3. Train a model & try: 
    - Creating a Text Extraction & Classification Pipeline
    - Tune the pipeline with a `GridSearchCV` or `RandomizedSearchCV`
    - Add some Latent Semantic Indexing (LSI) into your pipeline. *Note:* You can grid search a nested pipeline, but you have to use double underscores ie `lsi__svd__n_components`
    - Try to extract word embeddings with Spacy and use those embeddings as your features for a classification model.
4. Make a submission to Kaggle 

Note: You can put together your project from code snippets from the current Colab notebook. <br>
Alternatively, you can adapt and refactor this [Colab notebook](https://drive.google.com/file/d/1ZY-P33tXD5y-VucOjg2TXO5OAQBWuTLf/view?usp=sharing) to work with the Kaggle data for your project.

# Review

To review this module: 
* Continue working on the Kaggle competition
* Find another text classification task to work on