
<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# NLP II: `CountVectorizer` and `TfidfVectorizer`

_Authors: Dave Yerrington (SF), Justin Pounders (ATL), Riley Dallas (ATX), Matt Brems (DC)_

---

<img src="https://snag.gy/uvESGH.jpg" alt="drawing" width="800"/>

# $$
\begin{eqnarray*}
\textbf{Fun Fact:  } \text{Word Clouds} &\neq& \text{Data Science}
\end{eqnarray*}
$$

[If you want to generate a word cloud in the shape of something **for art only**, check here.](https://medium.com/hackernoon/what-real-fake-news-says-about-obamas-presidency-4bf42be71ff1)

## Learning Objectives
---

- Extract features from unstructured text by fitting and transforming with `CountVectorizer` and `TfidfVectorizer`.
- Describe how CountVectorizers and TF-IDFVectorizers work.
- Understand `stop_words`, `max_features`, `min_df`, `max_df`, and `ngram_range`.
- Implement `CountVectorizer` and `TfidfVectorizer` in a spam classification model.
- Use `GridSearchCV` and `Pipeline` with `CountVectorizer`.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# Import CountVectorizer and TFIDFVectorizer from feature_extraction.text.
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

## Introduction to Text Feature Extraction

The models we've learned, like linear regression, logistic regression, and k-nearest neighbors, take in an `X` and a `y` variable.
- `X` is a matrix/dataframe of real numbers.
- `y` is a vector/series of real numbers.

Text data (also called natural language data) is not already organized as a matrix or vector of real numbers. We say that this data is **unstructured**.

> This lesson will focus on how to transform our unstructured text data into a numeric `X` matrix.

## Spam Classification Model

One common application of NLP is predicting "spam" vs. "ham," or "spam" vs. "not spam."

Can we predict real vs. promotional texts just based on what is written?

> This data set was taken from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection).

In [2]:
# Read in data.
df = pd.read_csv('../datasets/SMSSpamCollection',
                 sep='\t',
                 names=['label', 'message'])

# Check out first five rows.
df.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Basic terminology

---

- A collection of text is a **document**. 
    - You can think of a document as a row in your feature matrix.
- A collection of documents is a **corpus**. 
    - You can think of your full dataframe as the corpus.

<details><summary>In this specific example, what is a document?</summary>
    
- Each text message in our data set is one document. 
- There are 5,572 documents in our corpus.
</details>

## Model prep
---

Convert ham/spam into binary labels:
- 0 for ham
- 1 for spam

In [3]:
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
df.head()

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


Let's set up our data for modeling:
- `X` will be the `message` column. **NOTE**: `CountVectorizer` requires a vector, so make sure you set `X` to be a `pandas` Series, **not** a DataFrame.
- `y` will be the `label` column

In [4]:
X = df['message']
y = df['label']

In [5]:
# Check what we need to check in a classification problem.
y.value_counts()

0    4825
1     747
Name: label, dtype: int64

In [6]:
# Split the data into the training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    stratify=y,
                                                    random_state=42)

## `CountVectorizer`
---

The easiest way for us to convert text data into a structured, numeric `X` dataframe is to use `CountVectorizer`.

We'll start by fitting a default CountVectorizer, then look at various hyperparameters.

In [7]:
# Instantiate a CountVectorizer.
cvec = CountVectorizer()

In [8]:
# Fit the vectorizer on our corpus.
cvec.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [9]:
# Transform the corpus.
X_train = cvec.transform(X_train)

<img src="../images/countvectorizer.png" alt="drawing" width="750"/>

[Source](https://towardsdatascience.com/nlp-learning-series-part-2-conventional-methods-for-text-classification-40f2839dd061).

In [10]:
# Convert X_train into a DataFrame.

X_train_df = pd.DataFrame(X_train.toarray(),
                          columns=cvec.get_feature_names())
X_train_df

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585334,02,0207,02072069400,...,zed,zeros,zhong,zindgi,zoe,zogtorius,zyada,èn,ú1,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3728,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3729,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3730,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3731,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


When we have unstructured text data, there is a lot of information in that text data.
- When we force unstructured text data to follow a "spreadsheet" or "dataframe" structure, we might lose some of that information.
- For example, CountVectorizer creates a vector (column) for each token and counts up the number of occurrences of each token in each document.

Our tokens are now stored as a **bag-of-words**. This is a simplified way of looking at and storing our data. 
- Bag-of-words representations discard grammar, order, and structure in the text but track occurrences.

In [11]:
# Transform test
X_test = cvec.transform(X_test)
X_test_df = pd.DataFrame(X_test.toarray(),
                         columns=cvec.get_feature_names())

X_test_df.head()

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585334,02,0207,02072069400,...,zed,zeros,zhong,zindgi,zoe,zogtorius,zyada,èn,ú1,〨ud
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


At this point, we could fit a model (like a logistic regression model or $k$-nearest neighbors model) on `X_train_df` and `y_train`, then evaluate our model with `X_test_df` and `y_test`.



<details><summary>What might be some of the advantages of using this bag-of-words approach when modeling?</summary>

- Efficient to store.
- Efficient to model.
- Keeps a decent amount of information.
</details>

<details><summary>What might be some of the disadvantages of using this bag-of-words approach when modeling?</summary>

- Since bag-of-words models discard grammar, order, structure, and context, we lose a decent amount of information.
- Phrases like "not bad" or "not good" won't be interpreted properly.
</details>

However, let's examine some of the different hyperparameters of `CountVectorizer`:
- `stop_words`
- `max_features`, `max_df`, `min_df`
- `ngram_range`

<details><summary>Remind me: what is a hyperparameter?</summary>

- A hyperparameter is a built-in option that affects our model, but our model cannot learn these from our data!
- Examples of hyperparameters include:
    - the value of $k$ and the distance metric in $k$-nearest neighbors,
    - our regularization constants $\alpha$ or $C$ in linear and logistic regression.
</details>

## Stopwords

---

Some words are so common that they may not provide legitimate information about the $Y$ variable we're trying to predict.

In [47]:
# Let's look at sklearn's stopwords.
print(CountVectorizer(stop_words = 'english').get_stop_words())

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

`CountVectorizer` gives you the option to eliminate stopwords from your corpus when instantiating your vectorizer.

```python
cvec = CountVectorizer(stop_words='english')
```

You can optionally pass your own list of stopwords that you'd like to remove.
```python
cvec = CountVectorizer(stop_words=['list', 'of', 'words', 'to', 'stop'])
```

## Vocabulary size

---
One downside to `CountVectorizer` is the size of its vocabulary (`cvec.get_feature_names()`) can get really large. We're creating one column for every unique token in your corpus of data!

There are three hyperparameters to help you control this.

1. You can set `max_features` to only include the $N$ most popular vocabulary words in the corpus.

```python
cvec = CountVectorizer(max_features=1_000) # Only the top 1,000 words from the entire corpus will be saved
```

2. You can tell `CountVectorizer` to only consider words that occur in **at least** some number of documents.

```python
cvec = CountVectorizer(min_df=2) # A word must occur in at least two documents from the corpus
```

3. Conversely, you can tell `CountVectorizer` to only consider words that occur in **at most** some percentage of documents.

```python
cvec = CountVectorizer(max_df=.98) # Ignore words that occur in > 98% of the documents from the corpus
```

Both `max_df` and `min_df` can accept either an integer or a float.
- An integer tells us the number of documents.
- A float tells us the percentage of documents.

<details><summary>Why might we want to control these vocabulary size hyperparameters?</summary>
    
- If we have too many features, our models may take a **very** long time to fit.
- Control for overfitting/underfitting.
- Words in 99% of documents or words occuring in only one document might not be very informative.
</details>

## N-Gram Range
---

`CountVectorizer` has the ability to capture $n$-word phrases, also called $n$-grams. Consider the following:

> The quick brown fox jumped over the lazy dog.

In the example sentence, the 2-grams are:
- 'the quick'
- 'quick brown'
- 'brown fox'
- 'fox jumped'
- 'jumped over'
- 'over the'
- 'the lazy'
- 'lazy dog'

The `ngram_range` determines what $n$-grams should be considered as features.

```python
cvec = CountVectorizer(ngram_range=(1,2)) # Captures every 1-gram and every 2-gram
```

<details><summary>How many 3-grams would be generated from the phrase "the quick brown fox jumped over the lazy dog?"</summary>

- Seven 3-grams.
    - 'the quick brown'
    - 'quick brown fox'
    - 'brown fox jumped'
    - 'fox jumped over'
    - 'jumped over the'
    - 'over the lazy'
    - 'the lazy dog'
</details>

<details><summary>Why might we want to change ngram_range to something other than (1,1)?</summary>

- We can work with multi-word phrases like "not good" or "very hot."
</details>

## Modeling

---

We may want to test lots of different values of hyperparameters in our CountVectorizer.

<details><summary>Is CountVectorizer an estimator or a transformer?</summary>
    
- A transformer.
</details>

<details><summary>Why do we need a pipeline to GridSearch over our CountVectorizer hyperparameters?</summary>
    
- The CountVectorizer is a transformer.
- Transformers have .fit() and .transform() methods, but cannot do .predict().
- In order to GridSearch over hyperparameters, we need some way to score our model performance.
- A pipeline stacks together one or more transformers with an estimator at the end. The estimator allows us to .predict() and get a score!
</details>

In [13]:
# Redefine training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    stratify=y,
                                                    random_state=42)

## Baseline accuracy

We need to calculate baseline accuracy in order to tell if our model is better than null model (predicting the plurality class).

In [14]:
y_test.value_counts(normalize=True)

0    0.865688
1    0.134312
Name: label, dtype: float64

In [15]:
# Let's set a pipeline up with two stages:
# 1. CountVectorizer (transformer)
# 2. LogisticRegression (estimator)

pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression(solver = 'lbfgs'))
])

## `GridSearchCV`
---

At this point, you could use your `pipeline` object as a model:

```python
# Estimate how your model will perform on unseen data
cross_val_score(pipe, X_train, y_train, cv=3).mean() 

# Fit your model
pipe.fit(X_train, y_train)

# Training score
pipe.score(X_train, y_train)

# Test score
pipe.score(X_test, y_test)
```

Since we want to tune over the `CountVectorizer`, we'll load our `pipeline` object into `GridSearchCV`.

In [16]:
# Search over the following values of hyperparameters:
# Maximum number of features fit: 2000, 3000, 4000, 5000
# Minimum number of documents needed to include token: 2, 3
# Maximum number of documents needed to include token: 90%, 95%
# Check (individual tokens) and also check (individual tokens and 2-grams).

pipe_params = {
    'cvec__max_features': [2_000, 3_000, 4_000, 5_000],
    'cvec__min_df': [2, 3],
    'cvec__max_df': [.9, .95],
    'cvec__ngram_range': [(1,1), (1,2)]
}

In [17]:
# Instantiate GridSearchCV.

gs = GridSearchCV(pipe, # what object are we optimizing?
                  param_grid=pipe_params, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

<details><summary>How many models are we fitting here?</summary>

- 4 max_features
- 2 min_df
- 2 max_df
- 2 ngram_range
- 5-fold CV
- 4 * 2 * 2 * 2 * 5 = 160 models
</details>

In [18]:
# Fit GridSearch to training data.
gs.fit(X_train, y_train)

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('cvec',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        prep

In [19]:
# What's the best score?
print(gs.best_score_)

0.9823167558527228


In [20]:
# Save best model as gs_model.

gs_model = gs.best_estimator_

In [21]:
# Score model on training set.
gs_model.score(X_train, y_train)

0.996249665148674

In [22]:
# Score model on testing set.
gs_model.score(X_test, y_test)

0.9804241435562806

Congratulations! We've used `CountVectorizer` to transform our text data into something we can pass into a model.

But what if we want to do something more than just count up the occurrence of each token?

## Term Frequency-Inverse Document Frequency (TF-IDF) Vectorizer

---

When modeling, which word do you think tends to be the most helpful?
- Words that are common across all documents.
- Words that are rare across all documents.
- Words that are rare across some documents, and common across some documents.

<details><summary>Answer:</summary>

- Words that are common in certain documents but rare in other documents tend to be more informative than words that are common in all documents or rare in all documents.
- Example: If we were examining poetry over time, the word "thine" might be common in some documents but rare in most documents. The word "thine" is probably pretty informative in this case.
</details>

TF-IDF is a score that tells us which words are important to one document, relative to all other documents. Words that occur often in one document but don't occur in many documents contain more predictive power.

Variations of the TF-IDF score are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.
- If you want to see how it can be calculated, check out [the Wikipedia page](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) and [`sklearn`](https://scikit-learn.org/stable/modules/feature_extraction.html#tfidf-term-weighting) page.

<img src="../images/tfidfvectorizer.png" alt="drawing" width="750"/>

[Source](https://towardsdatascience.com/nlp-learning-series-part-2-conventional-methods-for-text-classification-40f2839dd061).

## Practice Using the `TfidfVectorizer`

---

`sklearn` provides a TF-IDF vectorizer that works similarly to the CountVectorizer.
- The arguments `stop_words`, `max_features`, `min_df`, `max_df`, and `ngram_range` also work here.

As you did above, instantiate the default `TfidfVectorizer`, then fit the spam and ham data.

In [23]:
# Instantiate the transformer.
tvec = TfidfVectorizer()

In [24]:
df = pd.DataFrame(tvec.fit_transform(X_train).toarray(),
                  columns=tvec.get_feature_names())
df.head()

Unnamed: 0,00,000,000pes,008704050406,0089,0121,01223585334,02,0207,02072069400,...,zed,zeros,zhong,zindgi,zoe,zogtorius,zyada,èn,ú1,〨ud
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [25]:
X_train = tvec.fit_transform(X_train)

X_test = tvec.transform(X_test)

In [26]:
# Instantiate logistic regression.
lr = LogisticRegression(solver = 'lbfgs')

# Fit logistic regression.
lr.fit(X_train, y_train)

# Evaluate logistic regression.
print(f'Training Score: {lr.score(X_train, y_train)}')
print(f'Testing Score: {lr.score(X_test, y_test)}')

Training Score: 0.9726761317974819
Testing Score: 0.9711799891245242


## Interview Question

## (BONUS) How is the information from vectorizers stored efficiently?

When you CountVectorize the training text messages, you get 3,733 rows and 6,935 features... this is 25,888,355 entries. That's a lot of data to store in a dataframe!

<details><summary>How many of these values are zero?</summary>

- Over 99% of all values are zero!
</details>

Instead of storing all those zeroes, `sklearn` automatically stores these as a sparse matrix. It saves **a lot** of space.

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    stratify=y,
                                                    random_state=42)

cvec = CountVectorizer()

X_train = cvec.fit_transform(X_train)

print(type(X_train))
print(X_train[0])

<class 'scipy.sparse.csr.csr_matrix'>
  (0, 6885)	1
  (0, 3407)	1
  (0, 6754)	1
  (0, 3977)	1
  (0, 4462)	1
  (0, 4368)	1
  (0, 853)	1
  (0, 2888)	1
  (0, 3155)	1
