# 2. Natural Language Processing

This notebook preprocesses and develops pipelines to classify posts from the two subreddits using classification models.

**Preprocessing and Modeling**
- Is text data successfully converted to a matrix representation?
- Are methods such as stop words, stemming, and lemmatization explored?
- Does the student properly split and/or sample the data for validation/training purposes?
- Does the student test and evaluate a variety of models to identify a production algorithm (**AT MINIMUM:** two classification models, **BONUS:** try a Naive Bayes)?
- Does the student defend their choice of production model relevant to the data at hand and the problem?
- Does the student explain how the model works and evaluate its performance successes/downfalls?

#### Imports

In [31]:
import pandas as pd
import numpy as np

from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk import WordNetLemmatizer

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelBinarizer, OneHotEncoder, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeClassifier
from sklearn.dummy import DummyClassifier
from sklearn.neighbors import KNeighborsClassifier

In [2]:
%store -r df stop_words

#### Make train and test sets out of the train data

In [13]:
X = df['selftext']
y = df['subreddit']

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42,stratify=y)

In [15]:
y_test.value_counts()

Divorce            997
weddingplanning    987
Name: subreddit, dtype: int64

In [16]:
print(f'Baseline accuracy is {round(y_test.value_counts()[0]*100/len(y_test),1)}%.')

Baseline accuracy is 50.3%.


#### Encode subreddit

In [17]:
le = LabelEncoder()
y_train_enc = le.fit_transform(y_train)
y_test_enc = le.transform(y_test)

#### Create Lemmatizer

In [24]:
def lemma_tokenizer(doc):  ## Used class notes
    wnl = WordNetLemmatizer()
    return [w for w in word_tokenize(doc)]

### Pipelines

#### Logistic Regression Pipeline

In [43]:
pipe_lr = Pipeline([
    ('cv', CountVectorizer(stop_words=stop_words,ngram_range=(1,1),min_df=2)),
    ('lr', LogisticRegression())
])
pipe_lr.fit(X_train, y_train_enc)
pipe_lr.score(X_train, y_train_enc), pipe_lr.score(X_test, y_test_enc)

(0.9374684820978316, 0.9122983870967742)

In [54]:
pipe_lr = Pipeline([
    ('cv', CountVectorizer(stop_words=stop_words,ngram_range=(1,3),min_df=4)),
    ('lr', LogisticRegression())
])
pipe_lr.fit(X_train, y_train_enc)
pipe_lr.score(X_train, y_train_enc), pipe_lr.score(X_test, y_test_enc)

(0.9379727685325264, 0.9122983870967742)

GridSearch

In [56]:
params = {
    'cv__ngram_range':[(1,1),(1,2),(1,3),(1,4),(2,2),(2,3),(2,4)],
    'cv__min_df':np.arange(2,9)
}

In [57]:
gs_lr = GridSearchCV(
    pipe_lr, param_grid=params,n_jobs=-1)

In [58]:
gs_lr.fit(X_train,y_train_enc);

In [59]:
gs_lr.best_params_

{'cv__min_df': 4, 'cv__ngram_range': (1, 3)}

I will feed this back into my model.

#### Decision Tree Pipeline

In [61]:
pipe_dt = Pipeline([
    ('cv', CountVectorizer(stop_words=stop_words,ngram_range=(1,3),min_df=2)),
    ('dt', DecisionTreeClassifier(max_depth=7))
])
pipe_dt.fit(X_train, y_train_enc)
pipe_dt.score(X_train, y_train_enc), pipe_dt.score(X_test, y_test_enc)

(0.8472012102874432, 0.8397177419354839)

#### K Nearest Neighbors Pipeline

In [70]:
pipe_knn = Pipeline([
    ('cv', CountVectorizer(stop_words=stop_words,ngram_range=(1,3),min_df=4)),
    ('knn', KNeighborsClassifier(n_neighbors=5))
])
pipe_knn.fit(X_train, y_train_enc)
pipe_knn.score(X_train, y_train_enc), pipe_knn.score(X_test, y_test_enc)

(0.7925701798621617, 0.6834677419354839)

GridSearch

In [71]:
params_knn = {    'knn__n_neighbors':np.arange(2,15)}

In [72]:
gs_knn = GridSearchCV(
    pipe_knn, param_grid=params_knn,n_jobs=-1)

In [73]:
gs_knn.fit(X_train,y_train_enc);

In [69]:
gs_knn.best_estimator_

Pipeline(steps=[('cv',
                 CountVectorizer(min_df=2, ngram_range=(1, 3),
                                 stop_words=['i', 'me', 'my', 'myself', 'we',
                                             'our', 'ours', 'ourselves', 'you',
                                             "you're", "you've", "you'll",
                                             "you'd", 'your', 'yours',
                                             'yourself', 'yourselves', 'he',
                                             'him', 'his', 'himself', 'she',
                                             "she's", 'her', 'hers', 'herself',
                                             'it', "it's", 'its', 'itself', ...])),
                ('knn', KNeighborsClassifier())])

In [74]:
gs_knn.best_params_

{'knn__n_neighbors': 5}

In [75]:
gs_knn.best_score_

0.6817974288117266