# Modeling
**Purpose**  
Determine the best parameters for the models and test their performance as more training data is used.

*High Level Approach*
 * Compute baseline accuracy for 3 datasets
 * Build logistic regression model using both cvec and tfidif
 * Build Multinomial Naive Bayes model using cvec
 * Build Gaussian Naive Bayes model using tfidif
 
First we use only text data and grid search to find best params.  
Then we fit a model using these params and all data, both text and numerical.  

*Why these models?*  

Logistic Regression and Naive Bayes are industry standards for binary classification problems. The Logistic Regression model is best suited to this demonstration becuase it demonstrates strong performance, fast run times, and can handle float and negative data values. The Naive Bayes model is limited to handling the vectorized text data. It is also slower to fit to the training data. Hence, Logistic Regression is the clear choice of production models here.

**Credit:** This notebook borrows heavily from Noah's 5.02 lab review. Particularly the logistic regression grid search.

In [6]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction import text as text_sk
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin # this allows us to create a custom transformer

In [7]:
# Final list of stopwords
web_link_stop_words = ['amp', 'webp', 'www', 'https', 'http', 'png', 'pjpg', 'com']
reddit_stop_words = ['reddit', 'poll', 'redd']
bland_bigrams_stop_words = ['hope', 'enjoy']

custom_stop_words = text_sk.ENGLISH_STOP_WORDS.union(web_link_stop_words).union(reddit_stop_words).union(bland_bigrams_stop_words)

---

# Load and Split Data

In [8]:
# Load the data
small = pd.read_csv('../data/clean/small.csv')
medium = pd.read_csv('../data/clean/medium.csv')
large = pd.read_csv('../data/clean/large.csv')

data = [small, medium, large]
data_dict = {'small':small, 'medium':medium, 'large':large}

In [70]:
# Use text data for gridsearching
X_all = [df['all_text'] for df in data]
y_all = [df['subreddit'] for df in data]

X_dict_all = {size:val['all_text'] for size,val in data_dict.items()}
y_dict_all = {size:val['subreddit'] for size,val in data_dict.items()}

# Keys = ['s', 'm', 'l'] and values = [X_train, X_test, y_train, y_test]
split_all = {X_dict[0]:train_test_split(X_dict[1], y, stratify=y, random_state=18) for X_dict,y in zip(X_dict_all.items(), y_all)}

---
# Baseline Model

In [72]:
baseline_scores = {size:max(splits[3].value_counts(normalize=True)) for size,splits in split_all.items()}
baseline_scores

{'small': 0.6, 'medium': 0.5471698113207547, 'large': 0.5461825947677522}

**Interpretation:** Baseline models performs with 60%, 54.7%, and 54.6% accuracy.

# Logistic Regression - CountVectorizer & TFIDF Vectorizer

In [73]:
scores = {}
params = {}

In [74]:
pipe = Pipeline([
    ('cv', CountVectorizer()),
    ('tf', TfidfTransformer()),
    ('lr', LogisticRegression(solver = 'liblinear'))
])

pipe_params = {
    'cv__max_features': [1500, 2500, 3500],
    'cv__ngram_range' : [(1,1), (1,2), (1,3)],
    'cv__stop_words'  : [custom_stop_words],
    'tf__use_idf': [True, False] # if True, acts like TFIDF, if False, acts like CountVectorizer
}

In [75]:
%%time

for size,split in split_all.items():

    X_train, X_test, y_train, y_test = split

    gs = GridSearchCV(pipe, param_grid= pipe_params, cv = 5,
                      scoring = 'accuracy', n_jobs = 7)

    gs.fit(X_train, y_train)

    # Log scores and params
    scores[size] = [gs.score(X_train, y_train), gs.score(X_test, y_test)]
    params[size] = gs.best_params_
    
    if 'reddit' in params[size]['cv__stop_words']:
        params[size]['cv__stop_words'] = 'custom_stop_words'

    print('Train:', gs.score(X_train, y_train), 'Test:', gs.score(X_test, y_test))

Train: 0.892 Test: 0.832
Train: 0.9134591194968553 Test: 0.8633962264150944
Train: 0.9200783196867213 Test: 0.8814735718099306
CPU times: user 6.09 s, sys: 363 ms, total: 6.45 s
Wall time: 1min 22s


**Interpretation:** Fast and yields the highest testing scores. The model is, however, overfit.

In [76]:
params

{'small': {'cv__max_features': 1500,
  'cv__ngram_range': (1, 1),
  'cv__stop_words': 'custom_stop_words',
  'tf__use_idf': False},
 'medium': {'cv__max_features': 2500,
  'cv__ngram_range': (1, 2),
  'cv__stop_words': 'custom_stop_words',
  'tf__use_idf': True},
 'large': {'cv__max_features': 3500,
  'cv__ngram_range': (1, 3),
  'cv__stop_words': 'custom_stop_words',
  'tf__use_idf': True}}

---

# Naive Bayes

## CountVectorizer with MultinomialNB

In [87]:
scores_mnnb = {}
params_mnnb = {}

In [89]:
pipe = Pipeline([
    ('vectorizer', CountVectorizer()),
    ('mnb', MultinomialNB())
])

pipe_params = {'vectorizer__max_features': [1500, 2500, 3500, 4500],
               'vectorizer__ngram_range': [(1,1), (1,2), (1,3)],
               'vectorizer__stop_words': ['english', custom_stop_words],
               'mnb__alpha': np.linspace(2.5,4,5)
              }

In [90]:
%%time

for size,split in split_all.items():

    X_train, X_test, y_train, y_test = split

    gs_cv_mnb = GridSearchCV(pipe,
                             param_grid = pipe_params,
                             n_jobs = 7,
                             cv = 5,
                             scoring = 'accuracy')

    gs_cv_mnb.fit(X_train, y_train)

    # Log scores and params
    scores_mnnb[size] = [gs_cv_mnb.score(X_train, y_train), gs_cv_mnb.score(X_test, y_test)]
    params_mnnb[size] = gs_cv_mnb.best_params_
    
    if 'reddit' in params_mnnb[size]['vectorizer__stop_words']:
        params_mnnb[size]['vectorizer__stop_words'] = 'custom_stop_words'

    print('Train:', gs_cv_mnb.score(X_train, y_train), 'Test:', gs_cv_mnb.score(X_test, y_test))

Train: 0.8937777777777778 Test: 0.8333333333333334
Train: 0.8925786163522013 Test: 0.8641509433962264
Train: 0.8948024207903168 Test: 0.8750667378537106
CPU times: user 19.5 s, sys: 1.6 s, total: 21.1 s
Wall time: 7min 28s


**Interpretation:** Training scores remain similar as the sample size increases. Testing scores approach the train score as the sample size increases. Model trains slower than the logistic regression and performs with approximately equal accuracy.

In [91]:
params_mnnb

{'small': {'mnb__alpha': 3.625,
  'vectorizer__max_features': 4500,
  'vectorizer__ngram_range': (1, 1),
  'vectorizer__stop_words': 'english'},
 'medium': {'mnb__alpha': 2.5,
  'vectorizer__max_features': 4500,
  'vectorizer__ngram_range': (1, 1),
  'vectorizer__stop_words': 'english'},
 'large': {'mnb__alpha': 2.875,
  'vectorizer__max_features': 4500,
  'vectorizer__ngram_range': (1, 1),
  'vectorizer__stop_words': 'english'}}

## TFIDFVectorizer with GaussianNB

In [98]:
scores_gnb = {}
params_gnb = {}

In [99]:
# adapted from https://stackoverflow.com/questions/28384680/scikit-learns-pipeline-a-sparse-matrix-was-passed-but-dense-data-is-required
class DenseTransformer(TransformerMixin):
    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.toarray()

In [100]:
pipe = Pipeline([
    ('vectorizer', TfidfVectorizer()),
    ("_", DenseTransformer()),
    ('gnb', GaussianNB())
])

pipe_params = {'vectorizer__max_features': [1500, 2500, 3500],
               'vectorizer__ngram_range'  : [(1,1), (1,2), (1,3)],
               'vectorizer__stop_words': ['english', custom_stop_words]
              }

In [101]:
%%time

for size,split in split_all.items():

    X_train, X_test, y_train, y_test = split
    
    gs = GridSearchCV(pipe,
                      param_grid = pipe_params,
                      n_jobs = 7,
                      cv = 5,
                      scoring = 'accuracy')

    gs_tfidf = gs.fit(X_train, y_train)
    
    scores_gnb[size] = [gs_tfidf.score(X_train, y_train), gs_tfidf.score(X_test, y_test)]
    params_gnb[size] = gs_tfidf.best_params_
    
    if 'reddit' in params_gnb[size]['vectorizer__stop_words']:
        params_gnb[size]['vectorizer__stop_words'] = 'custom_stop_words'

    print('Train:', gs_tfidf.score(X_train, y_train), 'Test:', gs_tfidf.score(X_test, y_test))

Train: 0.9297777777777778 Test: 0.7653333333333333
Train: 0.8759748427672956 Test: 0.7984905660377358
Train: 0.914560341758633 Test: 0.8152696209289909
CPU times: user 7.58 s, sys: 1.47 s, total: 9.04 s
Wall time: 1min 35s


**Interpretation:** Tfidf Vectorizer and Gaussian NB massively overfits to the training data. Abandon this model.

In [102]:
params_gnb

{'small': {'vectorizer__max_features': 3500,
  'vectorizer__ngram_range': (1, 3),
  'vectorizer__stop_words': 'custom_stop_words'},
 'medium': {'vectorizer__max_features': 1500,
  'vectorizer__ngram_range': (1, 2),
  'vectorizer__stop_words': 'english'},
 'large': {'vectorizer__max_features': 3500,
  'vectorizer__ngram_range': (1, 3),
  'vectorizer__stop_words': 'english'}}

---
# Fit Final Models
Using text and numeric data

In [None]:
# Split into features and target
X_all = [df.drop(columns=['subreddit']) for df in data]
y_all = [df['subreddit'] for df in data]

# Dictionaries
X_dict_all = {size:val.drop(columns='subreddit') for size,val in data_dict.items()}
y_dict_all = {size:val['subreddit'] for size,val in data_dict.items()}

# Keys = ['small', 'medium', 'large'] and values = [X_train, X_test, y_train, y_test]
split_all = {X_dict[0]:train_test_split(X_dict[1], y, stratify=y, random_state=18) for X_dict,y in zip(X_dict_all.items(), y_all)}

This part was tricky. Here's what I did:  
1. Transformed the text data using best params found in above gridsearching.
2. Merged with number data.
3. Fit the model and scored it using the accuracy metric.  
4. Didn't work with Naive Bayes Multinomial. Probably broke the multinomial distribution assumption. Resolved by fitting on text data only.

I will figure out a better way using ColumnTransformer before adding this to my portfolio.

In [120]:
def fit_transform_custom(train, test, transformer):
        vec = transformer
        vec.fit(train['all_text'])
        
        v_train = vec.transform(train['all_text'])
        v_test = vec.transform(test['all_text'])

        v_train = pd.DataFrame(v_train.todense(), columns=vec.get_feature_names()).reset_index(drop=True)
        v_test = pd.DataFrame(v_test.todense(), columns=vec.get_feature_names()).reset_index(drop=True)
        
        return v_train, v_test

In [124]:
def vec_combine(data_size, tfidf=False, stopwords=custom_stop_words, max_features=1500, ngram_range=(1,1)):
    X_train, X_test, y_train, y_test = split_all[data_size]

    if tfidf:
        tfidf = TfidfVectorizer(max_features=max_features, ngram_range=ngram_range, stop_words=stopwords)
        v_train, v_test = fit_transform_custom(X_train, X_test, tfidf)
    
    else:
        cv = CountVectorizer(max_features=max_features, ngram_range=ngram_range, stop_words=stopwords)
        v_train, v_test = fit_transform_custom(X_train, X_test, cv)

    num_data_train = X_train.drop(columns='all_text').reset_index(drop=True)
    num_data_test = X_test.drop(columns='all_text').reset_index(drop=True)

    together_train = pd.concat([v_train,num_data_train], axis=1)
    together_test = pd.concat([v_test,num_data_test], axis=1)
    
    return together_train, together_test, y_train, y_test

In [157]:
def fit_show_log(X_train, X_test, y_train, y_test):
    log_reg = LogisticRegression(solver = 'liblinear')
    log_reg.fit(X_train, y_train);
    return 'Train:', log_reg.score(X_train, y_train), 'Test:', log_reg.score(X_test, y_test)

### Logistic Regression

In [161]:
print('Small dataset\n')
print(fit_show_log(*vec_combine('small')), '\n')

print('Medium dataset\n')
print(fit_show_log(*vec_combine('medium', tfidf=True, max_features=2500, ngram_range=(1,2))), '\n')

print('Large dataset\n')
print(fit_show_log(*vec_combine('large', tfidf=True, max_features=3500, ngram_range=(1,3))))

Small dataset

('Train:', 0.9266666666666666, 'Test:', 0.8373333333333334) 

Medium dataset

('Train:', 0.9116981132075471, 'Test:', 0.8671698113207548) 

Large dataset

('Train:', 0.9156283374866501, 'Test:', 0.8766684463427656)


### Naive Bayes

In [163]:
def transform_fit_nb(data_size, alpha):
    X_train, X_test, y_train, y_test = split_all[data_size]
    X_train, X_test = fit_transform_custom(X_train, X_test, CountVectorizer(max_features=4500, stop_words='english'))

    nb = MultinomialNB(alpha=alpha)
    nb.fit(X_train, y_train);
    return 'Train:', nb.score(X_train, y_train), 'Test:', nb.score(X_test, y_test)

In [164]:
print('Small dataset\n')
print(transform_fit_nb('small', 3.625), '\n')

print('Medium dataset\n')
print(transform_fit_nb('medium', 2.5), '\n')

print('Large dataset\n')
print(transform_fit_nb('large', 2.875))

Small dataset

('Train:', 0.8937777777777778, 'Test:', 0.8333333333333334) 

Medium dataset

('Train:', 0.8925786163522013, 'Test:', 0.8641509433962264) 

Large dataset

('Train:', 0.8948024207903168, 'Test:', 0.8750667378537106)
