## Model Benchmarks

In [1]:
#import libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix

In [2]:
df = pd.read_csv('../data/train_combine.csv')

In [3]:
# check whether any null for df
df.isnull().sum()

subreddit        0
combine_text     0
clean_combine    2
is_coffee        0
dtype: int64

In [4]:
# There are two null for clean_combine. probably clean_combine is empty during preprocessing. 
# We will drop these two rows.
df.loc[df['clean_combine'].isnull()]

Unnamed: 0,subreddit,combine_text,clean_combine,is_coffee
786,Coffee,It's here!,,1
1658,tea,I was looking to order some tea and I stumbled...,,0


In [5]:
df.dropna(inplace = True)

### Define X, y and check baseline

In [6]:
# Define X, y 
X = df['clean_combine']
y = df['is_coffee']

In [7]:
# It is an almost balanced class distribution
y.value_counts(normalize = True)

0    0.515453
1    0.484547
Name: is_coffee, dtype: float64

### train/test/split

In [8]:
# Create train_test_split.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size = 0.25,
                                                    random_state = 42,
                                                    stratify = y)

### Modeling

#### Use CountVectorizer  for Logisctic regression and Naive Bayes (default setting)

##### Logistic Regression model 

In [9]:
# Instantiate Countectorizer
cvec = CountVectorizer()

In [10]:
# Fit_trasform for X_train and transform for X_test
train_data_features = cvec.fit_transform(X_train)

test_data_features =cvec.transform(X_test)

In [11]:
# There are 6168 features in our train/test dataset
train_data_features.shape

(1431, 6168)

In [12]:
# Instantiate logistic regression model.
lr = LogisticRegression()

# Fit model to training data.
lr.fit(train_data_features, y_train)

# Evaluate model on training data.
lr.score(train_data_features, y_train)

0.9951083158630328

In [13]:
# Accuray score for test data
lr.score(test_data_features, y_test)

0.9184100418410042

##### Naive Bayes model 

In [14]:
# Instantiate naive bayes model.
nb = MultinomialNB()

In [15]:
# Fit model to training data.
nb.fit(train_data_features, y_train)

# Evaluate model on training data.
nb.score(train_data_features, y_train)

0.9622641509433962

In [16]:
# Accuracy score for test data
nb.score(test_data_features, y_test)

0.9079497907949791

#### Use TfidfVectorizer for Logisctic regression and Naive Bayes ( default setting)

##### Logistic Regression model 

In [17]:
# Instantiate Countectorizer
tvec = TfidfVectorizer()

In [18]:
# Fit_trasform for X_train and transform for X_test
train_data_features = tvec.fit_transform(X_train)

test_data_features =tvec.transform(X_test)

In [19]:
# Instantiate logistic regression model.
lr = LogisticRegression()

# Fit model to training data.
lr.fit(train_data_features, y_train)

# Evaluate model on training data.
lr.score(train_data_features, y_train)

0.9713487071977638

In [20]:
# Accuracy score for test data
lr.score(test_data_features, y_test)

0.9163179916317992

##### Naive Bayes model

In [21]:
# Instantiate Naive Bayes model.
nb = MultinomialNB()

In [22]:
#Fit model to training data.
nb.fit(train_data_features, y_train)

# Evaluate model on training data.
nb.score(train_data_features, y_train)

0.9671558350803634

In [23]:
# Accuracy score for test data
nb.score(test_data_features, y_test)

0.9058577405857741

- **Observation (with default setting)**: 

Model Name | Vectorizer |Train Score|Test Score|Train/Test Score gap
-|-|-|-|-
Logistic Regression|CountVectorizer|99.5%|91.8%|7.7%
Logistic Regression|TfidfVectorizer|97.1%|91.6%|5.5%
Naive Bayes|CountVectorizer|96.2%|90.8%|5.4%
Naive Bayes|TfidfVectorizer|96.7%|90.6%|6.1%


Both Logistic Regression and Naive Bayes models are overfitting. There are total 6168 features in our train dataset. More features cause overfitting for our models, hence good accuracy score for train data. Comparing to Naive Bayes model, Logistic Regression has a better train score but more overfitting. 

For logistic regression, if applying CountVectorizer, the model overfits more than applying TfidfVectorizer. It seems that high count words by CountVectorizer have strong effect for the model. 

For Naive Bayes model, there is no much difference by apply CountVectorizer or TfidfVectorizer. 

**In next section, we will use pipeline and GridSearch to build several models to tune the hyperparameters to obtain a best model and also reduce overfit which we observe with default setting in 4.3.1 and 4.3.2.** 

#### Use Pipeline for CountVectorizer Logisctic regression

In [24]:
# Instantiate pipeline for CountVectorizer & LogisticRegression
pipe1 = Pipeline([
    ('cvec', CountVectorizer()),
    ('lr', LogisticRegression(max_iter=200)),
])

In [25]:
# Search over the following values of hyperparameters:
pipe_params1 = {
    'cvec__max_features': [1000, 2000, 3000], # We reduce max_features in order to reduce overfit
    'cvec__min_df': [2,3,4,], #  Minimum number of documents needed to include token: 2, 3, 4
    'cvec__max_df': [0.9, 0.95], # Maximum number of documents needed to include token: 90%, 95%
    'cvec__ngram_range': [(1,1), (1,2)], # search for one-gram and bigram
    'lr__C': [1, 0.1, 0.01], # We would like to apply more regularization to reduce overfit,
                               # so we do not inlcude large C (small alpha) in search  
}

In [26]:
# Instantiate GridSearchCV.
gs1 = GridSearchCV(pipe1, # what object are we optimizing?
                  param_grid= pipe_params1, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [27]:
# Fit GridSearch to training data.
gs1.fit(X_train,y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('lr',
                                        LogisticRegression(max_iter=200))]),
             param_grid={'cvec__max_df': [0.9, 0.95],
                         'cvec__max_features': [1000, 2000, 3000],
                         'cvec__min_df': [2, 3, 4],
                         'cvec__ngram_range': [(1, 1), (1, 2)],
                         'lr__C': [1, 0.1, 0.01]})

In [28]:
# What's the best score?
gs1.best_score_

0.9000682244584685

In [29]:
# Save best model as gs1_model.
gs1_model =gs1.best_estimator_

In [30]:
# Score model on training set.
gs1_model.score(X_train, y_train)

0.9909154437456325

In [31]:
# Score model on testing set.
gs1_model.score(X_test, y_test)

0.9121338912133892

In [32]:
# get the best marameters
gs1.best_params_

{'cvec__max_df': 0.9,
 'cvec__max_features': 3000,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 2),
 'lr__C': 1}

#### Use Pipeline for TfidfVectorizer Logisctic regression

In [33]:
# Instantiate pipeline for TfidfVectorizer & LogisticRegression
pipe2 = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('lr', LogisticRegression(max_iter = 200)),
])

In [34]:
# Search over the following values of hyperparameters:
pipe_params2 = {
    'tvec__max_features': [1000, 2000, 3000], # We reduce max_features in order to reduce overfit
    'tvec__min_df': [2,3,4],   #Minimum number of documents needed to include token: 2, 3, 4
    'tvec__max_df': [0.9, 0.95], # Maximum number of documents needed to include token: 90%, 95%
    'tvec__ngram_range': [(1,1), (1,2)], # search for single-gram and bigram
    'lr__C': [1, 0.1, 0.01], # We would like to apply more regularization to reduce overfit,
                             # so we do not inlcude large C (small alpha) in search  
}

In [35]:
# Instantiate GridSearchCV.
gs2 = GridSearchCV(pipe2, # what object are we optimizing?
                  param_grid= pipe_params2, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [36]:
# Fit GridSearch to training data.
gs2.fit(X_train,y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tvec', TfidfVectorizer()),
                                       ('lr',
                                        LogisticRegression(max_iter=200))]),
             param_grid={'lr__C': [1, 0.1, 0.01], 'tvec__max_df': [0.9, 0.95],
                         'tvec__max_features': [1000, 2000, 3000],
                         'tvec__min_df': [2, 3, 4],
                         'tvec__ngram_range': [(1, 1), (1, 2)]})

In [37]:
# What's the best score?
gs2.best_score_

0.9098389415462587

In [38]:
# Save best model as gs2_model.
gs2_model=gs2.best_estimator_

In [39]:
# Score model on training set.
gs2_model.score(X_train, y_train)

0.9720475192173306

In [40]:
# Score model on test set.
gs2_model.score(X_test, y_test)

0.9142259414225942

In [41]:
# get the best marameters
gs2.best_params_

{'lr__C': 1,
 'tvec__max_df': 0.9,
 'tvec__max_features': 3000,
 'tvec__min_df': 4,
 'tvec__ngram_range': (1, 2)}

#### Use Pipeline for CountVectorizer Naive Bayes

In [42]:
# Instantiate pipeline for CountVectorizer & MultinomialNB
pipe3 = Pipeline([
    ('cvec', CountVectorizer()),
    ('nb', MultinomialNB()),
])

In [43]:
# Search over the following values of hyperparameters:
pipe_params3 = {
    'cvec__max_features': [1000, 2000, 3000], # We reduce max_features in order to reduce overfit
    'cvec__min_df': [2,3,4], #Minimum number of documents needed to include token: 2, 3, 4
    'cvec__max_df': [0.9, 0.95], # Maximum number of documents needed to include token: 90%, 95%
    'cvec__ngram_range': [(1,1), (1,2)],  # search for one-gram and bigram
    'nb__alpha': [1, 10, 100] # We would like to apply more regularization to reduce overfit,
                             # so we do not inlcude small alpha in search
}

In [44]:
# Instantiate GridSearchCV.
gs3 = GridSearchCV(pipe3, # what object are we optimizing?
                  param_grid= pipe_params3, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [45]:
# Fit GridSearch to training data.
gs3.fit(X_train,y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('nb', MultinomialNB())]),
             param_grid={'cvec__max_df': [0.9, 0.95],
                         'cvec__max_features': [1000, 2000, 3000],
                         'cvec__min_df': [2, 3, 4],
                         'cvec__ngram_range': [(1, 1), (1, 2)],
                         'nb__alpha': [1, 10, 100]})

In [46]:
# What's the best score?
gs3.best_score_

0.9000438585804439

In [47]:
# Save best model as gs3_model.
gs3_model =gs3.best_estimator_

In [48]:
# Score model on training set.
gs3_model.score(X_train, y_train)

0.9419986023759609

In [49]:
# Score model on test set.
gs3_model.score(X_test, y_test)

0.9037656903765691

In [50]:
# get the best parameters
gs3.best_params_

{'cvec__max_df': 0.9,
 'cvec__max_features': 3000,
 'cvec__min_df': 3,
 'cvec__ngram_range': (1, 2),
 'nb__alpha': 1}

#### Use Pipeline for TfidfVectorizer Naive Bayes

In [96]:
# Instantiate pipeline for TfidfVectorizer & MultinomialNB
pipe4 = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('nb', MultinomialNB()),
])

In [97]:
# Search over the following values of hyperparameters:
pipe_params4 = {
    'tvec__max_features': [1000, 2000, 3000], # We reduce max_features in order to reduce overfit
    'tvec__min_df': [2,3,4], #Minimum number of documents needed to include token: 2, 3, 4
    'tvec__max_df': [0.9, 0.95], # Maximum number of documents needed to include token: 90%, 95%
    'tvec__ngram_range': [(1,1), (1,2)], # search for single-gram and bigram
    'nb__alpha': [1, 10, 100] # We would like to apply more regularization to reduce overfit,
                             # so we do not inlcude small alpha in search
}

In [98]:
# Instantiate GridSearchCV.
gs4 = GridSearchCV(pipe4, # what object are we optimizing?
                  param_grid= pipe_params4, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [99]:
# Fit GridSearch to training data
gs4.fit(X_train,y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tvec', TfidfVectorizer()),
                                       ('nb', MultinomialNB())]),
             param_grid={'nb__alpha': [1, 10, 100], 'tvec__max_df': [0.9, 0.95],
                         'tvec__max_features': [1000, 2000, 3000],
                         'tvec__min_df': [2, 3, 4],
                         'tvec__ngram_range': [(1, 1), (1, 2)]})

In [100]:
# What's the best score?
gs4.best_score_

0.9014497697424527

In [101]:
# Save best model as gs4_model.
gs4_model =gs4.best_estimator_

In [102]:
# Score model on training set.
gs4_model.score(X_train, y_train)

0.9664570230607966

In [103]:
# Score model on test set.
gs4_model.score(X_test, y_test)

0.9205020920502092

In [104]:
# get the best paramaters
gs4.best_params_

{'nb__alpha': 1,
 'tvec__max_df': 0.9,
 'tvec__max_features': 3000,
 'tvec__min_df': 3,
 'tvec__ngram_range': (1, 2)}

#### Compare and choose best model 

Model Name | Vectorizer |Train Score|Test Score|Train/Test Score gap
-|-|-|-|-
Logistic Regression|CountVectorizer|99.1%|91.2%|7.9%
Logistic Regression|TfidfVectorizer|97.2%|91.4%|5.8%
Naive Bayes|CountVectorizer|94.2%|90.4%|3.8%
Naive Bayes|TfidfVectorizer|96.6%|92.1%|4.5%

According to the score table, for logistic regression, with both CountVectorizer and TfidfVectorizer, the gap between train score and test score are quite large, i.e. 5.8-7.9% with highly overfitting.

For Naive Bayes, with both CountVectorizer and TfidfVectorizer, the gap between train score and test data is smaller, i.e. 3.8-4.5%. The model still overfits but not too much comparing to logistic regression.

Logistic regression is discriminative model which estimated probability(y|x) directly from the training data by minimizing error. 
Naive Bayes is generative model which estimates a joint probability with features (x) and label y from training data. In another word, generative model can learn from training data. So comparing to Logistic regression, Naive Bayes can get better accuracy score for test data. 

Among these 4 models, I choose Naive Bayes combined with TfidfVectorizer as best model because this model does not overfit a lot and with better accuracy.

## Evaluation and Conceptual Understanding¶

### Confusion Table with best model

In [60]:
# Generate a confusion matrix
from sklearn.metrics import confusion_matrix
y_preds = gs4.predict(X_test)

pd.DataFrame(confusion_matrix(y_test, y_preds),
            columns=['predict tea', 'predict coffee'],
            index=['actual tea', 'actual coffee'])

Unnamed: 0,predict tea,predict coffee
actual tea,226,20
actual coffee,18,214


In [61]:
# Examine some classification metrics 
tn, fp, fn, tp = confusion_matrix(y_test, y_preds).ravel()
print('Accuracy: {}'.format(round((tp+tn)/(tp+fp+tn+fn),4)))
print('Misclassification rate: {}'.format(round((fp+fn)/(tp+fp+tn+fn),4)))
print('Precision: {}'.format(round(tp/(tp+fp),4)))
print('Recall: {}'.format(round(tp/(tp+fn),4)))
print('Specificity: {}'.format(round(tn/(tn+fp),4)))

Accuracy: 0.9205
Misclassification rate: 0.0795
Precision: 0.9145
Recall: 0.9224
Specificity: 0.9187


Our model correctly predicts 92.05% of observations.                                                                             
Among posts that our model predicted to be in /r/Coffee, we have 91.45% of them correctly classified.                           
Among posts that are in /r/Coffee, our model has 92.24% of them correctly classified.                                           
Among posts that are in /r/Tea, our model has 91.87% of them correctly classified.                                              

In our best model, we still have 7.95% of misclassification. We will explore in 5.2 to find out whether any key words to cause misclassification

### Evaluate misclassification posts

In [62]:
#preds_prob = gs3.predict_proba(X_test)
preds_prob = gs4.predict_proba(X_test)

In [63]:
# create a dataframe with X_test, preds_prod, preds_y and true_y
preds= pd.DataFrame({
    'clean_combine':X_test,
    'preds_prob':[preds_prob[i][1] for i in range(len(preds_prob))],
    'preds': y_preds,
    'true_y':y_test 
})
preds

Unnamed: 0,clean_combine,preds_prob,preds,true_y
1292,local store haul thank one client,0.193101,0,0
947,hello master fellow lover online world far eng...,0.252903,0,0
1703,look large double seal canister expensive hold...,0.465123,0,0
641,wonder anyone shed light seem like two identic...,0.778849,1,1
105,point wife snob thinking agree background cana...,0.510257,1,1
...,...,...,...,...
1702,today brew organic green stem twig,0.079966,0,0
1150,know anything three month ago bit problem,0.604524,1,0
990,guy day ago ask suggestion name come green bra...,0.132194,0,0
1446,think hard work people deserve recognition,0.710618,1,0


In [64]:
# Create a new column for pred y and true y difference
preds['diff'] = preds['preds'] - preds['true_y']

In [65]:
# predict is coffee post, but actual is tea post

for i in preds.loc[preds['diff'] == 1].clean_combine:
    print(i)
    print('.'*50)

really thinking try know start hope help expertise brew equipment need able brew wide range different temperature control kettle nothing brew bag main category check buy store want get
..................................................
decide give space
..................................................
make ceramic water temperature control make ceramic
..................................................
design little enamel pin witchesvspatriarchy say sub might appreciate
..................................................
try find decent variable temp kettle easy fancy pay buck ekg small capacity little flat kettle table small capacity normal use aside also gooseneck advantage find linear flow useful kitchen application xiaomi smart kettle window see much water also something like recommend variable temp kettle gooseneck
..................................................
hey anyone bubble recipe taste like kind make boba shop try make several time really taste idk bubble recipe
......

- Inside these mis-prediction, there are some words, like 'taste', 'brew', etc, which could be a significant term that misclassified them as Coffee. These words are in the top20 words list for Coffee. We also observe the some posts are quite short. These short posts may not contain meaning words, thus cause misclassification. 

In [66]:
preds.loc[preds['diff'] == -1]

Unnamed: 0,clean_combine,preds_prob,preds,true_y,diff
323,ice tray would highly recommend make extra pou...,0.464791,0,1,-1
181,last year seem clear market high quality seem ...,0.357149,0,1,-1
117,howdy love taste however totally caffeine into...,0.437966,0,1,-1
647,really seem know talk probably miss lot great ...,0.359928,0,1,-1
178,big bottle water infuse mint leave lead idea a...,0.264736,0,1,-1
179,recently ecuadorian black really like would li...,0.359504,0,1,-1
477,little bit context like give background two pl...,0.325214,0,1,-1
499,sorta new mocha frappe latte currently drift t...,0.429948,0,1,-1
784,try blind cup sample pack angel cup see lot pe...,0.392595,0,1,-1
566,hello import family farm salvador united state...,0.470261,0,1,-1


In [67]:
# predict is tea post, but actual is coffee post

for i in preds.loc[preds['diff'] == -1].clean_combine:
    print(i)
    print('.'*50)    

ice tray would highly recommend make extra pour excess say ice tray come back later frozen cube perfect add iced without detrement water unique way make iced
..................................................
last year seem clear market high quality seem people want high quality small viable market high scoring quality get price regular quality market top quality third wave
..................................................
howdy love taste however totally caffeine intolerant relegate decaf like know anybody wisdom impart good brewing method decaf blend flavorful decaf blend recommend thank decaffeinate tip recommendation
..................................................
really seem know talk probably miss lot great ignorance favorite ever brazil fazenda rio verde ipanema grand cru apricot get hasbean last year brew use beginner grinder cup one moccamaster without much skill ever truly love taste like pure fruit juice incredible well expensive rare gesha since brew find today natural 

- Inside these mis-prediction, there are some words, like 'cup', 'brew', etc which could be a significant term that misclassified them as Tea. These words are in the top20 words list for Tea.

## Conclusion and Recommendations

Our naive Bayes classifier performed well with a test accuracy score of 92.05%. This is expected due to two similar subreddit topics we chose. There are many common words in their top20 words list, like 'cup', 'brew','taste', 'water' etc.  
However, it still can be a good classifier for customer service team to categorize customer feedback with high accuracy. 

There are still space for improvement for this classifier: 
- Optimize stop words, for example to identify the common words for these two topics and add into stopwords. 
- Collect more text-heavy posts for Tea subreddit
- May consider to model more than two topics to improve user's satisfaction. 

## Reference

1. https://towardsdatascience.com/text-classification-applications-and-use-cases-beab4bfe2e62
2. https://towardsdatascience.com/generative-vs-2528de43a836
3. https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
4. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html