### Comparing Models and Vectorization Strategies for Text Classification

This try-it focuses on weighing the positives and negatives of different estimators and vectorization strategies for a text classification problem.  In order to consider each of these components, you should make use of the `Pipeline` and `GridSearchCV` objects in scikitlearn to try different combinations of vectorizers with different estimators.  For each of these, you also want to use the `.cv_results_` to examine the time for the estimator to fit the data.

### The Data

The dataset below is from [kaggle]() and contains a dataset named the "ColBert Dataset" created for this [paper](https://arxiv.org/pdf/2004.12765.pdf).  You are to use the text column to classify whether or not the text was humorous.  It is loaded and displayed below.

**Note:** The original dataset contains 200K rows of data. It is best to try to use the full dtaset. If the original dataset is too large for your computer, please use the 'dataset-minimal.csv', which has been reduced to 100K.

In [29]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

from itertools import product

In [77]:
df = pd.read_csv('text_data/dataset-minimal.csv')
df

Unnamed: 0,text,humor
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
1,Watch: darvish gave hitter whiplash with slow ...,False
2,What do you call a turtle without its shell? d...,True
3,5 reasons the 2016 election feels so personal,False
4,"Pasco police shot mexican migrant from behind,...",False
...,...,...
99994,Did you hear that joke about mosquitoes? it's ...,True
99995,What did van gogh's mother say to him when he ...,True
99996,Inappropriate wedding dresses: say 'i do' to w...,False
99997,Pit bull rescued from high-kill shelter really...,False


In [78]:
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['humor'], test_size=.25)

#### Task


**Text preprocessing:** As a pre-processing step, perform both `stemming` and `lemmatizing` to normalize your text before classifying. For each technique use both the `CountVectorize`r and `TfidifVectorizer` and use options for stop words and max features to prepare the text data for your estimator.

**Classification:** Once you have prepared the text data with stemming lemmatizing techniques, consider `LogisticRegression`, `DecisionTreeClassifier`, and `MultinomialNB` as classification algorithms for the data. Compare their performance in terms of accuracy and speed.

Share the results of your best classifier in the form of a table with the best version of each estimator, a dictionary of the best parameters and the best score.

In [125]:
vectorizers = {'cv': CountVectorizer, 'tf': TfidfVectorizer}

class tokenize_stem:
    def __init__(self, stemmer):
        self.stemmer = stemmer
        self.stopwords = [stemmer(word) for word in stopwords.words('english')]
    def __call__(self, sentence):
        return [self.stemmer(word) for word in word_tokenize(sentence)]

analyzers = {'po': tokenize_stem(PorterStemmer().stem), 'wn': tokenize_stem(WordNetLemmatizer().lemmatize)}

models = {'dt': DecisionTreeClassifier, 'lr': LogisticRegression, 'nb': MultinomialNB }

permutations = list(product(
    list(vectorizers.keys()),
    list(analyzers.keys()),
    list(models.keys())
    ))

print('We will iterate over %s vectorizer-analyzer-model'%len(permutations))

model_param_grid = {
    'lr': {'mod__max_iter': np.logspace(0,2,10), 'mod__C': np.logspace(-3,2,10)},
    'nb': {'mod__alpha': np.logspace(-1,4,10)},
    'dt': {'mod__max_depth': np.logspace(0,2,10)},
}

base_param_grid = {
    #'vec__stop_words': ['english', None],
}

We will iterate over 12 vectorizer-analyzer-model


In [126]:
def run_search(X,y, permutations = permutations):

    results = {'Permutation': [], 'Best_Estimator': [], 'Best_Train_Score': [], 'Test_Score': [], 'Avg_Fit_Time': []}

    for perm in permutations:

        # Create permuted pipeline
        vec = vectorizers[perm[0]]
        ana = analyzers[perm[1]]
        mod = models[perm[2]]

        pipe = Pipeline((
            ('vec', vec(tokenizer = ana, stop_words = ana.stopwords)),
            ('mod', mod())
        ))

        search = GridSearchCV(estimator = pipe, \
            param_grid = {**base_param_grid.copy(), **model_param_grid[perm[2]]},
            cv = 2, verbose = 5)

        search.fit(X, y)
        
        results['Permutation'].append(perm)
        results['Best_Estimator'].append(search.best_estimator_)
        results['Best_Train_Score'].append(search.best_estimator_.score(X_train, y_train))
        results['Test_Score'].append(search.best_estimator_.score(X_test, y_test))
        results['Avg_Fit_Time'].append(search.cv_results_['mean_fit_time'].mean())
    
    return results

In [105]:
# Let's first run a search over the more-limited dataset/parameterspace to determine which models perform the best; then we fine-tune.

results = run_search(X_train.iloc[0:1000], y_train.iloc[0:1000])

Fitting 3 folds for each of 20 candidates, totalling 60 fits


  % sorted(inconsistent)


KeyboardInterrupt: 

In [135]:
res = pd.DataFrame(results).sort_values(by='Test_Score', ascending=False)
res

Unnamed: 0,Permutation,Best_Estimator,Best_Train_Score,Test_Score,Avg_Fit_Time
4,"(cv, wn, lr)",(CountVectorizer(tokenizer=<__main__.tokenize_...,0.935812,0.93276,0.275546
1,"(cv, po, lr)",(CountVectorizer(tokenizer=<__main__.tokenize_...,0.936572,0.93268,0.464106
7,"(tf, po, lr)",(TfidfVectorizer(tokenizer=<__main__.tokenize_...,0.931972,0.932,0.466575
10,"(tf, wn, lr)",(TfidfVectorizer(tokenizer=<__main__.tokenize_...,0.930732,0.93108,0.298508
8,"(tf, po, nb)",(TfidfVectorizer(tokenizer=<__main__.tokenize_...,0.922266,0.92232,0.442534
11,"(tf, wn, nb)",(TfidfVectorizer(tokenizer=<__main__.tokenize_...,0.920239,0.92084,0.264548
3,"(cv, wn, dt)",(CountVectorizer(tokenizer=<__main__.tokenize_...,0.917386,0.91656,0.260944
0,"(cv, po, dt)",(CountVectorizer(tokenizer=<__main__.tokenize_...,0.917079,0.91648,0.442352
5,"(cv, wn, nb)","(CountVectorizer(stop_words='english',\n ...",0.913439,0.91312,0.240427
6,"(tf, po, dt)","(TfidfVectorizer(stop_words='english',\n ...",0.906439,0.90512,0.438838


In [136]:
best_pipe = res.iloc[0]['Best_Estimator'].fit(X_train, y_train)
print('Best training score: %s'%best_pipe.score(X_train,y_train))
print('Test score: %s'%best_pipe.score(X_test,y_test))


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Best training score: 0.9621328284377125
Test score: 0.95672


In [138]:
max_df = pd.read_csv('text_data/dataset.csv')

max_X_train, max_X_test, max_y_train, max_y_test = train_test_split(max_df['text'], max_df['humor'], test_size=.25)


In [139]:
best_pipe = res.iloc[0]['Best_Estimator'].fit(max_X_train, max_y_train)
print('Best training score: %s'%best_pipe.score(max_X_train,max_y_train))
print('Test score: %s'%best_pipe.score(max_X_test,max_y_test))


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Best training score: 0.9603866666666667
Test score: 0.95886


In [141]:
best_pipe.score(max_X_test, max_y_test)

0.95886