# Introduction

This is the third (and final) notebook in the sarcasm detection mentoring project series. In this installment, we will explore ways to represent text data numerically and use those new representations to build a text-based machine learning model that will do a good job identifying sarcasm.

We will be using Scikit-Learn to process the text and build our classifier. We'll be following some of the steps in Scikit-Learn's tutorial on [Working With Text Data](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html).

Series:
1. [Part 1](https://www.kaggle.com/yastapova/sarcasm-detection-2020-mentoring-proj-part-1): Exploring Data and Feature Engineering
2. [Part 2](https://www.kaggle.com/yastapova/sarcasm-detection-2020-mentoring-proj-part-2): Splitting Data and Building a Basic Machine Learning Model
3. Part 3: Building a Text-Based Machine Learning Model

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, cross_validate
from sklearn.metrics import accuracy_score, confusion_matrix

from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

First, let's load the dataset we split in the previous notebook and take a quick look to remind us of what it looks like.

In [None]:
train_data = pd.read_csv("/kaggle/input/sarcasm-detection-2020-mentoring-proj-part-2/sarcasm_train_split.csv")
test_data = pd.read_csv("/kaggle/input/sarcasm-detection-2020-mentoring-proj-part-2/sarcasm_test_split.csv")
train_data.head()

# Step 6: Train a Text Classifier

We've seen that a basic classifier that doesn't consider the comment text seems to not work well for this dataset. Let's try building one that *does* use the text. For this approach, we'll be working with only the *comment* column, so let's separate it out.

In [None]:
train_comments = train_data["comment"]
train_comments.head()

The first thing we need to do is to vectorize this text so that instead of working with words, our model can work with numbers. We will do this by creating a **bag of words** representation for each comment. We take a look at all the words in all of the comments and make those the columns of our new feature matrix. Then, we count how many times each word appears in each comment and place those values in the appropriate spots in the matrix. You can read more about what is happening [here](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction).

We can use the ```CountVectorizer``` to accomplish this. There are several different parameters that we can play with, but for now I'll only set ```max_features=20000``` so that we can save some computation time.

In [None]:
vect = CountVectorizer(max_features=20000)
train_bow = vect.fit_transform(train_comments)
train_bow.shape

We have now created a training matrix with 20,000 feature columns and 758,079 comment rows. If we try examining a row of this matrix as we normally would, we wouldn't get much out of it. This is because the matrix is very **sparse**, meaning that it has a lot of 0 values. This makes sense, because each comment is only going to have a few out of those 20,000 words.

Next, we'll start with an untuned ```LogisticRegression``` model, so that we can compare the results to our attempts in the previous notebook. I will only set the ```random_state=42``` as usual. Then, I'll cross validate it with 3 folds to get some accuracy values. (Note that this takes a long time to run.)

In [None]:
log_reg_model = LogisticRegression(random_state=42)
cross_validate(log_reg_model, train_bow, train_data["label"], cv=3, scoring="accuracy", n_jobs=-1)

That's a big difference from our previous classifier! Even with a limited and unrefined text representation, our Logistic Regression model can now achieve 69% accuracy. That's the power of text data. But are there things we can try to improve on this further? Of course!

We can try other models (especially ones that are quicker to train). We can try playing with the representation of the data. And we can cross validate to choose the best parameters for everything.

## CountVectorizer

First, let's start by taking a closer look at some of the parameters in our ```CountVectorizer```. The first parameter I want to set is ```strip_accents='unicode'```, which will remove accented letters from the comments are replace them with unaccented ones.

Next, we can see that the vectorizer has built-in functionality to remove **stopwords**, or words that are so common in the English language so as to be useless in classification. Though we can set the stopwords list to be the pre-built one, it may remove words that might be useful to us. Therefore, I will create my own.

The next useful feature is ```ngram_range```, which allows us to use not only single words, but also multi-word phrases (called **ngrams**) as features. By default, it's set to ```(1, 1)```, which means we only use ngrams of length 1 (i.e. single words, or **unigrams**). We can set it to ```(1, 2)``` to use unigrams and bigrams, or higher numbers if they are useful. I will save this for cross validation.

We then have two closely-related parameters we can play with: ```min_df``` and ```max_df```. These parameters allow us to automatically disregard words that occur in a certain number of comments. For example, words that appear in only a tiny fraction of the comments would not be very representative and may not be useful for classification. Likewise, words that appear in most comments will not be that useful either.

I will set ```max_df=0.70```, which will throw out all words that occur in more than 70% of comments. This makes sense to me because about 50% are sarcastic and 50% are not, so a word will be useful if it mostly occurs *only in sarcastic comments* or *only in non-sarcastic ones*, hence a maximum percentage close to 50%. I will set ```min_df=0.0001```, because words that occur only in 75 out of over 758,000 comments do not have a high chance of being useful.

The last useful parameter I will consider is ```max_features```, which controls the maximum number of words we will retain as features in our new feature matrix. By setting ```max_df``` and ```min_df``` we will be able to limit the number of features we have. However, once we start considering bigrams, trigrams, and (maybe) beyond, we will once again have an enormous list. I will limit the number of features we keep to 50,000 or less. This is another parameter we can potentially configure with cross validation.

Let's create our list of stopwords and our vectorizer based on those settings.

In [None]:
# make sure stopwords are all lowercase because
# the vectorizer makes everything lowercase by default
stopwords = ["the", "a", "an", "she", "he", "i", "you", "me", "they",
             "her", "his", "your", "their", "my", "we", "our", "ours",
             "hers", "yours", "it", "its", "him", "them", "theirs",
             "this", "that", "is", "was", "are", "were", "am", "or",
             "as", "of", "at", "by", "for", "with", "to", "from", "in",
             "m", "s", "ve", "d", "ll", "o", "re"]

vect = CountVectorizer(strip_accents='unicode', stop_words=stopwords, min_df=0.0001, max_df=0.70)
train_bow = vect.fit_transform(train_comments)
train_bow.shape

## TfidfTransformer

The next step suggested in the [tutorial](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#from-occurrences-to-frequencies) is to use a ```TfidfTransformer``` to transform the matrix of counts into a matrix of scaled term frequencies. This piece of our pipeline also has some parameters we could vary, but I will just use the defaults.

Note that the transformation is performed on the bag of words matrix, not on the original text data.

In [None]:
tf_trans = TfidfTransformer()
train_tf = tf_trans.fit_transform(train_bow)
train_tf.shape

## Models

Finally, the last piece of our pipeline is the actual model. There are plenty of different models we can use for classification, but I will focus on three: Logistic Regression, Naive Bayes, and Support Vector Machine. I have chosen Logistic Regression because it is a good starting point and the other two because they are suggested in the Scikit-Learn tutorial and are also good go-to models.

For each model, we'll examine the parameters that we can fix and tune. Then, we'll perform cross validation with ```GridSearchCV```. The only difference here compared to the previous notebook is that we'll be putting our vectorizer, transformer, and model into a pipeline for the grid search to be performed on each of the steps at the same time.

### Logistic Regression

Since we discussed Logistic Regression in the previous notebook, I will not bother to repeat it all. I will use the same approach as last time. Note how the pipeline is built and each stage is included. The names of the stages you specify here will be used in the parameter grid for grid search.

In [None]:
# recreate the vectorizer and transformer so they are not fit yet
vect = CountVectorizer(strip_accents='unicode', stop_words=stopwords, min_df=0.0001, max_df=0.70)
tf_trans = TfidfTransformer()

# create the model
log_reg_model = LogisticRegression(random_state=42, penalty="elasticnet", solver="saga")

pipeline = Pipeline([
    ('vect', vect),
    ('tftrans', tf_trans),
    ('model', log_reg_model)
])

Also note that this will take a very long time to run. There are 45 different combinations of parameters to cross-validate for, and there will be 3 folds for each. This means that we will train and test 135 times for Logistic Regression. (This is why I limited myself to only 30,000 max features and only 3 folds. Feel free to play with the parameter grid if you have more or less time/resources.)

Note that the input to the grid is the original comment text. This is because the vectorizer and transformer are part of the pipeline. The text comments will go into the pipeline, through the vectorizer, through the transformer, and into the model automatically.

(WARNING: This will take a very long time to run. Please refer to the compiled notebook for results if you are pressed for time.)

In [None]:
param_grid = {
    'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'vect__max_features': (5000, 15000, 30000),
    'model__l1_ratio': (0.0, 0.25, 0.50, 0.75, 1.0)
}

grid_logreg = GridSearchCV(pipeline, param_grid, scoring="accuracy", cv=3, n_jobs=-1)
grid_logreg.fit(train_comments, train_data["label"])

In [None]:
print(grid_logreg.best_score_)
for param_name in sorted(param_grid.keys()):
    print("%s: %r" % (param_name, grid_logreg.best_params_[param_name]))

#### Logistic Regression Results

After this grid search finally finishes running, we can see that the best model it found had 15,000 max features, an n-gram range of (1, 3), and an l1_ratio of 0.75. This model achieved an accuracy of 0.70. We can possibly improve upon this by trying more different values for the parameters in a more granular range. However, we also have two other models to attempt.

### Naive Bayes

The Naive Bayes classifier is one that is based on probability using Bayes' theorem. You can find more details on how it's computed in Scikit_Learn's [User Guide](https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes). The particular type of Naive Bayes we are using is Multinomial Naive Bayes, which is good for working with text datasets like the one we have.

The only parameter to modify here would be ```alpha```, which controls the smoothing of probabilities, but I will leave it at the default value of 1.

In [None]:
# recreate the vectorizer and transformer so they are not fit yet
vect = CountVectorizer(strip_accents='unicode', stop_words=stopwords, min_df=0.0001, max_df=0.70)
tf_trans = TfidfTransformer()

# create the NB model
nb_model = MultinomialNB()

pipeline = Pipeline([
    ('vect', vect),
    ('tftrans', tf_trans),
    ('model', nb_model)
])

In this case, our grid search will only have 9 combinations of parameters to try and will only have to train and test 27 times. (But it will still take plenty long to run!)

In [None]:
param_grid = {
    'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'vect__max_features': (5000, 15000, 30000)
}

grid_nb = GridSearchCV(pipeline, param_grid, scoring="accuracy", cv=3, n_jobs=-1)
grid_nb.fit(train_comments, train_data["label"])

In [None]:
print(grid_nb.best_score_)
for param_name in sorted(param_grid.keys()):
    print("%s: %r" % (param_name, grid_nb.best_params_[param_name]))

#### Naive Bayes Results

The results show that the best classifier achieved a maximum accuracy of 70% with 15,000 features that include uni-, bi-, and trigrams. This is just about the same as the Logistic Regression classifier above. One option we have is to add some sort of feature selection step to reduce the number of features we train on. The Logistic Regression classifier did this as part of the training process (due to the ```l1_ratio``` allowing for some Lasso regularization), but the Naive Bayes classifier would need this to be done explicitly. But first, we can try the Support Vector Machine.

### Support Vector Machine

Support Vectore Machine models are basically drawing a line to divide one class of data points from the other. The trick is finding the *best* line. You can learn about the benefits of SVMs and how they work in the Scikit Learn [User Guide](https://scikit-learn.org/stable/modules/svm.html). In particular, we will be using an SVM solved by Stochastic Gradient Descent, which is an approach used to train linear classifiers with a convex cost function. You can learn more about it in the [User Guide](https://scikit-learn.org/stable/modules/sgd.html#sgd) as well.

There are lots of parameters to modify in the SGDClassifier. We can change the type of loss function and type of penalty term we use. We can change the ```l1_ratio``` that has to do with the regularization penalty as well. We can change the number of iterations to permit and the tolerance for the stopping condition. We can change how to determine the learning rate and we can change various constants that are relevant to the mathematically formulae that define the SVM.

In [None]:
# recreate the vectorizer and transformer so they are not fit yet
vect = CountVectorizer(strip_accents='unicode', stop_words=stopwords, min_df=0.0001, max_df=0.70)
tf_trans = TfidfTransformer()

# create the SVM model
svm_model = SGDClassifier(penalty="elasticnet", random_state=42, n_jobs=-1)

pipeline = Pipeline([
    ('vect', vect),
    ('tftrans', tf_trans),
    ('model', svm_model)
])

For the sake of simplicity and computation time, I will only vary the parameters for ```ngram_range``` and ```max_features``` in the preprocessing steps as before, as well as the ```l1_ratio``` that controls the regularization. I will set ```random_state``` and ```n_jobs``` as usual, and I will also set ```penalty="elasticnet"``` so that the ```l1_ratio``` matters.

This gives our grid search 3 * 3 * 6 = 54 different combinations of parameters to try, meaning that with cross-validation it will train and test 162 times. As you may imagine, running this code will take a long while. (Refer to the compiled notebook for the results.)

In [None]:
param_grid = {
    'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'vect__max_features': (5000, 15000, 30000),
    'model__l1_ratio': (0.0, 0.15, 0.40, 0.60, 0.85, 1.0)
}

grid_svm = GridSearchCV(pipeline, param_grid, scoring="accuracy", cv=3, n_jobs=-1)
grid_svm.fit(train_comments, train_data["label"])

In [None]:
print(grid_svm.best_score_)
for param_name in sorted(param_grid.keys()):
    print("%s: %r" % (param_name, grid_svm.best_params_[param_name]))

#### Support Vector Machine Results

Our results show that the best model as determined by our grid search achieved 0.68 accuracy, with ```l1_ratio=0``` (indicating Ridge regularization) and a list of 15,000 uni-, bi-, and trigrams. This is below the accuracy of the other models, but since the difference is very slight it may not be significant.

All three of the models we tried agreed upon 15,000 features and the inclusion of bi- and trigrams. If we were to continue running grid search to select a model, we could try running it with ```max_feature``` values of 10,000, 15,000, 20,000, and 25,000. We may also decide to try 4-grams.

There are also other ways we can modify the SVM model. In particular, the ```SGDClassifier``` is limited to a linear kernel function because it uses Stochastic Gradient Descent. If we use the ```SVC``` model instead, we can try other kernel functions, such as RBF and polynomial, that may fit the data better. In the interest of brevity, I will not try those as part of this notebook.

## Training the Final Model

So which model do we choose? Naive Bayes and Logistic Regression had very similar results. I will choose Naive Bayes because it trains much faster on our dataset. Let's create our final pipeline and train our final model. Don't forget to fill in the chosen best parameters for each step.

In [None]:
# recreate the vectorizer and transformer so they are not fit yet
vect = CountVectorizer(strip_accents='unicode', stop_words=stopwords, min_df=0.0001, max_df=0.70, max_features=15000, ngram_range=(1, 3))
tf_trans = TfidfTransformer()

# create the NB model
nb_model = MultinomialNB()

pipeline = Pipeline([
    ('vect', vect),
    ('tftrans', tf_trans),
    ('model', nb_model)
])

pipeline.fit(train_comments, train_data["label"])

Now let's make predictions on all our test data, and then calculate the final accuracy.

In [None]:
test_comments = test_data["comment"]
preds = pipeline.predict(test_comments)
preds.shape

In [None]:
acc = accuracy_score(test_data["label"], preds)
acc

The final accuracy for this model is 69%, which agrees with the accuracy we got during cross-validation. This is good, it means we probably didn't overfit our model. We can also examine our results further with other metrics. For example, let's see if we do equally well in predicting both classes by looking at the confusion matrix.

In [None]:
confusion_matrix(test_data["label"], preds) / test_data.shape[0]

In a confusion matrix, the rows represent true labels and columns represent predicted labels. This shows how many sarcastic and non-sarcastic comments were classified correctly and incorrectly. The labels are in sorted order, so the first row and column correspond to the non-sarcastic label (0) and the second row and column correspond to the sarcastic label (1).

From this confusion matrix, we can understand that we are a little better at classifying non-sarcastic comments than sarcastic comments. Since both classes are 50% of the dataset, we can see that 6% more comments were correctly classified as non-sarcastic (37% of total comments) compared to sarcastic (31% of total comments). We can also see that the non-sarcastic label was more popular overall: summing down the first column shows us that about 56% of all test points were classified as non-sarcastic.

Rather than settling on this model, we could have continued cross-validating to find better parameters and other model types. For example, we did not try Random Forests or any other ensemble methods. At this point, if we decide to go back and try different models, we should re-split the data again with a different random split. This is to ensure that we do not overfit to the test set once we have final results.

Remember, we split off this test set to simulate our model's performance on new, never-before-seen data. When a model is put into production, it will receive completely new input data. If we overfit our model during training, it will not perform well in production.

# Conclusion

This concludes the Sarcasm Detection Project, in which we trained a final model that is approximately 70% accurate. What do we do next? If this project had a purpose beyond learning and practice, we may have to implement our model in production. We would also likely have to present our results to the relevant stakeholders, discussing our approach, our model's benefits, and its limitations. We may also have to go back to re-train our model if there are changes to the data.

There are many more things we can try to improve our model that we would likely try in an industry project. However, I will conclude this tutorial here. I hope this notebook and series is helpful to whoever reads it.

[\[Prev\]](https://www.kaggle.com/yastapova/sarcasm-detection-2020-mentoring-proj-part-2) >> Part 2: Splitting Data and Building a Basic Machine Learning Model