# Introduction

This is the second notebook in the sarcasm detection mentoring project series. While in the first one we dealt with data exploration and feature engineering, in this one we will train some models. We'll start by learning to properly split the data and then move on to training a basic model and understanding cross-validation.

Series:
1. [Part 1](https://www.kaggle.com/yastapova/sarcasm-detection-2020-mentoring-proj-part-1): Exploring Data and Feature Engineering
2. Part 2: Splitting Data and Building a Basic Machine Learning Model
3. [Part 3](https://www.kaggle.com/yastapova/sarcasm-detection-2020-mentoring-proj-part-3): Building a Text-Based Machine Learning Model

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split, cross_validate

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

# Step 4: Split the Data

Before we start training our models, we must first split the data into a training set and a test set. We do this because we want to make sure that our test set is entirely separate from the model as it's being trained. If we allowed the model to see the test set while it was training, it would taint our results. It would be like peeking at the answer key while taking an exam.

We must also be vigilant to prevent any kind of data leakage, which is when information from the test set finds its way into training even if you aren't using the test data itself. For example, if we were to [standardize](https://medium.com/@swethalakshmanan14/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff) a column by subtracting its mean and dividing by standard deviation, we must make sure to calculate the mean and standard deviation **only from the training data**. Then, we can standardize the training data and the test data using that same mean and standard deviation. This is because we must not allow the test set to influence those values. As far as we are concerned, the test set does not exist until we have a finished model.

Before we get started, let's take a look at the first few rows of the dataset, just to refresh our memories about what it looks like. I am loading the data from the output file of my [first notebook](https://www.kaggle.com/yastapova/sarcasm-detection-2020-mentoring-proj-parts-1-3), which contained all of the data exploration and feature engineering for this project.

In [None]:
data = pd.read_csv("/kaggle/input/sarcasm-detection-2020-mentoring-proj-part-1/sarcasm_prepped_data.csv")
data.head(10)

One thing to consider when we split the data is the balance of class labels in each split set. As we can see from the ```value_counts()``` function below, the full dataset is about 50/50 sarcastic and non-sarcastic. We should aim to maintain a similar ratio in the train and test sets.

In [None]:
data["label"].value_counts()

While you can certainly simply write a quick function to select random data points to create your training and test sets, I will be using the ```train_test_split()``` function from Scikit-Learn's ```model_selection``` module. It's nice to have professionally pre-made functions so that we don't have to implement our own.

I set the test size to be 0.25, which means that the resulting test set will consist of 25% of the original data set, and the training set will be the remaining 75%.

In [None]:
train_data, test_data = train_test_split(data, test_size=0.25, random_state=42)
test_data.shape

As we can see from the number of rows in the test set, it contains about 25% of the approximately one million original data points.

Consider this: we know our class labels are balanced, but what if they weren't? Would there be any issues with splitting the data that we would have to guard against?

Indeed, if we had a huge disparity in the classes, randomly sampling data points could leave us with a test set that is entirely made up of one class! This would definitely not be useful for testing. To ensure that doesn't happen, we can perform [stratified sampling](https://datascience.stackexchange.com/questions/16265/is-stratified-sampling-necessary-random-forest-python), which, luckily, is already a feature in the ```train_test_split()``` function!

# Step 5: Train a Simple Classifier

Before we get into text representation and NLP techniques, we'll try making a classifier that doesn't need anything like that. Let's take some of the features we have and see if they can predict sarcasm. We'll train a Logistic Regression model, which is a good starting point for a basic classification model.

(A nice brief description of Logistic Regression can be found in [this article](https://towardsdatascience.com/machine-learning-basics-part-1-a36d38c7916) and a more complex description in the [User Guide](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) of the Scikit-Learn library.)

We'll be using the Scikit-Learn library to train all of the models in this project. It has an excellent selection of models, algorithms, and other helpful functions for any situation, as well as the most marvellous documentation ever written.

Let's take a look at our data and determine which columns we want to use in our model.

In [None]:
train_data.head()

As you may recall, during data exploration I decided that the *ups* and *downs* columns were unclear and unreliable. Therefore, I will be ommitting them from the model. I will also throw out all of the non-numeric columns, since the model algorithms have no way to deal with them.

As a result, our training data will consist of the following columns: *score, comment_length, date_year, date_month* and *comment_day*. Of course, we will also keep *label* as our target variable.

In [None]:
basic_train_y = train_data["label"]
basic_train_X = train_data[["score", "comment_length", "date_year", "date_month", "comment_day"]]

basic_test_y = test_data["label"]
basic_test_X = test_data[["score", "comment_length", "date_year", "date_month", "comment_day"]]

basic_train_X.head()

Now, let's build a Logistic Regression model.

In [None]:
from sklearn.linear_model import LogisticRegression

log_reg_model = LogisticRegression(random_state=42)
log_reg_model = log_reg_model.fit(basic_train_X, basic_train_y)
log_reg_model

And that's it, we have a trained model! Yes, it **is** that easy. It's easy to train the model, but the hard part is making sure the model actually does a good job. As you can see from the model information printed out above, a Logistic Regression classifier has a lot of parameters. Initially, we have just trained it with defaults for everything, but this may not result in the best classifier.

In order to ensure we get the best possible classifier, we have to try combinations of values for these parameters and test how well the model performs with each combination. Do we test it on the test set? **No!** Remember that the test set is only for testing our final product. While we're still tinkering with parameters, we can't touch the test set yet. What we could do is split the data again into a validation set, which we won't use for training and will only use to test our in-between attempts at models.

However, this would take away a large amount of data that we could have used for training. We want as much data as possible for training because that will allow our models to train more thoroughly. Instead of splitting a separate validation set, we can tune our parameters through **cross-validation**.

## Cross Validation

The process of **K-fold Cross Validation** is discussed in detail in the article linked somewhere above called *Machine Learning - Fundamentals*. The gist is that we can take our training set and divide it into k equal parts called *folds*. We then pick the parameters we want to test, pick one fold to be the validation set (or the "hold-out" fold), and then train the model on the remaining folds. After training, we test it on the hold-out fold, record the score, and then do *all of that again* for the same parameter values, except now we pick a different fold to be the hold-out fold. Then we aggregate all the validation scores together to get a result for those parameter values we were testing. We do this all over again for a new set of parameters. After we test all of them, we can see which parameter combinations had the best scores and choose those for our final model.

If you think this sounds time-consuming and tedious, you're right. But it's an essential step of the machine learning process. Luckily, Scikit-Learn has some functions that will make our life easier. The first of these is the ```cross_validate()``` function, which performs one full run of the k-fold cross validation algorithm for one set of parameters. You can see how it works below.

In [None]:
cross_validate(log_reg_model, basic_train_X, basic_train_y, cv=5, scoring="accuracy")

In the results of the model, you can see 5 values in each array. These correspond to each of the 5 folds we specified when we ran ```cross_validate``` with ```cv=5```. At a glance, we can see that our model results in about 51% accuracy on average. This isn't very good, but maybe by varying some of the other parameters we can make it better.

If we use this ```cross_validate()``` function, we'll have to redefine the model, change the parameters, and rerun the validation manually every time, unless we wrote a loop to do it. But why resort to loops when Scikit Learn has anticipated your needs yet again? 

Let's use the implementation of cross-validation provided by ```GridSearchCV```, which automates the process for us. All we need to do is provide the model, data, and values for parameters we want to vary. I will set ```penalty``` to be "elasticnet" and vary the "l1_ratio" parameter. This will allow us to try different types of regularization (which is also discussed in the *Machine Learning - Fundamentals* article).

(This will take several minutes to run.)

In [None]:
from sklearn.model_selection import GridSearchCV

log_reg_model = LogisticRegression(random_state=42, penalty="elasticnet", solver="saga", max_iter=2000, n_jobs=-1)
param_grid = {"l1_ratio": [0.0, 0.25, 0.50, 0.75, 1.0]}

grid = GridSearchCV(log_reg_model, param_grid, scoring="accuracy", cv=3, n_jobs=-1)
grid.fit(basic_train_X, basic_train_y)

In [None]:
print(grid.best_score_)
print(grid.best_params_)

Above, we can see the results of our cross-validation attempts. The best model achieved an accuracy of 51% using an l1_ratio of 0. Now let's test this model on the test set.

In [None]:
log_reg_model = LogisticRegression(random_state=42, penalty="elasticnet", l1_ratio=0.0, solver="saga", max_iter=2000, n_jobs=-1)
log_reg_model = log_reg_model.fit(basic_train_X, basic_train_y)
score = log_reg_model.score(basic_test_X, basic_test_y)
score

## Discussion

After a grueling process of cross validation, we found our best parameters. We trained this best model on all our training data and finally got to test it on our test set. Our final accuracy is 51%. Now we arrive at the question: **is that good?**

This is a good time to talk about baselines and how to determine what our model evaluation metrics mean. Usually sometime in the beginning of a machine learning project, before any models are selected or any training is done, we must decide how we will evaluate the model and what the baseline score is that we will compare all our models to.

There are plenty of different metrics we can use to evaluate models, such as accuracy, recall, precision, F1 score, and many others. Not all of them are well suited to every problem. In this project, since our classes were balanced, we can get away with using simply accuracy.

### Baselines

But what constitutes a "good" model? Is 51% accuracy high enough? In some difficult prediction cases, it might be. In our case, it is not. This is because it does not noticeably beat either of the two most simple baselines. The first baseline is random guessing: if we have to classify a data point, we just flip a coin and either pick 0 or 1. This gives us a 50% chance to be correct on average. The second baseline is even simpler: just guess that everything is sarcastic, classifying everything as 1. Since our data is about half and half, this would also give us about 50% accuracy.

Both of those baselines don't sound like very smart or useful classifiers. And yet, our trained and tested model performed equally as accurately as they would. This means that our model is not very good. What kind of accuracy would it need in order to qualify as "good"? That depends on each individual case. For very difficult problems, it may be that any gain above the random baseline is noteworthy. For very easy problems, it's possible that anything below 90% is junk.

When creating baselines, it may be useful to research whether other people have tackled this problem before and see what their baselines and results were. It may also be useful to create a human baseline, in which you give a small subset of the data to people and have them try to solve the task. For this problem, out of my own experience/intuition, I would say that anything below 70% is not worth it and anything above 85% is probably pretty good.

And remember that these baselines should be set when you **begin** working on the problem, not once you already have results. Moving the finish line after you've already started the race is dishonest.

### Improvements

Now that we know what consititutes a good model, what improvements can we make in our model to get it there? Firstly, we can try using completely different algorithms, such as Random Forests or Support Vector Machines. It's possible that some algorithms simply don't work well for a certain problem. Second, we can try cross-validating more parameters or more values of parameters.

However, in the case of this initial non-text-based classifier, I think the problem is more deep than that. I believe we would have to go back to the drawing board and engineer some better features to train on. If you recall, it didn't look like months or years were predictive of sarcasm at all. We would have to create some features that do have strong relationships with sarcasm. If we find that we have too many features, we can also try running feature selection algorithms to reduce the noise.

What we can also do is use the actual text, instead of features generated from it like *comment_length*. This is much more likely to give us good results, and this is what we'll focus on in the next notebook.

(I will also now output the training and test data frames, so that I can reuse the same split of the data in the next notebook.)

In [None]:
train_data.to_csv("sarcasm_train_split.csv", index=False)
test_data.to_csv("sarcasm_test_split.csv", index=False)

[\[Prev\]](https://www.kaggle.com/yastapova/sarcasm-detection-2020-mentoring-proj-part-1) >> Part 1: Exploring Data and Feature Engineering

[\[Next\]](https://www.kaggle.com/yastapova/sarcasm-detection-2020-mentoring-proj-part-3) >> Part 3: Building a Text-Based Machine Learning Model