# Cross Validation Considerations for Movie Review Sentiment Analysis

This kernel describes and compares stratified k-fold cross validation and group k-fold cross validation for the [Stanford parsed Rotten Tomatoes dataset](https://www.kaggle.com/c/movie-review-sentiment-analysis-kernels-only/data).

In [None]:
import pandas as pd

pd.set_option('display.max_colwidth', -1)

data = pd.read_csv('../input/train.tsv', delimiter='\t')

### Class Balance

In [None]:
print("Sentiment Count:", data['Sentiment'].size)
print("Sentiment Distribution:", data['Sentiment'].value_counts(normalize=True), sep='\n')

The data is not evenly distributed between classes. So validation splits should be stratified, i.e. each split should have roughly the same distribution. 

Also, while outside the scope of this kernel, [precision and recall](https://en.wikipedia.org/wiki/Precision_and_recall) might be better metrics than accuracy. It may be worth trying training methods that boost the importance of the under represented classes, such as [oversampling](https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis) or class weighting.

## Cross Validation Methods

### Stratified K-Fold

Preserve the Sentiment distribution in each fold.

http://scikit-learn.org/stable/modules/cross_validation.html#stratified-k-fold

In [None]:
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
for f, split in enumerate(skf.split(data, data['Sentiment'])):
    print("Fold", f + 1, "-----")
    test = data.iloc[split[1]]
    print("Size:", test.size)
    print("Sentiment Distribution:", test['Sentiment'].value_counts(normalize=True), "", sep='\n')

There's a possible issue with stratified k-fold for this data set, however.
Lets look at the distribution of SentenceID 1 across folds.

In [None]:
Id = 1

print("SentenceId", Id, "Sentiment Counts:\n")
for f, split in enumerate(skf.split(data, data['Sentiment'])):
    print("Fold ", f + 1, ":", sep='')
    test = data.iloc[split[1]]
    if Id in test['SentenceId'].values:
        print(test['Sentiment'][test['SentenceId'] == Id].value_counts(sort=False), "\n")
    else:
        print("None\n")
    
    # use this later
    if f == 0:
        split1 = split

This is what the train and test set will look like for fold 1, SentenceId 1.

In [None]:
train = data.iloc[split1[0]]
test = data.iloc[split1[1]]

print("Train -----\nSentenceId =", Id, "Counts:")
print(train['Sentiment'][train['SentenceId'] == Id].value_counts(sort=False))
display(train[(train['SentenceId'] == Id)])

print("Test -----\nSentenceId =", Id, "Counts:")
print(test['Sentiment'][test['SentenceId'] == Id].value_counts(sort=False))
display(test[(test['SentenceId'] == Id)])

Notice how similar the phrases are between the train and test sets. Since a large proportion of the phrases in both sets are neutral, the model will appear to perform decently on SentenceId 1 if it classifies all phrases as neutral. I think with this data set, a favorable metric from stratified k-fold cross validation may be telling us how well the model has learned to recognize SentenceId, instead of evaluating the model's ability to recognize sentiment. 

When folds get cross contaminated like this, models get a misleading boost in performance. What we want is for the cross validation metrics to tell us how the model will generalize with unseen data. 

### Group K-Fold

Confine each SentenceId to a single fold.

http://scikit-learn.org/stable/modules/cross_validation.html#group-k-fold

In [None]:
from sklearn.model_selection import GroupKFold

gkf = GroupKFold(n_splits=5)
print("SentenceId", Id, "Sentiment Counts:\n")
for f, split in enumerate(gkf.split(data, groups=data['SentenceId'])):
    print("Fold ", f + 1, ":", sep='')
    test = data.iloc[split[1]]
    if Id in test['SentenceId'].values:
        print(test['Sentiment'][test['SentenceId'] == Id].value_counts(sort=False), "\n")
    else:
        print("None\n")

Now, with group k-fold, SentimentId 1 is kept in one fold. 

Group k-fold doesn't specifically stratify though. However with this data set, each fold still has a sentiment distribution that is close to the over all distribution.

In [None]:
for f, split in enumerate(gkf.split(data, groups=data['SentenceId'])):
    print("Fold", f + 1, "-----")
    test = data.iloc[split[1]]
    print("Size:", test.size)
    print("Sentiment Distribution:", test['Sentiment'].value_counts(normalize=True), "", sep='\n')

## Model Example

This is a basic logistic regression [pipeline](http://scikit-learn.org/stable/modules/pipeline.html#pipeline) that uses tf-idf for features.
I use a grid search with group k-fold to find the best value for the strenth of the l2 penalty.
Once the best value for `C` is selected, I'll get the pipeline accuracy reported by stratified k-fold and group k-fold cross validation. Then I'll use the same pipeline to predict on the test set.

Hypotheses:
- stratified k-fold will report the highest accuracy even though the pipeline is the same
- the test set accuracy will be closer to the group k-fold accuracy

In [None]:
import nltk
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('tfidf',  TfidfVectorizer()),
    ('lr', LogisticRegression())
])

analyzer = TfidfVectorizer().build_analyzer()
stemmer = nltk.stem.SnowballStemmer('english')

pipeline.set_params(
    tfidf__analyzer=lambda x: (stemmer.stem(w) for w in analyzer(x)),
    tfidf__ngram_range=(1,2),
    lr__solver='sag',
    lr__multi_class='multinomial',
    lr__penalty='l2', 
    lr__tol=0.001, 
    lr__verbose=False)

param_grid = {
    'lr__C': np.linspace(1, 3, 11) # I think the value is close to 2 based on previous testing, and want to reduce runtime
}

gs = GridSearchCV(
    pipeline, 
    param_grid=param_grid,
    cv=gkf,
    verbose=1,
    return_train_score=False)

gs.fit(data['Phrase'], y=data['Sentiment'], groups=data['SentenceId'])
print("Best C:", gs.best_params_['lr__C'])

Using the best value for `C`, I'll get the pipeline accuracy reported by stratified k-fold and group k-fold.

In [None]:
from sklearn.model_selection import cross_validate

pipeline.set_params(lr__C=gs.best_params_['lr__C'])

print("Running stratified k-fold...", end='')
skf_results = cross_validate(
    pipeline, 
    X=data['Phrase'], 
    y=data['Sentiment'], 
    cv=skf, 
    return_train_score=False, 
    verbose=False)
print(" done.")

print("Running group k-fold...", end='')
gkf_results = cross_validate(
    pipeline, 
    X=data['Phrase'], 
    y=data['Sentiment'], 
    groups=data['SentenceId'], 
    cv=gkf, 
    return_train_score=False, 
    verbose=False)
print(" done.\n")

print("Stratified k-fold average accuracy:", np.mean(skf_results['test_score']))
print("Group k-fold average accuracy:", np.mean(gkf_results['test_score']))

Now I'll fit the same pipeline on the entire train set and then predict on the test set.

In [None]:
test = pd.read_csv('../input/test.tsv', delimiter='\t')
test_pred = pipeline.fit(data['Phrase'], y=data['Sentiment']).predict(test['Phrase'])

submission = pd.concat([test['PhraseId'], pd.Series(test_pred, name='Sentiment')], axis=1)
submission.to_csv('sample_submission.csv', index=False)