In [None]:
import pandas as pd
from cytoolz import identity
import spacy
from spacy.lang.en import English
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.metrics import confusion_matrix

In [None]:
df = pd.read_csv("../input/classification.csv")
test = pd.read_csv("../input/classification_test.csv")

Note that on kaggle.com I may not use `en_core_web_md` from SpaCy, but can `en`.

In [None]:
nlp = spacy.load('en', disable=['tagger', 'ner', 'parser'])

Here I do not remove stop words and non-alphabic symbols.

In [None]:
%%time

def tokenize(text):
    return [tok.orth_ for tok in nlp.tokenizer(text)]
df['tokens'] = df['text'].apply(tokenize)
#def tokenize(text):
#    return [tok.text for tok in nlp.tokenizer(text)]

df['tokens'] = df['text'].apply(tokenize)
test['tokens'] = test['text'].apply(tokenize)

Let us create a martix with counts for every word in each document.

In [None]:
from sklearn.feature_extraction.text import *
dtm = CountVectorizer(analyzer=identity)

In [None]:
X = dtm.fit_transform(df['tokens'])
X_test = dtm.transform(test['tokens'])
X.shape

 So we have here 105489 columns. I got 93050 two weeks ago with the same code. I wonder if something was changed in Python modules after update. 
 
Another lesson is found here: always run your code before presentation in a comfort of your home or office.

In [None]:
from sklearn.naive_bayes import *

In [None]:
modelNB = BernoulliNB()
modelNB.fit(X, df['sports'])
predictions = modelNB.predict(X_test)

I would like a function for error metrics computation, because I will compute different models on the test set  and on `df` data to compare results.

In [None]:
def error_metrics(model_name, true_values, predictions, errors=None):
    if errors is None: 
        errors = pd.DataFrame({'metrics': ['accuracy', 'precision', 'recall', 'f1']})
        
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
    accuracy = accuracy_score(true_values, predictions)
    precision = precision_score(true_values, predictions)
    recall = recall_score(true_values, predictions)
    f1 = f1_score(true_values, predictions)
    errors[model_name]= [accuracy,  precision, recall, f1]
    return errors

Let us save the errors for this model on both sets in a data frame. 

In [None]:
err = error_metrics(model_name='Test set with all columns', 
                    true_values = test['sports'], predictions=predictions)
err = error_metrics(model_name='Train set with all columns',
                    true_values=df['sports'], predictions=modelNB.predict(X), errors=err)
err

Now I would like to see what happens with fewer columns. I will recycle my previous objects, so the script must be run as a whole.

In [None]:
%%time

def tokenize(text):
    return [tok.text for tok in nlp.tokenizer(text.lower()) if (tok.text not in STOP_WORDS) and ((tok.text).isalpha())]

df['tokens'] = df['text'].apply(tokenize)
test['tokens'] = test['text'].apply(tokenize)

We are to create a new model object to train(fit).

In [None]:
dtm = CountVectorizer(analyzer=identity)
X = dtm.fit_transform(df['tokens'])
X_test = dtm.transform(test['tokens'])
X.shape

In [None]:
modelNB = BernoulliNB()
modelNB.fit(X, df['sports'])
predictions = modelNB.predict(X_test)

In [None]:
err = error_metrics(model_name='Test set with fewer columns', 
                    true_values = test['sports'], predictions=predictions, errors=err)
err = error_metrics(model_name='Train set with fewer columns',
                          true_values=df['sports'], predictions=modelNB.predict(X), errors=err)

In [None]:
pd.set_option('display.max_columns', None)
err

Usually our train set shows better results than our test set, because our model was trained on it.

 There could be 2 reasons why test metrics are better than train set metrics:

1. This particular set happens to have less variance (noise) than our training set

2. Our test set is less than our train set, and thus has less noise.

Comparing with model metrics computed with fewer columns we see that additional columns add noise for the model and interfere in model training, decreasing metrics.

Note that when we have fewer columns we see better performance of the second model on a trainig set. This is so called **overfitting**, because the modell is better fitted to the test set. In particular, it is fitted to a random noise of the training set, and our test set might have a different random noise.

************************************************************************************************************************
Here is a sound alert when your script finished running, where 500 is the frequency in Herz and 2000 is the duration in miliseconds. I found it here: https://stackoverflow.com/questions/16573051/sound-alarm-when-code-finishes

In [None]:
import winsound
winsound.Beep(500, 2000)