## Introduction

This is an analysis of Wikipedia comments to create models that identify various types of toxic comments. There is a lot of racist content and swear words in the dataset and some of it will pop up in the analysis. 

The Conversation AI team, a research initiative founded by Jigsaw and Google (both a part of Alphabet) are working on tools to help improve online conversation. One area of focus is the study of negative online behaviors, like toxic comments (i.e. comments that are rude, disrespectful or otherwise likely to make someone leave a discussion). So far they’ve built a range of publicly available models served through the Perspective API, including toxicity. But the current models still make errors, and they don’t allow users to select which types of toxicity they’re interested in finding (e.g. some platforms may be fine with profanity, but not with other types of toxic content).

we need to build a multi-headed model that’s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate better than Perspective’s current models. You’ll be using a dataset of comments from Wikipedia’s talk page edits. Improvements to the current model will hopefully help online discussion become more productive and respectful.

## Dataset
The dataset taken from kaggle : https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data

## Dataset Description
You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:

1.toxic
2.severe_toxic
3.obscene
4.threat
5.insult
6.identity_hate

we must create a model which predicts a probability of each type of toxicity for each comment.


In [None]:
import pandas as pd
import pickle
import numpy as np
import nltk
from nltk.corpus import stopwords
import keras
import time
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string
from collections import namedtuple
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split 
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy import sparse
from sklearn.pipeline import make_union
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
import lightgbm as lgb
from sklearn.metrics import confusion_matrix

In [None]:
# Global random state and k-fold strategy 
seed = 42
k = 5
cv = StratifiedKFold(n_splits=5, random_state=seed,shuffle=True)

In [None]:
def lgb_f1_score(y_hat, data):
    # https://stackoverflow.com/questions/49774825/python-lightgbm-cross-validation-how-to-use-lightgbm-cv-for-regression
    y_true = data.get_label()
    y_hat = np.round(y_hat) 
    return 'f1', f1_score(y_true, y_hat), True

In [None]:
start = time.time()
def print_time(start):
    time_now = time.time() - start 
    minutes = int(time_now / 60)
    seconds = int(time_now % 60)
    if seconds < 10:
        print('Elapsed time was %d:0%d.' % (minutes, seconds))
    else:
        print('Elapsed time was %d:%d.' % (minutes, seconds))

## Feature Engineering 

In [None]:
def feature_engineering(df, sparse=0): 
    
    # Comment length
    df['length'] = df.comment_text.apply(lambda x: len(x))
    

    # Capitalization percentage
    def pct_caps(s):
        return sum([1 for c in s if c.isupper()]) / (sum(([1 for c in s if c.isalpha()])) + 1)
    df['caps'] = df.comment_text.apply(lambda x: pct_caps(x))

    # Mean Word length 
    def word_length(s):
        s = s.split(' ')
        return np.mean([len(w) for w in s if w.isalpha()])
    df['word_length'] = df.comment_text.apply(lambda x: word_length(x))

    # Average number of exclamation points 
    df['exclamation'] = df.comment_text.apply(lambda s: len([c for c in s if c == '!']))

    # Average number of question marks 
    df['question'] = df.comment_text.apply(lambda s: len([c for c in s if c == '?']))
    
    # Normalize
    for label in ['length', 'caps', 'word_length', 'question', 'exclamation']:
        minimum = df[label].min()
        diff = df[label].max() - minimum
        df[label] = df[label].apply(lambda x: (x-minimum) / (diff))

    # Strip IP Addresses
    ip = re.compile('(([2][5][0-5]\.)|([2][0-4][0-9]\.)|([0-1]?[0-9]?[0-9]\.)){3}'
                    +'(([2][5][0-5])|([2][0-4][0-9])|([0-1]?[0-9]?[0-9]))')
    def strip_ip(s, ip):
        try:
            found = ip.search(s)
            return s.replace(found.group(), ' ')
        except:
            return s

    df.comment_text = df.comment_text.apply(lambda x: strip_ip(x, ip))
    
    return df

def merge_features(comment_text, data, engineered_features):
    new_features = sparse.csr_matrix(df[engineered_features].values)
    if np.isnan(new_features.data).all():
        new_features.data = np.nan_to_num(new_features.data)
    return sparse.hstack([comment_text, new_features])

## Loading Dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Reset data and create holdout set. 

df = pd.read_csv('/content/drive/MyDrive/UDEMY_PROJECTS/Project_4_Toxic_Comments_Classification/train.csv')
targets = list(df.columns[2:])
df_targets = df[targets].copy()

df_sub = pd.read_csv('/content/drive/MyDrive/UDEMY_PROJECTS/Project_4_Toxic_Comments_Classification/test.csv', dtype={'id': object}, na_filter=False)

submission = pd.DataFrame()
submission['id'] = df_sub.id.copy()

# Feature Engineering
df = feature_engineering(df)
df_sub = feature_engineering(df_sub)

print('Training labels:')
print(list(df_targets.columns))
print(df_targets.shape)

print('\nTraining data')
df.drop(list(df_targets.columns), inplace=True, axis=1)
df.drop('id', inplace=True, axis=1)
print(list(df.columns))
print(df.shape)


print('\nSubmission data')
df_sub.drop('id', inplace=True, axis=1)
print(list(df_sub.columns))
print(df_sub.shape)

toxic_rows = df_targets.sum(axis=1)
toxic_rows = (toxic_rows > 0)
targets.append('any_label')
df_targets['any_label'] = toxic_rows.astype(int)

new_features = list(df.columns[1:])
print(new_features)

from sklearn.model_selection import train_test_split
df, holdout, df_targets, holdout_targets = train_test_split(df, df_targets, test_size=0.2, random_state=seed)

  out=out, **kwargs)


Training labels:
['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
(159571, 6)

Training data
['comment_text', 'length', 'caps', 'word_length', 'exclamation', 'question']
(159571, 6)

Submission data
['comment_text', 'length', 'caps', 'word_length', 'exclamation', 'question']
(153164, 6)
['length', 'caps', 'word_length', 'exclamation', 'question']


In [None]:
new_features

['length', 'caps', 'word_length', 'exclamation', 'question']

## Multilabel Function 

In [None]:
from sklearn.base import clone
#todo 
# Weights for 
def multi_cv(model, data, labels, k=5, nb_features=False):
    cv = StratifiedKFold(n_splits=k, random_state=None)
    # Creating NB features just once from any_label has about the same 
    # performance as individual labels with faster speed. 
    def log_count_ratio(x, y):
        x = sparse.csr_matrix(x)
        # WARNING: Some scipy modules use indexes that start at 1! 
        # You need to add 1 to an index when performing operations on a csr_matrix 

        p = abs(x[np.where(y==1)].sum(axis=0))
        p = p + 1
        p = p / np.sum(p)

        q = abs(x[np.where(y==0)].sum(axis=0))
        q = q + 1
        q = q / np.sum(q)

        return np.log(p/q)
    
    # Labels must be in a dataframe
    scores = []
    r_values = []
    for label in labels.columns:
        if nb_features:
            r = log_count_ratio(data, labels[label])
            r_values.append(r)
            data = data.multiply(r)
            if np.isnan(data.data).any():
                data.data = np.nan_to_num(data.data)
        score = np.mean(cross_val_score(clone(model), data, labels[label], scoring='f1', cv=cv))
        print(label + ' f1 score: %.4f' % score)
        scores.append(score)
    print('Average (excluding any) f1 score: %.4f' % np.mean(scores[:-1]))
    if nb_features:
        return scores, r_values
    else:
        return scores

# training_comments.data = np.nan_to_num(training_comments.data)

# model = LinearSVC()
# _ = multi_cv(model, training_comments, df_targets, nb_features=True)

## NB Feature Transformer 

This is the primary method that I will use for the NB-SVM models, but I've left other code in to use as a reference. 

In [None]:
class NBFeatures:
    def __init__(self, epsilon=1, sparse=True):
        # How much influence NB features have 
        if not epsilon > 0 and epsilon <= 1:
            raise Exception("Invalid Epsilon value. Must be greater than zero and less than or equal to one.")
        self.epsilon = epsilon
        self.r = None
    
    def log_count_ratio(self, x, y):
        x = sparse.csr_matrix(x)
        # WARNING: Some scipy authors fall in the "index starts at 1" camp
        # You need to add 1 to an index when performing operations on a csr_matrix 
        p = abs(x[np.where(y==1)].sum(axis=0))
        p = p + 1
        p = p / np.sum(p)
        q = abs(x[np.where(y==0)].sum(axis=0))
        q = q + 1
        q = q / np.sum(q)
        return np.log(p/q)
    
    def fit(self, x, y):
        self.r = self.log_count_ratio(x, y)
    
    def transform(self, x):
        if self.r == None: 
            raise Exception("Model not fit, can't transform.")
        transformed = x.multiply(self.r)
        return x.multiply(1-self.epsilon) + transformed.multiply(self.epsilon)
        #return np.multiply(x, self.r)
    
    def fit_transform(self, x, y):
        self.r = self.log_count_ratio(x, y)
        return self.transform(x, y)


#nb_trans = NBFeatures(0.5)
#new = nb_trans.fit_transform(training_comments, np.array(df_targets.iloc[:,-1]))
#nb_trans.r.shape

A separate helper function to calculate the log count ratio that can be used for experimentation. 

In [None]:
def log_count_ratio(x, y):
    x = sparse.csr_matrix(x)
    # WARNING: Some scipy authors fall in the "index starts at 1" camp
    # You need to add 1 to an index when performing operations on a csr_matrix 

    p = abs(x[np.where(y==1)].sum(axis=0))
    p = p + 1
    p = p / np.sum(p)

    q = abs(x[np.where(y==0)].sum(axis=0))
    q = q + 1
    q = q / np.sum(q)

    return np.log(p/q)


# Vectorizing text

It's necessary to vectorize text before inputting into machine learning models. This is a process of translating string data into numerical data that the computer can better understand. Vectorized data is usually sparse, with an array where the features contain either word counts or another way of representing the occurance of characters or words in a string. This is done with a vectorizer object, which stores a dictionary of characters or words and their associated integer representation, along with relevant statistics if applicable. 

The strategy I'm going to use here is term frequency - inverse document frequency. This is a statistic that describes the usefulness of a string of characters by looking at the frequency that it occurs in an individual document (here, a single comment) and the inverse of its frequency in all of the documents in the dataset. 

That means that a word that is used frequently in a comment in this dataset, but that few comments in the dataset feature, is probably useful to the model. But a string that occurs in nearly every document is almost useless. 

In [None]:
start = time.time()
comment_vector = TfidfVectorizer(max_features=10000, analyzer='word', #ngram_range=(2, 6), 
                                 stop_words='english')
training_comments = comment_vector.fit_transform(df.comment_text)
holdout_comments = comment_vector.transform(holdout.comment_text)
submission_comments = comment_vector.transform(df_sub.comment_text)
print_time(start)

print(training_comments.shape)

Elapsed time was 0:11.
(127656, 10000)


One of the most important parameters to tune in this problem is the number of features and the n_gram range in the TF-IDF vectorizer, as well as choosing whether to analyze the sequences by characters or words. Analyzing by single words initially gives very poor performance, possibly because slang words and misspellings reduce the frequency of individual bad words.  

This is a simple function to play with reducing class imbalance. 

In [None]:
# This is just experimental to learn about the behavior of models with imbalanced classes. 

from numpy.random import sample

def imbalance_reduction(p, y):
    """
    For multilabel problems, keeps all rows with a 
    positive label and returns p% of data where label is zero. 
    """
    #reduce y
    y = np.sum(y, axis=1)
    p = 1-p
    keep_index = sample(len(y))
    keep_index = keep_index + y
    keep_index[keep_index>=p] = 1
    keep_index[keep_index<p] = 0

    return np.where(keep_index==1)

# Benchmarks

### Logistic Regression

With 10,000 vectorized features, but without engineered features. 

In [None]:
start = time.time()
for target in targets: 
    lr = LogisticRegression(random_state=seed)
    print(target + ' score: %.4f' % np.mean(cross_val_score(lr, training_comments, df_targets[target], scoring='f1', cv=cv)))
print_time(start)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


toxic score: 0.7203
severe_toxic score: 0.3203
obscene score: 0.7464
threat score: 0.1982
insult score: 0.6261
identity_hate score: 0.2785
any_label score: 0.7295
Elapsed time was 0:55.


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


With engineered features added in. 

In [None]:
start = time.time()
for target in targets: 
    lr = LogisticRegression(random_state=seed)
    print(target + ' score: %.4f' % np.mean(cross_val_score(lr, merge_features(training_comments, df, new_features), df_targets[target], scoring='f1', cv=cv)))
print_time(start)

5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 1514, in fit
    accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
  File "/usr/local/lib/python3.7/dist-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 976, in check_X_y
    estimator=esti

toxic score: nan


5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 1514, in fit
    accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
  File "/usr/local/lib/python3.7/dist-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 976, in check_X_y
    estimator=esti

severe_toxic score: nan


5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 1514, in fit
    accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
  File "/usr/local/lib/python3.7/dist-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 976, in check_X_y
    estimator=esti

obscene score: nan


5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 1514, in fit
    accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
  File "/usr/local/lib/python3.7/dist-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 976, in check_X_y
    estimator=esti

threat score: nan


5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 1514, in fit
    accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
  File "/usr/local/lib/python3.7/dist-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 976, in check_X_y
    estimator=esti

insult score: nan


5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 1514, in fit
    accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
  File "/usr/local/lib/python3.7/dist-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 976, in check_X_y
    estimator=esti

identity_hate score: nan
any_label score: nan
Elapsed time was 0:02.


5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/linear_model/_logistic.py", line 1514, in fit
    accept_large_sparse=solver not in ["liblinear", "sag", "saga"],
  File "/usr/local/lib/python3.7/dist-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 976, in check_X_y
    estimator=esti

### Naive Bayes

In [None]:
start = time.time() 

model = MultinomialNB(alpha=1.0)
_ = multi_cv(model, training_comments, df_targets)
# _ = multi_cv(model, training_comments, df_targets,)
print_time(start)

toxic f1 score: 0.6581
severe_toxic f1 score: 0.1044
obscene f1 score: 0.6668
threat f1 score: 0.0000
insult f1 score: 0.5604
identity_hate f1 score: 0.0418
any_label f1 score: 0.6670
Average (excluding any) f1 score: 0.3386
Elapsed time was 0:02.


With engineered features. 

In [None]:
start = time.time() 

model = MultinomialNB(alpha=1.0)
_ = multi_cv(model, merge_features(training_comments, df, new_features), df_targets)
print_time(start)

5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/naive_bayes.py", line 663, in fit
    X, y = self._check_X_y(X, y)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/naive_bayes.py", line 523, in _check_X_y
    return self._validate_data(X, y, accept_sparse="csr", reset=reset)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params

toxic f1 score: nan
severe_toxic f1 score: nan


5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/naive_bayes.py", line 663, in fit
    X, y = self._check_X_y(X, y)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/naive_bayes.py", line 523, in _check_X_y
    return self._validate_data(X, y, accept_sparse="csr", reset=reset)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params

obscene f1 score: nan


5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/naive_bayes.py", line 663, in fit
    X, y = self._check_X_y(X, y)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/naive_bayes.py", line 523, in _check_X_y
    return self._validate_data(X, y, accept_sparse="csr", reset=reset)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params

threat f1 score: nan


5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/naive_bayes.py", line 663, in fit
    X, y = self._check_X_y(X, y)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/naive_bayes.py", line 523, in _check_X_y
    return self._validate_data(X, y, accept_sparse="csr", reset=reset)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params

insult f1 score: nan


5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/naive_bayes.py", line 663, in fit
    X, y = self._check_X_y(X, y)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/naive_bayes.py", line 523, in _check_X_y
    return self._validate_data(X, y, accept_sparse="csr", reset=reset)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params

identity_hate f1 score: nan
any_label f1 score: nan
Average (excluding any) f1 score: nan
Elapsed time was 0:01.


5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/naive_bayes.py", line 663, in fit
    X, y = self._check_X_y(X, y)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/naive_bayes.py", line 523, in _check_X_y
    return self._validate_data(X, y, accept_sparse="csr", reset=reset)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params

### Support Vector Machine 

In [None]:
start = time.time()
model = LinearSVC(random_state=None)
_ = multi_cv(model, training_comments, df_targets)
print_time(start)

toxic f1 score: 0.7549
severe_toxic f1 score: 0.3402
obscene f1 score: 0.7809
threat f1 score: 0.3648
insult f1 score: 0.6645
identity_hate f1 score: 0.3562
any_label f1 score: 0.7703
Average (excluding any) f1 score: 0.5436
Elapsed time was 0:10.


In [None]:
start = time.time()

model = LinearSVC(random_state=seed)
_ = multi_cv(model, merge_features(training_comments, df, new_features), df_targets)
print_time(start)

5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/svm/_classes.py", line 252, in fit
    accept_large_sparse=False,
  File "/usr/local/lib/python3.7/dist-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 976, in check_X_y
    estimator=estimator,
  File "/usr/local/lib/python3.7/dist-pac

toxic f1 score: nan


5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/svm/_classes.py", line 252, in fit
    accept_large_sparse=False,
  File "/usr/local/lib/python3.7/dist-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 976, in check_X_y
    estimator=estimator,
  File "/usr/local/lib/python3.7/dist-pac

severe_toxic f1 score: nan


5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/svm/_classes.py", line 252, in fit
    accept_large_sparse=False,
  File "/usr/local/lib/python3.7/dist-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 976, in check_X_y
    estimator=estimator,
  File "/usr/local/lib/python3.7/dist-pac

obscene f1 score: nan


5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/svm/_classes.py", line 252, in fit
    accept_large_sparse=False,
  File "/usr/local/lib/python3.7/dist-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 976, in check_X_y
    estimator=estimator,
  File "/usr/local/lib/python3.7/dist-pac

threat f1 score: nan


5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/svm/_classes.py", line 252, in fit
    accept_large_sparse=False,
  File "/usr/local/lib/python3.7/dist-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 976, in check_X_y
    estimator=estimator,
  File "/usr/local/lib/python3.7/dist-pac

insult f1 score: nan
identity_hate f1 score: nan
any_label f1 score: nan
Average (excluding any) f1 score: nan
Elapsed time was 0:01.


5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/svm/_classes.py", line 252, in fit
    accept_large_sparse=False,
  File "/usr/local/lib/python3.7/dist-packages/sklearn/base.py", line 581, in _validate_data
    X, y = check_X_y(X, y, **check_params)
  File "/usr/local/lib/python3.7/dist-packages/sklearn/utils/validation.py", line 976, in check_X_y
    estimator=estimator,
  File "/usr/local/lib/python3.7/dist-pac

### Support Vector Machine with Naive Bayes Features

In [None]:
nb = NBFeatures()
nb.fit(training_comments, df_targets.any_label)
nb_eng = NBFeatures()
nb_eng.fit(merge_features(training_comments, df, new_features), df_targets.any_label)
# b = nb.transform(training_comments)

Mini test: Does feature scaling make a difference? Support vector machines are particularly vulnerable to unbalanced features, and I want to check whether scaling after the added step of the Naive Bayes feature transformation makes a difference. 

### LightGBM 

In [None]:
start = time.time()
train_data = lgb.Dataset(training_comments, label=df_targets.any_label.values)
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'verbose': 1,
    'num_leaves': 64,
    'n_estimators': 100, 
    'learning_rate': 0.05, 
    'max_depth': 16,
    'n_jobs': -1,
    'seed': seed
}

cv_results = lgb.cv(
        params,
        train_data,
        num_boost_round=100,
        nfold=5,
        metrics='mae',
        early_stopping_rounds=10,
        feval=lgb_f1_score
        )
print_time(start)

print("Final CV F1 score is %.4f" % cv_results['f1-mean'][-1])



Elapsed time was 3:44.
Final CV F1 score is 0.6599


Elapsed time was 5:05.
Final CV F1 score is 0.7470

With engineered features. 

In [None]:
start = time.time()
train_data = lgb.Dataset(merge_features(training_comments, df, new_features), label=df_targets.any_label.values)
params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'verbose': 1,
    'num_leaves': 64,
    'n_estimators': 500, 
    'learning_rate': 0.05, 
    'max_depth': 16,
    'n_jobs': -1,
    'seed': seed
}

cv_results = lgb.cv(
        params,
        train_data,
        num_boost_round=100,
        nfold=5,
        metrics='mae',
        #early_stopping_rounds=10,
        feval=lgb_f1_score
        )
print_time(start)

print("Final CV F1 score is %.4f" % cv_results['f1-mean'][-1])



Elapsed time was 14:50.
Final CV F1 score is 0.7567


Elapsed time was 5:04.
Final CV F1 score is 0.7573

# Model Refinement

### Step 1:  Optimize tf-idf max features 

In [None]:
start = time.time()
comment_vector = TfidfVectorizer(max_features=10000, analyzer='word', #ngram_range=(3, 7), 
                                 stop_words='english')
training_comments = comment_vector.fit_transform(df.comment_text)
holdout_comments = comment_vector.transform(holdout.comment_text)
submission_comments = comment_vector.transform(df_sub.comment_text)
print_time(start)

print(training_comments.shape)

start = time.time()

nb_eng = NBFeatures()
nb_eng.fit(merge_features(training_comments, df, new_features), df_targets.any_label)
# training_comments = nb_eng.transform(merge_features(training_comments, df, new_features))

model = LinearSVC(random_state=seed)

score = np.mean(cross_val_score(model, training_comments, df_targets.any_label, scoring='f1', cv=cv))

print(score)

print_time(start)

Elapsed time was 0:11.
(127656, 10000)
0.7693156711613183
Elapsed time was 0:02.


0.784174616909
Elapsed time was 0:33.

In [None]:
start = time.time()
comment_vector = TfidfVectorizer(max_features=30000, analyzer='word', # ngram_range=(3, 7), 
                                 stop_words='english')
training_comments = comment_vector.fit_transform(df.comment_text)
#holdout_comments = comment_vector.transform(holdout.comment_text)
#submission_comments = comment_vector.transform(df_sub.comment_text)
print_time(start)

print(training_comments.shape)

start = time.time()

nb_eng = NBFeatures()
nb_eng.fit(merge_features(training_comments, df, new_features), df_targets.any_label)
# training_comments = nb_eng.transform(merge_features(training_comments, df, new_features))

model = LinearSVC(random_state=seed)

score = np.mean(cross_val_score(model, training_comments, df_targets.any_label, scoring='f1', cv=cv))

print(score)

print_time(start)

Elapsed time was 0:05.
(127656, 30000)
0.7738178806497222
Elapsed time was 0:02.


0.792595764561
Elapsed time was 0:21.

In [None]:
start = time.time()
comment_vector = TfidfVectorizer(max_features=30000, analyzer='word', ngram_range=(1,2), 
                                 stop_words='english')
training_comments = comment_vector.fit_transform(df.comment_text)
#holdout_comments = comment_vector.transform(holdout.comment_text)
#submission_comments = comment_vector.transform(df_sub.comment_text)
print_time(start)

print(training_comments.shape)

start = time.time()

nb_eng = NBFeatures()
nb_eng.fit(merge_features(training_comments, df, new_features), df_targets.any_label)
# training_comments = nb_eng.transform(merge_features(training_comments, df, new_features))

model = LinearSVC(random_state=seed)

score = np.mean(cross_val_score(model, training_comments, df_targets.any_label, scoring='f1', cv=cv))

print(score)

print_time(start)

Elapsed time was 0:18.
(127656, 30000)
0.7668741423124089
Elapsed time was 0:02.


0.787710758788
Elapsed time was 0:25.

In [None]:
start = time.time()
comment_vector = TfidfVectorizer(max_features=30000, analyzer='word', ngram_range=(2,6), 
                                 stop_words='english')
training_comments = comment_vector.fit_transform(df.comment_text)
#holdout_comments = comment_vector.transform(holdout.comment_text)
#submission_comments = comment_vector.transform(df_sub.comment_text)
print_time(start)

print(training_comments.shape)

start = time.time()

nb_eng = NBFeatures()
nb_eng.fit(merge_features(training_comments, df, new_features), df_targets.any_label)
# training_comments = nb_eng.transform(merge_features(training_comments, df, new_features))

model = LinearSVC(random_state=seed)

score = np.mean(cross_val_score(model, training_comments, df_targets.any_label, scoring='f1', cv=cv))

print(score)

print_time(start)

Elapsed time was 1:25.
(127656, 30000)
0.32813592780768525
Elapsed time was 0:02.


0.404212938192
Elapsed time was 0:31.

In [None]:
start = time.time()
comment_vector = TfidfVectorizer(max_features=20000, analyzer='char', ngram_range=(3, 7), 
                                 stop_words='english')
training_comments = comment_vector.fit_transform(df.comment_text)
#holdout_comments = comment_vector.transform(holdout.comment_text)
#submission_comments = comment_vector.transform(df_sub.comment_text)
print_time(start)

print(training_comments.shape)

start = time.time()

nb_eng = NBFeatures()
nb_eng.fit(merge_features(training_comments, df, new_features), df_targets.any_label)
# training_comments = nb_eng.transform(merge_features(training_comments, df, new_features))

model = LinearSVC(random_state=seed)

score = np.mean(cross_val_score(model, training_comments, df_targets.any_label, scoring='f1', cv=cv))

print(score)

print_time(start)

Elapsed time was 3:52.
(127656, 20000)
0.7728457692028268
Elapsed time was 0:36.


0.774235174963
Elapsed time was 3:09.

In [None]:
start = time.time()
word_vectorizer = TfidfVectorizer(max_features=5000, analyzer='word',# ngram_range=(1, 2), 
                                 stop_words='english')
char_vectorizer = TfidfVectorizer(max_features=5000, analyzer='char', ngram_range=(3, 7), 
                                 stop_words='english')
vectorizer = make_union(word_vectorizer, char_vectorizer, n_jobs=-1)
training_comments = vectorizer.fit_transform(df.comment_text)
print_time(start)
print(training_comments.shape)

start = time.time()

nb_eng = NBFeatures()
nb_eng.fit(merge_features(training_comments, df, new_features), df_targets.any_label)
# training_comments = nb_eng.transform(merge_features(training_comments, df, new_features))

model = LinearSVC(random_state=seed)

score = np.mean(cross_val_score(model, training_comments, df_targets.any_label, scoring='f1', cv=cv))

print(score)

print_time(start)

Elapsed time was 4:34.
(127656, 10000)
0.7843976300873488
Elapsed time was 0:22.


0.796185423779
Elapsed time was 2:56.

In [None]:
start = time.time()
word_vectorizer = TfidfVectorizer(max_features=15000, analyzer='word', ngram_range=(1, 2), 
                                 stop_words='english')
char_vectorizer = TfidfVectorizer(max_features=5000, analyzer='char', ngram_range=(3, 7), 
                                 stop_words='english')
vectorizer = make_union(word_vectorizer, char_vectorizer, n_jobs=-1)
# training_comments = vectorizer.fit_transform(df.comment_text)
print_time(start)
print(training_comments.shape)

start = time.time()

nb_eng = NBFeatures()
nb_eng.fit(merge_features(training_comments, df, new_features), df_targets.any_label)
# training_comments = nb_eng.transform(merge_features(training_comments, df, new_features))

model = LinearSVC(random_state=seed)

score = np.mean(cross_val_score(model, training_comments, df_targets.any_label, scoring='f1', cv=cv))

print(score)

print_time(start)

Elapsed time was 0:00.
(127656, 10000)
0.7843976300873488
Elapsed time was 0:22.


0.798518003383
Elapsed time was 2:07.

In [None]:
start = time.time()
word_vectorizer = TfidfVectorizer(max_features=20000, analyzer='word', ngram_range=(1, 2), 
                                 stop_words='english')
char_vectorizer = TfidfVectorizer(max_features=10000, analyzer='char', ngram_range=(3, 7), 
                                 stop_words='english')
vectorizer = make_union(word_vectorizer, char_vectorizer, n_jobs=-1)
# training_comments = vectorizer.fit_transform(df.comment_text)
print_time(start)
print(training_comments.shape)

start = time.time()

nb_eng = NBFeatures()
nb_eng.fit(merge_features(training_comments, df, new_features), df_targets.any_label)
# training_comments = nb_eng.transform(merge_features(training_comments, df, new_features))

model = LinearSVC(random_state=seed)

score = np.mean(cross_val_score(model, training_comments, df_targets.any_label, scoring='f1', cv=cv))

print(score)

print_time(start)

Elapsed time was 0:00.
(127656, 10000)
0.7843976300873488
Elapsed time was 0:22.


0.800512138358
Elapsed time was 2:36.

In [None]:
start = time.time()
word_vectorizer = TfidfVectorizer(max_features=20000, analyzer='word', ngram_range=(1, 2), 
                                 stop_words='english')
char_vectorizer = TfidfVectorizer(max_features=10000, analyzer='char', ngram_range=(3, 5), 
                                 stop_words='english')
vectorizer = make_union(word_vectorizer, char_vectorizer, n_jobs=-1)
training_comments = vectorizer.fit_transform(df.comment_text)
print_time(start)
print(training_comments.shape)

start = time.time()

nb_eng = NBFeatures()
nb_eng.fit(merge_features(training_comments, df, new_features), df_targets.any_label)
# training_comments = nb_eng.transform(merge_features(training_comments, df, new_features))

model = LinearSVC(random_state=seed)

score = np.mean(cross_val_score(model, training_comments, df_targets.any_label, scoring='f1', cv=cv))

print(score)

print_time(start)

Elapsed time was 2:15.
(127656, 30000)
0.7907162252541873
Elapsed time was 0:24.


0.80146407813
Elapsed time was 2:48.

### Step 2: Optimize NB Feature Weight

There appears to be an issue with the consistency of scores here. The variation in training time suggests that the support vector machine algorithm is struggling with a too-large range of input values. Using the the last and best training_comments from the cell above. 

In [32]:
start = time.time()
word_vectorizer = TfidfVectorizer(max_features=20000, analyzer='word', ngram_range=(1, 2), 
                                 stop_words='english')
char_vectorizer = TfidfVectorizer(max_features=10000, analyzer='char', ngram_range=(3, 5), 
                                 stop_words='english')
vectorizer = make_union(word_vectorizer, char_vectorizer, n_jobs=-1)
training_comments = vectorizer.fit_transform(df.comment_text)
print_time(start)
print(training_comments.shape)

for i in range(1,10):
    start = time.time()
    sc = StandardScaler(with_mean=False)
    epsilon = i/10
    print('**********************')
    print('For epsilon %f' % epsilon)
    nb_temp = NBFeatures(epsilon=epsilon)
    nb_temp.fit(merge_features(training_comments, df, new_features), df_targets.any_label)
    # input_data = nb_temp.transform(merge_features(training_comments, df, new_features))
    model = LinearSVC(random_state=seed)
    score = np.mean(cross_val_score(model, training_comments, df_targets.any_label, scoring='f1', cv=cv))
    print('Epsilon %f score: %.4f' % (epsilon, score))
    print_time(start)

Elapsed time was 2:14.
(127656, 30000)
**********************
For epsilon 0.100000
Epsilon 0.100000 score: 0.7907
Elapsed time was 0:25.
**********************
For epsilon 0.200000
Epsilon 0.200000 score: 0.7907
Elapsed time was 0:24.
**********************
For epsilon 0.300000
Epsilon 0.300000 score: 0.7907
Elapsed time was 0:24.
**********************
For epsilon 0.400000
Epsilon 0.400000 score: 0.7907
Elapsed time was 0:24.
**********************
For epsilon 0.500000
Epsilon 0.500000 score: 0.7907
Elapsed time was 0:24.
**********************
For epsilon 0.600000
Epsilon 0.600000 score: 0.7907
Elapsed time was 0:24.
**********************
For epsilon 0.700000
Epsilon 0.700000 score: 0.7907
Elapsed time was 0:24.
**********************
For epsilon 0.800000
Epsilon 0.800000 score: 0.7907
Elapsed time was 0:24.
**********************
For epsilon 0.900000
Epsilon 0.900000 score: 0.7907
Elapsed time was 0:24.


### Step 3: SVM Parameter Tuning 

In [33]:
start = time.time()
word_vectorizer = TfidfVectorizer(max_features=20000, analyzer='word', ngram_range=(1, 2), 
                                 stop_words='english')
char_vectorizer = TfidfVectorizer(max_features=10000, analyzer='char', ngram_range=(3, 5), 
                                 stop_words='english')
vectorizer = make_union(word_vectorizer, char_vectorizer, n_jobs=-1)
training_comments = vectorizer.fit_transform(df.comment_text)
print_time(start)
print(training_comments.shape)

# Reset NB feature transformer epsilon value
nb_eng = NBFeatures()
nb_eng.fit(merge_features(training_comments, df, new_features), df_targets.any_label)
# input_data = nb_eng.transform(merge_features(training_comments, df, new_features))

Elapsed time was 2:17.
(127656, 30000)


### Optimimum Model

In [None]:
# TF-IDF Vectorization
start = time.time()
word_vectorizer = TfidfVectorizer(max_features=200, analyzer='word', ngram_range=(1, 2), 
                                 stop_words='english')
char_vectorizer = TfidfVectorizer(max_features=10000, analyzer='char', ngram_range=(3, 5), 
                                 stop_words='english')
vectorizer = make_union(word_vectorizer, char_vectorizer, n_jobs=-1)

# Fit to and transform input data
X_train = vectorizer.fit_transform(df.comment_text)
X_test = vectorizer.transform(holdout.comment_text)

# Name training target data
y_train = df_targets.any_label
y_test = holdout_targets.any_label

# Create and fit NB Feature extractor 
nb = NBFeatures()
nb.fit(X_train, y_train)

# Tranform input data
X_train = nb.transform(X_train)
X_test = nb.transform(X_test)

print_time(start)

# Define model and fit to data 
start = time.time()
model = LinearSVC(random_state=seed, C=0.5)
model.fit(X_train, y_train)

print_time(start)

In [None]:
y_pred = model.predict(X_test)

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

acc = (tp+tn)/(tn+fn+tp+fp)
print("True Positives: %d" % tp)
print("False Positives: %d" % fp)
print("True Negatives: %d" % tn)
print("False Negatives: %d" % fn)
print("Precision: %.4f" % (tp/(tp+fp)))
print("Recall: %.4f" % (tp/(tp+fn)))
print("F1 Score: %.4f" % f1_score(y_test, y_pred))
print("Total Accuracy: %.2f%%" % acc)

In [None]:
cm = confusion_matrix(y_test, y_pred)
def cm_heatmap(arr, title):
    """Internal. Only called by scoring function."""
    plt.figure('cm_heatmap', figsize=(10,10))
    plt.title(title + ' confusion matrix')
    sns.heatmap(arr, square=True, annot=True, cmap='YlOrRd', fmt='g', cbar=False)
    plt.xlabel("Truth")
    plt.ylabel("Prediction")
    #sns.set(font_scale=3)
    plt.show()
cm_heatmap(cm, "Toxic Comments")