# Dota Dataset Notebook 5 - Word Embeddings (Word2Vec)

### TW: This notebook contains highly offensive language.
In order to prevent this language, we need to analyze the contexts they are used in and the players that use this type of language. Although all of the notebooks that include this dataset have offensive language appear, the work in this notebook seeks to analyze the usage of these words.

### Adding word embeddings to the toxicity classifier
   * Purpose 1: make the model more generalizable
       * ie. being trained on "n_gger" and generalizing to other races, such as "ch_nk"
       * typos: trash vs trsah
   * Purpose 2: make the model more robust to attempts to get around it
       * ie. attempting to not be caught by typing "assh01e" instead of "asshole"

**This notebook covers:**
* Finding the proper word embeddings
* Applying word embeddings to the model, first with a normal average
* Applying word embeddings to the model with a weighted average, the weights being from Tfidf

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import langdetect as ld
from textblob import TextBlob
import warnings
warnings.filterwarnings('ignore')

from sklearn.pipeline import make_union
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack
from sklearn.linear_model import LogisticRegression
from gensim.models.keyedvectors import KeyedVectors

# Use of Word Embeddings

In [8]:
df = pd.read_csv('labeled_dota.csv')

# When saving the table, it turned empty strings into nulls.
df['text'] = df['text'].fillna('')
df = df.drop('severe_toxic', axis=1)

## Using the Google News Corpus
* The Google News corpus is the most popular corpus for word embeddings. Due to this, it will be the initial starting point.

In [10]:
model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

**Examples**

In [11]:
model.most_similar('nigga')

[('niggas', 0.7684844136238098),
 ('n_*_gga', 0.7659837603569031),
 ('ni_**', 0.7131214141845703),
 ('n_*_ggas', 0.7091651558876038),
 ('homie', 0.6947641372680664),
 ('Niggas', 0.6646634340286255),
 ('kanye', 0.6643739342689514),
 ('motherf_*_cker', 0.6538609266281128),
 ('**_ga', 0.6494430303573608),
 ('sh_*_t', 0.6478126049041748)]

In [13]:
model.most_similar('chink')

[('chinks', 0.7933344841003418),
 ('Achilles_heel', 0.49940359592437744),
 ('flaw', 0.49598392844200134),
 ('soft_underbelly', 0.4720204770565033),
 ('silver_lining', 0.46761447191238403),
 ('weaknesses', 0.464680552482605),
 ('glimmer', 0.4613160490989685),
 ('silver_linings', 0.43982821702957153),
 ('flaws', 0.42893803119659424),
 ('Chinks', 0.4236713945865631)]

In [14]:
# cosine similarity (preferred over euclidean distance)
model.similarity('chink', 'asian')

0.04216005

In [15]:
model.similarity('gay', 'fagot')

0.40189582

* 'faggot' was not included in the vocabulary, but 'fagot' was

In [16]:
model.similarity('nigga', 'nigger')

0.5408276

In [17]:
model.similarity('chink', 'nigger')

0.071316406

In [18]:
model.similarity('NIGGER', 'nigger')

0.40419376

***Testing it out for this purpose, it did not perform well. This conclusion came about after trying other models below.***

## Using the English Dota Dataset
* Applying Word2Vec on the Dota dataset itself
    * This defeats the purpose of trying to generalize our model to unseen words, but this is to observe the relationships between words within the dataset.

In [19]:
large_eng = pd.read_csv('engDfWithSenti.csv')[['match', 'slot', 'time', 'text']]
large_eng.head()

Unnamed: 0,match,slot,time,text
0,0,9,1808.40822,100%
1,1,0,-131.14018,twitch.tv/rage_channel
2,1,0,-121.60481,https://www.twitch.tv/rage_channel
3,1,0,700.72893,https://www.twitch.tv/rage_channel
4,1,0,702.99503,https://www.twitch.tv/rage_channel


In [20]:
large_eng.shape

(6921688, 4)

In [21]:
# Removing links
large_eng = large_eng[~large_eng['text'].str.contains("(\.tv)")][~large_eng['text'].str.contains("(\.com)")]
large_eng.shape

(6914913, 4)

In [22]:
from gensim.models import Word2Vec

# Forming and training the Word2Vec model
texts = large_eng['text'].str.split(' ').values
model = Word2Vec(sg=1, min_count=1, window=3, size=50, workers=4)
model.build_vocab(texts)
model.train(sentences=texts, total_examples=model.corpus_count, epochs=model.epochs)

(79163834, 93020265)

**Examples**

In [23]:
model.most_similar('trash')

[('trahs', 0.9620177745819092),
 ('trsah', 0.9531400203704834),
 ('rubbish', 0.9508822560310364),
 ('thrash', 0.9503445625305176),
 ('garbage', 0.9502580165863037),
 ('trsh', 0.9492931365966797),
 ('retard', 0.9477963447570801),
 ('dogshit', 0.9467427730560303),
 ('tard', 0.9451007843017578),
 ('tash', 0.9420121908187866)]

In [24]:
model.most_similar('noob')

[('nub', 0.9675301313400269),
 ('idiot', 0.9619877934455872),
 ('nob', 0.9607500433921814),
 ('nooob', 0.9558690786361694),
 ('nab', 0.95162034034729),
 ('nobo', 0.9505505561828613),
 ('noooob', 0.9374358654022217),
 ('retard', 0.9371263384819031),
 ('tard', 0.9353867769241333),
 ('trahs', 0.9339444041252136)]

In [25]:
model.most_similar('nigger')

[('faggot', 0.9715926647186279),
 ('cunt', 0.9637824296951294),
 ('fag', 0.9553071856498718),
 ('cuck', 0.9494332671165466),
 ('slut', 0.9432471394538879),
 ('asshole', 0.9413410425186157),
 ('twat', 0.9412334561347961),
 ('dickhead', 0.9386234283447266),
 ('pig', 0.9354640245437622),
 ('fagget', 0.9351155757904053)]

In [26]:
model.similarity('nigger', 'chink')

0.9240302

In [27]:
model.most_similar('faggot')

[('fag', 0.9751840829849243),
 ('cunt', 0.972036600112915),
 ('nigger', 0.9715927243232727),
 ('asshole', 0.9545539617538452),
 ('cuck', 0.9481619596481323),
 ('dickhead', 0.945597231388092),
 ('dumbass', 0.945541501045227),
 ('bastard', 0.9450997710227966),
 ('bitch', 0.9446184039115906),
 ('shitter', 0.9419007897377014)]

* Although it shows that racial slurs are similar (one even more similar than the actual equivalent of 'faggot'), it still shows that this term is similar to other words within the same disciminatory category, and it also shows only offensive language.

In [28]:
len(model.wv.vocab)

576437

***Now that there is a model applied to the Dota dataset, it can be compared to a more generalizable corpus. The GloVe Common Crawl word vectors will be used.***

## Using the GloVe Common Crawl (840B) Word Vectors

GloVe Pre-trained word vectors:
*Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation.*

In [36]:
from gensim.scripts.glove2word2vec import glove2word2vec

## convert GloVe vectors in text format into the word2vec text format
# glove2word2vec(glove_input_file='glove.840B.300d.txt', word2vec_output_file="gensim_glove_vectors.txt")

In [20]:
# turn word2vec txt into model
glove_model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt", binary=False)

**Examples**

In [4]:
glove_model.most_similar('noob')

[('noobs', 0.8388797044754028),
 ('n00b', 0.8300445675849915),
 ('newb', 0.7779605984687805),
 ('noobie', 0.6746147274971008),
 ('Noob', 0.6667271852493286),
 ('dumbass', 0.6534879803657532),
 ('rofl', 0.6520781517028809),
 ('n00bs', 0.6475838422775269),
 ('noobish', 0.6475257873535156),
 ('stfu', 0.6416695713996887)]

In [33]:
glove_model.similarity('noob', 'newbie')

0.6285359

In [34]:
glove_model.most_similar('newbie')

[('newbies', 0.791803240776062),
 ('newb', 0.7092509269714355),
 ('novice', 0.6995048522949219),
 ('Newbie', 0.6929636597633362),
 ('beginner', 0.6885488033294678),
 ('newby', 0.6439059972763062),
 ('noob', 0.6285358667373657),
 ('noobie', 0.6156291961669922),
 ('newbee', 0.6093085408210754),
 ('n00b', 0.5996484756469727)]

In [6]:
glove_model.similarity('nigger', 'chink')

0.511657

* The Google News model marked their similarity as .07. The Word2Vec model trained on the Dota dataset marked their similarity as .92.
* Given that this corpus has around 1.5 million more unique words than the Dota dataset, a large discrapency in similarity is bound to happen. With this in consideration, the GloVe Common Crawl model does pretty well.

In [29]:
glove_model.most_similar('faggot')

[('fag', 0.9149971008300781),
 ('faggots', 0.834093451499939),
 ('dumbass', 0.8189541101455688),
 ('fagget', 0.8078358173370361),
 ('stfu', 0.7915732860565186),
 ('dipshit', 0.7823353409767151),
 ('fags', 0.7777252197265625),
 ('douchebag', 0.7625726461410522),
 ('fucker', 0.7555729150772095),
 ('nigger', 0.7529041767120361)]

* Similar to the most_similar words from the Word2Vec model trained on the Dota dataset.

In [35]:
len(glove_model.wv.vocab)

2196016

***Overall, the GloVe Common Crawl word vectors are more than sufficient for this project and will continue to be used.***

____

# Adding Word Embeddings into the Model

## Updated Jigsaw Classifier - Adding word embeddings with simple averaging

In [74]:
comments = pd.read_csv("jigsaw_train.csv")
comments['comment_text'] = comments['comment_text'].str.replace("\n", " ")
test = pd.read_csv('jigsaw_test.csv')

def num_upper(text):
    """Returns the number of capital letters in a string."""
    num = 0
    for i in text:
        if i.isupper():
            num += 1
    return num

def vector_mean(text):
    """Gets the vector mean of a sentence by averaging the word vectors (each singular dimension)."""
    sentence = []
    words = text.split(" ")
    words = [word for word in words if word in glove_model.wv.vocab]
    for word in words:
        sentence.append(glove_model[word])
    if len(sentence) > 0:
        return sum(sentence)/len(sentence)
    else:
        return np.zeros(300)

In [91]:
# Cleaning and adding features
comments_copy = comments.copy()
comments_copy['comment_text'] = comments_copy['comment_text'].str.replace(r"[(\.),(\|)!:='&(\*)(\")]", "")
comments_copy['comment_text'] = comments_copy['comment_text'].str.replace("\n", "")
comments_copy['len'] = comments_copy['comment_text'].apply(len) - comments_copy['comment_text'].str.count(" ")
comments_copy['caps'] = comments_copy['comment_text'].apply(num_upper)
comments_copy['proportion of caps'] = comments_copy['caps'] / comments_copy['len']
len_min = comments_copy['len'].min()
len_max = comments_copy['len'].max()
comments_copy['len'] = (comments_copy['len'].values - len_min) / (len_max - len_min)
comments_copy['proportion of caps'] = comments_copy['proportion of caps'].fillna(0)
comments_copy = comments_copy.drop(['id', 'caps'], axis=1)

# New - adding the 300-dimension vector means to the df
comments_copy['vector mean'] = comments_copy['comment_text'].apply(vector_mean)
tmp = pd.DataFrame(comments_copy['vector mean'].tolist())
comments_copy = comments_copy.join(tmp)
comments_copy = comments_copy.drop('vector mean', axis=1)
comments_copy.head(3)

Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,len,proportion of caps,0,...,290,291,292,293,294,295,296,297,298,299
0,Explanation Why the edits made under my userna...,0,0,0,0,0,0,0.042129,0.079812,-0.001525,...,-0.180881,-0.016054,0.094348,-0.038313,0.045927,-0.096478,-0.100205,0.00241,0.001732,0.129015
1,Daww He matches this background colour Im seem...,0,0,0,0,0,0,0.015924,0.096386,-0.026112,...,-0.055595,0.125816,-0.024587,-0.023318,0.007922,0.032877,-0.124093,0.073629,-0.065272,0.164368
2,Hey man Im really not trying to edit war Its j...,0,0,0,0,0,0,0.036686,0.021505,-0.098101,...,-0.217445,0.031593,-0.056462,-0.026338,0.05365,-0.013848,-0.097436,-0.015379,0.001089,0.164947


In [92]:
# Cleaning and adding features
testing = test.copy()
testing['comment_text'] = testing['comment_text'].str.replace(r"[(\.),(\|)!:='&(\*)(\")]", "")
testing['comment_text'] = testing['comment_text'].str.replace("\n", "")
testing['len'] = testing['comment_text'].apply(len) - testing['comment_text'].str.count(" ")
testing['caps'] = testing['comment_text'].apply(num_upper)
testing['proportion of caps'] = testing['caps'] / testing['len']
len_min = testing['len'].min()
len_max = testing['len'].max()
testing['len'] = (testing['len'].values - len_min) / (len_max - len_min)
testing['proportion of caps'] = testing['proportion of caps'].fillna(0)
testing = testing.drop(['id', 'caps'], axis=1)

# New - adding the 300-dimension vector means to the df
testing['vector mean'] = testing['comment_text'].apply(vector_mean)
tmp = pd.DataFrame(testing['vector mean'].tolist())
testing = testing.join(tmp)
testing = testing.drop('vector mean', axis=1)

# Tfidf
train_text = comments['comment_text']
test_text = test['comment_text']
text = pd.concat([train_text, test_text])

word_vectorizer = TfidfVectorizer(sublinear_tf=True, strip_accents='unicode', analyzer='word', 
                                  token_pattern=r'\w{1,}', ngram_range=(1, 1), max_features=30000)
char_vectorizer = TfidfVectorizer(sublinear_tf=True, strip_accents='unicode', analyzer='char', 
                                  ngram_range=(1, 4), max_features=30000)

vectorizer = make_union(word_vectorizer, char_vectorizer, n_jobs=2)
vectorizer.fit(text)

train_vector = vectorizer.transform(train_text)
test_vector = vectorizer.transform(test_text)

In [98]:
# Combining all features
final_training = hstack([train_vector, comments_copy.iloc[:,7:]])
final_testing = hstack([test_vector, testing.iloc[:,1:]])

# Logistic Regression
labels = comments.iloc[:,2:]
results = {}
for i in range(len(labels.columns)):
    lr = LogisticRegression(random_state=42, solver='sag').fit(final_training, labels.iloc[:,i])
    results[labels.columns[i]] = lr.predict_proba(final_testing)[:,1]

In [99]:
submission = pd.DataFrame({'id': test['id']})
submission['toxic'] = results['toxic']
submission['severe_toxic'] = results['severe_toxic']
submission['obscene'] = results['obscene']
submission['threat'] = results['threat']
submission['insult'] = results['insult']
submission['identity_hate'] = results['identity_hate']
submission.head(5)

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.999823,0.233773,0.999384,0.069227,0.977106,0.318873
1,0000247867823ef7,0.003770,0.000190,0.001786,0.000028,0.002995,0.000651
2,00013b17ad220c46,0.003311,0.003861,0.002719,0.003828,0.001953,0.001420
3,00017563c3f7919a,0.004570,0.000978,0.002590,0.000364,0.002927,0.000386
4,00017695ad8997eb,0.010297,0.001680,0.001770,0.000240,0.001963,0.000189
...,...,...,...,...,...,...,...
153159,fffcd0960ee309b5,0.247398,0.000401,0.031778,0.000061,0.008562,0.000884
153160,fffd7a9a6eb32c16,0.040743,0.004443,0.014421,0.004242,0.012678,0.009167
153161,fffda9e8d6fafa9e,0.002649,0.000250,0.005895,0.000087,0.001127,0.000347
153162,fffe8f1340a79fc2,0.008519,0.000772,0.004358,0.000880,0.005213,0.010218


In [100]:
# submission.to_csv('submission_word_vector.csv', index=False)

**Score decreased from 0.980 to 0.975. Given that this would generalize better to our data, the very minimal decrease in score is worth it. In addition, the score would not decrease as much with proper weighting.**

## Updated Jigsaw Classifier Applied to Dota Data

In [101]:
eng = pd.read_csv('engDfWithSenti.csv')[['match', 'slot', 'time', 'text']].head(20000)
eng.head(3)

Unnamed: 0,match,slot,time,text
0,0,9,1808.40822,100%
1,1,0,-131.14018,twitch.tv/rage_channel
2,1,0,-121.60481,https://www.twitch.tv/rage_channel


In [108]:
# Original model features
dota_text = eng.copy()
dota_text = dota_text[~dota_text['text'].str.contains("(\.tv)")][~dota_text['text'].str.contains("(\.com)")]
dota_text['text'] = dota_text['text'].str.replace(r"[(\.),(\|)!:='&(\*)(\")]", "")
dota_text['text'] = dota_text['text'].str.replace("\n", "")
dota_text['len'] = dota_text['text'].apply(len) - dota_text['text'].str.count(" ")
dota_text['caps'] = dota_text['text'].apply(num_upper)
dota_text['proportion of caps'] = dota_text['caps'] / dota_text['len']
len_min = dota_text['len'].min()
len_max = dota_text['len'].max()
dota_text['len'] = (dota_text['len'].values - len_min) / (len_max - len_min)
dota_text['proportion of caps'] = dota_text['proportion of caps'].fillna(0)
dota_text = dota_text.drop('caps', axis=1)

# New - adding the 300-dimension vector means to the df
dota_text['vector mean'] = dota_text['text'].apply(vector_mean)
tmp = pd.DataFrame(dota_text['vector mean'].tolist())
dota_text = dota_text.join(tmp)
dota_text = dota_text.drop('vector mean', axis=1)

In [109]:
dota_text[dota_text.iloc[:,6].isna()].head()

Unnamed: 0,match,slot,time,text,len,proportion of caps,0,1,2,3,...,290,291,292,293,294,295,296,297,298,299
19975,3017,1,1436.84913,its funny though,0.12963,0.0,,,,,...,,,,,,,,,,
19976,3017,1,1444.78053,mid and safelane carry on enemy team are lvl 19,0.351852,0.0,,,,,...,,,,,,,,,,
19977,3017,1,1450.51253,on my team not even 15,0.157407,0.0,,,,,...,,,,,,,,,,
19978,3017,1,1458.64383,not both anyway,0.12037,0.0,,,,,...,,,,,,,,,,
19979,3017,5,1466.10863,morph has lag,0.101852,0.0,,,,,...,,,,,,,,,,


For some reason, only match 3017 received NAs for their sentence vectors. This may just be an anomoly. If it resurfaces, attention will be brought back to this issue.

In [111]:
dota_text = dota_text[dota_text['match'] != 3017]

In [112]:
# Tfidf
train_text = comments['comment_text']
test_text = dota_text['text']
text = pd.concat([train_text, test_text])

word_vectorizer = TfidfVectorizer(sublinear_tf=True, strip_accents='unicode', analyzer='word', 
                                  token_pattern=r'\w{1,}', ngram_range=(1, 1), max_features=30000)
char_vectorizer = TfidfVectorizer(sublinear_tf=True, strip_accents='unicode', analyzer='char', 
                                  ngram_range=(1, 4), max_features=30000)

vectorizer = make_union(word_vectorizer, char_vectorizer, n_jobs=2)
vectorizer.fit(text)

train_vector = vectorizer.transform(train_text)
test_vector = vectorizer.transform(test_text)

# Combining all features
final_training = hstack([train_vector, comments_copy.iloc[:,7:]])
final_testing = hstack([test_vector, dota_text.iloc[:,4:]])

# Logistic Regression
labels = comments.iloc[:,2:]
results = {}
for i in range(len(labels.columns)):
    lr = LogisticRegression(random_state=42, solver='sag').fit(final_training, labels.iloc[:,i])
    results[labels.columns[i]] = lr.predict_proba(final_testing)[:,1]

In [113]:
labeled_dota = pd.DataFrame({'text': dota_text['text']})
labeled_dota['toxic'] = results['toxic']
labeled_dota['severe_toxic'] = results['severe_toxic']
labeled_dota['obscene'] = results['obscene']
labeled_dota['threat'] = results['threat']
labeled_dota['insult'] = results['insult']
labeled_dota['identity_hate'] = results['identity_hate']
labeled_dota

Unnamed: 0,text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,100%,0.034039,0.063953,0.071940,0.061007,0.055936,0.035569
7,carry,0.026550,0.000256,0.002668,0.000012,0.001987,0.000713
8,yes dog,0.140038,0.001870,0.044461,0.971723,0.438611,0.608974
9,yeah,0.076845,0.060020,0.014420,0.002525,0.064132,0.038330
10,fast and furious,0.950317,0.001478,0.483092,0.000049,0.021371,0.007016
...,...,...,...,...,...,...,...
19951,report shadow,0.008418,0.009685,0.001736,0.026290,0.004992,0.008200
19952,ironic,0.010940,0.001451,0.001924,0.000813,0.004775,0.001351
19953,rage boss,0.007376,0.000585,0.001321,0.000036,0.001129,0.000747
19954,mame privet,0.015945,0.004472,0.003854,0.002390,0.024617,0.004485


_____

## Updated Jigsaw Classifier -  Adding word embeddings with weighted averaging
* Weighting each word vector by the Tfidf score of each word in that sentence as part of a weighted average
    * Tfidf score: measure of word importance
        * ie. For the sentence "the cat ran," "the" should not have the same weight as "cat."

In [2]:
# turn txt of word vectors into a Word2Vec model
glove_model = KeyedVectors.load_word2vec_format("gensim_glove_vectors.txt", binary=False)

In [4]:
comments = pd.read_csv("jigsaw_train.csv")
comments['comment_text'] = comments['comment_text'].str.replace("\n", " ")
test = pd.read_csv('jigsaw_test.csv')

In [5]:
# loading Tfidf model
vectorizer = TfidfVectorizer()
vectorizer.fit(comments['comment_text'].values)
feature_names = vectorizer.get_feature_names()

def get_word_weight(text):
    """Returns a dictionary where keys are the words of the text and values are their weights."""
    tfidf_matrix = vectorizer.transform([text]).todense()
    feature_index = tfidf_matrix[0,:].nonzero()[1]
    tfidf_scores = zip([feature_names[i] for i in feature_index], [tfidf_matrix[0, x] for x in feature_index])
    return dict(tfidf_scores)

In [6]:
text = "Hello, I am a girl"
get_word_weight(text)

{'am': 0.3922306315610665,
 'girl': 0.7617353366583565,
 'hello': 0.5156688943025236}

* Weights don't sum to 1 (they're supposed to not)
* Since weights don't sum to 1, divide by the sum of weights.

In [7]:
text = "Hello, I am a girl"
text_dict = get_word_weight(text)
total = sum(text_dict.values())

# dividing by sum of weights to have weights sum to 1
text_dict = {key:(val/total) for key,val in text_dict.items()}
text_dict

{'am': 0.2349200057841454,
 'girl': 0.4562286963197283,
 'hello': 0.3088512978961263}

* After dividing by the sum of weights, they now sum to 1.

In [103]:
def num_upper(text):
    """Returns the number of capital letters in a string."""
    num = 0
    for i in text:
        if i.isupper():
            num += 1
    return num

def vector_mean(text):
    """Gets the vector mean of a sentence by averaging the word vectors (each singular dimension)."""
    sentence = []
    words = text.split(" ")
    words = [word for word in words if word in glove_model.wv.vocab]
    for word in words:
        sentence.append(glove_model[word])
    if len(sentence) > 0:
        return sum(sentence)/len(sentence)
    else:
        return np.zeros(300)
    
def weighted_vector_mean(text):
    """Gets the weighted vector mean of a sentence by averaging the word vectors according to Tfidf weights."""
    sentence_vects = []
    sentence_weights = []
    words = text.split(" ")
    words = [word for word in words if word in glove_model.wv.vocab]
    
    text_dict = get_word_weight(text)
    total = sum(text_dict.values())
    text_dict = {key:(val/total) for key,val in text_dict.items()}
        
    for word in words:
        sentence_vects.append(glove_model[word])               # get word vectors
        if word.lower() in text_dict.keys():
            sentence_weights.append(text_dict[word.lower()])   # get weights of words
        else:
            sentence_weights.append(0)
        
    if len(sentence_vects) > 0:
        return np.transpose(sentence_vects) @ sentence_weights / len(sentence_vects)
    else:
        return np.zeros(300)

In [104]:
# Cleaning and adding features
comments_copy = comments.copy()
comments_copy['comment_text'] = comments_copy['comment_text'].str.replace(r"[(\.),(\|)!:='&(\*)(\")]", "")
comments_copy['comment_text'] = comments_copy['comment_text'].str.replace("\n", "")
comments_copy['len'] = comments_copy['comment_text'].apply(len) - comments_copy['comment_text'].str.count(" ")
comments_copy['caps'] = comments_copy['comment_text'].apply(num_upper)
comments_copy['proportion of caps'] = comments_copy['caps'] / comments_copy['len']
len_min = comments_copy['len'].min()
len_max = comments_copy['len'].max()
comments_copy['len'] = (comments_copy['len'].values - len_min) / (len_max - len_min)
comments_copy['proportion of caps'] = comments_copy['proportion of caps'].fillna(0)
comments_copy = comments_copy.drop(['id', 'caps'], axis=1)

# New - adding the 300D vector means, weighted by Tfidf weights
comments_copy['vector mean'] = comments_copy['comment_text'].apply(weighted_vector_mean)
tmp = pd.DataFrame(comments_copy['vector mean'].tolist())
comments_copy = comments_copy.join(tmp)
comments_copy = comments_copy.drop('vector mean', axis=1)
comments_copy.head(3)

Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,len,proportion of caps,0,...,290,291,292,293,294,295,296,297,298,299
0,Explanation Why the edits made under my userna...,0,0,0,0,0,0,0.042129,0.079812,0.000552,...,-0.003081,-0.001262,0.002846,-0.001422,0.000968,-0.002976,-0.003383,0.000153,-0.000352,0.002616
1,Daww He matches this background colour Im seem...,0,0,0,0,0,0,0.015924,0.096386,0.002804,...,-0.009804,0.006116,-0.004279,-0.004179,0.000589,0.004729,-0.006293,-0.001872,0.002173,0.006684
2,Hey man Im really not trying to edit war Its j...,0,0,0,0,0,0,0.036686,0.021505,-0.002817,...,-0.005206,0.000549,-0.002285,-0.000499,0.000947,-0.00012,-0.002729,-0.000395,-0.000113,0.003905


In [105]:
# Cleaning and adding features
testing = test.copy()
testing['comment_text'] = testing['comment_text'].str.replace(r"[(\.),(\|)!:='&(\*)(\")]", "")
testing['comment_text'] = testing['comment_text'].str.replace("\n", "")
testing['len'] = testing['comment_text'].apply(len) - testing['comment_text'].str.count(" ")
testing['caps'] = testing['comment_text'].apply(num_upper)
testing['proportion of caps'] = testing['caps'] / testing['len']
len_min = testing['len'].min()
len_max = testing['len'].max()
testing['len'] = (testing['len'].values - len_min) / (len_max - len_min)
testing['proportion of caps'] = testing['proportion of caps'].fillna(0)
testing = testing.drop(['id', 'caps'], axis=1)

# New - adding the 300D vector means, weighted by Tfidf weights
testing['vector mean'] = testing['comment_text'].apply(weighted_vector_mean)
tmp = pd.DataFrame(testing['vector mean'].tolist())
testing = testing.join(tmp)
testing = testing.drop('vector mean', axis=1)

# Tfidf
train_text = comments['comment_text']
test_text = test['comment_text']
text = pd.concat([train_text, test_text])

word_vectorizer = TfidfVectorizer(sublinear_tf=True, strip_accents='unicode', analyzer='word', 
                                  token_pattern=r'\w{1,}', ngram_range=(1, 1), max_features=30000)
char_vectorizer = TfidfVectorizer(sublinear_tf=True, strip_accents='unicode', analyzer='char', 
                                  ngram_range=(1, 4), max_features=30000)

vectorizer = make_union(word_vectorizer, char_vectorizer, n_jobs=2)
vectorizer.fit(text)

train_vector = vectorizer.transform(train_text)
test_vector = vectorizer.transform(test_text)

In [106]:
# Combining all features
final_training = hstack([train_vector, comments_copy.iloc[:,7:]])
final_testing = hstack([test_vector, testing.iloc[:,1:]])

# Logistic Regression - applying the model on the dota data
labels = comments.iloc[:,2:]
results = {}
for i in range(len(labels.columns)):
    lr = LogisticRegression(random_state=42, solver='sag').fit(final_training, labels.iloc[:,i])
    results[labels.columns[i]] = lr.predict_proba(final_testing)[:,1]

In [107]:
submission = pd.DataFrame({'id': test['id']})
submission['toxic'] = results['toxic']
submission['severe_toxic'] = results['severe_toxic']
submission['obscene'] = results['obscene']
submission['threat'] = results['threat']
submission['insult'] = results['insult']
submission['identity_hate'] = results['identity_hate']
submission.head(5)

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,0.999786,0.207811,0.999486,0.046438,0.98204,0.280475
1,0000247867823ef7,0.003983,0.001345,0.00212,0.000317,0.003159,0.001652
2,00013b17ad220c46,0.010267,0.004232,0.006372,0.002058,0.003781,0.002105
3,00017563c3f7919a,0.003149,0.001381,0.002399,0.000784,0.003045,0.00055
4,00017695ad8997eb,0.011063,0.001115,0.002186,0.000589,0.002885,0.000488


In [109]:
# submission.to_csv('submission_word_vector.csv', index=False)

**This submission scored .977. The model with the simple averaged vectors scored .975. The model without any word embeddings scored .980. As mentioned before, the very minimal decrease in score is a worth tradeoff for better generalization to our data and to unseen data. Instead of a .005 score decrease, now there is only a .003 score decrease.**

# FINAL Updated Jigsaw Classifier Applied to Dota Data

In [117]:
eng = pd.read_csv('engDfWithSenti.csv')[['match', 'slot', 'time', 'text']].head(20000)
eng.head(3)

Unnamed: 0,match,slot,time,text
0,0,9,1808.40822,100%
1,1,0,-131.14018,twitch.tv/rage_channel
2,1,0,-121.60481,https://www.twitch.tv/rage_channel


In [124]:
# Original model features
dota_text = eng.copy()
dota_text = dota_text[~dota_text['text'].str.contains("(\.tv)")][~dota_text['text'].str.contains("(\.com)")]
dota_text['text'] = dota_text['text'].str.replace(r"[(\.),(\|)!:='&(\*)(\")]", "")
dota_text['text'] = dota_text['text'].str.replace("\n", "")
dota_text['len'] = dota_text['text'].apply(len) - dota_text['text'].str.count(" ")
dota_text['caps'] = dota_text['text'].apply(num_upper)
dota_text['proportion of caps'] = dota_text['caps'] / dota_text['len']
len_min = dota_text['len'].min()
len_max = dota_text['len'].max()
dota_text['len'] = (dota_text['len'].values - len_min) / (len_max - len_min)
dota_text['proportion of caps'] = dota_text['proportion of caps'].fillna(0)
dota_text = dota_text.drop('caps', axis=1)

# New - loading Tfidf model
vectorizer = TfidfVectorizer()
vectorizer.fit(dota_text['text'].values)
feature_names = vectorizer.get_feature_names()

# New - adding the 300-dimension vector means to the df
dota_text['vector mean'] = dota_text['text'].apply(weighted_vector_mean)
tmp = pd.DataFrame(dota_text['vector mean'].tolist())
dota_text = dota_text.join(tmp).dropna()
dota_text = dota_text.drop('vector mean', axis=1)

In [125]:
# Tfidf
train_text = comments['comment_text']
test_text = dota_text['text']
text = pd.concat([train_text, test_text])

word_vectorizer = TfidfVectorizer(sublinear_tf=True, strip_accents='unicode', analyzer='word', 
                                  token_pattern=r'\w{1,}', ngram_range=(1, 1), max_features=30000)
char_vectorizer = TfidfVectorizer(sublinear_tf=True, strip_accents='unicode', analyzer='char', 
                                  ngram_range=(1, 4), max_features=30000)

vectorizer = make_union(word_vectorizer, char_vectorizer, n_jobs=2)
vectorizer.fit(text)

train_vector = vectorizer.transform(train_text)
test_vector = vectorizer.transform(test_text)

# Combining all features
final_training = hstack([train_vector, comments_copy.iloc[:,7:]])
final_testing = hstack([test_vector, dota_text.iloc[:,4:]])

# Logistic Regression
labels = comments.iloc[:,2:]
results = {}
for i in range(len(labels.columns)):
    lr = LogisticRegression(random_state=42, solver='sag').fit(final_training, labels.iloc[:,i])
    results[labels.columns[i]] = lr.predict_proba(final_testing)[:,1]

In [167]:
labeled_dota = pd.DataFrame({'text': dota_text['text']})
labeled_dota['toxic'] = results['toxic']
labeled_dota['severe_toxic'] = results['severe_toxic']
labeled_dota['obscene'] = results['obscene']
labeled_dota['threat'] = results['threat']
labeled_dota['insult'] = results['insult']
labeled_dota['identity_hate'] = results['identity_hate']

In [168]:
# labeled_dota.to_csv('tfidf_weighted_labels.csv', index=False)

Next session:
* Understand and handle the false positives
* Insights on toxic players and players in general