# iLykei Lecture Series
# Advanced Machine Learning and Artificial Intelligence (MScA 32017)

# Project: Detection of Toxic Comments Online

## Notebook 1: NLP Basics

## Yuri Balasanov, Leonid Nazarov, &copy; iLykei 2018

# Word embeddings

Word embedding is the collective name for a set of language modeling in natural language processing where words or phrases from the vocabulary are mapped to vectors of real numbers. This idea is rather popular in modern machine learning and many embedding models were created in the recent years. The most famous are [Google's word2vec](https://code.google.com/archive/p/word2vec/), [Glove](https://nlp.stanford.edu/projects/glove/), [Lexvec]( https://github.com/alexandres/lexvec), [sent2vec](https://github.com/epfml/sent2vec), [Facebook's fastText](https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md), [numberbatch](https://github.com/commonsense/conceptnet-numberbatch), [bpe](https://github.com/bheinzerling/bpemb).  

# FastText

Below we will use FastText embeddings [crawl-300d-2M.vec](https://s3-us-west-1.amazonaws.com/fasttext-vectors/crawl-300d-2M.vec.zip) - 2 million word vectors trained on Common Crawl (600B tokens). These vectors in dimension 300 were obtained using model described in [Bojanowski et al. (2016)](https://arxiv.org/abs/1607.04606). Authors proposed a new approach based on the skipgram model, where each word is represented as a bag of character n-grams. A vector representation is associated to each character n-gram; words being represented as the sum of these representations.   
File *crawl-300d-2M.vec* with word vectors has the following format.  
The first line of the file contains the number of words in the vocabulary and the size of the vectors. Each line contains a word followed by its vectors, like in the default fastText text format. Each value is space separated. Words are ordered by descending frequency.  
Create dictionary with words as keys and embeddings as values.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import pickle

Create embedding index from file in .txt format. First line contains dictionary size and embedding dim. Fields are space separated.

In [2]:

def get_embeddings(file_name):
    embeddings_index = {}
    with open(file_name, encoding="utf8") as f:
        for line in f:
            # remove white spaces and split
            values = line.rstrip().split(' ')
            if len(values) > 2:
                embeddings_index[values[0]] = np.asarray(values[1:], dtype="float32")
    return embeddings_index

In [3]:
embeddings_path = './Embeddings/'
embeddings_index = get_embeddings(embeddings_path+'crawl-300d-2M.vec')

Print the first 25 items in the embeddings index, show the first key and the corresponding value of the index. Value is 300-long numeric vector corresponding to the key.

In [4]:
print(list(embeddings_index.keys())[:25])
ke,va = list(embeddings_index.items())[0]
print('\nFirst key: ',ke,'\n\nVector length: ',len(va),'\n\nVector for first key: ',va)

[',', 'the', '.', 'and', 'to', 'of', 'a', 'in', 'is', 'for', 'that', 'I', 'it', 'on', 'with', ')', ':', '"', '(', 'The', 'you', 'was', 'are', 'or', 'this']

First key:  , 

Vector length:  300 

Vector for first key:  [-2.820e-02 -5.570e-02 -4.510e-02 -4.340e-02  7.120e-02 -8.550e-02
 -1.085e-01 -5.610e-02 -4.523e-01 -2.020e-02  9.750e-02  1.047e-01
  1.962e-01 -6.930e-02  2.130e-02 -2.350e-02  1.336e-01 -4.200e-02
 -5.640e-02 -7.980e-02  4.240e-02 -4.090e-02 -5.360e-02 -2.520e-02
  1.350e-02  6.400e-03  1.235e-01  4.610e-02  1.200e-02 -3.720e-02
  6.500e-02  4.100e-03 -1.074e-01 -2.630e-02  1.133e-01 -2.900e-03
  6.710e-02  1.065e-01  2.340e-02 -1.600e-02  7.000e-03  4.355e-01
 -7.520e-02 -4.328e-01  4.570e-02  6.040e-02 -7.400e-02 -5.500e-03
 -8.900e-03 -2.926e-01 -5.450e-02 -1.519e-01  9.900e-02 -1.930e-02
 -5.000e-03  5.110e-02  4.040e-02  1.023e-01 -1.280e-02  4.880e-02
 -1.567e-01 -7.590e-02 -1.900e-02  1.442e-01  4.700e-03 -1.860e-02
  1.400e-02 -3.850e-02 -8.530e-02  1.572e-01 

Text preprocessing before creating fastText embeddings includes removing digits, isolating punctuation and replacing consecutive spaces with single one. Most of punctuation marks and some contractions (forms like can't and it's) are in vocabulary.

In [5]:
for word in [',','!','!!',"it's","I'll",'*','>','¿','£',"'","’"]:
    print(word, word in embeddings_index.keys())

, True
! True
!! False
it's True
I'll True
* True
> True
¿ True
£ True
' True
’ True


# Gensim

Embeddings can also be loaded with [*gensim*](https://radimrehurek.com/gensim/) package which provides a number of useful functions for handling similar words.

### Warning.
**Gensim may take too much nemory depending on available resources on your computer**.

In [None]:
embeddings_path = './Embeddings/'
import gensim
m = gensim.models.KeyedVectors.load_word2vec_format(embeddings_path+'crawl-300d-2M.vec')

In [None]:
m.most_similar('recommend')

Note that misspelled words *"reccomend", "reccommend", "recomend"* are among the most similar to *"recommend"*. This feature of fastText embeddings is extermly valuable in analysis of toxic comments with a lot of misspelled words.

Gensim also allows using expressions with words, like in this example: $$woman+king-man=queen.$$

In [None]:
m.most_similar(positive=['woman', 'king'], negative=['man'])[0]

<font color=blue>
    
**Assignment** <br>
What is the result $x$ of the following expression:  
$$IL-x=CA-LA$$ 
Find three "most similar" solutions.

In [None]:
# Skipped code
# x = IL+LA-CA


# Preparing data for the project

This section shows how to split the multilabel dataset of the project into train and validation sets in a stratified fashion.

Read the train dataset.

In [6]:
import pandas as pd
import numpy as np
dataPath = "./data/"

train = pd.read_csv(dataPath+'tc_train.csv',index_col=0)
types = list(train)[1:]
print(types)
train.head(6)

['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


Unnamed: 0_level_0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0
00025465d4725e87,"""\n\nCongratulations from me as well, use the ...",0,0,0,0,0,0


Check some of the bad comments.

In [7]:
list(train[train['threat']==1]['comment_text'][:6])

 "I think that your a Fagget get a oife and burn in Hell I hate you 'm sorry we cant have any more sex i'm running out of conndoms",
 "I'm also a sock puppet of this account...SUPRISE!!\n-sincerely,\n            The man that will track you down from the Internet and kill you",
 'Fuck you, Smith. Please have me notified when you die. I want to dance on your grave.',
 "WOULDN'T BE THE FIRST TIME BITCH. FUCK YOU I'LL FIND OUT WHERE YOU LIVE, SODOMIZE YOUR WIFE AND THEN BURN YOUR HOUSE DOWN. FUCK YOU YOU FUCKING QUEER.",
 'Whoever put a notices on my page. I will kill u']

Сomment can generally belong to several classes simultaneously. Here are the counts of combinations of labels.

In [8]:
train.groupby(types).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,comment_text
toxic,severe_toxic,obscene,threat,insult,identity_hate,Unnamed: 6_level_1
0,0,0,0,0,0,201081
0,0,0,0,0,1,68
0,0,0,0,1,0,365
0,0,0,0,1,1,32
0,0,0,1,0,0,27
0,0,0,1,1,0,4
0,0,1,0,0,0,366
0,0,1,0,0,1,3
0,0,1,0,1,0,196
0,0,1,0,1,1,19


By stratified splitting of the sample into train and validation based on single label does not preserve proportions of other labels.

Split train into new train and validation subsets keeping up  original distribution of classes by transforming labels: transform vectors of labels into strings, then stratify by the column of such strings.

In [9]:
import pickle
from sklearn.model_selection import train_test_split

# convert each vector of labels to the string
labels = train[types].astype(str).apply(lambda x: ''.join(x),axis=1)
print('Labels: \n',labels.head())
# aggregate rare combinations if any
count = labels.value_counts()
rare = count.index[count<=2]
labels[np.isin(labels.values,rare)] = 'rare'
print('\nCounts of labels: \n',labels.value_counts())
train_index, val_index = train_test_split(train.index, test_size=0.2, 
                                      stratify = labels, random_state=0)
# save train and validation indices for further calculations
fname = dataPath + 'train_val_split.pkl'
with open(fname, 'wb') as f: pickle.dump([train_index, val_index], f, -1),

Labels: 
 id
0000997932d777bf    000000
000103f0d9cfb60f    000000
000113f07ec002fd    000000
0001b41b1c6bb37e    000000
0001d958c54c6e35    000000
dtype: object

Counts of labels: 
 000000    201081
100000      7376
101010      5732
101000      2612
100010      1754
111010      1165
101011       979
111011       381
001000       366
000010       365
100011       215
100001       203
001010       196
101110       196
111000       186
100100       163
111110        88
101111        81
000001        68
101001        55
111111        45
110000        41
000011        32
000100        27
100110        25
001011        19
101100        17
110010        14
100101        11
110100        11
111100         8
111001         7
110011         7
110101         5
rare           5
000110         4
110001         3
001001         3
100111         3
dtype: int64


# Tf–idf term weighting

Tf-idf (or TFIDF) stands for term frequency-inverse document frequency.
Tf-idf is often used in text mining to retrieve information from text. This weight is a statistical measure evaluating importance of a word for a document in a collection of documents or corpus.  

Denote
- $D$ - corpus of documents;
- $n_D$  - total number of documents in the corpus $D$;
- ${tf}(t,d)$ - **term frequency** (number of times term t occurs in document d);
- ${df}(t,D)$ - **document frequency** (number of documents containing term t).  

Then **inverse document frequency** is defined as 
$${idf}(t,D) = log{\frac{1 + n_D}{1+{df}(t,D)}} + 1.$$
This formula is implemented in the class  [sklearn.feature_extraction.text.TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) used below and differs slightly from the standard textbook notation.  
The tf–idf term weight is the product of document frequency and inverse document frequency
$$ {tfidf}(t,d,D)= {tf} (t,d) \times {idf} (t,D).$$
Importance of the word increases with the frequency of it in the document, but inversely proportional to the frequency of the word in the corpus.

Calculate tf-idf of the corpus of sentences opening Shakespeare's "Hamlet": dialogue of two characters, one of them is Francisco.

In [10]:
corpus = ['Have you had quiet guard?',
          'Not a mouse stirring.',
          'Well, good night.']
from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf = TfidfVectorizer()
X = tf_idf.fit_transform(corpus)
X.toarray()

array([[0.        , 0.4472136 , 0.4472136 , 0.4472136 , 0.        ,
        0.        , 0.        , 0.4472136 , 0.        , 0.        ,
        0.4472136 ],
       [0.        , 0.        , 0.        , 0.        , 0.57735027,
        0.        , 0.57735027, 0.        , 0.57735027, 0.        ,
        0.        ],
       [0.57735027, 0.        , 0.        , 0.        , 0.        ,
        0.57735027, 0.        , 0.        , 0.        , 0.57735027,
        0.        ]])

The corpus contains 11 words. All of them are unique. Three sublists correspond to 3 sentences and show equal weights of words.

Method `fit_transform` returns sparse matrix of tf-idf weights with `len(corpus)` rows and columns corresponding to words in corpus dictionary. The mapping of words to column indices can be found by the attribute `vocabulary_`.  
Note that the resulting tf-idf vectors (rows of the matrix) are then normalized by the Euclidean norm.

In [11]:
print(tf_idf.vocabulary_)

{'have': 3, 'you': 10, 'had': 2, 'quiet': 7, 'guard': 1, 'not': 6, 'mouse': 4, 'stirring': 8, 'well': 9, 'good': 0, 'night': 5}
