**Insights about different embedding vectors - GloVe, FastText & Google News**

After our EDA: https://www.kaggle.com/s7anmerk/lean-import-to-save-ram-and-eda

And our Super Fast Preprocessing: https://www.kaggle.com/annekel/fastpreprocess-14min-99-7-coverage-gl-ov

In this kernel we want to provide you some insights gained in our project regarding the different embedding vectors GloVe, FastText and Google News. As it's difficult to find information about the content of the different vectors our insights were generated during the application of those vectors improving our knowledge step by step.

This kernel includes the import of necessary files as well as the loading of needed packages. Some functions used in the process. Comparison of the three word vectors regarding their out-of-the-box performance, their non-ascii characters coverage  and then their perfomance having applied our preprocessing function (Click [here](https://www.kaggle.com/annekel/fastpreprocess-14min-99-7-coverage-gl-ov) for more information regarding our preprocessing).

1. Load files and packages
2. Used Functions
3. Word Vectors
   * Google News
   * FastText
   * GloVe
4. Comparison & Conclusion



**1. Load files and packages**

In [None]:
# Import needed packages
import gensim
import pandas as pd
from nltk.tokenize import word_tokenize
import string
import re
import emoji
import operator 
import tqdm
from keras.preprocessing.text import text_to_word_sequence
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
import timeit
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from IPython.display import Image
import nltk

import warnings
warnings.filterwarnings("ignore")

import gc
gc.enable()

In [None]:
# Dictionary for lean import
dtypesDict_tr = {
'id'                            :         'int32',
'target'                        :         'float16',
'severe_toxicity'               :         'float16',
'obscene'                       :         'float16',
'identity_attack'               :         'float16',
'insult'                        :         'float16',
'threat'                        :         'float16',
'asian'                         :         'float16',
'atheist'                       :         'float16',
'bisexual'                      :         'float16',
'black'                         :         'float16',
'buddhist'                      :         'float16',
'christian'                     :         'float16',
'female'                        :         'float16',
'heterosexual'                  :         'float16',
'hindu'                         :         'float16',
'homosexual_gay_or_lesbian'     :         'float16',
'intellectual_or_learning_disability':    'float16',
'jewish'                        :         'float16',
'latino'                        :         'float16',
'male'                          :         'float16',
'muslim'                        :         'float16',
'other_disability'              :         'float16',
'other_gender'                  :         'float16',
'other_race_or_ethnicity'       :         'float16',
'other_religion'                :         'float16',
'other_sexual_orientation'      :         'float16',
'physical_disability'           :         'float16',
'psychiatric_or_mental_illness' :         'float16',
'transgender'                   :         'float16',
'white'                         :         'float16',
'publication_id'                :         'int8',
'parent_id'                     :         'float32',
'article_id'                    :         'int32',
'funny'                         :         'int8',
'wow'                           :         'int8',
'sad'                           :         'int8',
'likes'                         :         'int16',
'disagree'                      :         'int16',
'sexual_explicit'               :         'float16',
'identity_annotator_count'      :         'int16',
'toxicity_annotator_count'      :         'int16'
}

In [None]:
# Load dataset
train_data = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv")
# kill all other columns except comment text and target
cols_to_keep = ['comment_text','target']
train_data = train_data.drop(train_data.columns.difference(cols_to_keep), axis=1)
gc.collect()

In [None]:
"""preprocessing.py

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1OGfaNb3J6wHJbSjAh2qlwbgBe_e4M2z1

# Preprocessing.py
"""

import pandas as pd
import matplotlib as plt
import re
import numpy as np
from nltk.stem import WordNetLemmatizer
from textblob import Word
from gensim.test.utils import datapath, get_tmpfile
from gensim.models import KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
import nltk
import gensim
from nltk.corpus import stopwords
import emoji
from nltk.corpus import wordnet
import datetime
import time
import operator
from textblob import TextBlob
from tqdm import tqdm, trange
from nltk.tokenize import TweetTokenizer

def replace_contractions(text):
  
  """
  This functions check's whether a text contains contractions or not. 
  In case a contraction is found, the corrected value from the dictionary is 
  returned.
  Example: "I've" towards "I have"
  """
  
  #replace words with "'ve" to "have"
  matches = re.findall(r'\b\w+[\'`¬¥]ve\b', text)
  if len(matches) != 0:
    text = re.sub(r'[\'`¬¥]ve\b', " have", text)
  
  #replace words with "'re" to "are"
  matches = re.findall(r'\b\w+[\'`¬¥]re\b', text)
  if len(matches) != 0:
    text = re.sub(r'[\'`¬¥]re\b', " are", text)
  
  #replace words with "'ll" to "will"
  matches = re.findall(r'\b\w+[\'`¬¥]ll\b', text)
  if len(matches) != 0:
    text = re.sub(r'[\'`¬¥]ll\b', " will", text)
  
  #replace words with "'m" to "am"
  matches = re.findall(r'\b\w+[\'`¬¥]m\b', text)
  if len(matches) != 0:
    text = re.sub(r'[\'`¬¥]m\b', " am", text)
  
  #replace words with "'d" to "would"
  matches = re.findall(r'\b\w+[\'`¬¥]d\b', text)
  if len(matches) != 0:
    text = re.sub(r'[\'`¬¥]d\b', " would", text)
  
  #replace words with contraction according to the contraction_dict
  matches = re.findall(r'\b\w+[\'`¬¥]\w+\b', text)
  for x in matches:
    if x in contraction_dict.keys():
      text = text.replace(x, contraction_dict.get(x))
  
  # replace all "'s" by space
  matches = re.findall(r'\b\w+[\'`¬¥]s\b', text)
  if len(matches) != 0:
    text = re.sub(r'[\'`¬¥]s\b', " ", text)
  return text

# Dictionary of contractions coming out of the pre-investigation in the other kernel
contraction_dict = {"Can't":"can not", "Didn't":"did not", "Doesn't":"does not", 
                    "Isn't":"is not", "Don't":"do not", "Aren't":"are not", "#":"",
                    "ain't": "is not", "aren't": "are not","can't": "cannot",
                    "'cause": "because", "could've": "could have", "couldn't": "could not",
                    "didn't": "did not",  "doesn't": "does not", "don't": "do not",
                    "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
                    "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did",
                    "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
                    "I'd": "I would", "I'd've": "I would have", "I'll": "I will",
                    "I'll've": "I will have","I'm": "I am", "I've": "I have",
                    "i'd": "i would", "i'd've": "i would have", "i'll": "i will",
                    "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not",
                    "it'd": "it would", "it'd've": "it would have", "it'll": "it will",
                    "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
                    "mayn't": "may not", "might've": "might have","mightn't": "might not",
                    "mightn't've": "might not have", "must've": "must have",
                    "mustn't": "must not", "mustn't've": "must not have",
                    "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
                    "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not",
                    "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would",
                    "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have",
                    "she's": "she is", "should've": "should have", "shouldn't": "should not",
                    "shouldn't've": "should not have", "so've": "so have","so's": "so as",
                    "this's": "this is","that'd": "that would", "that'd've": "that would have",
                    "that's": "that is", "there'd": "there would", "there'd've": "there would have",
                    "there's": "there is", "here's": "here is","they'd": "they would",
                    "they'd've": "they would have", "they'll": "they will",
                    "they'll've": "they will have", "they're": "they are", "they've": "they have",
                    "to've": "to have", "wasn't": "was not", "we'd": "we would",
                    "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have",
                    "we're": "we are", "we've": "we have", "weren't": "were not",
                    "what'll": "what will", "what'll've": "what will have", "what're": "what are",
                    "what's": "what is", "what've": "what have", "when's": "when is",
                    "when've": "when have", "where'd": "where did", "where's": "where is",
                    "where've": "where have", "who'll": "who will", "who'll've": "who will have",
                    "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have",
                    "will've": "will have", "won't": "will not", "won't've": "will not have",
                    "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have",
                    "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have",
                    "y'all're": "you all are","y'all've": "you all have","you'd": "you would",
                    "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
                    "you're": "you are", "you've": "you have", "c'mon":"come on",
                    "Don''t":"do not", "Haden't":"had not", "Grab'em":"grab them", "USA''s":"USA",
                    "Pick'em":"pick them", "I'lll":"I will", "Tell'em":"tell them", "Y'all":"you all",
                    "Wouldn't":"would not", "Shouldn't":"should not", "I'DVE":"I would have",
                    "SHOOT'UM":"shoot them", "CANN'T":"can not", "COUD'VE":"could have", "Yo'ure":"you are",
                    "LOCK'EM":"lock them", "G'night":"goodnight", "W'ell":"we will", "IT'D":"it would",
                    "Couldn't":"could not", "LOCK'UM":"lock them", "WOULD'NT":"would not", "Cant't":"can not",
                    "HADN'T":"had not", "It''s":"it is", "Don'ts":"do not", "Arn't":"are not",
                    "We'll":"we will", "G'Night":"goodnight", "THAT'LL":"that will", "Dpn't":"do not",
                    "Idon'tgetitatall":"I do not get it at all", "THEY'VE":"they have", "Le'ts":"let us",
                    "SEND'EM":"send them", "AIN'T":"is not", "WE'D":"we would", "I'vemade":"I have made",
                    "SHE'LL":"she will", "I'llbe":"I will be", "I'mma":"I am a", "Could'nt":"could not",
                    "You'very":"you are very", "Light'em":"light them", "Con't":"can not", "I'Œú":"I am",
                    "Kick'em":"kick them", "Shoudn't":"should not", "That''s":"that is",
                    "Didn't_work":"did not work", "You'rethinking":"you are thinking", "Dn't":"do not",
                    "CON'T":"can not", "DON'T":"do not", "C'Mon":"come on", "You'res":"you are",
                    "Amn't":"is not", "WE'RE":"we are", "Can't":"can not", "Kouldn't":"could not",
                    "SHouldn't":"should not", "Does't":"does not", "COULD'VE":"could have",
                    "TrumpIDin'tCare":"Trump did not care", "Iv'e":"I have", "Dose't":"does not",
                    "DOESEN'T":"does not", "Give'em":"give them", "Won'tdo":"will not do",
                    "They'l":"they will", "He''s":"he is", "I'veve":"I have", "Wern't":"were not",
                    "Pay'um":"pay them", "She''l":"she will", "Y'know":"you know", "DIdn't":"did not",
                    "O'bamacare":"Obamacare", "I'ma":"I am a", "Ma'am":"madam", "WASN'T":"was not",
                    "Dont't":"do not", "Is't":"is not", "OU'RE":"you are", "YOU'RE":"you are",
                    "Ther'es":"there is", "C'mooooooon":"come on", "They_didn't":"they did not",
                    "Som'thin":"something", "Love'em":"love them", "You''re":"you are", "I'D":"I would",
                    "HASN'T":"has not", "WOULD'VE":"would have", "WAsn't":"was not", "ARE'NT":"are not",
                    "Dowsn't":"does not", "It'also":"it is also", "Geev'um":"give them", "Theyv'e":"they have",
                    "Theyr'e":"they are", "Take'em":"take them", "Book'em":"book them", "Havn't":"have not",
                    "DOES'NT":"does not", "Who''s":"who is", "WON't":"will not", "I'Il":"I will",
                    "I'don":"I do not", "AREN'T":"are not", "Ev'rybody":"everybody", "Hold'um":"hold them",
                    "WE'LL":"we will", "Cab't":"can not", "IJustDon'tThink":"I just do not think",
                    "Wouldn'T":"would not", "U'r":"you are", "I''ve":"I have", "DONT'T":"do not",
                    "G'morning":"good morning", "You'ld":"you would", "We''ll":"we will", "YOUR'E":"you are",
                    "TrumpDoesn'tCare":"Trump does not care", "Wasn't":"was not", "You'all":"you all",
                    "Y'ALL":"you all", "G'bye":"goodbye", "YOU'VE":"you have", "Does'nt":"does not",
                    "Don'TCare":"do not care",  "Weren't":"were not", "Y'All":"you all", "They'lll":"they will",
                    "You'reOnYourOwnCare":"you are on your own care", "I'veposted":"I have posted",
                    "Run'em":"run them", "Vote'em":"vote them", "Would't":"would not", "I'l":"I will",
                    "Ddn't":"did not", "I'mm":"I am", "Sshouldn't":"should not", "Your'e":"you are",
                    "I'v":"I have", "We'really":"we are really", "DOESN'T":"does not", "DiDn't":"did not",
                    "Needn't":"need not", "They'er":"they are", "Look'em":"look them", "I'v√à":"I have",
                    "Didn`t":"did not", "I'lll":"I will", "Wouldn't":"would not", "It`s":"it is", "What's":"what is",
                    "ISN`T":"is not", "WE'RE":"we are", "Are'nt":"are not", "DOesn't":"does not", "I'M":"I am",
                    "WON'T":"will not", "WEREN'T":"were not", "TrumpDon'tCareAct":"Trump do not care act",
                    "HAVEN'T":"have not", "That''s":"that is", "Do'nt":"do not"}

def replace_symbol_special(text,check_vocab=False, vocab=None): 

    ''' 
    This method can be used to replace dashes ('-') around and within the words using regex.
    It only removes dashes for words which are not known to the vocabluary.
    Next to that, common word separators like underscores ('_') and slashes ('/') are replaced by spaces. 
    '''

        
    # replace all dashes and abostropes at the beginning of a word with a space
    matches = re.findall(r"\s+(?:-|')\w*", text)
    # if there is a match is in text
    if len(matches) != 0:
      # remove the dash from the match or better text
      for match in matches:
        text = re.sub(match, re.sub(r"(?:-|')", ' ', match), text)
    
    # replace all dashes and abostrophes at the end of a word with a space
    # function works as above
    matches = re.findall(r"\w*(?:-|')\s+", text)
    if len(matches) != 0:
      for match in matches:
        text = re.sub(match, re.sub(r"(?:-|')", ' ', match), text)
    
    if check_vocab == True:
      # replace dashes and abostrophes in the middle of the word only in case it is not known to a dictionary
      # function works as above
      matches = re.findall(r"\w*(?:-|')\w*", text)
      if len(matches) != 0:
        for match in matches:
          #check if the word with dash in the middle in in the vocabluary
          if match not in vocab.keys():
            text = re.sub(match, re.sub(r"(?:-|')", ' ', match), text)
    
    #
    text = re.sub(r'(?:_|\/)', ' ', text)
    
    text = re.sub(r' +', ' ', text)#-
    return text
  
# Initially we consideredto remove the dash for words with this beginning. 
# However we found that it had almost no impact. Applying it to the total text, would kill correct spellings.
# pre_suffix_dict = {'bi-':'bi', 	'co-':'co','re-':'re',	'de-':'de','pre-':'pre',	'sub-':'sub', 'un-':'un'}

def find_smilies(text):
  
  '''
  For investigation only: Find most common keyboard typed smilies in the text.
  '''
  
  #define a pattern to find typical keyboard smilies
  pattern = r"((?:3|<)?(?::|;|=|B)(?:-|'|'-)?(?:\)|D|P|\*|\(|o|O|\]|\[|\||\\|\/)\s)"
  # Find the matches n the text
  matches = re.findall(pattern, text)
  # If the text contain matches print the text and the smilies found
  if len(matches) != 0:
    print(text, matches)
    
    

    
def replace_smilies(text):
  
  '''
  Simplyfied method to replace keyboard smilies with its very simple translation.
  '''
  
  #Find and replace all happy smilies
  matches = re.findall(r"((?:<|O|o|@)?(?::|;|=|B)(?:-|'|'-)?(?:\)|\]))", text)
  if len(matches) != 0:
    text = re.sub(r"((?:<|O|o|@)?(?::|;|=|B)(?:-|'|'-)?(?:\)|\]))", " smile ", text)
  
  #Find and replace all laughing smilies
  matches = re.findall(r"((?:<)?(?::|;|=)(?:-|'|'-)?(?:d|D|P|p)\b)", text)
  if len(matches) != 0:
    text = re.sub(r"((?:<)?(?::|;|=)(?:-|'|'-)?(?:d|D|P|p)\b)", " smile ", text)
  
  #Find and replace all unhappy smilies
  matches = re.findall(r"((?:3|<)?(?::|;|=|8)(?:-|'|'-)?(?:\(|\[|\||\\|\/))", text)
  if len(matches) != 0:
    text = re.sub(r"((?:3|<)?(?::|;|=|8)(?:-|'|'-)?(?:\(|\[|\||\\|\/))", " unhappy ", text)
  
  #Find and replace all kissing smilies
  matches = re.findall(r"((?:<)?(?::|;|=)(?:-|'|'-)?(?:\*))", text)
  if len(matches) != 0:
    text = re.sub(r"((?:<)?(?::|;|=)(?:-|'|'-)?(?:\*))", " kiss ", text)
  
  #Find and replace all surprised smilies
  matches = re.findall(r"((?::|;|=)(?:-|'|'-)?(?:o|O)\b)", text)
  if len(matches) != 0:
    text = re.sub(r"((?::|;|=)(?:-|'|'-)?(?:o|O)\b)", " surprised ", text)
    
  #Find and replace all screaming smilies
  matches = re.findall(r"((?::|;|=)(?:-|'|'-)?(?:@)\b)", text)
  if len(matches) != 0:
    text = re.sub(r"((?::|;|=)(?:-|'|'-)?(?:@)\b)", " screaming ", text)
    
  #Find and replace all hearts
  matches = re.findall(r"‚ô•|‚ù§|<3|‚ù•|‚ô°", text)
  if len(matches) != 0:
    text = re.sub(r"(?:‚ô•|‚ù§|<3|‚ù•|‚ô°)", " love ", text)
  
  text = re.sub(' +', ' ',text)
  return text

def remove_stopwords(text, stop_words):
  
  ''' 
  Remove stopwords and multiple whitespaces around words
  '''
  
  #Compile stopwords separated by | and stopped by word boundary 
  stopword_re = re.compile(r'\b(' + r'|'.join(stop_words) + r')\b')
  # Replace the stopwords by space
  text = stopword_re.sub(' ', text)
  #Replace double spaces by a single space
  text = re.sub(' +', ' ',text)
  return text

def clean_text(text, scope='general'):
  
  '''
  This function handles text cleaning from various symbols.
  - it translates special font types into the standard text type of python.
  - it removes all symbols except for dashes and abostrophes being handled by 
    "replace_symbol_special".
  - it handles multi letter appearances like "comiiii" > "comi"
  - typical unknown words like "Trump"
  '''
  
  
  
  #compile all special symbols from the dictionary to one regex function
  translate_regex = re.compile(r'(' + r'|'.join(translate_dictionary.keys()) + r')')
  
  # find all matches of special symbols in the text
  matches = re.findall(translate_regex, text)
  # if there is one or more matches
  if len(matches) != 0:
    for x in matches:
      if x in translate_dictionary.keys():
        #replace the symbol by its replacement item
        text = re.sub(x, translate_dictionary.get(x), text)
  
  # find and remove all "http" links
  matches = re.findall(r'http\S+', text)
  if len(matches) != 0:
    text = re.sub(r'http\S+', '', text)
  
  #remove all backslashes
  matches = re.findall(r'\\', text)
  if len(matches) != 0:
    text = re.sub(r'\\', ' ', text)
  
  # compile all remaining special characters into one translate line and replace them by space
  # the translate function is really fast thus here our preferred choice
  text = text.translate(str.maketrans(''.join(puncts), len(''.join(puncts))*' '))  
  
  #find words where 4 repetitions of a letter goes in a row and reduce them to only one
  #we are not correcting words with 2 or three identical letters in a row as this could destroy correct words
  #first find repeating characters
  matches = re.findall(r'(.)\1{3,}', text)
  # is some are found
  if len(matches) != 0:
    #for each match replace it with its first letter (x[0])
    for x in matches:
      character_re = re.compile(x + '{3,}')
      matchesInside = re.findall(character_re, text)
      if len(matchesInside) != 0:
        for x in matchesInside:
          text = re.sub(x, x[0], text)
          
  # hahaha s by one haha 
  matches = re.findall(r'\b[h,a]{4,}\b', text)
  if len(matches) != 0:
    text = re.sub(r'\b[h,a]{4,}\b', 'haha', text)
  
  # as we found many unknown word variations including 'Trump' we reduce thse  words just to Trump
  # being represented in most word vectors
  matches = re.findall(r'\w*[Tt][Rr][uU][mM][pP]\w*', text)
  if len(matches) != 0:
    for x in matches:
      text = re.sub(x, 'Trump', text)
      
  #remove potential double spaces generated during processing        
  text = re.sub(' +', ' ',text) 
  
  # those symbols are not touched by this function ->see replace_contraction or replace_special_symbols
  #keep = ["'", '-', '¬¥']
  
  
  return text





# The dictionary was generated in the compare and investigation phase in the other notebook
translate_dictionary = {'\t': 't', '0': '0', '1': '1', '2': '2', '3': '3', '5': '5', '6': '6',
                         '8': '8', '9': '9', 'd': 'd', 'e': 'e', 'h': 'h', 'm': 'm', 't': 't',
                         '¬≤': '2', '¬π': '1', 'ƒù': 'g', '≈ì': 'ae', '≈ù': 's', '«ß': 'g', '…ë': '…ë',
                         '…í': 'a', '…î': 'c', '…ô': 'e', '…õ': 'e', '…°': 'g', '…¢': 'g', '…™': 'i',
                         '…¥': 'n', ' Ä': 'r', ' è': 'y', ' ô': 'b', ' ú': 'h', ' ü': 'l', ' ∞': 'h',
                         ' ≥': 'r', ' ∑': 'w', ' ∏': 'y', 'À¢': '5', 'Õû': '-', 'Õü': '_', 'Õ¶': 'o',
                         'Œë': 'a', 'Œí': 'b', 'Œï': 'e', 'Œú': 'm', 'Œù': 'n', 'Œü': 'o', 'Œ§': 't',
                         'Œ≠': 'e', 'ŒØ': 'i', 'Œ±': 'a', 'Œ∫': 'k', 'œá': 'x', '–Ü': 'i', '–ê': 'a',
                         '–ë': 'e', '–ï': 'e', '–ó': '#', '–ò': 'n', '–ö': 'k', '–ú': 'm', '–ù': 'h',
                         '–û': 'o', '–†': 'p', '–°': 'c', '–£': 'y', '–•': 'x',  '–≤': 'b',
                         '–∫': 'k', '–º': 'm', '–Ω': 'h', '—ã': 'bi', '—å': 'b', '—ë': 'e', '—ô': 'jb',
                         '“ì': 'f', '“Ø': 'y', '‘ú': 'w', '’∞': 'h', '◊ê': 'n', '‡Ø¶': '0', '‡±¶': 'o',
                         '‡µ¶': 'o', '‡ªê': 'o', '·é•': 'i', '·é´': 'j', '·èß': 'd', '·ê®': '-', '·ê∏': '<',
                         '·ë≤': 'b', '·ë≥': 'b', '·óû': 'd', '·¥Ä': 'a', '·¥Ñ': 'c', '·¥Ö': 'n', '·¥á': 'e',
                         '·¥ä': 'j', '·¥ã': 'k', '·¥ç': 'm', '·¥è': 'o', '·¥ë': 'o', '·¥ò': 'p', '·¥õ': 't',
                         '·¥ú': 'u', '·¥†': 'v', '·¥°': 'w', '·¥µ': 'i', '·¥∑': 'k', '·¥∫': 'n', '·¥º': 'o',
                         '·µâ': 'e', '·µí': 'o', '·µó': 't', '·µò': 'u', '·∫É': 'w', '·ºÄ': 'a', '·ºà': 'a',
                         '·ºå': 'a', '·Ω∂': 'l', '·Ω∫': 'u', '‚Äí': '-', '‚ÇÅ': '1', '‚ÇÉ': '3', '‚ÇÑ': '4',
                         '‚Ñã': 'h', '‚Ñ†': 'sm', '‚ÑØ': 'e', '‚Ñ¥': 'c', '‚ïå': '--', '‚≤è': 'h', '‚≤£': 'p',
                         '‰∏ã': 'under', '‰∏ç': 'Do not', '‰∫∫': 'people', '‰ºé': 'trick', '‰ºö': 'meeting',
                         '‰Ωú': 'Make', '‰Ω†': 'you', 'ÂÖã': 'Gram', 'ÂÖ≥': 'turn off', 'Âà´': 'do not',
                         'Âä†': 'plus', 'Âçé': 'China', 'Âçñ': 'Sell', 'Âéª': 'go with', 'Âì•': 'brother',
                         'Âõ≠': 'garden', 'ÂõΩ': 'country', 'ÂúÜ': 'circle', 'Âúü': 'soil', 'Âú∞': 'Ground',
                         'Âùè': 'Bad', 'Â§ñ': 'outer', 'Â§ß': 'Big', 'Â§±': 'Lost', 'Â≠ê': 'child', 'Â∞è': 'small',
                         'Êàê': 'to make', 'Êà¶': 'War', 'ÊâÄ': 'Place', 'Êãø': 'take', 'ÊïÖ': 'Therefore',
                         'Êñá': 'Text', 'Êòé': 'Bright', 'ÊòØ': 'Yes', 'Êúâ': 'Have', 'Ê≠å': 'song', 
                         'ÊÆä': 'special', 'Ê≤π': 'oil', 'Ê∏©': 'temperature', 'Áâπ': 'special', 
                         'ÁçÑ': 'prison', 'ÁöÑ': 'of', 'Á®é': 'tax', 'Á≥ª': 'system', 'Áæ§': 'group',
                         'Ëàû': 'dance', 'Ëã±': 'English', 'Ëî°': 'Cai', 'ËÆÆ': 'Discussion', 'Ë∞∑': 'Valley',
                         'Ë±Ü': 'beans', 'ÈÉΩ': 'All', 'Èí±': 'money', 'Èôç': 'drop', 'Èöú': 'barrier',
                         'È™ó': 'cheat', 'ÏÑ∏': 'three', 'Ïïà': 'within', 'ÏòÅ': 'spirit', 'Ïöî': 'Yo',
                          'Õ∫': '', 'Œõ': 'L', 'Œû': 'X', 'Œ¨': 'a', 'ŒÆ': 'or', 'Œπ': 'j',
                         'Œæ': 'X', 'œÇ': 's', 'œà': 't', 'œå': 'The', 'œç': 'gt;', 'œé': 'o',
                         'œñ': 'e.g.', '–ì': 'R', '–î': 'D', '–ñ': 'F', '–õ': 'L', '–ü': 'P', 
                         '–§': 'F', '–®': 'Sh', '–±': 'b', '–ø': 'P', '—Ñ': 'f', '—Ü': 'c', 
                         '—á': 'no', '—à': 'sh', '—â': 'u', '—ç': 'uh', '—é': 'Yu', '—ó': 'her',
                         '—õ': 'ht', '’Å': 'Winter', '’°': 'a', '’§': 'd', '’•': 'e', '’´': 's',
                         '’±': 'h', '’¥': 'm', '’µ': 'y', '’∂': 'h', '’º': 'r', '’Ω': 'c', 
                         '÷Ä': 'p', '÷Ç': '¬≥', '◊ë': 'B', '◊ì': 'D', '◊î': 'God', '◊ï': 'and',
                         '◊ò': 'ninth', '◊ô': 'J', '◊ö': 'D', '◊õ': 'about', '◊ú': 'To', '◊ù': 'From', 
                         '◊û': 'M', '◊ü': 'Estate', '◊†': 'N', '◊°': 'S.', '◊¢': 'P', '◊£': 'Jeff',
                         '◊§': 'F', '◊¶': 'C', '◊ß': 'K.', '◊®': 'R.', '◊©': 'That', '◊™': 'A',
                         'ÿ°': 'Was', 'ÿ¢': 'Ah', 'ÿ£': 'a', 'ÿ•': 'a', 'ÿß': 'a', 'ÿ©': 'e', 
                         'ÿ™': 'T', 'ÿ¨': 'C', 'ÿ≠': 'H', 'ÿÆ': 'Huh', 'ÿØ': 'of the', 'ÿ±': 'T',
                         'ÿ≤': 'Z', 'ÿ≥': 'Q', 'ÿ¥': 'Sh', 'ÿµ': 's', 'ÿ∑': 'I', 'ÿπ': 'AS', 'ÿ∫': 'G',
                         'ŸÅ': 'F', 'ŸÇ': 'S', 'ŸÉ': 'K', 'ŸÑ': 'to', 'ŸÖ': 'M', 'ŸÜ': 'N', 'Ÿá': 'e', 
                         'Ÿà': 'And', 'Ÿâ': 'I', 'Ÿä': 'Y', '⁄Ü': 'What', '⁄©': 'K', '€å': 'Y', 
                         '‡§ï': 'A', '‡§Æ': 'M', '‡§∞': 'And', '‡™ó': 'C', '‡™ú': 'The same', 
                         '‡™§': 'I', '‡™∞': 'I', '‡Æú': 'SAD', '·Éö': 'L', '·πë': 'o', '·ºê': 'e',
                         '·ºî': '√ã', '·º°': 'or', '·º±': 'ƒ±', '·º¥': 'i', '·ΩÄ': 'The', '·ΩÅ': 'The',
                         '·Ωê': '√ø', '·Ω∞': 'a', '·Ω≤': '.', '·Ω∏': 'The', '·Ωª': 'gt;', '·æ∂': 'a', 
                         '·øÜ': 'or', '·øñ': '‡∏Å', '·ø¶': 'I', '„ÅÜ': 'U', '„Åï': 'The', '„Å£': 'What',
                         '„Å§': 'One', '„Å™': 'The', '„Çà': 'The', '„Çâ': 'Et al', '„Ç®': 'The', 
                         '„ÇØ': 'The', '„Çµ': 'The', '„Ç∑': 'The', '„Ç∏': 'The', '„Çπ': 'The',
                         '„ÉÅ': 'The', '„ÉÑ': 'The', '„Éã': 'D', '„Éè': 'Ha', '„Éû': 'Ma', 
                         '„É™': 'The', '„É´': 'Le', '„É¨': 'Les', '„É≠': 'The', '„É≥': 'The',
                         '‰∏Ä': 'One', '‰∏é': 'versus', '‰∏î': 'And', '‰∏∫': 'for', '‰π∞': 'buy',
                         '‰∫Ü': 'Up', '‰∫õ': 'some', '‰ªñ': 'he', '‰ª•': 'Take', '‰ª¨': 'They',
                         '‰ª∂': 'Items', '‰º†': 'pass', '‰º¶': 'Lun', '‰ΩÜ': 'but', '‰ø°': 'letter',
                         'ÂÄô': 'Waiting', 'ÂÅΩ': 'Pseudo', 'ÂÖ®': 'all', 'ÂÖ¨': 'public', 'ÂÖ∂': 'its',
                         'ÂÖª': 'support', 'ÂÜ¨': 'winter', 'Âá∏': 'Convex', 'Âáª': 'hit', 'Âà§': 'Judge',
                         'Âà∞': 'To', 'Âèã': 'Friend', 'ÂèØ': 'can', 'Âêó': 'What?', 'Âíå': 'with',
                         'ÂîØ': 'only', 'Âõ†': 'because', 'Âú£': 'Holy', 'Âú®': 'in', 'Âü∫': 'base',
                         'Â†Ç': 'Hall', 'Â£´': 'Shishi', 'Â§ç': 'complex', 'Â§ö': 'many', 'Â§©': 'day',
                         'Â•Ω': 'it is good', 'Â¶Ç': 'Such as', 'Â©ö': 'marriage', 'Â≠©': 'child', 
                         'ÂÆ†': 'Pet', 'ÂØì': 'Apartment', 'ÂØπ': 'Correct', 'Â±Å': 'fart', 
                         'Â±à': 'Qu', 'Â∑®': 'huge', 'Â∑±': 'already', 'Âºè': 'formula', 'ÂΩì': 'when',
                         'ÂΩº': 'he', 'Âæí': 'only', 'Âæó': 'Got', 'ÊÄí': 'angry', 'ÊÄ™': 'strange',
                         'ÊÅê': 'fear', 'ÊÉß': 'fear', 'ÊÉ≥': 'miss you', 'ÊÑ§': 'anger', 'Êàë': 'I',
                         'Êàò': 'war', 'Êâπ': 'Batch', 'Êää': 'Put', 'Êãâ': 'Pull', 'Êã∑': 'Copy', 
                         'Êé•': 'Connect', 'Êìç': 'Fuck', 'Êî∂': 'Receive', 'Êîø': 'Politics', 
                         'Êïô': 'teach', 'Êñ§': 'jin', 'ÊñØ': 'S', 'Êñ∞': 'new', 'Êó∂': 'Time', 
                         'ÊôÆ': 'general', 'Êõæ': 'Once', 'Êú¨': 'this', 'ÊùÄ': 'kill', 'ÊûÅ': 'pole',
                         'Êü•': 'check', 'Ê†ó': 'chestnut', 'Ê†™': 'stock', 'Ê†∑': 'kind', 'Ê£Ä': 'Check',
                         'Ê¨¢': 'Happy', 'Ê≠ª': 'dead', 'Ê±â': 'Chinese', 'Ê≤°': 'No', 'Ê≤ª': 'rule', 
                         'Ê≥ï': 'law', 'Ê¥ª': 'live', 'ÁÇπ': 'point', 'Ááª': 'Moth', 'Áâ©': 'object',
                         'Áåú': 'guess', 'Áå¥': 'monkey', 'ÁêÜ': 'Rational', 'Áîü': 'Health', 'Áî®': 'use',
                         'ÁôΩ': 'White', 'Áôæ': 'hundred', 'Áõ¥': 'straight', 'Áõ∏': 'phase', 'Áúã': 'Look',
                         'Áù£': 'Supervisor', 'Áü•': 'know', 'Á§æ': 'Society', 'Á•ù': 'wish', 'ÁßØ': 'product',
                         'Á®£': 'Jesus', 'Áªè': 'through', 'Áªì': 'Knot', 'Áªô': 'give', 'Áæé': 'nice', 
                         'ËÄ∂': 'Yay', 'ËÅä': 'chat', 'ËÉú': 'Win', 'Ëá≥': 'to', 'Ëôö': 'Virtual', 'Ë£Ω': 'Made', 
                         'Ë¶Å': 'Want', 'ËÆ§': 'recognize', 'ËÆ®': 'discuss', 'ËÆ©': 'Let', 'ËØÜ': 'knowledge',
                         'ËØù': 'words', 'ËØ≠': 'language', 'ËØ¥': 'Say', 'Ë∞ä': 'friendship', 
                         'Ë∞ì': 'Predicate', 'Ë±°': 'Elephant', 'Ë¥∫': 'He', 'Ëµ¢': 'win', 'Ëøé': 'welcome',
                         'Ëøò': 'also', 'Ëøô': 'This', 'ÈÄö': 'through', 'ÈâÑ': 'iron', 'ÈóÆ': 'ask', 
                         'Èòø': 'A', 'È¢ò': 'question', 'È¢ù': 'amount', 'È¨º': 'ghost', 'È∏°': 'Chicken',
                         'Í∞Ä': 'end', 'Í∞à': 'Go', 'Í≤å': 'to', 'Í≤©': 'case', 'Í≤Ω': 'circa', 'Í¥Ä': 'tube',
                         'Íµ≠': 'soup', 'Í∏à': 'gold', 'ÎÇò': 'I', 'Îäî': 'The', 'Îãà': 'Nee', 'Îã§': 'All',
                         'ÎåÄ': 'versus', 'ÎèÑ': 'Degree', 'Îêú': 'The', 'Îìú': 'De', 'Îì§': 'field', 
                         'Îïå': 'time', 'Îü∞': 'Run', 'Î†µ': 'Hi', 'Î°ù': 'rock', 'Î§º': 'Crown', 
                         'Î¶¨': 'Lee', 'Îßà': 'hemp', 'Îßå': 'just', 'Î∞ò': 'half', 'Î∂Ñ': 'minute', 
                         'ÏÇ¨': 'four', 'ÏÉÅ': 'Prize', 'ÏÑú': 'book', 'ÏÑù': 'three', 'ÏÑ±': 'castle',
                         'Ïä§': 'The', 'Ïãú': 'city', 'Ïïä': 'Not', 'Ïïº': 'Hey', 'ÏïΩ': 'about', 
                         'Ïñ¥': 'uh', 'ÏôÄ': 'Wow', 'Ïö©': 'for', 'Ïú†': 'U', 'ÏùÑ': 'of', 'Ïù¥': 'this',
                         'Ïù∏': 'sign', 'Ïûò': 'well', 'Ï†ú': 'My', 'Ï•ê': 'rat', 'ÏßÄ': 'G', 'Ï¥à': 'second',
                         'Ï∫ê': 'Can', 'ÌÉ±': 'Tang', 'Ìä∏': 'The', 'Ìã∞': 'tea', 'Ìå®': 'tile', 'Ìíà': 'Width', 
                         'Ìïú': 'One', 'Ìï©': 'synthesis', 'Ìï¥': 'year', 'Ìóà': 'Huh', 'Ìôî': 'anger', 'Ìô©': 'sulfur',
                         'Ìïò': 'Ha', 'Ô¨Å': 'be', 'Ôºê': '#', 'Ôºí': '#', 'Ôºò': '#', 'Ôº•': 'e', 'Ôºß': 'g',
                         'Ôº®': 'h', 'Ôº≠': 'm', 'ÔºÆ': 'n', 'ÔºØ': 'O', 'Ôº≥': 's', 'Ôºµ': 'U', 'Ôº∑': 'w',
                         'ÔΩÅ': 'a', 'ÔΩÇ': 'b', 'ÔΩÉ': 'c', 'ÔΩÑ': 'd', 'ÔΩÖ': 'e', 'ÔΩÜ': 'f', 'ÔΩá': 'g',
                         'ÔΩà': 'h', 'ÔΩâ': 'i', 'ÔΩã': 'k', 'ÔΩå': 'l', 'ÔΩç': 'm', 'ÔΩé': 'n', 'ÔΩè': 'o',
                         'ÔΩí': 'r', 'ÔΩì': 's', 'ÔΩî': 't', 'ÔΩï': 'u', 'ÔΩñ': 'v', 'ÔΩó': 'w', 'ÔΩô': 'y',
                         'ùêÄ': 'a', 'ùêÇ': 'c', 'ùêÉ': 'd', 'ùêÖ': 'f', 'ùêá': 'h', 'ùêä': 'k', 'ùêç': 'n', 
                         'ùêé': 'o', 'ùêë': 'r', 'ùêì': 't', 'ùêî': 'u', 'ùêò': 'y', 'ùêô': 'z', 'ùêö': 'a',
                         'ùêõ': 'b', 'ùêú': 'c', 'ùêù': 'd', 'ùêû': 'e', 'ùêü': 'f', 'ùê†': 'g', 'ùê°': 'h', 
                         'ùê¢': 'i', 'ùê£': 'j', 'ùê•': 'i', 'ùê¶': 'm', 'ùêß': 'n', 'ùê®': 'o', 'ùê©': 'p',
                         'ùê™': 'q', 'ùê´': 'r', 'ùê¨': 's', 'ùê≠': 't', 'ùêÆ': 'u', 'ùêØ': 'v', 'ùê∞': 'w',
                         'ùê±': 'x', 'ùê≤': 'y', 'ùê≥': 'z', 'ùë•': 'x', 'ùë¶': 'y', 'ùëß': 'z', 'ùë©': 'b',
                         'ùë™': 'c', 'ùë´': 'd', 'ùë¨': 'e', 'ùë≠': 'f', 'ùëÆ': 'g', 'ùëØ': 'h', 'ùë∞': 'i',
                         'ùë±': 'j', 'ùë≤': 'k', 'ùë≥': 'l', 'ùë¥': 'm', 'ùëµ': 'n', 'ùë∂': '0', 'ùë∑': 'p',
                         'ùëπ': 'r', 'ùë∫': 's', 'ùëª': 't', 'ùëæ': 'w', 'ùíÄ': 'y', 'ùíÅ': 'z', 'ùíÇ': 'a',
                         'ùíÉ': 'b', 'ùíÑ': 'c', 'ùíÖ': 'd', 'ùíÜ': 'e', 'ùíá': 'f', 'ùíà': 'g', 'ùíâ': 'h',
                         'ùíä': 'i', 'ùíã': 'j', 'ùíå': 'k', 'ùíç': 'l', 'ùíé': 'm', 'ùíè': 'n', 'ùíê': 'o', 
                         'ùíë': 'p', 'ùíí': 'q', 'ùíì': 'r', 'ùíî': 's', 'ùíï': 't', 'ùíñ': 'u', 'ùíó': 'v', 
                         'ùíò': 'w', 'ùíô': 'x', 'ùíö': 'y', 'ùíõ': 'z', 'ùí©': 'n', 'ùí∂': 'a', 'ùí∏': 'c',
                         'ùíΩ': 'h', 'ùíæ': 'i', 'ùìÄ': 'k', 'ùìÅ': 'l', 'ùìÉ': 'n', 'ùìÖ': 'p', 'ùìá': 'r',
                         'ùìà': 's', 'ùìâ': 't', 'ùìä': 'u', 'ùìå': 'w', 'ùìé': 'y', 'ùìí': 'c', 'ùì¨': 'c',
                         'ùìÆ': 'e', 'ùì≤': 'i', 'ùì¥': 'k', 'ùìµ': 'l', 'ùìª': 'r', 'ùìº': 's', 'ùìΩ': 't',
                         'ùìø': 'v', 'ùï¥': 'j', 'ùï∏': 'm', 'ùïø': 'i', 'ùñÇ': 'm', 'ùñÜ': 'a', 'ùñá': 'b',
                         'ùñà': 'c', 'ùñâ': 'd', 'ùñä': 'e', 'ùñã': 'f', 'ùñå': 'g', 'ùñç': 'h', 'ùñé': 'i', 
                         'ùñí': 'm', 'ùñì': 'n', 'ùñï': 'p', 'ùñó': 'r', 'ùñò': 's', 'ùñô': 't', 'ùñö': 'u',
                         'ùñõ': 'v', 'ùñú': 'w', 'ùñû': 'n', 'ùñü': 'z', 'ùóï': 'b', 'ùóò': 'e', 'ùóô': 'f',
                         'ùóû': 'k', 'ùóü': 'l', 'ùó†': 'm', 'ùó¢': 'o', 'ùó§': 'q', 'ùó¶': 's', 'ùóß': 't',
                         'ùó™': 'w', 'ùó≠': 'z', 'ùóÆ': 'a', 'ùóØ': 'b', 'ùó∞': 'c', 'ùó±': 'd', 'ùó≤': 'e',
                         'ùó≥': 'f', 'ùó¥': 'g', 'ùóµ': 'h', 'ùó∂': 'i', 'ùó∑': 'j', 'ùó∏': 'k', 'ùóπ': 'i',
                         'ùó∫': 'm', 'ùóª': 'n', 'ùóº': 'o', 'ùóΩ': 'p', 'ùóø': 'r', 'ùòÄ': 's', 'ùòÅ': 't',
                         'ùòÇ': 'u', 'ùòÉ': 'v', 'ùòÑ': 'w', 'ùòÖ': 'x', 'ùòÜ': 'y', 'ùòá': 'z', 'ùòê': 'l',
                         'ùòì': 'l', 'ùòñ': 'o', 'ùò¢': 'a', 'ùò£': 'b', 'ùò§': 'c', 'ùò•': 'd', 'ùò¶': 'e',
                         'ùòß': 'f', 'ùò®': 'g', 'ùò©': 'h', 'ùò™': 'i', 'ùò´': 'j', 'ùò¨': 'k', 'ùòÆ': 'm',
                         'ùòØ': 'n', 'ùò∞': 'o', 'ùò±': 'p', 'ùò≤': 'q', 'ùò≥': 'r', 'ùò¥': 's', 'ùòµ': 't',
                         'ùò∂': 'u', 'ùò∑': 'v', 'ùò∏': 'w', 'ùòπ': 'x', 'ùò∫': 'y', 'ùòº': 'a', 'ùòΩ': 'b',
                         'ùòæ': 'c', 'ùòø': 'd', 'ùôÄ': 'e', 'ùôÉ': 'h', 'ùôÖ': 'j', 'ùôÜ': 'k', 'ùôá': 'l', 
                         'ùôà': 'm', 'ùôä': 'o', 'ùôã': 'p', 'ùôç': 'r', 'ùôè': 't', 'ùôí': 'w', 'ùôî': 'y',
                         'ùôñ': 'a', 'ùôó': 'b', 'ùôò': 'c', 'ùôô': 'd', 'ùôö': 'e', 'ùôõ': 'f', 'ùôú': 'g',
                         'ùôù': 'h', 'ùôû': 'i', 'ùôü': 'j', 'ùô†': 'k', 'ùô¢': 'm', 'ùô£': 'n', 'ùô§': 'o',
                         'ùô•': 'p', 'ùôß': 'r', 'ùô®': 's', 'ùô©': 't', 'ùô™': 'u', 'ùô´': 'v', 'ùô¨': 'w',
                         'ùô≠': 'x', 'ùôÆ': 'y', 'ùüé': '0', 'ùüè': '1', 'ùüê': '2', 'ùüì': '5', 'ùüî': '6',
                         'ùüñ': '8', 'ùü¨': '0', 'ùü≠': '1', 'ùüÆ': '2', 'ùüØ': '3', 'ùü∞': '4', 'ùü±': '5',
                         'ùü≤': '6', 'ùü≥': '7', 'ùüë':'3', 'ùüí':'4', 'ùüï':'7', 'ùüó':'9',
                         'üá¶': 'a', 'üá©': 'd', 'üá™': 'e', 'üá¨': 'g', 'üáÆ': 'i', 
                         'üá≥': 'n', 'üá¥': 'o', 'üá∑': 'r', 'üáπ': 't', 'üáº': 'w', 'üñí': 'thumps up',
                         '‚Ñè':'h', ' ≤':'j', 'Ôº£':'c', 'ƒ∫':'i', 'Ôº™':'j', 'ƒ∏':'k', 'Ôº∞':'p'}






# List was cerated in separate notebook investigating on word embedding. 
# These dictionary is used to remove unwanted characters from the text
puncts =                 ['_','!', '?','\x08', '\n', '\x0b', '\r', '\x10', '\x13', '\x1f', ' ', ' # ', '"', '#', 
                         '# ', '$', '%', '&',  '(', ')', '*', '+', ',',  '/', '.', ':', ';', '<',
                         '=', '>', '@', '[', '\\', ']', '^', '`', '{', '|', '}', '~', '\x7f', '\x80',
                         '\x81', '\x85', '\x91', '\x92', '\x95', '\x96', '\x9c', '\x9d', '\x9f', '\xa0', 
                         '¬°', '¬¢‡ºº', '¬£', '¬§', '¬•', '¬ß', '¬®', '¬©', '¬´', '¬¨', '\xad', '¬Ø', '¬∞', '¬±', '¬≥',
                         '¬∂', '¬∑', '¬∏', '¬∫', '¬ª', '¬º', '¬Ω', '¬æ', '¬ø', '√ó', '√ò', '√∑', '√∏', '∆Ñ', '∆Ω',
                         '«î', '»ª', '…ú', '…©', ' É', ' å', ' ª', ' º', 'Àà', 'Àå', 'Àê', 'Àô', 'Àö', 'ÃÅ', 'ÃÑ', 'ÃÖ', 
                         'Ãá', 'Ãà', 'Ã£', 'Ã®', 'ÃØ', 'Ã±', 'Ã≤', 'Ã∂', 'Õú', 'Õù', 'Õû', 'Õü', 'Õ°', 'Õ¶', 'ÿü', 'Ÿé', 'Ÿê', '⁄°', 
                         '€û', '€©', '‹Å', '‡§æ', '‡•ç', '‡™æ', '‡´Ä', '‡´Å', '‡πè', '‡πèÃØÕ°', '‡ºº', '‡ºΩ', '·êÉ', '·ê£', '·ê¶', '·êß',
                         '·ëé', '·ë≠', '·ëØ', '·íß', '·ìÄ', '·ìÇ', '·ìÉ', '·ìá', '·î≠', '·¥¶', '·¥®', '·µª', '·º∏', '·ºπ', '·Ωº', 
                         '·æΩ', '·øÉ', '\u2000', '\u2001', '\u2002', '\u2003', '\u2004', '\u2005', '\u2006', 
                         '\u2007', '\u2008', '\u2009', '\u200a', '\u200b', '\u200c', '\u200d', '\u200e',
                         '\u200f', '‚Äê', '‚Äë', '‚Äí', '‚Äì', '‚Äî', '‚Äï', '‚Äñ', '‚Äò', '‚Äô', '‚Äö', '‚Äõ', '‚Äú', '‚Äù', '‚Äû',
                         '‚Ä†', '‚Ä°', '‚Ä¢', '‚Ä£', '‚Ä¶', '\u2028', '\u202a', '\u202c', '\u202d', '\u202f', '‚Ä∞',
                         '‚Ä≤', '‚Ä≥', '‚Äπ', '‚Ä∫', '‚Äø', '‚ÅÑ', '‚ÅçÃ¥Ãõ\u3000', '‚Åé', '‚Å¥', '‚ÇÇ', '‚Ç¨', '‚Çµ', '‚ÇΩ', '‚ÑÉ', '‚ÑÖ',
                         '‚Ñê', '‚Ñ¢', '‚ÑÆ', '‚Öì', '‚Üê', '‚Üë', '‚Üí', '‚Üì', '‚Ü≥', '‚Ü¥', '‚Ü∫', '‚áå', '‚áí', '‚á§', '‚àÜ', '‚àé',
                         '‚àè', '‚àí', '‚àï', '‚àô', '‚àö', '‚àû', '‚à©', '‚à¥', '‚àµ', '‚àº', '‚âà', '‚â†', '‚â§', '‚â•', '‚äÇ', '‚äï',
                         '‚äò', '‚ãÖ', '‚ãÜ', '‚å†', '‚éå', '‚èñ', '‚îÄ', '‚îÅ', '‚îÉ', '‚îà', '‚îä', '‚îó', '‚î£', '‚î´', '‚î≥', '‚ïå', '‚ïê',
                         '‚ïë', '‚ïî', '‚ïó', '‚ïö', '‚ï£', '‚ï¶', '‚ï©', '‚ï™', '‚ï≠', '‚ï≠‚ïÆ', '‚ïÆ', '‚ïØ', '‚ï∞', '‚ï±', '‚ï≤', '‚ñÄ',
                         '‚ñÇ', '‚ñÉ', '‚ñÑ', '‚ñÖ', '‚ñÜ', '‚ñá', '‚ñà', '‚ñä', '‚ñã', '‚ñè', '‚ñë', '‚ñí', '‚ñì', '‚ñî', '‚ñï', 
                         '‚ñô', '‚ñ†', '‚ñ™', '‚ñ¨', '‚ñ∞', '‚ñ±', '‚ñ≤', '‚ñ∑', '‚ñ∏', '‚ñ∫', '‚ñº', '‚ñæ', '‚óÑ', '‚óá', '‚óã',
                         '‚óè', '‚óê', '‚óî', '‚óï', '‚óù', '‚óû', '‚ó°', '‚ó¶', '‚òÖ', '‚òÜ', '‚òè', '‚òê', '‚òí', '‚òô', '‚òõ',
                         '‚òú', '‚òû', '‚ò≠', '‚òª', '‚òº', '‚ô¶', '‚ô©', '‚ô™', '‚ô´', '‚ô¨', '‚ô≠', '‚ô≤', '‚öÜ', '‚ö≠', '‚ö≤', '‚úÄ',
                         '‚úì', '‚úò', '‚úû', '‚úß', '‚ú¨', '‚ú≠', '‚ú∞', '‚úæ', '‚ùÜ', '‚ùß', '‚û§', '‚û•', '‚†Ä', '‚§è', '‚¶Å',
                         '‚©õ', '‚¨≠', '‚¨Ø', '\u3000', '„ÄÅ', '„ÄÇ', '„Ää', '„Äã', '„Äå', '„Äç', '„Äî', '„Éª', '„Ñ∏', '„Öì',
                         'Èîü', 'Íú•', '\ue014', '\ue600', '\ue602', '\ue607', '\ue608', '\ue613', '\ue807',
                         '\uf005', '\uf020', '\uf04a', '\uf04c', '\uf070',  '\uf202\uf099', '\uf203',
                         '\uf071\uf03d\uf031\uf02f\uf032\uf028\uf070\uf02f\uf032\uf02d\uf061\uf029',
                         '\uf099', '\uf09a', '\uf0a7', '\uf0b7', '\uf0e0', '\uf10a', '\uf202', 
                         '\uf203\uf09a', '\uf222', '\uf222\ue608', '\uf410', '\uf410\ue600', '\uf469', 
                         '\uf469\ue607', '\uf818', 'Ô¥æ', 'Ô¥æÕ°', 'Ô¥ø', 'Ô∑ª', '\ufeff', 'ÔºÅ', 'ÔºÖ', 'Ôºá',
                         'Ôºà', 'Ôºâ', 'Ôºå', 'Ôºç', 'Ôºé', 'Ôºè', 'Ôºö', 'Ôºû', 'Ôºü', 'Ôºº', 'ÔΩú', 'Ôø¶', 'Ôøº', 'ÔøΩ',
                         'ùíª', 'ùïæ', 'ùñÑ', 'ùñê', 'ùñë', 'ùñî', 'ùóú', 'ùòä', 'ùò≠', 'ùôÑ', 'ùô°', 'ùùà', 'üñë', 'üñí']

def clean_numbers(x):
  
  """
  The following function is used to format the numbers.
  In the beginning "th, st, nd, rd" are removed
  """
  
  #remove "th" after a number
  matches = re.findall(r'\b\d+\s*th\b', x)
  if len(matches) != 0:
    x = re.sub(r'\s*th\b', " ", x)
    
  #remove "rd" after a number 
  matches = re.findall(r'\b\d+\s*rd\b', x)
  if len(matches) != 0:
    x = re.sub(r'\s*rd\b', " ", x)
  
  #remove "st" after a number
  matches = re.findall(r'\b\d+\s*st\b', x)
  if len(matches) != 0:
    x = re.sub(r'\s*st\b', " ", x)
    
  #remove "nd" after a number
  matches = re.findall(r'\b\d+\s*nd\b', x)
  if len(matches) != 0:
    x = re.sub(r'\s*nd\b', " ", x)
  
  # replace standalone numbers higher than 10 by #
  # this function does not touch numbers linked to words like "G-20"
  if bool(re.search(r'\d+', x)):
    x = re.sub('[0-9]{5,}', '#####', x)
    x = re.sub('[0-9]{4}', '####', x)
    x = re.sub('[0-9]{3}', '###', x)
    x = re.sub('[0-9]{2}', '##', x)
    #we do include the range from 1 to 10 as all word-vectors include them
    #x = re.sub('[0-9]{1}', '#', x)
    
  return x

def year_and_hour(text):
  """
  This function is used to replace "yr,yrs" by year and "hr,hrs" by hour.
  """
  
  # Find matches for "yr", "yrs", "hr", "hrs"
  matches_year = re.findall(r'\b\d+\s*yr\b', text)
  matches_years = re.findall(r'\b\d+\s*yrs\b', text)
  matches_hour = re.findall(r'\b\d+\s*hr\b', text)
  matches_hours = re.findall(r'\b\d+\s*hrs\b', text)
  
  # replace all matches accordingly
  if len(matches_year) != 0:
    text = re.sub(r'\b\d+\s*yr\b', "year", text)
  if len(matches_years) != 0:
    text = re.sub(r'\b\d+\s*yrs\b', "year", text)
  if len(matches_hour) != 0:
    text = re.sub(r'\b\d+\s*hr\b', "hour", text)
  if len(matches_hours) != 0:
    text = re.sub(r'\b\d+\s*hrs\b', "hour", text)
  return text

def textBlobLemmatize(sentence):
  """
  This function uses the Word lemmatizer function of the textBlob package.
  """  
  #for each word in the text, replace the word by its lemmatized version
  for x in sentence.split():
    sentence = sentence.replace(x, Word(x).lemmatize())
  return sentence

def build_vocab(df):
  
  '''Build a dictionary of words and its number of occurences from the data frame'''
  
  #initialize the tokenizer
  tokenizer = TweetTokenizer()
  
  vocab = {}
  for i, row in enumerate(df):
      #tokenize the sentence 
      words = tokenizer.tokenize(row)
      #for each word, check if it is in the dict otherwise add a new entry
      for w in words:
       
        try:
            vocab[w] += 1
        except KeyError:
            vocab[w] = 1
  
  return vocab

#https://www.kaggle.com/christofhenkel/how-to-preprocessing-for-glove-part1-eda
def check_coverage(vocab,embeddings_index, print_oov_num=100):
  '''
  This function checks what part of the vocabluary and the text is covered by the embedding index.
  It returns a list of tuples of unknown words and its occuring frequency.
  '''
  
  a = {}
  oov = {}
  k = 0
  i = 0

  # for every word in vocab
  for word in vocab:
      # check if it can be found in the embedding
      try:
          # store the embedding index to a
          a[word] = embeddings_index[word]
          # count up by #of occurences in df
          k += vocab[word]
      except:
          # if no embedding for word, add to oov
          oov[word] = vocab[word]
          # # count up by #of occurences in df
          i += vocab[word]
          pass
  # calc percentage of #of found words by length of vocab
  print('Found embeddings for {:.2%} of vocab'.format(len(a) / len(vocab)))
  # devide number of found words by number of all words from df
  print('Found embeddings for  {:.2%} of all text'.format(k / (k + i)))

  # return unknown words sorted by number of occurences
  sorted_x = sorted(oov.items(), key=operator.itemgetter(1))[::-1]
  print('Top unknown words are:', sorted_x[:print_oov_num])

  #return dict of unknown words + occurences
  return oov

def  load_embedding_vocab(path):
  '''
  Load the embeddings in the right format and return the vocab dictionary. 
  '''  
  # Print starting info about the pre-processing
  starttime = datetime.datetime.now().replace(microsecond=0)
  print("Starttime: ", starttime)

  def timediff(time):
    return time - starttime
  
  EMBEDDING_FILE = path
  def get_coefs(word,*arr): return word, np.asarray(arr, dtype='float32')
  embeddings_index = dict(get_coefs(*o.strip().split(" ")) for o in open(EMBEDDING_FILE)) 
    
  time = datetime.datetime.now().replace(microsecond=0)
  print("Embedding model loaded and vocab returned. Time since start: ", timediff(time))
  
  #return the vocab
  return embeddings_index

def preprocessing_NN(df, model_vocab, calc_coverage=True, print_oov_num=100):
  """
  This function is only correcting words which are not out of the box known towards the embedding dictionary.
  It is optimized using the nltk TweetTokenizer.

  Function that combines the whole pre-processing process specifically for neural networks where less pre-processing is required compared to conventional methods.
  This means we will not remove stopwords, lemmatize or remove typical punctuation.
  """
  
  # Set parameters
  tokenizer = TweetTokenizer()
  
  # Print starting info about the pre-processing
  starttime = datetime.datetime.now().replace(microsecond=0)
  print('Dataset Length: ', len(df), "Starttime: ", starttime)

  def timediff(time):
    return time - starttime
  
  # build a vocabulary from the text 
  vocab = build_vocab(df.comment_text)
  print('Embedding vectors are loaded. \n')
  # check the coverage and receive a dictionary of unknown words
  unknown = check_coverage(vocab,model_vocab, print_oov_num=print_oov_num)
  # extract the list of unknown words
  unknown = unknown.keys()
  
  ## Process the unknown words
  # The replace_contractions function is applied on the data frame
  corrected = [replace_contractions(x) for x in unknown]
  time = datetime.datetime.now().replace(microsecond=0)
  print("Contractions have been replaced. Time since start: ", timediff(time))

  # Replace emojis with text
  corrected = [emoji.demojize(x) for x in corrected]
  time = datetime.datetime.now().replace(microsecond=0)
  print("Emojis have been converted to text. Time since start: ", timediff(time))

  # Replace keyboard smilies with text
  corrected = [replace_smilies(x) for x in corrected]
  time = datetime.datetime.now().replace(microsecond=0)
  print("Smilies have been converted to text. Time since start: ", timediff(time))

  # The clean_text function is applied on the data frame
  corrected = [clean_text(x) for x in corrected]
  time = datetime.datetime.now().replace(microsecond=0)
  print("All signs have been removed. Time since start: ", timediff(time))
  
  # The clean_numbers function is applied
  corrected = [clean_numbers(x) for x in corrected]
  time = datetime.datetime.now().replace(microsecond=0)
  print("All numbers have been replaced with ###. Time since start: ", timediff(time))
  
    # Replace or remove special characters like - / _ according to rules
  corrected = [replace_symbol_special(x, check_vocab=True, vocab=model_vocab) for x in corrected]
  time = datetime.datetime.now().replace(microsecond=0)
  print("Special symbols have been processed. Time since start: ", timediff(time))

  # Abbreviations are replaced by year and hour
  corrected = [year_and_hour(x) for x in corrected]
  time = datetime.datetime.now().replace(microsecond=0)
  print("Yr and hr have been replaced by year and hour. Time since start: ", timediff(time))
  
  # *Takes too long
  #Correct spelling mistakes
  #corrected = [TextBlob(x).correct() for x in corrected]
  #time = datetime.datetime.now().replace(microsecond=0)
  #print("Yr and hr have been replaced by year and hour. Time since start: ", timediff(time))
  
  #create a dictionary from word and correction
  dictionary = dict(zip(unknown, corrected))
  keys = dictionary.keys()
  
  #remove all keys where unknown equals correction after processing
  #create a new dict
  dict_mispell = dict()
  for key in dictionary.keys():
    # if the correction differs from the unknown word add it to the new dict
    if key != dictionary.get(key):
      dict_mispell[key] = dictionary.get(key)
  
  time = datetime.datetime.now().replace(microsecond=0)
  print('Correction dictionary of unknown words prepared. Time since start: ', timediff(time))
  #print(dict_mispell, '\n')
  
  def clean_mispell(text, dict_mispell):
    '''Replaces the unknown words in the text by its corrections.'''
    #tokenize the text with TweetTokenizer
    words = tokenizer.tokenize(text)
    for i, word in enumerate(words):
      # if the word is among the misspellings
      if word in dict_mispell.keys():
        #replace it by the corrected word
        words[i] = dict_mispell.get(word)
    #merge text by space
    text = ' '.join(words)
    # remove all double spaces potentially appearing after pre-processing.
    text  = re.sub(r' +', ' ', text)
    return text
      
  
  #tqdm.pandas()
  df.comment_text = df.comment_text.apply(lambda x: clean_mispell(x, dict_mispell))
  time = datetime.datetime.now().replace(microsecond=0)
  print('Unknown words replaced excluding coverage check. Time since start: ', timediff(time))
  
  # print the final result
  if calc_coverage == True: 
    vocab = build_vocab(df.comment_text)
    unknown = check_coverage(vocab,model_vocab, print_oov_num=print_oov_num)
    time = datetime.datetime.now().replace(microsecond=0)
    print('Pre-processing done including coverage check. Time since start: ', timediff(time))
  
  return df

**2. Used Functions**

The check_coverage function, load_embedding_vocab and the build_vocab function were already covered in the preprocessing notebook and will not be covered in this notebook. The get_characters function returns all non ascii characters that are in a given list. The extract_used_characters gets all used characters from a given dataframe and the check_coverage_characters function return the percentage of signs covered by an entered character list of a word vector.

In [None]:
# extract all characters from text which arenot in ascii from a list
def get_characters(_list):
  character = set()
  for element in _list:
    for letter in element:
      character.add(letter)
  character = [c for c in character if c not in list(string.ascii_letters)]    
  return character

In [None]:
# extract all characters from text which arenot in ascii from a df
def extract_used_characters(df):
    
    used_characters = set()
    for i, row in enumerate(df):
        characters = list(row)
        for x in characters:
            used_characters.add(x)
    used_characters = [c for c in used_characters if c not in list(string.ascii_letters)]
    return used_characters

In [None]:
# Check the coverage of the signs
def check_coverage_characters(df_charac, model_charac):
  
    i = 0
    for c in df_charac:
      if c in model_charac:
        i += 1
    
    print('{:.2%} of the signs in the data frame are covered by the model'.format((i / len(df_charac))))

**3. Word Vectors**

As mentioned above we will look at three different word vectors. First of all we will check their out of the box performance without having applied any preprocessing.

Previous we will build a general vocabulary for all the comments of the train data. This vocabulary will be used for all three word vectors. Afterwards we convert the emojis of the train data in text to afterwards extract all non ascii signs from the dataframe.

In [None]:
# Create general vocabulary for the train data
general_vocab = build_vocab(train_data.comment_text)

In [None]:
# Remove emojies from train_data
train_data.comment_text = train_data.comment_text.apply(lambda x: emoji.demojize(x))

In [None]:
# get the characters from the df
charac_df = extract_used_characters(train_data.comment_text)

In [None]:
# Load dataset
train_data = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv")
# kill all other columns except comment text and target
cols_to_keep = ['comment_text','target']
train_data = train_data.drop(train_data.columns.difference(cols_to_keep), axis=1)
gc.collect()

With the extracted signs we will check the percentage of signs covered by each model. Although this notebook will not include any details about the preprocessing, it will apply our preprocessing function for each word vector. Once preprocessed we will use the check_coverage function again to see the effect on the result.

**Google News**

In [None]:
# Load model
model_google = KeyedVectors.load_word2vec_format("../input/googlenewsvectorsnegative300/GoogleNews-vectors-negative300.bin", binary=True)
model_google_vocab = model_google.vocab

In [None]:
# Check the out of the box coverage
coverage_google = check_coverage(general_vocab,model_google_vocab)

In [None]:
# Extract all non ascii characters from the model
charac_google = get_characters(model_google_vocab.keys())

In [None]:
# Check character coverage
check_coverage_characters(charac_df, charac_google)

As we noted in the process of comparing the three word vectors,  the google word vector does not recognize numbers and stopwords. Therefore in the next step we will not only execute our preprocessing function but also remove the stopwords.

In [None]:
# Preprocess with google embedding
stopWords = set(stopwords.words('english'))
train_data.comment_text = train_data.comment_text.apply(lambda x: remove_stopwords(x, stopWords))
train_preprocessed_google = preprocessing_NN(train_data, model_google.vocab)

In [None]:
# Load dataset
train_data = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv")
# kill all other columns except comment text and target
cols_to_keep = ['comment_text','target']
train_data = train_data.drop(train_data.columns.difference(cols_to_keep), axis=1)
gc.collect()

In [None]:
del train_preprocessed_google, model_google_vocab, charac_google, model_google
gc.collect()

**FastText**

In [None]:
# Load model
model_fasttext_vocab = load_embedding_vocab('../input/fasttext-crawl-300d-2m/crawl-300d-2M.vec')

In [None]:
# Check the out of the box coverage
coverage_fasttext = check_coverage(general_vocab,model_fasttext_vocab)

In [None]:
# Extract all non ascii characters from the model
charac_fasttext = get_characters(model_fasttext_vocab.keys())

In [None]:
# Check character coverage
check_coverage_characters(charac_df, charac_fasttext)

In [None]:
# Preprocess with fasttext embedding
train_preprocessed_fasttext = preprocessing_NN(train_data, model_fasttext_vocab)

In [None]:
# Load dataset
train_data = pd.read_csv("../input/jigsaw-unintended-bias-in-toxicity-classification/train.csv")
# kill all other columns except comment text and target
cols_to_keep = ['comment_text','target']
train_data = train_data.drop(train_data.columns.difference(cols_to_keep), axis=1)
gc.collect()

In [None]:
del train_preprocessed_fasttext, model_fasttext_vocab, charac_fasttext
gc.collect()

**GloVe**

In [None]:
# Load model vocab
model_glove_vocab = load_embedding_vocab('../input/glove840b300dtxt/glove.840B.300d.txt')

In [None]:
# Check the out of the box coverage
coverage_glove = check_coverage(general_vocab,model_glove_vocab)

In [None]:
# Extract all non ascii characters from the model
charac_glove = get_characters(model_glove_vocab.keys())

In [None]:
# Check character coverage
check_coverage_characters(charac_df, charac_glove)

In [None]:
# Preprocess with glove embedding
train_preprocessed_glove = preprocessing_NN(train_data, model_glove_vocab)

In [None]:
del train_preprocessed_glove, model_glove_vocab, charac_glove
gc.collect()

**4. Comparison & Conclusion**

In general all three word vectors require similar preprocessing. Lowercase should not be applied in any of them. Punctuations should not be removed as all include some punctuation and substantial information might get lost. Contractions should be cleaned as many are unknown and with the cleaning the text gets harmonized. Also translation and spellchecking should be executed for all of them to get any possible information and reduce the noise. Lemmatization should not be performed as all vectors can handle the different word versions. Unknown words get ignored by all models and therefore need to be handled separately. Regarding the replacement of emojis FastText differs from the other two as it includes a huge amount of emojis and therefore the replacement is not necessarily needed. For the other two the replacement should be performed. Google News differs from the other two vectors in the case of the removal of stopwords. It does not include any and therefore they need to be removed. Same behaviour regarding numbers for Google News.

Looking at our obtained results from this notebook GloVe and FastText seem to be the best out-of-the-box choices as they already cover around 98% of the text and around 50% of the vocabulary. Google News performs rather poorly obtaining only 77% of text coverage and 35% on the vocabulary. This performance can be explained by the above mentioned aspect that it does not cover stopwords and so a huge amount of words are unknown.

Meanwhile FastText and GloVe only get slight improvements after having preprocessed data, the effect on Google News is a lot bigger but it again this can be explained with the stopwords.

FastText offers an impressing amount of non-ascii characters. It covers 65% of all non-ascii characters that appear in the dataframe of the competition. Meanwhile Google News and GloVe only cover between 32 - 35%.

In the end all three vectors get to a result around 99% of text coverage. Therefore all three are good candidates for this competition. Nevertheless probably FastText and GloVe are the better choices as Google requires some additional preprocessing which might result in information getting lost.