https://www.kaggle.com/theoviel/improve-your-score-with-some-text-preprocessing

# Improve your Score with some Text Preprocessing


This kernel is an improved version of @Dieter's work.
> https://www.kaggle.com/christofhenkel/how-to-preprocessing-when-using-embeddings


It is the preprocessing I use for my current LB score, and it has helped improving it by a bit. Feel free to use it as well, but please upvote if you do. 

This is also how I caught a glimpse of spelling mistakes in the database.

#### Any feedback is appreciated ! 

In [1]:
import pandas as pd
import numpy as np
import operator 
import re

## Loading data

In [2]:
train = pd.read_csv("../input/train.csv").drop('target', axis=1)
test = pd.read_csv("../input/test.csv")
df = pd.concat([train ,test])

print("Number of texts: ", df.shape[0])

Number of texts:  1362492


## Loading embeddings

In [3]:
def load_embed(file):
    def get_coefs(word,*arr): 
        return word, np.asarray(arr, dtype='float32')
    
    if file == '../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec':
        embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(file) if len(o)>100)
    else:
        embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(file, encoding='latin'))
        
    return embeddings_index

In [4]:
glove = '../input/embeddings/glove.840B.300d/glove.840B.300d.txt'
paragram =  '../input/embeddings/paragram_300_sl999/paragram_300_sl999.txt'
wiki_news = '../input/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec'

In [5]:
print("Extracting GloVe embedding")
embed_glove = load_embed(glove)
print("Extracting Paragram embedding")
embed_paragram = load_embed(paragram)
print("Extracting FastText embedding")
embed_fasttext = load_embed(wiki_news)

Extracting GloVe embedding
Extracting Paragram embedding
Extracting FastText embedding


## Vocabulary and Coverage functions
> Again, check Dieter's work if you haven't, those are his.

In [11]:
def build_vocab(texts):
    sentences = texts.apply(lambda x: x.split()).values
    vocab = {}
    for sentence in sentences:
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

In [12]:
def check_coverage(vocab, embeddings_index):
    known_words = {}
    unknown_words = {}
    nb_known_words = 0
    nb_unknown_words = 0
    for word in vocab.keys():
        try:
            known_words[word] = embeddings_index[word]
            nb_known_words += vocab[word]
        except:
            unknown_words[word] = vocab[word]
            nb_unknown_words += vocab[word]
            pass

    print('Found embeddings for {:.2%} of vocab'.format(len(known_words) / len(vocab)))
    print('Found embeddings for  {:.2%} of all text'.format(nb_known_words / (nb_known_words + nb_unknown_words)))
    unknown_words = sorted(unknown_words.items(), key=operator.itemgetter(1))[::-1]

    return unknown_words

## Starting point

In [13]:
vocab = build_vocab(df['question_text'])

In [14]:
print("Glove : ")
oov_glove = check_coverage(vocab, embed_glove)
print("Paragram : ")
oov_paragram = check_coverage(vocab, embed_paragram)
print("FastText : ")
oov_fasttext = check_coverage(vocab, embed_fasttext)

Glove : 
Found embeddings for 32.77% of vocab
Found embeddings for  88.15% of all text
Paragram : 
Found embeddings for 19.37% of vocab
Found embeddings for  72.21% of all text
FastText : 
Found embeddings for 29.77% of vocab
Found embeddings for  87.66% of all text


 #### Paragram seems to have a significantly lower coverage. 
>That's because it does not understand upper letters, let us lower our texts :

In [15]:
df['lowered_question'] = df['question_text'].apply(lambda x: x.lower())

In [16]:
vocab_low = build_vocab(df['lowered_question'])

In [17]:
print("Glove : ")
oov_glove = check_coverage(vocab_low, embed_glove)
print("Paragram : ")
oov_paragram = check_coverage(vocab_low, embed_paragram)
print("FastText : ")
oov_fasttext = check_coverage(vocab_low, embed_fasttext)

Glove : 
Found embeddings for 27.10% of vocab
Found embeddings for  87.88% of all text
Paragram : 
Found embeddings for 31.01% of vocab
Found embeddings for  88.21% of all text
FastText : 
Found embeddings for 21.74% of vocab
Found embeddings for  87.14% of all text


#### Better, but we lost a bit of information on the other embeddings.
> Therer are words known that are known with upper letters and unknown without. Let us fix that :
- word.lower() takes the embedding of word if word.lower() doesn't have an embedding

In [18]:
def add_lower(embedding, vocab):
    count = 0
    for word in vocab:
        if word in embedding and word.lower() not in embedding:  
            embedding[word.lower()] = embedding[word]
            count += 1
    print(f"Added {count} words to embedding")

In [19]:
print("Glove : ")
add_lower(embed_glove, vocab)
print("Paragram : ")
add_lower(embed_paragram, vocab)
print("FastText : ")
add_lower(embed_fasttext, vocab)

Glove : 
Added 15199 words to embedding
Paragram : 
Added 0 words to embedding
FastText : 
Added 27908 words to embedding


In [20]:
print("Glove : ")
oov_glove = check_coverage(vocab_low, embed_glove)
print("Paragram : ")
oov_paragram = check_coverage(vocab_low, embed_paragram)
print("FastText : ")
oov_fasttext = check_coverage(vocab_low, embed_fasttext)

Glove : 
Found embeddings for 30.39% of vocab
Found embeddings for  88.19% of all text
Paragram : 
Found embeddings for 31.01% of vocab
Found embeddings for  88.21% of all text
FastText : 
Found embeddings for 27.77% of vocab
Found embeddings for  87.73% of all text


### What's wrong ?

In [21]:
oov_glove[:10]

[('india?', 17092),
 ("what's", 13977),
 ('it?', 13702),
 ('do?', 9125),
 ('life?', 8114),
 ('why?', 7674),
 ('you?', 6572),
 ('me?', 6525),
 ('them?', 6423),
 ('time?', 6021)]

#### First faults appearing are : 
- Contractions 
- Words with punctuation in them

> Let us correct that.

## Contractions

In [22]:
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have" }

In [23]:
def known_contractions(embed):
    known = []
    for contract in contraction_mapping:
        if contract in embed:
            known.append(contract)
    return known

In [24]:
print("- Known Contractions -")
print("   Glove :")
print(known_contractions(embed_glove))
print("   Paragram :")
print(known_contractions(embed_paragram))
print("   FastText :")
print(known_contractions(embed_fasttext))

- Known Contractions -
   Glove :
["can't", "'cause", "didn't", "doesn't", "don't", "I'd", "I'll", "I'm", "I've", "i'd", "i'll", "i'm", "i've", "it's", "ma'am", "o'clock", "that's", "you'll", "you're"]
   Paragram :
["can't", "'cause", "didn't", "doesn't", "don't", "i'd", "i'll", "i'm", "i've", "it's", "ma'am", "o'clock", "that's", "you'll", "you're"]
   FastText :
[]


#### FastText does not understand contractions
> We use the map to replace them

In [25]:
def clean_contractions(text, mapping):
    specials = ["’", "‘", "´", "`"]
    for s in specials:
        text = text.replace(s, "'")
    text = ' '.join([mapping[t] if t in mapping else t for t in text.split(" ")])
    return text

In [26]:
df['treated_question'] = df['lowered_question'].apply(lambda x: clean_contractions(x, contraction_mapping))

In [27]:
vocab = build_vocab(df['treated_question'])
print("Glove : ")
oov_glove = check_coverage(vocab, embed_glove)
print("Paragram : ")
oov_paragram = check_coverage(vocab, embed_paragram)
print("FastText : ")
oov_fasttext = check_coverage(vocab, embed_fasttext)

Glove : 
Found embeddings for 30.53% of vocab
Found embeddings for  88.56% of all text
Paragram : 
Found embeddings for 31.16% of vocab
Found embeddings for  88.58% of all text
FastText : 
Found embeddings for 27.91% of vocab
Found embeddings for  88.44% of all text


## Now, let us deal with special characters

In [28]:
punct = "/-'?!.,#$%\'()*+-/:;<=>@[\\]^_`{|}~" + '""“”’' + '∞θ÷α•à−β∅³π‘₹´°£€\×™√²—–&'

In [29]:
def unknown_punct(embed, punct):
    unknown = ''
    for p in punct:
        if p not in embed:
            unknown += p
            unknown += ' '
    return unknown

In [30]:
print("Glove :")
print(unknown_punct(embed_glove, punct))
print("Paragram :")
print(unknown_punct(embed_paragram, punct))
print("FastText :")
print(unknown_punct(embed_fasttext, punct))

Glove :
“ ” ’ ∞ θ ÷ α • à − β ∅ ³ π ‘ ₹ ´ ° £ € × ™ √ ² — – 
Paragram :
“ ” ’ ∞ θ ÷ α • à − β ∅ ³ π ‘ ₹ ´ ° £ € × ™ √ ² — – 
FastText :
_ ` 


#### FastText seems to have a better knowledge of special characters 
> We use a map to replace unknown characters with known ones.

> We make sure there are spaces between words and punctuation


In [31]:
punct_mapping = {"‘": "'", "₹": "e", "´": "'", "°": "", "€": "e", "™": "tm", "√": " sqrt ", "×": "x", "²": "2", "—": "-", "–": "-", "’": "'", "_": "-", "`": "'", '“': '"', '”': '"', '“': '"', "£": "e", '∞': 'infinity', 'θ': 'theta', '÷': '/', 'α': 'alpha', '•': '.', 'à': 'a', '−': '-', 'β': 'beta', '∅': '', '³': '3', 'π': 'pi', }

In [32]:
def clean_special_chars(text, punct, mapping):
    for p in mapping:
        text = text.replace(p, mapping[p])
    
    for p in punct:
        text = text.replace(p, f' {p} ')
    
    specials = {'\u200b': ' ', '…': ' ... ', '\ufeff': '', 'करना': '', 'है': ''}  # Other special characters that I have to deal with in last
    for s in specials:
        text = text.replace(s, specials[s])
    
    return text

In [33]:
df['treated_question'] = df['treated_question'].apply(lambda x: clean_special_chars(x, punct, punct_mapping))

In [35]:
vocab = build_vocab(df['treated_question'])
print("Glove : ")
oov_glove = check_coverage(vocab, embed_glove)
print("Paragram : ")
oov_paragram = check_coverage(vocab, embed_paragram)
print("FastText : ")
oov_fasttext = check_coverage(vocab, embed_fasttext)

Glove : 
Found embeddings for 69.10% of vocab
Found embeddings for  99.58% of all text
Paragram : 
Found embeddings for 73.58% of vocab
Found embeddings for  99.63% of all text
FastText : 
Found embeddings for 60.75% of vocab
Found embeddings for  99.45% of all text


In [36]:
oov_fasttext[:100]

[('quorans', 885),
 ('bitsat', 583),
 ('kvpy', 369),
 ('comedk', 369),
 ('quoran', 325),
 ('wbjee', 246),
 ('articleship', 218),
 ('viteee', 193),
 ('fortnite', 166),
 ('upes', 164),
 ('marksheet', 151),
 ('afcat', 131),
 ('uceed', 126),
 ('dropshipping', 123),
 ('bhakts', 118),
 ('iitjee', 114),
 ('machedo', 112),
 ('upsee', 111),
 ('bnbr', 105),
 ('alshamsi', 100),
 ('chsl', 100),
 ('iitian', 99),
 ('amcat', 97),
 ('josaa', 96),
 ('unacademy', 89),
 ('zerodha', 85),
 ('qoura', 85),
 ('nmat', 80),
 ('icos', 79),
 ('jiit', 78),
 ('hairfall', 73),
 ('lnmiit', 73),
 ('metoo', 71),
 ('kavalireddi', 71),
 ('doklam', 70),
 ('muoet', 68),
 ('woocommerce', 67),
 ('nicmar', 66),
 ('vajiram', 62),
 ('srmjee', 61),
 ('modiji', 61),
 ('infjs', 60),
 ('adhaar', 60),
 ('zebpay', 58),
 ('elitmus', 58),
 ('pubg', 57),
 ('awdhesh', 55),
 ('hackerrank', 54),
 ('gixxer', 54),
 ('aiq', 53),
 ('sibm', 53),
 ('koinex', 50),
 ('golang', 50),
 ('mahadasha', 49),
 ('mhcet', 47),
 ('byju', 47),
 ('binance', 46

### What's still missing ? 
- Unknown words
- Acronyms
- Spelling mistakes

## We can correct manually most frequent mispells

#### For example, here are some mistakes and their frequency
- qoura : 85 times
- mastrubation : 38 times
- demonitisation : 30 times
- …

In [37]:
mispell_dict = {'colour': 'color', 'centre': 'center', 'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can', 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do', 'doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 'mastrubation': 'masturbation', 'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'Ethereum', 'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 'demonitization': 'demonetization', 'demonetisation': 'demonetization'}

In [38]:
def correct_spelling(x, dic):
    for word in dic.keys():
        x = x.replace(word, dic[word])
    return x

In [39]:
df['treated_question'] = df['treated_question'].apply(lambda x: correct_spelling(x, mispell_dict))

In [40]:
vocab = build_vocab(df['treated_question'])
print("Glove : ")
oov_glove = check_coverage(vocab, embed_glove)
print("Paragram : ")
oov_paragram = check_coverage(vocab, embed_paragram)
print("FastText : ")
oov_fasttext = check_coverage(vocab, embed_fasttext)

Glove : 
Found embeddings for 69.09% of vocab
Found embeddings for  99.58% of all text
Paragram : 
Found embeddings for 73.58% of vocab
Found embeddings for  99.63% of all text
FastText : 
Found embeddings for 60.74% of vocab
Found embeddings for  99.45% of all text


### That's all for now !

#### Improvement ideas: 
> Replace acronyms with their meaning

> Replace unknown words with a more general term : 
 - ex : fortnite, pubg -> video game
 
 ### *Thanks for reading ! *

In [43]:
insincere_questions = df.sort_values('question_text', ascending=True)
insincere_questions[:100]

Unnamed: 0,qid,question_text,lowered_question,treated_question
840731,a4c44963530035288b93,I want to blow things up with TNT now what?,i want to blow things up with tnt now what?,i want to blow things up with tnt now what ?
613283,781b002c109bfc37d8e6,!TRIGGER WARNING! Am I a homophobe if I refuse...,!trigger warning! am i a homophobe if i refuse...,! trigger warning ! am i a homophobe if i re...
420816,527aac2ce6f12f789fe5,"""","""",""""
1102919,d827932261abbd74fc45,""" I post a download link of a website in my we...",""" i post a download link of a website in my we...",""" i post a download link of a website in m..."
213438,29c09eb5311f71b0809d,""" I visited the theater"" or ""I enjoyed the the...",""" i visited the theater"" or ""i enjoyed the the...",""" i visited the theater "" or "" i enj..."
496717,614361b1ee8b7d82876e,""" I've been to the doctor many times now to cu...",""" i've been to the doctor many times now to cu...",""" i have been to the doctor many times now..."
646127,7e8c70c6cc622e84160e,""" Is there anybody who had drastic good / bad ...",""" is there anybody who had drastic good / bad ...",""" is there anybody who had drastic good ..."
4692,00e9f7cd3e8d60b309fb,""" So far She has published three chapters of h...",""" so far she has published three chapters of h...",""" so far she has published three chapters ..."
998382,c3a539c771f3b20256c6,""" What does the scientific mean when you dream...",""" what does the scientific mean when you dream...",""" what does the scientific mean when you d..."
387244,4bdd5ebef6ceb1b47651,""" if 25 men working 6 hrs a day, can do a work...",""" if 25 men working 6 hrs a day, can do a work...",""" if 25 men working 6 hrs a day , can do ..."


In [44]:
insincere_questions[100:]

Unnamed: 0,qid,question_text,lowered_question,treated_question
452809,58b2b29b905f87374b0d,"""Go West Young Man"", where and when it came from?","""go west young man"", where and when it came from?",""" go west young man "" , where and when ..."
841088,a4d85947da93238104c6,"""God destroys the upright and the perfect"", if...","""god destroys the upright and the perfect"", if...",""" god destroys the upright and the perfect ..."
78695,0f67b063fca55bb64c41,"""God is at once a Spirit yet can also become i...","""god is at once a spirit yet can also become i...",""" god is at once a spirit yet can also beco..."
500188,61ee3e65947056a8d6d9,"""Grand Theft Auto V"" features a newly drawn ma...","""grand theft auto v"" features a newly drawn ma...",""" grand theft auto v "" features a newly ..."
191134,255e742e6e41a0923f45,"""Guestofaguest"" does anyone subscribe to it? I...","""guestofaguest"" does anyone subscribe to it? i...",""" guestofaguest "" does anyone subscribe ..."
466553,5b5c6fd69d916fde70ab,"""Hard work beats talent when talent doesn't wo...","""hard work beats talent when talent doesn't wo...",""" hard work beats talent when talent does n..."
679550,8516e378d96649bf3beb,"""Have you been practicing any sport these year...","""have you been practicing any sport these year...",""" have you been practicing any sport these ..."
1161271,e38a862416e7cc224aa7,"""Have you ever laughed very badly in your sadn...","""have you ever laughed very badly in your sadn...",""" have you ever laughed very badly in your ..."
46151,d0d9f8d46ed51d7df2a5,"""Have you ever loved someone truly or madly or...","""have you ever loved someone truly or madly or...",""" have you ever loved someone truly or madl..."
309560,3ca315c28b23485e1989,"""He was fooling us all this time.He bought all...","""he was fooling us all this time.he bought all...",""" he was fooling us all this time . he boug..."
