# Introduction

The essays for this challenge have a lot of spelling mistakes and punctuations which can cause out of vocabulary errors for tokenizers.

In this notebook, we will see that initially we have around 80% coverage of word tokens compared to the vocabulary of glove embeddings and then we will improve it to beyond 99%.

We wont be using any lemmatization or stemming to achieve this feat. You are welcome to stem as it may help with some words. Lemmatize probably wont help as it depends on a valid word in the first place.

---
## TLDR;

* Get the exported csv file from notebook output. For the words in raw_words, replace them with clean_words_05 from the text you encounter from your training and testing dataframes. This improves the word embeddings as common mistakes have been clarified and count-wise improvement lifts it from 80% current to 99% new.

---
## Disclaimer

All ideas are taken from the below excellent reference and then adopted for our train/test sets. Amazing work on simple ideas to improve the vocabulary.


from @christofhenkel
https://www.kaggle.com/christofhenkel/how-to-preprocessing-when-using-embeddings

---
---
# Warning

Make sure that if you use anything like below, then dont train on character positions. Because after spelling corrections the character positions will change. 
I have tried to keep the token counts same as before so training on token positions can potentially work.

In [None]:
from gensim.models import KeyedVectors
import pandas as pd
import os
from tqdm.auto import tqdm
tqdm.pandas()
import numpy as np
import re
from nltk.corpus import stopwords

########################################
# if stopwords are not downloaded to your environment
# import nltk
# nltk.download('stopwords')
########################################
import gensim.downloader as api
word_vectors = api.load("glove-wiki-gigaword-100")

########################################
# if you want to save / load from local vectors
#word_vectors.save('vectors.kv')
#word_vectors = KeyedVectors.load('vectors.kv')
########################################

In [None]:
%config Completer.use_jedi = False
LOWER_CASE = True

In [None]:
df = pd.read_csv("../input/feedback-prize-2021/train.csv")
df_ss = pd.read_csv("../input/feedback-prize-2021/sample_submission.csv")

In [None]:
def read_train_file(currid = "423A1CA112E2", curr_dir = "../input/feedback-prize-2021/train"):
    with open(os.path.join(curr_dir, "{}.txt".format(currid)), "r") as f:
        filetext = f.read()
        
    return filetext

In [None]:
from collections import defaultdict

aam_misspell_dict = {'colour':'color',
                'centre':'center',
                'favourite':'favorite',
                'travelling':'traveling',
                'counselling':'counseling',
                'theatre':'theater',
                'cancelled':'canceled',
                'labour':'labor',
                'organisation':'organization',
                'wwii':'world war 2',
                'citicise':'criticize',
                "genericname": "someone",
                 "driveless" : "driverless",
                 "canidates" : "candidates",
                 "electorial" : "electoral",
                 "genericschool" : "school",
                 "polution" : "pollution",
                 "enviorment" : "environment",
                 "diffrent" : "different",
                 "benifit" : "benefit",
                 "schoolname" : "school",
                 "artical" : "article",
                 "elctoral" : "electoral",
                 "genericcity" : "city",
                 "recieves" : "receives",
                 "completly" : "completely",
                 "enviornment" : "environment",
                 "somthing" : "something",
                 "everyones" : "everyone",
                 "oppurtunity" : "opportunity",
                 "benifits" : "benefits",
                 "benificial" : "beneficial",
                 "tecnology" : "technology",
                 "paragragh" : "paragraph",
                 "differnt" : "different",
                 "reist" : "resist",
                 "probaly" : "probably",
                 "usuage" : "usage",
                 "activitys" : "activities",
                 "experince" : "experience",
                 "oppertunity" : "opportunity",
                 "collge" : "college",
                 "presedent" : "president",
                 "dosent" : "doesnt",
                 "propername" : "name",
                 "eletoral" : "electoral",
                 "diffcult" : "difficult",
                 "desicision" : "decision"
 }

# Vocabulary Flow (Lower Case)

1. Read all text from train files and combine together.

In [None]:
txt = []
for i in tqdm( df["id"].unique() ):
    txt.append( read_train_file(i) )

## Build Initial Vocabulary

In [None]:
from collections import defaultdict

initial_vocab = defaultdict(int)

for i in tqdm(txt, total = len(txt)):
    words = i.split()
    for word in words:
        initial_vocab[word.lower()] += 1

# Total Vocabulary Words

Length of the vocabulary ~ 101K words right now

In [None]:
print("Total vocabulary including stopwords is : ", len(initial_vocab))

# Pandas for all heavylifting

* We will use pandas so that all results achieved can be in different columns and you can download the data in the end and use any columns you would like based on your own tokenizer preferences.


In [None]:
word_df = pd.DataFrame(initial_vocab.items(),
            columns = ["raw_words", "raw_words_counts"])
print("-"*80)
print("Displaying Head of the words dataframe")
display(word_df.head())
print("-"*80)
print("Displaying Tail of the words dataframe")
display(word_df.tail())
print("-"*80)

# Exclude Stopwords

* This means that to report coverage of text we will not count stop words in the analysis.
* It also helps because in the keyed vectors from gensim stop words wont be included

In [None]:
stops = stopwords.words("english")

word_df["is_stop_word"] = word_df["raw_words"].apply(lambda x: 0 if x not in stops else 1)

In [None]:
word_df["is_stop_word"].value_counts()

# Analysis 1:
1) After all lower case words, we see there are 170 stop words detected from train set

---
---

# Match Vocab

Let us now see how many words have valid entries in the word_vectors

* apply_coverage only checks if a word exists in glove vectors obtained earlier.

In [None]:
def apply_coverage(x):
    if x in word_vectors:
        return 1
    return 0
        
word_df["raw_in_vectors"] = word_df["raw_words"].apply(apply_coverage)

* get_coverage checks the amount of vocabulary and text coverage when comparing to glove vectors
* It creates a new column named column_word_presence to indicate if the target word exists or not.
* It also displays the coverage statistics and returns a dataframe.

    '''
        column_words : the column containing the words for which we check coverage
        column_word_counts : the column containing pre-computed word counts for the words in question
        column_word_presence : Just an output column name where we will output if word exists in word_vectors
        exc_stop : Should we exclude stopwrods from coverage analysis or not
    '''

In [None]:
def get_coverage(column_words,
                 column_word_counts,
                 column_word_presence,
                 df, exc_stop = True):
    '''
        column_words : the column containing the words for which we check coverage
        column_word_counts : the column containing pre-computed word counts for the words in question
        column_word_presence : Just an output column name where we will output if word exists in word_vectors
        exc_stop : Should we exclude stopwrods from coverage analysis or not
    '''
    word_df = df.copy()
    word_df[column_word_presence] = word_df[column_words].apply(apply_coverage)
    print("-" * 80)
    
    #display(word_df[column_word_presence].value_counts(normalize = True))
    #print("-" * 80)
    if exc_stop == False:
        word_coverage = 100*word_df[column_word_presence].value_counts(normalize = True)[1]
        text_coverage = 100*word_df.groupby([column_word_presence])[column_word_counts].sum()[1] / (word_df.groupby([column_word_presence])[column_word_counts].sum()[0] + 
                         word_df.groupby([column_word_presence])[column_word_counts].sum()[1])
    else:
        print("EXCLUDING STOP WORD FROM ANALYSIS...")
        word_coverage = 100*word_df[word_df["is_stop_word"] == 0][column_word_presence].value_counts(normalize = True)[1]
        text_coverage = 100*word_df[word_df["is_stop_word"] == 0].groupby([column_word_presence])[column_word_counts].sum()[1] / (word_df[word_df["is_stop_word"] == 0].groupby([column_word_presence])[column_word_counts].sum()[0] + 
                         word_df[word_df["is_stop_word"] == 0].groupby([column_word_presence])[column_word_counts].sum()[1])
        
    if exc_stop:
        print("Total words in {} were {} and {:.2f}% words were found in the word_vectors.".format(column_words,
                                                                                               len(word_df[word_df["is_stop_word"] == 0]),
                                                                                               word_coverage))
    else:
        print("Total words in {} were {} and {:.2f}% words were found in the word_vectors.".format(column_words,
                                                                                               len(word_df),
                                                                                               word_coverage))
        
    print("-" * 80)
    print("From text coverage, {:.2f}% text is coverage in word_vectors.".format(text_coverage))
    print("-" * 80)
    return word_df
    



In [None]:
word_df = get_coverage( "raw_words", "raw_words_counts", "raw_in_vectors", word_df)

---
---
# Analysis 2

* So we have missing vocabulary for around **76%** of the total words
* In terms of usage frequency, we have coverage of 80% of the words if repetitions are taken into account

---
---

# Objective

* Our objective is to improve the **text coverage** so that the tokenizers can improve their performances


**Remember** we DONT WANT TO INCREASE or DECREASE the number of tokens. As this will make training and predictionstring too difficult and cause problems on the submission dataset

In [None]:
def preprocess(x):
    x = x.replace("n't", "nt")
    
    x = str(x)
    if LOWER_CASE:
        x = x.lower()
        
    if len(x.strip()) == 1:
        return x #special case if a punctuation was the only alphabet in the token.
    
    for punct in "/-'&":
        x = x.replace(punct, '')
    for punct in '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '“”’':
        x = x.replace(punct, '')
        
    x = re.sub('[0-9]{1,}', '#', x) #replace all numbers by #
    if len(x.strip()) < 1:
        x = '.' #if it was all punctuations like ------ or ..... or .;?!!. Then we return only a period to keep token consistent performance.
    return x

word_df["clean_words_01"] = word_df["raw_words"].progress_apply(lambda x: preprocess(x))

In [None]:
word_df.head()

In [None]:
temp = pd.DataFrame(word_df.groupby( ["clean_words_01"] )["raw_words_counts"].sum()).reset_index()
temp.columns = ["clean_words_01", "clean_words_01_counts"]

word_df = word_df.merge(temp, on=["clean_words_01"], how = 'left')

In [None]:
word_df = get_coverage( "clean_words_01", "clean_words_01_counts", "clean_01_in_vectors", word_df)

---
# Analysis #3

* **Amazing**, we improved the text coverage from **80**% to **99.76**%
* Lets try to do more
---

# Accented alphabet replacements

* á to a and so on....

In [None]:
# Reference: 
# https://itqna.net/questions/9818/how-remove-accented-expressions-regular-expressions-python
import re

# char codes: https://unicode-table.com/en/#basic-latin
accent_map = {
    u'\u00c0': u'A',
    u'\u00c1': u'A',
    u'\u00c2': u'A',
    u'\u00c3': u'A',
    u'\u00c4': u'A',
    u'\u00c5': u'A',
    u'\u00c6': u'A',
    u'\u00c7': u'C',
    u'\u00c8': u'E',
    u'\u00c9': u'E',
    u'\u00ca': u'E',
    u'\u00cb': u'E',
    u'\u00cc': u'I',
    u'\u00cd': u'I',
    u'\u00ce': u'I',
    u'\u00cf': u'I',
    u'\u00d0': u'D',
    u'\u00d1': u'N',
    u'\u00d2': u'O',
    u'\u00d3': u'O',
    u'\u00d4': u'O',
    u'\u00d5': u'O',
    u'\u00d6': u'O',
    u'\u00d7': u'x',
    u'\u00d8': u'0',
    u'\u00d9': u'U',
    u'\u00da': u'U',
    u'\u00db': u'U',
    u'\u00dc': u'U',
    u'\u00dd': u'Y',
    u'\u00df': u'B',
    u'\u00e0': u'a',
    u'\u00e1': u'a',
    u'\u00e2': u'a',
    u'\u00e3': u'a',
    u'\u00e4': u'a',
    u'\u00e5': u'a',
    u'\u00e6': u'a',
    u'\u00e7': u'c',
    u'\u00e8': u'e',
    u'\u00e9': u'e',
    u'\u00ea': u'e',
    u'\u00eb': u'e',
    u'\u00ec': u'i',
    u'\u00ed': u'i',
    u'\u00ee': u'i',
    u'\u00ef': u'i',
    u'\u00f1': u'n',
    u'\u00f2': u'o',
    u'\u00f3': u'o',
    u'\u00f4': u'o',
    u'\u00f5': u'o',
    u'\u00f6': u'o',
    u'\u00f8': u'0',
    u'\u00f9': u'u',
    u'\u00fa': u'u',
    u'\u00fb': u'u',
    u'\u00fc': u'u'
}

def accent_remove (m):
    return accent_map[m.group(0)]

string_velha = "Olá você está ????   "
string_nova = re.sub(u'([\u00C0-\u00FC])', accent_remove, string_velha.encode().decode('utf-8'))
string_nova

In [None]:
word_df["clean_words_02"] = word_df["clean_words_01"].apply( lambda x: re.sub(u'([\u00C0-\u00FC])', 
                                                  accent_remove, 
                                                  x.encode().decode('utf-8'))
                               )

In [None]:
temp = pd.DataFrame(word_df.groupby( ["clean_words_02"] )["raw_words_counts"].sum()).reset_index()
temp.columns = ["clean_words_02", "clean_words_02_counts"]

word_df = word_df.merge(temp, on=["clean_words_02"], how = 'left')

In [None]:
word_df = get_coverage( "clean_words_02", "clean_words_02_counts", "clean_02_in_vectors", word_df)

# Analysis 4

* Seems that accented character changes did not improve score too much. Lets see some examples where these replacements were made

In [None]:
word_df[ word_df["clean_words_01_counts"] != word_df["clean_words_02_counts"]].head(10)

---
---

# Misspellings

* I created a dictionary in the beginning of notebook **aam_misspell_dict** to include common errors that I can see. You can improve upon it
* There are definitely a lot of **misspellings** at work
* There are also anonymous names playing. (like **genericschool**, **genericname** etc..)
* Seeing that mainly there are a **limited number of topics**, we can perform some basic spelling corrections on the essay topics


In [None]:
word_df["clean_words_03"] = word_df["clean_words_02"].apply(lambda x: x if x not in aam_misspell_dict else aam_misspell_dict[x])

In [None]:
temp = pd.DataFrame(word_df.groupby( ["clean_words_03"] )["raw_words_counts"].sum()).reset_index()
temp.columns = ["clean_words_03", "clean_words_03_counts"]

word_df = word_df.merge(temp, on=["clean_words_03"], how = 'left')

In [None]:
word_df = get_coverage( "clean_words_03", "clean_words_03_counts", "clean_03_in_vectors", word_df)

# Viola

* Another improvement from **99.76%** to **99.84%**
* Can we do better ???

In [None]:
word_df[word_df["clean_03_in_vectors"] == 0].sort_values(by = ["clean_words_03_counts"], ascending = False).head(20)

# Analysis #5

* **Shouldnt** is giving us some problems. Well it shouln't (pun-intended)
* The vocabulary contains should and not separately but I dont want to increase number of tokens, so we will replace all shouldnt with **shant** which has similar meaning

In [None]:
aam_misspell_dict.update( {"shouldnt" : "shant" })

In [None]:
word_df["clean_words_04"] = word_df["clean_words_03"].apply(lambda x: x if x not in aam_misspell_dict else aam_misspell_dict[x])
temp = pd.DataFrame(word_df.groupby( ["clean_words_04"] )["raw_words_counts"].sum()).reset_index()
temp.columns = ["clean_words_04", "clean_words_04_counts"]

word_df = word_df.merge(temp, on=["clean_words_04"], how = 'left')

In [None]:
word_df = get_coverage( "clean_words_04", "clean_words_04_counts", "clean_04_in_vectors", word_df)

---
---

# Wow

* Another 0.01% improvement
* Lets see the top words still giving us issues

In [None]:
word_df[word_df["clean_04_in_vectors"] == 0].sort_values(by = ["clean_words_04_counts"], ascending = False)[:10]

---
---

# Analysis 6 - Risky Choices Ahead

* Now we can see that the vocabulary with most trouble is sort of problem-specific and not general enough
* We can replace the **studentdesigned** or **teacherdesigned** as designed.. This would probably change the contextual meaning but can improve tokenize performance as well.
* I will replace **studentname** as **myself**
* I will replace **teachername** as **teacher**
* I will replace **winnertakeall** as **winner-take-all**

In [None]:
aam_misspell_dict.update( {"teacherdesigned" : "designed",
                      "studentname" : "myself",
                      "studentdesigned" : "designed",
                      "teachername" : "teacher",
                      "winnertakeall" : "winner-take-all"})

In [None]:
word_df["clean_words_05"] = word_df["clean_words_04"].apply(lambda x: x if x not in aam_misspell_dict else aam_misspell_dict[x])
temp = pd.DataFrame(word_df.groupby( ["clean_words_05"] )["raw_words_counts"].sum()).reset_index()
temp.columns = ["clean_words_05", "clean_words_05_counts"]

word_df = word_df.merge(temp, on=["clean_words_05"], how = 'left')

In [None]:
word_df = get_coverage( "clean_words_05", "clean_words_05_counts", "clean_05_in_vectors", word_df)

---
---
# Conclusion

* After application of several transformations, we have created a dictionary which improves the text-coverage from **80%** to **99.92%**
* Hopefully this can improve the prediction performance from different models.
* Do post critique/feedback

* You can use the exported dataframe to create / use as dictionary for your tokens.

In [None]:
print("Exporting the created dictionary now. ")
word_df.to_csv("cleaned_word_dict.csv")

Have a nice day fellows. Happy kaggling!