# Spelling correction

Spelling correction can help in the NLP task of tweet classification in the considered example because the tweets are particularly succeptible to incorrect spellings of words, either deliberate or otherwise. 

In [None]:
# !pip install -U symspellpy
# !pip install textblob

In [7]:
import pandas as pd
import pkg_resources

from textblob import TextBlob
from symspellpy import SymSpell, Verbosity


## Dataset
The training part of the [Disaster Tweets Dataset from Kaggle](https://www.kaggle.com/competitions/nlp-getting-started/discussion/134890) is used here as it is most noisy dataset and great one to practice data preprocessing. Spelling correction is performed on the cleaned dataset created in data_preprocessing.

In [12]:
tweets_df = pd.read_csv("disaster_tweets_preprocessed.csv")
tweets_df.head()

Unnamed: 0,id,keyword,location,text,target,tweet
0,1,,,Our Deeds are the Reason of this #earthquake M...,1,deed reason earthquake may allah forgive u
1,4,,,Forest fire near La Ronge Sask. Canada,1,forest fire near la ronge sask canada
2,5,,,All residents asked to 'shelter in place' are ...,1,resident asked shelter place notified officer ...
3,6,,,"13,000 people receive #wildfires evacuation or...",1,people receive wildfire evacuation order calif...
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1,got sent photo ruby alaska smoke wildfire pour...


## Spelling correction using TextBlob

In [46]:
def correct_spelling(text):
    words = [str(TextBlob(word).correct()) for word in text.split()] 
    " ".join(words)
    return text

In [47]:
text = "typhoon satellite spy super typhoon soudelor"
correct_spelling(text)

'typhoon satellite spy super typhoon soudelor'

Sometimes the spell correction can correct the correctly spelled words such as typhoon is corrected to typhoid in above case

In [None]:
tweets_df["tweet"] = tweets_df["tweet"].apply(correct_spelling)
tweets_df["tweet"]

Time required for TextBlob to go through entire dataset is very high making it infeasible for large datasets 

## Spelling correction using pySymSpell

pySymSpell give multiple suggestions to the words. We can select the first suggested word. 

In [23]:
sym_spell = SymSpell(max_dictionary_edit_distance=2, prefix_length=7)
dictionary_path = pkg_resources.resource_filename(
    "symspellpy", "frequency_dictionary_en_82_765.txt"
)
sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)


<symspellpy.suggest_item.SuggestItem at 0x20a04ff86d0>

In [40]:
def correct_spelling_symspell(text):
    words = [
        sym_spell.lookup(
            word, 
            Verbosity.CLOSEST, 
            max_edit_distance=2,
            include_unknown=True
            )[0].term 
        for word in text.split()] 
    text = " ".join(words)
    return text

In [41]:
text = "typhoon satellite spy super typhoon soudelor"  
correct_spelling_symspell(text)

'typhoon satellite spy super typhoon soudelor'

In [44]:
tweets_df["tweet"] = tweets_df["tweet"].apply(correct_spelling_symspell)
tweets_df["tweet"]

0              deed reason earthquake may allah forgive a
1                   forest fire near la range sask canada
2       resident asked shelter place notified officer ...
3       people receive wildfire evacuation order calif...
4       got sent photo ruby alaska smoke wildfire pour...
                              ...                        
7608    two giant crane holding bridge collapse nearby...
7609    control wild fire california even northern par...
7610                                   etc volcano hawaii
7611    police investigating a bike collided car littl...
7612    latest home razed northern california wildfire...
Name: tweet, Length: 7613, dtype: object

it can be observed that in the first tweet the "forgive u" became "forgive a", "utc" became "etc". So may be some acronyms/abbreviations expansion in the pre-processing stage is needed.

pySymSpell is extremely fast and is language agnostic so is way better than textblob for spelling correction