### Import Library

In [35]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import nltk
import string
from nltk.stem import WordNetLemmatizer

### Load & Read data

In [20]:
df=pd.read_csv('https://raw.githubusercontent.com/tkeldenich/NLP_Preprocessing/main/train.csv')
df.head()

Unnamed: 0,text
0,Forest f...
1,All resi...
2,"13,000 p..."
3,Just got...
4,#RockyFi...


In [3]:
df.shape

(7603, 1)

#### To See the "Text" columns entire content Set the column width

In [21]:
pd.set_option('display.max_colwidth',9999)
df.head()

Unnamed: 0,text
0,Forest fire near La Ronge Sask. Canada
1,All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
2,"13,000 people receive #wildfires evacuation orders in California"
3,Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school
4,#RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires


### Cleaning the Data
* Once the data is loaded it needs to be cleaned up, this is called preprocessing.
* In most cases for NLP, preprocessing consists of removing non-letter characters such as “#”, “-“, “!”, numbers or even words that do not make sense or are not part of the language being analyzed.
* Keep in mind however that for certain types of problems it can be interesting to preserve certain types of characters.
* For example: to analyze if an email is a spam or not, we can imagine that the ‘!’ are a good indicator and therefore do not remove them during cleaning.

**initialize :**
1. **stopwords**: Which are words that appear frequently but do not bring any meaning to the sentence.

example: “of”, “the”, “a”

2. **Lemmatizer**: This object allows us to preserve the root of the words so that two words having the same strain will be considered as one and the same word 

example: ‘neighbors’ and ‘neighborhood’ will both be changed into ‘neighbor’

In [6]:
stopwords = nltk.corpus.stopwords.words('english')
words = set(nltk.corpus.words.words())
lemmatizer = WordNetLemmatizer()

**Afterwards we build our preprocessing function which will successively :**
* remove the punctuation
* remove the numbers
* transform the sentences into a list of tokens (a list of words)
* remove stopwords (words that don’t bring understanding)
* lemmatize
* remove capital letters
* reform sentences with the remaining words

In [25]:
def Preprocess_listofSentence(listofSentence):
    preprocess_list = []
    for sentence in listofSentence :
        sentence_w_punct = "".join([i.lower() for i in sentence if i not in string.punctuation])
        sentence_w_num = ''.join(i for i in sentence_w_punct if not i.isdigit())
        tokenize_sentence = nltk.tokenize.word_tokenize(sentence_w_num)
        words_w_stopwords = [i for i in tokenize_sentence if i not in stopwords]
        words_lemmatize = (lemmatizer.lemmatize(w) for w in words_w_stopwords)
        sentence_clean = ' '.join(w for w in words_lemmatize if w.lower() in words or not w.isalpha())
        preprocess_list.append(sentence_clean)
    return preprocess_list

#### Apply above function to ''text'' column of dataset

In [26]:
preprocess_list = Preprocess_listofSentence(df['text'])
preprocess_list

['forest fire near la canada',
 'resident shelter place notified officer evacuation shelter place order',
 'people receive wildfire evacuation order',
 'got sent photo ruby smoke wildfire school',
 'update closed direction due lake county fire wildfire',
 'flood disaster heavy rain cause flash flooding street colorado spring area',
 'top hill see fire wood',
 'there emergency evacuation happening building across street',
 'afraid tornado coming area',
 'three people heat wave far',
 'south getting flooded hah wait second live south gon na gon na flooding',
 'flooding day lost count',
 'flood bago bago',
 'damage school bus car crash breaking',
 'whats man',
 'love fruit',
 'summer lovely',
 'car fast',
 '',
 'ridiculous',
 'cool',
 'love skiing',
 'wonderful day',
 '',
 'cant eat',
 'last week',
 'love',
 '',
 'like',
 'end',
 'wholesale market ablaze',
 'always try bring heavy metal',
 'breaking flag set ablaze aba',
 'cry set ablaze',
 'plus side look sky last night ablaze',
 'built 

In [30]:
# check whether function fine or not
print('Original sentence : '+df['text'][4])
print('Cleaned sentence : '+preprocess_list[4])

Original sentence : #RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires
Cleaned sentence : update closed direction due lake county fire wildfire


### Encoding Text
* Encoding is an essential step in Machine Learning.
* It allows us to transform the text data into numbers that the machine can interpret and understand.
* There are different types of encoding,
    1. One Hot Encoding
    2. Label Encoding etc

### One-Hot Encoding
* One-Hot consists in creating a dictionary with every words that appear in our cleaned sentences.
* This dictionary is in fact a table where each column represents a word and each row represents a sentence.
* If such a word appears in such a sentence, we put a 1 in the element of the table, otherwise we put a 0.
* We will thus have an array composed only of 0 and 1.
* The only disadvantage of One-Hot Encoding is that we lose the hierarchy, the order of the words. we lose the context, the meaning of the sentence and in theory this should impoverish the results of our model.

**To realize the One-Hot Encoding in Python, we initialize the dictionary with the CountVectorizer() function of the Sklearn library. Then we use the fit_transform() function on our preprocessed data.**

In [33]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(preprocess_list)
X.toarray()[0]

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

##### See which features or columns we have

In [36]:
vectorizer.get_feature_names()

['aa',
 'aal',
 'aba',
 'abandon',
 'abandoned',
 'ability',
 'abject',
 'ablaze',
 'able',
 'aboard',
 'abomination',
 'abortion',
 'abouts',
 'absence',
 'absolute',
 'absolutely',
 'abstract',
 'absurd',
 'absurdly',
 'abuse',
 'accept',
 'access',
 'accident',
 'accidentally',
 'accidently',
 'accidents',
 'according',
 'accordingly',
 'account',
 'accountable',
 'accuracy',
 'accused',
 'accustomed',
 'ace',
 'achieve',
 'achievement',
 'aching',
 'acid',
 'acids',
 'acne',
 'acoustic',
 'acquiesce',
 'acquire',
 'acquired',
 'acquisition',
 'acre',
 'acronym',
 'across',
 'acrylic',
 'act',
 'actin',
 'acting',
 'action',
 'activate',
 'active',
 'actively',
 'activist',
 'activity',
 'actor',
 'actress',
 'actual',
 'actually',
 'acute',
 'ad',
 'adaptation',
 'add',
 'added',
 'addict',
 'addiction',
 'addition',
 'address',
 'adjust',
 'adjustable',
 'adjuster',
 'administration',
 'administrative',
 'admit',
 'adopt',
 'adoption',
 'adoptive',
 'adorable',
 'adult',
 'advance