## Data Preprocessing

Objectives: 

We will present how to apply the following preprocessing tasks to a simple example:

1.Convert everything to lowercase

2.Remove HTML tags

3.Contraction mapping

4.Remove (‘s)

5.Remove any text inside the parenthesis ( )

6.Eliminate punctuations and special characters

7.Remove stopwords

8.Remove short words

## require packages if you have not installed 

        pip install bs4
        pip install lxml
        pip install nltk

In [15]:
# import library
import re

In [16]:
# Here is the dictionary that we will use for expanding the contractions:
    
contraction_mapping = {"ain't": "is not", 
                       "aren't": "are not",
                       "can't": "cannot", 
                       "'cause": "because", 
                       "could've": "could have", "couldn't": "could not",
                           "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",
                           "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",
                           "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",
                           "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",
                           "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",
                           "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",
                           "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",
                           "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",
                           "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",
                           "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",
                           "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",
                           "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",
                           "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",
                           "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",
                           "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",
                           "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",
                           "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",
                           "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",
                           "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",
                           "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
                           "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",
                           "you're": "you are", "you've": "you have"}

In [17]:
text = """ I'm abcd. MY FAVORITE SUBJECT is STEM (Science Tech Eng Math). Please come to vist the website https://www.ai-camp.org. Some HTML ... <img src="subdirectory/MyImage.png" width=60 height=60 />. I make $234234 every year!!!!"""
print(text)

 I'm abcd. MY FAVORITE SUBJECT is STEM (Science Tech Eng Math). Please come to vist the website https://www.ai-camp.org. Some HTML ... <img src="subdirectory/MyImage.png" width=60 height=60 />. I make $234234 every year!!!!


In [18]:
# 1.Convert everything to lowercase
text = text.lower()
print(text)

 i'm abcd. my favorite subject is stem (science tech eng math). please come to vist the website https://www.ai-camp.org. some html ... <img src="subdirectory/myimage.png" width=60 height=60 />. i make $234234 every year!!!!


In [19]:
# 2.Remove HTML tags
from bs4 import BeautifulSoup
text = BeautifulSoup(text, "lxml").text
print(text)

i'm abcd. my favorite subject is stem (science tech eng math). please come to vist the website https://www.ai-camp.org. some html ... . i make $234234 every year!!!!


In [20]:
# 3.get rid of urls
text = re.sub('https?://\S+|www\.\S+', '', text)
print(text)

i'm abcd. my favorite subject is stem (science tech eng math). please come to vist the website  some html ... . i make $234234 every year!!!!


In [21]:
# 4.Contraction mapping
text = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in text.split(" ")])
print(text)

i am abcd. my favorite subject is stem (science tech eng math). please come to vist the website  some html ... . i make $234234 every year!!!!


In [22]:
# 5.get rid of non words and extra spaces
# Remove (‘s)
# Eliminate punctuations and special characters
text = re.sub('\\W', ' ', text)
text = re.sub('\n', '', text)
text = re.sub(' +', ' ', text)
text = re.sub('^ ', '', text)
text = re.sub(' $', '', text)
text = re.sub(r'\([^)]*\)', '', text)
text = re.sub('"','', text)  
text = re.sub(r"'s\b","",text)
text = re.sub("[^a-zA-Z]", " ", text) 
text = re.sub('[m]{2,}', 'mm', text)
print(text)

i am abcd my favorite subject is stem science tech eng math please come to vist the website some html i make        every year


In [23]:
# 6.Remove stopwords using NLTK
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english')) 


[nltk_data] Downloading package stopwords to /projects/ae15c660-30de-4
[nltk_data]     74e-abca-5963358c9eb9/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [24]:
print(stop_words)

{'himself', 'any', "didn't", 't', 'by', 'y', 'same', 'your', 'between', 'very', 'these', 'can', 'has', "don't", 'll', 'off', 'you', 'before', 'if', 'was', 'how', 'her', 'not', "you've", 'ours', 'in', 'each', "doesn't", 'ma', 'our', 'a', "she's", 'that', 'will', "shan't", "should've", "won't", 'mightn', 'but', 'no', "hasn't", 'some', 'during', 'from', 'too', 'theirs', 'until', 'themselves', 'out', 'further', 'being', 'needn', 'does', 'wouldn', 'they', 'myself', 'is', 'while', 'this', "isn't", "shouldn't", 'haven', 'such', 'didn', 'now', 'm', 'whom', "you're", "mightn't", 'have', "haven't", 's', 'for', 'i', 'were', 'hadn', 'ourselves', 'its', 'yours', 'under', 'she', 'hers', 'at', 'ain', 'because', 'been', 'doesn', 'aren', 'which', 'hasn', 'just', "weren't", 'below', 'both', 'it', 'after', 'we', 'herself', 'few', 'do', 'don', 'more', 'd', 'yourself', 'won', 'be', "wouldn't", 'had', 'other', 'an', 'on', 'he', 'up', "hadn't", 'having', 'wasn', "wasn't", 'own', 'me', 'them', "couldn't", 'th

In [25]:
num = 0
if(num==0):
    tokens = [w for w in text.split() if not w in stop_words]
else:
    tokens=text.split()
    
print(tokens)

['abcd', 'favorite', 'subject', 'stem', 'science', 'tech', 'eng', 'math', 'please', 'come', 'vist', 'website', 'html', 'make', 'every', 'year']


In [26]:
# 7.Remove short words
long_words=[]
for i in tokens:
    if len(i)>1:                                                 #removing short word
        long_words.append(i)   

print(long_words)

['abcd', 'favorite', 'subject', 'stem', 'science', 'tech', 'eng', 'math', 'please', 'come', 'vist', 'website', 'html', 'make', 'every', 'year']


In [27]:
cleaned_text = (" ".join(long_words)).strip()
print(cleaned_text)

abcd favorite subject stem science tech eng math please come vist website html make every year
