# Data preprocessing for NLP

When working with text data there is some preprocessing that needs to be done before doing any modeling. The aim of this preprocessing is to remove as much noise as possible, while keeping as much information as possible. 

In the previous section, we observed two things already:

1. There are lots of `&amp`, `&quot` and symbols like that in the tweets. These symbols are html markers that don't provide much information or meaning, and could be removed for the purpose of sentiment analysis.
2. By far, the most frequest words, are words of the kind: `I, the, a, you, ...`. These words don't carry any sentiment meaning and flood the tweets. If we keep them in there, they will make it much harder for any algorythm to learn. Actually, these words are called [stop words](https://en.wikipedia.org/wiki/Stop_word) and it is very common to remove them for any NLP task.

We will do more processing steps, but let's start by these two. 

In this directory you will find a file called `twitter.py`, this file contains the definition of a class `TextProcessor` that we will be filling out. The first function that we will implement, will remove all these html markers from the tweets.

In [1]:
# Don't modify this cell. This will make any changes in the twitter.py file to be immediately loaded in this notebook

%load_ext autoreload
%autoreload 2

from twitter import TextProcessor

In [2]:
# Keep all your other imports in this cell


In [3]:
# 1. Load the train_data.csv file


In [4]:
# 2. Print the first 50 rows' text, there are many interesting examples to see


Besides what we already observed, there are other things that catch our eye. For example:

* Mentions (i.e @106andpark) are meaningless. Can be removed
* Punctuation signs (comma, dot, etc.) don't carry much information. While exclamation or question marks could carry some meaning, we can safely remove them, since the sentiment weight is carried in meaningful words. 
* In some tweets, there are emoticons like :) or (:. We could actually preserve them, since those _do_ mean something

**NOTE:** Some of the following questions make use of Python [re](https://docs.python.org/3/library/re.html) module for regular expressions. If you have not worked with regular expressions before, these questions may be a bit tricky. Don't hesitate to ask for help in the classroom :)

In [5]:
# 3. Modify the function remove_mentions in the TextProcessor class so that it removes mentions from a tweet. Run this cell
# when you are done to check your solution. The cell should run without errors

tp = TextProcessor()
text = '@you this is @amention in a tweet'
res = tp.remove_mentions(text)

# Use the .strip() method to remove extra spaces created when removing mentions at the beginning or the end
assert res == 'this is  in a tweet', print(f'Your result: {res}')

Before removing punctuation signs, let's unescape the HTML characters from the tweets. These are the `&amp` or `&quot` that we have seen before, and unescaping them means to transform them to their character representations. In these cases `&` and `"`. Don't worry, [this](https://docs.python.org/3/library/html.html) python module can do the work for you

In [6]:
# 4. Modify the function unescape_html in the TextProcessor class so that it removes html markups like &amp or &quot. 
# Run this cell when you are done to check your solution. The cell should run without errors

tp = TextProcessor()
text = '&amp this contains html &quot;markup&quot'
res = tp.unescape_html(text)

assert res == '& this contains html "markup"', print(f'Your result: {res}')

Now that we don't have mentions or escaped HTML characters, we can remove all other punctuation signs. But let's also try to keep the emojis as we think they carry sentiment information, right? :)

In [7]:
# 5. Modify the remove_punctuation function in the TextProcessor class so that it removes any punctuation sign. If the parameter keep_emoticons=True
# save the emojis and append them at the end of the resulting string

tp = TextProcessor()
text = 'This tweet is great! :) right?'
res = tp.remove_punctuation(text)

assert res == 'This tweet is great right :)', print(f'Your result: {res}')

res = tp.remove_punctuation(text, keep_emoticons=False)

assert res == 'This tweet is great right ', print(f'Your result: {res}')

Great! What else can we do to reduce noise? We could, for example, lowercase all the words. This way the same word capitalise or now, will be counted only once.

In [8]:
# 6. Modify the function lower in the TextProcessor class so that it transforms the tweet to lowercase. Run this cell when 
# you are done to check your solution. The cell should run without errors
tp = TextProcessor()
text = 'This is UpperCased'
res = tp.lower(text)

assert res == "this is uppercased", print(f'Your result: {res}')

Nicely done! Let's move on to removing stop words. Of course, the list of stop words is very big, and it has already been sorted out for us. Python has a great library for NLP called [nltk](https://www.nltk.org/). If you haven't yet, install this library with `pip install nltk`. The class `TextProcessor` already initialises a list of stopwords for you to use. They are saved as a class attribute called `stop_words`.

In [9]:
# 7. Using the list of stop words provided by nltk, modify the function remove_stopwords in the TextProcessor class
# so that it removes any stop word from the given text. Run this cell when you are done to check your solution. 
# The cell should run without errors
tp = TextProcessor()
text = 'I am really loving this course. I think I am learning so much already!'
res = tp.remove_stopwords(text)

assert res == "I really loving course. I think I learning much already!", print(f'Your result: {res}')

And we're almost doe with the preprocessing! One last thing that is commonly done is to reduce the words to their root. This is called stemming, and the purpose is to reduce the vocabulary by only using the root termination of a word. For example `loving` and `love` would both become `love`. 

Again, `nltk` provides us with such a stemmer, called [Porter Steamer](https://www.nltk.org/api/nltk.stem.html#module-nltk.stem.porter). Use this module to complete the following question:

**NOTE:** There are more stemming algorithms provided by NLTK, feel free to experiment with them!

In [10]:
# 8. Modify the function stem in the TextProcessor class to convert all the words in the tweet to their root. Run 
# this cell when you are done to check your solution. The cell should run without errors

tp = TextProcessor()
text = 'I am really loving this course. I think I am learning so much already!'
res = tp.stem(text)

assert res == "I am realli love thi course. I think I am learn so much already!", print(f'Your result: {res}')

I know some words look strange and become not even a word (i.e thi), but using only the root of the words reduces the
vocabulary drastically, which makes the learning process easier for any algorithm. 

We are done! If you have successfully completed the previous tasks, the following cell should return a dataframe with all the tweets pre_processed:

In [None]:
tp = TextProcessor()
df = tp.preprocess(df)

df.head()

In [None]:
df.head()

With this transformed dataframe, let's ask some of the questions we asked in the previous part of the project:

In [7]:
# 9. Before processing, the number of unique words was 148857. How many are there now?


As you can see, stemming and removing trash has drastically reduced the vocabulary to work with. That's great

In [8]:
# 10. What are now the most common 50 words? What are their frequency distribution?
