# Text Processing Exercise 

In this exerise, you will learn some building blocks for text processing . You will learn how to normalize, tokenize, stemmeize, and lemmatize tweets from Twitter.

### Fetch Data from the online resource

First, we will use the `get_tweets()` function from the `exercise_helper` module to get all the tweets from the following Twitter page https://twitter.com/AIForTrading1. This website corresponds to a Twitter account created especially for this course. This webiste contains 28 tweets, and our goal will be to get all these 28 tweets. The `get_tweets()` function uses the `requests` library and BeautifulSoup to get all the tweets from our website. In a later lesson we will learn how the use the `requests` library and BeautifulSoup to get data from websites. For now, we will just use this function to help us get the tweets we want.

In [None]:
import exercise_helper

all_tweets = exercise_helper.get_tweets()

print(all_tweets)

### Normalization
Text normalization is the process of transforming text into a single canonical form.

There are many normalization techniques, however, in this exercise we focus on two methods. First, we'll converting the text into lowercase and second, remove all the punctuation characters the text.

#### TODO: Part 1

Convert text to lowercase.

Use the Python built-in method `.lower()` for converting each tweet in `all_tweets` into the lower case.

In [None]:
# your code goes here


#### Part 2 

Here, we are using `Regular Expression` library to remove punctuation characters. 

The easiest way to remove specific punctuation characters is with regex, the `re` module. You can sub out specific patterns with a space:

```python
re.sub(pattern, ' ', text) 
```

This will substitute a space with anywhere the pattern matches in the text. 

Pattern for punctuation is the following `[^a-zA-Z0-9]`. 

In [None]:
import re

counter = 0

for tweet in all_tweets:
    all_tweets[counter] = re.sub(r'[^a-zA-Z0-9]', ' ', tweet) 
    counter += 1

print(all_tweets)

### NLTK: Natural Language ToolKit

NLTK is a leading platform for building Python programs to work with human language data. It has a suite of tools for classification, tokenization, stemming, tagging, parsing, and semantic reasoning. 

Let's import NLTK. 

In [None]:
import os 
import nltk 
nltk.data.path.append(os.path.join(os.getcwd(), "nltk_data"))

#### TODO: Part 1

NLTK has `TweetTokenizer` method that splits tweets into tokens.

This make tokenizng tweets much easier and faster. 

For `TweetTokenizer`, you can pass the following argument `(preserve_case= False)` to make your tokens in lower case. In the cell below tokenize each tweet in `all_tweets` 

In [None]:
from nltk.tokenize import TweetTokenizer

#  your code goes here


#### Part 2

NLTK adds more modularity for tokenization.

For example, stop words are words which do not contain important significance to be used in text analysis. They are repetitive words such as "the", "and", "if", etc. Ideally, we want to remove these words from our tokenized lists. 

NLTK has a list of these words, `nltk.corpus.stopwords`, which you actually need to download through `nltk.download`.

Let's print out stopwords in English to see what these words are. 

In [None]:
from nltk.corpus import stopwords
nltk.download("stopwords")

### TODO: 

print stop words in English

In [None]:
# your code is here


#### TODO: Part 3 

In the cell below use the `.split()` method to split each tweet into a list of words and remove the stop words from all the tweets.

In [None]:
## your code is here 

    

### Stemming
Stemming is the process of reducing words to their word stem, base or root form.

### TODO:

In the cell below, use  the `PorterStemmer` method from the ntlk library to perform stemming on all the tweets

In [None]:
from nltk.stem.porter import PorterStemmer

# your code goes here



### Lemmatizing
#### Part 1

Lemmatization is the process of grouping together the inflected forms of a word so they can be analyzed as a single item.

For reducing the words into their root form, you can use `WordNetLemmatizer()` method. 

For more information about lemmatzing in NLTK, please take a look at NLTK documentation https://www.nltk.org/api/nltk.stem.html

If you like to understand more about Stemming and Lemmatizing, take a look at the following source: 
https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html

In [None]:
nltk.download('wordnet') ### download this part 

### TODO:

In the cell below, use the `WordNetLemmatizer()` method to lemmatize all the tweets

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer

# your code goes here


#### TODO: Part 2

In the cell below, lemmatize verbs by specifying `pos`. For `WordNetLemmatizer().lemmatize` add `pos` as an argument.

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer

# your code goes here


# Solution

[Solution notebook](process_tweets_solution.ipynb)