# NLP Text Preprocessing 
**Date:** 09/09/2020                              
Version: 1.0

## 1. Introduction
Convert a set of tweets and convert them into numerical representations which are suitable for input into recommender-systems and information-retrieval algorithms. 


## 2. Import libraries

In [2]:
import os # Get file from a particular path
import re # Regular expressions
import langid # Filtering enlish tweets
import pandas as pd # Handle data frames and import Excel files
import nltk # Text processing
from nltk.tokenize import RegexpTokenizer # Tokenizer
from nltk.tokenize import MWETokenizer # Tokenize multiple words
from nltk.collocations import * # Bigrams collocations
from itertools import chain # Apprend many lists together
import itertools 
from nltk.stem import PorterStemmer # Stem
from nltk.util import ngrams # Get ngrams out of a list of text
from sklearn.feature_extraction.text import CountVectorizer # Create a count vector

## 3. Text Pre-Processing


### 1. Examining and loading data
The first step is to load the excel file given (30550971.xlsx), that is stored in a folder called part2 that is in the same location of this notebook. We get the information of this file by using the fuction ExcelFile that belongs to Pandas library

In [3]:
excel_data = pd.ExcelFile(os.path.join(os.getcwd(), 'data/tweets.xlsx')) # Get the information stored in the excel file
excel_data

<pandas.io.excel._base.ExcelFile at 0x19ea3bacd00>

The file imported was stored as a diference Data Frames as each sheet of the excel file, has a different data frame to be parsed. As we can see in the following example:  

In [3]:
sheets = excel_data.sheet_names # Get a list with all the names of the sheets of the excel file
print('The Excel File has ' + str(len(sheets)) + ' sheets')
sheets[:5] # Names of the first 5 sheets

The Excel File has 81 sheets


['2020-03-22', '2020-03-23', '2020-03-24', '2020-03-25', '2020-03-26']

The context independent words are those that are used in every text no matter what its topic is. That is why we have to get rid of these words. There are some libraries in Python that have the list of the Context Independent words, but in this case we are going to use the .txt file given and store the information in it in a list called `ci_stopwords`

In [4]:
ci_stopwords = []
with open(os.path.join(os.getcwd(), 'part2/stopwords_en.txt'), 'r', encoding="utf8") as file: # Open the file 
    for line in file.readlines(): # Read each word that is stored in each line of the file
        ci_stopwords.append(line.strip()) # Strip the word in each line and append it to the list
        
ci_stopwords[:5] # Firts 5 context independent stop words

['a', "a's", 'able', 'about', 'above']

### 2. Parsing the data
After we manage to extract the information of each sheet of the Excel File, we have to parse it to get a clean data frame with the required information. To do this we have to:
- **Drop useless columns**: Columns where all the values are NA
- **Drop useless rows**: Rows where all the values are NA
- **Fix the header**: Make an uniform header for all the data frames

After we make this corrections to we are going to store all the tweets in the same list, we are going to do this by extracting the column **text** in every data frame and append it to a dictionary called `tweets`

In [5]:
tweets = {} # List that will store all the tweets given
for sheet in sheets: # Just do this to the first sheet as a demonstration
    df = excel_data.parse(sheet) # Parse the sheet to get a dataframe with the information
    df.dropna(axis = 0, how='all', inplace=True) # Drop all the rows where the value of 'all' cells is NaN
    df.dropna(axis = 1, how='all', inplace=True) # Drop all the columns where the value of 'all' cells is NaN
    df.index = range(len(df.index)) # As we drop some rows, we restart the indexes from 0
    header = [df.columns[0], df.columns[1], df.columns[2]] # Define the headers as the first row of df

    if header != ['text', 'id', 'created_at']: # If the header has not this structure
        df.rename(columns=df.iloc[0], inplace = True) # Define the first row as the header of the df
        df.drop(df.index[0], inplace=True) # Drop the first row, as it is the header
    
    text = df['text'].tolist() # Convert the text column to a list
    tweets[sheet] = text # Append the list with the tweets of a sheets to the list that will store all the tweets

### 3. Filtering English Tweets
By using the library 'Langid' we can filter out the non-english tweets, as we are only interested in those that are written in English. We have to identify the language of each tweet and then classify it. The final result is a smaller dictionary with only the english tweet that will be called `en_tweets`

In [6]:
en_tweets = {} # Dictionary to store the english tweets

for date in tweets: # For every date in tweets
    text = [] # Empty list of tweets. To store all the english tweets written in each day
    
    for tweet in tweets[date]: # For each tweet in the tweets list
        if langid.classify(str(tweet))[0] == 'en': # If the tweet is written in english
            text.append(tweet)
    
    en_tweets[date] = text

### 4. Word Tokenization
Now, we need to convert each tweet to lower case to then tokenize each tweet in **en_tweets** with the following Regular Expression: **[a-zA-Z]+(?:[-'][a-zA-Z]+)?**. And then store all the unique tokens in a dictionary called *tokens_tweets*, where the keys are the dates and the values are the list of the words of that date.

<img src="./images/tokenization.png" width="500" height="250"></img>

In [7]:
tokenizer = RegexpTokenizer(r"[a-zA-Z]+(?:[-'][a-zA-Z]+)?") # Regular expression used to get the tokens       
tokens_tweets = {} # Dictionary to store the tokens per date

for date in en_tweets: # For every date in tweets
    tokens = [] # List to store the tokens
    
    for tweet in en_tweets[date]: # For each tweet in the en_tweets list
        tokens = tokens + tokenizer.tokenize(str(tweet).lower()) # Tokenize each tweet and convert each token to lower case
    
    tokens_tweets[date] = tokens # Store all the tokens of a date in the dictionary 

### 5. Generate list of bigrams
Before removing useless words or stemming the words we have to create the bigrams, therefore we use the library **nltk.collocations** to get the first 200 bigram collocations by using PMI measure, given the unigrams of all tweets.

First we concatenate all the daiy unigrams, in order to create a huge unique list with all the tokens of the list

In [8]:
unigrams = list(chain.from_iterable([token for token in tokens_tweets.values()])) # Get all the unigrams of the tweet_dict

Then, with the list that we just create we build a new list with the top 200 bigrams measured by using PMI.

In [9]:
bigrams = []
bigram_measures = nltk.collocations.BigramAssocMeasures()
bigram_finder = nltk.collocations.BigramCollocationFinder.from_words(unigrams) 
bigram_finder.apply_word_filter(lambda w: len(w) < 3)# or w.lower() in ignored_words)
top_200_bigrams = bigram_finder.nbest(bigram_measures.pmi, 200) # Top-200 bigrams

And finally, as the function **nltk.collocations.nbest(bigram_measures.pmi, n)** returns a list with sets of the bigrams. We proceed to convert this sets to the following format:

$$\text{set[0]_set[1]}$$

And append this value to a list called bigrams. So at the end the list **bigrams** will store the top 200 bigrams with the correct format.

In [10]:
for i in top_200_bigrams:
    bigram = str(i[0].strip()) + '_' + str(i[1].strip())
    bigrams.append(bigram)

### 6. Removal of useless words
Now that we already create the bigram list, we can handle the unigrams list to performe the following steps:

#### 6.1 Remove of duplicate words
As we are constructing the vocabulary of the tweets, we can get rid of duplicate values. We do this by converting the list unigrams to a set. And then we convert the set again to a list to make it easier to handle.

In [11]:
unigrams = list(set(unigrams))

#### 6.2. Remove Context Independent Stopwords
There some words in all the language that are useful to construct correct sentences, these words are often functional words for example, articles, pronouns, particles, and so on. 
There are some libraries in Python that have these words compiled, but in this case we are going to use the words in the file **stopwords_en.txt** given to do this assigment and was already imported in the first step. The Context Independent Stop Words are stored in a list called **ci_stopwords**

In [12]:
unigrams = [x for x in unigrams if x not in set(ci_stopwords)] # Remove context independent stopwords

#### 6.3. Remove unigrams with less than 3 letters
In order to get words we some meaningful significant, we get rid of words with less that 3 letters.

In [13]:
unigrams = [x for x in unigrams if 3 <=  len(x)] # Tokens with less than 3 letters

#### 6.4. Remove words due its frequency
By getting the frequency of occurrence of some words distributed along the documents, we can identify two different type of words.

- **Rare tokens:** Words that occur in less than 5 documents (days)
- **Context dependent stop words:** Words that occur in more than 60 documents (days)

First, we are going to find the frequency distribution along with the documents of the unigrams. By getting the unique values of the unigrams of each day and then append these unigrams together. And then, we count the frequency of each word, returning the occurrence along with the documents.

In [14]:
daily_unigrams = list(chain.from_iterable([set(token) for token in tokens_tweets.values()])) 
freq_dist = nltk.FreqDist(daily_unigrams) # Get the frequency distribution of the tokens

##### 6.4.1. Remove of Rare Tokens
As we said before, rare tokens are the words that have a frequency distribution of less than 5. So first we construct the list with this rare tokens, and finally remove them from the list unigrams.

In [15]:
rare_tokens = [x for x in freq_dist.keys() if freq_dist.get(x) < 5] # Rare tokens

unigrams = [x for x in unigrams if x not in set(rare_tokens)] # Remove context dependent stopwords

##### 6.4.2. Remove of Context Dependent Stop Words
As we said before, Context Dependent Stop Words are the words that have a frequency distribution of more than 60. So first we construct the list with this Context Dependent Stop Words, and finally remove them from the list unigrams.

In [16]:
cd_stopwords = [x for x in freq_dist.keys() if freq_dist.get(x) > 60] # Context dependent stop words

unigrams = list(set(unigrams) - set(cd_stopwords))  # Remove context dependent stopwords

### 7. Stemming of tokens
Stemming the action of reducing different words into the 'root' same word. In English, nouns are inflected in the plural, verbs are inflected in the various tenses, and adjectives are inflected in the comparative/superlative. So the idea is to just keep this 'root' word instead of having to many different forms of the same word.
We do this by using the function `PortStemmer` and apply the stem to all words in the list unigrams

In [18]:
stemmer = PorterStemmer() # Porter Stemmer
unigrams = ['{1}'.format(w, stemmer.stem(w)) for w in unigrams] # Covert all the elements in unigram to a stem form

### 8. Create the Vocabulary
Now that we have the unigrams without the useless words and stemmed, and the top 200 bigrams. We proceed to create the sample vocabulary with these two lists.

So that, we append the lists unigrams and bigrams, then we sort the new list in alphabetical order, in order to get a new list called `vocab`

In [19]:
vocab = unigrams + bigrams
vocab.sort()

Now, we inspect the values of the `vocab` list

In [20]:
vocab[:5] # First 5 words of the vocabulary

['a-chinese_bioweapon-greatawakening',
 'aaa',
 'aaaaaaa_comel',
 'aaaahjhvas_chaitye',
 'aaak_thooooo']

### 9. Daily Frequency Distribution of Unigrams
In this task we proceed to create distionary called `freq_unigrams`. The keys of the dictionary are going to be the dates and the values are a list of the top 100 unigrams per day.

What we do here is to get the original unigrams per day, stem all the words and only get the words that are in the list **unigrams** that have already removed the useless words.

In [21]:
freq_unigrams = {}
for date in tokens_tweets:
    daily_unigrams = ['{1}'.format(x, stemmer.stem(x)) for x in tokens_tweets[date]]
    daily_unigrams = [x for x in daily_unigrams if x in set(unigrams)]
    freq_unigrams[date] = nltk.FreqDist(daily_unigrams).most_common(100)

Now, we inspect the values of the `freq_unigrams` dictionary

In [23]:
freq_unigrams['2020-03-22'][:5] # First 5 unigrams of the date 2020-03-22 with its frequency

[('co', 927), ('in', 327), ('thi', 209), ('be', 145), ('on', 130)]

### 10. Daily Frequency Distribution of Bigrams
In this task we proceed to create distionary called `freq_bigrams`. The keys of the dictionary are going to be the dates and the values are a list of the top 100 bigrams per day.

What we do here is to get the original unigrams per day, create bigrams out of these lists and get the top 100 of each day.

In [22]:
freq_bigrams = {}
for date in tokens_tweets:
    daily_bigrams = ngrams(tokens_tweets[date], n=2)
    freq_bigrams[date] = nltk.FreqDist(daily_bigrams).most_common(100)

Now, we inspect the values of the `freq_bigrams` dictionary

In [24]:
freq_bigrams['2020-03-22'][:5] # First 5 unigrams of the date 2020-03-22 with its frequency

[(('https', 't'), 925),
 (('t', 'co'), 925),
 (('the', 'coronavirus'), 64),
 (('coronavirus', 'https'), 56),
 (('of', 'the'), 47)]

### 11. Create the .txt files

#### 11.1. vocabulary.txt
In this file we are going to save the unigrams and bigrams with the format of sample_vocab.txt file. The data is going to be extracted from the list `vocab`

In [25]:
vocab_txt = open('vocabulary.txt', 'w', encoding='utf8') # Create a new .txt file

for token in vocab: # For every token in vocab list
    text = str(token) + ':' + str(vocab.index(token)) + '\n' # Format -> token:id
    vocab_txt.write(text) # Write the text in the file 30550971_vocab.txt
    
vocab_txt.close() # Close the file

#### 11.2. 100unigrams.txt
In this file we are going to save the top 100 unigrams of each date of the file with the format of sample_100uni.txt file. The data is going to be extracted from the dictionary `freq_unigrams`

In [26]:
uni100_txt = open('100unigrams.txt', 'w', encoding='utf8') # Create a new .txt file

for date in freq_unigrams: # For every token in vocab list
    text = str(date) + ':' + str(freq_unigrams[date]) + '\n' # Format -> token:FreqDist
    uni100_txt.write(text) # Write the text in the file 30550971_100uni.txt
    
uni100_txt.close() # Close the file

#### 11.3. 100unigrams.txt
In this file we are going to save the top 100 bigrams of each date of the file with the format of sample_100bi.txt file. The data is going to be extracted from the dictionary `freq_bigrams`

In [27]:
bi100_txt = open('100bigrams.txt', 'w', encoding='utf8') # Create a new .txt file

for date in freq_bigrams: # For every token in vocab list
    text = str(date) + ':' + str(freq_bigrams[date]) + '\n' # Format -> token:FreqDist
    bi100_txt.write(text) # Write the text in the file 30550971_100bi.txt
    
bi100_txt.close() # Close the file

## References

[Open files with os library](https://stackoverflow.com/questions/18262293/how-to-open-every-file-in-a-folder)

[Classify english tweets](https://stackoverflow.com/questions/39142778/python-how-to-determine-the-language)

[Column to list](https://stackoverflow.com/questions/22341271/get-list-from-pandas-dataframe-column)