# Chapter 4: Preprocessing

## 0. Loading the data

First we mount our GDrive:

In [0]:
# Import libraries
import pandas as pd
from google.colab import drive

# Mount GDrive
drive.mount('/content/drive/')

After mounting our GDrive, we load the data on a Pandas DataFrame:

In [0]:
# Load the data into a Pandas DataFrame
review_df = pd.read_csv("/content/drive/My Drive/TFM/yelp_reviews.csv")

# Show 5 columns as an example
review_df.head()

## 1. Case normalization

For each review, we apply Pandas method to lower-case:

In [0]:
# Lower-case all text column via Pandas method
review_df['text'] = review_df['text'].str.lower()

# Show 5 rows as an example
review_df.head()

## 2. Tokenization

Tokenization is conducted via WordPunctTokenizer of NLTK:

In [0]:
# Import the tokenizer from NLTK
from nltk.tokenize import WordPunctTokenizer

# Initiate the tokenizer class
tokenizer = WordPunctTokenizer()

# Apply the tokenizer to each row
review_df['text'] = review_df['text'].apply(lambda x: tokenizer.tokenize(x))

# Show 5 rows as an example
review_df.head()

## 3. Stopping

### 3.1. Removal of stop words

If it is your first time loading NLTK's stop words list, you should run the following cell:

In [0]:
# You will have to download the set of stop words the first time
import nltk
nltk.download('stopwords')
nltk.download('punkt')

Now we just have to remove words from reviews that are in the stop words list:

In [0]:
# Load stopwords from NLTK
from nltk.corpus import stopwords

# Load list of stopwords for English
stop_words = stopwords.words('english')

# Remove stop words for each row
review_df['text'] = review_df['text'].apply(lambda x: [word for word in x if word not in stop_words])

# Show 5 rows as an example
review_df.head()

## 3.2. Removal of non-characters

We use Python's built-in method to remove non-characters:


In [0]:
# Remove non-characters for each row
review_df['text'] = review_df['text'].apply(lambda x: [word for word in x if word.isalpha()])

# Show 5 rows as an example
review_df.head()

## 4. Spelling normalization

We are going to use pyspellchecker library to run spelling normalization. If you do not have this library on your system, please run the following cell:

In [0]:
# Run this cell if you do not have the pyspellchecker library
pip install pyspellchecker

Now we can use the SpellChecker method to correct spelling.

__Warning:__ this process takes a long time.

In [0]:
# Import method from pyspellchecker library
from spellchecker import SpellChecker

# Initiate SpellChecker class
spell = SpellChecker()

# Correct spelling for each row
review_df['text'] = review_df['text'].apply(lambda x: [spell.correction(word) for word in x])

# Show 5 rows as an example
review_df.head()

## 5. Stemming

We use the EnglishStemmer from the snowball module of NLTK's stemmers:

In [0]:
# Import the stemmer from NLTK
from nltk.stem.snowball import EnglishStemmer

# Initiate stemmer class
sb = EnglishStemmer()

# Stem for each row
review_df['text'] = review_df['text'].apply(lambda x: [sb.stem(word) for word in x])

# Show 5 rows as an example
review_df.head()

## 6. Word embedding

### 6.1. GloVe

First, we must import our pretrained GloVe model, which is just a dictionary of words with its word embedding:

In [0]:
# Import libraries
import numpy as np

# Initialize dictionary of GloVe word embeddings
glove_embeddings = {}
# Open file of trained GloVe model and store it on this dictionary
with open("/content/drive/My Drive/TFM/glove.twitter.27B.100d.txt", 'r') as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        glove_embeddings[word] = vector

We discard words that are not in our GloVe pretrained model:

In [0]:
# Discard words that are not in our GloVe pretrained models
review_df['text'] = review_df['text'].apply(lambda x: [word for word in x if word in glove_embeddings.keys()])

# Show 5 rows as an example
review_df['text'].head()

### 6.2. TF-IDF

We apply sklearn's TF-IDF method:

In [0]:
# Load method from sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

# sklearn's method takes as input a list of documents. We create a list of reviews
corpus = [" ".join(review) for review in review_df['text']]

# Initiate TF-IDF class
vectorizer = TfidfVectorizer()

# Compute TF-IDF
tf_idf_reviews = vectorizer.fit_transform(corpus)

### 6.3. GloVe averaged by TF-IDF

We compute GloVe averaged by TF-IDF and store it to our original review DataFrame:

In [0]:
# Get words in the TF-IDF model
tf_idf_word_column = vectorizer.get_feature_names()

# Initialize list for final output
glove_tf_idf_reviews = []

# Compute GloVe averaged by TF-IDF
for row, review in enumerate(review_df['text']):
  
  # Temporal list to store results
  tmp = []

  # For each word in a review, if the word is in the TF-IDF model, create a list of word*TF-IDF term 
  for word in review:
    if word in tf_idf_word_column:
      tmp.append(glove_embeddings[word]*tf_idf_reviews[row,tf_idf_word_column.index(word)])

  # Sum the weighted
  glove_tf_idf_reviews.append(sum(tmp))

# Save output to original DataFrame
review_df['text'] = glove_tf_idf_reviews

# Show 5 rows as an example
review_df['text'].head()

## 7. Export

Finally, we export the preprocessed dataset.

In [0]:
# Export the DataFrame to a csv
review_df.to_csv("/content/drive/My Drive/TFM/yelp_final_data.csv",index=False)