# Data Downsampling and Preprocessing

This Jupyter Notebook contains the first step of the entire ML pipeline for this project: data pre-processing and downsampling. Data pre-processing is done to clean up any messy text in reviews and preserve only the words from the review that preserve the most information (removing redundancy). This is essential to gain reliable insights when conducting EDA. 

Downsampling is done in order to create a dataset which is smaller than the original dataset but is still representative of it as a whole. It is required as my computer is not powerful enough to operate on all the data from the dataset.  

In [1]:
#General Imports
import numpy as np
import pandas as pd
import pickle 
from os.path import join

#Preprocessing related imports 
from nltk.stem import WordNetLemmatizer
import gensim.parsing.preprocessing as gpp
import gensim.utils as gu


In [3]:
#Load full dataset 
data_dir = "data/"
data = pd.read_csv(join(data_dir, "train.csv"), header=None, names=['Rating', 'Title', 'Review'])
display(data)

Unnamed: 0,Rating,Title,Review
0,3,more like funchuck,Gave this to my dad for a gag gift after direc...
1,5,Inspiring,I hope a lot of people hear this cd. We need m...
2,5,The best soundtrack ever to anything.,I'm reading a lot of reviews saying that this ...
3,4,Chrono Cross OST,The music of Yasunori Misuda is without questi...
4,5,Too good to be true,Probably the greatest soundtrack in history! U...
...,...,...,...
2999995,1,Don't do it!!,The high chair looks great when it first comes...
2999996,2,"Looks nice, low functionality",I have used this highchair for 2 kids now and ...
2999997,2,"compact, but hard to clean","We have a small house, and really wanted two o..."
2999998,3,Hard to clean!,I agree with everyone else who says this chair...


In [4]:
# Check distribution of rating values as this is likely our target variable
data["Rating"].value_counts()

1    600000
2    600000
3    600000
4    600000
5    600000
Name: Rating, dtype: int64

We see that the rating values are distributed equally with an equal number of data points for each label value. We want to preserve this property when we downsample our dataset. 

### Preprocessing
We first preprocess our entire dataset by applying the following transformations to the textual data:
- Stripping HTML Tags (`gpp.strip_tags`)
- Removing all Punctuation (`gpp.strip_punctuation`)
- Removing all extra whitespaces (`gpp.strip_multiple_whitespaces`)
- Removing all numerics (`gpp.strip_numeric`)
- Removing stopwords(`gpp.remove_stopwords`)
- Removing words shorter than 3 letters (`gpp.strip_short`)

Following this initial pre-processing, we also then lemmatize all the words in the reviews to produce lemmatized strings.


In [2]:
def preprocess_text(text):
    """Preprocesses a given string text input"""
    preprocs = [
        gpp.strip_tags, 
        gpp.strip_punctuation,
        gpp.strip_multiple_whitespaces,
        gpp.strip_numeric,
        gpp.remove_stopwords, 
        gpp.strip_short, 
    ]
    text = gu.to_unicode(text.lower().strip())
    for preproc in preprocs:
        text = preproc(text)
    return text

def lemmatize(text):
    """Lemmatizes a given string text input"""
    wnl = WordNetLemmatizer()
    return wnl.lemmatize(text)  

In [5]:
# Combining both the above functions into a single preprocessing function
preprocess = lambda text: lemmatize(preprocess_text(str(text)))

Before we apply the preprocessing, we notice that the dataset has two columns with textial data: the title of the review and the review itself. As the title of the data also indicates the feelings of the user towards the product and is essentially a summarization of the review it is also informative for predicting user rating. Therefore, we create a new feature "ReviewFull" which is a concatenation of the review title as well as the review itself, and use this as our primary data for EDA and model training.

In [6]:
# Create the ReviewFull data column
data["ReviewFull"] = data["Title"] + " " + data["Review"]
data = data.drop(["Title", "Review"], axis=1)

# Apply the preprocessing to the textual data
data["ReviewFull"] = data["ReviewFull"].apply(preprocess)
data.head()

Unnamed: 0,Rating,ReviewFull
0,3,like funchuck gave dad gag gift directing nuns...
1,5,inspiring hope lot people hear need strong pos...
2,5,best soundtrack reading lot reviews saying bes...
3,4,chrono cross ost music yasunori misuda questio...
4,5,good true probably greatest soundtrack history...


In [7]:
# Save the data
data.to_csv(join(data_dir, "preprocessed_train.csv"))

### Downsampling

We now create a smaller dataset which my computer can process when doing EDA and modeling. We downsample to a dataset size of 50000 data points, ensuring that there is an even distribution of ratings by grouping by the "Rating" column when sampling

In [8]:
downsampled = data.groupby("Rating").sample(10000)
display(downsampled)

Unnamed: 0,Rating,ReviewFull
260774,1,diappointed service order thing direct address...
1339843,1,allsop scratch repair kit product according pa...
1099824,1,angry requirements years old ibook running tig...
355715,1,yawn years heard film scary disturbing long ag...
2880294,1,terrible waste time bother switches past prese...
...,...,...
1319052,5,loved book finished reading flying seat pants ...
1040680,5,beautiful marriage photographs text photograph...
819639,5,great came soon days functions compared compet...
1673128,5,perfectly written novel think safe read book g...


In [9]:
# Ensure equal distribution of targets
downsampled["Rating"].value_counts()

1    10000
2    10000
3    10000
4    10000
5    10000
Name: Rating, dtype: int64

In [10]:
# Save data
downsampled.to_csv(join(data_dir, "downsampled_preprocessed_train_50000.csv"))