# *Module 1 - Data Preparation*
---

# 1. Project summary
___

This is part of a project to build a sentiment classifier trained on Yelp review data (https://www.yelp.com/dataset). The project has been divided into several modules to perform different parts of the analysis, e.g., data cleaning, data processing, and model training. The goal is to predict the sentiment of a document; while using Yelp reviews of businesses, the 1-5 star rating acts as a proxy for sentiment, and the written Yelp review as the document text. The project is written in Python on Jupyter notebooks and makes use of a range of data science tools like pandas, spaCy, word2vec, and keras. My motivation in starting this project is to build my skillset, learn new tools, and improve as a data scientist. It is an ongoing project and may see many updates/iterations.

# 2. Module Overview
___

## Goal
- The goal of this module is to prepare Yelp review data for eventual training and evaluation of a sentiment classifier
    - Data Exploration:
        - The data are inspected to identify structure, data-types, data classes, distributions, missing values, etc.
    - Data Reduction:
        - Only a representative subsample of the dataset is kept for more efficient data analysis
        - Irrelevant features are removed (e.g., date, business ID)
    - Data Cleaning:
        - Data are cleaned to ensure no missing values, duplicates, etc.
        - Minimal cleaning of the review text is performed, e.g., replacing odd whitespaces, backslashes, etc.
- The above steps are organized for clarity but are not performed in the exact order listed
- Prepared data are saved in an ouput json file for processing by the next module

## Data
- Input Data:
    - `yelp-dataset`
    - Kaggle: https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset
    - Yelp: https://www.yelp.com/dataset
- Description from the kaggle dataset page:
> This dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the most recent dataset you'll find information about businesses across 8 metropolitan areas in the USA and Canada.
- Specifically, I analyze the `yelp-dataset/yelp_academic_dataset_review.json` file which contains data pertaining to Yelp user reviews and includes the text of the review and an associated 1-5 star rating of the Yelp user's experience (among other data, e.g., date, business ID, etc.)
- Data are indexed per review with no discernable sorting

## Libraries
- Key Libraries:
    - `Pandas` - used to read, load, store, inspect, process, and save the data
        - webpage: https://pandas.pydata.org/

## Output
- Output:
    - `/kaggle/working/cleaned_reduced_data.json`
- This json file contains a reduced dataset of the input data that has been cleaned and prepared for NLP

# 3. Import Libraries
___

- I will only be using `pandas` at this stage
- The module from `IPython` ensures every command in a cell is displayed, which saves me from having to write lots of print statements

In [None]:
# Libraries for reading, handling, and visualizing data
import pandas as pd
import seaborn as sns

# For quickly getting file lengths
from subprocess import check_output

# Settings for displaying commands in a cell
from IPython.core.interactiveshell import InteractiveShell

- Some settings for the notebook that aid with analysis

In [None]:
# Display output of every command in a cell
InteractiveShell.ast_node_interactivity = 'all'

# Set default seaborn theme for plots
sns.set_theme()

# 4. Data Exploration, Reduction, and Cleaning
___

## Data exploration - file size, structure

- I'll now explore the data, reducing and cleaning in situ
- Start by opening the input file `/kaggle/input/yelp-dataset/yelp_academic_dataset_review.json`
- Read the first line to get a sense for how the file is structured

In [None]:
# Inspect the structure of the Yelp review data json file using a single entry
INPUT_FILE = '/kaggle/input/yelp-dataset/yelp_academic_dataset_review.json'
with open(INPUT_FILE, 'r') as json_file:
    json_file.readline()

- It is apperent that a single line is a self-contained json object corresponding to a single observation (review)
- The keys are data labels and the values are data values
- Let's see how many observations this file contains

In [None]:
# Get the number of lines in the json file, equal to the number of data observations (reviews)
int(check_output(['wc', '-l', INPUT_FILE]).split()[0])

- Almost 7 million reviews! This is far more than I will need for analysis and will slow the processing down considerably
- Before I take a subsample, let's make sure that the reviews are not ordered in any meaningful way
- This lets us just truncate the data rather than having to randomly sample it which can take time (pass through first m < n lines rather than all n lines)

In [None]:
# Read and inspect the first few lines of the json file to see if review data are shuffled
# Use a pandas dataframe for easier inspection
df_head_10 = pd.read_json(INPUT_FILE, lines=True, nrows=10)
df_head_10.head(10)

- There does not appear to be sorting in any column, so we will assume the data are properly shuffled

## Data reduction - subsampling

- I will truncate the number of lines read so we only load a representative subsample of the total data
- Read a subsample of the data from the input file `/kaggle/input/yelp-dataset/yelp_academic_dataset_review.json` and load into a pandas dataframe

In [None]:
# Read the Yelp review data
# Dataset is large (~7 million reviews!) 
# Read the first 20,000 lines of the input file and load the data subsample into a pandas dataframe for analysis 
READ_SIZE = 20000
df = pd.read_json(INPUT_FILE, lines=True, nrows=READ_SIZE)

## Data exploration - structure, dtypes, null values, duplicates

- After reading and loading the data subsample, I inspect the contents of the dataframe: identify the data-types, check for null entries and duplicates, and peek the first few rows of data

In [None]:
# Inspect the size and data-types of the df
df.info()

# Check for null/missing values
df.isnull().values.any()

# Check for any duplicate entries
df.duplicated().values.any()

# Inspect the first few rows
df.head()

- Fortunately there are no null-valued or duplicate entries in our sample
- We can see there are 9 columns with int and str data-types

## Data reduction - feature selection

- I am only interested in the "text" and "stars" columns for the purposes of sentiment analysis
- Remove all but these two columns from the dataframe

In [None]:
# Remove columns we will not need from the df
# For building a sentiment classifier we will only need to keep:
# 1. Written reviews ("text") for training features
# 2. Star ratings ("stars") for training targets
df = df[['stars','text']]

## Data exploration - class distributions
- Let's explore the star ratings in the 'stars' column, which will serve as our categories or class labels
- Start by inspecting the 'stars' column to understand the values

In [None]:
# Check the values of the class labels ("stars")
sorted(pd.unique(df['stars']))
len(pd.unique(df['stars']))

- We can see the values are integers ranging from 1 to 5
- This might seem obvious, but it's good to check since these will become our class labels
- Now look at how these values are distributed

In [None]:
NUM_CLASSES = 5
MIN_STAR_VAL = 1
MAX_STAR_VAL = 5

# Inspect the distribution of classes in the dataset
df['stars'].value_counts(ascending=True)
df['stars'].plot.hist(bins=NUM_CLASSES, range=(MIN_STAR_VAL-.5,MAX_STAR_VAL+.5), xlabel='stars')

- The data are heavily bias toward 5 star ratings, followed by 4 stars.
- There are slightly more 1 star ratings than 2 star
- This is not very surprising: people are more inclined to leave positive reviews after a great experience with a business, and are more likely to give strong statements like 5 stars or 1 stars, than neutral statements, say, 2 or 3 stars
- I am slightly surprised there are not more 1 star reviews, as I imagine many people feel inclined to leave feedback after a very negative experience
- Based on a cursury read of some of the more negative sounding reviews, is seems as though negative reviews are just distributed among the 1, 2, and 3 star ratings, i.e., even a very negatively written review can be awarded a 2 or even 3 star rating
- I could reduce the data by taking a subset with a uniform distribution of star ratings, as training a model on data with an uneven class distribution can lead to bias
- There is also a good reason to keep this distribution, as it is a real feature of user sentiment and we want a model that will reflect that
- I will retain this data and make that decision later in the analysis
- This also allows me to group the classes down the road for, say, a binary classifier, with greater data retention

## Data cleaning - text data

- Now I focus on cleaning the text data, i.e., the written review of the business
- Start by checking for odd whitespaces including:
    - newline characters `\n`
    - tabs `\t`
    - backslashes or escape characters `\`
    - double spaces, tripple spaces, etc.

In [None]:
# Now let's clean up the textual data

# This let's us see the entire text of each review - better for analysis
pd.set_option('display.max_colwidth', 10000)

# Search for any odd whitespace or backslashes in the text
df['text'].str.contains('\n', regex=True).any() # newline
df['text'].str.contains('\t', regex=True).any() # tab
df['text'].str.contains('  ', regex=True).any() # double space
df['text'].str.contains(r'\\', regex=True).any() # backslash

- There are newline characters, backslashes, and double spaces present
- Let's take a look at an example of each occurance

In [None]:
# Look at a couple examples of each
pd.set_option('display.max_colwidth', 10000) # display the entire review
df[df['text'].str.contains('\n', regex=True) == True]['text'].head(1)
df[df['text'].str.contains(r'\\', regex=True) == True]['text'].head(1)
df[df['text'].str.contains('  ', regex=True) == True]['text'].head(1)

- In the first example there are several `\n` chars
- In the second example we see a `\` in "`4\5 stars`" which is likely just a typo that the Yelp user intended as a `/` (also back-to-back `\n` chars)
- In the last example, there is a double space, also likely a typo

- While we could apply different replacements in each case, for the purposes of sentiment analysis it is sufficient to replace all of these whitespaces/backslashes with a single space; it will not drastically change the meaning of the text
- Replace the whitespaces and add the cleaned text as a new column to the dataframe
- Verify the cleaned data look as expected
- N.B. Punctuation and special characters will be dealt with in a separate module, where we use a pretrained English langauge model to tokenize the text

In [None]:
# Replace all whitespace and backslashes with a single space
df['cleaned_text'] = df['text'].str.replace('\s+', ' ', regex=True)
df['cleaned_text'] = df['cleaned_text'].str.replace(r'\\', ' ', regex=True)

# Verify the changes with the same examples
df[df['text'].str.contains('\n', regex=True) == True]['cleaned_text'].head(1)
df[df['text'].str.contains(r'\\', regex=True) == True]['cleaned_text'].head(1)
df[df['text'].str.contains('  ', regex=True) == True]['cleaned_text'].head(1)

- At this point, the text data are clean of any odd whitespace and non-English characters
- I could clean the text data further, such as removing numbers, punctuation, etc., but I will do that in a separate module focused on processing English text with specialized tools

## Data reduction - feature selection

- Now I will drop any columns I no longer need:
    - `text` - the raw, un-cleaned text
- Double check the data in the remaining columns look as expected

In [None]:
# We can now drop the unprocessed text column
df = df.drop(columns=['text'])
df.head()

# 5. Save data
___

- I can finally save the reduced and cleaned data as an output file for use in other modules
- I am saving the data in the JSON format to remain consistent with the input files

In [None]:
# The data are now prepared!
# The next step will be to process the text data and featurize the text and class labels

# Save the current state of the data so it can be read by other notebooks
df.to_json('cleaned_reduced_data.json', orient='records', lines=True)

# Read the saved data back in to verify the format
df = pd.read_json('/kaggle/working/cleaned_reduced_data.json', orient='records', lines=True)

# Inspect the size and data-types of the df
df.info()

# Inspect the first few rows
df.head()