# *Module 2 - NLP*
___

# 1. Project summary
___

This is part of a project to build a sentiment classifier trained on Yelp review data (https://www.yelp.com/dataset). The project has been divided into several modules to perform different parts of the analysis, e.g., data cleaning, data processing, and model training. The goal is to predict the sentiment of a document; while using Yelp reviews of businesses, the 1-5 star rating acts as a proxy for sentiment, and the written Yelp review as the document text. The project is written in Python on Jupyter notebooks and makes use of a range of data science tools like pandas, spaCy, word2vec, and keras. My motivation in starting this project is to build my skillset, learn new tools, and improve as a data scientist. It is an ongoing project and may see many updates/iterations.

# 2. Module overview
___

## Goal
- The goal of this module is to further prepare the Yelp review data by performing natural language processing on the text data
    - Tokenization:
        - Text data are passed through a pretrained English language model that will tokenize the text
    - Filtering:
        - Unnecssesary tokens like punctuation will be filtered from the text
    - Lemmatization:
        - Now filtered tokens will be lemmatized for sentiment analysis
- The above steps are organized for clarity but are not performed in the exact order listed
- Processed data are saved in an ouput json file for processing by the next module

## Data
- Input Data:
    - `yelp-dataset-reviews-prepared`
    - Kaggle: https://www.kaggle.com/datasets/gabrielmadigan/yelp-dataset-reviews-prepared/data/intermediate
    - Derived from: `yelp-dataset`
        - Kaggle: https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset
        - Yelp: https://www.yelp.com/dataset
        - Description from the kaggle dataset page:
        > This dataset is a subset of Yelp's businesses, reviews, and user data. It was originally put together for the Yelp Dataset Challenge which is a chance for students to conduct research or analysis on Yelp's data and share their discoveries. In the most recent dataset you'll find information about businesses across 8 metropolitan areas in the USA and Canada.
- This dataset has been prepared for this module following inspection, cleaning, and reduction of the original Yelp dataset.

## Libraries
- Key Libraries:
    - `Pandas` - used to read, load, store, inspect, process, and save the data
        - webpage: https://pandas.pydata.org/
    - `spaCy` - a powerful natural language processing library used for tokenizing the text data
        - webpage: https://spacy.io/

## Output
- Output:
    - `/kaggle/working/data/intermediate/lemmatized-data.json` - contains the lemmatized reviews and corresponding labels
    - `/kaggle/working/data/intermediate/lemmatized-sentences.json` - contains lemmatized sentences from all reviews, needed when we train the word embeddings in the next module

# 3. Import libraries
___

- I will be importing `pandas` to read, load, and handle the data
- The `spaCy` library is used to perform NLP on the text data
- `spaCy` comes with pretrained language models that can be loaded and used to process text, including: tokenization, parts of speech tagging, lemmatization, named entity recognition, etc.,
- The `Sentencizer` module from `spaCy` is used to extract individual sentences from a document that we will need for word embeddings in a later module
- The `tqdm` package is useful for displaying progress bars while processing data (https://tqdm.github.io/)
- The module from `IPython` ensures every command in a cell is displayed, which saves me from having to write lots of print statements.

In [1]:
# Libraries for reading and handling data
import os
import pandas as pd

# Libraries for NLP
import spacy
from spacy.tokens import Doc
from spacy.pipeline import Sentencizer

# tqdm allows us to display a progress bar for long loops
from tqdm import tqdm 

# Settings for displaying commands in a cell
from IPython.core.interactiveshell import InteractiveShell

- Some settings for the notebook that aid with analysis

In [2]:
# Display output of every command in a cell
InteractiveShell.ast_node_interactivity = 'all'

# shows a progress bar while manipulating a pandas df or series
tqdm.pandas()

# 4. Read data
___

- Read the input files and load the data into a pandas dataframe
- Inspect the dataframe for data-type, size, contents, etc.,

In [3]:
# Read in the cleaned/prepared Yelp review data as a pandas dataframe
INPUT_FILE = '/kaggle/input/yelp-dataset-reviews-prepared/data/intermediate/cleaned-reduced-data.json'
OUT_PATH = '/kaggle/working/'

# Save all intermediate data to this directory
DATA_PATH = os.path.join(OUT_PATH, 'data')
if not os.path.exists(DATA_PATH):
    os.mkdir(DATA_PATH)
INT_DATA_PATH = os.path.join(DATA_PATH, 'intermediate')
if not os.path.exists(INT_DATA_PATH):
    os.mkdir(INT_DATA_PATH)

df = pd.read_json(INPUT_FILE, lines=True)

# Inspect the size and data-types of the df
df.info()

# Inspect the first few rows
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   stars         20000 non-null  int64 
 1   cleaned_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


Unnamed: 0,stars,cleaned_text
0,3,"If you decide to eat here, just be aware it is..."
1,5,I've taken a lot of spin classes over the year...
2,3,Family diner. Had the buffet. Eclectic assortm...
3,5,"Wow! Yummy, different, delicious. Our favorite..."
4,4,Cute interior and owner (?) gave us tour of up...


# 5. Data featurization - NLP
___

## Load English language pipeline with spaCy

- To begin processing the text data, I construct a spaCy pipeline by loading in the pretrained English language model: `en_core_web_lg`
- Then I add specific tools to the pipeline:
    - `sentencizer` for identifying sentences (needed for training the word embeddings later on)
    - `merge_entities` for identifying names and merging them into a single token (we want to treat, e.g., a restaurant's name as a single word)

In [4]:
# load English language pipeline from spacy
# Add sentencizer; we will need sentences for word2vec training
# Add merge_entities to boost performance; e.g., treat "Starbucks Coffee" as one token, not two, i.e., "Starbucks" and "Coffee"
nlp = spacy.load('en_core_web_lg')
nlp.add_pipe('sentencizer')
nlp.add_pipe('merge_entities')

<spacy.pipeline.sentencizer.Sentencizer at 0x7996919ba940>

<function spacy.pipeline.functions.merge_entities(doc: spacy.tokens.doc.Doc)>

## Tokenize text

- I can now pass text as a string to the pretrained pipeline that will return a spaCy `Doc` object
- A spaCy `Doc` is a container for all the information gathered in the pipeline, including the original text, word tokens, annotations, etc.,
- Tokenize the text data by creating a `Doc` for each Yelp review, and add it to the dataframe in a new column

In [5]:
# Tokenize the cleaned text
# Pass the text into the language pipe and return a list spacy docs 
# NB: this can take a long time - passing the df to a pipe speeds processing time vs. passing the language model to each row in the df with the apply method
df['doc'] = list(tqdm(nlp.pipe(df['cleaned_text'], batch_size=64, n_process=4), total=len(df.index)))

100%|██████████| 20000/20000 [02:41<00:00, 123.91it/s]


- spaCy comes with a great visualization tool called `displacy`
- Let's use this to look at the performance of the named entity recognition

In [6]:
# Display an example doc with named entities highlighted to verify performance of spacy pipe
spacy.displacy.render(df['doc'][1], style='ent', jupyter=True)

- Not too bad; the model correctly identified:
    - "`Body Cycle`" as an `ORG` in both instances - also it merged the two words "`Body`" and "`Cycle`" into a single token
    - "`Russell`" as a `PERSON` in all 3 instances
    - "`the years`" as a `DATE` - not super critical for this analysis but validates the overall performance

## Lemmatize and filter text

- For sentiment analysis, we do not need (or want) high variation in our text, so lemmatizing the text should yield a better model
- The lemmatization of a word is saved as a variable in each token
- Define two functions that loop through the tokens of a Doc and get a list of the lemmmas as strings
- For the first funtion, also apply filtering to the text, removing stop words and punctuation that can have adverse effects on sentiment classifier training (as with lemmatiziation)
- The second funtion will process the text for training the word embeddings later on, which will benifit from a higher density of contextual information like punctuation and stop words - apply no filters

In [7]:
# Some functions for processing tokenized text
def lemmatize_filter(doc: Doc) -> list[str]:
    '''Lemmatize tokens and apply filters to doc.'''
    return [token.lemma_ for token in doc if not token.is_stop and not token.is_punct and not token.is_space]

def lemmatize(doc: Doc) -> list[str]:
    '''Lemmatize tokens in doc.'''
    return [token.lemma_ for token in doc]

- Apply the lemmatization and filtering to each review Doc and add to the dataframe as a new column

In [8]:
# Lematize the docs and add the list of lemmas to the df
df['lemmatized_text'] = df['doc'].progress_apply(lemmatize_filter)
df.head()

100%|██████████| 20000/20000 [00:01<00:00, 16819.18it/s]


Unnamed: 0,stars,cleaned_text,doc,lemmatized_text
0,3,"If you decide to eat here, just be aware it is...","(If, you, decide, to, eat, here, ,, just, be, ...","[decide, eat, aware, go, about 2 hour, begin, ..."
1,5,I've taken a lot of spin classes over the year...,"(I, 've, taken, a, lot, of, spin, classes, ove...","[take, lot, spin, class, the year, compare, cl..."
2,3,Family diner. Had the buffet. Eclectic assortm...,"(Family, diner, ., Had, the, buffet, ., Eclect...","[family, diner, buffet, eclectic, assortment, ..."
3,5,"Wow! Yummy, different, delicious. Our favorite...","(Wow, !, Yummy, ,, different, ,, delicious, .,...","[wow, yummy, different, delicious, favorite, l..."
4,4,Cute interior and owner (?) gave us tour of up...,"(Cute, interior, and, owner, (, ?, ), gave, us...","[cute, interior, owner, give, tour, upcoming, ..."


## Separate and lemmatize sentences
- Separate out each sentence in each review Doc and store the sentences in new dataframe 
- Apply the lemmatization with NO filtering to each sentence in the new dataframe

In [9]:
# Make new df containing all sentences in all docs
# This will be saved and used when creating word embeddings in the future
# Lemmatize but do not filter - we want as much context as possible to create meaningful word embeddings
sents = []
for i, doc in enumerate(df['doc']):
    for sent in doc.sents:
        sents.append(sent)
sents_df = pd.DataFrame({'sents': sents})
sents_df['lemmatized_sents'] = sents_df['sents'].progress_apply(lemmatize)
sents_df.head()

100%|██████████| 155488/155488 [00:02<00:00, 76659.71it/s]


Unnamed: 0,sents,lemmatized_sents
0,"(If, you, decide, to, eat, here, ,, just, be, ...","[if, you, decide, to, eat, here, ,, just, be, ..."
1,"(We, have, tried, it, multiple, times, ,, beca...","[we, have, try, it, multiple, time, ,, because..."
2,"(I, have, been, to, it, 's, other, locations, ...","[I, have, be, to, it, be, other, location, in,..."
3,"(The, food, is, good, ,, but, it, takes, a, ve...","[the, food, be, good, ,, but, it, take, a, ver..."
4,"(The, waitstaff, is, very, young, ,, but, usua...","[the, waitstaff, be, very, young, ,, but, usua..."


# 6. Data reduction - feature removal
___

- The original review text and review Docs are now irrelevant features, so we can remove these columns from the "reviews" dataframe
- For the "sentences" dataframe, we no longer need the un-lemmatized sentence Docs, so we drop that column as well

In [10]:
# We can now drop the cleaned_text and doc columns
df = df.drop(columns=['cleaned_text', 'doc'])
df.head()
sents_df = sents_df.drop(columns=['sents'])
sents_df.head()

Unnamed: 0,stars,lemmatized_text
0,3,"[decide, eat, aware, go, about 2 hour, begin, ..."
1,5,"[take, lot, spin, class, the year, compare, cl..."
2,3,"[family, diner, buffet, eclectic, assortment, ..."
3,5,"[wow, yummy, different, delicious, favorite, l..."
4,4,"[cute, interior, owner, give, tour, upcoming, ..."


Unnamed: 0,lemmatized_sents
0,"[if, you, decide, to, eat, here, ,, just, be, ..."
1,"[we, have, try, it, multiple, time, ,, because..."
2,"[I, have, be, to, it, be, other, location, in,..."
3,"[the, food, be, good, ,, but, it, take, a, ver..."
4,"[the, waitstaff, be, very, young, ,, but, usua..."


# 7. Save data
___
- I can finally save the processed data and sentences as output files for use in other modules
- I am saving the data in the JSON format to remain consistent with the input files

In [11]:
# The reviews are now processed into sequences of lemmas
# Also every sentence in the reviews is now processed into a sequence of lemmas
# The next step will be to further process the text data into word embeddings

# Save the current state of the data so it can be read by other notebooks
df.to_json(os.path.join(INT_DATA_PATH, 'lemmatized-data.json'), orient='records', lines=True)
sents_df.to_json(os.path.join(INT_DATA_PATH, 'lemmatized-sentences.json'), orient='records', lines=True)

# Read the saved data back in to verify the format
df = pd.read_json(os.path.join(INT_DATA_PATH, 'lemmatized-data.json'), orient='records', lines=True)
sents_df = pd.read_json(os.path.join(INT_DATA_PATH, 'lemmatized-sentences.json'), orient='records', lines=True)

# Inspect the size and data-types of the df
df.info()
sents_df.info()

# Inspect the first few rows
df.head()
sents_df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   stars            20000 non-null  int64 
 1   lemmatized_text  20000 non-null  object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155488 entries, 0 to 155487
Data columns (total 1 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   lemmatized_sents  155488 non-null  object
dtypes: object(1)
memory usage: 1.2+ MB


Unnamed: 0,stars,lemmatized_text
0,3,"[decide, eat, aware, go, about 2 hour, begin, ..."
1,5,"[take, lot, spin, class, the year, compare, cl..."
2,3,"[family, diner, buffet, eclectic, assortment, ..."
3,5,"[wow, yummy, different, delicious, favorite, l..."
4,4,"[cute, interior, owner, give, tour, upcoming, ..."


Unnamed: 0,lemmatized_sents
0,"[if, you, decide, to, eat, here, ,, just, be, ..."
1,"[we, have, try, it, multiple, time, ,, because..."
2,"[I, have, be, to, it, be, other, location, in,..."
3,"[the, food, be, good, ,, but, it, take, a, ver..."
4,"[the, waitstaff, be, very, young, ,, but, usua..."
