# CIND 820 - Deliverable 3 - Data Preparation
## Data Preparation Section

The purpose of this section of the notebook is to setup our data such that it becomes ready for data analysis and ML model adoption. Since this project is largely text-based, we must engineer some features first before we can begin our analysis. As such, breaking convention, our project will start with the *Data Preparation Section* first before moving into the *Data Analysis Section*. 

1. First we need to fix the 'stars' column. It's formatted as Float64 and we want integers for a cleaner grouping later. 

2. Second, we need to prepare our 'text' column such that it is formatted in a way that is acceptable for both our VADER-library and Countvectorizer library for the purposes of feature engineering. We need to create 2 versions of the 'text' column': One for the VADER library to read and generate sentiment scores, and another for the Countvectorizer library to create additional Bag-of-Words vectors for our model training. We require two different columns because the level of data pre-processing required for each are different. 

3. Third, we'll need to employ stratified sampling to balance out our star-ratings. We will also further reduce the size of our dataset for the purposes of model training.

4. Fourth, we'll engineer VADER-based sentiment scores using the VADER library

5. Then, we'll create Bag-of-Words vectors, using the Countvectorizer library, which will be used as additional features for our model. 

The completion of these 5 steps will mark the end of the *Data Preparation Section* as the data will be ready for analysis and model training.

### Setting up the Notebook Environment

In [2]:
import os
os.chdir('C:/Users/Sunora/iCloudDrive/Documents/Learning Data Analytics/TMU Certificate copy/CIND 820/Yelp Dataset')

In [3]:
#Importing libraries and setting up environment
import pandas as pd # Used for Data Manipulation
import numpy as np # Used for Numerical Operations
import re # Used for Regular Expressions
import nltk # Natural Language Toolkit
from nltk.sentiment.vader import SentimentIntensityAnalyzer # Used for Sentiment Analysis (VADER)
# nltk.download('vader_lexicon') # Downloading VADER Lexicon
import spacy # Used for Advanced NLP tasks (tokenization, lemmatization, etc.)
nlp = spacy.load('en_core_web_sm') # Loading SpaCy English Model

#Defining working directory and reading data
filepath = r'C:\Users\Sunora\iCloudDrive\Documents\Learning Data Analytics\TMU Certificate copy\CIND 820\Yelp Dataset\yelp_review_sample_data.csv'
originalData = pd.read_csv(filepath)

# Previewing the Datatype Info
originalData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 349514 entries, 0 to 349513
Data columns (total 5 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   review_id    349514 non-null  object 
 1   user_id      349514 non-null  object 
 2   business_id  349514 non-null  object 
 3   stars        349514 non-null  float64
 4   text         349514 non-null  object 
dtypes: float64(1), object(4)
memory usage: 13.3+ MB


### 1. Fixing Stars Column

In [4]:
# Step 1. Converting 'stars' column from Float64 to Integer
originalData['stars'] = originalData['stars'].astype(int)

### 2. Text Pre-processing for VADER and Bag-of-Words Libraries

In [None]:
# Step 2. Cleaning 'text' data column and and performing text preprocessing

# General cleaning of 'text' column
originalData['text'] = originalData['text'].fillna('').astype(str) # Fill NaN values with empty strings and ensure all entries are strings

# Creating a function to perform basic text cleaning - applicable to both VADER and Bag-of-Words (BoW)
def text_clean_general(text):
    text = re.sub(r'<.*?>', '', text)  # Remove HTML tags (if any)
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with a single space (meaning double-spaces, new lines, tabs, etc.)
    return text.strip() # Remove leading/trailing spaces

# Applying the function defined above to create a new text column that is pre-processed for VADER analysis
originalData['text_VADER'] = originalData['text'].apply(text_clean_general) # Creating a new column 'text_VADER' for VADER analysis

# Loop for cleaning the 'text' column for BoW - to be used on the 'text_VADER' column as BoW needs more aggressive cleaning
corpus = originalData['text_VADER'].str.lower().tolist() # Converting the 'text_VADER' column to a list for processing; using VADER cleaned text as base
outputs = [] # List to store cleaned text and token counts - for easier dataframe conversion 
for doc in nlp.pipe(corpus, batch_size=1000, n_process=8): # Using nlp.pipe for efficient processing (due to size of dataset)
    tokens = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha] # Lemmatization, removing stop words and non-alphabetic tokens
    cleaned_text = ' '.join(tokens) # Joining tokens back into a single string
    token_count = len(tokens) # Counting number of tokens after cleaning - an additional feature
    outputs.append((cleaned_text, token_count)) # Appending cleaned text and token count as a tuple to outputs list

# Gathering outputs from the loop (above) into the original dataframe as new columns
originalData[['text_BoW', 'num_Tokens']] = pd.DataFrame(outputs, index = originalData.index) # text_BoW for Bag of Words cleaned text, num_Tokens for number of tokens


### 3. Stratified Sampling for Star Rating Balancing

In [6]:
# Using balancedData to store the new balanced originalDataset based on stratified sampling - 5000 samples per star rating
balancedData = (
    originalData[originalData['num_Tokens'] > 0]  # Filtering out rows with zero tokens after cleaning
    .groupby('stars', group_keys=False)
    .apply(lambda x: x.sample(n = 5000, random_state = 2025))
    .reset_index(drop=True)
)

  .apply(lambda x: x.sample(n = 5000, random_state = 2025))


In [7]:
balancedData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   review_id    25000 non-null  object
 1   user_id      25000 non-null  object
 2   business_id  25000 non-null  object
 3   stars        25000 non-null  int64 
 4   text         25000 non-null  object
 5   text_VADER   25000 non-null  object
 6   text_TFIDF   25000 non-null  object
 7   num_Tokens   25000 non-null  int64 
dtypes: int64(2), object(6)
memory usage: 1.5+ MB


### 4. Feature Engineering VADER-Sentiment Scores

In [8]:
# Generate VADER sentiment scores
sia = SentimentIntensityAnalyzer() # Initializing VADER Sentiment Intensity Analyzer
balancedData['vader_scores'] = balancedData['text_VADER'].apply(sia.polarity_scores) # Applying VADER to the 'text_VADER' column and storing results in a new column 'vader_scores'
balancedData = pd.concat([balancedData.drop(['vader_scores'], axis=1), balancedData['vader_scores'].apply(pd.Series)], axis=1) # Expanding the 'vader_scores' dictionary into separate columns

balancedData.head() # Previewing the final balanced dataset with VADER scores

Unnamed: 0,review_id,user_id,business_id,stars,text,text_VADER,text_TFIDF,num_Tokens,neg,neu,pos,compound
0,CoCim4CRm-WCoU-CFfWpLw,McdCFYocB1hFIiDQBRQ7YA,P_nqb7lULOtx3pAJbKfFXA,1,Santa Fe used to be my favorite restaurant. I ...,Santa Fe used to be my favorite restaurant. I ...,santa fe favorite restaurant enjoy cancun taco...,48,0.119,0.779,0.102,-0.0209
1,8s6Eejmy24XUhgNkR2uIUA,X67DbQdqHeZ-F2UVUOhn1g,WNjrsnJVPPnv_FtHHdjklA,1,I have never experienced this level of incompe...,I have never experienced this level of incompe...,experience level incompetence customer service...,84,0.11,0.819,0.071,-0.6697
2,2GdPCXF_5fR4_od5DJTD8Q,-VPeYf78MNJAB0iR7d9-zg,QboMIy08NLnBbLXEsmnDHg,1,"I've noticed a ""Rising sun"" flag displayed in ...","I've noticed a ""Rising sun"" flag displayed in ...",notice rise sun flag display restaurant symbol...,27,0.276,0.69,0.034,-0.9612
3,vYSCzz-jM7ibdoIUCRLysw,I0Vt1g8iK0D_cxXkJyXb0A,INz7vujcHs0AggsV__pXYQ,1,Sold me a part that was wrong size and wouldn'...,Sold me a part that was wrong size and wouldn'...,sell wrong size exchange get rob,6,0.222,0.778,0.0,-0.6449
4,mLokfOcquwIP57pcOkHBZQ,TV3p-bv5yh8RgdJ3WxM7Ug,eh8WfQqPa2ZWtbXe9_wHgQ,1,I brought my car to Hyundai Service for a chec...,I brought my car to Hyundai Service for a chec...,bring car hyundai service check engine light c...,44,0.055,0.919,0.026,-0.3201


### 5. Creating Bag-of-Words Features

In [None]:
# Bag-of-Words Feature Engineering
import sklearn
from sklearn.feature_extraction.text import CountVectorizer

balancedDataBoW = balancedData.copy() # Creating a copy of balancedData for Bag-of-Words processing

# Creating Bag-of-Words Matrix
bow_vectorizer = CountVectorizer(
    max_features = 5000, # Limiting to top 5000 features to manage dimensionality
    ngram_range= (1, 1), # Using unigrams only,
    # min_df = 0.03, # Minimum document frequency of 5%
    # max_df = 0.75 # Maximum document frequency of 75%
    )

# Creating Bag-of-Words matrix
bow_matrix = bow_vectorizer.fit_transform(balancedDataBoW['text_TFIDF']) # Fitting and transforming the Bag-of-Words matrix

# Adding Bag-of-Words features to balancedDataBoW DataFrame
bow_df = pd.DataFrame(
    bow_matrix.toarray(),
    index=balancedDataBoW.index,
    columns=bow_vectorizer.get_feature_names_out()
)
balancedDataBoW = pd.concat([balancedDataBoW.reset_index(drop=True), bow_df.reset_index(drop=True)], axis=1) # Concatenating the Bag-of-Words features to the balancedDataBoW DataFrame
balancedDataBoW.drop(columns=['text', 'text_VADER', 'text_BoW'], inplace=True) # Dropping text columns to reduce file size


In [15]:
# Checking the shape of the Bag-of-Words DataFrame
balancedDataBoW.shape

(25000, 5008)

### Saving Cleaned Dataset as separate .CSV file

In [16]:
# Save the balancedDataBoW data to a new CSV file to prevent future re-runs of the data cleaning step.
balancedDataBoW.to_csv('balancedDataBoW.csv', index=False)

## End of Data Preparation Section
This marks the end of the *Data Preparation Section*. In summary, we've cleaned our original dataset, balanced our star-ratings through stratified sampling, feature engineered sentiment scores using VADER and using a Bag-of-Words matrix. We then dropped the 'text', 'text_VADER', and 'text_TFIDF' columns which are no longer required as we've already created vectorized features from the text. The final cleaned dataset is then saved as balancedDataBoW (for Bag-of-Words) / balancedDataTFIDF (for TF-IDF) csv files. 

### Next Steps
In the next phase, the *Data Analysis Section* we will further examine and explore this balancedData csv file.