---
title: "Data Cleaning"
format:
    html: 
        code-fold: false
---

<br>
<br>

## Overview

In this section, I continue by cleaning the data gathered in the [Data Collection](../data-collection/main.ipynb) section. To begin, I load in the raw data file from the `data/raw-data/` directory. From there, I drop columns that are not relevant to my [EDA](../eda/main.ipynb) and modeling down the line. Throughout this section, I try to consistently reduce the number of NA values until none remain in the final version of the dataset. While cleaning, I also make sure to drop NA values from rows where their removal wouldn't seriously impact the size of the dataset. The columns that ended up matching this criteria were `reviewerName` and `summary`. I also make sure to assign proper data types to the columns - changing number of community votes `vote` to an integer value, date of review posting `reviewTime` to datetime. Finally, I run the function `clean_review()` on the `reviewText` and `summary` columns. In the challenges section, I highlight the computational bottleneck faced when trying to run `clean_text()` on over 12 million entries, and how I improved its performance. The final result of this section yields a clean dataset, ripe for further exploration and modeling.

## Code 

### Importing Packages

First, let's import the necessary packages. 

In [1]:
# Packages
import gzip # For unzipping the raw data
import pandas as pd # Using pandas for easier data manipulation
import nltk # Using nltk for its list of stopwords
import string # string.punctuation will be used for text cleaning

import warnings # Turning off warnings for cleaner output

In [2]:
warnings.filterwarnings('ignore') #ignoring warnings

### Loading raw data

Now we can load in the processed CSV we constructed in the previous section.

In [3]:
# Pathway to raw data
data_path = "../../data/raw-data/book_reviews.csv.gz"

# Unzip the CSV file
with gzip.open(data_path, 'rb') as f:
    # Read the CSV file into a dataframe
    reviews_raw = pd.read_csv(f)

# Display the first few lines
reviews_raw.head(1)

Unnamed: 0,overall,vote,verified,reviewTime,reviewerID,asin,style,reviewerName,reviewText,summary,unixReviewTime,image
0,5.0,67,True,"09 18, 1999",AAP7PPBU72QFM,151004714,{'Format:': ' Hardcover'},D. C. Carrad,This is the best novel I have read in 2 or 3 y...,A star is born,937612800,


Here is our first look at the raw dataframe - we can observe 12 columns containing the following variables:

- `overall`: Rating 
- `vote`: Amount of community votes given to the review. Users will often vote when they find a review helpful
- `verified`: Boolean variable that indicates whether an account is verified or not
- `reviewTime`: Time of the review (raw)
- `reviewerID`: ID of the reviewer
- `asin`: Id of the product
- `style`: Key-value object. In this case, it describes the format of the book (e.g. Paperback, Hardcover, Kindle, etc.)
- `reviewerName`: Name of the reviewer
- `reviewText`: Raw, unprocessed text contents of the review
- `summary`: The title of the user's review
- `unixReviewTime`: Time of the review in unix time (Measures time "based by the number of non-leap seconds that have elapsed since 00:00:00 UTC")[@UnixTime].
- `image`: Image path (if any)

### Dropping Unnecessary Columns

First, lets drop the `unixReviewTime`, `style` and `image` columns, since they will not be used in our analysis

In [4]:
# list of columns to drop
drop_cols = ['unixReviewTime', 'image', 'style']

# dropping columns
reviews_raw = reviews_raw.drop(columns=drop_cols)

# ensuring correct columns were dropped
print(reviews_raw.columns)

Index(['overall', 'vote', 'verified', 'reviewTime', 'reviewerID', 'asin',
       'reviewerName', 'reviewText', 'summary'],
      dtype='object')


### Checking for missing values

Now lets check to see if our dataset contains any NA values

In [5]:
print(reviews_raw.isnull().sum())

overall               0
vote            5790916
verified              0
reviewTime            0
reviewerID            0
asin                  0
reviewerName       1472
reviewText         1640
summary             875
dtype: int64


It looks like our most important column, `reviewText`, has 1640 null values, so let's get rid of them.

In [6]:
# Taking only rows or reviews_raw where reviewText is not null
reviews_raw = reviews_raw[reviews_raw['reviewText'].notna()]

# Lets make sure this works
print(f"Number of NULL values in reviewText Column: {reviews_raw['reviewText'].isnull().sum()}")

Number of NULL values in reviewText Column: 0


### Checking data types

Before moving forward, let's ensure that each of our column data types are appropriate

In [7]:
print(reviews_raw.dtypes)

overall         float64
vote             object
verified           bool
reviewTime       object
reviewerID       object
asin             object
reviewerName     object
reviewText       object
summary          object
dtype: object


Lets make the following changes:

- Convert `vote` to an int object, where NA values are replaced by 0

In [8]:
# Converting 'vote' to integer, while dropping columns from strings and replacing NA with 0 
reviews_raw['vote'] = reviews_raw['vote'].replace({',': ''}, regex=True).fillna(0).astype(int)

print(f"Vote column data type: {reviews_raw['vote'].dtype}")
print(f"Number of NULL values in vote Column: {reviews_raw['vote'].isnull().sum()}")

Vote column data type: int64
Number of NULL values in vote Column: 0


Lets take another look at our null values

In [9]:
print(reviews_raw.isnull().sum())

overall            0
vote               0
verified           0
reviewTime         0
reviewerID         0
asin               0
reviewerName    1468
reviewText         0
summary          787
dtype: int64


As a final step to getting rid of all Null values, lets drop instances where `reviewerName` and `summary` are Null

In [10]:
# Taking only rows or reviewerName is not null
reviews_raw = reviews_raw[reviews_raw['reviewerName'].notna()]
# Now doing the same for the summary column
reviews_raw = reviews_raw[reviews_raw['summary'].notna()]

# Finally, lets print our null report
print(reviews_raw.isnull().sum())

overall         0
vote            0
verified        0
reviewTime      0
reviewerID      0
asin            0
reviewerName    0
reviewText      0
summary         0
dtype: int64


Great! Now we are left with no more NA entries in our data.

### Renaming Columns

Before moving on to cleaning our actual text data, let's rename `overall` to `reviewRating` and `asin` to `productID` for easier intuition and more uniformity


In [11]:
# Renaming cols
reviews_raw.rename(columns={
    'overall': 'reviewRating',
    'asin': 'productID'
}, inplace=True)

# printing result
reviews_raw.head(1)


Unnamed: 0,reviewRating,vote,verified,reviewTime,reviewerID,productID,reviewerName,reviewText,summary
0,5.0,67,True,"09 18, 1999",AAP7PPBU72QFM,151004714,D. C. Carrad,This is the best novel I have read in 2 or 3 y...,A star is born


As a last step before cleaning the text, let's convert `reviewTime` to actual datetime format

In [12]:
# Converting reviewTime from object to datetime
reviews_raw['reviewTime'] = pd.to_datetime(reviews_raw['reviewTime'], format='%m %d, %Y', errors = 'coerce')

# Ensuring this worked
print(f"Data type of reviewTime column: {reviews_raw['reviewTime'].dtype}")

Data type of reviewTime column: datetime64[ns]


### Cleaning and Processing Text Data

In this section I will begin to process the text columns within this dataset. I will first remove digits from the text using python's built-in `.isdigit()` function. Next, I will utilize python's `string` library to create a list of punctuation marks that will be referenced when removing all punctuation from the text. Step three involves lowercasing the text using python's built-in `.lower()` function. Finally I will remove stopwords in the text with the help of `nltk.corpus`. If you are not familiar, stopwords are terms that do not carry any semantic meaning - some examples include 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', etc. From there. After all of this is complete, we will have a fully cleaned dataframe that is ready for analysis and modeling.

In [14]:
# importing stopwords 
from nltk.corpus import stopwords
stop_words = list(stopwords.words('english'))

print("STOP WORDS")
print("=============")
print(stop_words)

STOP WORDS
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so'

In [15]:
# Gathering punctuation marks to be removed
punctuation_marks = list(string.punctuation)

print("PUNCTUATION MARKS")
print("===================")
print(punctuation_marks)

PUNCTUATION MARKS
['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


Below, I define the function `clean_review()` which takes a string as an input and applies all of the asjustments above in the following order:

1. Removes digits using `.isdigit()`
2. Removes puncuation marks by referencing the list object `punctuation_marks`
3. Lowercases the text using python's in-built `.lower()`
4. Removes stopwords using list object `stop_words`

***Note***: The function[@gpt4o_textfunc] used below was provided to me by OpenAI's GPT-4o model. The full citation can be found at the bottom of the page. Additionally, my original function will be included in the **Challenges** section.

In [40]:
def clean_review(text):
    # Predefine allowed characters (alphabet and space)
    allowed_chars = set('abcdefghijklmnopqrstuvwxyz ')
    
    # Dropping digits and punctuation while lowercasing
    text = ''.join(char.lower() for char in text if char.lower() in allowed_chars)
    
    # Splitting and removing stopwords
    words = text.split()
    cleaned_words = [word for word in words if word not in stop_words]

    # Return the cleaned text
    return ' '.join(cleaned_words)

def clean_review(text):
    # Predefine allowed characters (alphabet and space)
    allowed_chars = set('abcdefghijklmnopqrstuvwxyz ')
    
    # Dropping digits and punctuation while lowercasing
    text = ''.join(char.lower() for char in text if char.lower() in allowed_chars)
    
    # Splitting and removing stopwords
    words = text.split()
    cleaned_words = [word for word in words if word not in stop_words]

    # Return the cleaned text
    return ' '.join(cleaned_words)



Now, I will apply the function above to the `reviewText` and `summary` columns

In [46]:
# reviews_raw['reviewTextClean'] = reviews_raw['reviewText'].apply(clean_review)
# reviews_raw['summaryClean'] = reviews_raw['summary'].apply(clean_review)

reviews_raw[['reviewTextClean', 'summaryClean']].head()

Unnamed: 0,reviewTextClean,summaryClean
0,best novel read years everything fiction beaut...,star born
1,pages pages introspection style writers like h...,stream consciousness novel
2,kind novel read time lose book days possibly w...,im huge fan author one disappoint
3,gorgeous language incredible writer last life ...,beautiful book ever read
4,taken reviews compared book leopard promised b...,dissenting viewin part


Here are the first few rows of cleaned text. It looks like everything works!

### Challenges

- Experienced serious bottlenecks with string-based text cleaning, so I asked chat-gpt to optimize it for me. You can find the citation below. Additionally, here is the original function that I fed into the LLM:

```
def clean_review(text):
    #Remove digits
    text = ''.join(char for char in text if not char.isdigit())
    #Remove Punctuation
    text = ''.join(char for char in text if char not in punctuation_list)
    # Lowercase
    text = text.lower()
    # Remove stopwords
    words = text.split()
    words = [word for word in words if word not in stop_words]

    # Return cleaned text
    return ' '.join(words)
```


- Due to the massive size of the data, I could only save a trunacted version of the cleaned data set to the `data/processed-data/` directory. Doing this made pushing the data to github easier, while also providing graders with a reference to what the whole dataset looks like.