# IBM Advanced Data Science Capstone Project
## Sentiment Analysis of Amazon Customer Reviews
### Harsh V Singh, Apr 2021

## Extract, Transform, Load (ETL)

This notebook contains the comprehensive step-by-step process for preparing the raw data to be used in the project. The data that we are using is avaiable in the form of two csv files (train.csv/ test.csv). We will read these files into memory and then store them in parquet files with the same name. *Spark csv reader is not able to handle commas within the quoted text of the reviews. Hence, we will first read the files into Pandas dataframes and then export them into parquet files*.

## Importing required Python libraries and initializing Apache Spark environment

In [1]:
import pandas as pd
import csv
import time
from pathlib import Path

import findspark
findspark.init()

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, SparkSession
from pyspark.sql.types import StructType, StructField, DoubleType, IntegerType, StringType
conf = SparkConf().setMaster("local[*]") \
    .setAll([("spark.driver.memory", "16g"),\
             ("spark.executor.memory", "4g"), \
             ("spark.driver.maxResultSize", "16g"), \
             ("spark.executor.cores", "4")])
sc = SparkContext.getOrCreate(conf=conf)
from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .getOrCreate()

In [None]:
#spark.sparkContext.stop()

## Reading data from CSV and storing local copies

The data that we are using for this project is avaiable to us in the form of two csv files (train.csv/ test.csv). We will read these files into memory and then store them in parquet files with the same name. 

We will write a function called **readSparkDFFromParquet** will read the parquet files into memory as Spark dataframes. In case the parquet files are not found, this function will call another function called **savePandasDFToParquet** which reads the original csv files into Pandas dataframe and saves them as **parquet** files.  

*The reason why we need to read the csv files into a Pandas dataframe is bacause the Spark csv reader function is not able to handle commas within the quoted text of the reviews. In order to solve that, we will use the Pandas csv reader to process the data initially and then export them into parquet files*.


In [2]:
# Function to print time taken by a particular process, given the start and end times
def printElapsedTime(startTime, endTime):
    elapsedTime = endTime - startTime
    print("Process time = %.2f seconds."%(elapsedTime))

In [3]:
# Schema that defines the columns and datatypes of the data in the csv files
rawSchema = StructType([
    StructField("rating", IntegerType(), True),
    StructField("review_heading", StringType(), True),
    StructField("review_text", StringType(), True)
    ])

In [4]:
# Function to save a Pandas dataframe as a parquet file
def savePandasDFToParquet(csvPath, parqPath, rawSchema, printTime=False):
    startTime = time.time()
    pandasDF = pd.read_csv(csvPath, header=None)
    pandasDF.columns = rawSchema.names
    pandasDF.to_parquet(parqPath, engine="pyarrow")
    endTime = time.time()
    if printTime:
        printElapsedTime(startTime=startTime, endTime=endTime)
    return

In [5]:
# Function to read a parquet file into a Spark dataframe
# If the parquet file is not found, it will be created from the original csv
def readSparkDFFromParquet(csvPath, parqPath, rawSchema, printTime=False):
    parquetFile = Path(parqPath)
    if (parquetFile.is_file() == False):
        print("Parquet file not found... converting %s to parquet!"%(csvPath))
        savePandasDFToParquet(csvPath=csvPath, parqPath=parqPath, rawSchema=rawSchema, printTime=printTime)
    sparkDF = spark.read.parquet(parqPath)
    return (sparkDF)

## Load local data for sanity check

We will load the train and test sets and print a few samples as well as the size of the datasets.

In [6]:
trainRaw = readSparkDFFromParquet(csvPath="data/raw/train.csv", parqPath="data/train.parquet", rawSchema=rawSchema, printTime=True)
testRaw = readSparkDFFromParquet(csvPath="data/raw/test.csv", parqPath="data/test.parquet", rawSchema=rawSchema, printTime=True)
trainRaw.show(5)
print("There are %d/ %d samples in the training/ test data."%(trainRaw.count(), testRaw.count()))
print("Sample review text: %s"%(trainRaw.take(1)[0]["review_text"]))

+------+--------------------+--------------------+
|rating|      review_heading|         review_text|
+------+--------------------+--------------------+
|     3|  more like funchuck|Gave this to my d...|
|     5|           Inspiring|I hope a lot of p...|
|     5|The best soundtra...|I'm reading a lot...|
|     4|    Chrono Cross OST|The music of Yasu...|
|     5| Too good to be true|Probably the grea...|
+------+--------------------+--------------------+
only showing top 5 rows

There are 3000000/ 650000 samples in the training/ test data.
Sample review text: Gave this to my dad for a gag gift after directing "Nunsense," he got a reall kick out of it!
