# Assignment 1

## Problem Statement Explanation:

For this assignment, we have to download the data from [here](http://jmcauley.ucsd.edu/data/amazon/links.html), any reviews file that has at least a million reviews.<br>
I've chosen to use the "Movies and TV" review dataset that contains 1,697,533 reviews.

Assigned tasks once the download is done:

- Find the item with the least rating
- Find the item with the most rating
- Find the item with the longest reviews
- Finally store it into a parquet file

> **To download the dataset, please execute the below cell.<br>Please note that it will download to the current directory you are in and unzip it.**

Library Versions used: [requirements.txt](requirements.txt)

In [1]:
# !wget "http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Movies_and_TV_5.json.gz"
# !gunzip reviews_Movies_and_TV_5.json.gz

Before running the below cell it is recommended to run the below command in the terminal to create an environment (Linux):

```
python3 -m venv .env
source .env/bin/activate
pip install -r requirements.txt
```
<br>

Importing Libraries that will be used to answer the above questions. <br>

> Using a try-except catch here to make sure if there is an exception *"ModuleNotFoundError"* it will run the `requirements.txt` file on the active environment 

In [2]:
try:
    import pyspark
    from pyspark.sql import SparkSession
    from pyspark.sql import functions as F
    import logging, datetime
except ModuleNotFoundError:
    !pip install -r requirements.txt

Setting up an logger to customize a few logs that are basic checks.

> Setting up the logger level at "INFO" meaning almost any event will be reported to the log file.<br>
> `format` keyword is used to format the logs to a necessary required format

In [3]:
capture_log_time = datetime.datetime.now()
log_file_path = f"./movies_{capture_log_time.date()}_at_runtime_{capture_log_time.hour}:{capture_log_time.minute}.log"
logging.basicConfig(filename=log_file_path, level=logging.INFO, format='%(asctime)s - %(message)s')

logging.info(f"INITIALIZING Movies script")

## Creation of Class

Creating a class `Transformation`. This is to assist us make the code modular and maintainable. I've tried to follow the SOLID principles of Python coding and added docstrings for better understanding. 

> Due to the various kind of exceptions that can be thrown for a path related method, using a generalized way of catching exception

In [4]:
class Transformation:
    """
    Provides methods for common data transformation tasks using a SparkSession.
    Args:
        spark (SparkSession): The SparkSession to use for transformations.
    """

    def __init__(self, spark):
        self.spark = spark
        logging.info(f"== Spark session started ==")

    def exit_spark(self):
        logging.info(f"== Spark session stopped ==")
        self.spark.stop()

    def json_reader(self, path:str):
        """
        Reads a JSON file into a Spark DataFrame.
        Args:
            path (str): The path to the JSON file.
        Returns:
            DataFrame: The DataFrame containing the JSON data.
        Raises:
            Exception: If an error occurs during reading.
        """
        try:
            return self.spark.read.json(path)
        except Exception as e:
            logging.critical(f"ERROR: {e} with {path}")

    def parquet_writer(self, df, path:str):
        """
        Writes a Spark DataFrame to a Parquet file.
        Args:
            df (DataFrame): The DataFrame to write.
            path (str): The path to the output Parquet file.
        Raises:
            Exception: If an error occurs during writing.
        """
        try:
            return df.repartition(1).write.mode('overwrite').parquet(path)
        except Exception as e:
            logging.critical(f"ERROR: {e} with {path}")
            
    def transform_time(self, df, col_name:str, format_out:str, format_in=None, unix=False):
        """
        Transforms a time-related column in a DataFrame.
        Args:
            df: The DataFrame containing the column to transform.
            col_name: The name of the column to transform.
            format_out: The desired output format for the transformed column.
            format_in (optional): The input format of the column, if not Unix timestamp. Defaults to None.
            unix (optional): Whether the column is in Unix timestamp format. Defaults to False.
        Returns:
            DataFrame: The DataFrame with the transformed column.
        """
        logging.info(f"Transforming {col_name} to format {format_out} and isUnix is {unix}")
        if not unix and format_in != None:
            return df.withColumn(col_name , F.date_format(F.to_date(F.col(col_name), format_in), format_out))
        elif unix:
            return df.withColumn(col_name, F.date_format(F.from_unixtime(col_name), format_out))
        else:
            raise Exception("For unix=False, 'format_in' is necessary")

## Spark Session Initialization

- `spark.executor.memory`: Used for data processing and computations.
- `spark.driver.memory`: Used for managing jobs
- `spark.executor.cores`: Used for parallel processing
- `spark.executor.instances`: Used for better processing capacity
- `spark.sql.legacy.timeParserPolicy`: Used for datetime related bugs to not arise

> **Note:** If you want console logs to be printed set it on `INFO` level.

In [5]:
spark = SparkSession.builder \
    .appName("amz_reviews") \
    .config("spark.executor.memory", "4g") \
    .config("spark.driver.memory", "4g") \
    .config("spark.executor.cores", "5") \
    .config("spark.executor.instances", "5") \
    .config("spark.sql.legacy.timeParserPolicy", "LEGACY") \
    .getOrCreate()

spark.sparkContext.setLogLevel("WARN") # Use FATAL for cleaner notebook

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/12/25 20:53:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [6]:
processor = Transformation(spark)

In [7]:
movies_df = processor.json_reader("./raw_lake/reviews_Movies_and_TV_5.json")

                                                                                

In [8]:
movies_df = processor.transform_time(movies_df, 'reviewTime', format_in='MM dd, yyyy', format_out='MM-dd-yyyy')
movies_df = processor.transform_time(movies_df, 'unixReviewTime', format_out='MM-dd-yyyy', unix=True)

In [9]:
movies_df.printSchema()

root
 |-- asin: string (nullable = true)
 |-- helpful: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- overall: double (nullable = true)
 |-- reviewText: string (nullable = true)
 |-- reviewTime: string (nullable = true)
 |-- reviewerID: string (nullable = true)
 |-- reviewerName: string (nullable = true)
 |-- summary: string (nullable = true)
 |-- unixReviewTime: string (nullable = true)



## Description of the columns

`asin` - ID of the product, e.g. 0000013714<br>
`helpful` - helpfulness rating of the review, e.g. 2/3<br>
`overall` - rating of the product<br>
`reviewText` - text of the review<br>
`reviewTime` - time of the review (raw)<br>
`reviewerID` - ID of the reviewer, e.g. A2SUAM1J3GNN3B<br>
`reviewerName` - name of the reviewer<br>
`summary` - summary of the review<br>
`unixReviewTime` - time of the review (unix time)

## Least and Most reviews

To find the least or the most reviews we need to *group by* `asin` as it's the ID of the product (unique value)
<br>Then *count* the number of reviews for each `asin`
<br>Then *sort* them in ascending for least reviews and descending for most reviews.

In [10]:
least_reviews = movies_df.groupBy("asin").count().sort(F.asc("count"))

In [11]:
most_reviews = movies_df.groupBy("asin").count().sort(F.desc("count"))

## Longest reviews:

To find the longest review we choose the `reviewText` column and get the *length* of each given text.
<br>Then, we *group by* `asin` 
<br>Perform two aggregration using the *sum* total of the *length*
<br>And finally, *sorting* it in desending order of the length.

In [12]:
movies_df = movies_df.withColumn('length_summary', F.length(F.col('reviewText')))
longest_reviews = movies_df.groupby("asin").agg(F.sum("length_summary").alias("total_len")).sort(F.desc("total_len"))                                    

In the above cell we are intentionally not showing the output as spark engine by default prefers to behave with lazy loading enabled. This is what allows spark to enhance performance over large datasets.<br> While keeping the lazy loading active to get the result out we run 3 computation queries using the `first()` method. 

Once computed -  we create a new dataframe and store the values along with the ASIN ids for a better view of the data.

In [13]:
logging.info(f"Executing data computation: {datetime.datetime.now()}")
results = least_reviews.first(), most_reviews.first(), longest_reviews.first()
logging.info(f"Computation end time: {datetime.datetime.now()}")


final_answers = spark.createDataFrame([
    ("asin", results[0][0], results[1][0], results[2][0]),
    ("values", results[0][1], results[1][1], results[2][1])
], "ref string, least_movies_asin string, most_movies_asin string, longest_reviews_asin string")

final_answers.show()

                                                                                

+------+-----------------+----------------+--------------------+
|   ref|least_movies_asin|most_movies_asin|longest_reviews_asin|
+------+-----------------+----------------+--------------------+
|  asin|       0780018648|      B003EYVXV4|          B00003CWT6|
|values|                5|            2213|             1553600|
+------+-----------------+----------------+--------------------+



## A Desired Operation

Here I've used `when` from the functions module.<br>
This specifically serves as an `if-else` for any respective column helping us better segregate and put logical conclusions over data.<br>
`helpful` is a array with the index 0 element being helpful ones and index 1 element being the total reactions. This will help us understand which reviews are actually helpful to the customer and if there are specific reviews that can be deep-dived more. 
We create a column `effective rating`. <br>
- Any value below 1000 is called as basic
- Between 1000 to 5000 as intermediate
- Above 5000 as advanced

We can then better understand the data, or create available ready features for ML teams to use. 

In [14]:
movies_df = movies_df.withColumn("total_votes", F.col("helpful")[1])
movies_df = movies_df.withColumn("actual_helpful", F.col("helpful")[0])
movies_df = movies_df.withColumn('effective_rating',
                                F.when((F.col("actual_helpful") / F.col("total_votes") * 100) > 60, "above 60%") 
                                .when((F.col("actual_helpful") / F.col("total_votes") * 100) >= 30, "below 60%")
                                .otherwise("below 30%"))

In [15]:
print(movies_df.show(1))
print(movies_df.printSchema())

+----------+-------+-------+--------------------+----------+-------------+--------------------+--------------------+--------------+--------------+-----------+--------------+----------------+
|      asin|helpful|overall|          reviewText|reviewTime|   reviewerID|        reviewerName|             summary|unixReviewTime|length_summary|total_votes|actual_helpful|effective_rating|
+----------+-------+-------+--------------------+----------+-------------+--------------------+--------------------+--------------+--------------+-----------+--------------+----------------+
|0005019281| [0, 0]|    4.0|This is a charmin...|02-26-2008|ADZPIG9QOCDG5|Alice L. Larson "...|good version of a...|    02-26-2008|           299|          0|             0|       below 30%|
+----------+-------+-------+--------------------+----------+-------------+--------------------+--------------------+--------------+--------------+-----------+--------------+----------------+
only showing top 1 row

None
root
 |-- asin: 

The below row of code will save the transformed dataframe into a `.parquet` file. 

Original File Size: `2.0 GB`<br>
Parquest File Size: `1.1 GB`

Even with the introduction of new columns the space the data consumes if almost a half less than the original dump. 

In [16]:
processor.parquet_writer(movies_df, "./raw_lake/movies_parquet")

                                                                                

In [17]:
movies_df.unpersist()
processor.exit_spark()

# Final Results:

1. Item with the least ratings:
   > **ASIN: 0780018648** with `5` ratings.
3. Item with the most ratings:
   > **ASIN: B003EYVXV4** with `2213` ratings.
5. Item with the longest reviews:
   > **ASIN: B00003CWT6** with accumulative lenght of reviews `1553600` letters.