# CW1 - Amazon Bestsellers Analysis with PySpark


In this assignment you will be tasked with exploring a dataset containing the Top 50 best-selling books from Amazon between 2009-2019. You should complete the exercises presented in the Google Colab Notebook below. This assignment will be graded using CodeGrade.

Exercise 1 (5 Marks): Find the authors with the most entries in the bestseller’s lists, find the number of unique titles for each, the average rating, total number of reviews, and highest position in the ranking.

Exercise 2 (5 Marks): For fiction and non-fiction books, find the average and total number of reviews for the top 10, 25, and 50 of the bestsellers lists, in each year.

Exercise 3 (10 Marks): For each year, find the average price of a fiction and non-fiction book in the top 10, 25 and 50 of the bestsellers lists.

Exercise 4 (10 Marks): For free books—where the price is zero—fine the number of unique titles and authors. Compare the average rating and number of reviews in each year between free and priced books.


In [None]:
# CodeGrade Tag Init1

from google.colab import drive
drive.mount('/content/drive')

In [None]:
# CodeGrade Tag Init2

# Apache Spark uses Java, so first we must install that
!apt-get install openjdk-8-jdk-headless -qq > /dev/null

# Unpack Spark from google drive
!tar xzf /content/drive/MyDrive/spark-3.3.0-bin-hadoop3.tgz

# Set up environment variables
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "spark-3.3.0-bin-hadoop3"

# Install findspark, which helps python locate the psyspark module files
!pip install -q findspark
import findspark
findspark.init()

In [None]:
# Finally, we initialse a "SparkSession", which handles the computations
from pyspark.sql import SparkSession
spark = SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

from pyspark.sql import functions as F

In [None]:
# Load the AmazonBooks.csv file into your notebook as a pyspark dataframe

CsvPath = '/content/AmazonBooks.csv'

# Load .csv with header, ',' seperators and inferred schema
BooksDF = spark.read\
                     .option('header', 'True')\
                     .option('sep', ',')\
                     .option('inferSchema', 'True')\
                     .csv(CsvPath)



In [None]:
# CodeGrade Tag Init3

BooksDF.printSchema()
BooksDF.show()

In [None]:
# pyspark.sql.functions countains all the transformations and actions you will
# need
from pyspark.sql import functions as F

# Exercise 1

Find the authors with the most entries in the bestsellers lists. Find the number of unique titles for each author, the average rating, total number of reviews and highest position in the ranking. Create a dataframe where the columns are:

Author, Number of titles, Average Rating, Total Ratings, Highest Position

Sort by the number of titles in descending order.

In [None]:
# CodeGrade Tag Ex1
### Create a dataframe that contains, for each author, the number of unique
### books, the average rating, the number of reviews and the highest rank reached



# Exercise 2

For fiction and non-fiction books, find the average rating, the average number of reviews, the total number of reviews and the average price in the bestsellers list, for each year. Create a dataframe where the columns are:

Year, Genre, Average Rating, Average Number of Reviews, Total Reviews, Average Price,

Sort by the year in ascending order.

In [None]:
# CodeGrade Tag Ex2
### Create a dataframe that shows the average user rating, average number of
### reviews, total number of reviews and average price in each year of the
### bestsellers list



# Exercise 1

For each year, find the average price of fiction and non-fiction books in the top 10, 25 and 50 of the bestsellers list. Make a dataframe where the columns are:

Year, Genre, Avg Price in Top 10, Avg Price in Top 25 and Avg Price in Top 50

Sort by the year in ascending order.

In [None]:
# CodeGrade Tag Ex3
### Create a DataFrame that shows the average price for books in the top 10, 25
### and 50 of the bestsellers list, for each year in the dataset



# Exercise 4

For free books, find the total number of unique title and author, store these as variables called ```free_titles``` and ```free_authors```.

Compare the average rating and number of reviews for free and priced books, in each year of the dataset. Create a dataframe where the columns are:

Year, Avg Rating Free, Avg Rating Priced, Total Ratings Free, Total Ratings Priced

Sort by the year in ascending order.

In [None]:
# CodeGrade Tag Ex4a
### Find the number of free books in the dataset and the number of authors
### who wrote them



In [None]:
# CodeGrade Tag Exb
### Create a dataframe that has, for each year, the average rating and number of
### user reviews for free books and priced books

