Name: Stamatios Sideris

ID: f2822113

### We first create a sparkSession. It is responsible to create a sparkContext object that will help us communicate with Spark. As it will need the credentials of our application it also creates a SparkConf object that includes our application name and the location it will run.

In [1]:
from pyspark.sql import SparkSession
appName = "task2" #determine the name of the App
master = "local" #determine it will run locally
spark = SparkSession.builder.appName(appName).master(master).getOrCreate()

### We read the json file and print its schema

In [2]:
json_path = "books_5000.json"
df = spark.read.json(json_path)
df.printSchema()
df.show(3)

root
 |-- asin: string (nullable = true)
 |-- authors: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- author_id: string (nullable = true)
 |    |    |-- role: string (nullable = true)
 |-- average_rating: string (nullable = true)
 |-- book_id: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- description: string (nullable = true)
 |-- edition_information: string (nullable = true)
 |-- format: string (nullable = true)
 |-- image_url: string (nullable = true)
 |-- is_ebook: string (nullable = true)
 |-- isbn: string (nullable = true)
 |-- isbn13: string (nullable = true)
 |-- kindle_asin: string (nullable = true)
 |-- language_code: string (nullable = true)
 |-- link: string (nullable = true)
 |-- num_pages: string (nullable = true)
 |-- popular_shelves: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- count: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |-- pub

### We choose to keep the columns "book_id", "title" and "average_rating". We filter them so that we keep only the titles that start with the letter "S". We sort them in descending order by the column "average_rating" and we choose the first row as it will be the highest rating.

In [3]:
from pyspark.sql.functions import desc
df.select(df["book_id"],df["title"],df["average_rating"]).filter(df.title.startswith("S")).sort(desc("average_rating")).show(1)

+--------+--------------------+--------------+
| book_id|               title|average_rating|
+--------+--------------------+--------------+
|22129858|Superman: The Gol...|          4.75|
+--------+--------------------+--------------+
only showing top 1 row



### We filter the titles that start with "I". We then use mean command to find the average of the "average_rating" column.

In [4]:
from pyspark.sql.functions import mean
df.filter(df.title.startswith("I")).select(mean("average_rating")).show()

+-------------------+
|avg(average_rating)|
+-------------------+
| 3.9546753246753252|
+-------------------+



### We change the type of column "num_pages" to integer as it is an integer number.

In [5]:
from pyspark.sql.types import *
df = df.withColumn("num_pages",df.num_pages.cast(IntegerType()))

### We select the columns "book_id", "title", "format" and "num_pages" as they are the ones needed. We filter them for titles starting with "D" and their format is equal to "Paperback". We sort them in descending order by the number of pages and take the first row as it will be the book with the most pages.

In [6]:
df.select(df["book_id"],df["title"],df["format"],df["num_pages"]).filter(df.title.startswith("D") & (df.format == "Paperback")).sort(desc("num_pages")).show(1)

+--------+--------------------+---------+---------+
| book_id|               title|   format|num_pages|
+--------+--------------------+---------+---------+
|18143804|Dragon Ball (3-in...|Paperback|      576|
+--------+--------------------+---------+---------+
only showing top 1 row

