Name: Stamatios Sideris

ID: f2822113

**Background**

You have been hired by a small bookstore company that wants to use data science techniques to optimize their sales. It has been assigned to you to analyse a dataset of books metadata using Apache Spark (and PySpark, in particular) to reveal useful insights.

**Task 1**

Your first task is to explore the dataset. You need to use SparkSQL with Dataframes in a Jupyter notebook that delivers the following:
- It uses the json() function to load the dataset.
- It counts and displays the number of books in the database.
- It counts and displays the number of e-books in the database (based on the “is_ebook”
field).
- It uses the summary() command to display basic statistics about the “average_rating”
field.
- It uses the groupby() and count() commands to display all distinct values in the
“format” field and their number of appearances

##### We first create a sparkSession. It is responsible to create a sparkContext object that will help us communicate with Spark. As it will need the credentials of our application it also creates a SparkConf object that includes our application name and the location it will run.

In [1]:
from pyspark.sql import SparkSession
appName = "task1" #determine the name of the App
master = "local" #determine it will run locally
spark = SparkSession.builder.appName(appName).master(master).getOrCreate()

##### We read the json file and print its schema

In [2]:
json_path = "books_5000.json"
df = spark.read.json(json_path)
df.printSchema()
df.show(3)

root
 |-- asin: string (nullable = true)
 |-- authors: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- author_id: string (nullable = true)
 |    |    |-- role: string (nullable = true)
 |-- average_rating: string (nullable = true)
 |-- book_id: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- description: string (nullable = true)
 |-- edition_information: string (nullable = true)
 |-- format: string (nullable = true)
 |-- image_url: string (nullable = true)
 |-- is_ebook: string (nullable = true)
 |-- isbn: string (nullable = true)
 |-- isbn13: string (nullable = true)
 |-- kindle_asin: string (nullable = true)
 |-- language_code: string (nullable = true)
 |-- link: string (nullable = true)
 |-- num_pages: string (nullable = true)
 |-- popular_shelves: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- count: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |-- pub

#### Count the number of distinct rows and so the number of books as each row refers to a book. The total number of books is 4999.

In [3]:
df.distinct().count()

4999

##### Group by the column "is_ebook" and use count to count the number of true and false. Use collect function to keep the second row only that displays the count of true values. The number of true values and so the number of ebooks is 749.

In [4]:
df.groupBy("is_ebook").count().collect()[1]

Row(is_ebook='true', count=749)

##### Use summary command to display basic statistics about the "average_rating" field.

In [5]:
df.select("average_rating").summary().show()

+-------+------------------+
|summary|    average_rating|
+-------+------------------+
|  count|              4999|
|   mean| 3.911204240848176|
| stddev|0.4344448952868878|
|    min|              1.00|
|    25%|              3.66|
|    50%|              3.98|
|    75%|              4.23|
|    max|              5.00|
+-------+------------------+



##### Use the groupby() and count() commands to display all distinct values in the "format" field and their number of appearances 

In [6]:
df.groupBy("format").count().show()

+--------------------+-----+
|              format|count|
+--------------------+-----+
|Paperback comic book|    1|
|        Spiral-bound|    1|
|              Broche|    2|
|       Graphic Novel|    2|
|                  FC|    2|
|            Brochura|    1|
|           Paperback| 2629|
|        Single Issue|    1|
|Bolsillo con sobr...|    2|
|          Broschiert|    2|
|             Planeta|    1|
|       Audible Audio|    1|
|            Audio CD|    2|
|           paperback|    2|
| Slipcased Hardcover|    2|
|     Library Binding|    2|
|       Klappbroschur|    1|
|          Board book|   11|
|     Klappenbroschur|    1|
|                Nook|    1|
+--------------------+-----+
only showing top 20 rows

