# Books Dataset

## Leer los DF

Creamos el SparkSession:

In [1]:
from pyspark.sql import SparkSession 
spark = SparkSession \
    .builder \
    .appName("BooksDataset") \
    .getOrCreate()

Definimos una función para obtener los ficheros que contengan un string dentro de un directorio:

In [105]:
from os import listdir
from os.path import isdir, isfile, join
def getFiles(dir, match):
    files = []
    if(isdir(dir)):
        for f in listdir(dir):
            if isfile(join(dir, f)) and match in f:
                files.append(join(dir, f))
    return files;

Definimos una función para retornar el una columna si existe o None si no existe.

In [55]:
from pyspark.sql.functions import lit
def hasColumn(df, column):
    return df(column) if column in df.schema.fieldNames() else lit(None)

Obtenemos los ficheros que contengan la palabra "book" de nuestro directorio de datasets:

In [None]:
dataset_folder = '/home/spark/datasets/books'
booksFiles = getFiles(dataset_folder, "book")
print(booksFiles)

Leemos cada uno de los ficheros de libros mediante la API de DataFrameReader:

In [53]:
booksDFList = list(map(lambda file: (spark
     .read
     .format("csv")
     .option("header", "true")
     .option("quote", "\"")
     .option("escape", "\"")
     .option("ignoreLeadingWhiteSpace","true")
     .option("ignoreTrailingWhiteSpace","true")
     .load(file)
), booksFiles))

Añadimos las columnas "Description" y "Count of text reviews" si no existen en el DF:

In [57]:
booksDFList2 = list(map(lambda df: (
    df.withColumn("Description", hasColumn(df, "Description")) \
      .withColumn("Count of text reviews", hasColumn(df, "Count of text reviews"))
), booksDFList))

Definimos una función para juntar la lista de DF en un único DF:

In [119]:
from functools import reduce
def joinDFList(dfList):
    return reduce(lambda a, b: a.unionByName(b), dfList)

Unimos los DF:

In [62]:
unionDF = joinDFList(booksDFList2)

Limpiamos el DF:
- Casteamos los campos de tipo Integer y Double.
- Eliminamos de los campos RatingDist[5, 4, 3, 2, 1, Total] las entradillas de texto ("5: 123" -> "123")
- Intercambiamos las columnas de mes y día, ya que tienen los valores al revés en el DF.

In [144]:
from pyspark.sql.functions import col, regexp_extract
from pyspark.sql.types import *
df = (unionDF
      .withColumn("Id", col("Id").cast(IntegerType()))
      .withColumn("Rating", col("Rating").cast(DoubleType()))
      .withColumn("PublishYear",col("PublishYear").cast(IntegerType()))
      .withColumn("PublishMonth",col("PublishMonth").cast(IntegerType()))
      .withColumn("PublishDay", col("PublishDay").cast(IntegerType()))
      .withColumn("RatingDist5",regexp_extract("RatingDist5", "5:(\d+)", 1).cast(IntegerType()))
      .withColumn("RatingDist4",regexp_extract("RatingDist4", "4:(\d+)", 1).cast(IntegerType()))
      .withColumn("RatingDist3",regexp_extract("RatingDist3", "3:(\d+)", 1).cast(IntegerType()))
      .withColumn("RatingDist2",regexp_extract("RatingDist2", "2:(\d+)", 1).cast(IntegerType()))
      .withColumn("RatingDist1",regexp_extract("RatingDist1", "1:(\d+)", 1).cast(IntegerType()))
      .withColumn("RatingDistTotal",regexp_extract("RatingDistTotal", "total:(\d+)", 1).cast(IntegerType()))
      .withColumn("CountsOfReview", col("CountsOfReview").cast(IntegerType()))
      .withColumn("pagesNumber", col("pagesNumber").cast(IntegerType()))
      .withColumn("Count of text reviews", col("Count of text reviews").cast(IntegerType()))
      .withColumnRenamed("Count of text reviews", "CountOfTextReviews")
      .withColumnRenamed("PublishMonth", "PublishDayFixed")
      .withColumnRenamed("PublishDay", "PublishMonth")
      .withColumnRenamed("PublishDayFixed", "PublishDay")
)
df.printSchema

<bound method DataFrame.printSchema of DataFrame[PublishYear: int, Rating: double, RatingDistTotal: int, ISBN: string, RatingDist1: int, Publisher: string, PublishDay: int, Id: int, Name: string, Authors: string, RatingDist5: int, RatingDist4: int, PublishMonth: int, RatingDist2: int, pagesNumber: int, RatingDist3: int, CountsOfReview: int, Language: string, Description: string, CountOfTextReviews: int]>

Guardamos el DF en formato parquet para realizar las consultas:

In [87]:
parquet_path = '/home/spark/datasets/parquet'
df.write.format("parquet").mode("overwrite").save(parquet_path)

Lo leemos como un DF:

In [88]:
booksDF = spark.read.format("parquet").load(parquet_path+'/*.parquet')

Leemos los ficheros que contienen los rating de los usuarios, del mismo modo que leímos anteriormente los libros:

In [121]:
ratingFiles = getFiles(dataset_folder, "user")
ratingsDF = joinDFList(list(map(lambda file: (spark
     .read
     .format("csv")
     .option("header", "true")
     .option("quote", "\"")
     .option("escape", "\"")
     .load(file)), ratingFiles)))

## Consultas

**1. Rating promedio de todos los libros**

In [94]:
from pyspark.sql.functions import avg, round
booksDF.select(round(avg("Rating"), 2).alias("Average rating")).show()

+--------------+
|Average rating|
+--------------+
|       3761.15|
+--------------+



**2. Rating promedio de los libros por autor**

In [107]:
booksDF \
    .where(col("Rating").isNotNull()) \
    .groupBy("Authors") \
    .agg(round(avg("Rating"), 2).alias("Average rating")) \
.show()

+--------------------+--------------+
|             Authors|Average rating|
+--------------------+--------------+
|    Vera Albuquerque|           4.0|
|       Thierry Lentz|          2.78|
|       Georges Nania|           0.0|
|        Fred Allison|           0.0|
|    Frances Bellerby|           3.0|
|    Nathaniel Harris|          3.36|
|       David   Baird|          3.03|
|      Alison Daniels|          1.53|
|         Ken England|           3.0|
|         Bill Bright|          3.39|
|        Mary O'Brien|          2.11|
|        John Farndon|          3.17|
|   Edgar M. Bronfman|          3.17|
|     Louis Althusser|          3.89|
|Maria Julia Bertomeu|           0.0|
|     Mario Benedetti|          3.95|
|  The New York Times|          2.97|
|    Albert J. Schütz|          2.64|
|      Eloise Jelinek|           0.0|
|      Elizabeth Chan|          3.67|
+--------------------+--------------+
only showing top 20 rows



**3. Rating promedio de los libros por Publisher**

In [108]:
booksDF \
    .where(col("Rating").isNotNull()) \
    .groupBy("Publisher") \
    .agg(round(avg("Rating"), 2).alias("Average rating")) \
.show()

+--------------------+--------------+
|           Publisher|Average rating|
+--------------------+--------------+
|           IVP Books|          3.78|
|    Ycp Publications|          3.95|
|John Benjamins Pu...|          1.53|
|                 DAW|          3.74|
|Regina Press Malh...|          3.05|
| Prospect Books (UK)|          2.94|
|            Capstone|          2.64|
|        Lorenz Books|           3.0|
|       The New Press|          3.77|
|     Militzke Verlag|           0.0|
|         Cleis Press|          3.78|
|Arcadia Publishin...|          3.14|
|      Celestial Arts|          3.33|
|Chicago Review Press|           3.4|
|     Dance Books Ltd|          2.64|
|        Chosen Books|          3.78|
| Research Press (IL)|          2.28|
|Civilized Publica...|          4.15|
| Orange Frazer Press|          2.76|
|   R.W. Secord Press|          3.67|
+--------------------+--------------+
only showing top 20 rows



**4. Número promedio de páginas de todos los libros**

In [109]:
booksDF.select(round(avg("pagesNumber"), 2).alias("Average pages")).show()

+-------------+
|Average pages|
+-------------+
|       534.44|
+-------------+



**5. Número promedio de páginas de todos los libros por autor**

In [110]:
booksDF \
    .where(col("pagesNumber").isNotNull()) \
    .groupBy("Authors") \
    .agg(round(avg("pagesNumber"), 2).alias("Average pages")) \
.show()

+--------------------+-------------+
|             Authors|Average pages|
+--------------------+-------------+
|    Vera Albuquerque|        472.0|
|       Thierry Lentz|        332.0|
|       Georges Nania|        916.0|
|        Fred Allison|        452.0|
|    Frances Bellerby|        184.0|
|    Nathaniel Harris|       112.89|
|       David   Baird|       383.92|
|      Alison Daniels|        160.0|
|         Ken England|        409.5|
|         Bill Bright|       271.41|
|        Mary O'Brien|       215.33|
|        John Farndon|       138.07|
|   Edgar M. Bronfman|        226.5|
|     Louis Althusser|       306.33|
|Maria Julia Bertomeu|        199.0|
|     Mario Benedetti|       239.48|
|  The New York Times|       234.08|
|    Albert J. Schütz|       306.67|
|      Eloise Jelinek|        490.0|
|      Elizabeth Chan|         64.0|
+--------------------+-------------+
only showing top 20 rows



**6. Número promedio de páginas de todos los libros por Publisher**

In [111]:
booksDF \
    .where(col("pagesNumber").isNotNull()) \
    .groupBy("Publisher") \
    .agg(round(avg("pagesNumber"), 2).alias("Average pages")) \
.show()

+--------------------+-------------+
|           Publisher|Average pages|
+--------------------+-------------+
|           IVP Books|       192.85|
|    Ycp Publications|        280.0|
|John Benjamins Pu...|       325.47|
|                 DAW|       355.01|
|Regina Press Malh...|        75.28|
| Prospect Books (UK)|       281.19|
|            Capstone|       154.04|
|        Lorenz Books|       203.51|
|       The New Press|       290.01|
|     Militzke Verlag|        236.0|
|         Cleis Press|       222.02|
|Arcadia Publishin...|        129.7|
|      Celestial Arts|       194.92|
|Chicago Review Press|       259.42|
|     Dance Books Ltd|        209.0|
|        Chosen Books|       219.05|
| Research Press (IL)|       255.21|
|Civilized Publica...|       200.33|
| Orange Frazer Press|       237.64|
|   R.W. Secord Press|        471.0|
+--------------------+-------------+
only showing top 20 rows



**7. Número promedio de libros publicados por autor**

In [118]:
booksDF \
    .groupBy("Authors") \
    .count() \
    .agg(round(avg("count"), 2).alias("Average number of books per author")) \
.show()

+----------------------------------+
|Average number of books per author|
+----------------------------------+
|                              2.75|
+----------------------------------+



**8. Ordenar los libros de mayor a menor (top 15) por número de ratings dados por los usuarios (excluir aquellos valores sin rating)**

In [139]:
ratingsDF.groupBy("Name") \
    .agg(count("Name").alias("Ratings")) \
    .where(col("Name") != "Rating") \
    .orderBy(col("Ratings").desc()) \
    .limit(15) \
.show(truncate=False)

+------------------------------------------------------------+-------+
|Name                                                        |Ratings|
+------------------------------------------------------------+-------+
|The Catcher in the Rye                                      |985    |
|The Great Gatsby                                            |885    |
|The Da Vinci Code (Robert Langdon, #2)                      |846    |
|To Kill a Mockingbird                                       |830    |
|1984                                                        |756    |
|The Kite Runner                                             |749    |
|Harry Potter and the Sorcerer's Stone (Harry Potter, #1)    |728    |
|Animal Farm                                                 |717    |
|Harry Potter and the Goblet of Fire (Harry Potter, #4)      |639    |
|Harry Potter and the Prisoner of Azkaban (Harry Potter, #3) |631    |
|Harry Potter and the Order of the Phoenix (Harry Potter, #5)|595    |
|Harry

**9. Obtener el top 5 de ratings más frecuentes otorgados por usuarios**

In [142]:
ratingsDF.groupBy("Rating") \
  .count() \
  .orderBy(col("count").desc()) \
  .limit(5) \
.show()

+---------------+------+
|         Rating| count|
+---------------+------+
|really liked it|132808|
|       liked it| 96047|
| it was amazing| 92354|
|      it was ok| 28811|
|did not like it|  7811|
+---------------+------+

