## Books Analysis Using Spark DataFrames

In [1]:
import findspark
findspark.init()
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [2]:
spark = SparkSession \
    .builder \
    .appName("Books") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

In [3]:
df = spark.read.csv("books.csv",header=True,inferSchema= True,sep=";");
df1 = spark.read.csv("ratings.csv",header=True,sep=";");
df = df.drop('Image-URL-S')
df = df.drop('Image-URL-M')
df = df.drop('Image-URL-L')

In [4]:
df.show()

+----------+--------------------+--------------------+-------------------+--------------------+
|      ISBN|          Book-Title|         Book-Author|Year-Of-Publication|           Publisher|
+----------+--------------------+--------------------+-------------------+--------------------+
|0195153448| Classical Mythology|  Mark P. O. Morford|               2002|Oxford University...|
|0002005018|        Clara Callan|Richard Bruce Wright|               2001|HarperFlamingo Ca...|
|0060973129|Decision in Normandy|        Carlo D'Este|               1991|     HarperPerennial|
|0374157065|Flu: The Story of...|    Gina Bari Kolata|               1999|Farrar Straus Giroux|
|0393045218|The Mummies of Ur...|     E. J. W. Barber|               1999|W. W. Norton &amp...|
|0399135782|The Kitchen God's...|             Amy Tan|               1991|    Putnam Pub Group|
|0425176428|What If?: The Wor...|       Robert Cowley|               2000|Berkley Publishin...|
|0671870432|     PLEADING GUILTY|       

### Ratings for each book

In [5]:
df1.groupBy("ISBN").count().join(df, on='ISBN').orderBy(desc("count")).show()

+----------+-----+--------------------+------------------+-------------------+--------------------+
|      ISBN|count|          Book-Title|       Book-Author|Year-Of-Publication|           Publisher|
+----------+-----+--------------------+------------------+-------------------+--------------------+
|0971880107| 2502|         Wild Animus|      Rich Shapero|               2004|             Too Far|
|0316666343| 1295|The Lovely Bones:...|      Alice Sebold|               2002|       Little, Brown|
|0385504209|  883|   The Da Vinci Code|         Dan Brown|               2003|           Doubleday|
|0060928336|  732|Divine Secrets of...|     Rebecca Wells|               1997|           Perennial|
|0312195516|  723|The Red Tent (Bes...|     Anita Diamant|               1998|         Picador USA|
|044023722X|  647|     A Painted House|      John Grisham|               2001|Dell Publishing C...|
|0142001740|  615|The Secret Life o...|     Sue Monk Kidd|               2003|       Penguin Books|


### Books with Maximum versions

A book  has different versions with respect to the Publisher, editions like Hardcover, Paperback and Revised editions. Each version will have a different ISBN. This finds the books with highest number of versions. 

In [6]:
df.groupBy("Book-Title").count().orderBy(desc("count")).show()

+--------------------+-----+
|          Book-Title|count|
+--------------------+-----+
|      Selected Poems|   27|
|        Little Women|   24|
|   Wuthering Heights|   21|
|             Dracula|   20|
|   The Secret Garden|   20|
|Adventures of Huc...|   20|
|           Jane Eyre|   19|
|The Night Before ...|   18|
| Pride and Prejudice|   18|
|  Great Expectations|   17|
|        Frankenstein|   16|
|          Masquerade|   16|
|        Black Beauty|   16|
|             Beloved|   15|
|            The Gift|   15|
|                Emma|   15|
|             Nemesis|   14|
|          Psychology|   13|
|          The Secret|   13|
|     Robinson Crusoe|   13|
+--------------------+-----+
only showing top 20 rows



### Publishers with Highest no of Books released

In [7]:
df.groupBy("Publisher").count().orderBy(desc("count")).show()

+--------------------+-----+
|           Publisher|count|
+--------------------+-----+
|           Harlequin| 7536|
|          Silhouette| 4220|
|              Pocket| 3905|
|    Ballantine Books| 3783|
|        Bantam Books| 3647|
|          Scholastic| 3160|
|Simon &amp; Schuster| 2971|
|       Penguin Books| 2844|
|Berkley Publishin...| 2771|
|        Warner Books| 2727|
|         Penguin USA| 2717|
|       Harpercollins| 2526|
|       Fawcett Books| 2258|
|         Signet Book| 2070|
|    Random House Inc| 2045|
|       St Martins Pr| 1953|
|  St. Martin's Press| 1783|
|           Tor Books| 1704|
|HarperCollins Pub...| 1701|
|         Zebra Books| 1694|
+--------------------+-----+
only showing top 20 rows



### Publishers with Highest Average Ratings

In [8]:
df.join(df1, df.ISBN == df1.ISBN) \
  .groupBy(df.Publisher).agg({"Book-Rating": "avg", "Publisher" :"count"}).withColumnRenamed("avg(Book-Rating)", "Average_Rating")\
  .orderBy(desc("Average_Rating")).show()

+--------------------+--------------+----------------+
|           Publisher|Average_Rating|count(Publisher)|
+--------------------+--------------+----------------+
|          Glen Adams|          10.0|               1|
|Scribes Valley Pu...|          10.0|               1|
|           Jugglebug|          10.0|               1|
|                XAOX|          10.0|               1|
|         Hermetic Pr|          10.0|               3|
|McGallen &amp; Bo...|          10.0|               1|
|      Dartnell Corp.|          10.0|               1|
|Macdonald and Jane's|          10.0|               2|
|Joshua Odell Edit...|          10.0|               1|
|Madinah Publisher...|          10.0|               1|
|Rocky Mountain Na...|          10.0|               1|
|  Stewart Publishing|          10.0|               1|
|         Veritas Pub|          10.0|               4|
|       Codhill Press|          10.0|               1|
|         Paper Tiger|          10.0|               1|
|         

The results show that most books with the highest average rating has just one rating. So this may not be the best criteria to evaluate the books content.

### Publishers with Highest Total Ratings

In [9]:
df.join(df1, df.ISBN == df1.ISBN) \
  .groupBy(df.Publisher).agg({"Book-Rating": "sum","Publisher" :"count"}) \
  .withColumnRenamed("sum(Book-Rating)", "Total_Rating") \
  .orderBy(desc("Total_Rating")).show()

+--------------------+------------+----------------+
|           Publisher|Total_Rating|count(Publisher)|
+--------------------+------------+----------------+
|    Ballantine Books|     97269.0|           34724|
|              Pocket|     79930.0|           31989|
|Berkley Publishin...|     69384.0|           28614|
|        Warner Books|     68134.0|           25506|
|              Bantam|     57564.0|           20007|
|        Bantam Books|     54932.0|           23600|
|       Penguin Books|     54572.0|           17033|
|         Signet Book|     51431.0|           19155|
|           Perennial|     45129.0|           13466|
|                Avon|     42469.0|           17352|
|   Vintage Books USA|     36407.0|           11426|
|          Jove Books|     36353.0|           15178|
|           Harlequin|     36037.0|           25029|
|                Dell|     34480.0|           13924|
|          Scholastic|     33493.0|           13662|
|         HarperTorch|     31778.0|           

### Authors with Highest no of books

In [10]:
df.groupBy("Book-Author").count().orderBy(desc("count")).show()

+--------------------+-----+
|         Book-Author|count|
+--------------------+-----+
|     Agatha Christie|  632|
| William Shakespeare|  567|
|        Stephen King|  524|
|       Ann M. Martin|  423|
|     Francine Pascal|  373|
|       Carolyn Keene|  373|
|        Isaac Asimov|  330|
|        Nora Roberts|  315|
|    Barbara Cartland|  307|
|     Charles Dickens|  302|
|Not Applicable (Na )|  286|
|         R. L. Stine|  282|
|          Mark Twain|  231|
|         Jane Austen|  223|
|     Terry Pratchett|  220|
|  Mary Higgins Clark|  218|
|       Piers Anthony|  217|
|Marion Zimmer Bra...|  216|
|        Janet Dailey|  214|
|   Franklin W. Dixon|  204|
+--------------------+-----+
only showing top 20 rows

