**Dataset: dbfs:/FileStore/ctsdatasets/movielens**


- From movies.csv and ratings.csv datasets, fetch the top 10 movies with highest average user-rating
	- Consider only those movies that are rated by atleast 50 users
	- Data: movieId, title, totalRatings, averageRating
	- Arrange the data in the DESC order of averageRating
	- Round the averageRating to 4 decimal places
	- Save the output as a single pipe-separated CSV file with header
	- Use only DF transformation methods (not SQL)

In [2]:
%run "./Setup.ipynb"

In [3]:
movies_file = "E:\\PySpark\\data\\movielens\\movies.csv"
ratings_file = "E:\\PySpark\\data\\movielens\\ratings.csv"

In [4]:
movies_schema = "movieId INT, title STRING, genres STRING"
ratings_schema = "userId INT, movieId INT, rating DOUBLE, timestamp BIGINT"

In [5]:
movies_df = spark.read.csv(movies_file, header=True, schema=movies_schema)
rating_df = spark.read.csv(ratings_file, header=True, schema=ratings_schema)

In [7]:
movies_df.show(5, False)

+-------+----------------------------------+-------------------------------------------+
|movieId|title                             |genres                                     |
+-------+----------------------------------+-------------------------------------------+
|1      |Toy Story (1995)                  |Adventure#Animation#Children#Comedy#Fantasy|
|2      |Jumanji (1995)                    |Adventure#Children#Fantasy                 |
|3      |Grumpier Old Men (1995)           |Comedy#Romance                             |
|4      |Waiting to Exhale (1995)          |Comedy#Drama#Romance                       |
|5      |Father of the Bride Part II (1995)|Comedy                                     |
+-------+----------------------------------+-------------------------------------------+
only showing top 5 rows



In [8]:
rating_df.show(5, False)

+------+-------+------+----------+
|userId|movieId|rating|timestamp |
+------+-------+------+----------+
|1     |31     |2.5   |1260759144|
|1     |1029   |3.0   |1260759179|
|1     |1061   |3.0   |1260759182|
|1     |1129   |2.0   |1260759185|
|1     |1172   |4.0   |1260759205|
+------+-------+------+----------+
only showing top 5 rows



In [9]:
from pyspark.sql.functions import *

In [20]:
average_rating_df = rating_df \
    .groupBy("movieId") \
    .agg(
        count("rating").alias("totalRatings"),
        avg("rating").alias("averageRating")
    ) \
    .where("totalRatings >= 50") \
    .orderBy(desc("averageRating")) \
    .limit(10) \
    .join(movies_df, "movieId") \
    .select("movieId", "title", "totalRatings", "averageRating") \
    .orderBy(desc("averageRating")) \
    .withColumn("averageRating", round("averageRating", 4)) \
    .coalesce(1)
    
average_rating_df.show(truncate=False)

+-------+--------------------------------+------------+-------------+
|movieId|title                           |totalRatings|averageRating|
+-------+--------------------------------+------------+-------------+
|858    |Godfather, The (1972)           |200         |4.4875       |
|318    |Shawshank Redemption, The (1994)|311         |4.4871       |
|969    |African Queen, The (1951)       |50          |4.42         |
|913    |Maltese Falcon, The (1941)      |62          |4.3871       |
|1221   |Godfather: Part II, The (1974)  |135         |4.3852       |
|50     |Usual Suspects, The (1995)      |201         |4.3706       |
|1228   |Raging Bull (1980)              |50          |4.35         |
|1252   |Chinatown (1974)                |76          |4.3355       |
|904    |Rear Window (1954)              |92          |4.3152       |
|1203   |12 Angry Men (1957)             |74          |4.3041       |
+-------+--------------------------------+------------+-------------+



In [13]:
average_rating_df.count()

453

In [21]:
average_rating_df.rdd.getNumPartitions()

1

In [22]:
outputPath = "E:\\PySpark\\output\\movies"
average_rating_df.write.csv(outputPath, header=True, sep="|")

### Example 2

**Number of movies in each genre**

In [24]:
movies_df.show(5, False)

+-------+----------------------------------+-------------------------------------------+
|movieId|title                             |genres                                     |
+-------+----------------------------------+-------------------------------------------+
|1      |Toy Story (1995)                  |Adventure#Animation#Children#Comedy#Fantasy|
|2      |Jumanji (1995)                    |Adventure#Children#Fantasy                 |
|3      |Grumpier Old Men (1995)           |Comedy#Romance                             |
|4      |Waiting to Exhale (1995)          |Comedy#Drama#Romance                       |
|5      |Father of the Bride Part II (1995)|Comedy                                     |
+-------+----------------------------------+-------------------------------------------+
only showing top 5 rows



In [33]:
genres_df = movies_df \
    .select("movieId", explode(split("genres", "#")).alias("genre") ) \
    .groupBy("genre").count() \
    .orderBy(desc("count"))
    
genres_df.show(20, False)

+------------------+-----+
|genre             |count|
+------------------+-----+
|Drama             |4365 |
|Comedy            |3315 |
|Thriller          |1729 |
|Romance           |1545 |
|Action            |1545 |
|Adventure         |1117 |
|Crime             |1100 |
|Horror            |877  |
|Sci-Fi            |792  |
|Fantasy           |654  |
|Children          |583  |
|Mystery           |543  |
|Documentary       |495  |
|Animation         |447  |
|Musical           |394  |
|War               |367  |
|Western           |168  |
|IMAX              |153  |
|Film-Noir         |133  |
|(no genres listed)|18   |
+------------------+-----+



In [30]:
genres_df.count()

20340