### 6. Consider the movie dataset provided in the previous question. Perform the given operation using PySpark
### • Find the most active users (users who have rated the most movies).
### • Sort the movies name in alphabetic order
### • Calculate the average rating per genre.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, split, explode

In [2]:
spark = SparkSession.builder.appName("Program 6").getOrCreate()

In [3]:
movies_df=spark.read.csv("Datasets/Movies/movies.csv", header=True, inferSchema=True)

In [4]:
ratings_df=spark.read.csv("Datasets/Movies/ratings.csv", header=True, inferSchema=True)

In [5]:
movies_df.show(5)

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
+-------+--------------------+--------------------+
only showing top 5 rows



In [6]:
ratings_df.show(5)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
+------+-------+------+---------+
only showing top 5 rows



In [7]:
movies_df.createOrReplaceTempView("movies")

In [8]:
ratings_df.createOrReplaceTempView("ratings")

#### • Find the most active users (users who have rated the most movies).

In [9]:
query1="""
        Select userId, count(rating) as Number_of_Ratings
        from ratings
        group by userId
        order by Number_of_Ratings desc
       """

In [10]:
most_active_users=spark.sql(query1)

In [11]:
print("Most Active Users")
most_active_users.show(5)

Most Active Users
+------+-----------------+
|userId|Number_of_Ratings|
+------+-----------------+
|   414|             2698|
|   599|             2478|
|   474|             2108|
|   448|             1864|
|   274|             1346|
+------+-----------------+
only showing top 5 rows



#### • Sort the movies name in alphabetic order

In [12]:
query2= """
        Select * from movies
        order by title
        """

In [13]:
sorted_movies=spark.sql(query2)

In [14]:
print("Movies sorted alphabetically")
sorted_movies.show()

Movies sorted alphabetically
+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|   7789|"11'09""01 - Sept...|               Drama|
| 117867|          '71 (2014)|Action|Drama|Thri...|
|  97757|'Hellboy': The Se...|Action|Adventure|...|
|  26564|'Round Midnight (...|       Drama|Musical|
|  27751| 'Salem's Lot (2004)|Drama|Horror|Myst...|
|    779|'Til There Was Yo...|       Drama|Romance|
| 149380|'Tis the Season f...|             Romance|
|   2072|  'burbs, The (1989)|              Comedy|
|   3112|'night Mother (1986)|               Drama|
|  69757|(500) Days of Sum...|Comedy|Drama|Romance|
|   8169|*batteries not in...|Children|Comedy|F...|
|   5706|...All the Marble...|        Comedy|Drama|
|   3420|...And Justice fo...|      Drama|Thriller|
| 157110|00 Schneider - Ja...|        Comedy|Crime|
|    889|   1-900 (06) (1994)|       Drama|Romance|
|   6658|           10 (1979)|     

#### • Calculate the average rating per genre.

In [15]:
# Split the genres by '|' and explode them to create multiple rows per movie
movies_with_genres_df = movies_df.withColumn("genre", explode(split(col("genres"), "\|")))

In [16]:
movies_with_genres_df.show(5)

+-------+----------------+--------------------+---------+
|movieId|           title|              genres|    genre|
+-------+----------------+--------------------+---------+
|      1|Toy Story (1995)|Adventure|Animati...|Adventure|
|      1|Toy Story (1995)|Adventure|Animati...|Animation|
|      1|Toy Story (1995)|Adventure|Animati...| Children|
|      1|Toy Story (1995)|Adventure|Animati...|   Comedy|
|      1|Toy Story (1995)|Adventure|Animati...|  Fantasy|
+-------+----------------+--------------------+---------+
only showing top 5 rows



In [17]:
movies_with_genres_df.createOrReplaceTempView("movies_with_genres")

In [18]:
query3 = """
        Select genre, avg(rating) as Average_Rating
        from movies_with_genres join ratings
        on movies_with_genres.movieId = ratings.movieId
        group by genre
        """

In [19]:
genre_ratings=spark.sql(query3)

In [20]:
print("Average Ratings per genre")
genre_ratings.show()

Average Ratings per genre
+------------------+------------------+
|             genre|    Average_Rating|
+------------------+------------------+
|             Crime| 3.658293867274144|
|           Romance|3.5065107040388437|
|          Thriller|3.4937055799183425|
|         Adventure|3.5086089151939075|
|             Drama|3.6561844113718758|
|               War|   3.8082938876312|
|       Documentary| 3.797785069729286|
|           Fantasy|3.4910005070136894|
|           Mystery| 3.632460255407871|
|           Musical|3.5636781053649105|
|         Animation|3.6299370349170004|
|         Film-Noir| 3.920114942528736|
|(no genres listed)|3.4893617021276597|
|              IMAX| 3.618335343787696|
|            Horror| 3.258195034974626|
|           Western| 3.583937823834197|
|            Comedy|3.3847207640898267|
|          Children| 3.412956125108601|
|            Action| 3.447984331646809|
|            Sci-Fi| 3.455721162210752|
+------------------+------------------+



In [21]:
spark.stop()