### **Tasks**:

1. **Load the Dataset**:
   - Read the CSV file into a PySpark DataFrame.

2. **Filter Movies by Genre**:
   - Find all movies in the "Sci-Fi" genre.

3. **Top-Rated Movies**:
   - Find the top 3 highest-rated movies.

4. **Movies Released After 2010**:
   - Filter out all movies released after the year 2010.

5. **Calculate Average Box Office Collection by Genre**:
   - Group the movies by genre and calculate the average box office collection for each genre.

6. **Add a New Column for Box Office in Billions**:
   - Add a new column that shows the box office collection in billions.

7. **Sort Movies by Box Office Collection**:
   - Sort the movies in descending order based on their box office collection.

8. **Count the Number of Movies per Genre**:
   - Count the number of movies in each genre.

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.2.tar.gz (317.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.2-py2.py3-none-any.whl size=317812365 sha256=d89fc7aba77efaaf49d008b1ec951b996a34a7ed1d83279b556d92e16295d74f
  Stored in directory: /root/.cache/pip/wheels/34/34/bd/03944534c44b677cd5859f248090daa9fb27b3c8f8e5f49574
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.2


Setup

In [2]:
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder \
    .appName("MovieDataTransformations") \
    .getOrCreate()


1. Load the Dataset

In [3]:
# Load CSV file into DataFrame
df = spark.read.csv("/content/movies.csv", header=True, inferSchema=True)
df.show()


+--------+-----------------+---------+------+----------+----------+
|movie_id|            title|    genre|rating|box_office|      date|
+--------+-----------------+---------+------+----------+----------+
|       1|        Inception|   Sci-Fi|   8.8| 830000000|2010-07-16|
|       2|  The Dark Knight|   Action|   9.0|1004000000|2008-07-18|
|       3|     Interstellar|   Sci-Fi|   8.6| 677000000|2014-11-07|
|       4|Avengers: Endgame|   Action|   8.4|2797000000|2019-04-26|
|       5|    The Lion King|Animation|   8.5|1657000000|1994-06-15|
|       6|      Toy Story 4|Animation|   7.8|1073000000|2019-06-21|
|       7|        Frozen II|Animation|   7.0|1450000000|2019-11-22|
|       8|            Joker|    Drama|   8.5|1074000000|2019-10-04|
|       9|         Parasite|    Drama|   8.6| 258000000|2019-05-30|
|     ```|             NULL|     NULL|  NULL|      NULL|      NULL|
+--------+-----------------+---------+------+----------+----------+



2. Filter Movies by Genre

In [4]:
sci_fi_movies = df.filter(df.genre == "Sci-Fi")
sci_fi_movies.show()


+--------+------------+------+------+----------+----------+
|movie_id|       title| genre|rating|box_office|      date|
+--------+------------+------+------+----------+----------+
|       1|   Inception|Sci-Fi|   8.8| 830000000|2010-07-16|
|       3|Interstellar|Sci-Fi|   8.6| 677000000|2014-11-07|
+--------+------------+------+------+----------+----------+



3. Top-Rated Movies

In [5]:
top_rated_movies = df.orderBy(df.rating.desc()).limit(3)
top_rated_movies.show()


+--------+---------------+------+------+----------+----------+
|movie_id|          title| genre|rating|box_office|      date|
+--------+---------------+------+------+----------+----------+
|       2|The Dark Knight|Action|   9.0|1004000000|2008-07-18|
|       1|      Inception|Sci-Fi|   8.8| 830000000|2010-07-16|
|       3|   Interstellar|Sci-Fi|   8.6| 677000000|2014-11-07|
+--------+---------------+------+------+----------+----------+



4. Movies Released After 2010

In [6]:
from pyspark.sql.functions import year

df_with_year = df.withColumn("year", year(df.date))
movies_after_2010 = df_with_year.filter(df_with_year.year > 2010)
movies_after_2010.show()


+--------+-----------------+---------+------+----------+----------+----+
|movie_id|            title|    genre|rating|box_office|      date|year|
+--------+-----------------+---------+------+----------+----------+----+
|       3|     Interstellar|   Sci-Fi|   8.6| 677000000|2014-11-07|2014|
|       4|Avengers: Endgame|   Action|   8.4|2797000000|2019-04-26|2019|
|       6|      Toy Story 4|Animation|   7.8|1073000000|2019-06-21|2019|
|       7|        Frozen II|Animation|   7.0|1450000000|2019-11-22|2019|
|       8|            Joker|    Drama|   8.5|1074000000|2019-10-04|2019|
|       9|         Parasite|    Drama|   8.6| 258000000|2019-05-30|2019|
+--------+-----------------+---------+------+----------+----------+----+



5. Calculate Average Box Office Collection by Genre

In [7]:
from pyspark.sql.functions import avg

average_box_office_by_genre = df.groupBy("genre").agg(avg("box_office").alias("average_box_office"))
average_box_office_by_genre.show()


+---------+--------------------+
|    genre|  average_box_office|
+---------+--------------------+
|     NULL|                NULL|
|    Drama|              6.66E8|
|Animation|1.3933333333333333E9|
|   Action|            1.9005E9|
|   Sci-Fi|             7.535E8|
+---------+--------------------+



6. Add a New Column for Box Office in Billions

In [8]:
from pyspark.sql.functions import col

df_with_box_office_in_billions = df.withColumn("box_office_billions", col("box_office") / 1e9)
df_with_box_office_in_billions.show()


+--------+-----------------+---------+------+----------+----------+-------------------+
|movie_id|            title|    genre|rating|box_office|      date|box_office_billions|
+--------+-----------------+---------+------+----------+----------+-------------------+
|       1|        Inception|   Sci-Fi|   8.8| 830000000|2010-07-16|               0.83|
|       2|  The Dark Knight|   Action|   9.0|1004000000|2008-07-18|              1.004|
|       3|     Interstellar|   Sci-Fi|   8.6| 677000000|2014-11-07|              0.677|
|       4|Avengers: Endgame|   Action|   8.4|2797000000|2019-04-26|              2.797|
|       5|    The Lion King|Animation|   8.5|1657000000|1994-06-15|              1.657|
|       6|      Toy Story 4|Animation|   7.8|1073000000|2019-06-21|              1.073|
|       7|        Frozen II|Animation|   7.0|1450000000|2019-11-22|               1.45|
|       8|            Joker|    Drama|   8.5|1074000000|2019-10-04|              1.074|
|       9|         Parasite|    

7. Sort Movies by Box Office Collection

In [9]:
sorted_movies = df.orderBy(df.box_office.desc())
sorted_movies.show()


+--------+-----------------+---------+------+----------+----------+
|movie_id|            title|    genre|rating|box_office|      date|
+--------+-----------------+---------+------+----------+----------+
|       4|Avengers: Endgame|   Action|   8.4|2797000000|2019-04-26|
|       5|    The Lion King|Animation|   8.5|1657000000|1994-06-15|
|       7|        Frozen II|Animation|   7.0|1450000000|2019-11-22|
|       8|            Joker|    Drama|   8.5|1074000000|2019-10-04|
|       6|      Toy Story 4|Animation|   7.8|1073000000|2019-06-21|
|       2|  The Dark Knight|   Action|   9.0|1004000000|2008-07-18|
|       1|        Inception|   Sci-Fi|   8.8| 830000000|2010-07-16|
|       3|     Interstellar|   Sci-Fi|   8.6| 677000000|2014-11-07|
|       9|         Parasite|    Drama|   8.6| 258000000|2019-05-30|
|     ```|             NULL|     NULL|  NULL|      NULL|      NULL|
+--------+-----------------+---------+------+----------+----------+



8. Count the Number of Movies per Genre

In [10]:
movie_count_by_genre = df.groupBy("genre").count()
movie_count_by_genre.show()


+---------+-----+
|    genre|count|
+---------+-----+
|     NULL|    1|
|    Drama|    2|
|Animation|    3|
|   Action|    2|
|   Sci-Fi|    2|
+---------+-----+

