# **IMDb Rating**

## **Problem Statement**
You are given the **IMDb dataset** consisting of two tables:

### **Table: IMDB**
| Column Name | Type  |
|-------------|-------|
| `movie_id`  | int   |
| `title`     | string |
| `rating`    | float  |
| `budget`    | int    |

### **Table: Genre**
| Column Name | Type  |
|-------------|-------|
| `movie_id`  | int   |
| `genre`     | string |

### **Objective**
Retrieve the **title** and **rating** of movies that meet the following conditions:
- Released in **2014** (`title` contains `'2014'`).
- Belong to a **genre starting with 'C'**.
- Had a **budget exceeding 4 Crore (40,000,000 INR).**

---

## **Approach 1: PySpark DataFrame API**
### **Steps**
1. **Initialize Spark Session**
2. **Create DataFrames for `IMDB` and `Genre` Tables**
3. **Join the DataFrames on `movie_id`**
4. **Apply Filters** (`genre LIKE 'C%'`, `title CONTAINS '2014'`, `budget > 40000000`)
5. **Select `title` and `rating` Columns**
6. **Display the Result**

### **Code**

In [9]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Step 1: Initialize Spark Session
spark = SparkSession.builder.appName("IMDbMovies").getOrCreate()

# Step 2: Create DataFrames for IMDB and Genre Tables
imdb_df = spark.read.csv("Files/csv/IMDB.csv", header=True, inferSchema=True)


genre_df = spark.read.csv("Files/csv/genre.csv", header=True, inferSchema=True)

# Step 3: Join the DataFrames on movie_id
merged_df = imdb_df.join(genre_df, "movie_id", "left")

# Step 4: Apply Filters
filtered_df = merged_df.filter(
    (col("genre").startswith("C")) & 
    (col("title").contains("2014")) & 
    (col("budget") > 40000000)
)

# Step 5: Select Required Columns
result_df = filtered_df.select("title", "rating")

# Step 6: Display the Result
result_df.show()

StatementMeta(, 4d041d27-86af-4500-a20b-df2885fca3c7, 11, Finished, Available, Finished)

+--------------------+------+
|               title|rating|
+--------------------+------+
|    Gone Girl (2014)|     8|
|Kingsman: The Sec...|     8|
+--------------------+------+



---

## **Approach 2: SQL Query in PySpark**
### **Steps**
1. **Create Spark Session**
2. **Create DataFrames for `IMDB` and `Genre` Tables**
3. **Register them as SQL Views**
4. **Write and Execute SQL Query**
5. **Display the Output**


### **Code**

In [10]:
# Step 1: Register DataFrames as SQL Views
imdb_df.createOrReplaceTempView("IMDB")
genre_df.createOrReplaceTempView("Genre")

# Step 2: Run SQL Query
sql_query = """
SELECT title, rating
FROM IMDB i
LEFT JOIN Genre g ON i.movie_id = g.movie_id
WHERE g.genre LIKE 'C%'
AND i.title LIKE '%2014%'
AND i.budget > 40000000
"""

result_sql = spark.sql(sql_query)

# Step 3: Display Output
result_sql.show()


StatementMeta(, 4d041d27-86af-4500-a20b-df2885fca3c7, 12, Finished, Available, Finished)

+--------------------+------+
|               title|rating|
+--------------------+------+
|    Gone Girl (2014)|     8|
|Kingsman: The Sec...|     8|
+--------------------+------+



---

## **Summary**
| Approach  | Method                      | Steps  |
|-----------|-----------------------------|--------|
| **Approach 1** | PySpark DataFrame API    | Uses `join()`, `filter()`, `select()` |
| **Approach 2** | SQL Query in PySpark     | Uses `LEFT JOIN`, `WHERE`, `LIKE`, and `> 40000000` |
