# EDA And Data Profiling with Pandas

Given your movie dataset, perform the following tasks: <br/>
- Load the dataset into a Pandas DataFrame.
- Perform exploratory data analysis (EDA) to understand the structure and content of the dataset.
- Clean the data:
    - Handle missing values
    - Remove any duplicates
    - Correct data types if necessary
    - Determine how to handle any outliers
    - Determine which columns have little to no analytical worth and drop them
<br/>

Answer the following questions based on your cleaned dataframe:
1. How many unique genres exist in the dataset?
    - 23
    1. What are the top 5 most common genres? 
    - 1. Drama : 726, 
    - 2. Comedy : 289, 
    - 3. Thriller : 229, 
    - 4. Adventure : 206, 
    - 5. Romance : 206
2. What is the newest movie in the dataset? The oldest?
    - Newest: Chak De! India
    - Oldest: The Cabinet of Dr. Caligari
3. What are the top 10 highest-rated movies (by IMDb rating)?

    1. The Godfather Trilogy: 1901-1980           | 9.3
    2. The Shawshank Redemption                    | 9.3
    3. The Godfather                               | 9.2
    4. The Chaos Class                             | 9.2
    5. The Dark Knight                             | 9.1
    6. Schindler's List                            | 9.0
    7. CM101MMXI Fundamentals                      | 9.0
    8. 12 Angry Men                                | 9.0
    9. The Godfather Part II                       | 9.0
    10. The Lord of the Rings: The Return of the King | 9.0

4. What is the distribution of IMDb ratings (mean, median, standard deviation)?
    - Mean: 7.84
    - Median: 7.90
    - Standard Deviation: 0.4678
5. How do average IMDb ratings vary by genre?

- Documentary    8.027273
- Film-Noir      7.988889
- War            7.952500
- Animation      7.940299
- Sport          7.916000
- Biography      7.893069
- Mystery        7.892500
- Crime          7.883417
- History        7.877049
- Drama          7.873416
- Short          7.860000
- Family         7.853409
- Fantasy        7.850000
- Western        7.840000
- Thriller       7.824454
- Adventure      7.821359
- Sci-Fi         7.790833
- Action         7.773370
- Music          7.758824
- Musical        7.758537
- Romance        7.715534
- Comedy         7.704844
- Horror         7.680488

6. What director has the most movies in the dataset?
 - Alfred Hitchcock with 13 movies
7. Is there a relationship between the Meta_score and IMDB_Rating? (if applicable)
- N/A
8. What stars appear most frequently in the dataset? (if applicable)
- N/A
9. Average Runtime per year (if applicable)
- Answer located where code is
10. Shortest movie for each genre (if applicable)
- Answer located where code is
11. Longest movie for each genre (if applicable)
- Answer located where code is

> **Note**  
> You will not be turning in this exercise. <br/>
> However, you will be using this in a group activity tomorrow.

In [None]:
import pandas as pd
import numpy as np

#### Step 1: Load and clean data

In [None]:
# load data
fpath = r"C:\Users\stanl\20251124-EY-Azure-Data-Engineering\Python\Data\Movie Data\imdb-top-rated-movies-user-rated-kaggle.csv"
# skips the very first row with the bad data
df = pd.read_csv(fpath, skiprows= 1)
# made copy just in case we want old values
movies = df.copy()
# movies.head()

In [None]:
# gets sum of all empty values
movies.isna().sum()

In [None]:
movies.info()

In [None]:
movies[movies["Directors"].isna()]
# replaces empty values in Directors with string Unknown
movies["Directors"] = movies["Directors"].fillna("Unknown")

In [None]:
# don't know how to replace NaN in date time format, will leave blank for now
movies[movies["Release Date"].isna()]

In [None]:
# I don't think Original Title or URL has much use, we will drop it
movies.drop(columns=["Original Title", "URL", "Title Type"], inplace=True)

# What is Position based off? Maybe I will drop it too?
# movies.drop(columns="Position", inplace = True)


In [None]:
movies.head()

### Dataframe is now cleaned
### Answers to questions

##### 1. How many unique genres exist in the dataset? 
##### Answer: 23
#####    1. What are the top 5 most common genres?
##### Answer: 1. Drama : 726, 2. Comedy : 289, 3. Thriller : 229, 4. Adventure : 206, 5. Romance : 206

In [None]:
# some shows have multiple genres
movies["Genres"].unique()

In [None]:
# Flattens all genres into one big list
# we also inclue "," since many strings end in it
genres_split = movies["Genres"].str.split(",")
# genres_split
# help from google
all_genres = [g.strip() for sublist in genres_split for g in sublist]
all_genres

In [None]:
# we find the unique genres
# sets make it so only unique values are considered
unique_genres = set(all_genres)
print(len(unique_genres))

In [None]:
# map to counter how often each genre appears
genre_map = {}

# used AI for this part
for genre in all_genres:
    genre_map[genre] = genre_map.get(genre, 0) + 1

top_5 = sorted(genre_map.items(), key=lambda x: x[1], reverse=True)[:5]

print(f"Top 5 genres: {top_5}")
#movies.head(20)

##### 2. What is the newest movie in the dataset? The oldest?
##### Answer: 
##### Newest: Chak De! India
##### Oldest: The Cabinet of Dr. Caligari

In [None]:
movies.head(5)

In [None]:
# we convert the date(string format) into a number that we can use
# we had that one NaN value in release date, "errors = "coerce" will ignore it for us
movies["Release Date"] = pd.to_datetime(movies["Release Date"], errors="coerce")
movies.sort_values("Release Date", inplace=True)
# year first, month second, then day
movies.head(5)

In [None]:
# finds the smallest and largest values of time, which gives us newest and oldest movies
oldest_date = movies["Release Date"].min()
newest_date = movies["Release Date"].max()


In [None]:
# now we get the actual movie that corresponds with the date
oldest_movie = movies[movies["Release Date"] == oldest_date]
newest_movie = movies[movies["Release Date"] == newest_date]

print(f"Oldest movie is: {oldest_movie["Title"]} \nNewest movie is: {newest_movie["Title"]}")

##### 3. What are the top 10 highest-rated movies (by IMDb rating)?

In [None]:
# gives 10 highest rating movies, their title, and index
movies.sort_values("IMDb Rating", ascending=False)[['Title', 'IMDb Rating']].head(10)

##### 4. What is the distribution of IMDb ratings (mean, median, standard deviation)?

In [None]:
# this gives the mean and standard deviation
# movies["IMDb Rating"].describe()

# we can also do this
rating_mean = movies["IMDb Rating"].mean()
rating_median = movies["IMDb Rating"].median()
rating_std = movies["IMDb Rating"].std()

print(f"Mean: {rating_mean:.2f}")
print(f"Median: {rating_median:.2f}")
print(f"Standard Deviation: {rating_std:.4f}")

##### 5. How do average IMDb ratings vary by genre?

In [None]:
# done with help from AI and google
# we split the genres into different lists
movies_exploded = movies.copy()
# str.split converts the string into multiple genres (since a movie can have more than one)
movies_exploded['Genres'] = movies_exploded['Genres'].str.split(',')
# we turn the lists into rows
movies_exploded = movies_exploded.explode('Genres')
# cleans whitespace
movies_exploded['Genres'] = movies_exploded['Genres'].str.strip()
# groups/aggregates the data
avg_ratings = movies_exploded.groupby('Genres')['IMDb Rating'].mean()

In [None]:
# our result
avg_ratings_by_genre = movies_exploded.groupby('Genres')['IMDb Rating'].mean().sort_values(ascending=False)
print(avg_ratings_by_genre)

##### 6. What director has the most movies in the dataset?

In [None]:
# similar to the solution for problem 5
# split multiple directors and explode into separate rows
movies_directors = movies.copy()
movies_directors['Directors'] = movies_directors['Directors'].str.split(',')
movies_directors = movies_directors.explode('Directors')
movies_directors['Directors'] = movies_directors['Directors'].str.strip()  # removes whitespaces

In [None]:
director_counts = movies_directors['Directors'].value_counts()
director_counts.head(10)

##### 9. Average Runtime per year (if applicable)

In [None]:
movies.head()

In [None]:
# movies['Runtime (mins)'] = pd.to_numeric(movies["Runtime (mins)"])
# don't need to convert actually

In [None]:
avg_runtime_per_year = movies.groupby('Year')['Runtime (mins)'].mean().sort_index()

# sees all values, I think it gets cut off anyways because of output limit in jupyter notebooks
pd.set_option('display.max_rows', None)
print(avg_runtime_per_year)

##### 10. Shortest movie for each genre (if applicable)

In [None]:
# # find shortest movie per genre
# # could not understand how to only have 1 genre
# shortest_per_genre = movies_exploded.loc[movies_exploded.groupby('Genres')['Runtime (mins)'].idxmin()]
# shortest_per_genre = shortest_per_genre[['Genres', 'Title', 'Runtime (mins)']].sort_values('Runtime (mins)')
# print(shortest_per_genre)

# AI's code, could not find a solution without repetition of movies
# find shortest movie per genre
shortest_per_genre = movies_exploded.loc[movies_exploded.groupby('Genres')['Runtime (mins)'].idxmin()]
# Remove duplicate (genre, title) pairs if any
shortest_per_genre = shortest_per_genre[['Genres', 'Title', 'Runtime (mins)']].drop_duplicates().sort_values('Runtime (mins)')
print(shortest_per_genre)

type(shortest_per_genre)



In [None]:
movies.head(5)

##### 

In [None]:
# # find longest movie per genre
# # .idmax() returns the index label of where the max value in a series occurs
# longest_per_genre = movies_exploded.loc[movies_exploded.groupby('Genres')['Runtime (mins)'].idxmax()]
# longest_per_genre = longest_per_genre[['Genres', 'Title', 'Runtime (mins)']].sort_values('Runtime (mins)', ascending=False)
# print(longest_per_genre)

# find longest movie per genre
longest_per_genre = movies_exploded.loc[movies_exploded.groupby('Genres')['Runtime (mins)'].idxmax()]
# Remove duplicate (genre, title) pairs if any
longest_per_genre = longest_per_genre[['Genres', 'Title', 'Runtime (mins)']].drop_duplicates().sort_values('Runtime (mins)', ascending=False)
print(longest_per_genre)