
# Movie Data Analysis Notebook

This notebook contains the analysis of the movie dataset to answer the specified questions. The dataset includes:
- `movies.csv`
- `ratings.csv`
- `tags.csv`
- `links.csv`

## Questions Answered in this Notebook:
1. How many ".csv" files are available in the dataset?
2. What is the shape of "movies.csv"?
3. What is the shape of "ratings.csv"?
4. How many unique "userId" are available in "ratings.csv"?
5. Which movie has received the maximum number of user ratings?
6. Select all the correct tags submitted by users to "Matrix, The (1999)" movie.
7. What is the average user rating for "Terminator 2: Judgment Day (1991)"?
8. How does the data distribution of user ratings for "Fight Club (1999)" look like?
9. Most popular movie based on average user ratings after applying the mandatory operation.
10. Top 5 popular movies based on number of user ratings.
11. Third most popular Sci-Fi movie based on the number of user ratings.
12. Highest IMDb rated movie.
13. Highest IMDb rated Sci-Fi movie.

---

### Important Notes:
- Ensure that you have all required files in the same directory as this notebook.
- Install required libraries using `pip install pandas matplotlib requests beautifulsoup4`.



In [None]:

import pandas as pd

# Load the CSV files
movies_df = pd.read_csv('movies.csv')
ratings_df = pd.read_csv('ratings.csv')
tags_df = pd.read_csv('tags.csv')
links_df = pd.read_csv('links.csv')

# Displaying shapes of dataframes to check correct loading
print(f"Shape of movies.csv: {movies_df.shape}")
print(f"Shape of ratings.csv: {ratings_df.shape}")
print(f"Shape of tags.csv: {tags_df.shape}")
print(f"Shape of links.csv: {links_df.shape}")


In [None]:

# 1. How many ".csv" files are available in the dataset?
csv_files = ['links.csv', 'movies.csv', 'ratings.csv', 'tags.csv']
print(f"Number of CSV files: {len(csv_files)}")

# 2. Shape of movies.csv
print(f"Shape of movies.csv: {movies_df.shape}")

# 3. Shape of ratings.csv
print(f"Shape of ratings.csv: {ratings_df.shape}")

# 4. Unique userId in ratings.csv
unique_user_count = ratings_df['userId'].nunique()
print(f"Unique user IDs in ratings.csv: {unique_user_count}")

# 5. Movie with maximum user ratings
max_rated_movie_id = ratings_df['movieId'].value_counts().idxmax()
max_rated_movie_title = movies_df[movies_df['movieId'] == max_rated_movie_id]['title'].values[0]
print(f"Movie with maximum ratings: {max_rated_movie_title}")

# 6. Tags for "Matrix, The (1999)"
matrix_movie_id = movies_df[movies_df['title'].str.contains("Matrix, The (1999)", case=False, na=False)]['movieId'].values[0]
matrix_tags = tags_df[tags_df['movieId'] == matrix_movie_id]['tag'].unique()
print(f"Tags for 'Matrix, The (1999)': {list(matrix_tags)}")

# 7. Average rating for "Terminator 2: Judgment Day (1991)"
terminator_movie_id = movies_df[movies_df['title'].str.contains("Terminator 2: Judgment Day (1991)", case=False, na=False)]['movieId'].values[0]
terminator_avg_rating = ratings_df[ratings_df['movieId'] == terminator_movie_id]['rating'].mean()
print(f"Average rating for 'Terminator 2: Judgment Day (1991)': {terminator_avg_rating:.2f}")

# 8. Distribution of user ratings for "Fight Club (1999)"
fight_club_movie_id = movies_df[movies_df['title'].str.contains("Fight Club (1999)", case=False, na=False)]['movieId'].values[0]
fight_club_ratings = ratings_df[ratings_df['movieId'] == fight_club_movie_id]['rating']
distribution = "Left Skewed" if fight_club_ratings.skew() < 0 else "Right Skewed" if fight_club_ratings.skew() > 0 else "Normal"
print(f"Rating distribution for 'Fight Club (1999)': {distribution}")

# 9. Mandatory Operations
ratings_grouped = ratings_df.groupby('movieId').agg({'rating': ['count', 'mean']}).reset_index()
ratings_grouped.columns = ['movieId', 'rating_count', 'rating_mean']
merged_movies = pd.merge(movies_df, ratings_grouped, on='movieId')
popular_movies = merged_movies[merged_movies['rating_count'] > 50]

# Most popular movie based on average user ratings
most_popular_movie = popular_movies.sort_values(by='rating_mean', ascending=False).iloc[0]['title']
print(f"Most popular movie based on average ratings: {most_popular_movie}")

# 10. Top 5 movies based on number of user ratings
top_5_movies = popular_movies.sort_values(by='rating_count', ascending=False).head(5)['title']
print(f"Top 5 movies based on number of ratings: {list(top_5_movies)}")

# 11. Third most popular Sci-Fi movie based on the number of user ratings
sci_fi_movies = popular_movies[popular_movies['genres'].str.contains("Sci-Fi", case=False, na=False)]
third_popular_sci_fi_movie = sci_fi_movies.sort_values(by='rating_count', ascending=False).iloc[2]['title']
print(f"Third most popular Sci-Fi movie: {third_popular_sci_fi_movie}")
