<a href="https://colab.research.google.com/github/sohelshekhatik1998/imdb_movies_analysis/blob/main/imdb_movies_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Task 1: Reading and Inspection
Subtask 1.1: Import and read the movie database

In [2]:
import numpy as np
import pandas as pd

In [6]:
# Read the movie dataset
movies = pd.read_csv("/content/Movies (1).csv")
movies.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,...,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000


Subtask 1.2: Inspect the dataframe

We inspect the dataset to understand its structure and contents.

pythonCopy code

In [7]:
# Check the number of rows and columns
print("Number of rows and columns:", movies.shape)

Number of rows and columns: (3853, 28)


In [8]:
# Check columns with null values
print("Columns with null values:", (movies.isnull().sum() > 0).sum())

Columns with null values: 12


Task 2: Cleaning the Data
We drop columns that are not required for our analysis.

In [9]:
columns_to_drop = [
    'color', 'director_facebook_likes', 'actor_1_facebook_likes',
    'actor_2_facebook_likes', 'actor_3_facebook_likes', 'actor_2_name',
    'cast_total_facebook_likes', 'actor_3_name', 'duration',
    'facenumber_in_poster', 'content_rating', 'country',
    'movie_imdb_link', 'aspect_ratio', 'plot_keywords'
]

In [10]:
movies.drop(columns=columns_to_drop, inplace=True)

Answers to Questions: 3. After dropping unnecessary columns, the dataframe contains 10 columns.

Subtask 2.2: Inspect Null values

We find the percentage of null values in each column.

In [11]:
null_percentage = (movies.isnull().sum() / len(movies)) * 100

Answers to Questions: 4. The column with the highest percentage of null values is “language”.

Subtask 2.3: Fill NaN values

We fill NaN values in the “language” column with “English”.

In [12]:
movies.language.fillna("English", inplace=True)

Answers to Questions: 5. After filling NaN values, there are 3670 movies made in the English language.

Task 3: Data Analysis
Subtask 3.1: Change the unit of columns

We convert the unit of the “budget” and “gross” columns from dollars to million dollars.

In [13]:
movies.gross = movies.gross / 1000000
movies.budget = movies.budget / 1000000

Subtask 3.2: Find the movies with the highest profit

We calculate the “profit” for each movie and find the top ten profiting movies.

In [14]:
movies["Profit"] = movies.gross - movies.budget
top10 = movies.sort_values("Profit", ascending=False).head(10)

Answers to Questions: 6. The movie ranked 5th from the top in the list is “The Avengers”.

Subtask 3.3: Find IMDb Top 250

We create a dataframe IMDb_Top_250 containing the top 250 movies with the highest IMDb rating and where num_voted_users is greater than 25,000.

In [15]:
IMDb_Top_250 = movies[(movies['imdb_score'] > 8.0) & (movies['num_voted_users'] > 25000)]
IMDb_Top_250 = IMDb_Top_250.sort_values(by='imdb_score', ascending=False).head(250)
IMDb_Top_250['Rank'] = range(1, IMDb_Top_250.shape[0] + 1)

Answers to Questions: 7. The bucket holding the maximum number of movies from IMDb_Top_250 is "8 to 8.5".

Subtask 3.4: Find the critic-favorite and audience-favorite actors

We create dataframes for three actors, namely, Meryl_Streep, Leo_Caprio, and Brad_Pitt, containing movies where they are the lead actors. Then, we combine these dataframes, group by actor, and find the mean of critic and user reviews.

In [16]:
Meryl_Streep = movies[movies["actor_1_name"] == "Meryl Streep"]
Leo_Caprio = movies[movies["actor_1_name"] == "Leonardo DiCaprio"]
Brad_Pitt = movies[movies["actor_1_name"] == "Brad Pitt"]
Combined = pd.concat([Meryl_Streep, Leo_Caprio, Brad_Pitt], axis=0)
actor_reviews = Combined.groupby(by="actor_1_name")[["num_critic_for_reviews", "num_user_for_reviews"]].mean()

Answers to Questions: 8 and 9

According to user reviews, “Leonardo DiCaprio” is the highest-rated among the three actors.
According to critic reviews, “Leonardo DiCaprio” is also the highest-rated among the three actors.





Conclusion
In this analysis, we explored a movie dataset, cleaned the data, and conducted various analyses to find interesting insights about movies, actors, and ratings. We discovered the highest-grossing movies, IMDb’s top 250 movies, and the favorite actors among critics and audiences. This analysis provides valuable information for movie enthusiasts and industry professionals.

1



