# Project: Investigate a Dataset - [TMDb movie data]

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description 
This dataset for analysis is the "TMDB Movie Metadata" accessible via Kaggle. It encompasses various movie details such as budget, revenue, genre, director, cast, popularity, and user ratings. This dataset comprises information on 10,000 movies sourced from The Movie Database (TMDb), featuring both user ratings and revenue data.

The dataset coulumn describtion: 
- id: A unique code assigned to each individual movie.
- imdb_id: A unique code assigned to each individual movie
- popularity: A numerical value indicating the level of popularity of the movie.
- budget: The amount of money allocated for the production of the movie.
- revenue: The total income generated worldwide by the movie.
- original title: The title of the movie in its original language prior to any translations or adaptations.
- cast: The names of the main and supporting actors.
- homepage: A link directing to the official webpage of the movie.
- director: The creative leader responsible for guiding all aspects of a film's production.
- tagline: A concise phrase or sentence representing the essence of the movie.
- keywords: Specific words or phrases associated with the movie.
- overview: A brief summary outlining the plot or content of the movie.
- runtime: The duration of the movie in minutes.
- genres: The categories or types of the movie, such as Action, Comedy, Thriller, etc.
- production companies: The organizations responsible for the production of the movie.
- release date: The date on which the movie was officially released.
- vote count: The total number of votes received for the movie.
- vote average: The average rating given to the movie.
- release year: The year on which the movie was officially released
- budget_adj: The budget of the movie adjusted to reflect 2010 dollar values, accounting for inflation.
- revenue_adj: The revenue of the movie adjusted to reflect 2010 dollar values, accounting for inflation.

### Question(s) for Analysis
1. What are the top 10 films that have generated the most profit?
2. Who are the top 10 directors with the greatest profits?
3. Which production companies rank in the top 10 for highest profit?
4.  Which ten genres have the highest average vote rating?

In [None]:
# import all of packages that I need
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

<a id='wrangling'></a>
## Data Wrangling
### General Properties

In [None]:
# load the data
df = pd.read_csv('tmdb-movies.csv')
df.head(3)

In this step, we are loading the dataset from the provided CSV file using the `pd.read_csv()` function and assigning it to the variable `df`. After loading the dataset, we are displaying the first few rows using the `head()` function to get an initial glimpse of the data structure and content.

In [None]:
# return a tuple of the dimensions of the dataframe
df.shape

In this step, we are retrieving the shape of the dataframe using the `shape` attribute. The shape of a dataframe represents the dimensions of the dataframe the number of rows and columns.

In [None]:
 # return the datatypes of the columns
df.dtypes

In this step, we are examining the data types of each column in the dataframe using the `dtypes` attribute. This provides us with information about the data type of each feature, such as integer, float, object (string), etc.

In [None]:
# display a concise summary of the dataframe, including the number of non-null values in each column
df.info()

In this step, we are displaying detailed information about the dataframe using the `info()` method. This includes a summary of the dataframe's structure, such as the total number of entries, the number of non-null values in each column, and the data types of each column.

In [None]:
# return the number of unique values in each column
df.nunique()

In this step, we are calculating the number of unique values for each column in the dataframe using the `nunique()` method. This provides us with insight into the diversity and variability of data within each feature.

In [None]:
# calculate the number of missing values in each column of the DataFrame
df.isnull().sum()

In this step, we are calculating the number of missing values in each column of the dataframe using the `isnull()` method followed by `.sum()`. This information helps us to understand the extent of missing data in our dataset.

In [None]:
# return the number of duplicated data
df.duplicated().sum()

In this step, we are checking for duplicate entries in the dataframe using the `duplicated()` function. By calling `.sum()` on the result, we obtain the total number of duplicate rows in the dataframe.

In [None]:
# extracting specific columns from the DataFrame and use describe method to it
df[ ['budget', 'revenue', 'runtime', 'vote_count', 'vote_average', 'budget_adj', 'revenue_adj'] ].describe()

In this step, we are generating a summary of descriptive statistics for numerical columns in the dataframe using the `describe()` method. This summary includes various statistical measures such as count, mean, standard deviation, minimum, quartiles, and maximum values for each numerical feature. It provides a quick overview of the distribution and central tendencies of the data, aiding in understanding the data's characteristics and identifying potential outliers or anomalies.


### Data Cleaning
> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).
 

In [None]:
df.head(3)

In [None]:
# drop the unnecessary columns
df.drop(columns=['id', 'imdb_id', 'cast', 'homepage', 'tagline', 'keywords', 'overview'], inplace=True)

In this step, we are removing certain columns ('id', 'imdb_id', 'cast', 'homepage', 'tagline', 'keywords', 'overview') from the dataframe

In [None]:
# fill missing values in the dataframe with the string 'null'
df.fillna('null', inplace=True)

In this step, we are filling any missing values in the dataframe with the string 'null'.

In [None]:
# check the number of missing values in each column of the dataframe
df.isnull().sum()

In this step, we are checking the dataframe columns after filling in the missing values for each column.

In [None]:
# drop the duplication in dataframe
df.drop_duplicates(inplace=True)

In this step, we are removing duplicate entries from the dataframe.

In [None]:
# calculate the number of duplicated rows in the dataframe
df.duplicated().sum()

In this step, we are verifying whether the duplicated rows have been successfully deleted.

<a id='eda'></a>
## Exploratory Data Analysis

In [None]:
df.head()

### Research Question 1 What are the top 10 films that have generated the most profit?

In [None]:
# calculate the gain by subtracting the budget from the revenue
df['gain']= df['revenue'] - df ['budget']

In [None]:
# define a function to get the top 10 movies based on 'gain' from the dataframe
def get_top_10_movies(df):
    # extract the columns 'original_title' and 'gain', sort by 'gain' in descending order, and select the top 10 rows
    top_10_movies = df[['original_title', 'gain']].sort_values(by='gain', ascending=False).head(10)
    # return the dataframe containing the top 10 movies
    return top_10_movies

In [None]:
# extract the top 10 movies with the highest total gain

# call the get_top_10_movies function with the DataFrame df
top_10_movies = get_top_10_movies(df)

# define lighter shades of yellow, purple, blue, and orange
light_yellow = '#FFFF99'  # light yellow color
light_purple = '#CC99FF'  # light purple color
light_blue = '#99CCFF'    # light blue color
light_orange = '#FFCC99'  # light orange color

# plotting the data as a horizontal bar plot
plt.figure(figsize=(11, 7))
plt.barh(top_10_movies['original_title'], top_10_movies['gain'], color=[light_blue, light_yellow, light_purple, light_orange])
plt.xlabel('Gain')
plt.ylabel('Movie Title')
plt.title('Top 10 Movies by Total Gain')
plt.gca().invert_yaxis()
plt.grid(axis='x', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()

According to the data, Avatar seems to lead as the highest-grossing film, with Star Wars following closely behind. Additionally, four out of the top 10 movies fall under the Sci-fi genre. This suggests that Sci-Fi genre films often achieve significant success at the box office.

### Research Question 2 Who are the top 10 directors with the greatest profits?

In [None]:
# count the occurrences of each director in the 'director' column
df['director'].value_counts()

In [None]:
# group the data by director and calculate the total gain for each director, then sort in descending order
top_10_directors = df.groupby('director')['gain'].sum().sort_values(ascending=False).head(10)
top_10_directors

In [None]:
# extract the top 10 directors with the highest total gain

# define colors for the pie chart slices
light_yellow = '#FFFF99'  # light yellow color
light_purple = '#CC99FF'  # light purple color
light_blue = '#99CCFF'    # light blue color
light_orange = '#FFCC99'  # light orange color

# define explode values to highlight the first slice
explode = (0.1, 0, 0, 0, 0, 0, 0, 0, 0, 0)

# plot the data as a pie chart with custom colors and explode effect
plt.figure(figsize=(7, 7))
plt.pie(top_10_directors, labels=top_10_directors.index, autopct='%1.1f%%', startangle=140, colors=[light_blue, light_yellow, light_purple, light_orange], shadow=True, explode=explode)
plt.title('Top 10 Directors by Total Gain')
plt.axis('equal')
plt.tight_layout()
plt.show()

The data reveals Steven Spielberg as the director with the highest earnings. Spielberg's association with science fiction is fundamental to his appeal to audiences and his identity as a filmmaker. This further supports the notion that Sci-Fi films often achieve significant success at the box office.

### Research Question 3 Which production companies rank in the top 10 for highest profit?

In [None]:
# count the occurrences of each value in the 'production_companies' column
df['production_companies'].value_counts()

In [None]:
# group the data by production companies and calculate the total gain for each company, then sort in descending order
top_10_companies = df.groupby('production_companies')['gain'].sum().sort_values(ascending=False).head(5)
top_10_companies

In [None]:
# extract the top 5 production companies with the highest total gain

# define lighter shades of yellow, purple, blue, and orange
light_yellow = '#FFFF99'  # light yellow color
light_purple = '#CC99FF'  # light purple color
light_blue = '#99CCFF'    # light blue color
light_orange = '#FFCC99'  # light orange color

# plot the data as a vertical bar plot
plt.figure(figsize=(11, 7))
top_5_companies.plot(kind='bar', color=[light_blue, light_yellow, light_purple, light_orange])
plt.xlabel('Production Companies')
plt.ylabel('Total Gain')
plt.title('Top 5 Production Companies by Total Gain')
plt.xticks(rotation=60)
plt.tight_layout()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

The graph displaying the top 5 production companies based on total gain demonstrates a notable divergence in their financial performance. The leading company is paramount production holds a position in the industry. The gains of these top 5 companies are closely grouped together.

### Research Question 4 Which ten genres have the highest average vote rating?

In [None]:
# group the data by genres and calculate the average vote for each genre
top_10_genres = df.groupby('genres')['vote_average'].mean().sort_values(ascending=False).head(10)
top_10_genres

In [None]:
# extract the top 10 genres with the highest vote average

# Define colors for the bar plot
light_yellow = '#FFFF99'  
light_purple = '#CC99FF'  
light_blue = '#99CCFF'    
light_orange = '#FFCC99'

# Plot the data as a horizontal bar plot
plt.figure(figsize=(11, 7))
top_10_genres.plot(kind='bar', color=[light_blue, light_yellow, light_purple, light_orange])
plt.xlabel('Genres')
plt.ylabel('Total Vote Average')
plt.title('Top 10 Genres by Total Vote Average')
plt.tight_layout()
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

The graph displays the top 10 genres by their average vote. It's evident that certain genres, such as Drama, Documentary, and History, tend to receive higher average votes compared to others. This suggests that these genres resonate well with audiences, potentially due to their compelling storytelling or thematic depth.

<a id='conclusions'></a>
## Conclusions
Based on the data analysis, it appears that "Avatar" emerges as the highest-grossing film, closely trailed by "Star Wars," with a notable presence of 4 Sci-fi genre movies within the top 10. This underscores the tendency for Sci-Fi films to achieve significant success at the box office. Furthermore, the data emphasizes Steven Spielberg's position as the director with the highest earnings, attributing his association with science fiction to his enduring appeal to audiences and filmmaker identity, thereby reinforcing the triumph of Sci-Fi movies. The graph delineating the top 5 production companies by total gain reveals substantial variance in their financial performance, with one leading company holding a predominant position, while the rest closely cluster together in terms of gains. Moreover, the graph portraying the top 10 genres by average vote highlights genres like Drama, Documentary, and History, which consistently garner higher average votes, indicating their resonance with audiences owing to their captivating narratives or thematic richness.

#### Limitations
The dataset's integrity and precision might be compromised by issues like missing data, discrepancies, or errors, introducing ambiguity or constraints in analyses and interpretations.

In [None]:
# Running this cell will execute a bash command to convert this notebook to an .html file
!python -m nbconvert --to html Investigate_a_Dataset.ipynb