<a href="https://colab.research.google.com/github/yousefhm/Investigate-a-Dataset-TMDb-movie-/blob/main/Investigate_a_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Project: Investigate a Dataset (TMDb movie)**
The primary goal of the project is to go through the dataset and the general data analysis process using numpy, pandas and matplotlib and other. This contain four parts:



**Table of Contents:**
*   Introduction
*   Data wrangling 
*   Exploratory data analysis
*   Conclusions







## Introduction

**Dataset**

* In this notebook I working on the TMDb movie data se (cleaned from original data on Kaggle) I selected this data set for analysis. 
This data set contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue.

    **Contain:**

      *   Total Rows = 10866
      *   Total Columns = 21

**Questions**


1.   Which year has the highest release of movies?
2.   What is the top 10 production companies produce movies?
3.   What kinds of properties are associated with movies that have high revenues?
4.   Which Month Released Highest Number Of Movies In All Of The Years? And Which Month Made The Highest Average Revenue?
5.   Which genres are most popular?



Import Packages and libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
import plotly.graph_objects as go
import plotly.express as px

import os

## Data Wrangling 

###**Gathering** 

In [None]:
#Import CSV file that contain data / Using head function to represent sample of data.

df = pd.read_csv('/content/tmdb-movies.csv')
df.head()

###**Assessing**

In [None]:
#The column labels of the DataFrame.

df.columns

In [None]:
#Return a tuple representing the dimensionality of the DataFrame.

df.shape

In [None]:
#Print a concise summary of a DataFrame.

df.info()

In [None]:
#Detect missing values and using (Sum) to sum missing values.

df.isnull().sum()

In [None]:
#Generate descriptive statistics.

df.describe()

> Quality issues     
  * Incorrect data type for 'release_date' column.
  * Missing records in ['imdb_id', 'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview', 'genres', 'production_companies'] columns.
  * The dataset contain lots of movies where the budget or revenue have a value of '0'.

> Tidiness issues 
* 'homepage' not important for analysis.


###**Cleaning** 

In [None]:
#Make a copy of data files to save it from lose. 

investigate_data_copy = df.copy()
investigate_data_copy.head()

In [None]:
#The column labels of the DataFrame.

investigate_data_copy.columns

In [None]:
#check for null value 
investigate_data_copy.isnull().sum()

In [None]:
#drop null value from data 
investigate_data_copy.dropna()

In [None]:
#Return boolean Series denoting duplicate rows.

investigate_data_copy.duplicated().sum()

In [None]:
#Drop duplicates
investigate_data_copy.drop_duplicates()

In [None]:
# Convert type of column release_date to datetime type>

investigate_data_copy['release_date'] = pd.to_datetime(investigate_data_copy['release_date'])

In [None]:
#Drop extra columns in data

investigate_data_copy.drop(['homepage'], axis=1, inplace=True)

##Exploratory Data Analysis

**Tip:** Now that you've trimmed and cleaned your data, you're ready to move on to exploration. Compute statistics and create visualizations with the goal of addressing the research questions that you posed in the Introduction section. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables.


###**Research Question 1** 

**( Which year has the highest release of movies? )**

In [None]:
def highest_release_movie (investigate_data_copy):

    # make group for each year and count the number of movies in each year 
    data=investigate_data_copy.groupby('release_year').count()['id']
    print(data.tail())

    #make group of the data according to their release year and count the total number of movies in each year and pot.
    investigate_data_copy.groupby('release_year').count()['id'].plot(xticks = np.arange(1960,2016,5))

    #set the figure size and labels
    sns.set(rc={'figure.figsize':(10,5)})
    plt.title("Year Vs Number Of Movies",fontsize = 14)
    plt.xlabel('Release year',fontsize = 13)
    plt.ylabel('Number Of Movies',fontsize = 13)

    #set the style sheet
    sns.set_style("whitegrid")

    plt.show()

After Seeing the plot and the output we can conclude that year 2014 year has the highest release of movies (700) followed by year 2013 (659) and year 2015 (629).

###**Research Question 2** 

**( What is the top 10 production companies produce movies ? )**

- The production companies fund the making of movies and offer the resources and manpower required to make it all possible. It is exciting to look at the great films that have been produced by some of the most productive companies depend on TMDb movie data.



In [None]:
investigate_data_copy['production_companies'].value_counts().head(10)

The following graph has 10 major production companies that reshaped the movies.

**The rankings are based on research as well as the company name appears in the data set.**


In [None]:
def production_companies(investigate_data_copy):
    x = ['Paramount Pictures', 'Universal Pictures', 'Warner Bros.','Walt Disney Pictures','Metro-Goldwyn-Mayer (MGM)','Columbia Pictures','New Line Cinema','Touchstone Pictures','20th Century Fox','Twentieth Century Fox Film']
    y = [156, 133, 84,76,72,72,61,51,50,49,]

    # Use textposition='auto' for direct text
    fig = go.Figure(data=[go.Bar(
            x=x, y=y,
            text=y,
            textposition='auto',
        )])

    fig.update_layout(xaxis_title="Production companies", yaxis_title="Number of movies",title_text='Top 10 production companies produce movies.')
    fig.show()

According to the plot we can conclude that there are higher companies of release are Paramount Pictures and Universal picture.



###**Research Question 3** 

**( What kinds of properties are associated with movies that have high revenues? )**





In [None]:
from IPython.display import display
def properties_associated_revenues(investigate_data_copy):
    print("Properties are associated with movies that have high revenues") 
    movie_top = investigate_data_copy.nlargest(10,'revenue')[['original_title','revenue','production_companies','genres']]
    display(movie_top.head(len(movie_top)))


According to the plot we can conclude that there properties are associated with movies that have high revenues.

###**Research Question 4** 

**( Which Month Released Highest Number Of Movies In All Of The Years? And Which Month Made The Highest Average Revenue? )**




In [None]:
def movies_with_month(investigate_data_copy):

    #extract the month number from the release date.
    month_release = investigate_data_copy['release_date'].dt.month

    #count the movies in each month using value_counts().
    number_of_release = month_release.value_counts().sort_index()
    months=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
    number_of_release = pd.DataFrame(number_of_release)
    number_of_release['month'] = months

    #change the column name of the new dataframe 'number_of_release'
    number_of_release.rename(columns = {'release_date':'number_of_release'},inplace=True)

    #plot the bar graph using plot.
    number_of_release.plot(x='month',kind='bar',fontsize = 11,figsize=(8,6))

    #set the labels and titles of the plot.
    plt.title('Months vs Number Of Movie Releases',fontsize = 15)
    plt.xlabel('Month',fontsize = 13)
    plt.ylabel('Number of movie releases',fontsize = 13)
    sns.set_style("darkgrid")

According to the plot we can conclude that there are higher number of release in september and october month.


###**Research Question 5** 

**( Which genres are most popular ?)**




In [None]:
investigate_data_copy['genres'].value_counts().head(10)

In [None]:
def most_grnres(investigate_data_copy):
    x = ['Comedy', 'Drama', 'Documentary','Drama|Romance','Comedy|Drama','Comedy|Romance','Horror|Thriller','Horror','Comedy|Drama|Romance','Drama|Thriller']
    y = [712, 712, 312,289,280,268,259,253,222,138,]

    # Use textposition='auto' for direct text
    fig = go.Figure(data=[go.Bar(
            x=x, y=y,
            text=y,
            textposition='auto',
        )])

    fig.update_layout(xaxis_title="Movies genres", yaxis_title="Number of movies", title_text='Genres are most popular')
    fig.show()

According to the plot we can conclude that there are higher number of popular genres are comedy and drama.


In [None]:
if __name__ == "__main__":
    highest_release_movie(investigate_data_copy)
    production_companies(investigate_data_copy)
    properties_associated_revenues(investigate_data_copy)
    movies_with_month(investigate_data_copy)
    most_grnres(investigate_data_copy)  

## **Conclusions**

* Maximum Number Of Movies Release In year 2014.
* Warner Bros, Universal Pictures and Paramount Pictures production companies earn more life time profit than other production companies.
* septamper,octobor,november and december are most popular month for releasing movies, if you want to earn more profit.
* Comady is the most popular genre, following by Drama and Documentary.

**Limitations**

*  It's not 100 percent guaranteed solution that this formula is gonna work, But it shows us that we have high probability of making high profits if we had similar characteristics as such. If we release a movie with these characteristics, it gives people high expectations from this movie. This was just one example of an influantial factor that would lead to different results, there are many that have to be taken care of.

*  During the data cleaning process, I drop empty recorders  for easy parsing during the exploration phase. This increases the time taken in calculating the result.



**list of Web sites, books, forums, blog posts, github repositories, etc.**

[Plotly](https://plotly.com/python/bar-charts/)

[Matplotlib](https://matplotlib.org/gallery/index.html#subplots-axes-and-figures)

[Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html)