

#                  Project: Investigate a Dataset (TMDb)

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

> What can we say about the importance of Cinema's Industry? and its effect on the economy of the producing countries ? What can we say about the social effect on people and the fun we have while watching a movie.
> I am interested in watching movies , it's one of my pleasuring fun and I've watched more than 500 movies , so I consider myself familial with this industry.
> Here we have a dataset to investigate "TMDb Movie Database" a dataset of about 5000 movies, here we can answer questions as;<br />
<br />
> **1. What rate of movie production over years for each movie genre?<br/>
> 2. which are the highest profitable, rated, vote_count, popular movie genre?<br/>
> 3. Top 10 Movies?:<br/>
>..3A. What kinds of properties are associated with movies that have high revenues?<br />
>..3B. What kinds of properties are associated with the most popular movies?<br/>
>..3C. What kinds of properties are associated with the highest rated movies?<br/> 
> 4. Which companies made the overall highest revenue per year?<br />
> 5. What is the relation between popularity and vote_count?<br />
> 6. What is the relation between the vote_count and vote rate?<br />
> 7. Is there  a relation between the budget and the runtime ?<br/>
> 8. Is there  a relation between the budget and the revenue ?<br/>
> 9. Most Popular Directors?<br/>
> 10. Average Runtime?
**



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os as os
import re
import json
import datetime as dt
%matplotlib inline


<a id='wrangling'></a>
## Data Wrangling


### General Properties

In [None]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.
credits_df = pd.read_csv('../input/tmdb-movie-metadata/tmdb_5000_credits.csv')
movies_df = pd.read_csv('../input/tmdb-movie-metadata/tmdb_5000_movies.csv')


In [None]:
credits_df.head()

In [None]:
movies_df.head()

**We need to change the column id to movie_id to prevent duplication while merging**

In [None]:
# changing the movies_df column name 'id' to 'movie_id'
movies_df_changed = movies_df.rename(columns = {'id':'movie_id'})
movies_df_changed.head()

In [None]:
# merging the two csv data file into one file dataset
movies_credit = pd.merge(credits_df, movies_df_changed, on=['title', 'movie_id'], how='outer')
movies_credit.head()

**1. We need to save these changes into a csv file.**
**2. We need to take a copy.**

In [None]:
# 1.Saving the merged two file to a one csv file
movies_credit.to_csv('movies_credit.csv')

In [None]:
# 2.Taking a copy of the file for cleaning the data 
movies_clean = movies_credit.copy()

In [None]:
movies_clean.shape

In [None]:
movies_clean.info()

In [None]:
movies_clean.describe()

In [None]:
movies_clean.duplicated().sum()

In [None]:
movies_clean.hist(figsize = (8,8))

### Data Cleaning (TMDb dataset cleaning):

> 1. We need to fill Nan cells in columns with numerical values with the mean value.
> 2. We need to remove the unused and unimportnat column like ('homepage')
> 3. We need to remove the NaN cells from dataset.
> 4. We need to convert the dtype of column ('runtime') values to integer dtype.
> 5. Capture the name of crew, cast, genres, keywords, spoken_languages, production_countries, production_companies.
> 6. We need to remove the unnecessary characters and strip spaces from these names.
> 7. Save all these changes to a new csv file.



**We need to fill Nan or zero values\' cells with the mean value of each column** 

In [None]:
# fill Nan cells in columns with numerical values with the mean value.
movies_clean.budget = movies_clean.budget.replace(0,movies_clean.budget.mean())
movies_clean.popularity = movies_clean.popularity.replace('0', movies_clean.popularity.mean())
movies_clean.revenue = movies_clean.revenue.replace(0, movies_clean.revenue.mean())
movies_clean.runtime = movies_clean.runtime.replace('0', movies_clean.runtime.mean())
movies_clean.vote_average = movies_clean.vote_average.replace('0', movies_clean.vote_average.mean())
movies_clean.vote_count = movies_clean.vote_count.replace(0, movies_clean.vote_count.mean())

**We need to remove the unused and unimportnat column like ('homepage')**

In [None]:
# We need to remove the unused and unimportnat column like ('homepage')
movies_clean = movies_clean.drop('homepage', axis = 1)

**We need to remove the NaN cells from dataset**

In [None]:
# We need to remove the NaN cells from dataset
movies_clean.dropna(inplace=True)

In [None]:
# convert the dtype of column ('runtime') values to integer dtype
movies_clean.runtime = movies_clean.runtime.astype(int)

In [None]:
movies_clean.info()

In [None]:
movies_clean.shape

**Convert strings in columns to JSON structure to iterate over them and capture name value.<br/>
Capture the name of crew, cast, genres, keywords, spoken_languages, production_countries, production_companies.<br/>
Remove the unnecessary characters and strip spaces from these names.**

In [None]:
# collect columns needed to be converted to json into one array
# use a for loop to iterate over the array of columns to convert them
json_columns = ['genres', 'cast', 'crew', 'keywords', 'production_companies', 'production_countries', 'spoken_languages'] 
for column in json_columns:
    movies_clean[column] = movies_clean[column].apply(json.loads)

**code a function to iterate over the contents of the crew column to capture the director name**

In [None]:
# code a function to iterate over the contents of the crew column to capture the director name
# use apply method to apply the function to the wanted column
def get_director(column):
    for i in column:
        if i['job'] == 'Director':
            return i['name']
    return np.nan
movies_clean['director'] = movies_clean['crew'].apply(get_director)
movies_clean.head()

**iterate over the columns and clean them from the unwanted charcters and spaces**

In [None]:
# iterate over the columns and clean them from the unwanted charcters and spaces 
for column in json_columns:
    for index, i in zip(movies_clean.index, movies_clean[column]):
        list_names = []
        for j in range(len(i)):
            list_names.append(i[j]['name'])
            if len(list_names) >= 4:
                break;
        movies_clean.loc[index, column] = str(list_names)
    movies_clean[column] = movies_clean[column].str.strip('[]').str.replace('\'', "").str.replace(',', ' |')   

In [None]:
movies_clean.head()

**Save Changes to the CSV file we saved to it before**

In [None]:
# save to csv file
movies_clean.to_csv('movies.csv')

<a id='eda'></a>
## Exploratory Data Analysis

**Extract the year from column( release_date) using str.extract method**

In [None]:
# Extract the year from column( release_date) using str.extract method and regular expression
movies_clean['year'] = movies_clean.release_date.str.extract('(\d{4})')

In [None]:
movies_clean.to_csv('movies.csv')

In [None]:
sns.set_style('darkgrid')

### Q1:What rate of movie production over years for each movie genre?

_**Get the Rate of Drama Movies**_

In [None]:
# Drama Movies
drama = movies_clean.query('genres == "Drama"')['year'].value_counts(ascending=True)
drama.plot(kind = 'bar',figsize=(16,16))
plt.xlabel('Year', fontsize = 14)
plt.ylabel('Movie_count', fontsize= 14)
plt.title('Drama Over Years', fontsize=16)

_**Get the Rate of Comedy Movies**_

In [None]:
# Comedy Movies
comedy = movies_clean.query('genres == "Comedy"')['year'].value_counts(ascending=True)
comedy.plot(kind = 'bar',figsize=(16,16))
plt.xlabel('Year', fontsize = 14)
plt.ylabel('Movie_count', fontsize= 14)
plt.title('Comedy Over Years', fontsize=16)

_**Get the Rate of Romance Movies**_

In [None]:
# Romance Movies
romance = movies_clean.query('genres == "Romance"')['year'].value_counts()
romance.plot(kind = 'bar',figsize=(6,6))
plt.xlabel('Year', fontsize = 14)
plt.ylabel('Movie_count', fontsize= 14)
plt.title('Romance Over Years', fontsize=16)

_**Get the Rate of Adventure Movies**_

In [None]:
# Adventure Movies
adventure = movies_clean.query('genres == "Adventure"')['year'].value_counts()
adventure.plot(kind = 'bar',figsize=(8,8))
plt.xlabel('Year', fontsize = 14)
plt.ylabel('Movie_count', fontsize= 14)
plt.title('Adventure Over Years', fontsize=16)

_**Get the Rate of Horror Movies**_

In [None]:
# Horror Movies
horror = movies_clean.query('genres == "Horror"')['year'].value_counts(ascending=True)
horror.plot(kind = 'bar',figsize=(10,10))
plt.xlabel('Year', fontsize = 14)
plt.ylabel('Movie_count', fontsize= 14)
plt.title('Horror Over Years', fontsize=16)

_**Get the Rate of Crime Movies**_

In [None]:
# Crime Movies
crime = movies_clean.query('genres == "Crime"')['year'].value_counts()
crime.plot(kind = 'bar',figsize=(6,6))
plt.xlabel('Year', fontsize = 14)
plt.ylabel('Movie_count', fontsize= 14)
plt.title('Crime Over Years', fontsize=16)

_**Get the Rate of Thriller Movies**_

In [None]:
# Thriller Movies
thriller = movies_clean.query('genres == "Thriller"')['year'].value_counts(ascending=True)
thriller.plot(kind = 'bar',figsize=(8,8))
plt.xlabel('Year', fontsize = 14)
plt.ylabel('Movie_count', fontsize= 14)
plt.title('Thriller Over Years', fontsize=16)

_**Get the Rate of Family Movies**_

In [None]:
#Family Movies
family = movies_clean.query('genres == "Family"')['year'].value_counts()
family.plot(kind = 'bar',figsize=(6,6))
plt.xlabel('Year', fontsize = 14)
plt.ylabel('Movie_count', fontsize= 14)
plt.title('Family Over Years', fontsize=16)

_**Get the Rate of Music Movies**_

In [None]:
# Music Movies
music = movies_clean.query('genres == "Music"')['year'].value_counts()
music.plot(kind = 'bar',figsize=(6,6))
plt.xlabel('Year', fontsize = 14)
plt.ylabel('Music Over Years', fontsize=16)
plt.title('Music Movies Over Years', fontsize=16)

_**Get the Rate of Animation Movies**_

In [None]:
# Animation Movies
animation = movies_clean.query('genres == "Animation"')['year'].value_counts()
animation.plot(kind = 'bar',figsize=(6,6))
plt.xlabel('Year', fontsize = 14)
plt.ylabel('Movie_count', fontsize= 14)
plt.title('Animation Over Years', fontsize=16)

_**Get the Rate of Action Movies**_

In [None]:
# Action Movies
action = movies_clean.query('genres == "Action"')['year'].value_counts(ascending=True)
action.plot(kind = 'bar',figsize=(10,10))
plt.xlabel('Year', fontsize = 14)
plt.ylabel('Movie_count', fontsize= 14)
plt.title('Action Over Years', fontsize=16)

_**Get the Rate of Fantasy Movies**_

In [None]:
# Fantasy Movies
fantasy = movies_clean.query('genres == "Fantasy"')['year'].value_counts(ascending=True)
fantasy.plot(kind = 'bar',figsize=(8,8))
plt.xlabel('Year', fontsize = 14)
plt.ylabel('Movie_count', fontsize= 14)
plt.title('Fantasy Over Years', fontsize=16)

_**Get the Rate of Western Movies**_

In [None]:
# Western Movies
western = movies_clean.query('genres == "Western"')['year'].value_counts(ascending=True)
western.plot(kind = 'bar',figsize=(8,8))
plt.xlabel('Year', fontsize = 14)
plt.ylabel('Movie_count', fontsize= 14)
plt.title('Western Over Years', fontsize=16)

_**Get the Rate of Science Fiction Movies**_

In [None]:
# Science Fiction Movies
scinece_fiction = movies_clean.query('genres == "Science Fiction"')['year'].value_counts(ascending=True)
scinece_fiction.plot(kind = 'bar',figsize=(8,8))
plt.xlabel('Year', fontsize = 14)
plt.ylabel('Movie_count', fontsize= 14)
plt.title('Science Fiction Over Years', fontsize=16)

### Q2: Which genres are most popular, profitable, high rated, high vote count from year to year?

_**Getting the Profit Values and appending them into a column "Profit"**_<br/>
_**Save the column into the CSV file**_

In [None]:
# getting the profit and appending it into profit column
movies_clean['profit'] = movies_clean.revenue - movies_clean.budget
movies_clean.head()
movies_clean.to_csv('movies.csv')

_**function to get data and concatenating it to one of the dataset properties classification**_<br/>
_**Here we need to split movie genres and connecting each genre with its related data**_

In [None]:
# function to get data and concatenating it to one of the dataset properties classification
# Here we need to split movie genres and connecting each genre with its related data
def get_data(data, column):
    data = data[column].str.cat(sep = '|')
    data = pd.Series(data.split('|')).str.strip()
    data_counts = data.value_counts(ascending=True)
    return data_counts

_**getting the highest profit data and connecting them to its related movie genre**_

In [None]:
# getting the highest profit data and connecting them to its related movie genre
highest_profit = movies_clean[movies_clean.profit > movies_clean.profit.mean()]
movie_genre = get_data(highest_profit,'genres')
movie_genre.plot(kind='bar', figsize=(14,14), color='yellowgreen')
plt.xlabel('Movie Genres', fontsize=14)
plt.ylabel('Movies Number', fontsize=14)
plt.title('Most Profitable Movie Genres', fontsize=16)

_**getting the highest vote_average data and connecting them to its related movie genre**_

In [None]:
# getting the highest vote_average data and connecting them to its related movie genre
highest_vote_average = movies_clean[movies_clean.vote_average > movies_clean.vote_average.mean()]
movie_genre = get_data(highest_vote_average,'genres')
movie_genre.plot(kind='bar', figsize=(14,14), color='purple')
plt.xlabel('Movie Genre', fontsize=16)
plt.ylabel('No. of Movies', fontsize=16)
plt.title('Movies with Highest Vote Average', fontsize=18)

_**getting the highest vote_count data and connecting them to its related movie genre**_

In [None]:
# getting the highest vote_count data and connecting them to its related movie genre
highest_vote_count = movies_clean[movies_clean.vote_count > movies_clean.vote_count.mean()]
movie_genre = get_data(highest_vote_count,'genres')
movie_genre.plot(kind='bar', figsize=(14,14), color='violet')
plt.xlabel('Movie Genre', fontsize=16)
plt.ylabel('No. of Movies', fontsize=16)
plt.title('Movies with Highest Vote Count', fontsize=18)

_**getting the highest popularity data and connecting them to its related movie genre**_

In [None]:
# getting the highest popularity data and connecting them to its related movie genre
highest_popularity = movies_clean[movies_clean.popularity > movies_clean.popularity.mean()]
movie_genre = get_data(highest_popularity,'genres')
movie_genre.plot(kind='bar', figsize=(14,14), color='#00BFFF')
plt.xlabel('Movie Genre', fontsize=16)
plt.ylabel('No. of Movies', fontsize=16)
plt.title('Movies with Highest Popularity', fontsize=18)

## TOP 10 Movies
### Q3A: What kinds of properties are associated with movies that have high revenues?

_**getting the Top 10 highest revenue movies and their properties**_

In [None]:
# getting the Top 10 highest revenue movies and their properties
top_10 = movies_clean['revenue'].nlargest(10).index
movies_clean.loc[top_10]

### Q3B: What kinds of properties are associated with the most popular movies?

_**getting the Top 10 highest revenue movies and their properties**_

In [None]:
# getting the Top 10 highest revenue movies and their properties
top_10 = movies_clean['popularity'].nlargest(10).index
movies_clean.loc[top_10]

### Q3C: What kinds of properties are associated with the highest rated movies?

_**getting the Top 10 highest rated movies and their properties**_

In [None]:
# getting the Top 10 highest rated movies and their properties
top_10 = movies_clean['vote_average'].nlargest(10).index
movies_clean.loc[top_10]

<a id='companies'></a>
### Q4: Which companies made the overall highest revenue per year?

_**getting the companies that get the highest revenue each year and which movie targeted that highest revenue**_

In [None]:
# getting the companies that get the highest revenue each year and which movie targeted that highest revenue
max_revenue_per_year = movies_clean.groupby('year')['revenue'].idxmax()
list_max_revenue = list(max_revenue_per_year)
max_companies = movies_clean.production_companies.loc[list_max_revenue]
max_year = movies_clean.year.loc[list_max_revenue]
max_companies_year = pd.merge(max_companies, max_year,left_index=True, right_index=True
,how = 'outer')
max_companies_year_revenue = movies_clean['revenue'].loc[list_max_revenue]
max_companies_year_revenue
companies_year_revenue = max_companies_year.join(max_companies_year_revenue)
max_movies = movies_clean.title.loc[list_max_revenue]
companies_year_revenue_title = companies_year_revenue.join(max_movies)
companies_year_revenue_title.dropna()
companies_year_revenue_title

### Q5: What is the relation between popularity and vote_count?

_**visualizing the realtionship between popularity and vote_count by scatter plot to clarify the effect of popularity on the vote count**_

In [None]:
# visualizing the realtionship between popularity and vote_count by scatter plot
plt.scatter(movies_clean.vote_count, movies_clean.popularity)
plt.xlabel('Vote_Count', fontsize=14)
plt.ylabel('Popularity', fontsize=14)
plt.title('Popularity Vote_Count Relationship', fontsize= 16)
plt.figure(figsize=(20, 20))

### Q6: What is the relation between the vote_count and vote rate?

_**visualizing the realtionship between vote rate and vote_count by scatter plot and clarify their effect on each other**_


In [None]:
# visualizing the realtionship between vote rate and vote_count by scatter plot
plt.scatter(movies_clean.vote_count, movies_clean.vote_average)
plt.xlabel('Vote_Count', fontsize=14)
plt.ylabel('Vote_Rate', fontsize=14) 
plt.title('Vote_Count & Rate Relationship', fontsize= 16)
plt.figure(figsize=(20, 20))

### Q7: Is there a relation between the budget and the runtime ?

_**visualizing the realtionship between budget and runtime by scatter plot and if the long runtime increasing the budget or not**_

In [None]:
# visualizing the realtionship between budget and runtime by scatter plot
plt.scatter(movies_clean.runtime, movies_clean.budget)
plt.xlabel('RunTime', fontsize=14)
plt.ylabel('Budget', fontsize=14)
plt.title('RunTime & Budget Relatioship', fontsize= 16)
plt.figure(figsize=(20, 20))

### Q8: Is there a relation between the budget and the revenue ?

_**visualizing the realtionship between budget and revenue by scatter plot to clarify if there a relation between budget and revenue**_

In [None]:
# visualizing the realtionship between budget and revenue by scatter plot
plt.scatter(movies_clean.revenue, movies_clean.budget)
plt.xlabel('Revenue', fontsize=14)
plt.ylabel('Budget', fontsize=14)
plt.title('Revnue & Budget Relatioship', fontsize= 16)
plt.figure(figsize=(20, 20))

### Q9: Most Popular Directors?

_**get the Top 10 Directors over the time and their movie number they directed**_

In [None]:
movies_clean.director.value_counts().nlargest(10)

### Q10: Average Runtime?

_**in this visulaization we are trying to represent the average runtime of the movies and how much movies have the average runtime over time**_

In [None]:
# visualzing the runtime average by scatter plot
# in this visulaization we are trying to represent the average runtime of the movies and how much movies have the average runtime over time
plt.scatter(movies_clean.runtime, movies_clean.movie_id)
plt.xlabel('RunTime', fontsize=14)
plt.ylabel('No of Movies', fontsize=14)
f=plt.figure()
f.set_figwidth(20)
f.set_figheight(20)
plt.show()

In [None]:
plt.show()

In [None]:
movies_clean.to_csv('movies.csv')

<a id='conclusions'></a>
## Conclusions

#### Rate of movie production over years for each movie genre:
> Drama Movies: The highest year of Drama is 2006 <br/>
> Comedy Movies: The highest year of Comedy is 2012 <br/>
> Horror Movies: The highest year of Horror is 2009 <br/>
> Thriller Movies: The highest year of Thriller is 2015 <br/>
> Action Movies: The highest year of Action is 2015 <br/>
> Fantasy Movies: The highest year of Fantasy is 2016 <br/>
> Western Movies: The highest year of Western is 1968, the rate is descending <br/>
> Science Fiction Movies: The highest year of Science Fiction is 2000 <br/>
> Drama Movies: The highest year of Drama is 2006 <br/>
#### The highest profitable, rated, vote_count, popular movie genre:
> The most Profitable movie genres are First: Comedy, Second: Drama, Third: Action <br/>
> The highest rated movie genres are First: Drama, Second: Comedy, Third: Thriller <br/>
> The most vote_count movie genres are First: Action, Second: Drama, Third: Adventure <br/>
> The most Popular movie genres are First: Drama, Second: Action, Third: Thriller <br/>
#### Top 10 Movies?:
>>**TOP 10 highest revenue:** <br/>                   
>>Avatar<br/>
>>Titanic<br/>
>>The Avengers<br/>
>>Jurassic World<br/>
>>Furious 7<br/>
>>Avengers: Age of Ultron<br/>
>>Frozen<br/>
>>Iron Man 3<br/>
>>Minions<br/>
>>Captain America: Civil War <br/><br/>
>> **TOP 10 popular:** <br/>
>>Minions<br/>
>>Interstellar<br/>
>>Deadpool<br/>
>>Guardians of the Galaxy<br/>
>>Mad Max: Fury Road<br/>
>>Jurassic World<br/>
>>Pirates of the Caribbean: The Curse of the Black Pearl<br/>
>>Dawn of the Planet of the Apes<br/>
>>The Hunger Games: Mockingjay - Part 1<br/>
>>Big Hero 6<br/><br/>
>>**TOP 10 rated:** <br/>
>>Dancer, Texas Pop. 81<br/>
>>Me You and Five Bucks<br/>
>>One Man's Hero<br/>
>>The Shawshank Redemption<br/>
>>The Prisoner of Zenda<br/>
>>The Godfather<br/>
>>Fight Club<br/>
>>Schindler's List<br/>
>>Spirited Away<br/>
>>The Godfather: Part II<br/>
#### Companies made the overall highest revenue per year:
<a href="#companies">Companies</a>
#### Relation between popularity and vote_count:
> The Distribution is skewed to left , positive relationship
#### Relation between the vote_count and vote rate:
> No effect of the vote count on the vote rate
#### Relation between the budget and the runtime:
> No effect of the long runtime on increasing the budget
#### Relation between the budget and the revenue:
> The Distribution is skewed to the left , positive relationship
#### Most Popular Directors and Number of their movies in our dataset:
>Steven Spielberg:      27<br/>
>Clint Eastwood:        18<br/>
>Robert Rodriguez:      15<br/>
>Ridley Scott:          15<br/>
>Renny Harlin:          15<br/>
>Martin Scorsese:       15<br/>
>Steven Soderbergh:     14<br/>
>Tim Burton:            14<br/>
>Oliver Stone:          14<br/>
>Woody Allen:           13<br/>
#### Average Runtime:
> between 100:110 mins , meanly about 106 mins

### *Limitations:*
1. We have used TMBD Movies dataset for our analysis and worked with popularity, revenue and runtime. Our analysis is limited to only the provided dataset. For example, the dataset does not confirm that every release of every director is listed.<br/>
2. There is no normalization or exchange rate or currency conversion is considered during this analysis and our analysis is limited to the numerical values of revenue.<br/>
3. Dropping missing or Null values from variables of our interest might skew our analysis and could show unintentional bias towards the relationship being analyzed. etc.<br/>

## Resources:

<a href='https://pandas.pydata.org/docs/'>Pandas Documentation</a><br/>
<a href='https://matplotlib.org/3.3.3/contents.html#'>Matplotlib Documentation</a><br/>
<a href='https://numpy.org/doc/1.20/'>Numpy Documentation</a><br/>
<a href='https://docs.python.org/3/'>Python Documentation</a><br/>
<a href='https://seaborn.pydata.org/'>Seaborn Documentation</a><br/>
<a href='https://www.kaggle.com/tmdb/tmdb-movie-metadata'>Kaggle TMDb dataset</a><br/>
<a href='http://ipython.readthedocs.io/en/stable/interactive/magics.html'>Notebook Inline magics</a><br/>

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Project: Investigate a Dataset (TMDb)'])