
# Project: Investigate TMDB 5000 Movie Dataset
> udacity project
## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

>  In this analysis we investigate TMBD movie dataset which is all about the movie industry.


> .and we will answer questions as

    1.Which genres are most popular from year to year? 
    
    2.what is the relation between revenue and budget?
    


In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns 

%matplotlib inline 


<a id='wrangling'></a>
## Data Wrangling


In [None]:
# Load your data and print out a few lines. Perform operations to inspect data
#   types and look for instances of missing or possibly errant data.

raw_df = pd.read_csv('../input/tmdb-5000-movies/tmdb-movies.csv')

In [None]:
raw_df.head(3)   # 21 columns
                 # the format of genres , production_companies should be fixed

In [None]:
raw_df.tail(3)

In [None]:
raw_df.shape     # 10866 row 

In [None]:
raw_df.info()    #release_date is a string should be changed to Date
         # and we have null values at columns [cast , homepage,director ,tagline,keywords,overview,genres,production_companies]

In [None]:
raw_df.describe()  # we have movies with 0 popularity , runtime and they have vote !!!

In [None]:
raw_df.duplicated() # no duplicates 



### Data Cleaning (Replace this with more specific notes!)

First of all let's drop some columns we won't need 


In [None]:
v1_df = raw_df.drop(['id','imdb_id','homepage','cast','overview','tagline','director','keywords'],axis=1)
v1_df.head(3)

In [None]:
v1_df.isnull().sum()   #23 row is null in genres so we can afford to drop them safely 
                       #and dropping null in production_companies didn't affect the status much

In [None]:
v1_df.dropna(inplace=True)

In [None]:
v1_df.isnull().any()    

In [None]:
v1_df.info()  # handled null values

In [None]:
v1_df['release_date']=pd.to_datetime(v1_df['release_date']) 

In [None]:
v1_df.info() # fixed the type of release_date

In [None]:
v1_df['genres']=v1_df['genres'].apply(lambda x: x.split("|")[0]) # taking the first genre as the genre of the movie as
                                                                 # we don't want the same movie to effect in many genres 
                                                                 # at the same time

In [None]:
v1_df.head(2)           # handled the genre format 

In [None]:
v1_df.hist(figsize=(12,10))

In [None]:
v1_df.describe()

In [None]:
v1_df=v1_df[(v1_df['budget']> 1000) &  (v1_df['runtime']>1) &(v1_df['vote_count']> 30) & (v1_df['revenue']> 1)]

In [None]:
v1_df.describe()  # removed outliners

<a id='eda'></a>
## Exploratory Data Analysis


###    1.Which genres are most popular?

In [None]:
q1 = v1_df.copy()

In [None]:
q1.sort_values(by=['vote_average'],ascending=False).head(50).groupby(['genres']).count().plot.pie(y ='vote_average',figsize=(10,10),autopct="%.2f")
plt.title('the largest genre')

>The Drama genre is the most popular in top 50 movies

### 2.Contrast between revenue and budget?

In [None]:
q2=v1_df.copy()
q2 = q2.set_index('original_title')

In [None]:
q2 = q2.sort_values(by=['vote_average'],ascending=False).head(20)

In [None]:
q2[['budget','revenue']].plot(kind='barh',figsize=(12,12),stacked=True)

> The Dark Knight has the highest revenue

In [None]:
q2['profit']= q2['revenue']/q2['budget']
q2 = q2.sort_values(by=['profit'],ascending=False)

In [None]:
q2['profit'].plot(kind='barh',figsize=(12,12))

> The Godfather made the most out of every dollar spent

<a id='conclusions'></a>
## Conclusions

>The Drama genre is the biggest in the 50 top movies

>Spending alot of money on budget does not guarantee big revenue but it works sometimes though

## Submitting your Project 

> Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should get a return code of 0, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!

In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])