# please upvote my work if it is useful
# Investigate a Dataset (tmdb-movies)

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
##### Here we have a dataset ' tmdb-movies ' and we're going to investigate to answer these main questions

##### popularity
* Q1: which movies are the most popularity ?
* Q2: is there a relation between popularity and vote_count ?

##### budget
* Q3: which movies are the highest budget ?
* Q4: which year is the highest budget ?
* Q5: is there a correaltion between the budget and runtime of movie?

##### revenue
* Q6: which movies are the highest revenue ?
* Q7: which year is the highest revenue ?
* Q8: what is the relation between revenue and budget

##### profit
* Q9 : which movies are the highest profit ?
* Q10: which year is the highest profit ?
* Q11: what is the Film industry profit over the years?

##### directors
* Q12: who are the most active directors?
* Q13: who are the directors of the most popularity and vote counts movies?

##### runtime
* Q14: Which movies are the longest and the shortest?

##### genres
* Q15: Which genres spend the highest budget and generating the highest revenues?

##### production_companies
* Q16: which production_companies are  the most active ?
* Q17: which production_companies are  the highest  profit ?

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#% matplotlib inline
import seaborn as sns



<a id='wrangling'></a>
## Data Wrangling

### General Properties

In [None]:
# Load data and print out a few lines
df = pd.read_csv('../input/tmdb-movies/tmdb-movies.csv', index_col="id")
df.head(3)

In [None]:
# this returns a tuple of the dimensions of the dataframe
df.shape

In [None]:
# this returns the datatypes of the columns
df.dtypes

In [None]:
# this displays a concise summary of the dataframe,
# including the number of non-null values in each column
df.info()

In [None]:
# this returns useful descriptive statistics for each column of data
df.describe()

##### notice1 : we have a huge number of null  so we don't drop it in order not to affect the results of the analysis and we can't fill it by mean because all of it is non-numerical data
##### notice2 : we have zero values and we replace them by the mean

# Data Cleaning

In [None]:
# change data types
df['release_date'] = pd.to_datetime (df.release_date)

# df.dtypes

In [None]:
# replace zero values
df.replace(0,df.mean(),inplace=True)

# df.describe()

In [None]:
#    check duplicated
sum(df.duplicated())

In [None]:
# remove duplicated
df.drop_duplicates(inplace = True)

# sum(df.duplicated())

In [None]:
#    drop data
df.drop(['homepage','tagline','overview'], axis = 1, inplace = True)

<a id='Exploratory'></a>

# Exploratory Data Analysis



In [None]:
# Data distribution 
df.hist(bins= 20,figsize=(20,10))

## Q1: which movies are the most popularity ?

In [None]:
# top 10 movies most popularity
df.groupby('original_title')['popularity'].mean().sort_values(ascending = False)[:10]

## Q2: is there a relation between popularity and vote_count ?

In [None]:
plt.scatter(df.popularity, df.vote_count)
plt.xlabel('popularity')
plt.ylabel('vote_count')
plt.title('popularity & vote_count Relatioship', fontsize= 16)

> The Distribution is skewed to left so we conclude that there is a positive relationship between popularity and vote_count 

## Q3: which movies are the highest budget ?


In [None]:
# the highest 10 movies budget
top_10_budget = df.groupby('original_title')['budget'].mean().sort_values(ascending = False)[:10]
top_10_budget

### visualizing for the highest 10 movies budget

In [None]:
top_10_budget.plot(kind ='bar',title='highest movies budget',x='original_title', y='budget', figsize= (15,5))

## Q4: which year is the highest budget ?

In [None]:
# the highest budget year ?
movies_num_f_year = df.groupby('release_year').sum()['budget']
movies_num_f_year.sort_values(ascending = False)[:1]

## Q5: is there a correaltion between the budget and runtime of movie

In [None]:
median_budget = df['budget'].median()
low = df.query('budget < {}'.format(median_budget))
high = df.query('budget >= {}'.format(median_budget))

locations = [1, 2]
heights = [low['runtime'].mean(), high['runtime'].mean()]
labels = ['Low', 'High']
plt.bar(locations, heights, tick_label=labels)
plt.title('runtime vs budget')
plt.xlabel('budget')
plt.ylabel('runtime');

> The runtime of the movie has no effect on the budget


## Q6: which movies are the highest revenue ?


In [None]:
# the highest 10 movies revenue
top_10_revenue = df.groupby('original_title')['revenue'].mean().sort_values(ascending = False)[:10]
top_10_revenue

### visualizing for the highest 10 movies revenue

In [None]:
top_10_revenue.plot(kind ='bar',title='highest movies revenue',x='original_title', y='revenue', figsize= (15,5))

## Q7: which year is the highest revenue ?


In [None]:
# the highest year of revenue
movies_num_f_year = df.groupby('release_year').sum()['revenue']
movies_num_f_year.sort_values(ascending = False)[:1]

## Q8: what is the relation between revenue and budget


In [None]:
plt.scatter(df.revenue, df.budget)
plt.xlabel('Revenue')
plt.ylabel('Budget')
plt.title('Revnue & Budget Relatioship', fontsize= 16)

> The Distribution is skewed to left so we conclude that there is a positive relationship between budget and revenue 


## Q9: which movies are the highest profit ?


In [None]:
profit = df['revenue'] - df['budget']

In [None]:
df['profit'] = profit

In [None]:
# the highest 10 movies profit
top_10_profit = df.groupby('original_title')['profit'].mean().sort_values(ascending = False)[:10]
top_10_profit

### visualizing for the highest 10 movies profit

In [None]:
top_10_profit.plot(kind ='bar',title='highest movies profit',x='original_title', y='profit', figsize= (15,5))

## Q10: which year is the highest profit ?


In [None]:
# the highest year of profit
movies_num_f_year = df.groupby('release_year').sum()['profit']
movies_num_f_year.sort_values(ascending = False)[:1]

## Q11: what is the Film industry profit over the years?


In [None]:
## Film industry profit over the years
profit_over_years = df.groupby('release_year').sum()['profit']
profit_over_years.plot(kind = 'line',title = 'Film industry budget over the years',x = 'release_year', y = 'profit')


## Q12: who are the most active directors?


In [None]:
#the most active directors
df['director'].value_counts()[:10]

## Q13: who are the directors of the most popularity and vote counts movies?

In [None]:
# the directors of the most popularity movies
df.groupby(['director','original_title'])['popularity'].sum().sort_values(ascending = False)[:10]

In [None]:
# the directors of the most vote counts movies
df.groupby(['director','original_title'])['vote_count'].sum().sort_values(ascending = False)[:10]

## Q14: Which movies are the longest and the shortest?

In [None]:
# the longest movie
df[df['runtime'] == df['runtime'].max()].loc[:,['original_title','runtime']]

In [None]:
#the shortest movies
df[df['runtime'] == df['runtime'].min()].loc[:,['original_title','runtime']]

# the genres

In [None]:
genres_df = df['genres'].str.split("|", expand=True)
genres_df.head(3)

In [None]:
# Creating a separate dataframe form unique genres records.
genres_df = genres_df.stack()

genres_df = pd.DataFrame(genres_df)
genres_df.head()

In [None]:
#Renaming the genres column and verifying the genres value count
genres_df.rename(columns={0:'genres_adj'}, inplace=True)
genres_df.genres_adj.value_counts()

In [None]:
genres_df.genres_adj.nunique()

In [None]:
df_merged = df.merge(genres_df,left_index=True, right_index=True)
df_merged.head()

In [None]:
df_merged.drop('genres', axis=1, inplace=True)

## Q15: Which genres spend the highest budget and generating the highest revenues?

In [None]:
#  genres spend the highest budget
df_merged.groupby('genres_adj').budget_adj.sum().sort_values().plot.barh(color='blue', figsize=(10,10), fontsize= 10)
plt.xlabel('budget in Dollars', fontsize= 15)
plt.ylabel('Adjusted Genre', fontsize= 15)
plt.title('Total budget per genre', fontsize=20);


In [None]:
# Which genres generating the highest revenues
df_merged.groupby('genres_adj').revenue_adj.sum().sort_values().plot.barh(color='blue', figsize=(10,10), fontsize= 10)
plt.xlabel('Revenue in Dollars', fontsize= 15)
plt.ylabel('Adjusted Genre', fontsize= 15)
plt.title('Total revenues per genre', fontsize=20);


## Q16: which production_companies are  the most active  ?

In [None]:
df['production_companies'].value_counts()[:10]

## Q17: which production_companies are  the highest  profit ?

In [None]:
top_10_production_companies = df.groupby('production_companies')['profit'].sum().sort_values(ascending = False)[:10]
top_10_production_companies

<a id='conclusions'></a>
## Conclusions

eventually, we faced limitations like

    a huge numbers of null cell and zero values 
    the cast and genres columns include multiple values
and we handled it whith proper manner, and we came to the following conclusions

* top 10 movies most popular and the directors

        original_title                            director 
        Jurassic World                       :    Colin Trevorrow                
        Mad Max: Fury Road                   :    George Miller                  
        Interstellar                         :    Christopher Nolan              
        Guardians of the Galaxy              :    James Gunn                     
        Insurgent                            :    Robert Schwentke               
        Captain America: The Winter Soldier  :    Joe Russo|Anthony Russo        
        Star Wars                            :    George Lucas                   
        John Wick                            :    Chad Stahelski|David Leitch    
        Star Wars: The Force Awakens         :    J.J. Abrams                    
        The Hunger Games: Mockingjay - Part 1:    Francis Lawrence                
    
* top 10 highest budget

       >The Warrior's Way                           
        Pirates of the Caribbean: On Stranger Tides    
        Pirates of the Caribbean: At World's End       
        Avengers: Age of Ultron                        
        Superman Returns                               
        John Carter                                    
        Tangled                                        
        Spider-Man 3                                   
        The Lone Ranger                                
        Harry Potter and the Half-Blood Prince         

* top 10 highest revenue 

       >Avatar                                           
        Star Wars: The Force Awakens                     
        Jurassic World                                   
        Furious 7                                        
        Avengers: Age of Ultron                          
        Harry Potter and the Deathly Hallows: Part 2     
        Iron Man 3                                       
        Minions                                          
        Transformers: Dark of the Moon                   
        The Lord of the Rings: The Return of the King   
    
* top 10 highest profit

       >Avatar                                         
        Star Wars: The Force Awakens                     
        Jurassic World                                   
        Furious 7                                        
        Harry Potter and the Deathly Hallows: Part 2     
        Avengers: Age of Ultron                          
        The Net                                          
        Minions                                          
        The Lord of the Rings: The Return of the King    
        Iron Man 3                                       
    
* the highest year budget

      2013 is the highest year budget   
   
* the highest year revenue 

      2015 is the highest year revenue   

* the highest year profit   
      
      2015 is the highest year profit   
    
* the relation between popularity and vote_count

      there is a positive relation between popularity and vote_count

* the correlation between budget and runtime of movies
      
      The runtime of the movie has no effect on the budget

* the relation between revenue and budget
      
      there is a positive relation between revenue and budget

* The most active directors

        Woody Allen          45
        Clint Eastwood       34
        Steven Spielberg     29
        Martin Scorsese      29
        Ridley Scott         23
        Steven Soderbergh    22
        Ron Howard           22
        Joel Schumacher      21
        Brian De Palma       20
        Barry Levinson       19

* the tallest movie  
   
      The Story of Film: An Odyssey is the tallest movie
   
* the highest budget and revenue genres 

      drama genres spend the highest budget and generating the highest revenue


In [None]:
from subprocess import call
call(['python', '-m', 'nbconvert', 'Investigate_a_Dataset.ipynb'])