*If this Kernel helped you in any way,I would be very much appreciated to your <font color='red'>UPVOTES</font>*

# Introduction
In this story, I will investigate the TMDB movies dataset which is collected between 1960 to 2015 with the information of title, budget, revenue, cast, director, genres, release date, release year, runtime, etc …
The primary goal of the project is making the exploratory data analysis using numpy, pandas, seaborn and matplotlib library. For this, we need the clean the data first. Previously, we should ask a question and find the answers inside this datasets. So, this purpose will help us with the cleaning process.
## Questions to be Answered
* What are all times highest and lowest profit movie?
* What is all times top 10 movies which earn the highest profit?
* What are the highest profit movie and the total profit for each year?
* What is the all times highest and lowest budget movie?
* What is all times top 10 movies which have the highest budget?
* What are the highest budget movie and the total budget for each year?
* What is the All times highest and lowest revenue movie?
* What is all times top 10 movies which have the highest revenue?
* What are the highest budget movie and the total budget for each year?
* Which genres most used from 1960 to 2015?
* Which cast were more filmed?
* Which director was most filmed?
* What is the Number of movies released in each month? What is the total profit by month?
## Importing the Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

## Loading the datasets

In [None]:
df = pd.read_csv('../input/tmdb_movies_data.csv')
df_copy = df.copy()
df.head()

In [None]:
df.info()

There are 10866 columns and 21 columns.

* “id”, “imdb_id” columns are similar column so we can get rid of “imdb_id” column which is not given any useful information for this analysis.
* “popularity”, “budget”, and “revenue” columns are useful for this analysis and we are going to calculate the profit extract the revenue from the budget column. But previously we need the handle the missing values from budget and revenue column.
* “original_title”, “cast”, “director” columns have useful information about the movies.
* “homepage”, “tagline”, “keywords”, “overview”, “vote_average”, “budget_adj”, “revenue_adj” columns are not useful for analysis so these columns could be deleted from the data frame.
* “release_date”, and “release_year” columns also important. And we need the convert the release_date column to pandas DateTime object.

In [None]:
#Let's count the null rows using isnull() and sum() function
df.isnull().sum()

## Data Cleaning
* Drop the duplicated rows.
* Replace the values from ‘0’ to ‘NAN’ then, drop the rows which have missing values.
* Change the format of release date into DateTime format.
* Delete the unused columns from Data Frame
* Checking the all columns are in the desired data type.
* Calculating the profit extracting revenue from the budget.

In [None]:
#'duplicated()' function return the duplicate row as True and othter as False
# using the sum() functions we can count the duplicate elements 
sum(df.duplicated())

In [None]:
#Let's drop these row using 'drop_duplicates()' function
df.drop_duplicates(inplace=True)

In [None]:
# Let's check the dataframe shape to see just 1 row dropped.
print('Shape of Data Frame after droppping duplicated rows:\n(Rows : Cloumns):', df.shape)

In [None]:
#Changing Format Of Release Date Into Datetime Format
df['release_date'] = pd.to_datetime(df['release_date'])
df['release_date'].head()

In [None]:
#Let's handle the budget and revenue
#this will replace the value of '0' to NaN of columns given in the list
df[['budget','revenue']] = df[['budget','revenue']].replace(0,np.NAN)

df.dropna(subset=['budget', 'revenue'], inplace=True)
print('After cleaning, we have {} rows'.format(df.shape[0]))

In [None]:
df.columns

In [None]:
#Let's delete the unused columns
del_col = ['imdb_id', 'homepage','tagline', 'keywords', 'overview','vote_average', 'budget_adj','revenue_adj']
df.drop(del_col, axis=1, inplace=True)
print('We have {} rows and {} columns' .format(df.shape[0], df.shape[1]))

In [None]:
#Before answering the questions, lets figure out the profits of each movie
df['profit'] = df['revenue']-df['budget']
df['profit'] = df['profit'].apply(np.int64)
df['budget'] = df['budget'].apply(np.int64)
df['revenue'] = df['revenue'].apply(np.int64)

In [None]:
df.head()

In [None]:
print(df.isnull().sum())

In [None]:
df.dtypes

## Exploratory Data Analysis
We will create the function to facilitate the answer the questions before going into exploratory data analysis.

In [None]:
def find_min_max(col_name):
    #using idxmin()  and idxmax() functions to find min and max value of the given column.
    #idxmin to find the index of lowest in given col_name
    min_index = df[col_name].idxmin()
    #idxmax to find the index of highest in given col_name
    max_index = df[col_name].idxmax()
    #select the lowest and hisghest value from given col_name
    low  = pd.DataFrame(df.loc[min_index,:])
    high = pd.DataFrame(df.loc[max_index,:])
    #Print the results
    
    print('Movie which has highest '+col_name+' : ', df['original_title'][max_index])
    print('Movie which has lowest '+col_name+' : ', df['original_title'][min_index])
    return pd.concat([high,low], axis=1)

In [None]:
def top_10(col_name,size=10):
    #find the all times top 10 for a fiven column
    #sort the given column and select the top 10
    df_sorted = pd.DataFrame(df[col_name].sort_values(ascending=False))[:size]
    df_sorted['original_title'] = df['original_title']
    plt.figure(figsize=(12,6))
    #Calculate the avarage
    avg = np.mean(df[col_name])   
    sns.barplot(x=col_name, y='original_title', data=df_sorted, label=col_name)
    plt.axvline(avg, color='k', linestyle='--', label='mean')
    if (col_name == 'profit' or col_name == 'budget' or col_name == 'revenue'):
        plt.xlabel(col_name.capitalize() + ' (U.S Dolar)')
    else:
        plt.xlabel(col_name.capitalize())
    plt.ylabel('')
    plt.title('Top 10 Movies in: ' + col_name.capitalize())
    plt.legend()

In [None]:
from matplotlib import gridspec
def each_year_best(col_name, size=15):
        #this function plot the last size=15 years best given varible 
        release = df[['release_year',col_name,'original_title']].sort_values(['release_year',col_name],
                                                                               ascending=False)
        # group by release year and find the best profit for each year
        release = pd.DataFrame(release.groupby(['release_year']).agg({col_name:[max,sum],
                                                                      'original_title':['first'] })).tail(size)
        #select the max from given column
        x_max = release.iloc[:,0]
        #select the sum from given column
        x_sum = release.iloc[:,1]
        #select the name title
        y_title = release.iloc[:,2]
        #select the index
        r_date = release.index  
        #plot the desirible variable
        fig = plt.figure(figsize=(12, 6))
        gs = gridspec.GridSpec(1, 2, width_ratios=[2, 2]) 
        ax0 = plt.subplot(gs[0])
        ax0 = sns.barplot(x=x_max, y=y_title, palette='deep')
        for j in range(len(r_date)):
            #put the year information on the plot
            ax0.text(j,j*1.02,r_date[j], fontsize=12, color='black')
        plt.title('Last ' +str(size)+ ' years highest ' +col_name+ ' movies for each year')
        plt.xlabel(col_name.capitalize())
        plt.ylabel('')
        ax1 = plt.subplot(gs[1])
        ax1 = sns.barplot(x=r_date, y=x_sum, palette='deep')
        plt.xticks(rotation=90) 
        plt.xlabel('Release Year')
        plt.ylabel('Total '+col_name.capitalize())
        plt.title('Last ' +str(size)+ ' years total '+ col_name)
        plt.tight_layout()

> Using these functions on the budget, revenue, and profit columns let’s find out the answers we are looking for.
### 1 - What are all times highest and lowest profit movie?


In [None]:
find_min_max('profit')

 ### What is all times top 10 movies which earn the highest profit?

In [None]:
top_10('profit')

### What are the highest profit movie and the total profit for each year?

In [None]:
each_year_best('profit')

Let's find out same answer for budget  and revenue.
* What is the all times highest and lowest budget movie?
* What is all times top 10 movies which have the highest budget?
* What are the highest budget movie and the total budget for each year?
* What is the All times highest and lowest revenue movie?
* What is all times top 10 movies which have the highest revenue?
* What are the highest budget movie and the total budget for each year?

In [None]:
find_min_max('budget')

In [None]:
top_10('budget')

In [None]:
each_year_best('budget')

In [None]:
find_min_max('revenue')

In [None]:
top_10('revenue')

In [None]:
each_year_best('revenue')

In [None]:
#Let's also check it out longes and shortes movie using find_min_max() function
find_min_max('runtime')

*We are going to write another function to answer the following question. This function could take the column like genres, cast or director then count the values of these columns to find out more filmed genres or the cast or director more filmed in this time of period.

We are going to write a function to find out the most filmed genres, cast or director.*

In [None]:
def split_count_data(col_name, size=15):
    ##function which will take any column as argument from which data is need to be extracted and keep track of count
    #take a given column, and separate the string by '|'
    data = df[col_name].str.cat(sep='|')
    #storing the values separately in the series
    data = pd.Series(data.split('|'))
    #Let's count the most frequenties values for given column
    count = data.value_counts(ascending=False)
    count_size = count.head(size)
    #Setting axis name for multiple names
    if (col_name == 'production_companies'):
        sp = col_name.split('_')
        axis_name = sp[0].capitalize()+' '+ sp[1].capitalize()
    else:
        axis_name = col_name.capitalize()
    fig = plt.figure(figsize=(14, 6))
    #set the subplot 
    gs = gridspec.GridSpec(1,2, width_ratios=[2,2])
    #count of given column on the bar plot
    ax0 = plt.subplot(gs[0])
    count_size.plot.barh()
    plt.xlabel('Number of Movies')
    plt.ylabel(axis_name)
    plt.title('The Most '+str(size)+' Filmed ' +axis_name+' Versus Number of Movies')
    ax = plt.subplot(gs[1])
    #setting the explode to adjust the pei chart explode variable to any given size
    explode = []
    total = 0
    for i in range(size):
         total = total + 0.015
         explode.append(total)
    #pie chart for given size and given column
    ax = count_size.plot.pie(autopct='%1.2f%%', shadow=True, startangle=0, pctdistance=0.9, explode=explode)
    plt.title('The most '+str(size)+' Filmed ' +axis_name+ ' in Pie Chart')
    plt.xlabel('')
    plt.ylabel('')
    plt.axis('equal')
    plt.legend(loc=9, bbox_to_anchor=(1.4, 1))

###  Questions to be answered uisng split_count_data() function. 
* Which genres was more used from 1960 to 2015?
* Which cast were more filmed?
* Which director was most filmed?
* Which production companies were the most filmed?

In [None]:
split_count_data("genres")

In [None]:
split_count_data("cast")

In [None]:
split_count_data("director")

### What is the Number of movies released in each month? What is the total profit by month?

In [None]:
df_month = df.copy()
df_month['release_month'] = df_month['release_date'].dt.strftime("%B")

fig = plt.figure(figsize=(12,6))
count_month = df_month.groupby('release_month')['profit'].count()
plt.subplot(1,2,1)
count_month.plot.bar()
plt.xlabel('Release Month')
plt.ylabel('Number of Movies')
plt.title('Number of Movies released in each month')

plt.subplot(1,2,2)
sum_month = df_month.groupby('release_month')['profit'].sum()

sum_month.plot.bar()
plt.xlabel('Release Month')
plt.ylabel('Monthly total Profit ')
plt.title('Total profit by month (1950-2015)')


> We also look for popularity and vote count column using the top_10 function to see the most popular film and most counted film.
> 
> Let’s explore the popularity using the top_10 function, and the also investigate the vote_count to find out most voted movies in TMDB website

In [None]:
top_10('popularity', size=30)

In [None]:
top_10('vote_count', size=30)

Let’s try the found out if there is any correlation between this variable.

In [None]:
df_related = df[['profit','budget','revenue','runtime', 'vote_count','popularity','release_year']]
sns.pairplot(df_related, kind='reg')

Let’s check out a few plots below:

1. Budget vs Revenue: Budget and revenue both have a positive correlation between them. Means there is a good possibility that movies with higher investments result in better revenues.
2. Profit Vs Budget: Profit And Budget both have a positive correlation between them. Means there is a good possibility that movies with higher investments result in better profit.
3. Release Year Vs Vote count: Release year and vote Average have a negative correlation. Means that movie ratings (vote count) do not depend on the release year.
4. Popularity Vs Profit: Popularity and profit have a positive correlation. It means that movie with high popularity tends to earn high profit.
# Conclusion
We analysis the TMDB dataset which is collected between 1960 to 2015. Our goal here finding the answer utilizing this dataset. We could summaries this analysis result in the following items.

    1- The most profitable movie is Avatar and filmed in 2009. Star Wars: The Force Awakers is second, and Titanic is the third one.
    
    2- The last profitable movie is The Warrior’s Way and this movie also has the highest budget.
    
    3- The most popular genres was filmed Drama, Comedy, and Action.
    
    4- The most filmed actor was Robert De Niro, Bruce Wills and Samual L. Jackson.
    
    5- The most filmed director was Steven Spielberg, Clint Eastwood, and Ridley Scott.
    
    6- The most filmed production company was Universal Pictures, Waner Bros, and Paramount Pictures.
    
    7- The most profitable mounts are June, December, and May.
    
    8- According to TMDB dataset, all times most popular movies are Jurassic World, Mad Max: Fury Road, and Interstellar.
    
    9- All times most voted movies are Inception, The Avengers and Avatar.
    
    10- Revenue and budget both have a positive correlation between them.
    
    11- There is a high probability that movies with higher investments result in better profit.
    
  *If this Kernel helped you in any way,I would be very much appreciated to your <font color='red'>UPVOTES</font>*