## Data
>Two datasets:

> 1. tmdb_movies.csv as "df_movies"
> 2. derived dataset from df_movies having revenue and budget greater than zero as "df"

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## GOAL
### Below Questions needs to be answered
>#### How have movie  trends varied over the years?
#### What are the top 20 highest grossing movies?
#### What are the top 20 most expensive movies?
#### Who are the top 20 directors who made highly rated films?
####  How do budgets correlate with revenues? Do high budgets mean high revenues?
####  Do certain months of release associate with better revenues?
#### Which months have seen the maximum releases?
#### How do ratings correlate with commercial success (profits)?
#### Movie Genre Trends over year


> ### Loading CSV File and Cleaning dataset

In [None]:
df_movies = pd.read_csv('C:/Users/kumarijy/Documents/Learning/DAND/data_analysis_project/tmdb-movies.csv')

In [None]:
df_movies.head()
#print("Movies DF:\n\n{}\n".format(df_movies.head()))

>  As can be seen few columns are not required as part of so, Delete unecessary Columns

In [None]:
#df_movies.drop('keywords',axis = 1,inplace = True)
df_movies.drop('imdb_id',axis = 1,inplace = True)
df_movies.drop('homepage',axis = 1,inplace = True)
df_movies.drop('overview',axis = 1,inplace = True)
df_movies.drop('tagline',axis = 1,inplace = True)
df_movies.drop('vote_count', axis = 1, inplace = True)

> Duplicate Check

In [None]:
df_movies = pd.read_csv('tmdb-movies.csv')
sum(df_movies.duplicated())

> Adding new column as Month which will have month in which movie get released, Derived from release date

In [None]:
df_movies['release_date'] = pd.to_datetime(df_movies['release_date'], format='%m/%d/%y')
df_movies['release_month'] = df_movies['release_date'].dt.month

> Adding a new column in a table as "profit", which be used latter for analysis

In [None]:
df_movies['profit'] = df_movies['revenue'] - df_movies['budget']

In [None]:
df_movies.info()

>###   <span style="color:purple"> Question :What are the top 20 highest Budget movies? </span>

In [None]:
df_top_budget = df_movies.nlargest(20,'budget')
df_top_budget.loc[:,['id','budget','revenue','original_title','director','vote_average','genres']]

>###   <span style="color:purple"> Question :What are the top 20 highest grossing movies? </span>

In [None]:
df_top_revenue = df_movies.nlargest(20,'revenue')
df_top_revenue.loc[:,['id','budget','revenue','original_title','director','vote_average','genres']]

>### <span style="color:purple">Question: Which months have seen the maximum releases?</span>
####  Answer: Most of the movie get released on september month

In [None]:
df_movies['release_month'].value_counts()

In [None]:
sns.distplot(df_movies['release_month'])

> September has seen the most releases followed by October

###  <span style="color:purple">Question: How do budgets correlate with revenues? Do high budgets mean high revenues?</span>¶
###  <span style="color:purple">Calculation of Pearson's correlation coefficient</span>
#### Answer:  As can be seen correlation in 0.73 which shows there is positive correlation between revenue and budget

In [None]:
g = sns.regplot(x = 'budget', y='revenue' , data = df_movies)
# remove the top and right line in graph
sns.despine()
# Set the size of the graph from here
g.figure.set_size_inches(7,5)
# Set the Title of the graph from here
g.axes.set_title('Budget vs. Revenue', fontsize=20,color="b",alpha=0.8)
# Set the x & y label of the graph from here
g.set_xlabel("Budget",size = 20,color="c",alpha=0.8)
g.set_ylabel("Revenue",size = 20,color="c",alpha=0.8)

> It can be seen from the plot that there are some movies with high budgets but low revenues and some with low budgets and high revenues. The outliers are more with high budget movies which get low/ moderate revenues.


#### Creating subset data frame havin all valid budget and revenue records

In [None]:
df = df_movies.query('budget > 0 & revenue > 0')

In [None]:
df.shape

In [None]:

f, ax = plt.subplots(figsize=(7, 7))
ax.set( yscale="log")
p = sns.regplot('vote_average','budget', data =df, ax = ax, scatter = True)
p.set(ylabel='log(budget)')
p.axes.set_title('Log(Budget) vs User Rating', fontsize=20,color="c",alpha=0.8)

> There is a clear positive correlation that budget and ratings. It’s safe to say that for the most part, IMDB users enjoy big budget movies.

> Let's see how revenue correlates with some other film figures

In [None]:
#Array with the column names for what we want to compare the revenue to
revenue_comparisons = ['budget', 'runtime', 'vote_average', 'popularity','release_month']
for comparison in revenue_comparisons:
    sns.jointplot(y='revenue', x=comparison, data=df_movies, color='b', size=5, space=0, kind='reg')
    #p.axes.set_title('y vs x', fontsize=20,color="c",alpha=0.8)

> 1. As Can be seen the correlation between budget and revenue is 0.73 and correlation between revenue and popularity is 0.66 which shows positive correlation
2. The correlation between revenue and month is 0.039 which shows revenue is independent of the month in which movie is release
3. And with Revenue vs vote_average and revenue vs runtime we can say that the correlation is 0.17 and 0.16 respectively which shows that there is no correlation between them.

###  <span style="color:purple">Question: How have movie  trends varied over the years?</span>

In [None]:
ax = sns.distplot(df_movies['release_year'])
ax.set_title("Growth of movies production with years", color = 'c')


>Movie production has increased over the years from 1960 to 2015. The decade of 2000 - 2010 shows a steep increase in production compared to previous decades. The year 2015 with 900 movies, is the year of maximum movie production, and 1961 with 31 movies has been the year of least production

## Genre Analysis


In [None]:
col = df_movies['genres']

In [None]:
col2 = []
for s in col:
    #print(s)
    try:
        x = s.split('|')
    except:
        x = ['No']
    col2.append(x)

In [None]:
l1 =[]
for s in col2:
    #print(type(s))
    l1 = sum([l1,s],[])

gener = set(l1)

In [None]:
gener = list(gener)
gener.remove('No')

In [None]:
gener

In [None]:
# Here we have used df dataset which will have data for all the records having revenue and budget 
#both greater than zero
df.columns

In [None]:
for g in gener:
    df1 = df['genres'].str.contains(g).fillna(False)
    #print('The total number of movies with ',g,'=',len(df[df1]))
    f, ax = plt.subplots(figsize=(25, 5))
    sns.countplot(x = 'release_year', data=df[df1], palette="Greens_d")
    plt.title(g)
    compare_movies_rating = ['budget']
    for compare in compare_movies_rating:
        sns.jointplot(y ='profit', x=compare, data=df[df1], alpha=0.7, color='b', size=5)
        plt.title(g)

> 1. From Above we can say that for Animation and Fantacy movie Popularity is positively correlated with profit indicating
2. if popularity is more for these genre then profit will be more 
3. From the Above we can say that for Sci-fi, Action , Adventure and falimy has 0.5(approx) correlation with profit which is weak correlation and for wester and foreign genre correlation is negative.


In [None]:
# Here we used df dataset so that the records having valid revenue and budget should be considered
df_profit = df.nlargest(20,'profit')

df_profit.genres.value_counts()[:10].plot.pie(autopct='%1.1f%%',figsize=(10,10))
plt.title('TOP 20 GENRE IN MOVIE DATASET ')

> Looking at the above plot we can say that among top 20 highest profit making movies the most common genre is Action followed by Adventure

# Conclusion

> 1. From the above analysis we can say that there is no relationship between profit and the month in which movie is released.
2. Budget and Revenue are correlated
3. Revenue and Popularity are correlated.
4. High Budget have good vote_average indicating high budget movie are liked by people
5. Number of movie released increased exponentially
6. Most of the movie got released in september followed by october
7. From Above we can say that for Animation and Fantacy movie Popularity is positively correlated with profit indicating
8. if popularity is more for these genre then profit will be more 
9. From the Above we can say that for Sci-fi, Action , Adventure and falimy has 0.5(approx) correlation with profit which is weak correlation and for wester and foreign genre correlation is negative.
