**The Movie Database's (TMDb) API is a great source of all things movie related.** <p>
QUESTIONS:<p>
    1. What is the mean, median, and mode BUDGET for films?
    2. What is the mean, median, and mode REVENUE for films?
    2. What is the mean, median, and mode RUNTIME for films?
    4. What independent variable is the most closely correlated to revenue?
    5. What films have outsized returns -- the highest revenue:budget ratio?

In [None]:
import numpy as np
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
alt = 'https://raw.githubusercontent.com/premonish/Springboard/master/Data/tmdb_5000_movies.csv'
movies = pd.read_csv('/kaggle/input/tmdb-movie-metadata/tmdb_5000_movies.csv')
credits = pd.read_csv('/kaggle/input/tmdb-movie-metadata/tmdb_5000_credits.csv')
movies.columns

In [None]:
credits.columns

In [None]:
#credits.cast.value_counts()

In [None]:
print('Mean budget:', str(round(movies.budget.mean(),2)))
print('Median budget:', str(movies.budget.median()))
print('Budget Mode:', str(movies.budget.mode()[0]))

In [None]:
print('Mean budget:', str(round(movies.revenue.mean(),2)))
print('Median budget:', str(movies.revenue.median()))
print('Budget Mode:', str(movies.revenue.mode()[0]))

In [None]:
print('Mean budget:', str(round(movies.runtime.mean(),2)))
print('Median budget:', str(movies.runtime.median()))
print('Budget Mode:', str(movies.runtime.mode()[0]))

In [None]:
new = movies.genres.value_counts()
new.head()

In [None]:
#movies.keywords.value_counts()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
co2 = np.corrcoef((movies.budget, movies.revenue))[0][1]

In [None]:
sns.set_style("darkgrid")
sns.regplot(x='budget', y='revenue', data=movies)
plt.title('Pearson\'s CorrCoef: '+str(co2))

In [None]:
ax = sns.kdeplot(movies.revenue, shade=True, color="b")

In [None]:
ax = sns.kdeplot(movies.budget, shade=True, color="b")

In [None]:
sns.regplot(x='runtime', y='revenue', data=movies)

In [None]:
co3 = np.corrcoef((movies.popularity, movies.revenue))[0][1]
sns.regplot(x='popularity', y='revenue', data=movies)
print(co3)

In [None]:
sns.regplot(x='vote_average', y='revenue', data=movies)

In [None]:
# create a feature 'pct' ||| revenue : budget ratio (higher is better)
movies['pct'] = movies.revenue/movies.budget
movies.pct = movies.pct.replace(np.inf, '0')
movies.pct.fillna(0)
movies['pct'].head()
#print(movies['pct'].mean(axis = 1), skipna = True)
movies.pct = movies.pct.astype(float)
h = movies.sort_values(by=['pct'], ascending=False)
h2 = h[['original_title', 'pct']]

In [None]:
h.original_title.head(100)

In [None]:
h2.head(20)

In [None]:
fig, ax = plt.subplots()

plt.rcParams['font.family'] = "serif"
ax.barh(h2.original_title[:2], h2.pct[:2], align='center')
ax.set_yticks(h2.original_title[:2])
ax.set_yticklabels(h2.original_title[:2])
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('Revenue/Budget')
ax.set_title('Films with Outsized Returns')

plt.show()

These outliers "Modern Times" and "Nurse 3-D" are due to inaccurate data according to my research. 

In [None]:
fig, ax = plt.subplots()

plt.rcParams['font.family'] = "serif"
ax.barh(h2.original_title[2:5], h2.pct[2:5], align='center')
ax.set_yticks(h2.original_title[2:5])
ax.set_yticklabels(h2.original_title[2:5])
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('Revenue/Budget')
ax.set_title('Films with Outsized Returns')

plt.show()

# [Paranormal Activity](http://https://en.wikipedia.org/wiki/Paranormal_Activity#:~:text=It%20was%20given%20a%20limited,the%20U.S.%20rights%20for%20%24350%2C000.) was apparently made for 15,000(USD) and grossed a whooping **193 million!**



In [None]:
fig, ax = plt.subplots()

plt.rcParams['font.family'] = "serif"
ax.barh(h2.original_title[5:25], h2.pct[5:25], align='center')
ax.set_yticks(h2.original_title[5:25])
#ax.set_xlabels(h2.pct[5:25])
ax.set_yticklabels(h2.original_title[5:25])
ax.invert_yaxis()  # labels read top-to-bottom
ax.set_xlabel('Revenue/Budget')
ax.set_title('Films with Outsized Returns')

plt.show()

In [None]:
ax = sns.kdeplot(movies.pct, shade=True, color="b")

In [None]:
sns.pairplot(movies)

In [None]:
co4 = np.corrcoef((movies.vote_count, movies.revenue))[0][1]
co4

> **There is a positive correlation between **movie budget** and **movie revenue** (Pearson's CorrCoef: 0.7308).
> 
> **There is a stronger positive correlation between **vote count** and **movie revenue** (Pearson's CorrCoef: 0.7815).****

Is there are a 'sweet spot' or a clear point of diminishing returns?

In [None]:
movies2 = movies[['budget','popularity','revenue','runtime','vote_average','vote_count']]

In [None]:
ax = sns.pairplot(movies2)