In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Exploring Walt Disney Movies and Box Office Success
Explore Disney movies data, and build a linear regression model to predict box office success.

## Description
Since my childhood in the '90s and early 2000's, Walt Disney Movies were one of the main sources for my entertainment. 
My personal favorite Disney genres are comedies and adventures. While some movies are indeed directed towards kids, many are intended for a broad audience. 
In this notebook, we will analyze data to see how Disney movies have changed in popularity over time.
We will visualize the success of movie genres and also perform hypothesis testing to see what aspects of a movie contribute to its success.

## Tasks
    1. Importing our working tools and libraries
    2. Exploratory Data Analysis
    3. Data inspecting and cleaning
    4. Data visualization
    5. Data transformation
    6. Statistical analysis
    7. Conclusion

# 1. Importing our working tools and libraries

In [None]:
# Let's import our favourite standard libraries for numerical data manipulation
import pandas as pd
import numpy as np

# Our visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Module for statistical analysis
from sklearn.linear_model import LinearRegression

%matplotlib inline
plt.rcParams['figure.figsize'] = [13,6]
plt.style.use('ggplot')

# 2. Exploratory data analysis
Lets start with reading our dataset and get to know eachother.

In [None]:
disney_df = pd.read_csv('/kaggle/input/disney-movies/disney_movies.csv')
disney_df.head(10)

In [None]:
# Display some information about our data for making the first impression
print(disney_df.info())
print()
print(disney_df.isna().sum())

As we can see from the information method, our data contains some null values.
I personally like to work with suitable data types, and I always convert my data types accordingly.
Let's first take a peek at that null values. There are not many, so maybe we could impute some values.

In [None]:
disney_df[disney_df.genre.isna()]

According to the movie_title, maybe we could impute some genre names, but in this case, we will just let it be, because there is a very small amount of nulls.
Next, let's work on our data-types and make some fixing.

# 3. Data inspecting and cleaning

In [None]:
# Changing our data types for better performance
disney_df['release_date'] = pd.to_datetime(disney_df.release_date)
disney_df[['genre', 'mpaa_rating']] = disney_df[['genre', 'mpaa_rating']].astype('category')
disney_df['release_year'] = disney_df.release_date.dt.year.astype('int64')

In [None]:
# Let's check the data information once again
disney_df.info()

# 4. Data visualization
Lets try to answer some basic questions about our data, using visualizations.

We will try to answer some questions like:

    * How many movies from each genre are in our dataset?
    * What is the time period covered in our dataset?
    * Which movie genre made the most money?
    * Which are the top 20 most profitable movies?

In [None]:
# What is the most common movie genre made by Walt Disney Sudios?
disney_df['genre'].value_counts()

In [None]:
# What is the most common movie genre made by Walt Disney Sudios (Let's visualize)
sns.countplot(x=disney_df['genre'], data=disney_df, order=disney_df.genre.value_counts().index, palette='magma')
plt.xticks(rotation=60);

Ok, no surprise there, at least for me. 
I always enjoyed Disney's comedies and adventures. 
I cannot remember of any Disney horror movie though. Let's check them out very quickly.

In [None]:
disney_df[disney_df.genre == 'Horror']

In [None]:
# Check the inflation adjusted gross by genre
disney_df.groupby('genre')['inflation_adjusted_gross'].sum().sort_values(ascending=False).plot(kind='bar')

As we can see, even though Musicals are far less in quantity than Action or Drama for example, they are far more successful in terms of earnings, which can also be confirmed by calculating the mean value of inflation_adjusted_gross for every genre.

In [None]:
# Visualize the adjusted gross mean by genre
disney_df.groupby('genre')['inflation_adjusted_gross'].mean().sort_values(ascending=False).plot(kind='bar')

In [None]:
# Quick check of descriptive statistics of our dataset
disney_df.describe()

We will make a plot out of these means of groups to better see how box office revenues have changed over time

In [None]:
# Compute mean of adjusted gross per genre and per year
gen_y = disney_df.groupby(['genre', 'release_year']).mean().reset_index()

sns.lineplot(data=gen_y, x='release_year', y='inflation_adjusted_gross', hue='genre')

Something else I wanted to know, was the time period in terms of years. which year was the busiest for Disney's Studios. I assumed that it was the time of my childhood (during the 1990s) but let's see what the data will tell us.

In [None]:
# Which was the year Walt Disney Studios produced maximum amount of movies?
disney_df.release_year.value_counts(ascending=False).plot(kind='bar');

In [None]:
# Let's see which are the top 20 Disneys movies that made the most?
# How many did you watch?
top_movies = disney_df.sort_values('inflation_adjusted_gross', ascending=False)
top_movies.head(20)

## 5. Data transformation
<p>According to the above  line plot, we can say that some genres are growing faster in popularity than others. For Disney movies in this dataset, Action and Adventure genres are growing the fastest. 
Next, we will build a linear regression model to see the relationship between genre and box office gross. </p>
<p>Since linear regression requires numerical variables and the genre variable is a categorical variable, we must first convert the categorical variables to numerical.</p>

<p>For this dataset, there will be 11 dummy variables, one for each genre except the action genre which we will use as a baseline. For example, if a movie is an adventure movie, the adventure variable will be 1 and other dummy variables will be 0. 
    Since the action genre is our baseline, if a movie is an action movie, all dummy variables will be 0.</p>

In [None]:
# Transform genre variables to dummy variables 
genre_dumm =  pd.get_dummies(disney_df['genre'], drop_first=True)

# Inspect the genre_dumm data frame
genre_dumm.head(10)

<p>With our dummy variables, we can now build a linear regression model to predict the adjusted gross.</p>

## 6 Statistical analysis
<p>From the regression model, we can then check the effect of each genre by looking at its coefficient given in units of box office gross. Our focus will be on the impact of action and adventure genres here.
We would expect that movies belonging in these genres (action, adventure) would perform better for box office.</p>

In [None]:
# Build a linear regression model
linreg = LinearRegression()

# Fit the model to the dataset
linreg.fit(genre_dumm, disney_df['inflation_adjusted_gross'])

# Get estimated intercept and coefficient values 
action =  linreg.intercept_
adventure = linreg.coef_[[0]][0]

# Inspect the estimated intercept and coefficient values 
print(f'Estimated intercept value is {action}, while estimated coeficient value is {adventure}')

<p>It is now time to compute 95% confidence intervals for the intercept and coefficients. 

The 95% confidence intervals for the intercept  <b><i>a</i></b> and coefficient <b><i>b<sub>i</sub></i></b> means that the intervals have a probability of 95% to contain the true value <b><i>a</i></b> and coefficient <b><i>b<sub>i</sub></i></b> respectively. If there is a significant relationship between a given genre and the adjusted gross, the confidence interval of its coefficient should exclude 0.      </p>
<p>We will calculate the confidence intervals using the pairs bootstrap method. </p>

In [None]:
# Create an array of indices to sample from 
inds = np.arange(0, len(disney_df['genre']))

# Initialize 500 replicate arrays
size = 500
bs_action =  np.empty(size)
bs_adventure =  np.empty(size)

<p>After the initialization, we will perform pair bootstrap estimates for the regression parameters. We will draw a sample from a set of data (genre, adjusted gross) where the genre is the original genre variable. We will perform one-hot encoding after that. </p>

In [None]:
# Generate replicates  
for i in range(size):
    
    # Resample the indices 
    bs_inds = np.random.choice(inds, size=len(inds))
    
    # Get the sampled genre and sampled adjusted gross
    bs_genre = disney_df['genre'][bs_inds] 
    bs_gross = disney_df['inflation_adjusted_gross'][bs_inds]
    
    # Convert sampled genre to dummy variables
    bs_dumm = pd.get_dummies(bs_genre, drop_first=True)
   
    # Build and fit a regression model
    lreg = LinearRegression().fit(bs_dumm, bs_gross)
        
    # Compute replicates of estimated intercept and coefficient
    bs_action[i] = lreg.intercept_
    bs_adventure[i] = lreg.coef_[[0]][0]

<p>Finally, we compute 95% confidence intervals for the intercept and coefficient and examine if they exclude 0. If one of them (or both) does, then it is unlikely that the value is 0 and we can conclude that there is a significant relationship between that genre and the adjusted gross. </p>

In [None]:
# Compute 95% confidence intervals for intercept and coefficient values
ci_action = np.percentile(bs_action, [2.5, 97.5])
ci_adventure = np.percentile(bs_adventure, [2.5, 97.5])
    
# Inspect the confidence intervals
print(f'95% confidence interval of the action genre is {ci_action}')
print(f'95% confidence interval of the action genre is {ci_adventure}')

## 7. Conclusion
<p>The confidence intervals from the bootstrap method for the intercept and coefficient do not contain the value zero, as we have already seen that lower and upper bounds of both confidence intervals are positive. These tell us that it is likely that the adjusted gross is significantly correlated with the action and adventure genres. </p>
<p>From the results of the bootstrap analysis and the trend plot we have done earlier, we could say that Disney movies with plots that fit into the action and adventure genre, according to our data, tend to do better in terms of adjusted gross than other genres. 