# Phase 2 Review

In [18]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from statsmodels.formula.api import ols
import scipy.stats as stats
import math

pd.set_option('display.max_columns', 100)

### Check Your Data … Quickly
The first thing you want to do when you get a new dataset, is to quickly to verify the contents with the .head() method.

In [5]:
df = pd.read_csv('movie_metadata.csv')
print(df.shape)
df.head()

(5043, 28)


Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,1.0,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
4,,Doug Walker,,,131.0,,Rob Walker,131.0,,Documentary,Doug Walker,Star Wars: Episode VII - The Force Awakens ...,8,143,,0.0,,http://www.imdb.com/title/tt5289954/?ref_=fn_t...,,,,,,,12.0,7.1,,0


## Question 1

A Hollywood executive wants to know how much an R-rated movie released after 2000 will earn. The data above is a sample of some of the movies with that rating during that timeframe, as well as other movies. How would you go about answering her question? Talk through it theoretically and then do it in code.

What is the 95% confidence interval for a post-2000 R-rated movie's box office gross?

In [None]:
# talk through your answer here
"""
1. Get movies made 2000 or after by slicing at title_year
2. Check for null values and drop them if they are not a significant portion of the data
3. Calculate Confidence Interval using T-distribution
"""

In [25]:
# do it in code here
df_2010 = df[(df["title_year"]>2000) & (df["content_rating"]=="R")]
# check null
#df_2010.isnull().sum() # 96 null in gross
#df_2010.shape #477 entries. dropping null due to time constraints
df_2010 = df_2010.dropna()
n = df_2010["gross"].shape[0]
x_bar = df_2010["gross"].mean()
std = df_2010["gross"].std()/math.sqrt(n)


In [15]:
df_2010.describe()

Unnamed: 0,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_1_facebook_likes,gross,num_voted_users,cast_total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
count,1084.0,1084.0,1084.0,1084.0,1084.0,1084.0,1084.0,1084.0,1084.0,1084.0,1084.0,1084.0,1084.0,1084.0,1084.0,1084.0
mean,209.396679,109.51476,710.719557,657.820111,8183.774908,30138630.0,104656.412362,11841.206642,1.383764,342.702952,44591490.0,2007.73893,2175.291513,6.592251,2.149659,11898.013838
std,125.614763,20.192764,2889.3841,1610.004433,12125.443711,40313210.0,127527.477481,17282.737885,2.02212,358.955619,400213900.0,4.346395,5856.634321,0.929895,0.251547,24180.252503
min,8.0,53.0,0.0,0.0,7.0,162.0,91.0,11.0,0.0,4.0,4500.0,2001.0,0.0,2.3,1.33,0.0
25%,118.75,96.0,10.0,161.75,717.0,2703738.0,23757.25,1773.75,0.0,120.0,7000000.0,2004.0,328.75,6.1,1.85,0.0
50%,189.0,106.0,68.0,402.5,1000.0,16134920.0,60351.0,3981.0,1.0,224.5,18000000.0,2008.0,651.0,6.6,2.35,373.5
75%,275.0,118.0,246.5,655.5,14000.0,40206990.0,134564.5,17559.25,2.0,423.5,35000000.0,2011.0,971.0,7.2,2.35,15000.0
max,775.0,300.0,23000.0,17000.0,164000.0,363024300.0,955174.0,303717.0,31.0,2814.0,12215500000.0,2016.0,137000.0,8.7,2.76,199000.0


In [27]:
#this is using Central Limit Theorem
x_bar - 1.96*std, x_bar + 1.96*std

(27738751.08271632, 32538503.22171172)

In [30]:
stats.norm.interval(alpha = 0.95, loc = x_bar, scale=std)

(27738795.181002267, 32538459.123425774)

In [26]:
# 95% confidence interval
# topic 13
stats.t.interval(alpha = 0.95, df = n-1, loc = x_bar, scale=std)

(27736110.171151448, 32541144.133276593)

## Question 2a

Your ability to answer the first question has the executive excited and now she has many other questions about the types of movies being made and the differences in those movies budgets and gross amounts.

Read through the questions below and **determine what type of statistical test you should use** for each question and **write down the null and alternative hypothesis for those tests**.

- Is there a relationship between the number of Facebook likes for a cast and the box office gross of the movie?
- Do foreign films perform differently at the box office than non-foreign films?
- Of all movies created are 40% rated R?
- Is there a relationship between the language of a film and the content rating (G, PG, PG-13, R) of that film?
- Is there a relationship between the content rating of a film and its budget? 

In [None]:
# your answers here
"""
Is there a relationship between the number of Facebook likes for a cast and the box office gross of the movie?

Simple linear regresion check for P value of f statics
H0: There is no relation between independent and dependent variable
H1: There is relation between independent and depedent variable

Do foreign films perform differently at the box office than non-foreign films?
Two sample T test
H0: mu0 = mu1
H1: mu0 != mu1


Of all movies created are 40% rated R?
...weird question?
One sample T test? treat 40% as the population mean
H0: mu = 40%
H1: mu != 40%



Is there a relationship between the language of a film and the content rating (G, PG, PG-13, R) of that film?
chi-squared test
H0: language has no change in the variance
H1: language has change in variance


Is there a relationship between the content rating of a film and its budget?
ANOVA
check for signficance between the budget and rating


"""

## Question 2b

Calculate the answer for the second question:

- Do foreign films perform differently at the box office than non-foreign films?

In [35]:
# your answer here
df_USA = df[df["country"]=="USA"]
df_noUS = df[df["country"]!="USA"]

df_USA.dropna(inplace = True)
df_noUS.dropna(inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [37]:
df_USA.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
0,Color,James Cameron,723.0,178.0,0.0,855.0,Joel David Moore,1000.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,CCH Pounder,Avatar,886204,4834,Wes Studi,0.0,avatar|future|marine|native|paraplegic,http://www.imdb.com/title/tt0499549/?ref_=fn_t...,3054.0,English,USA,PG-13,237000000.0,2009.0,936.0,7.9,1.78,33000
1,Color,Gore Verbinski,302.0,169.0,563.0,1000.0,Orlando Bloom,40000.0,309404152.0,Action|Adventure|Fantasy,Johnny Depp,Pirates of the Caribbean: At World's End,471220,48350,Jack Davenport,0.0,goddess|marriage ceremony|marriage proposal|pi...,http://www.imdb.com/title/tt0449088/?ref_=fn_t...,1238.0,English,USA,PG-13,300000000.0,2007.0,5000.0,7.1,2.35,0
3,Color,Christopher Nolan,813.0,164.0,22000.0,23000.0,Christian Bale,27000.0,448130642.0,Action|Thriller,Tom Hardy,The Dark Knight Rises,1144337,106759,Joseph Gordon-Levitt,0.0,deception|imprisonment|lawlessness|police offi...,http://www.imdb.com/title/tt1345836/?ref_=fn_t...,2701.0,English,USA,PG-13,250000000.0,2012.0,23000.0,8.5,2.35,164000
5,Color,Andrew Stanton,462.0,132.0,475.0,530.0,Samantha Morton,640.0,73058679.0,Action|Adventure|Sci-Fi,Daryl Sabara,John Carter,212204,1873,Polly Walker,1.0,alien|american civil war|male nipple|mars|prin...,http://www.imdb.com/title/tt0401729/?ref_=fn_t...,738.0,English,USA,PG-13,263700000.0,2012.0,632.0,6.6,2.35,24000
6,Color,Sam Raimi,392.0,156.0,0.0,4000.0,James Franco,24000.0,336530303.0,Action|Adventure|Romance,J.K. Simmons,Spider-Man 3,383056,46055,Kirsten Dunst,0.0,sandman|spider man|symbiote|venom|villain,http://www.imdb.com/title/tt0413300/?ref_=fn_t...,1902.0,English,USA,PG-13,258000000.0,2007.0,11000.0,6.2,2.35,0


In [38]:
df_noUS.head()

Unnamed: 0,color,director_name,num_critic_for_reviews,duration,director_facebook_likes,actor_3_facebook_likes,actor_2_name,actor_1_facebook_likes,gross,genres,actor_1_name,movie_title,num_voted_users,cast_total_facebook_likes,actor_3_name,facenumber_in_poster,plot_keywords,movie_imdb_link,num_user_for_reviews,language,country,content_rating,budget,title_year,actor_2_facebook_likes,imdb_score,aspect_ratio,movie_facebook_likes
2,Color,Sam Mendes,602.0,148.0,0.0,161.0,Rory Kinnear,11000.0,200074175.0,Action|Adventure|Thriller,Christoph Waltz,Spectre,275868,11700,Stephanie Sigman,1.0,bomb|espionage|sequel|spy|terrorist,http://www.imdb.com/title/tt2379713/?ref_=fn_t...,994.0,English,UK,PG-13,245000000.0,2015.0,393.0,6.8,2.35,85000
9,Color,David Yates,375.0,153.0,282.0,10000.0,Daniel Radcliffe,25000.0,301956980.0,Adventure|Family|Fantasy|Mystery,Alan Rickman,Harry Potter and the Half-Blood Prince,321795,58753,Rupert Grint,3.0,blood|book|love|potion|professor,http://www.imdb.com/title/tt0417741/?ref_=fn_t...,973.0,English,UK,PG,250000000.0,2009.0,11000.0,7.5,2.35,10000
12,Color,Marc Forster,403.0,106.0,395.0,393.0,Mathieu Amalric,451.0,168368427.0,Action|Adventure,Giancarlo Giannini,Quantum of Solace,330784,2023,Rory Kinnear,1.0,action hero|attempted rape|bond girl|official ...,http://www.imdb.com/title/tt0830515/?ref_=fn_t...,1243.0,English,UK,PG-13,200000000.0,2008.0,412.0,6.7,2.35,0
20,Color,Peter Jackson,422.0,164.0,0.0,773.0,Adam Brown,5000.0,255108370.0,Adventure|Fantasy,Aidan Turner,The Hobbit: The Battle of the Five Armies,354228,9152,James Nesbitt,0.0,army|elf|hobbit|middle earth|orc,http://www.imdb.com/title/tt2310332/?ref_=fn_t...,802.0,English,New Zealand,PG-13,250000000.0,2014.0,972.0,7.5,2.35,65000
25,Color,Peter Jackson,446.0,201.0,0.0,84.0,Thomas Kretschmann,6000.0,218051260.0,Action|Adventure|Drama|Romance,Naomi Watts,King Kong,316018,7123,Evan Parke,0.0,animal name in title|ape abducts a woman|goril...,http://www.imdb.com/title/tt0360717/?ref_=fn_t...,2618.0,English,New Zealand,PG-13,207000000.0,2005.0,919.0,7.2,2.35,0


In [40]:
stats.ttest_ind(df_USA.gross,df_noUS.gross)

#they do perfrom differently

Ttest_indResult(statistic=10.486552243257917, pvalue=2.226158263095092e-25)

## Question 3

Now that you have answered all of those questions, the executive wants you to create a model that predicts the money a movie will make if it is released next year in the US. She wants to use this to evaluate different scripts and then decide which one has the largest revenue potential. 

Below is a list of potential features you could use in the model. Create a new frame containing only those variables.

Would you use all of these features in the model?

Identify which features you might drop and why.

*Remember you want to be able to use this model to predict the box office gross of a film **before** anyone has seen it.*

- **budget**: The amount of money spent to make the movie
- **title_year**: The year the movie first came out in the box office
- **years_old**: How long has it been since the movie was released
- **genre**: Each movie is assigned one genre category like action, horror, comedy
- **avg_user_rating**: This rating is taken from Rotten tomatoes, and is the average rating given to the movie by the audience
- **actor_1_facebook_likes**: The number of likes that the most popular actor in the movie has
- **total_cast_facebook_likes**: The sum of likes for the three most popular actors in the movie
- **language**: the original spoken language of the film


In [45]:
# your answer here
df_USA[["budget","title_year","genres","actor_1_facebook_likes","language"]].corr()

Unnamed: 0,budget,title_year,actor_1_facebook_likes
budget,1.0,0.250533,0.152662
title_year,0.250533,1.0,0.093797
actor_1_facebook_likes,0.152662,0.093797,1.0


## Question 4a

Create the following variables:

- `years_old`: The number of years since the film was released.
- Dummy categories for each of the following ratings:
    - `G`
    - `PG`
    - `R`
    
Once you have those variables, create a summary output for the following OLS model:

`gross~cast_total_facebook_likes+budget+years_old+G+PG+R`

In [None]:
# your answer here
#howe to fit OLS and show summary:
#"Indicator~predictor+predictor" use dataframe columns
#set data as the target dataframe
#call fit method


model = ols(formula="gross~budget+years+year_old", data = df).fit()
model.summary()

## Question 4b

Below is the summary output you should have gotten above. Identify any key takeaways from it.
- How ‘good’ is this model?
- Which features help to explain the variance in the target variable? 
    - Which do not? 


<img src="ols_summary.png" style="withd:300px;">

In [None]:
# your answer here


## Question 5

**Bayes Theorem**

An advertising executive is studying television viewing habits of married men and women during prime time hours. Based on the past viewing records he has determined that during prime time wives are watching television 60% of the time. It has also been determined that when the wife is watching television, 40% of the time the husband is also watching. When the wife is not watching the television, 30% of the time the husband is watching the television. Find the probability that if the husband is watching the television, the wife is also watching the television.

In [47]:
# your answer here
wives_tv = 0.6
husband_wives_tv = 0.4
no_wives_husband_tv = 0.3

husband_and_wives_tv = (wives_tv*husband_wives_tv)/((wives_tv*husband_wives_tv)+((1-wives_tv)*no_wives_husband_tv))
husband_and_wives_tv

0.6666666666666666

## Question 6

Explain what a Type I error is and how it relates to the significance level when doing a statistical test. 

In [None]:
# your answer here
"""
Type I error is when the test fails to correctly reject the null hypothesis. If a test shows significance level
less than the set alpha of the test, one should reject the null hypothesis. Type 1 error occurs when null hypothesis
is not correctly rejected when the significance level of the test is less than the set alpha. 
"""