# Phase 2 Review

In [504]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from statsmodels.formula.api import ols
import scipy.stats as scs
import math

pd.set_option('display.max_columns', 100)

### Check Your Data … Quickly
The first thing you want to do when you get a new dataset, is to quickly to verify the contents with the .head() method.

In [505]:
df = pd.read_csv('movie_metadata.csv')
print(df.shape)
#df.head()

(5043, 28)


In [506]:
df = df.dropna(subset=['gross','title_year'])

## Question 1

A Hollywood executive wants to know how much an R-rated movie released after 2000 will earn. The data above is a sample of some of the movies with that rating during that timeframe, as well as other movies. How would you go about answering her question? Talk through it theoretically and then do it in code.

What is the 95% confidence interval for a post-2000 R-rated movie's box office gross?

In [507]:
# talk through your answer here

# 1. Filter data. Define subset of data that only contains year > 2000 and rating R
# 2. Make sure our quantitative variable does not have None / NaN values.
# 3. There is no STD of population avail. will do computation with t-score

In [564]:
df1 = df[(df.content_rating == 'R') & (df.title_year > 2000.0)]

sample = df1.gross

def confidence_interval(threshold):
    """
    params : sample - lsit of valuses with numeric values
             trashold - integer as pct
    """
    n = 100        # number of observations in the sample
    m = 19     # mean, avg
    s = 3     # diviation
    alpha = (1-(1-95/100)/2)
    #t_score = scs.norm.ppf(alpha)              # in case we 
    t_score = scs.t.ppf(alpha, df=len(sample))

    moe = t_score*(s/math.sqrt(n))    # calculate margin of error
    conf_int = (m-moe, m+moe)   # calculate confidance interval 
    
    return print("""
    
    We are {}% confident that true population mean falls between {} and {}
    
    Our sample mean is {}.
    """.format(threshold, round(conf_int[0],2), round(conf_int[1],2), round(m, 2)))

In [565]:
confidence_interval(95)


    
    We are 95% confident that true population mean falls between 18.41 and 19.59
    
    Our sample mean is 19.
    


## Question 2a

Your ability to answer the first question has the executive excited and now she has many other questions about the types of movies being made and the differences in those movies budgets and gross amounts.

Read through the questions below and **determine what type of statistical test you should use** for each question and **write down the null and alternative hypothesis for those tests**.

- Is there a relationship between the number of Facebook likes for a cast and the box office gross of the movie?
- Do foreign films perform differently at the box office than non-foreign films?
- Of all movies created are 40% rated R?
- Is there a relationship between the language of a film and the content rating (G, PG, PG-13, R) of that film?
- Is there a relationship between the content rating of a film and its budget? 

In [510]:
# Is there a relationship between the number of Facebook likes for a cast and the box office gross of the movie?

# H0: B1 = 0  (B1 is coefitient)
# HA: B1 != 0 
# TEST : Simple Linear regression 

In [511]:
# Do foreign films perform differently at the box office than non-foreign films?

# H0: X1 = X2 (X1 - mean for foreign film X2 mean for domestic)
# HA: X1 != X2
# TEST : Difference of 2 means  . 2 side T-test (m1 - m2 = 0)

In [512]:
# Of all movies created are 40% rated R?

# H0: P(R) = 0.4
# HA: P(R) != 0.4
# TEST : Population proportion Z-test 2 tails !!!

In [513]:
# Is there a relationship between the language of a film and the content rating (G, PG, PG-13, R) of that film?

# H0: Distribution are equal
# HA: Distribution are equal
# TEST : Chi-square Test. Homogenality

In [514]:
# Is there a relationship between the content rating of a film and its budget?

# H0: m(r) = m(P13) = m(PG) = m(G)
# HA: m(r) != m(P13) != m(PG) != m(G)
# TEST : ANOVA

## Question 2b

Calculate the answer for the second question:

- Do foreign films perform differently at the box office than non-foreign films?

In [557]:
def diff_two_means(df, threshold):
    
    df_usa = df[df.country == 'USA']['gross']
    df_foreing = df[~(df.country=='USA')]['gross']
    
    n1 = len(df_usa)
    n2 = len(df_foreing)
    
    x_1 = df_usa.mean()
    x_2 = df_foreing.mean()
    
    s1 = df_usa.std()
    s2 = df_foreing.std()
    
    alpha = (1-(threshold)/100)/2
    
    df1 = n1+n2-1
    t_crit = scs.t.ppf(1 - alpha, df=df1)   # critical value 
 
    numer = (x_1 - x_2)
    denum = math.sqrt((s1**2/n1)+(s2**2/n2))
    delta_mu = numer/denum
    
    p = 1-scs.t.cdf(delta_mu, df=df1) 
    # p < 0.025 - we reject H0 
    # p > 0.025 - we fail to reject H0 
    
    if delta_mu > t_crit or delta_mu < - t_crit :
        
        return print( """
        We reject the null Hypotesys because based one statistical test two groups sample means difference 
        is = {}, which gets in rejection area defined by critical values {} and -{}.
        """.format(round(delta_mu,2),round(t_crit,2),round(t_crit,2)))
    
    else:
        return print( """
        There is not enough evidence to reject the null Hypotesys because based one statistical test
        two groups sample means difference is = {}, which does not get in rejection area defined by critical values {} and -{}.
        """.format(round(delta_mu,2),round(t_crit,2),round(t_crit,2)))

In [558]:
diff_two_means(df, 95)


        We reject the null Hypotesys because based one statistical test two groups sample means difference 
        is = 14.85, which gets in rejection area defined by critical values 1.96 and -1.96.
        


In [552]:
df_usa = df[df.country == 'USA']['gross']
df_foreing = df[~(df.country=='USA')]['gross']
#scs.ttest_ind(df_usa, df_foreing)
scs.stats.ttest_ind(df_foreing,df_usa, nan_policy = 'omit')

Ttest_indResult(statistic=-12.047474068413432, pvalue=7.02325288919722e-33)

## Question 3

Now that you have answered all of those questions, the executive wants you to create a model that predicts the money a movie will make if it is released next year in the US. She wants to use this to evaluate different scripts and then decide which one has the largest revenue potential. 

Below is a list of potential features you could use in the model. Create a new frame containing only those variables.

Would you use all of these features in the model?

Identify which features you might drop and why.

*Remember you want to be able to use this model to predict the box office gross of a film **before** anyone has seen it.*

- **budget**: The amount of money spent to make the movie
- **title_year**: The year the movie first came out in the box office
- **years_old**: How long has it been since the movie was released
- **genre**: Each movie is assigned one genre category like action, horror, comedy
- **avg_user_rating**: This rating is taken from Rotten tomatoes, and is the average rating given to the movie by the audience
- **actor_1_facebook_likes**: The number of likes that the most popular actor in the movie has
- **total_cast_facebook_likes**: The sum of likes for the three most popular actors in the movie
- **language**: the original spoken language of the film


R-squared is a statistical measure of how close the data are to the fitted regression line. ... 0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean.


In [453]:
# def create_ols_model(df, target_var):
    
#     features = ['gross','budget','title_year','years_old','genres','imdb_score','actor_1_facebook_likes','cast_total_facebook_likes','language']
#     df1 = df[features]
#     new_features = [f[0] for f in df1.dtypes.items() if f[1] == 'float64' or f[1] == 'int']
#     df2 = df1[new_features]
#     formula = '{}~{}'.format(target_var,','.join(new_features))
    
#     #ols_model = ols(formula = formula, data=df2).fit()
    
#     return df2

In [483]:
features = ['budget','title_year','genres','imdb_score','actor_1_facebook_likes','cast_total_facebook_likes','language']

In [484]:
df_f = df[features]

In [487]:
df_f.corr()

Unnamed: 0,budget,title_year,imdb_score,actor_1_facebook_likes,cast_total_facebook_likes
budget,1.0,0.04499,0.029135,0.017544,0.030189
title_year,0.04499,1.0,-0.131504,0.085532,0.112207
imdb_score,0.029135,-0.131504,1.0,0.088893,0.099612
actor_1_facebook_likes,0.017544,0.085532,0.088893,1.0,0.945742
cast_total_facebook_likes,0.030189,0.112207,0.099612,0.945742,1.0


In [559]:
# We need to exclude highly corelated fetures to avoid multicolinarity:
# drop either `cast_total_facebook_likes` or `actor_1_facebook_likes` due to multicollinearity
# cast_total_facebook_likes
# actor_1_facebook_likes
# imdb_score -  avg_user_rating

In [561]:
'''
`num_critic_for_reviews` and `imdb_score` can't be known before the movie is released.

we'll drop them from the model.

drop either `cast_total_facebook_likes` or `actor_1_facebook_likes` due to multicollinearity.
'''

"\n`num_critic_for_reviews` and `imdb_score` can't be known before the movie is released.\n\nwe'll drop them from the model.\n\ndrop either `cast_total_facebook_likes` or `actor_1_facebook_likes` due to multicollinearity.\n"

## Question 4a

Create the following variables:

- `years_old`: The number of years since the film was released.
- Dummy categories for each of the following ratings:
    - `G`
    - `PG`
    - `R`
    
Once you have those variables, create a summary output for the following OLS model:

`gross~cast_total_facebook_likes+budget+years_old+G+PG+R`

In [560]:
df['title_year'] = df['title_year'].map(lambda x : int(x))
df['years_old'] = df['title_year'].map(lambda x : 2021-x)

## Question 4b

Below is the summary output you should have gotten above. Identify any key takeaways from it.
- How ‘good’ is this model?
- Which features help to explain the variance in the target variable? 
    - Which do not? 


<img src="ols_summary.png" style="withd:300px;">

In [338]:
# your answer here
#  R - squared value: 0.079
# Adj. R -squared value: 0.075
# Poor model because there is littel variation within the model,
# seeing that our data points spread out more
# All the features with the exception G rating help us explain the variance
# in the target variable, based on our p-value
# G rated value is the only value we can see from here that plays no effect on the
# gross of the film.

In [562]:
'''
    The model is not very good in that it only explains about 7.9% (13.9% in mine) of the variation 
    in the data around the mean. (based on R-squared value)
    
    In the photo, Total Facebook likes, budget, age, PG rating, and R rating help to explain the variance, 
    whereas G rating does not. (based on p-values)
    
    In mine, everything other than years old helps to explain the variance.

'''

'\n    The model is not very good in that it only explains about 7.9% (13.9% in mine) of the variation \n    in the data around the mean. (based on R-squared value)\n    \n    In the photo, Total Facebook likes, budget, age, PG rating, and R rating help to explain the variance, \n    whereas G rating does not. (based on p-values)\n    \n    In mine, everything other than years old helps to explain the variance.\n\n'

## Question 5

**Bayes Theorem**

An advertising executive is studying television viewing habits of married men and women during prime time hours. Based on the past viewing records he has determined that during prime time wives are watching television 60% of the time. It has also been determined that when the wife is watching television, 40% of the time the husband is also watching. When the wife is not watching the television, 30% of the time the husband is watching the television. Find the probability that if the husband is watching the television, the wife is also watching the television.

In [121]:
# W - Wife watching TV
# H - Husband watching TV

#  P(H/W) = 0.40 - 40% of the time the husband  watching, when the wife is watching television, 
#  P(H/-W) = 0.30 - 30% of the time the husband is watching the television when wife is not watching the television
#  P(W) = 0.60 - during prime time wives are watching television 60% of the time
#  P(-W) = 1 - P(W) = 0.40  
#  P(W/H) - ? 


# P(W/H) =  (P(A/B)*P(A))/P(B)
# P(H) = P(W/H)*P(W) + P(H/-W)*P(-W)
# P(W/H) =  (P(W/H)*P(W))/P(H)) / P(W/H)*P(W) + P(H/-W)*P(-W)

P_b_given_a = 0.4 gro
P_b_given_not_a = 0.3
P_a = 0.6
P_not_a = 0.4

P_b = P_b_given_a * P_a + P_b_given_not_a * P_not_a
P_a_given_b = (P_b_given_a * P_a)/ P_b

print("""
The probability that if the husband is watching the television, the wife is also watching the television is {}%"""
      .format(round(P_a_given_b*100,2)))



The probability that if the husband is watching the television, the wife is also watching the television is 66.67%


## Question 6

Explain what a Type I error is and how it relates to the significance level when doing a statistical test. 

In [563]:
# your answer here
'''
A Type I error occurs when you reject the null hypothesis even though the null hypothesis is True.

The likelihood of a Type I error is directly related to changes in the significance level. If you
increase the significance level, the likelihood of a Type I error also increases and vice versa.

If our significane lecel is 95%, that means we have a 5% chance of making a type one error.
'''

'\nA Type I error occurs when you reject the null hypothesis even though the null hypothesis is True.\n\nThe likelihood of a Type I error is directly related to changes in the significance level. If you\nincrease the significance level, the likelihood of a Type I error also increases and vice versa.\n\nIf our significane lecel is 95%, that means we have a 5% chance of making a type one error.\n'

In [491]:
# If our sign level is 95% means we have 5% chens of making type one error 