# Netflix Title's Hypothesis Test

## Introduction about background:

### I am a very big fan of film and TV and was able to find a dataset on Kaggle that showed information of all the Netflix titles. I also think that TV is a cultural representation of different people just as art is. If we investigate this data, we can to an extent understand what the different cultures are like. I wanted to use this dataset to get a deeper understanding of the difference in American and foreign TV. I will be performing a Chi squared test between the country the title was made and rating of the title. 


## Why is this business problem interesting?
### We can see if countries lean towards making certain types of rated titles. This can give us a better understanding of what types of movies different locations prefer and maybe give us a bit more understanding of the different cultures.

### The original data set: https://www.kaggle.com/shivamb/netflix-shows

### Method used: Chi-squared test



In [102]:
import numpy as np
import pandas as pd
import scipy
pd.reset_option("^display")

In [103]:
netflix_master_df = pd.read_csv(r'C:/Users/samru/Desktop/archive/netflix_titles.csv')
netflix_df = netflix_master_df.copy()

In [104]:
netflix_df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...


In [105]:
netflix_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       7787 non-null   object
 1   type          7787 non-null   object
 2   title         7787 non-null   object
 3   director      5398 non-null   object
 4   cast          7069 non-null   object
 5   country       7280 non-null   object
 6   date_added    7777 non-null   object
 7   release_year  7787 non-null   int64 
 8   rating        7780 non-null   object
 9   duration      7787 non-null   object
 10  listed_in     7787 non-null   object
 11  description   7787 non-null   object
dtypes: int64(1), object(11)
memory usage: 730.2+ KB


In [106]:
print('Rows:', netflix_df.shape[0])
print('Columns:', netflix_df.shape[1])
print('\nUnique values : \n', netflix_df.nunique())

Rows: 7787
Columns: 12

Unique values : 
 show_id         7787
type               2
title           7787
director        4049
cast            6831
country          681
date_added      1565
release_year      73
rating            14
duration         216
listed_in        492
description     7769
dtype: int64


In [107]:

netflix_df['country'].value_counts()


United States                                                      2555
India                                                               923
United Kingdom                                                      397
Japan                                                               226
South Korea                                                         183
                                                                   ... 
United Kingdom, France, United States, Belgium                        1
United Kingdom, Germany, United Arab Emirates, New Zealand            1
Bulgaria                                                              1
France, Switzerland, Spain, United States, United Arab Emirates       1
Brazil, United Kingdom                                                1
Name: country, Length: 681, dtype: int64

In [108]:
country_rating_tab = pd.crosstab(index=netflix_df['country'], columns=netflix_df['rating'])

In [109]:
country_rating_tab

rating,G,NC-17,NR,PG,PG-13,R,TV-14,TV-G,TV-MA,TV-PG,TV-Y,TV-Y7,TV-Y7-FV,UR
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Argentina,0,0,2,0,0,1,7,1,34,3,2,0,0,0
"Argentina, Brazil, France, Poland, Germany, Denmark",0,0,0,0,0,0,1,0,0,0,0,0,0,0
"Argentina, Chile",0,0,0,0,0,0,1,0,0,0,0,0,0,0
"Argentina, Chile, Peru",0,0,0,0,0,0,0,0,1,0,0,0,0,0
"Argentina, France",0,0,0,0,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Venezuela,0,0,0,0,0,0,1,0,0,0,0,0,0,0
"Venezuela, Colombia",0,0,1,0,0,0,0,0,0,0,0,0,0,0
Vietnam,0,0,0,0,0,0,2,1,2,0,0,0,0,0
West Germany,0,0,0,0,0,0,0,0,1,0,0,0,0,0


## This Data is not very useable at the current state because there is a lot of overlap in countries for these titles. We will have to clean it up and only use a certain amount of the countries in our data set.

In [110]:
df4 = netflix_df[(netflix_df['country'].isin(['United States','India', 'United Kingdom','Japan', 'South Korea','Canada','Spain','France','Egypt','Turkey']))]

In [111]:
country_rating_tab2 = pd.crosstab(index=df4['country'], columns=df4['rating'])

In [112]:
country_rating_tab2

rating,G,NC-17,NR,PG,PG-13,R,TV-14,TV-G,TV-MA,TV-PG,TV-Y,TV-Y7,TV-Y7-FV,UR
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Canada,1,1,2,8,3,16,23,14,60,22,17,9,1,0
Egypt,0,0,0,0,0,0,69,0,28,4,0,0,0,0
France,0,0,0,1,1,2,18,2,75,3,9,3,0,1
India,0,0,5,3,4,2,520,9,228,133,6,11,1,1
Japan,0,0,0,5,3,0,77,1,82,39,1,17,0,0
South Korea,0,0,3,0,0,0,70,1,83,15,4,7,0,0
Spain,0,0,1,1,1,2,12,1,109,5,2,0,0,0
Turkey,0,0,3,0,0,0,36,1,50,10,0,0,0,0
United Kingdom,0,0,5,2,7,31,68,21,168,74,16,5,0,0
United States,29,1,35,144,227,364,378,77,880,241,81,93,1,1


## Hypothesis Test:

## H0: There is no relationship between the country making the title and the rating of the title. (INDEPENDENT)
## Ha: There is relationship between the country making the title and the rating of the title. (DEPENDENT)

In [113]:
from scipy.stats import chi2_contingency
#stat, p, dof, expected = chi2_contingency(data)
stat, p, dof, expected = chi2_contingency([country_rating_tab2.iloc[0].values,
                         country_rating_tab2.iloc[1].values,
                         country_rating_tab2.iloc[2].values,
                         country_rating_tab2.iloc[3].values,
                         country_rating_tab2.iloc[4].values,
                         country_rating_tab2.iloc[5].values,
                         country_rating_tab2.iloc[6].values,
                         country_rating_tab2.iloc[7].values,
                         country_rating_tab2.iloc[8].values,
                         country_rating_tab2.iloc[9].values])
print('chi2   :',stat)
print('p-value:',p)
print('dof    :',dof)

# interpret p-value
alpha = 0.05
if p <= alpha:
    print('Dependent (reject H0)')
else:
    print('Independent (H0 holds true)')

chi2   : 1544.4410389547852
p-value: 1.6559652352317356e-247
dof    : 117
Dependent (reject H0)


## Conclusion:

### We reject the null hypothesis that there is no relationship between the country making the title and the rating of the title because the p-value was far less than the default critical value. We can say that there is most definitely a relationship between the rating of a title and where it was made. People from different cultures like to make certain types of movies and that is very understandable and give a vague reflection of their culture.
