<p style='font-weight:bold; font-size:24px'> Data analysis project 1: Yoon Tae Park (yp2201@nyu.edu)</p>

In [1]:
# Before answering questions, I will import some basic libraries and create dataset from given csv file
import numpy as np
import pandas as pd
from scipy import stats

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

dataset = pd.read_csv('./movieReplicationSet.csv')

<p style='font-weight:bold; font-size:18px'> 1) Are movies that are more popular (operationalized as having more ratings) rated higher than movies that are less popular? [Hint: You can do a median-split of popularity to determine high vs. low popularity movies] </p>

<p style='font-weight:bold; font-size:18px'> Answer) </p>
<p style='font-size:16px'> I can check this by conducting null hypothesis testing(which I will use for most of questions)</p>
<p style='font-size:16px'> Hypothesis: Popular movies are rated higher than less popular movies</p>
<p style='font-size:16px'> Null hypothesis: Popular movies are not rated higher than less popular movies</p>
<p style='font-size:16px'> I will assume that the null hypothesis is true</p><br>

<p style='font-size:16px'>  I divided movies into high-popularity movies and low-popularity movies by median. Then I've conducted Mann-Whitney U test and caclulated p-value. I am using  Mann-Whitney U test test since I am comparing 2 groups that are nonparametric, and movie ratings data is ordinal data.</p><br>

<p style='font-size:16px'> p-value was relatively small(1.6971433120157929e-40) compared to significant level a=0.005</p>
<p style='font-size:16px'> So, I rejected the null hypothesis. In english, I've concluded that popular movies are rated higher than less popular movies(below are detailed codes that follow my conclusion)</p>

In [2]:
# We don't need columns other than movie ratings, so filter by movies
dataset_movies = dataset.iloc[:, :400]

In [3]:
# Since popularity is determined by not null counts for each movies, 
# we need to create a new row which counts not null
dataset_movies.loc['not_null_cnt'] = dataset_movies.notnull().sum()

In [4]:
# now we compute median of not null counts and find that median is 197.5
np.median(dataset_movies.loc['not_null_cnt'])

197.5

In [5]:
# Create a function that divides movies into high/low popular movies
# If some movies are having exact same value as median, I won't be using those movies 
def high_low_check(x):
    
    if x > np.median(dataset_movies.iloc[1097]): return 'high'
    elif x < np.median(dataset_movies.iloc[1097]): return 'low'
    else: return 'same'
    
dataset_movies.loc['popular'] = dataset_movies.iloc[1097].apply(lambda x: high_low_check(x))

In [6]:
# For this time, every movies are divided by low and high
dataset_movies.loc['popular'].value_counts()

low     200
high    200
Name: popular, dtype: int64

In [7]:
# now let's filter movies that are popluar
high_movie = dataset_movies.loc[:, dataset_movies.loc['popular'] == 'high'][:1097]

In [8]:
# For high popular movies each, delete null values element-wise and get the mean value
# It is because we don't know what values to be substituted
# and we will lose a lot of statistic power if we delete null values row-wise
# (I will use row-wise deletion for entire problems)
high_movie_array = []

for i in range(len(high_movie.columns)):
    each_movie = high_movie.iloc[:,i]
    each_movie = each_movie[pd.notnull(each_movie)]
    high_movie_array.append(each_movie.mean())

In [9]:
# Doing same thing for low popular movies as well
low_movie = dataset_movies.loc[:, dataset_movies.loc['popular'] == 'low'][:1097]

In [10]:
low_movie_array = []

for i in range(len(low_movie.columns)):
    each_movie = low_movie.iloc[:,i]
    each_movie = each_movie[pd.notnull(each_movie)]
    low_movie_array.append(each_movie.mean())

In [11]:
# Then, calculate p-value by using Mann-Whitney U test
# I am using U test since I am comparing 2 groups that are nonparametric, and movie ratings data is ordinal data.

# p-value was relatively small(1.6971433120157929e-40) compared to significant level a=0.005
# reject null hypothesis that popular movies are not rated higher than less popular movies 
# and conclude that movies that are more popular rated higher than movies that are less popular
u1, p1 = stats.mannwhitneyu(high_movie_array, low_movie_array)
u1, p1

(35404.0, 1.6971433120157929e-40)

<p style='font-weight:bold; font-size:18px'> 2) Are movies that are newer rated differently than movies that are older? [Hint: Do a median split of year of release to contrast movies in terms of whether they are old or new] </p>

<p style='font-weight:bold; font-size:18px'> Answer) </p>
<p style='font-size:16px'> I can check this by conducting null hypothesis testing</p>
<p style='font-size:16px'> Hypothesis: newer movies are rated differently than older movies</p>
<p style='font-size:16px'> Null hypothesis: newer movies are not rated differently than older movies</p><br>
<p style='font-size:16px'> I will assume that the null hypothesis is true</p><br>

<p style='font-size:16px'>  I divided movies into newer movies and older movies by median. Then I've conducted Mann-Whitney U test and caclulated p-value. I am using  Mann-Whitney U test test since I am comparing 2 groups that are nonparametric, and movie ratings data is ordinal data.</p><br>

<p style='font-size:16px'> p-value was relatively big(0.16654749319603956) compared to significant level a=0.005</p>

<p style='font-size:16px'> So, we don't do anything because we already assumed that the null hypothesis is true. In english, newer movies are not rated differently than older movies.(below are detailed codes that follow my conclusion)</p>

In [94]:
# Starting from 'dataset_movies' variable (which contains movie ratings only)
# We need to find years for each movies. I've parsed columns and extracted year data
dataset_movies.loc['year'] = dataset_movies.columns.str[-5:-1]

In [95]:
# By parsing as below, we can get year data from movie name. 
dataset_movies.columns.str[-5:-1][:1]

Index(['2003'], dtype='object')

In [96]:
# Rambo: First Blood Part II doesn't have year data 
# I'll drop this column, as this movie is the only movie that doesn't have year data
dataset_movies_v2 = dataset_movies.drop('Rambo: First Blood Part II', axis=1)

In [97]:
# Calculating median year value for entire movies. Median year is 1999
dataset_movies_v2.loc['year'].median()

1999.0

In [98]:
# Classify movies into new/old/same

def old_new_check(x):
    
    if x > dataset_movies_v2.loc['year'].median(): return 'new'
    elif x < dataset_movies_v2.loc['year'].median(): return 'old'
    else: return 'same'
    
dataset_movies_v2.loc['old_new_check'] = dataset_movies_v2.loc['year'].apply(lambda x: old_new_check(int(x)))

In [99]:
# Note that I won't be using 'same' movies, since those have median values (Not old, Not new)
dataset_movies_v2.loc['old_new_check'].value_counts()

old     196
new     174
same     29
Name: old_new_check, dtype: int64

In [100]:
# Filter by new movies
new_movies = dataset_movies_v2.loc[:, dataset_movies_v2.loc['old_new_check'] == 'new'][:1097]

In [111]:
# Create a new array and append each new movie's mean rating
new_movie_array = []

for i in range(len(new_movies.columns)):
    each_movie = new_movies.iloc[:,i]
    each_movie = each_movie[pd.notnull(each_movie)]
    new_movie_array.append(each_movie.mean())

In [112]:
# Filter by old movies
old_movies = dataset_movies_v2.loc[:, dataset_movies_v2.loc['old_new_check'] == 'old'][:1097]

In [113]:
# Create a new array and append each old movie's mean rating
old_movie_array = []

for i in range(len(old_movies.columns)):
    each_movie = old_movies.iloc[:,i]
    each_movie = each_movie[pd.notnull(each_movie)]
    old_movie_array.append(each_movie.mean())

In [114]:
# Then, calculate p-value by using Mann-Whitney U test
# I am using U test since I am comparing 2 groups that are nonparametric, and movie ratings data is ordinal data.

# p-value was relatively big(0.16654749319603956) compared to significant level a=0.005
# So, we don't do anything because we already assumed that the null hypothesis is true
# In english, newer movies are not rated differently than older movies

u1, p1 = stats.mannwhitneyu(new_movie_array, old_movie_array)
u1, p1

(18473.0, 0.16654749319603956)

<p style='font-weight:bold; font-size:18px'>3) Is enjoyment of ‘Shrek (2001)’ gendered, i.e. do male and female viewers rate it differently?</p>

<p style='font-weight:bold; font-size:18px'> Answer) </p>
<p style='font-size:16px'> I can check this by conducting null hypothesis testing</p>
<p style='font-size:16px'> Hypothesis: Shrek (2001) was rated differently by gender</p>
<p style='font-size:16px'> Null hypothesis: Shrek (2001) was not rated differently by gender</p><br>
<p style='font-size:16px'> I will assume that the null hypothesis is true</p><br>

<p style='font-size:16px'>  I divided Sherk (2001) movie by male and female. Then I've conducted Mann-Whitney U test and caclulated p-value. I am using  Mann-Whitney U test test since I am comparing 2 groups that are nonparametric, and movie ratings data is ordinal data.</p><br>

<p style='font-size:16px'> p-value was relatively big(0.050536625925559006) compared to significant level a=0.005</p>


<p style='font-size:16px'> So, we don't do anything because we already assumed that the null hypothesis is true. In english, Shrek (2001) was not rated differently by gender.(below are detailed codes that follow my conclusion)</p>

In [22]:
# Checking gender distribution
# I didn't used 3 value, since we cannot decide self-described.
dataset.iloc[:, 474].value_counts()

1.0    807
2.0    260
3.0      6
Name: Gender identity (1 = female; 2 = male; 3 = self-described), dtype: int64

In [23]:
# Filter by female rows and shrek column
# Also delete null values element-wise
female_shrek = dataset[dataset.iloc[:, 474] == 1.0]['Shrek (2001)']
female_shrek = female_shrek[pd.notnull(female_shrek)]

In [24]:
# Filter by male rows and shrek column
# Also delete null values element-wise
male_shrek = dataset[dataset.iloc[:, 474] == 2.0]['Shrek (2001)']
male_shrek = male_shrek[pd.notnull(male_shrek)]

In [25]:
# Then, calculate p-value by using Mann-Whitney U test
# I am using U test since I am comparing 2 groups that are nonparametric, and movie ratings data is ordinal data.

# p-value was relatively big(0.050536625925559006) compared to significant level a=0.005
# So, we don't do anything because we already assumed that the null hypothesis is true
# In english, Shrek (2001) was not rated differently by gender.

u1, p1 = stats.mannwhitneyu(female_shrek, male_shrek)
u1, p1

(96830.5, 0.050536625925559006)

<p style='font-weight:bold; font-size:18px'>4) What proportion of movies are rated differently by male and female viewers?</p>

<p style='font-weight:bold; font-size:18px'> Answer) </p>

<p style='font-size:16px'> I will apply below null hypothesis testing to each movies and calculate proportion of movies that have relatively small p-value. If p-value is relatively small, I can conclude that given movie is rated differently by male and female viewers.</p>

<p style='font-size:16px'> I can check this by conducting null hypothesis testing</p>
<p style='font-size:16px'> Hypothesis: Given movie is rated differently by gender</p>
<p style='font-size:16px'> Null hypothesis: Given movie is not rated differently by gender</p>
<p style='font-size:16px'> I will assume that the null hypothesis is true</p><br>

<p style='font-size:16px'>  I divided viewers into male and female. Then for each movies, I've conducted Mann-Whitney U test and caclulated p-value. I am using  Mann-Whitney U test test since I am comparing 2 groups that are nonparametric, and movie ratings data is ordinal data.</p><br>

<p style='font-size:16px'> By classifying p-value based on significant level a=0.005, I got 50 movies with relatively small p-values, and 350 movies with relatively big p-values.</p>

<p style='font-size:16px'> Therefore, proportion of movies rated differently by gender is 0.125</p>

In [26]:
# Filter by male and female viewers 
dataset_male = dataset[dataset.iloc[:, 474] == 1.0].iloc[:, :400]
dataset_female = dataset[dataset.iloc[:, 474] == 2.0].iloc[:, :400]

In [27]:
# small: reject null hypothesis, so effective
# big: don't reject null hypothesis, so not effective 
# len(dataset_male.columns) == len(dataset_female.columns) == 400

# iterate by each movies and calculate p-value
# for each male and female ratings, drop null values element-wise and compare 
# if p-value is relatively small(p < 0.005), add count to small
# if p-value is relatively big(p >= 0.005), add count to big

small = 0
big = 0

for i in range(len(dataset_male.columns)):
    male_rating = dataset_male.iloc[:,i]
    male_rating = male_rating[pd.notnull(male_rating)]
    
    female_rating = dataset_female.iloc[:,i]
    female_rating = female_rating[pd.notnull(female_rating)]
    
    u1, p1 = stats.mannwhitneyu(male_rating, female_rating)
    
    if p1 < 0.005:
        small += 1
    else:
        big += 1

In [28]:
# Calculate proportion
proportion = small / (small+big)
print('small p-values: {0}, big p-values: {1}, proportion:{2}'.format(small, big, proportion))

small p-values: 50, big p-values: 350, proportion:0.125


<p style='font-weight:bold; font-size:18px'>5) Do people who are only children enjoy ‘The Lion King (1994)’ more than people with siblings?</p>

<p style='font-weight:bold; font-size:18px'> Answer) </p>

<p style='font-size:16px'> I've assumed that if rating is high, people enjoyed the movie.</p>
<p style='font-size:16px'> Now, I can check this by conducting null hypothesis testing</p>
<p style='font-size:16px'> Hypothesis: The Lion King (1994) was enjoyed(rated) more to people with only chidren than people with siblings.</p>
<p style='font-size:16px'> Null hypothesis: The Lion King (1994) was not enjoyed(rated) more to people with only chidren than people with siblings</p><br>
<p style='font-size:16px'> I will assume that the null hypothesis is true</p><br>

<p style='font-size:16px'>  I divided The Lion King (1994) movie by people with only child and people with siblings. Then I've conducted Mann-Whitney U test and caclulated p-value. I am using  Mann-Whitney U test test since I am comparing 2 groups that are nonparametric, and movie ratings data is ordinal data.</p><br>

<p style='font-size:16px'> p-value was relatively big(0.04319872995682849) compared to significant level a=0.005</p>


<p style='font-size:16px'> So, we don't do anything because we already assumed that the null hypothesis is true. In english, The Lion King (1994) was not enjoyed(rated) more to people with only chidren than people with siblings.(below are detailed codes that follow my conclusion)</p>

In [29]:
# Checking child column. I've assumed 1 as people with only child, and 0 as people with siblings. 
# I didn't used -1 value, since we cannot decide no respond.
dataset.iloc[:, 475].value_counts()

 0    894
 1    177
-1     26
Name: Are you an only child? (1: Yes; 0: No; -1: Did not respond), dtype: int64

In [115]:
# for each only_child and siblings ratings, drop null values element-wise and compare 
only_child = dataset[dataset.iloc[:, 475] == 1.0]['The Lion King (1994)']
only_child = only_child[pd.notnull(only_child)]

siblings = dataset[dataset.iloc[:, 475] == 0.0]['The Lion King (1994)']
siblings = siblings[pd.notnull(siblings)]

In [116]:
# Then, calculate p-value by using Mann-Whitney U test
# I am using U test since I am comparing 2 groups that are nonparametric, and movie ratings data is ordinal data.

# p-value was relatively big(0.04319872995682849) compared to significant level a=0.005
# So, we don't do anything because we already assumed that the null hypothesis is true
# In english, The Lion King (1994) was not enjoyed(rated) more to people 
# who are only chidren than people with siblings.
u1, p1 = stats.mannwhitneyu(only_child, siblings)
u1, p1

(52929.0, 0.04319872995682849)

<p style='font-weight:bold; font-size:18px'>6) What proportion of movies exhibit an “only child effect”, i.e. are rated different by viewers with siblings vs. those without?</p>

<p style='font-weight:bold; font-size:18px'> Answer) </p>

<p style='font-size:16px'> I will apply below null hypothesis testing to each movies and calculate proportion of movies that have relatively small p-value. If p-value is relatively small, I can conclude that given movie is rated differently by viewers with siblings vs. those without.</p>

<p style='font-size:16px'> I can check this by conducting null hypothesis testing</p>
<p style='font-size:16px'> Hypothesis: Given movie is rated differently by people with only chidren and people with siblings.</p>
<p style='font-size:16px'> Null hypothesis: Given movie is not rated differently by people with only children and people with siblings.</p>
<p style='font-size:16px'> I will assume that the null hypothesis is true</p><br>

<p style='font-size:16px'>  I divided viewers into people with only children and people with siblings. Then for each movies, I've conducted Mann-Whitney U test and caclulated p-value. I am using  Mann-Whitney U test test since I am comparing 2 groups that are nonparametric, and movie ratings data is ordinal data.</p><br>

<p style='font-size:16px'> By classifying p-value based on significant level a=0.005, I got 7 movies with relatively small p-values, and 393 movies with relatively big p-values.</p>

<p style='font-size:16px'> Therefore, proportion of movies rated differently by 'only child effect' is 0.0175</p>

In [33]:
# Filter by onlychild and siblings
dataset_onlychild = dataset[dataset.iloc[:, 475] == 1.0].iloc[:, :400]
dataset_morechild = dataset[dataset.iloc[:, 475] == 0.0].iloc[:, :400]

In [34]:
# small: reject null hypothesis, so effective
# big: don't reject null hypothesis, so not effective 
# len(dataset_onlychild.columns) == len(dataset_morechild.columns) == 400

# iterate by each movies and calculate p-value
# for each onlychild and siblings ratings, drop null values element-wise and compare 
# if p-value is relatively small(p < 0.005), add count to small
# if p-value is relatively big(p >= 0.005), add count to big

small = 0
big = 0

for i in range(len(dataset_onlychild.columns)):
    onlychild = dataset_onlychild.iloc[:,i]
    onlychild = onlychild[pd.notnull(onlychild)]
    
    morechild = dataset_morechild.iloc[:,i]
    morechild = morechild[pd.notnull(morechild)]
    
    u1, p1 = stats.mannwhitneyu(onlychild, morechild)
    
    if p1 < 0.005:
        small += 1
    else:
        big += 1

In [35]:
# Calculate proportion
proportion = small / (small+big)
print('small p-values: {0}, big p-values: {1}, proportion:{2}'.format(small, big, proportion))

small p-values: 7, big p-values: 393, proportion:0.0175


<p style='font-weight:bold; font-size:18px'>7) Do people who like to watch movies socially enjoy ‘The Wolf of Wall Street (2013)’ more than those who prefer to watch them alone?</p>

<p style='font-weight:bold; font-size:18px'> Answer) </p>

<p style='font-size:16px'> I've assumed that if rating is high, than peopled enjoyed the movie.</p>
<p style='font-size:16px'> Now, I can check this by conducting null hypothesis testing</p>
<p style='font-size:16px'> Hypothesis: The Wolf of Wall Street (2013) was enjoyed(rated) more to people who like to watch movies socially than people who prefer to watch them alone.</p>
<p style='font-size:16px'> Null hypothesis: The Wolf of Wall Street (2013) was not enjoyed(rated) more to people who like to watch movies socially than people who prefer to watch them alone.</p>
<p style='font-size:16px'> I will assume that the null hypothesis is true</p><br>

<p style='font-size:16px'>  I divided The Wolf of Wall Street (2013) movie by people who like to watch movies socially and people who prefer to watch them alone. Then I've conducted Mann-Whitney U test and caclulated p-value. I am using  Mann-Whitney U test test since I am comparing 2 groups that are nonparametric, and movie ratings data is ordinal data.</p><br>

<p style='font-size:16px'> p-value was relatively big(0.1127642933222891) compared to significant level a=0.005</p>


<p style='font-size:16px'> So, we don't do anything because we already assumed that the null hypothesis is true. In english, The Wolf of Wall Street (2013) was not enjoyed(rated) more to people who like to watch movies socially than people who prefer to watch them alone.</p>

In [36]:
# Checking movies enjoyed column. I've used 1 as people who like to watch movies socially, 
# and 0 as people who prefer to watch them alone. 
# I didn't used -1 value, since we cannot decide no respond.
dataset.iloc[:, 476].value_counts()

 1    610
 0    462
-1     25
Name: Movies are best enjoyed alone (1: Yes; 0: No; -1: Did not respond), dtype: int64

In [117]:
# for each movie by watching alone and watching socially, drop null values element-wise and compare 
movie_alone = dataset[dataset.iloc[:, 476] == 1.0]['The Wolf of Wall Street (2013)']
movie_alone = movie_alone[pd.notnull(movie_alone)]

movie_social = dataset[dataset.iloc[:, 476] == 0.0]['The Wolf of Wall Street (2013)']
movie_social = movie_social[pd.notnull(movie_social)]

In [118]:
# p-value was relatively big(0.1127642933222891) compared to significant level a=0.005
# So, we don't do anything because we already assumed that the null hypothesis is true
# In english, The Wolf of Wall Street (2013) was not enjoyed(rated) more to people
# who like to watch movies socially than people who prefer to watch them alone.

u1, p1 = stats.mannwhitneyu(movie_alone, movie_social)
u1, p1

(56806.5, 0.1127642933222891)

<p style='font-weight:bold; font-size:18px'>8) What proportion of movies exhibit such a “social watching” effect?</p> 

<p style='font-weight:bold; font-size:18px'> Answer) </p>

<p style='font-size:16px'> I will apply below null hypothesis testing to each movies and calculate proportion of movies that have relatively small p-value. If p-value is relatively small, I can conclude that given movie is rated differently by viewers with movies watching alone and watching socially.</p>

<p style='font-size:16px'> I can check this by conducting null hypothesis testing</p>
<p style='font-size:16px'> Hypothesis: Given movie is rated differently by people watching movies alone or watching socially.</p>
<p style='font-size:16px'> Null hypothesis: Given movie is not rated differently by people watching movies alone or watching socially</p><br>
<p style='font-size:16px'> I will assume that the null hypothesis is true</p><br>

<p style='font-size:16px'>  I divided viewers into people watching movies alone or watching socially. Then for each movies, I've conducted Mann-Whitney U test and caclulated p-value. I am using  Mann-Whitney U test test since I am comparing 2 groups that are nonparametric, and movie ratings data is ordinal data.</p><br>

<p style='font-size:16px'> By classifying p-value based on significant level a=0.005, I got 10 movies with relatively small p-values, and 390 movies with relatively big p-values.</p>

<p style='font-size:16px'> Therefore, proportion of movies rated differently by 'social watching effect' is 0.025</p>

In [40]:
# Filter by people watching movies alone or watching socially
movies_alone = dataset[dataset.iloc[:, 476] == 0.0].iloc[:, :400]
movies_social = dataset[dataset.iloc[:, 476] == 1.0].iloc[:, :400]

In [41]:
# small: reject null hypothesis, so effective
# big: don't reject null hypothesis, so not effective 
# len(movies_alone.columns) == len(movies_social.columns) == 400

# iterate by each movies and calculate p-value
# for each movies_alone and movies_social ratings, drop null values element-wise and compare 
# if p-value is relatively small(p < 0.005), add count to small
# if p-value is relatively big(p >= 0.005), add count to big

small = 0
big = 0

for i in range(len(movies_alone.columns)):
    alone = movies_alone.iloc[:,i]
    alone = alone[pd.notnull(alone)]
    
    social = movies_social.iloc[:,i]
    social = social[pd.notnull(social)]
    
    u1, p1 = stats.mannwhitneyu(alone, social)
    
    if p1 < 0.005:
        small += 1
    else:
        big += 1

In [42]:
# Calculate proportion
proportion = small / (small+big)
print('small p-values: {0}, big p-values: {1}, proportion:{2}'.format(small, big, proportion))

small p-values: 10, big p-values: 390, proportion:0.025


<p style='font-weight:bold; font-size:18px'>9) Is the ratings distribution of ‘Home Alone (1990)’ different than that of ‘Finding Nemo (2003)’?</p>

<p style='font-weight:bold; font-size:18px'> Answer) </p>

<p style='font-size:16px'> I can check this by conducting null hypothesis testing</p>
<p style='font-size:16px'> Hypothesis: Ratings distribution of ‘Home Alone (1990)’ is different than that of ‘Finding Nemo (2003)'.</p>
<p style='font-size:16px'> Null hypothesis: Ratings distribution of ‘Home Alone (1990)’ is not different than that of ‘Finding Nemo (2003)'.</p><br>
<p style='font-size:16px'> I will assume that the null hypothesis is true</p><br>

<p style='font-size:16px'>  I created datasets of movie rating ‘Home Alone (1990)’ and ‘Finding Nemo (2003)'. Then I've dropped null values element-wise.</p>
<p style='font-size:16px'> Then I've conducted Kolmogorov-Smirnov(KS) test and caclulated p-value. I am using Kolmogorov-Smirnov(KS) test since I am comparing shapes of distribution of 2 groups that are nonparametric, and movie ratings data is ordinal data.</p><br>

<p style='font-size:16px'> p-value was relatively small(6.379381467525036e-10) compared to significant level a=0.005</p>

<p style='font-size:16px'> So, I rejected the null hypothesis. In english, I've concluded that ratings distribution of ‘Home Alone (1990)’ is different than that of ‘Finding Nemo (2003)'.</p>

In [121]:
dist_home_alone = dataset['Home Alone (1990)']
dist_home_alone = dist_home_alone[pd.notnull(dist_home_alone)]

dist_find_nemo = dataset['Finding Nemo (2003)']
dist_find_nemo = dist_find_nemo[pd.notnull(dist_find_nemo)]

In [122]:
# p-value was relatively small(6.379381467525036e-10)) compared to significant level a=0.005
# reject null hypothesis and conclude that Ratings distribution of ‘Home Alone (1990)’ 
# is different than that of ‘Finding Nemo (2003)'.
u1, p1 = stats.kstest(dist_home_alone, dist_find_nemo)
u1, p1

(0.15269080020897632, 6.379381467525036e-10)

<p style='font-weight:bold; font-size:18px'>10) There are ratings on movies from several franchises ([‘Star Wars’, ‘Harry Potter’, ‘The Matrix’, ‘Indiana Jones’, ‘Jurassic Park’, ‘Pirates of the Caribbean’, ‘Toy Story’, ‘Batman’]) in this dataset. How many of these are of inconsistent quality, as experienced by viewers? [Hint: You can use the keywords in quotation marks featured in this question to identify the movies that are part of each franchise]</p>

<p style='font-weight:bold; font-size:18px'> Answer) </p>

<p style='font-size:16px'> For each franchises, I will check its inconsistent quality by comparing ratings of every pairs in each franchise series.(i.e. compare every pairs in star wars series). If one of the comparision result appears to have relatively small p-value(therefore null-hypothesis was rejected), then given franchise has inconsistent quality. </p>
<p style='font-size:16px'> Also, I will assume that if ratings are different, it means inconsistent quality. </p>

<p style='font-size:16px'> Now, I can check inconsistency by conducting null hypothesis testing</p>
<p style='font-size:16px'> Hypothesis: Given pair of movies has inconsistent quality.(rated differently)</p>
<p style='font-size:16px'> Null hypothesis: Given pair of movies doesn't have inconsistent quality.(not rated differently)'.</p><br>
<p style='font-size:16px'> I will assume that the null hypothesis is true</p><br>


<p style='font-size:16px'>  I've filtered movies by each franchise series. Then for each franchise series, I've conducted Mann-Whitney U test and caclulated p-value for every pairs of franchise series. I am using  Mann-Whitney U test test since I am comparing 2 groups that are nonparametric, and movie ratings data is ordinal data.</p>
<p style='font-size:16px'> By classifying p-value based on significant level a=0.005, I've got below results for each franchise series.</p><br>

<p style='font-weight:bold;font-size:16px'> - Starwars series have inconsistent quality </p>
<p style='font-size:16px'> We have total 6 series, so conducted 15 comparison. Result: 8 rejects and 7 accepts. Therefore, Matrix series have inconsistent quality </p><br>

<p style='font-weight:bold;font-size:16px'> - Harry Potter series have consistent quality </p>
<p style='font-size:16px'> We have total 4 series, so conducted 6 comparison. Result: 0 rejects and 6 accepts. Therefore, Harry Potter series have consistent quality</p><br>


<p style='font-weight:bold;font-size:16px'> - Matrix series have inconsistent quality </p>
<p style='font-size:16px'> We have total 3 series, so conducted 3 comparisons. Result: 2 rejects and 1 accepts. Therefore, Matrix series have inconsistent quality</p><br>

<p style='font-weight:bold;font-size:16px'> - Indiana Jones series have inconsistent quality </p>
<p style='font-size:16px'> We have total 4 series, so conducted 6 comparisons. Result: 4 rejects and 2 accepts Therefore, Indiana Jones series have inconsistent quality</p><br>

<p style='font-weight:bold;font-size:16px'> - Jurassic Park series have inconsistent quality </p>
<p style='font-size:16px'> We have total 3 series, so conducted 3 comparisons. Result: 3 rejects and 0 accepts
Therefore, Jurassic Park series have inconsistent quality.</p><br>

<p style='font-weight:bold;font-size:16px'> - Pirates of the Caribbean series have inconsistent quality </p>
<p style='font-size:16px'> We have total 3 series, so conducted 3 comparisons. Result: 2 rejects and 1 accepts
Therefore, Pirates of the Caribbean series have inconsistent quality</p><br>

<p style='font-weight:bold;font-size:16px'> - Toy Story series have inconsistent quality </p>
<p style='font-size:16px'> We have total 3 series, so conducted 3 comparisons.Result: 2 rejects and 1 accepts
Therefore, Toy Story series have inconsistent quality</p><br>

<p style='font-weight:bold;font-size:16px'> - Batman series have inconsistent quality </p>
<p style='font-size:16px'> We have total 3 series, so conducted 3 comparisons. Result: 3 rejects and 0 accepts
Therefore, Batman series have inconsistent quality</p><br>


In [45]:
# Filtering movies by franchise name  
starwars, harry, matrix, indiana, jurassic, pirate, toy, batman = [], [], [], [], [], [], [], []
movies = dataset.columns[:400]

for movie in movies:
    if 'Star Wars' in movie:
        starwars.append(movie)
    elif 'Harry Potter' in movie:
        harry.append(movie)
    elif 'The Matrix' in movie:
        matrix.append(movie)
    elif 'Indiana Jones' in movie:
        indiana.append(movie)
    elif 'Jurassic Park' in movie:
        jurassic.append(movie)
    elif 'Pirates of the Caribbean' in movie:
        pirate.append(movie)
    elif 'Toy Story' in movie:
        toy.append(movie)
    elif 'Batman' in movie:
        batman.append(movie)

In [46]:
# For each series, create a new array that contains each series rating data
starwars_test = []

for i in starwars:
    movie = dataset[i]
    movie = np.array(movie[pd.notnull(movie)])
    starwars_test.append(movie)

In [47]:
# Compare every pair of movies in series
reject, accept = 0, 0
for i in range(len(starwars_test)):
    for j in range(i+1, len(starwars_test)):
        u, p = stats.mannwhitneyu(starwars_test[i], starwars_test[j])
        if p < 0.005:
            print('Reject null) Series between case {0} and case {1} p-value: {2:.6f}'.format(i+1, j+1, p))
            reject += 1
        else:
            print('Accept null) Series between case {0} and case {1} p-value: {2:.6f}'.format(i+1, j+1, p))
            accept += 1
print()
print('We have total {0} series, so conducted {1} comparisons.\nResult: {2} rejects and {3} accepts'
      .format(len(starwars_test), reject+accept, reject, accept))
print('Therefore, Star Wars series have inconsistent quality.')

Reject null) Series between case 1 and case 2 p-value: 0.000000
Accept null) Series between case 1 and case 3 p-value: 0.081898
Reject null) Series between case 1 and case 4 p-value: 0.000000
Accept null) Series between case 1 and case 5 p-value: 0.023967
Accept null) Series between case 1 and case 6 p-value: 0.494681
Reject null) Series between case 2 and case 3 p-value: 0.000000
Accept null) Series between case 2 and case 4 p-value: 0.450309
Reject null) Series between case 2 and case 5 p-value: 0.000000
Reject null) Series between case 2 and case 6 p-value: 0.000000
Reject null) Series between case 3 and case 4 p-value: 0.000000
Accept null) Series between case 3 and case 5 p-value: 0.573254
Accept null) Series between case 3 and case 6 p-value: 0.303426
Reject null) Series between case 4 and case 5 p-value: 0.000000
Reject null) Series between case 4 and case 6 p-value: 0.000000
Accept null) Series between case 5 and case 6 p-value: 0.114050

We have total 6 series, so conducted 15

In [48]:
# For each series, create a new array that contains each series rating data
harry_test = []

for i in harry:
    movie = dataset[i]
    movie = np.array(movie[pd.notnull(movie)])
    harry_test.append(movie)

In [49]:
# Compare every pair of movies in series
reject, accept = 0, 0
for i in range(len(harry_test)):
    for j in range(i+1, len(harry_test)):
        u, p = stats.mannwhitneyu(harry_test[i], harry_test[j])
        if p < 0.005:
            print('Reject null) Series between case {0} and case {1} p-value: {2:.6f}'.format(i+1, j+1, p))
            reject += 1
        else:
            print('Accept null) Series between case {0} and case {1} p-value: {2:.6f}'.format(i+1, j+1, p))
            accept += 1
print()
print('We have total {0} series, so conducted {1} comparisons.\nResult: {2} rejects and {3} accepts'
      .format(len(harry_test), reject+accept, reject, accept))
print('Therefore, Harry Potter series have consistent quality.')

Accept null) Series between case 1 and case 2 p-value: 0.461328
Accept null) Series between case 1 and case 3 p-value: 0.158534
Accept null) Series between case 1 and case 4 p-value: 0.098167
Accept null) Series between case 2 and case 3 p-value: 0.497306
Accept null) Series between case 2 and case 4 p-value: 0.361903
Accept null) Series between case 3 and case 4 p-value: 0.804052

We have total 4 series, so conducted 6 comparisons.
Result: 0 rejects and 6 accepts
Therefore, Harry Potter series have consistent quality.


In [50]:
# For each series, create a new array that contains each series rating data
matrix_test = []

for i in matrix:
    movie = dataset[i]
    movie = np.array(movie[pd.notnull(movie)])
    matrix_test.append(movie)

In [51]:
# Compare every pair of movies in series
reject, accept = 0, 0
for i in range(len(matrix_test)):
    for j in range(i+1, len(matrix_test)):
        u, p = stats.mannwhitneyu(matrix_test[i], matrix_test[j])
        if p < 0.005:
            print('Reject null) Series between case {0} and case {1} p-value: {2:.6f}'.format(i+1, j+1, p))
            reject += 1
        else:
            print('Accept null) Series between case {0} and case {1} p-value: {2:.6f}'.format(i+1, j+1, p))
            accept += 1
print()
print('We have total {0} series, so conducted {1} comparisons.\nResult: {2} rejects and {3} accepts'
      .format(len(matrix_test), reject+accept, reject, accept))
print('Therefore, Matrix series have inconsistent quality.')

Accept null) Series between case 1 and case 2 p-value: 0.249811
Reject null) Series between case 1 and case 3 p-value: 0.000000
Reject null) Series between case 2 and case 3 p-value: 0.000000

We have total 3 series, so conducted 3 comparisons.
Result: 2 rejects and 1 accepts
Therefore, Matrix series have inconsistent quality.


In [52]:
# For each series, create a new array that contains each series rating data
indiana_test = []

for i in indiana:
    movie = dataset[i]
    movie = np.array(movie[pd.notnull(movie)])
    indiana_test.append(movie)

In [53]:
# Compare every pair of movies in series
reject, accept = 0, 0
for i in range(len(indiana_test)):
    for j in range(i+1, len(indiana_test)):
        u, p = stats.mannwhitneyu(indiana_test[i], indiana_test[j])
        if p < 0.005:
            print('Reject null) Series between case {0} and case {1} p-value: {2:.6f}'.format(i+1, j+1, p))
            reject += 1
        else:
            print('Accept null) Series between case {0} and case {1} p-value: {2:.6f}'.format(i+1, j+1, p))
            accept += 1
print()
print('We have total {0} series, so conducted {1} comparisons.\nResult: {2} rejects and {3} accepts'
      .format(len(indiana_test), reject+accept, reject, accept))
print('Therefore, Indiana Jones series have inconsistent quality.')

Accept null) Series between case 1 and case 2 p-value: 0.307084
Reject null) Series between case 1 and case 3 p-value: 0.000603
Reject null) Series between case 1 and case 4 p-value: 0.000178
Reject null) Series between case 2 and case 3 p-value: 0.000018
Accept null) Series between case 2 and case 4 p-value: 0.005199
Reject null) Series between case 3 and case 4 p-value: 0.000000

We have total 4 series, so conducted 6 comparisons.
Result: 4 rejects and 2 accepts
Therefore, Indiana Jones series have inconsistent quality.


In [54]:
# For each series, create a new array that contains each series rating data
jurassic_test = []

for i in jurassic:
    movie = dataset[i]
    movie = np.array(movie[pd.notnull(movie)])
    jurassic_test.append(movie)

In [55]:
# Compare every pair of movies in series
reject, accept = 0, 0
for i in range(len(jurassic_test)):
    for j in range(i+1, len(jurassic_test)):
        u, p = stats.mannwhitneyu(jurassic_test[i], jurassic_test[j])
        if p < 0.005:
            print('Reject null) Series between case {0} and case {1} p-value: {2:.6f}'.format(i+1, j+1, p))
            reject += 1
        else:
            print('Accept null) Series between case {0} and case {1} p-value: {2:.6f}'.format(i+1, j+1, p))
            accept += 1
print()
print('We have total {0} series, so conducted {1} comparisons.\nResult: {2} rejects and {3} accepts'
      .format(len(jurassic_test), reject+accept, reject, accept))
print('Therefore, Jurassic Park series have inconsistent quality.')

Reject null) Series between case 1 and case 2 p-value: 0.000275
Reject null) Series between case 1 and case 3 p-value: 0.000674
Reject null) Series between case 2 and case 3 p-value: 0.000000

We have total 3 series, so conducted 3 comparisons.
Result: 3 rejects and 0 accepts
Therefore, Jurassic Park series have inconsistent quality.


In [56]:
# For each series, create a new array that contains each series rating data
pirate_test = []

for i in pirate:
    movie = dataset[i]
    movie = np.array(movie[pd.notnull(movie)])
    pirate_test.append(movie)

In [57]:
# Compare every pair of movies in series
reject, accept = 0, 0
for i in range(len(pirate_test)):
    for j in range(i+1, len(pirate_test)):
        u, p = stats.mannwhitneyu(pirate_test[i], pirate_test[j])
        if p < 0.005:
            print('Reject null) Series between case {0} and case {1} p-value: {2:.6f}'.format(i+1, j+1, p))
            reject += 1
        else:
            print('Accept null) Series between case {0} and case {1} p-value: {2:.6f}'.format(i+1, j+1, p))
            accept += 1
print()
print('We have total {0} series, so conducted {1} comparisons.\nResult: {2} rejects and {3} accepts'
      .format(len(pirate_test), reject+accept, reject, accept))
print('Therefore, Pirates of the Caribbean series have inconsistent quality.')

Accept null) Series between case 1 and case 2 p-value: 0.251478
Reject null) Series between case 1 and case 3 p-value: 0.000009
Reject null) Series between case 2 and case 3 p-value: 0.001758

We have total 3 series, so conducted 3 comparisons.
Result: 2 rejects and 1 accepts
Therefore, Pirates of the Caribbean series have inconsistent quality.


In [58]:
# For each series, create a new array that contains each series rating data
toy_test = []

for i in toy:
    movie = dataset[i]
    movie = np.array(movie[pd.notnull(movie)])
    toy_test.append(movie)

In [59]:
# Compare every pair of movies in series
reject, accept = 0, 0
for i in range(len(toy_test)):
    for j in range(i+1, len(toy_test)):
        u, p = stats.mannwhitneyu(toy_test[i], toy_test[j])
        if p < 0.005:
            print('Reject null) Series between case {0} and case {1} p-value: {2:.6f}'.format(i+1, j+1, p))
            reject += 1
        else:
            print('Accept null) Series between case {0} and case {1} p-value: {2:.6f}'.format(i+1, j+1, p))
            accept += 1
print()
print('We have total {0} series, so conducted {1} comparisons.\nResult: {2} rejects and {3} accepts'
      .format(len(toy_test), reject+accept, reject, accept))
print('Therefore, Toy Story series have inconsistent quality.')

Reject null) Series between case 1 and case 2 p-value: 0.000084
Reject null) Series between case 1 and case 3 p-value: 0.000006
Accept null) Series between case 2 and case 3 p-value: 0.539633

We have total 3 series, so conducted 3 comparisons.
Result: 2 rejects and 1 accepts
Therefore, Toy Story series have inconsistent quality.


In [60]:
# For each series, create a new array that contains each series rating data
batman_test = []

for i in batman:
    movie = dataset[i]
    movie = np.array(movie[pd.notnull(movie)])
    batman_test.append(movie)

In [61]:
# Compare every pair of movies in series
reject, accept = 0, 0
for i in range(len(batman_test)):
    for j in range(i+1, len(batman_test)):
        u, p = stats.mannwhitneyu(batman_test[i], batman_test[j])
        if p < 0.005:
            print('Reject null) Series between case {0} and case {1} p-value: {2:.6f}'.format(i+1, j+1, p))
            reject += 1
        else:
            print('Accept null) Series between case {0} and case {1} p-value: {2:.6f}'.format(i+1, j+1, p))
            accept += 1
print()
print('We have total {0} series, so conducted {1} comparisons.\nResult: {2} rejects and {3} accepts'
      .format(len(batman_test), reject+accept, reject, accept))
print('Therefore, Batman series have inconsistent quality.')

Reject null) Series between case 1 and case 2 p-value: 0.000000
Reject null) Series between case 1 and case 3 p-value: 0.000000
Reject null) Series between case 2 and case 3 p-value: 0.000000

We have total 3 series, so conducted 3 comparisons.
Result: 3 rejects and 0 accepts
Therefore, Batman series have inconsistent quality.


<p style='font-weight:bold; font-size:18px'>Extra Credit: Tell us something interesting and true (supported by a significance test of some kind) about the movies in this dataset that is not already covered by the questions above [for 5% of the grade score].</p>

<p style='font-weight:bold; font-size:18px'> Extra - (1) </p>
<p style='font-size:16px'> 'Scream (1996)' was rated more to people who enjoys watching horror movies than those who don't enjoy watching horror movies. So, people's tendency of enjoying horror movies actually affect rating of horror movie 'Scream (1996)'. (It can be true even before testing, but is interesting to see the actual result!) </p>

<p style='font-weight:bold; font-size:18px'> Explanation) </p>
<p style='font-size:16px'> I can check this by conducting null hypothesis testing</p>
<p style='font-size:16px'> Hypothesis: 'Scream (1996)' was rated more to people who enjoys watching horror movies than those who don't enjoy watching horror movies. </p>
<p style='font-size:16px'> Null hypothesis: 'Scream (1996)' was not rated more to people who enjoys watching horror movies than those who don't enjoy watching horror movies. </p><br>
<p style='font-size:16px'> I will assume that the null hypothesis is true</p><br>

<p style='font-size:16px'>  I divided 'Scream (1996)' movie into people who enjoys watching horror movies and people who don't enjoy watching horror movies. Then I've conducted Mann-Whitney U test and caclulated p-value. I am using  Mann-Whitney U test test since I am comparing 2 groups that are nonparametric, and movie ratings data is ordinal data.</p><br>

<p style='font-size:16px'> p-value was relatively small(4.5909393543455896e-08) compared to significant level a=0.005</p>

<p style='font-size:16px'> So, I rejected the null hypothesis. In english, 'Scream (1996)' was rated more to people who enjoys watching horror movies than those who don't enjoy watching horror movies.</p>

In [62]:
# Checking distribution of column 'enjoy watching horror movies'
dataset.iloc[:, 409].value_counts()

1.0    316
5.0    269
4.0    253
2.0    140
3.0    114
Name: I enjoy watching horror movies, dtype: int64

In [63]:
# Split 'Scream (1996)' into people enjoy watching horror movies and don't enjoy watching horror movies
# I will divide into two groups by median rate(3.0) and won't be using 3.0 as it is a median rate
# drop null values element-wise and compare 

horror_enjoy = dataset[dataset.iloc[:, 409] > 3.0]['Scream (1996)']
horror_enjoy = horror_enjoy[pd.notnull(horror_enjoy)]

horror_dont_enjoy = dataset[dataset.iloc[:, 409] < 3.0]['Scream (1996)']
horror_dont_enjoy = horror_dont_enjoy[pd.notnull(horror_dont_enjoy)]

In [64]:
# p-value was relatively small(4.5909393543455896e-08) compared to significant level a=0.005
# So, I rejected the null hypothesis. In english, 'Scream (1996)' was rated more to people 
# who enjoys watching horror movies than those who don't enjoy watching horror movies.

u1, p1 = stats.mannwhitneyu(horror_enjoy, horror_dont_enjoy)
u1, p1

(13535.5, 4.5909393543455896e-08)

<p style='font-weight:bold; font-size:18px'> Extra - (2) </p>
<p style='font-size:16px'> Let's assume that people who cries often during a movie tends to rate differently than people who don't often cry during a movie. </p>
<p style='font-size:16px'> If we call this "crying effect", proportion of movies rated differently by '"crying effect" is 0.0425 </p>

<p style='font-weight:bold; font-size:18px'> Explanation) </p>
<p style='font-size:16px'> I will apply below null hypothesis testing to each movies and calculate proportion of movies that have relatively small p-value. If p-value is relatively small, I can conclude that given movie is rated differently by viewers with movies watching alone and watching socially.</p>

<p style='font-size:16px'> Hypothesis: Given movie is rated differently by people crying or not crying(during the movie).</p>
<p style='font-size:16px'> Null hypothesis: Given movie is not rated differently by people crying or not crying(during the movie).</p>
<p style='font-size:16px'> I will assume that the null hypothesis is true</p><br>

<p style='font-size:16px'>  I divided viewers into people cried during a movie and people didn't cry during a movie. Then for each movies, I've conducted Mann-Whitney U test and caclulated p-value. I am using  Mann-Whitney U test test since I am comparing 2 groups that are nonparametric, and movie ratings data is ordinal data.</p><br>

<p style='font-size:16px'> By classifying p-value based on significant level a=0.005, I got 17 movies with relatively small p-values, and 383 movies with relatively big p-values.</p>

<p style='font-size:16px'> Therefore, proportion of movies rated differently by "crying effect" is 0.0425</p>

In [66]:
# Checking distribution of column 'cried during a movie'
dataset.iloc[:, 464].value_counts()

4.0    343
5.0    273
3.0    219
6.0     98
2.0     92
1.0     55
Name: I have cried during a movie, dtype: int64

In [67]:
# I will divide into two groups by median rate(3.0)
# I won't be using 3.0 as it is a median rate, and assume 6.0 as high rate

# Filter by people watching movies alone or watching socially
movies_cry = dataset[dataset.iloc[:, 464] > 3.0].iloc[:, :400]
movies_dont_cry = dataset[dataset.iloc[:, 464] < 3.0].iloc[:, :400]

In [68]:
# small: reject null hypothesis, so effective
# big: don't reject null hypothesis, so not effective 
# len(dataset_male.columns) == len(dataset_female.columns) == 400

# iterate by each movies and calculate p-value
# for each male and female ratings, drop null values element-wise and compare 
# if p-value is relatively small(p < 0.005), add count to small
# if p-value is relatively big(p >= 0.005), add count to big

small = 0
big = 0

for i in range(len(movies_cry.columns)):
    cried = movies_cry.iloc[:,i]
    cried = cried[pd.notnull(cried)]
    
    not_cried = movies_dont_cry.iloc[:,i]
    not_cried = not_cried[pd.notnull(not_cried)]
    
    u1, p1 = stats.mannwhitneyu(cried, not_cried)
    
    if p1 < 0.005:
        small += 1
    else:
        big += 1

In [69]:
# Calculate proportion
proportion = small / (small+big)
print('small p-values: {0}, big p-values: {1}, proportion:{2}'.format(small, big, proportion))

small p-values: 17, big p-values: 383, proportion:0.0425


<p style='font-weight:bold; font-size:18px'> Extra - (3) </p>
<p style='font-size:16px'> Let's assume that people who rates over the limit tends to rate differently than people who don't. </p>
<p style='font-size:16px'> If we call this "outlier effect", proportion of movies rated differently by "outlier effect" is 0.05</p>

<p style='font-weight:bold; font-size:18px'> Explanation) </p>
<p style='font-size:16px'> I've checked questions that people gave wrong answers. (Which is, over the 1.0 ~ 5.0 range). Then, I've selected column that has the most wrong answers(outliers).</p>
<p style='font-size:16px'> Then, I've divided groups into outliers and normal ratings.</p><br>


<p style='font-size:16px'> I will apply below null hypothesis testing to each movies and calculate proportion of movies that have relatively small p-value. If p-value is relatively small, I can conclude that given movie is rated differently by viewers with outliers and normal.</p>

<p style='font-size:16px'> Hypothesis: Given movie is rated differently by people rates wrongly or normally.</p>
<p style='font-size:16px'> Null hypothesis: Given movie is not rated differently by people wrongly or normally.</p>
<p style='font-size:16px'> I will assume that the null hypothesis is true</p><br>

<p style='font-size:16px'>  I divided viewers into people who rated over the limit for [column 470: The emotions on the screen "rub off" on me] and those who didn't. Then for each movies, I've conducted Mann-Whitney U test and caclulated p-value. I am using  Mann-Whitney U test test since I am comparing 2 groups that are nonparametric, and movie ratings data is ordinal data.</p><br>

<p style='font-size:16px'> By classifying p-value based on significant level a=0.005, I got 20 movies with relatively small p-values, and 380 movies with relatively big p-values.</p>

<p style='font-size:16px'> Therefore, proportion of movies rated differently by "outlier effect" is 0.05</p>

In [70]:
# Selecting column 470, since it has the most outliers (column 470 has 226 outliers)
for i in range(400, 474):
    outlier = dataset.iloc[:, i][dataset.iloc[:, i] > 5.0].count()
    if outlier > 0:
        print('Column {0}: {1:.50} ~ has {2} outliers'.format(i, dataset.columns[i], outlier))

Column 464: I have cried during a movie ~ has 98 outliers
Column 465: I have trouble following the story of a movie ~ has 10 outliers
Column 466: I have trouble remembering the story of a movie a  ~ has 18 outliers
Column 467: When watching a movie I cheer or shout or talk or  ~ has 46 outliers
Column 468: When watching a movie I feel like the things on th ~ has 41 outliers
Column 469: As a movie unfolds I start to have problems keepin ~ has 15 outliers
Column 470: The emotions on the screen "rub off" on me - for i ~ has 226 outliers
Column 471: When watching a movie I get completely immersed in ~ has 164 outliers
Column 472: Movies change my position on social economic or po ~ has 30 outliers
Column 473: When watching movies things get so intense that I  ~ has 14 outliers


In [71]:
# Checking distribution of outlier column
dataset.iloc[:, 470].value_counts()

5.0    348
4.0    303
6.0    226
3.0    120
1.0     44
2.0     34
Name: The emotions on the screen "rub off" on me - for instance if something sad is happening I get sad or if something frightening is happening I get scared, dtype: int64

In [72]:
# Filter by people watching movies alone or watching socially
rating_outlier = dataset[dataset.iloc[:, 470] > 5.0].iloc[:, :400]
rating_normal = dataset[dataset.iloc[:, 470] < 5.0].iloc[:, :400]

In [73]:
# small: reject null hypothesis, so effective
# big: don't reject null hypothesis, so not effective 
# len(dataset_male.columns) == len(dataset_female.columns) == 400

# iterate by each movies and calculate p-value
# for each male and female ratings, drop null values element-wise and compare 
# if p-value is relatively small(p < 0.005), add count to small
# if p-value is relatively big(p >= 0.005), add count to big

small = 0
big = 0

for i in range(len(rating_outlier.columns)):
    outlier = rating_outlier.iloc[:,i]
    outlier = outlier[pd.notnull(outlier)]
    
    normal = rating_normal.iloc[:,i]
    normal = normal[pd.notnull(normal)]
    
    u1, p1 = stats.mannwhitneyu(outlier, normal)
    
    if p1 < 0.005:
        small += 1
    else:
        big += 1

In [74]:
# Calculate proportion
proportion = small / (small+big)
print('small p-values: {0}, big p-values: {1}, proportion:{2}'.format(small, big, proportion))

small p-values: 20, big p-values: 380, proportion:0.05
