## Tasty Statistics

Recently I started with a Statistics in Python course on coursera. Instead of only learning from the examples provided in the course, I wanted to explore and apply the methods on my own.
Most of statistics examples I cover serious topics that require complete accuracy when applying statiscal methods, like medical tests, systematic racism and politics. As a newbee, I am too afraid to jump to conclusions too quickly and spread an opinion without having my tools sharpened properly.
So I wanted to find data, that doesn't hurt and that is so highly subjective, that everybody can have an opinion about. Ramen became one of my favorite dishes lately, so I thought, why not look at that.
So here it comes, the Ramen with confidence statistics.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Data Prep

In [None]:
# Importing the data
ramen_df = pd.read_csv('/kaggle/input/ramen-ratings/ramen-ratings.csv')

In [None]:
# Looking at the data
ramen_df

In [None]:
ramen_df.dtypes

In [None]:
ramen_df['Stars'].unique()

In [None]:
ramen_df[ramen_df['Stars'] == 'Unrated'].count()

I got an error, when I tried to convert Stars to float. In the dataset are 3 occurences of "Unrated" as Stars. So I feel comfortable to remove those. 

In [None]:
ramen_df = ramen_df[ramen_df['Stars'] != 'Unrated']

In [None]:
ramen_df["Stars"] = ramen_df.Stars.astype(float)

In [None]:
ramen_df['Country'].unique()

In [None]:
count_df = ramen_df.groupby('Country').count()["Brand"]
count_df.sort_values(ascending=False)

Let's have a look at the "Top Ten" column. We see that the rank values are not complete for all the years. That can mean that our data set is not complete and we need to be careful when drawing conclusions from this column. 

In [None]:
ramen_df[ramen_df['Top Ten'].notna()][['Country', 'Top Ten']].sort_values('Top Ten')

___________________________________________________________

So now we know our data set a bit better. 
**What can we actually explore here? **

I want to see if there a significant difference in ramen ratings of the two big ramen producers Japan (352) and USA (323). 

For that I want to check that the mean value of ratings are similar in for both countries.

Hypothesis: mean(japan_ratings) == mean(usa_ratings)


For that I firstly extract the data for Japan and USA Ramen. Since there are different styles of ramen and I want to compare apples with apples, I select the most represented style from each data set.

In [None]:
japan_ramen_df = ramen_df[ramen_df['Country'] == 'Japan']
usa_ramen_df = ramen_df[ramen_df['Country'] == 'USA']

There are different styles of ramen and I want to compare apples with apples and see how many reviews there are for each style for the 2 countries. In this case this is style "Pack" 

In [None]:
japan_ramen_df.groupby('Style')['Review #'].nunique()

In [None]:
usa_ramen_df.groupby('Style')['Review #'].nunique()

Bowl, Cup and Pack look like the ones we can look at, since they have a high number of reviews. 

In [None]:
japan_bowl_df = japan_ramen_df[japan_ramen_df['Style'] == 'Bowl']
japan_pack_df = japan_ramen_df[japan_ramen_df['Style'] == 'Pack']
japan_cup_df = japan_ramen_df[japan_ramen_df['Style']  == 'Cup']
japan_cup_df.head(2)

In [None]:
usa_bowl_df = usa_ramen_df[usa_ramen_df['Style'] == 'Bowl']
usa_pack_df = usa_ramen_df[usa_ramen_df['Style'] == 'Pack']
usa_cup_df = usa_ramen_df[usa_ramen_df['Style']  == 'Cup']
usa_cup_df.head(2)

# The Analysis

Null hypothesis: 

There is no difference between mean ratings
mean(Japan) - mean(USA) = 0


Alternative hypthesis: 

There is a significant difference between mean ratings 


We assume that the two samples (ratings in Japan and ratings in the USA) are random and independent from each other.

We set the confidence intervall to **95%** which means that our significance level is at **0.05** 



## Boxplot of ratings Japan vs USA 

In [None]:
import matplotlib.pyplot as plt

fig1, ax1 = plt.subplots()
ax1.set_title('Ratings Ramen Pack')
labels = ['Japan', 'USA']
ax1.boxplot([japan_pack_df['Stars'],usa_pack_df['Stars']], labels=labels)

## Distribution of ratings 

In [None]:
fig, axs = plt.subplots(1, 2, sharey=True, tight_layout=True)

# We can set the number of bins with the `bins` kwarg
axs[0].hist(japan_pack_df['Stars'], bins=5)
axs[0].set_title('Ratings Ramen Pack Japan')
axs[1].hist(usa_pack_df['Stars'], bins=5)
axs[1].set_title('Ratings Ramen Pack USA')

In [None]:
japan_pack_df['Stars'].describe()

In [None]:
usa_pack_df['Stars'].describe()

In [None]:
japan_mean_ratings = japan_pack_df['Stars'].mean()
usa_mean_ratings = usa_pack_df['Stars'].mean()
print(japan_mean_ratings, usa_mean_ratings)

best_estimate = japan_mean_ratings - usa_mean_ratings
print(best_estimate)

Is that mean significantly different than zero?

The standard deviations are similar enough, so that we can use the pooled approach

test_statistic = $\frac{best estimate - null value}{estimated standard error}$


**Unpooled test statistics**

$stderror_{estimated} = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2 }{n_1 + n_2 - 2}} \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}$



test_statistic = $\frac{x1_{mean} - x2_{mean} - 0}{stderror_{estimated}}$

In [None]:
n1 = japan_pack_df['Stars'].count()
n2 = usa_pack_df['Stars'].count()
print(n1, n2)

s1 = japan_pack_df['Stars'].std()
s2 = usa_pack_df['Stars'].std()
print(s1, s2)

test_statistic = (japan_mean_ratings - usa_mean_ratings)
test_statistic /= np.sqrt( ((n1 - 1)* s1**2 + (n2 -1)*s2**2) / (n1 + n2 -2) )
test_statistic /= np.sqrt(1/n1 + 1/n2)
test_statistic

If the null hypothesis were true, would a test statistic value of 2.05 be unusual enough to reject the null?

p-value: assuming the null hypothesis is true, it is the probability of observing a test statistic of 2.05 or more extreme

Since we want to see the difference of the means unequal to zero, we have to do a two sided test. 

In [None]:
df = n1 + n2 - 2  # degrees of freedom, reducting 2 because of mean

from scipy import stats

p = 1 - stats.t.cdf(test_statistic,df=df)
p

The p-value of 0.02 is smaller than our significance level of 0.05. 
Therefore we can reject the null hypothesis and assume that our alternative hypothesis is true, that 
the mean ratings are different for Ramen Packs from Japan and USA.  

# **Future work**
Is there a difference between ramen styles? 

In [None]:
usa_pack_df['Stars'].count()