# 1. Purpose of this notebook:

To answer the simple question: **"The mean of the price of the books of fiction genre is less than that of their non-fiction counterparts"**

The notebook in no way defines or explains what an Hypothesis testing is or some, perhaps, esoteric terms associated with it. But, both as a learner and a teacher, I understand the importance of premises (building blocks) of any concept. Therefore, if any of my generous audience would want me to explain or dwell more on the topic, I'll certainly do that by making a new notebook. So feel free to reach me through comments.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
from matplotlib import style
import random
import math
style.use('ggplot')

In [None]:
df = pd.read_csv('/kaggle/input/amazon-top-50-bestselling-books-2009-2019/bestsellers with categories.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df['Genre'] = df['Genre'].map(lambda x: 1 if x == 'Fiction'
                             else 0)

In [None]:
df.head()

# 1. Cumulative Distribution Function

Defining a cumulative distribution function (cdf)

In [None]:
def EvalCdf(sample, x):
    count = 0
    for i in sample:
        if i <= x:
            count += 1
    prob = count / len(sample)
    return prob

# 2. Categorizing the data into fiction and non fiction and plotting the cdfs of the price of the books belonging to these genres

In [None]:
fiction = df[df.Genre == 1]
non_fiction = df[df.Genre == 0]

In [None]:
price_fiction = sorted(fiction['Price'].values)
cdf_price_fiction = [EvalCdf(price_fiction, x) for x in price_fiction]

price_nonfiction = sorted(non_fiction['Price'].values)
cdf_price_nonfiction = [EvalCdf(price_nonfiction, x) for x in price_nonfiction]

In [None]:
plt.figure(figsize = (15, 8))

plt.plot(price_fiction, cdf_price_fiction, label = 'Fiction price')
plt.plot(price_nonfiction, cdf_price_nonfiction, label = 'Non Fiction price')

plt.legend()
plt.show()

**What does the graph tell us?**

1. The graph looks like an exponential distribution
2. The price of the books belonging to the non-fiction genre is slightly more than that of their fiction counterparts
3. More than 90% of the books of fiction genre is less than and equal to 20 units(probably USD), whereas the about 85% of the books of non fiction genre fall in the same range

# 3. Using Estimation to find out which is a better estimator for estimating the mean of the price of both fiction and non fiction books - sample mean or median?

**Why do we do it?**

1. Presence of outliers - Although sample mean is a good estimator of mean but we cannot rule out the possibility of having outliers in out data. That is, there might be some books that have a price much higher or much lower than one might expect those to be. In that case, the mean would be misleading.
2. In that case, the median would be a better estimator.

**How do we do it?**

1. Run the experiment m no. of times each time using a sample of n no. of elements from the sample. 
2. Note the sample mean and median of the sample thus used.
3. Put the means and the median in separate lists.
4. Find the RMSE (Root Mean Square Error) for both the means and medians.
5. If the RMSE of the means is less than that of the medians, we will use the mean of our sample as the estimator for estimating the mean, else otherwise.

***The codes below will give a better understanding***

In [None]:
print('mean of price of fiction books: ', fiction.Price.mean())
print('mean of price of non fiction books: ', non_fiction.Price.mean())

In [None]:
print('median of price of fiction books: ', fiction.Price.median())
print('median of price of non fiction books: ', non_fiction.Price.median())

In [None]:
def Estimate(df=fiction, mu=0, n=7, m=1000):
    
    means = []
    medians = []
    
    for _ in range(m):
        xs = [random.sample(list(df.Price.values), n)]
        xbar = np.mean(xs)
        median = np.median(xs)
        means.append(xbar)
        medians.append(median)
        
    print('rmse of xbar: ', RMSE(means, mu))
    print('rmse of median: ', RMSE(medians, mu))
    
def RMSE(estimates, actual):
    e2 = [(estimate - actual) ** 2 for estimate in estimates]
    mse = np.mean(e2)
    return math.sqrt(mse)

In [None]:
Estimate(mu = 10.85)

In [None]:
Estimate(df = non_fiction, mu = 14.841935483870968)

**What does the outcomes tell us?**

1. The rmse of xbar (sample mean) of the price of the books of fiction genre is more than that of the median
2. this portrays that the median is a better estimator as compared to that of the mean. So we'll use median as mean for this distribution.
3. The same is also true for the other distribution (non friction books data)

# 4. Answer/Justify the question/statement: "The mean of the price of the books of fiction genre is less than that of their non-fiction counterparts"

***To answer that question, we need to assume a "Null Hypothesis"***

**What is a null hypothesis?**

The opposite of what we are trying to prove. [Wikipedia definiton of Null Hypothesis](https://en.wikipedia.org/wiki/Null_hypothesis)

**Null Hypothesis** - *The prices for both fiction and non fiction books are same, because to prove that the the non fiction books on average are less expensive than that of their non fiction counterparts, we need to prove that their distribution is dissimilar* 

**Wait a minute! Haven't we done that before by plotting the cdfs of both the distributions?**

Yes, but now we need to see if that conclusion holds true for all cases (larger population) or had that effect (apparent effect) appeared by chance!

In [None]:
class HypothesisTest(object):
    
    def __init__(self, data):
        self.data = data
        self.MakeModel()
        self.actual = self.TestStatistic(data)
        
    def PValue(self, iters=1000):
        self.test_stats = [self.TestStatistic(self.RunModel()) for _ in range(iters)]
        self.test_cdf = [EvalCdf(self.test_stats, x) for x in self.test_stats]
        count = sum(1 for x in self.test_stats if x >= self.actual)
        return count / iters
    
    def TestStatistic(self, data):
        raise UnimplementedMethodException()
        
    def MakeModel(self):
        pass
    
    def RunModel(self):
        raise UnimplementedMethodException()

In [None]:
class DiffMediansPermute(HypothesisTest):
    
    def TestStatistic(self, data):
        group1, group2 = data
        test_stat = abs(np.median(group1) - np.median(group2))
        return test_stat
    
    def MakeModel(self):
        group1, group2 = self.data
        self.n, self.m = len(group1), len(group2)
        self.pool = np.hstack((group1, group2))
        
    def RunModel(self):
        np.random.shuffle(self.pool)
        data = self.pool[:self.n], self.pool[self.n:]
        return data

***Note that we used "DiffMediansPermute" not "DiffMeansPermute" because we saw that the median is a the better estimator for estimating mean for our both distributions***

***In case one is interested in knowing more about the functions, please comment***

In [None]:
data = fiction.Price.values, non_fiction.Price.values
ht = DiffMediansPermute(data)
pvalue = ht.PValue()

In [None]:
print('the PValue of the null hypothesis is: ', pvalue)

**What does the P Value tell us?**

1. If it is less (certainly less than 5%), it means that the null hypothesis is false or that the vice versa is true (which is our assumption).
2. If it is more than 10%, we can say that our assumption is false or that our assumption/outcome is not "statistically important" to be more precise and technical.
3. If it is between  5 to 10%, we cannot say or assume much/anything about the estimation/outcome that we get because Hypothesis Testing tells us if there is strong evidence to believe that fact or not. It does not always give us a precise answer.

**What does the outcome tell us?**

The outcome is between 0 and 1%, which means that our estimation(i.e. "The mean of the price of the books of fiction genre is less than that of their non-fiction counterparts") is statistically important and also holds true for a larger population.