We might want to know if our sample and sub-samples are representative of diamonds in general. Moreover, we might want to reach some conclusions about the influence of certain diamonds features in their price. In that sense, we propose you to perform two statistical tests.

# Challenge 3: Hypothesis testing

In [7]:
# imports
import pandas as pd

In [9]:
# the data set
df = pd.read_csv('../data/diamonds_train.csv')

## Test 1 - one sample vs constant hypothesis test.

We know from the available literature that diamonds average price rounds about 4000 USD. The aim is to test whether the prices in our sample are significantly different from the literature value. Give some conclusions about the implications of your test results.

In [3]:
# we import the module ttest_1samp from scipy.
from scipy.stats import ttest_1samp

In [12]:
# list of the diamonds prices.
prices_list = df['price'].to_list()

In [13]:
# use ttest_1samp to make hypothesis test.
ttest_1samp(prices_list, 4000)

Ttest_1sampResult(statistic=-3.604902369125729, pvalue=0.00031264532833074845)

In [18]:
# CONCLUSION: As the t statistic is negative, the mean of our sample is lowest than 4.000.
# pvalue is less than 0.05, so the difference between our sample and the constant, is significant.
# However, is close to 0.05 so the difference is significant but not huge. 

In [17]:
# we check it calculating the price mean
df['price'].mean()

3928.444469163268

## Test 2 - two independent samples.

Our sample includes diamonds with different features (carat, cut, color clarity, etc.). It seems clear that the carat plays an important role in price. However, it's not that clear whether the prices of some "sub-groups" are significantly different from each other. These are the "sub-groups" that you might feel suspicious about it:

1. Sub-Test 1: Fair cut + color G vs. Fair cut + color I

2. Sub-Test 2: Good cut + color E vs. Good cut + color F

3. Sub-Test 3: Ideal cut + color D vs. Ideal cut + color E

4. Sub-Test 4: Premium cut + color D vs. Premium cut + color E

5. Sub-Test 5: Very Good cut + color I vs. Very Good cut + color J

6. Sub-Test 6: All cuts + color D vs. All cuts + color E

### 1. Sub-Test 1: Fair cut + color G vs. Fair cut + color I

In [19]:
# as the samples are independant 
from scipy.stats import ttest_ind

In [34]:
# samples
faircut_colorg = df[(df['cut'] == 'Fair') & (df['color'] == 'G')]['price'].to_list()
faircut_colori = df[(df['cut'] == 'Fair') & (df['color'] == 'I')]['price'].to_list()

In [35]:
# testing the hypothesis: are the price of this subgroups significant different?
ttest_ind(faircut_colorg, faircut_colori)

Ttest_indResult(statistic=0.03552493926641288, pvalue=0.971680163699314)

In [None]:
# CONCLUSION: The t statistics tell us that faircut_colorg mean is a little but higher than faircut_colori,
# but this difference it's not significant due tu pvalue is greater than 0.05.

### 2. Sub-Test 2: Good cut + color E vs. Good cut + color F

In [40]:
# samples
goodcut_colore = df[(df['cut'] == 'Good') & (df['color'] == 'E')]['price'].to_list()
goodcut_colorf = df[(df['cut'] == 'Good') & (df['color'] == 'F')]['price'].to_list()

In [41]:
# testing the hypothesis: are the price of this subgroups significant different?
ttest_ind(goodcut_colore, goodcut_colorf)

Ttest_indResult(statistic=-0.44021568469654665, pvalue=0.6598512677605672)

In [42]:
# CONCLUSION: the difference between the mean price of this subgroups is not significant because pvalue is greater than 0.05. 
# However, the mean price of goodcut_colore is lower than goodcut_colorf.

In [44]:
# checking out results
print(df[(df['cut'] == 'Good') & (df['color'] == 'E')]['price'].mean())
print(df[(df['cut'] == 'Good') & (df['color'] == 'F')]['price'].mean())

3399.88115942029
3477.504518072289


### 3. Sub-Test 3: Ideal cut + color D vs. Ideal cut + color E