# Sprint 2 Challenge

### Dataset description: 

Anyone who is a fan of detective TV shows has watched a scene where human remains are discovered and some sort of expert is called in to determine when the person died. But is this science fiction or science fact? Is it possible to use evidence from skeletal remains to determine how long a body has been buried (a decent approximation of how long the person has been dead)? 

Researchers sampled long bone material from bodies exhumed from coffin burials in two cemeteries in England. In each case, date of death and burial (and therefore interment time) was known. This data is given in the Longbones.csv dataset. 

You can find Longbones.csv here: https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Longbones.csv

**What can we learn about the bodies that were buried in the cemetery?**

The variable names are:

Site = Site ID, either Site 1 or Site 2

Time = Interrment time in years

Depth = Burial depth in ft.

Lime = Burial with Quiklime (0 = No, 1 = Yes)

Age = Age at time of death in years

Nitro = Nitrogen composition of the long bones in g per 100g of bone.

Oil = Oil contamination of the grave site (0 = No contamination, 1 = Oil contamination)

Source: D.R. Jarvis (1997). "Nitrogen Levels in Long Bones from Coffin Burials Interred for Periods of 26-90 Years," Forensic Science International, Vol85, pp199-208

###1) Import the data 

Import the Longbones.csv file and print the first 5 rows.

In [21]:
#Import the dataset

import pandas as pd
import numpy as np


data_url = 'https://raw.githubusercontent.com/Chelsea-Myers/Lambda-Intro/master/Longbones.csv'

df = pd.read_csv(data_url, skipinitialspace=True, header=0)



In [22]:
### YOUR CODE HERE ###
df.head()

Unnamed: 0,Site,Time,Depth,Lime,Age,Nitro,Oil
0,1,88.5,7.0,1,,3.88,1
1,1,88.5,,1,,4.0,1
2,1,85.2,7.0,1,,3.69,1
3,1,71.8,7.6,1,65.0,3.88,0
4,1,70.6,7.5,1,42.0,3.53,0


###2) Check for missing data.

Is there any missing data in the dataset?  If so, in what variable(s)?  

In [23]:
### YOUR CODE HERE ###
df.isnull().sum()

Site     0
Time     0
Depth    1
Lime     0
Age      7
Nitro    0
Oil      0
dtype: int64

Summarize your answer here.
ANSWER -->

There is some missing data in this DataFrame. There is one missing observation in Depth column and seven missing observations in Age column. 
The rest of the DataFrame is not having any missing information.

### 3) Remove any rows with missing data from the dataset.  If there is no missing data, write "No missing data" in the answer section below.

In [26]:
### YOUR CODE HERE ###
# Finding the shape of the DataFrame
df.shape
# Dropping rows with no information in any one column(variable)
df = df.dropna()
df.shape

(35, 7)

If there are no NA's, indicate that here. 

#Use the following information to answer questions 4) - 7) 

The mean nitrogen compostion in living individuals is 4.3g per 100g of bone.  

We wish to use the Longbones sample to test the null hypothesis that the mean nitrogen composition per 100g of bone in the deceased is 4.3g (equal to that of living humans) vs the alternative hypothesis that the mean nitrogen composition per 100g of bone in the deceased is not 4.3g (not equal to that of living humans). 



###4) Using symbols and statistical language, write the null and alternative hypotheses outlined above.

Your answer here.
ANSWER -->

 We are making a comparison of the sample mean with a particular number therefore this is a case of a one sample t-test. The null and alternative hypothesis in this case will be:-

$H_0: \mu = $ 4.3 grams

In words, our null hypothesis is that the population mean of all dying people nitrogen composition per 100 grams bone is 4.3 grams.

$H_a: \mu \neq$ 4.3 grams

In words, our alternative hypothesis is that the population mean of all dying people nitrogen composition per 100 grams bone is NOT 4.3 grams.


###5) What is the appropriate test for these hypotheses?  A t-test or a chi-square test?  Explain your answer in a sentence or two.

Your answer here.
ANSWER --> 

As mentioned above the since we are comparing the population mean with a reference value the suitable t-test in such situation is the single value t-test.

###6) Use a built-in Python function to conduct the statistical test you identified in 5).  Report your p-value.  Write the conclusion to your hypothesis test at the alpha = 0.05 significance level.

In [27]:
### YOUR CODE HERE ###
import scipy.stats as st

t, pval = st.stats.ttest_1samp(df['Nitro'], 4.3)
print ('The p-value of the sample is:',pval)

The p-value of the sample is: 8.097649978903554e-18


Your answer here.
ANSWER -->

Based on the p-value of single value t-test and the set significance level on the sample we reject the null hypothesis and conclude that the population mean of all dying people nitrogen composition per 100 grams bone is NOT 4.3 grams.

###7) Create a 95% confidence interval for the mean nitrogen compostion in the longbones of a deceased individual.  Interpret your confidence interval in a sentence or two.

In [28]:
### YOUR CODE HERE ###
from scipy.stats import t
# calculating data mean, standard deviation, n and standard error
nitro_mean = df['Nitro'].mean()
nitro_sd = df['Nitro'].std()
nitro_n = df['Nitro'].count()
nitro_se = nitro_sd / (nitro_n**(1/2))
t.interval(alpha= 0.95, df = nitro_n-1, loc = nitro_mean , scale = nitro_se)

(3.734020952024922, 3.8579790479750784)

Your answer here.
ANSWER-->

Based on the confidence interval and significance level of 95% we can conclude the population mean for dying people nitrogen composition per 100 grams bone is between 3.73 grams and 3.86 grams. 

The values from the confidence interval also validates the one sample t-test we had conducted earlier as the value of 4.3 grams does not lie in this threshold.

#Use the following information to answer questions 8) - 12) 


The researchers also want to learn more about burial practices in the parts of England where the two cemetaries in the study were located.  They wish to determine if burials with Quicklime are associated with the burial region.  

Their null hypothesis is that there is no association between cemetery site and burial with Quicklime.  The alternative hypothesis is that there is an association between cemetery site and burial with Quicklime.



###8) Calculate the joint distribution of burial with Quicklime by burial site.

In [29]:
### YOUR CODE HERE ###
pd.crosstab(df['Site'], df['Lime'])

Lime,0,1
Site,Unnamed: 1_level_1,Unnamed: 2_level_1
1,14,5
2,9,7


###9) Calculate the conditional distribution of burial with Quicklime by burial site.

In [30]:
### YOUR CODE HERE ###
pd.crosstab(df['Site'], df['Lime'], normalize='index')*100

Lime,0,1
Site,Unnamed: 1_level_1,Unnamed: 2_level_1
1,73.684211,26.315789
2,56.25,43.75


###10) What is the appropriate test for the hypotheses listed above?  A t-test or a chi-square test?  Explain your answer in a sentence or two.

Your answer here.
ANSWER -->

Since both burial with Quickline and burial site are categorical variable the appropriate test is a chi-square test.

###11) Conducte your hypothesis test and report your conclusion at the alpha = 0.05 significance level.

In [31]:
### YOUR CODE HERE ###
#################### ANSWERING THE QUESTION BELOW INSTEAD ####################

Your answer here.

###12) Conducte your hypothesis test and report your conclusion at the alpha = 0.05 significance level.

In [32]:
### YOUR CODE HERE ###
from scipy.stats import chi2_contingency
g, p, dof, expected = chi2_contingency(pd.crosstab(df['Site'], df['Lime']))
p

0.4684181967877057

Your answer here.

ANSWER -->

Based on the significance level of 0.05 and a p-value of 0.47 we fail to reject the null hypothesis and therefore we can conclude that there is a relationship between burial with Quicklime and burial site.

###13) In a few sentences, describe the difference between Bayesian and Frequentist statistics.

Your answer here.

The main difference between Bayesian and Frequentist statistics is that in Bayseian there is a prior belief and by running statistics on the model you are just updating your belief. In Frequentist statistics there is no belief / historical data and statistics is done purely on the basis of the current dataset.