## What is the true normal human body temperature? 

#### Background

The mean normal body temperature was held to be 37$^{\circ}$C or 98.6$^{\circ}$F for more than 120 years since it was first conceptualized and reported by Carl Wunderlich in a famous 1868 book. In 1992, this value was revised to 36.8$^{\circ}$C or 98.2$^{\circ}$F. 

#### Exercise
In this exercise, you will analyze a dataset of human body temperatures and employ the concepts of hypothesis testing, confidence intervals, and statistical significance.

Answer the following questions **in this notebook below and submit to your Github account**. 

1.  Is the distribution of body temperatures normal? 
    - Remember that this is a condition for the CLT, and hence the statistical tests we are using, to apply. 
2.  Is the true population mean really 98.6 degrees F?
    - Bring out the one sample hypothesis test! In this situation, is it approriate to apply a z-test or a t-test? How will the result be different?
3.  At what temperature should we consider someone's temperature to be "abnormal"?
    - Start by computing the margin of error and confidence interval.
4.  Is there a significant difference between males and females in normal temperature?
    - Set up and solve for a two sample hypothesis testing.

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources

+ Information and data sources: http://www.amstat.org/publications/jse/datasets/normtemp.txt, http://www.amstat.org/publications/jse/jse_data_archive.htm
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

****

In [1]:
import pandas as pd
import scipy.stats as sps
import math

In [43]:
df = pd.read_csv('data/human_body_temperature.csv')

In [44]:
# This cell helps to display dataframes more attractively
from IPython.core.display import HTML
css = open('style-table.css').read() + open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))

In [45]:
df.head(5)

Unnamed: 0,temperature,gender,heart_rate
0,99.3,F,68
1,98.4,F,81
2,97.8,M,73
3,99.2,F,66
4,98.0,F,73


In [46]:
df.describe()

Unnamed: 0,temperature,heart_rate
count,130.0,130.0
mean,98.249231,73.761538
std,0.733183,7.062077
min,96.3,57.0
25%,97.8,69.0
50%,98.3,74.0
75%,98.7,79.0
max,100.8,89.0


In [47]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 130 entries, 0 to 129
Data columns (total 3 columns):
temperature    130 non-null float64
gender         130 non-null object
heart_rate     130 non-null float64
dtypes: float64(2), object(1)
memory usage: 4.1+ KB


###Question 1:
Is the distribution of body temperatures normal?
Remember that this is a condition for the CLT, and hence the statistical tests we are using, to apply.

In [7]:
df_temp = df.groupby(df['temperature']//0.5*0.5).size()

In [8]:
df_temp

temperature
96.0      2
96.5      4
97.0     13
97.5     21
98.0     38
98.5     33
99.0     15
99.5      2
100.0     1
100.5     1
dtype: int64

In [9]:
mean=df['temperature'].mean()
variance=df['temperature'].var()
total=df_temp.sum()
mean, variance, total

(98.249230769230749, 0.53755754323553495, 130)

In [11]:
sps.mstats.normaltest(df_temp)

  np.min(n))


NormaltestResult(statistic=1.9042014162094383, pvalue=0.38592944619402303)

As the p value is larger than 0.05, the observed distribution is more than 5% likely to occur if it truely was a normal distribution. In other words, we can safely assume <b> the distribution is likely to be normal </b>.

I wanted to add the expected number of observations, given that the null hypothesis (body temperatures ~N(mean,variance)) is true. However, I'd like to discuss how to do this exactly.

###Question 2:
Is the true population mean really 98.6 degrees F?
* Bring out the one sample hypothesis test! In this situation, is it approriate to apply a z-test or a t-test? How will the result be different?

A t-test is most appropriate, as we don't know the standard deviation. However, df is quite large. So Normal approximation will hold.

In [12]:
t_score = (mean-98.6)/math.sqrt(variance/len(df))
t_score

-5.4548232923463891

In [13]:
# One side probability:
prob_under = sps.t(len(df)-1).cdf(t_score)
prob_under

1.2053160208779541e-07

In [14]:
# Two sided probability:
2 * prob_under

2.4106320417559082e-07

This means the mean is <b> not </b> equal to 98.6 degrees F!

Note that the normal approximation will also hold:

In [15]:
sps.norm.cdf(t_score)

2.4510785073006942e-08

###Question 3:
At what temperature should we consider someone's temperature to be "abnormal"?
Start by computing the margin of error and confidence interval.

In [18]:
sps.norm.interval(0.95, loc=mean, scale=math.sqrt(variance))

(96.812218185398308, 99.68624335306319)

Note that this is not about the mean, but about a individual sample out of the total population. Below 96.8 and above 99.7 can be considered to be abnormal.  As practical guidelines, we can use 97 and 99 (safety first)

###Question 4:
Is there a significant difference between males and females in normal temperature?
Set up and solve for a two sample hypothesis testing.

In [21]:
df_mean = df.groupby(['gender']).mean()
df_mean

Unnamed: 0_level_0,temperature,heart_rate
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,98.393846,74.153846
M,98.104615,73.369231


In [22]:
df_std = df.groupby(['gender']).std()
df_std

Unnamed: 0_level_0,temperature,heart_rate
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
F,0.743488,8.105227
M,0.698756,5.875184


In [26]:
df_count = df.groupby(['gender']).size()
df_count

gender
F    65
M    65
dtype: int64

Hnull = MUf - MUm = 0
s = sqrt (sigmaf/Nf + sigmam/Nm)
df = min(Nf-1, Nm-1) {conservative}

In [29]:
df_std.temperature.F

0.74348775273483592

In [60]:
s = math.sqrt((df_std.temperature.F)**2 + df_std.temperature.M**2)/math.sqrt(df_count.F)
s

0.12655395041987005

In [61]:
diff_mean = df_mean.temperature.F - df_mean.temperature.M
diff_mean

0.28923076923072699

In [62]:
degfree = min(df_count-1)
degfree

64

In [63]:
minimum = diff_mean + sps.t(degfree).ppf(0.025)*s
minimum

0.036410189693441342

So it is likely that the temperatures are not the same, with an alpha of 0.05

A more efficient way...

In [51]:
Male_temps = df['temperature'][df.gender=='M']
Female_temps = df['temperature'][df.gender=='F']

In [54]:
sps.ttest_ind(Male_temps, Female_temps, equal_var=True)

Ttest_indResult(statistic=-2.2854345381656103, pvalue=0.023931883122395609)

This also suggests that we can reject the null hypothesis that the temperature of male and female is the same!