## What is the true normal human body temperature? 

#### Background

The mean normal body temperature was held to be 37$^{\circ}$C or 98.6$^{\circ}$F for more than 120 years since it was first conceptualized and reported by Carl Wunderlich in a famous 1868 book. In 1992, this value was revised to 36.8$^{\circ}$C or 98.2$^{\circ}$F. 

#### Exercise
In this exercise, you will analyze a dataset of human body temperatures and employ the concepts of hypothesis testing, confidence intervals, and statistical significance.

Answer the following questions **in this notebook below and submit to your Github account**. 

1.  Is the distribution of body temperatures normal? 
    - Remember that this is a condition for the CLT, and hence the statistical tests we are using, to apply. 
2.  Is the true population mean really 98.6 degrees F?
    - Bring out the one sample hypothesis test! In this situation, is it approriate to apply a z-test or a t-test? How will the result be different?
3.  At what temperature should we consider someone's temperature to be "abnormal"?
    - Start by computing the margin of error and confidence interval.
4.  Is there a significant difference between males and females in normal temperature?
    - Set up and solve for a two sample hypothesis testing.

You can include written notes in notebook cells using Markdown: 
   - In the control panel at the top, choose Cell > Cell Type > Markdown
   - Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

#### Resources

+ Information and data sources: http://www.amstat.org/publications/jse/datasets/normtemp.txt, http://www.amstat.org/publications/jse/jse_data_archive.htm
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet

****

In [1]:
from scipy import stats
import numpy as np
import pandas as pd
import datetime
from pylab import figure, axes, pie, title, show

pd.set_option('display.max_columns', 10000)

In [2]:
df = pd.read_csv('data/human_body_temperature.csv')

In [3]:
df.head()

Unnamed: 0,temperature,gender,heart_rate
0,99.3,F,68.0
1,98.4,F,81.0
2,97.8,M,73.0
3,99.2,F,66.0
4,98.0,F,73.0


# Q1) Is the distribution of body temperatures normal?

## Method 1: Check the statistics of the temperature data

In [4]:
temperatures = df.temperature
dist_stats = dict()
mean = temperatures.mean()
stdev = temperatures.std()
obs = len(temperatures)
dist_stats["mean"] = temperatures.mean()
dist_stats["median"] = temperatures.median()
dist_stats["stdev"] = temperatures.std(ddof=1)
dist_stats["obs"] = obs
dist_stats["1std"]  = len(temperatures[(temperatures < mean+1*stdev) & (temperatures > mean-1*stdev)])/obs
dist_stats["2std"]  = len(temperatures[(temperatures < mean+2*stdev) & (temperatures > mean-2*stdev)])/obs
dist_stats["3std"]  = len(temperatures[(temperatures < mean+3*stdev) & (temperatures > mean-3*stdev)])/obs
dist_stats

{'1std': 0.6923076923076923,
 '2std': 0.9461538461538461,
 '3std': 0.9923076923076923,
 'mean': 98.24923076923078,
 'median': 98.3,
 'obs': 130,
 'stdev': 0.7331831580389454}

In order to confirm with the temperature is normally distributed, we check for the characteristics of normally distibuted data:

1. Is the mean equal to the median? 
 * Yes, the median is 98.3 and the mean is very close at 98.249
2. Are 69% of the observations within 1 standard deviation of the mean?
 * Yes, 69.23% of the observations are within 1 standard deviation of the mean
3. Are 95% of the observations within 2 standard deviations of the mean?
 * Yes, 94.62% of the observations are within 2 standard deviations of the mean
4. Are 99% of the observations within 3 standard deviations of the mean?
 * Yes, 99.23% of the observations are within 3 standard deviations of the mean

Hence, by this precursory look, the data seems to be **normally distibuted**

## Method 2: Use the normaltest method from scipy

In [5]:
isnormal = stats.mstats.normaltest(df["temperature"], axis=0)
isnormal

NormaltestResult(statistic=2.7038014333192031, pvalue=0.2587479863488254)

Given a null hypothesis that the temperature is normally distibuted and assuming a significance level (alpha) of 0.05,

The result shows a p-value of 0.2587. Since p-value > alpha, we cannot reject the null hypothesis. Hence, there is insufficient evidence to reject the null hypothesis.

Hence, we believe the data to be normally distributed 

## Q2) Is the true population mean really 98.6 degrees F?
Bring out the one sample hypothesis test! In this situation, is it approriate to apply a z-test or a t-test? How will the result be different?

## Answer:

Null Hypothesis: True population mean = 98.6F

Alternative Hypothesis: True population mean is not equal to 98.6F

Significance level, alpha = 0.05

Sample size, n = 130

Sample mean, xbar = 98.2492F

Sample standard deviation, xstd = 0.7332F

Under the null hypothesis, the sample mean of the sampling distribution(sdist_mean) = the true population mean = 98.6F

The standard error (s_err) = xstd/sqrt(n)

In [6]:
alpha = 0.05
n = dist_stats["obs"]
xbar = dist_stats["mean"]
xstd = dist_stats["stdev"]
sdist_mean = 98.6
s_err = xstd/np.sqrt(n)
s_err

0.06430441683789101

Thus, the standard error (s_err) = 0.0643

**We will use the z-statistic as the sample size is greater than 30. Even if we use the t-statistic, the result will not be significantly different**

The z-statistic (z_stat) = (xbar - sdist_mean)/s_err

In [7]:
z_stat = (xbar - sdist_mean)/s_err
z_stat

-5.4548232923640789

Looking at the z-table, we find that a z-statistic of -5.45 gives a p-value of almost 0, which is less than alpha.
Hence, we reject the null hypothesis. 

**Hence, the true mean body temperature of the population is not 98.6F**

## Q3) At what temperature should we consider someone's temperature to be "abnormal"?

Start by computing the margin of error and confidence interval.

## Answer:

Confidence interval computation:

For a confidence level of 95%, the z-statistic (z_stat) is 1.96

Sample mean (xbar) is 98.2492F

Sample standard deviation (xstd) is 0.7332

Sample size (n) is 130

The confidence interval (CI) is computed as follows:

xbar +- z_stat * xstd / sqrt(n)

In [8]:
z_stat = 1.96
confidence_interval = [(xbar-(z_stat*xstd/np.sqrt(n))),(xbar+(z_stat*xstd/np.sqrt(n)))]
confidence_interval

[98.123194112228518, 98.375267426233037]

**Thus, the 95% confidence interval is 98.123F to 98.375F. If a person's body temperature is outside this range, their temperature can be considered as abnormal**

## Q4) Is there a significant difference between males and females in normal temperature?
Set up and solve for a two sample hypothesis testing.

## Answer:

In [9]:
male_temp = df.temperature[df.gender=='M']
female_temp = df.temperature[df.gender=='F']

two_sample_stats = dict()
two_sample_stats["male mean"] = male_temp.mean()
two_sample_stats["female mean"] = female_temp.mean()
two_sample_stats["male std dev"] = male_temp.std(ddof=1)
two_sample_stats["female std dev"] = female_temp.std(ddof=1)
two_sample_stats["male size"] = len(male_temp)
two_sample_stats["female size"] = len(female_temp)
two_sample_stats["male-female mean difference"] = two_sample_stats["male mean"] - two_sample_stats["female mean"]
two_sample_stats

{'female mean': 98.39384615384613,
 'female size': 65,
 'female std dev': 0.7434877527313665,
 'male mean': 98.1046153846154,
 'male size': 65,
 'male std dev': 0.6987557623265908,
 'male-female mean difference': -0.289230769230727}

Null Hypothesis: Difference in body temperatures means of men and women = 0F

Alternative Hypothesis: Difference in body temperatures means of men and women is not 0F

Significance level, alpha = 0.05

Sample size, n = 65

Sample mean, xbar = -0.2892F

Sample standard deviation calculation:

In [10]:
alpha = 0.05
n = 65
xbar = two_sample_stats["male-female mean difference"]
var_male = two_sample_stats["male std dev"]**2
var_female = two_sample_stats["female std dev"]**2
var_sample = var_male + var_female
xstd = np.sqrt(var_sample)
xstd

1.0203105673500361

Thus, the sample standard deviation (xstd) = 1.0203

Under the null hypothesis, the sample mean of the sampling distribution(sdist_mean) = 0F

The standard error (s_err) = xstd/sqrt(n)

In [11]:
sdist_mean = 0
s_err = xstd/np.sqrt(n)
s_err

0.12655395041982645

Thus, the standard error (s_err) is 0.1266

We use the z-statistic as we do not know the standard deviation of the population.

The z-statistic (z_stat) is: xbar - sdist_mean / s_err

In [12]:
z_stat = (xbar - sdist_mean)/s_err
z_stat

-2.2854345381652736

The z-statistic is -2.2854. 

Looking at the z-table, we find the z-statistic cutoff is at +1.96 and -1.96.

**Since our computed z-statistic is less than -1.96, it means we reject the null hypothesis. Thus, there is a statistically difference between the body temperatures of men and women.**