# BMI Projekt 1 - English
This notebook is for assistance with the coding for many of the questions in the project.
The sections are marked with the corresponding question in the Project description.
Remember, this code is provided to get started with the project, but the code is not complete for answering the corresponding questions

#### Initialize python packages

In [None]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import statsmodels.api as sm

#### Read data

In [None]:
# Path to data file (insert your own path)
file_path = '/Users/johndoe/Documents/DTU/intro_stat/projects/bmi1/bmi1_data.csv'

# Load data
D = pd.read_csv(file_path, sep=';')

#### a) Simple summary of data


In [None]:
print("Dimensions of data frame (number of rows and columns): ", D.shape)
print("Column/variable names: ", D.columns)
print("First 5 rows of D:") # (\n means new line)
display(D.head())
print("Last 5 rows of D:")
display(D.tail())
print("Description of D:")
display(D.describe())
print("Types of variables:\n", D.dtypes)

#### Calculate BMI

In [None]:
# Calculate BMI and add to data frame D
D['bmi'] = D['weight'] / (D['height']/100)**2

#### b) Histogram (Empirical density)

In [None]:
# Histogram describing the empirical density of BMI
# (histogram for BMI normalized so that the area is 1)
plt.hist(D['bmi'], bins = 15,density=True, color = 'blue', edgecolor = 'black')
plt.show()

#### Data subsets (men and women seperate)

In [None]:
# Split data into two sub data sets (men and women)
D_female = D[D['gender']==0]
D_male = D[D['gender']==1]

#### c) Density Histograms for Men and Women seperately

In [None]:
# Density histograms describing the empirical density of BMI for Women and Men

# Density histogram for women
plt.hist(D_female['bmi'], bins=15, density=True, color='red', edgecolor='black')
plt.show()

# Density histogram for men
plt.hist(D_male['bmi'], bins=15, density=True, color='blue', edgecolor='black')
plt.show()

# Combined in one plot
plt.hist(D_female['bmi'], bins = 15,density=True, color = 'red', edgecolor = 'black', alpha = 0.5)
plt.hist(D_male['bmi'], bins = 15,density=True, color = 'blue', edgecolor = 'black', alpha = 0.5)
plt.show()


#### d) Boxplot by genders

In [None]:
# Boxplot of BMI seperated into genders
plt.boxplot([D_female['bmi'], D_male['bmi']])
plt.show()

#### e) Key summary statistics for BMI

In [None]:
# Number of observations in total. Do not include any missing observations.
print("Total number of observations: ", D[D['bmi'].isna()==False].shape[0])

# Sample mean (not split)
print("Sample mean for BMI: ", D['bmi'].mean(skipna=True))

# Sample variance (not split)
print("Sample variance for BMI: ", D['bmi'].var(skipna=True, ddof=1))
## Etc.
# "skipna=True" means that missing values are ignored and the statistic can still be calculated
# If there are missing values.
# If skipna=False the function will return NaN if there are missing values in the data set.


#### f) QQ-plot for model validation

In [None]:
# New variable 'logbmi' with log-transformed BMI
D['logbmi'] = np.log(D['bmi'])

# QQ-plot for log-transformed BMI
sm.qqplot(D['logbmi'],line='q')
plt.show()



#### g-h) One-sample t-test

In [None]:
# T-test for one sample on log-transformed BMI
res = stats.ttest_1samp(D['logbmi'], popmean = np.log(25))
print("t-obs: ", res[0])
print("p-value: ", res[1])

# Confidence interval directly from t-test
print(res.confidence_interval())

#### j) Confidence interval (CI) for mean and median

In [None]:
# Using sample for women only
D_female = D[D['gender']==0]
n = D_female.shape[0]
std_err = np.std(D_female['logbmi'],ddof=1)/np.sqrt(n)

# CI for mean of logBMI for women
KI = stats.t.interval(0.95, df=n-1, loc=D_female['logbmi'].mean(),scale=std_err)
print("95% CI for mean of logBMI for women: ", KI)

# Transform back to get CI for median BMI for women
print("95% CI for median BMI for women: ", np.exp(KI))

#### k) Welch t-test

In [None]:
# Comparison of logBMI for women and men
D_male = D[D['gender'] == 1]
res = stats.ttest_ind(D_female['logbmi'],D_male['logbmi'], equal_var=False)
print("t-obs: ", res[0])
print("p-value: ", res[1])

#### m) Correlation

In [None]:
# Correlation between chosen variables
print(D[['weight', 'fastfood', 'bmi']].corr())

## EXTRA

#### Subsets in Python

In [None]:
## Extra information about picking out subsets in Python
#
# Logical vector with TRUE or FALSE for every value in a column in D,
# for example: Find all women in the dataframe
women = D['gender'] == 0
print("Logical vector: \n", women)
# This logical vector can then be used to pick out all women (values where women = TRUE)
print("Using logical vector:")
display(D[women])
# Alternatively you can use the pandas function .loc
print("Using .loc:")
display(D.loc[D['gender'] == 0, :]) # ":" means all columns
# More complex logical expressions kan be used, for example:
# Find all women under 55 kg:
print("Women under 55 kg:")
women_under_55kg = (D['gender'] == 0) & (D['weight'] < 55)
display(D[women_under_55kg])

## DISPLAY function gives a nicer table than print. This is especially useful when we 
# are working with dataframes (pandas)

#### Additional Python tips

In [None]:
## Make a for-loop for calculating some summary statistics
## and save in a new dataframe
Tbl = pd.DataFrame()
for i in [0,1]:
    Tbl.loc[i, "mean"] = D[D['gender'] == i]['bmi'].mean()
    Tbl.loc[i, "var"] = D[D['gender'] == i]['bmi'].var()

Tbl.index = ['Women', 'Men'] # Nameing rows

# show Tbl (dataframe)
display(Tbl)


In [None]:
# There are many ways to reach the same restults, and some have more compact commands/functions:
# For example:
result = D.groupby('gender')['bmi'].agg(['mean', 'var'])
# Here the groupby function is used to group the data by gender, and then the agg (aggregate) function
# for calculating mean and variance for BMI for each group.
display(result)

# See more functions in pandas documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
# Numpy documentation: https://numpy.org/doc/stable/reference/index.html
# Or find documentation/guides for other python packages/functions online.


#### Latex Tips
Pandas (pd) also includes a function that is very handy for writing tables/dataframes directly into Latex-code. 
This is done by usind the function `pd.to_latex()`.
The following is the simplest form of the function:

In [None]:
Tbl_latex = Tbl.to_latex()
print(Tbl_latex)