# MEI Introduction to Data Science
# Lesson 7 - Activity 1
The problem in this activity is an introduction to one type of *machine learning*. In it you will explore how the values in one or more fields in a dataset can be used form a model to predict the values in another field. 

The activity is based on body fat percentage: https://en.wikipedia.org/wiki/Body_fat_percentage. This is less easy to measure than other quantities such as height or weight. One way of measuring it is the *Brozek* estimate based on the body's density; however, this requires the volume of the body to be measured. You will explore whether other more easily measured quantities could be used to predict an expected body fat percentage. 

## Problem
> Can body measurements be used to estimate the expected body fat percentage?

## Getting the data
This activity uses data recorded from 250 adults in the United States and includes the *Brozek* body fat percentage. The values recorded are:
* age (years)
* height (inches)
* weight (pounds)
* chest (cm)
* abdom (cm)
* brozek (%)

* Run the code in the boxes below to import the libraries/commands and import the data

In [None]:
# import pandas
import pandas as pd

#import matplotlib for plotting
import matplotlib.pyplot as plt

In [None]:
# importing the data
body_data=pd.read_csv('../input/body-measurements/measures.csv')

# inspecting the dataset to check that it has imported correctly
body_data.head()

## Exploring the data
Before trying to build a model you should check that the values in each field are appropriate.
* Add and run code in the box below to describe and/or boxplots to check each field

In [None]:
# describe and/or boxplots to check each field

You can use a scatter diagram to explore the association between each of the fields and the Brozek percentage.
* Run the code below to display the scatter diagram for brozek plotted against age

In [None]:
# display a scatter diagram for age vs brozek body fat percentage
body_data.plot.scatter(x='age', y='brozek', figsize=(12,8))
plt.show()

* Display scatter diagrams for the brozek body fay percentage (plotted on the y-axis) against the other measures (plotted on the axis).

In [None]:
# display more scatter diagrams


**Checkpoint**
> * Which of the other measurements appear to be associate with the Brozek body fat percentage? Use your scatter diagrams to justify your answer.
> * For each of the fields give at least one reason why you either would or wouldn't expected it to be associated with body fat percentage. 

## Analysing the data (1)
For the analysis of this data you are going to construct models for predicting an expected Brozek body fat percentage. Initially you will look at simple linear functions of one variable: this is the same as *drawing a line of best fit* or *finding a regression line*. To do this you will need to import some additional libraries and commands:

`numpy` is the numerical library often used in Python programs. You need it here so that the `sklearn` can read the lists/array or numbers correctly.

`sklearn` is the *scikit-learn* library. This is required for creating and anlaysing your linear model. You do not need the full library so the commands `linear_model` (for creating the model) and `r2_score` (for measuring how well the model fits the data) are imported individually.
* Run the code below to import these libraries and commands

In [None]:
# import numpy for handling lists/arrays
import numpy as np 

#import linear_model (for creating the model) and r2_score (for measuring how well the model fits the data)
from sklearn import linear_model
from sklearn.metrics import r2_score

### Creating training and testing datasets
In mathematics, you often judge the fit of a trend line or regression line by seeing how close it is to the data it was calculated from. In data science, an alternative method is often used: the line (or any other function) is fitted to one dataset (the *training* set) and then its predictions on another dataset (the *testing* set) are used to judge how good the fit is.

As you do not have two separate sources of data in this activity, you can achieve the same effect by splitting the dataset into two smaller parts. You can take the top 80% of rows for the training set and the bottom 20% of the rows for the testing set. *Note: this makes the assumption that the rows in the dataset are in a random order. If this had not been the case (for example, if the data had been in age order) then a random split into two sets would be needed.*

The training dataset is used to create the model. The testing dataset is used to evaluate the model. This is effectively the same as taking a new set of data and measuring how closely your predictions match reality.
* Run the code in the boxes below to select 80% of the dataset for the training subset and 20% for the testing subset

In [None]:
# create the testing subset from 80% of the original dataset
body_data_train=body_data.head(200)
body_data_train.head()

In [None]:
# create the testing subset from 20% of the original dataset
body_data_test=body_data.tail(50)
body_data_test.head()

You can check that the training and testing datasets have split the data into appropriate proportions by using the `shape` command.
* Add and run some code in the box below to find the sizes of the training and testing datasets.

In [None]:
# find the size of the testing and training datasets


### Fitting a model to the data
Now you have a training dataset you can use this to try and fit a linear model to the data. This is acheived in two stages: creating a list of the target data and fitting a linear model to the data. 
* Run the code below to create the target list from the training data

In [None]:
# create the target data for training as a list
target_train=body_data_train['brozek']
print(target_train)

To fit the model to the data you need to create the input column for the training data as a list, tell Python you want to use the Linear Regression model and fit this to the data. The `linear_model` command you have imported will fit the best line to the training subset. 
* Run the code below to fit the model and output the coefficient (i.e. gradient) and intercept of your linear model

In [None]:
# create an array for the input data
input_a_train=body_data_train[['weight']]

# define the model to be used as linear
model_a = linear_model.LinearRegression()

# fit a linear model to the data
model_a.fit(input_a_train, target_train)

# output the coefficients and y-intercept
print('Coefficients: \n', model_a.coef_)
print('Intercept: \n', model_a.intercept_)

### Judging the model
In this activity you are trying to create a model that will predict an average brozek body fat percentage in terms of other variables. This model can be judged by using the coefficient of determination, *R*². This indicates the proportion of the variation accounted for by the model; e.g. *R*²=0.7 would indicate that 70% of the variation in brozek is accounted for by the predictions of the model and 30% is due to deviations from the predicted value.

When the model is a simple linear function of one variable (as it is in A level Mathematics), *R*² is just the same as *r*², the square of the pmcc. The notation *R*² is used when the model is any more complicated than that. The *R*² measure can be used to compare the goodness of fit of different models including linear functions of more than one variable or non-linear functions.  For more details see: https://en.wikipedia.org/wiki/Coefficient_of_determination
* Run the code in the boxes below to create the taget list for testing and calculate *R*² for brozek body fat measure plotted against weight 

In [None]:
# create the target data for testing as a list
target_test=body_data_test['brozek']
print(target_test)

In [None]:
# create list for the test input data
input_a_test=body_data_test[['weight']]

# use the input data to create a list of predictions
target_pred_a = model_a.predict(input_a_test)

# calculate and display the coefficient of determination
print("Brozek body fat percentage vs weight: R²="+str(r2_score(target_test, target_pred_a)))

### Exploring other fields
* In the boxes below add and run code to fit a linear model and find the value of *R*² for at least two of the other fields as input variables - you should choose the input variables that appeared to be most associated with *brozek*

In [None]:
# fit a linear model


In [None]:
# calculate R²


**Checkpoint**
* Which of the other fields gave the best models for predicting the expected *Brozek* body fat percentage?
* State the equations for your lines of best fit algebraically (e.g. you could write the prediction for body fat, *f*, based on weight, *w*, as *f* = 3.1*w* + 20). 
* Choose at least three rows from the original dataset and use each of your linear models to calculate the expected *Brozek* body fat percentage. How well do these compare to the actual recorded *Brozek* body fat percentage?

## Analysing the data (2)
### Fitting and measuring a linear model in two variables to the training dataset 
The models created so far used a single variable to create a prediction but you are not restricted to this: it is possible to create a prediction model using a combination of two, or more, input variables. 

This is difficult to visualise. You could represent a linear model  of two input variables used to predict a third value as as a plane in 3D space but for more than two input variables a simple visualisation isn't viable. However, it is straightforward to write linear functions of multiple variables algebraically. For example a linear function predicting body fat (*f*) based on weight (*w*) and height (*h*) can be written as:

*f* = *a w* + *b h* + *c*

where *a*, *b* and *c* are coefficients/intercept to be determined.
* Run the code below to create and measure the model for body fat as a linear function of weight and chest size - *Note that the first model used here is labelled as g as you might have used up to 6 models in the previous analysis*

In [None]:
# create an array for the input data
input_g_train=body_data_train[['weight','chest']]

# define the model to be used as linear
model_g = linear_model.LinearRegression()

# fit a linear model to the data
model_g.fit(input_g_train, target_train)

# output the coefficients and y-intercept
print('Coefficients: \n', model_g.coef_)
print('Intercept: \n', model_g.intercept_)

In [None]:
# create an array for the input data
input_g_test=body_data_test[['weight','chest']]

# use the input data to create a list of predictions
target_pred_g = model_g.predict(input_g_test)

# The coefficient of determination: 1 is perfect prediction
print("Brozek body fat percentage vs weight and chest size: R²="+str(r2_score(target_test, target_pred_g)))

### Exploring other pairs fields
* In the boxes below add and run code to fit a linear model and find the value of *R*² for other pairs of input variables. You should do this for at least two other pairs of fields as input variables. Your analysis in the previous section should help you decide which pairs to use based on those that appeared to be most associated with *brozek*.

In [None]:
# find the model



In [None]:
# measure the model


**Checkpoint**
* Which of the pairs of other fields gave the best models for predicting the expected *Brozek* body fat percentage? Were these models better than the linear functions of a single variable?
* State the equations for your lines of best fit algebraically (e.g. you could write the prediction for body fat, *f*, based on weight, *w*, and height, *h* as *f* = 3.1*w* + 2.1*h* + 20). 
* Choose at least three rows from the original dataset and use each of your linear models to calculate the expected *Brozek* body fat percentage. How well do these compare to the actual recorded *Brozek* body fat percentage?

## Analysing the data (3) - extension work 
You are not restricted to linear functions of single variables or pairs of variables. In the boxes below you could try the extension task of fitting and judging some other models for the data such as linear models of more than two variables or non-linear models using products, powers, exponential or logrithmic functions.

*Note: BMI is one such non-linear model that is calculated by dividing the weight (in kg) by the sqaure of the height (in m). 

In [None]:
# find the model

In [None]:
# measure the model

## Communicating the result
Use your analysis above to answer the original problem:
> Can body measurements be used to estimate the expected body fat percentage?