# Body Fat Percentage Prediction

This notebook is a work flow for various Python-based machine learning model for predicting body Fat Percentage.

Going to take the following approach:

1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Model Evaluation

# 1. Problem Definition

Given the set of parameters, can we predict the body Fat percentage?

# 2. Data

https://www.kaggle.com/fedesoriano/body-fat-prediction-dataset

## Context

Lists estimates of the percentage of body fat determined by underwater
weighing and various body circumference measurements for 252 men.

## Educational use of the dataset

This data set can be used to illustrate multiple regression techniques. Accurate measurement of body fat is inconvenient/costly and it is desirable to have easy methods of estimating body fat that are not inconvenient/costly.

## Content

The variables listed below, from left to right, are:

    Density determined from underwater weighing
    Percent body fat from Siri's (1956) equation
    Age (years)
    Weight (lbs)
    Height (inches)
    Neck circumference (cm)
    Chest circumference (cm)
    Abdomen 2 circumference (cm)
    Hip circumference (cm)
    Thigh circumference (cm)
    Knee circumference (cm)
    Ankle circumference (cm)
    Biceps (extended) circumference (cm)
    Forearm circumference (cm)
    Wrist circumference (cm)

(Measurement standards are apparently those listed in Benhke and Wilmore (1974), pp. 45-48 where, for instance, the abdomen 2 circumference is measured "laterally, at the level of the iliac crests, and anteriorly, at the umbilicus".)

These data are used to produce the predictive equations for lean body weight given in the abstract "Generalized body composition prediction equation for men using simple measurement techniques", K.W. Penrose, A.G. Nelson, A.G. Fisher, FACSM, Human Performance Research Center, Brigham Young University, Provo, Utah 84602 as listed in Medicine and Science in Sports and Exercise, vol. 17, no. 2, April 1985, p. 189. (The predictive equation were obtained from the first 143 of the 252 cases that are listed below).

## More details

A variety of popular health books suggest that the readers assess their health, at least in part, by estimating their percentage of body fat. In Bailey (1994), for instance, the reader can estimate body fat from tables using their age and various skin-fold measurements obtained by using a caliper. Other texts give predictive equations for body fat using body circumference measurements (e.g. abdominal circumference) and/or skin-fold measurements. See, for instance, Behnke and Wilmore (1974), pp. 66-67; Wilmore (1976), p. 247; or Katch and McArdle (1977), pp. 120-132).

The percentage of body fat for an individual can be estimated once body density has been determined. Folks (e.g. Siri (1956)) assume that the body consists
of two components - lean body tissue and fat tissue. Letting:

    D = Body Density (gm/cm^3)
    A = proportion of lean body tissue
    B = proportion of fat tissue (A+B=1)
    a = density of lean body tissue (gm/cm^3)
    b = density of fat tissue (gm/cm^3)

we have:

D = 1/[(A/a) + (B/b)]

solving for B we find:

B = (1/D)*[ab/(a-b)] - [b/(a-b)].

Using the estimates a=1.10 gm/cm^3 and b=0.90 gm/cm^3 (see Katch and McArdle (1977), p. 111 or Wilmore (1976), p. 123) we come up with "Siri's equation":

Percentage of Body Fat (i.e. 100*B) = 495/D - 450.

Volume, and hence body density, can be accurately measured a variety of ways. The technique of underwater weighing "computes body volume as the difference between body weight measured in air and weight measured during water submersion. In other words, body volume is equal to the loss of weight in
water with the appropriate temperature correction for the water's density" (Katch and McArdle (1977), p. 113). Using this technique,

Body Density = WA/[(WA-WW)/c.f. - LV]

where:

    WA = Weight in air (kg)
    WW = Weight in water (kg)
    c.f. = Water correction factor (=1 at 39.2 deg F as one-gram of water occupies exactly one cm^3 at this temperature, =.997 at 76-78 deg F)
    LV = Residual Lung Volume (liters)

(Katch and McArdle (1977), p. 115). Other methods of determining body volume are given in Behnke and Wilmore (1974), p. 22 ff.

## Source

The data were generously supplied by Dr. A. Garth Fisher who gave permission to freely distribute the data and use for non-commercial purposes.

Roger W. Johnson
Department of Mathematics & Computer Science
South Dakota School of Mines & Technology
501 East St. Joseph Street
Rapid City, SD 57701

email address: rwjohnso@silver.sdsmt.edu
web address: http://silver.sdsmt.edu/~rwjohnso
References

Bailey, Covert (1994). Smart Exercise: Burning Fat, Getting Fit, Houghton-Mifflin Co., Boston, pp. 179-186.

Behnke, A.R. and Wilmore, J.H. (1974). Evaluation and Regulation of Body Build and Composition, Prentice-Hall, Englewood Cliffs, N.J.

Siri, W.E. (1956), "Gross composition of the body", in Advances in Biological and Medical Physics, vol. IV, edited by J.H. Lawrence and C.A. Tobias, Academic Press, Inc., New York.

Katch, Frank and McArdle, William (1977). Nutrition, Weight Control, and Exercise, Houghton Mifflin Co., Boston.

Wilmore, Jack (1976). Athletic Training and Physical Fitness: Physiological Principles of the Conditioning Process, Allyn and Bacon, Inc., Boston.

# 3. Evalutation

Creating a Regression Model that we will evalute using the Root Mean Square Error (RMSE)

## Task Details

If you have some experience with R or Python and machine learning basics, this is the perfect task for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition.

## Expected Submission

Prediction for the body fat percentage for a test set. Note that the use of the Density variable is not allowed since the body fat is calculated directly with the density.
Evaluation

Evaluation using RMSE

# 4. Features

## Input / Features
1. Density determined from underwater weighing
3. Age (years)
4. Weight (lbs)
5. Height (inches)
6. Neck circumference (cm)
7. Chest circumference (cm)
8. Abdomen 2 circumference (cm)
9. Hip circumference (cm)
10. Thigh circumference (cm)
11. Knee circumference (cm)
12. Ankle circumference (cm)
13. Biceps (extended) circumference (cm)
14. Forearm circumference (cm)
15. Wrist circumference (cm)

## Output /  Label
2. Percent body fat from Siri's (1956) equation

## Standard Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
df = pd.read_csv('/kaggle/input/body-fat-prediction-dataset/bodyfat.csv')
df.head()

## Data Exploration (Exploratory Data Analysis (EDA) )

In [None]:
df

From our task, Note that the use of the Density variable is not allowed since the body fat is calculated directly with the density. Therefore we will drop the Density from our Dataset

In [None]:
df = df.drop('Density', axis = 1)

In [None]:
df

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
plt.figure(figsize=(20,10))
plt.title('Histogram of the BodyDat in the dataset')
sns.histplot(data=df, x='BodyFat', bins=40, kde= True);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Age and Body fat')
sns.scatterplot(data=df, x='BodyFat',y='Age',s=100);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Weight and Body fat')
sns.scatterplot(data=df, x='BodyFat',y='Weight',s=100);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Weight and Body fat')
sns.scatterplot(data=df, x='BodyFat',y='Height',s=100);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Neck and Body fat')
sns.scatterplot(data=df, x='BodyFat',y='Neck',s=100);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Chest and Body fat')
sns.scatterplot(data=df, x='BodyFat',y='Chest',s=100);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Abdomen and Body fat')
sns.scatterplot(data=df, x='BodyFat',y='Abdomen',s=100);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Hip and Body fat')
sns.scatterplot(data=df, x='BodyFat',y='Hip',s=100);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Thigh and Body fat')
sns.scatterplot(data=df, x='BodyFat',y='Thigh',s=100);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Knee and Body fat')
sns.scatterplot(data=df, x='BodyFat',y='Knee',s=100);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Ankle and Body fat')
sns.scatterplot(data=df, x='BodyFat',y='Ankle',s=100);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Biceps and Body fat')
sns.scatterplot(data=df, x='BodyFat',y='Biceps',s=100);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Forearm and Body fat')
sns.scatterplot(data=df, x='BodyFat',y='Forearm',s=100);

In [None]:
plt.figure(figsize=(20,10))
plt.title('Plot of Forearm and Body fat')
sns.scatterplot(data=df, x='BodyFat',y='Wrist',s=100);

In [None]:
df.describe()

In [None]:
plt.figure(figsize=(20,20))
plt.title('Heat Map of Correlation')
sns.heatmap(data=df.corr(), annot=True)

# 5. Modelling

In [None]:
X = df.drop('BodyFat', axis=1)
y = df['BodyFat']

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Importing Models

In [None]:
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR, LinearSVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor

In [None]:
def fit_and_score(models, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_scores = {}
    
    for name, model in models.items():
        model.fit(X_train,y_train)
        model_scores[name] = model.score(X_test,y_test)

    model_scores = pd.DataFrame(model_scores, index=['Accuracy'])
    model_scores = model_scores.transpose().sort_values('Accuracy')

    return model_scores

## Baseline Models and Scores

In [None]:
models = {'Ridge' : Ridge(),
         'Lasso': Lasso(),
         'ElasticNet': ElasticNet(),
         'KNeighborsRegressor': KNeighborsRegressor(),
         'SVR': SVR(),
         'DecisionTreeRegressor': DecisionTreeRegressor(),
         'RandomForestRegressor':RandomForestRegressor(),
         'GradientBoostingRegressor': GradientBoostingRegressor(),
         'AdaBoostRegressor': AdaBoostRegressor()}

In [None]:
baseline_model_scores_df = fit_and_score(models, X_train, X_test, y_train, y_test)

In [None]:
baseline_model_scores_df.sort_values('Accuracy')

In [None]:
plt.figure(figsize=(20,10))
sns.barplot(data=baseline_model_scores_df.T)
plt.title('Baseline Model Accuracy Score')
plt.xticks(rotation=90);

With the scoring of the baseline model, we will use the following model to tune the hyperparameter:
1. AdaBoostRegressor 	0.697803

## Hyperparameter Tuning via Grid Search CV

In [None]:
from sklearn.model_selection import GridSearchCV
from warnings import filterwarnings

In [None]:
filterwarnings('ignore')

### Grid Search CV 1

In [None]:
def gridsearch_cv_scores(models, params, X_train, X_test, y_train, y_test):
    np.random.seed(42)
    
    model_gs_scores = {}
    model_gs_best_param = {}
    
    for name, model in models.items():
        gs_model = GridSearchCV(model,
                                param_grid=params[name],
                                scoring='neg_mean_squared_error',
                                n_jobs=-1,
                                cv=5,
                                verbose=2)
        
        gs_model.fit(X_train,y_train)

        model_gs_scores[name] = gs_model.score(X_test,y_test)
        model_gs_best_param[name] = gs_model.best_params_

    model_gs_scores = pd.DataFrame(model_gs_scores, index=['neg_mean_squared_error'])
    model_gs_scores = model_gs_scores.transpose().sort_values('neg_mean_squared_error')
        
    return model_gs_scores, model_gs_best_param

In [None]:
models = {'AdaBoostRegressor':AdaBoostRegressor()}
params = {'AdaBoostRegressor': {'n_estimators': [25,50,100,200,400],
                                'learning_rate': [0.01,0.1,0.5,0.8,1],
                                'loss': ['linear', 'square', 'exponential']}}

In [None]:
model_gs_scores_1, model_gs_best_param_1 = gridsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_gs_scores_1

In [None]:
model_gs_best_param_1

### Grid Search CV 2

In [None]:
models = {'AdaBoostRegressor':AdaBoostRegressor()}
params = {'AdaBoostRegressor': {'n_estimators': [40,50,60,70],
                                'learning_rate': [0.6,0.7,0.8,0.9],
                                'loss': ['linear']}}

In [None]:
model_gs_scores_2, model_gs_best_param_2 = gridsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_gs_scores_2

In [None]:
model_gs_best_param_2

### Grid Search CV 3

In [None]:
models = {'AdaBoostRegressor':AdaBoostRegressor()}
params = {'AdaBoostRegressor': {'n_estimators': np.arange(60,70),
                                'learning_rate': [0.8],
                                'loss': ['linear']}}

In [None]:
model_gs_scores_3, model_gs_best_param_3 = gridsearch_cv_scores(models, params, X_train, X_test, y_train, y_test)

In [None]:
model_gs_scores_3

In [None]:
model_gs_best_param_3

From the Grid Search CV using the neg mean squared error, we can see that the AdaBoostRegressor model is performing the best with a result of -14.641099.

# 6. Model Evaluation

In [None]:
model = AdaBoostRegressor(learning_rate=0.8,loss='linear', n_estimators=67,random_state=42)
model.fit(X_train, y_train)
y_preds = model.predict(X_test)

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
r2 = r2_score(y_test,y_preds)
mae = mean_absolute_error(y_test, y_preds)
mse = mean_squared_error(y_test, y_preds)
rmse = np.sqrt(mse)

In [None]:
print(f'R2 Score: {r2}')
print(f'Mean Absolute Error: {mae}')
print(f'Mean Square Error: {mse}')
print(f'Root Mean Square Error: {rmse}')

Using a AdaBoost Regressor Model we have evaulated the model of a Root Mean Square Error of 3.8263689684545876