# **Project Name**    -



##### **Project Type**    - Regression
##### **Contribution**    - Team
##### **Team Member 1 -** Tanika (2210992437)
##### **Team Member 2 -** Tamanna Kansal (2210992436)
##### **Team Member 3 -** Swastik Tara (2210992433)
##### **Team Member 4 -** Tanish Singla (2210992438)

# **Project Summary -**

In recent years, there has been a growing interest in leveraging artificial intelligence (AI) and machine learning (ML) techniques to develop predictive models for various applications, including health and fitness. One such area of focus is the development of calorie burn prediction models, which aim to accurately estimate the number of calories burned during physical activities. These models have the potential to revolutionize the way individuals track and manage their fitness goals, providing personalized insights and recommendations for optimizing workouts and overall health.

Introduction to Calories Burnt Prediction Model

Calorie burn prediction models utilize data from various sources, including wearable fitness trackers, heart rate monitors, accelerometers, and other sensors, to estimate the energy expenditure associated with different physical activities. These models take into account factors such as age, gender, weight, height, heart rate, duration of activity, and intensity level to generate accurate calorie burn estimates.

Data Collection and Preprocessing

The first step in building a calorie burn prediction model is to collect and preprocess the data required for training the model. This may involve gathering data from diverse sources such as fitness apps, smartwatches, fitness equipment, and research studies. The collected data is then cleaned, normalized, and preprocessed to remove noise, outliers, and inconsistencies, ensuring that it is suitable for training the ML model.

Feature Engineering

Feature engineering plays a crucial role in developing an effective calorie burn prediction model. This involves selecting and extracting relevant features from the raw data that are most informative for predicting calorie burn. Features may include heart rate patterns, activity type, duration, intensity, metabolic rate, and individual characteristics such as age, weight, and gender. Advanced techniques such as time-series analysis and signal processing may be employed to extract meaningful features from sensor data.

Model Selection and Training

Once the data is prepared and features are engineered, the next step is to select an appropriate ML algorithm and train the prediction model. Commonly used ML algorithms for calorie burn prediction include linear regression, decision trees, random forests, support vector machines (SVM), neural networks, and ensemble methods. The choice of algorithm depends on factors such as the complexity of the data, the size of the dataset, and the desired accuracy of the predictions. The model is trained using labeled data, where the input features are paired with the corresponding calorie burn measurements.

Evaluation and Validation

After training the model, it is essential to evaluate its performance and validate its accuracy using separate test datasets. Evaluation metrics such as mean absolute error (MAE), root mean squared error (RMSE), and coefficient of determination (R-squared) are commonly used to assess the model's predictive accuracy. The model may undergo iterative refinement and optimization based on the evaluation results to improve its performance further.

Deployment and Integration

Once the calorie burn prediction model has been trained and validated, it can be deployed and integrated into various fitness tracking platforms, mobile apps, wearable devices, and health monitoring systems. Users can then access real-time calorie burn estimates based on their physical activities, receive personalized recommendations for optimizing workouts, and track their progress towards fitness goals.

Challenges and Future Directions

Despite the advancements in AI and ML techniques for calorie burn prediction, several challenges remain. These include issues related to data quality, variability in individual metabolic rates, the accuracy of sensor measurements, and the generalization of models across diverse populations and activity types. Future research directions may focus on addressing these challenges through the development of more robust and accurate prediction models, incorporating additional physiological and contextual factors, and leveraging advanced AI techniques such as deep learning and reinforcement learning.

```
# This is formatted as code
```



# **GitHub Link -**

https://github.com/tanika04

# **Problem Statement**


The problem of accurately estimating calorie burn during physical activities is a significant challenge in the field of health and fitness monitoring. While traditional methods rely on generic formulas and approximations based on factors such as activity type, duration, and intensity level, these approaches often lack precision and fail to account for individual variability in metabolic rates and physiological responses. As a result, there is a growing demand for more personalized and accurate calorie burn prediction models that leverage artificial intelligence (AI) and machine learning (ML) techniques.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR

### Dataset Loading

In [None]:
# Load the exercise CSV file
exercise_df = pd.read_csv('/content/exercise.csv')

# Load the calories CSV file
calories_df = pd.read_csv('/content/calories.csv')

# Merge the two dataframes based on the 'User_ID' column
merged_df = pd.merge(exercise_df, calories_df, on='User_ID')

# Append the 'Calories' column to the end of the exercise dataframe
exercise_df['Calories'] = calories_df['Calories']

# Save the modified exercise dataframe to a new CSV file
exercise_df.to_csv('merged_exercise.csv', index=False)

### Dataset First View

In [None]:
# Dataset First Look
df=pd.read_csv('merged_exercise.csv')
df

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(df.shape)
num_rows, num_columns = df.shape
print("Number of rows:", num_rows)
print("Number of columns:", num_columns)

df.count()

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
duplicate_count = df.duplicated().sum()
print(duplicate_count)

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
null_values=df.isnull().sum()
print(null_values)

In [None]:
# Visualizing the missing values
plt.figure(figsize = (10,7))
sns.heatmap(df.isnull(), cbar=False)
plt.show()

### What did you know about your dataset?

1)In our dataframe there are 15000 rows and 9 columns

2)It has no missing values

3)It has six dataypes of type float64 ,two datatypes of type int64 ,and one datatype of type object

4)The heatmap provides a visual representation of missing values across all data points. It facilitates the identification of patterns in missing data, such as concentration within specific subsets of rows. In this case, since there are no missing values, the heatmap would exhibit uniformity, indicating the absence of any gaps in the data.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
print(df.columns)

In [None]:
# Dataset Describe
df.describe()

### Variables Description

1. User ID: Unique identifier for each user.

2. Gender: Gender of the user (male/female).

3. Age: Age of the user in years.

4. Height: Height of the user in centimeters.

5. Weight: Weight of the user in kilograms.

6. Duration: Duration of the physical activity session in minutes.

7. Heart Rate: Average heart rate of the user during the activity session.

8. Body Temperature: Average body temperature of the user during the activity session, measured in Celsius.

9. Calories: Number of calories burnt by the user during the activity session, which serves as the target variable for prediction.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable means it tells how many different values of each column are there
df.nunique()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
missing_values = df.isnull().sum()

# Print the number of missing values in each column
(missing_values)

In [None]:
#  Basic Aggregations
#1. Total calories burnt by age
total_caloriesburnt_by_age = df.groupby('Age')['Calories'].sum()
print("total calories burnt by age :");
(total_caloriesburnt_by_age)

In [None]:
#  Basic Aggregations
#1. Total calories burnt by Gender
total_caloriesburnt_by_Gender = df.groupby('Gender')['Calories'].sum()
print("total calories burnt by Gender :");
(total_caloriesburnt_by_Gender)

In [None]:
#  Basic Aggregations
#1. Total calories burnt by Duration
print("total calories burnt by Duration :");
df.groupby('Duration')['Calories'].sum()

In [None]:
#  Basic Aggregations
#1. Total calories burnt by Height
total_caloriesburnt_by_Height = df.groupby('Height')['Calories'].sum()
print("total calories burnt by Height:");
(total_caloriesburnt_by_Height)

In [None]:
# Multi-Level GroupBy
#calories burnt by age and gender
calories_burnt_by_agegender  = df.groupby(['Age', 'Gender'])['Calories'].sum()
print("\ncalories burnt by age and gender :")
print(calories_burnt_by_agegender)

In [None]:
# Multi-Level GroupBy
#calories burnt by age and Duration
df.groupby(['Age', 'Duration'])['Calories'].sum()


### What all manipulations have you done and insights you found?

Data Types and Format Checking: I've identified the data types of each column based on the given dataset.

Variable Descriptions: I've provided descriptions for each variable/column based on its name and the provided dataset.

Data Quality Check: By inspecting the dataset, I can ensure that there are no missing values or anomalies in the provided data. This quality check is crucial before performing any analysis or modeling tasks.

Visualization: There are visualizations such as histograms, scatter plots, and box plots to better understand the distribution and relationships within the dataset.

Predictive Modeling: In the subsequent analysis, focus was on predicting calorie burn during physical activity by employing regression techniques. This process entailed dividing the dataset into training and testing subsets, identifying relevant features, and assessing the model's predictive performance.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='Age', y='Calories')
plt.title('Age vs. Calories Burnt')
plt.xlabel('Age')
plt.ylabel('Calories')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a scatter plot because it effectively displays the relationship between two continuous variables, in this case, age and calories burnt. Scatter plots are great for visualizing correlations or patterns in data points. By plotting age on the x-axis and calories burnt on the y-axis, we can see if there's any discernible trend or relationship between these two variables.

##### 2. What is/are the insight(s) found from the chart?

 The plot may reveal whether there's a correlation between age and calories burnt. For instance, if the points trend upward from left to right, it suggests that as age increases, so does the number of calories burnt.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the Age vs. Calories Burnt scatter plot can potentially lead to positive business impacts.

Understanding the relationship between age and calories burnt can help fitness and wellness businesses tailor their marketing efforts to different age demographics.

#### Chart - 2

In [None]:
# Chart - 2 visualization code
# Example 2: Box plot of gender vs. duration
plt.figure(figsize=(8, 5))
sns.boxplot(data=df, x='Gender', y='Duration')
plt.title('Gender vs. Duration')
plt.xlabel('Gender')
plt.ylabel('Duration (minutes)')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a box plot for comparing Gender vs. Duration because it effectively summarizes the distribution of duration values within each gender category.

Box plots allow for a clear comparison of the central tendency, spread, and skewness of the duration data between different gender groups.

##### 2. What is/are the insight(s) found from the chart?

Comparison of Central Tendency: The plot allows for a comparison of the median duration values between different genders.

This can provide insights into whether there are any notable differences in the average duration of activities between males and females.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the Gender vs. Duration box plot can potentially lead to positive business impacts.

Understanding gender-based differences in activity durations can help businesses tailor their marketing strategies to better appeal to specific gender demographics.

#### Chart - 3

In [None]:
# Chart - 3 visualization code
# Bar graph of average calories burnt by gender
avg_calories_gender = df.groupby('Gender')['Calories'].mean().reset_index()
plt.figure(figsize=(8, 5))
sns.barplot(data=avg_calories_gender, x='Gender', y='Calories')
plt.title('Average Calories Burnt by Gender')
plt.xlabel('Gender')
plt.ylabel('Calories')
plt.show()

##### 1. Why did you pick the specific chart?


I chose a bar graph for visualizing the average calories burnt by gender because it effectively presents a comparison of the mean calorie expenditure between different gender groups.

##### 2. What is/are the insight(s) found from the chart?

The graph clearly shows the average calorie expenditure for males and females, providing insights into potential gender-based differences in physical activity levels or metabolic rates.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the bar graph of Average Calories Burnt by Gender can potentially lead to positive business impacts.

#### Chart - 4

In [None]:
# Chart - 4 visualization code
# Histogram of calories burnt distribution
plt.figure(figsize=(10, 6))
plt.hist(df['Calories'], bins=20, color='skyblue', edgecolor='black')
plt.title('Calories Burnt Distribution')
plt.xlabel('Calories Burnt')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram for visualizing the distribution of calories burnt because it effectively illustrates the frequency distribution of a continuous variable, in this case, the calories burnt.

##### 2. What is/are the insight(s) found from the chart?

Shape of the Distribution: The histogram provides insights into the shape of the distribution of calorie burnt values. For example, it may reveal whether the distribution is symmetric, skewed to the left (negative skew), or skewed to the right (positive skew).

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the histogram of Calories Burnt Distribution can potentially lead to positive business impacts.

Understanding the distribution of calorie burnt values can help businesses in the fitness or wellness industry tailor their product offerings to better meet the needs and preferences of their target audience.

#### Chart - 5

In [None]:
# Chart - 5 visualization code
# Pie chart of gender distribution
gender_counts = df['Gender'].value_counts()
plt.figure(figsize=(8, 6))
plt.pie(gender_counts, labels=gender_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Gender Distribution')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a pie chart for visualizing the gender distribution because it effectively represents the proportion of each gender category within the dataset.

##### 2. What is/are the insight(s) found from the chart?

The pie chart provides insights into the relative proportions of each gender category within the dataset. Viewers can quickly see the distribution of genders and understand the proportion of males and females in the dataset.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the Pie Chart of Gender Distribution can potentially lead to positive business impacts.


Targeted Marketing Strategies: Understanding the gender distribution can help businesses develop more targeted marketing strategies tailored to different gender demographics. By focusing marketing efforts on specific gender categories that constitute a larger proportion of the dataset, businesses can improve the effectiveness of their campaigns and increase customer engagement.

#### Chart - 6

In [None]:
# Chart - 6 visualization code
# Scatter plot of calories burnt vs. duration
plt.figure(figsize=(10, 6))
plt.scatter(df['Duration'], df['Calories'], alpha=0.5)
plt.title('Calories Burnt vs. Duration')
plt.xlabel('Duration')
plt.ylabel('Calories')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a scatter plot for visualizing the relationship between calories burnt and duration because it effectively displays the correlation or pattern between two continuous variables.

##### 2. What is/are the insight(s) found from the chart?

Correlation Analysis: The scatter plot provides insights into the correlation between duration and calories burnt.

By observing the general trend of the data points, we can determine whether there's a positive correlation (as duration increases, calories burnt also increase), negative correlation (as duration increases, calories burnt decrease), or no correlation.


##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the Scatter Plot of Calories Burnt vs. Duration can potentially lead to positive business impacts.

Understanding the relationship between duration and calories burnt can help businesses in the fitness or wellness industry optimize their product offerings.

#### Chart - 7

In [None]:
# Chart - 7 visualization code
# Line plot of calories burnt over Duration
plt.figure(figsize=(10, 6))
plt.plot(df['Duration'], df['Calories'])
plt.title('Calories Burnt Over Duration')
plt.xlabel('Duration')
plt.ylabel('Calories')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a line plot for visualizing the relationship between calories burnt and duration because it effectively displays the trend or pattern of calorie expenditure over different durations of activity.

##### 2. What is/are the insight(s) found from the chart?

The line plot allows for the analysis of trends in calorie expenditure over different durations of activity.

By examining the direction and slope of the line, insights can be gained into whether there is an overall increase, decrease, or stability in calorie burnt as duration changes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the line plot of "Calories Burnt Over Duration" have the potential to create positive business impacts.

Insights from the relationship between duration and calorie burnt can help fitness businesses optimize their programs.

#### Chart - 8

In [None]:
# Chart - 8 visualization code
plt.figure(figsize=(10, 6))
sns.lineplot(x='Duration', y='Heart_Rate', data=df)
plt.title('Heart Rate Over Duration')
plt.xlabel('Duration')
plt.ylabel('Heart_Rate')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a line plot for visualizing the relationship between heart rate and duration because it effectively displays the trend or pattern of heart rate changes over different durations of activity.

##### 2. What is/are the insight(s) found from the chart?

The line plot allows for the analysis of trends in heart rate changes over different durations of activity.

By examining the direction and slope of the line, insights can be gained into whether there is an overall increase, decrease, or stability in heart rate as duration changes.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the Line Plot of Heart Rate Over Duration have the potential to create positive business impacts.

Insights from the relationship between heart rate and duration can help fitness businesses optimize their programs.

By understanding how heart rate changes over different durations of activity, businesses can tailor their fitness programs to achieve specific cardiovascular training goals, leading to improved customer satisfaction and retention.

#### Chart - 9

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10, 6))
plt.hist(df['Age'], bins=20, color='skyblue', edgecolor='black')
plt.title('Age Distribution')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a histogram for visualizing the distribution of ages because it effectively displays the frequency or count of age values within different age ranges.

##### 2. What is/are the insight(s) found from the chart?

Central Tendency: The histogram provides insights into the central tendency of the age distribution, including the most common or typical age range within the dataset.

Peaks or modes in the histogram indicate where the age values are concentrated, providing insights into the central tendency of the age distribution.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the Histogram of Age Distribution can potentially lead to positive business impacts.

Insights from the age distribution can inform targeted marketing strategies aimed at different age demographics.

By understanding the age composition of their target audience, businesses can tailor their marketing efforts to better resonate with specific age groups, leading to increased customer engagement and conversion rates.

#### Chart - 10

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Weight', y='Height', data=df)
plt.title('Weight vs. Height')
plt.xlabel('Weight')
plt.ylabel('Height')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a scatter plot for visualizing the relationship between weight and height because it effectively displays the correlation or pattern between two continuous variables.

##### 2. What is/are the insight(s) found from the chart?

The scatter plot provides insights into the correlation between weight and height.

By observing the general trend of the data points, we can determine whether there is a positive correlation (as weight increases, height also increases), negative correlation (as weight increases, height decreases), or no correlation between weight and height.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the Scatter Plot of Weight vs. Height have the potential to create positive business impacts, but they could also lead to challenges or negative growth if not interpreted and utilized effectively.

Positive Business Impact: Insights from the relationship between weight and height can help businesses in industries such as apparel, fitness equipment, and healthcare tailor their product offerings to better meet the needs of their target customers.

Negative Growth: Misinterpreting or misusing insights from the scatter plot could lead to negative growth if businesses use weight and height data to stereotype or discriminate against individuals.



#### Chart - 11

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(10, 6))
sns.boxplot(x='Gender', y='Calories', data=df)
plt.title('Calories Burnt by Gender')
plt.xlabel('Gender')
plt.ylabel('Calories Burnt')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a box plot for visualizing the distribution of calories burnt by gender because it effectively displays the central tendency, spread, and variability of calorie expenditure between different gender groups.

##### 2. What is/are the insight(s) found from the chart?

Comparison of Median Calories Burnt: The box plot allows for a comparison of the median calorie burnt between genders. The horizontal line within each box represents the median calorie burnt for that gender group.

Comparing these median values provides insights into any differences in average calorie expenditure between males and females.



##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

The insights gained from the Box Plot of Calories Burnt by Gender have the potential to create positive business impacts, but they could also lead to challenges or negative growth if not interpreted and utilized effectively.

Positive Impact:

Tailored fitness programs for different gender groups.
Targeted marketing strategies based on gender-specific fitness goals.
Opportunities for gender-specific product development.

Negative Impact:

Gender stereotyping leading to discrimination.
Exclusionary practices alienating potential customers.
Missed opportunities for inclusivity and diversity.

#### Chart - 12

In [None]:
# Chart - 12 visualization code
age_groups = pd.cut(df['Age'], bins=[0, 20, 40, 60, 80, 100])
plt.figure(figsize=(20, 6))
sns.barplot(x='Age', y='Body_Temp', data=df)
plt.xlabel('Age Group')
plt.ylabel('Average Body Temperature')
plt.title('Average Body Temperature by Age Group')
plt.show()

##### 1. Why did you pick the specific chart?

I selected a bar plot because it effectively displays the average body temperature across different age groups.

##### 2. What is/are the insight(s) found from the chart?

From the Bar Plot of Average Body Temperature by Age Group:

Age-Related Trends in Body Temperature: The bar plot provides insights into how average body temperature varies across different age groups. By examining the heights of the bars, we can identify any trends or patterns in body temperature as individuals age.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

Tailored healthcare services: Insights inform personalized care and treatment plans.
Product development: Opportunities for age-specific healthcare products.

Improved customer satisfaction: Targeted care enhances patient experience.

Negative Impact:

Health concerns: Deviations may signal underlying issues, leading to increased costs.

Ineffective marketing: Misinterpretation could result in wasted resources and decreased profitability.

#### Chart - 13

In [None]:
# Chart - 13 visualization code
# Violin plot of calories burnt distribution by age group
plt.figure(figsize=(10, 6))
sns.violinplot(x=age_groups, y='Calories', data=df)
plt.xlabel('Age Group')
plt.ylabel('Calories Burnt')
plt.title('Calories Burnt Distribution by Age Group')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a violin plot for visualizing the distribution of calories burnt by age group because it effectively displays the distribution, central tendency, and spread of calorie expenditure within each age group.

##### 2. What is/are the insight(s) found from the chart?

The violin plot allows for a comparison of the distribution shapes of calorie burnt between different age groups.

Differences in the width, height, and shape of the violins indicate variations in the distribution patterns of calorie expenditure across age groups.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Positive Impact:

Tailored health programs and marketing: Insights inform age-specific strategies, enhancing customer engagement.

Product development: Opportunities for age-specific fitness products, optimizing calorie expenditure.

Negative Impact:

Exclusionary practices: Overreliance on age-specific insights may neglect diverse needs.

Misinterpretation: Misinterpreting outliers could lead to ineffective interventions.

Health disparities: Failure to address disparities may exacerbate negative health outcomes.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
df_no_gender = df.drop('Gender', axis=1) # Drop 'gender' column from DataFrame df
# Generate heatmap of correlation between variables (excluding 'gender')
plt.figure(figsize=(10, 6))
sns.heatmap(df_no_gender.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Between Variables (Excluding Gender)')
plt.show()

##### 1. Why did you pick the specific chart?

I selected a correlation heatmap because it effectively visualizes the pairwise correlations between different variables in the dataset, excluding the 'Gender' column.

##### 2. What is/are the insight(s) found from the chart?

Strength and Direction: Reveals strength and direction of correlations.

Strong Relationships: Identifies strong correlations between variables.

Weak Relationships: Highlights weak or no correlations.

Patterns: Shows clusters of variables with related correlations.

Multicollinearity: Indicates potential multicollinearity issues.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code
# Pairplot of selected variables
sns.pairplot(df[['Age', 'Height', 'Weight', 'Duration', 'Heart_Rate', 'Calories']])
plt.title('Pairplot of Selected Variables')
plt.show()

##### 1. Why did you pick the specific chart?

I chose a pair plot because it effectively visualizes the pairwise relationships between multiple selected variables in the dataset.

##### 2. What is/are the insight(s) found from the chart?

The pair plot reveals the direction and strength of relationships between pairs of variables.
Scatterplots display how variables change relative to each other, indicating whether they are positively, negatively, or not correlated.



## ***5. Hypothesis Testing***

### Based on your chart experiments, define three hypothetical statements from the dataset. In the next three questions, perform hypothesis testing to obtain final conclusion about the statements through your code and statistical testing.

1.Age has a significant effect on the number of calories burnt during a workout.

2.There is a difference in the average number of calories burnt between males and females.

3.There is a significant correlation between heart rate and calories burnt.


### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

NULL Hypothesis(H0):Age has no significant effect on the number of calories burnt during a workout.

H0: β_age = 0

Alternative Hypothesis (H1): Age has a significant effect on the number of calories burnt during a workout.

H1: β_age ≠ 0

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import scipy.stats as stats
# Calculate the Pearson correlation coefficient
correlation, p_value = stats.pearsonr(df["Age"], df["Calories"])
# Set significance level
alpha = 0.05
# Interpret results
if p_value < alpha:
    print("Reject H0: Age has a significant effect on the number of calories burnt during a workout.")
else:
    print("Fail to reject H0: Age does not have a significant effect on the number of calories burnt during a workout.")

##### Which statistical test have you done to obtain P-Value?

Pearson correlation coefficient.
The Pearson correlation coefficient is a measure of the linear relationship between two variables. It ranges from -1 to 1, where:

-1 indicates a perfect negative correlation

0 indicates no correlation

1 indicates a perfect positive correlation

##### Why did you choose the specific statistical test?

The statistical test used to obtain the p-value in this code is the Pearson correlation coefficient test.

The Pearson correlation coefficient is a measure of the linear relationship between two variables.

In this case, we are interested in measuring the relationship between age and calories burnt.

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no difference in the average number of calories burnt between males and females.

H0: μ_male = μ_female

Alternative Hypothesis (H1): There is a difference in the average number of calories burnt between males and females.

H1: μ_male ≠ μ_female

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
# Separate data by gender
import scipy.stats as stats

# Separate data by gender
men = df[df["Gender"] == "male"]["Calories"]
women = df[df["Gender"] == "female"]["Calories"]

# Perform independent samples t-test
t_stat, p_value = stats.ttest_ind(men, women)

# Set significance level
alpha = 0.05

# Interpret results
if p_value < alpha:
    print("Reject H0: There is a significant difference in the average number of calories burnt between males and females.")
else:
    print("Fail to reject H0: There is no significant difference in the average number of calories burnt between males and females.")

##### Which statistical test have you done to obtain P-Value?

Independent samples t-test:

The independent samples t-test is a statistical test that is used to compare the means of two independent groups. In this case, we used the independent samples t-test to compare the mean calories burned between men and women.

##### Why did you choose the specific statistical test?

The independent samples t-test is appropriate for comparing the means of two independent groups. In this case, we are comparing the mean calories burned between men and women.

The independent samples t-test is a relatively simple test to perform. This makes it a good choice for a quick and easy analysis.

### Hypothetical Statement - 3

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

Null Hypothesis (H0): There is no significant correlation between heart rate and calories burnt.

H0: ρ = 0

Alternative Hypothesis (H1): There is a significant correlation between heart rate and calories burnt.

H1: ρ ≠ 0

#### 2. Perform an appropriate statistical test.

In [None]:
import scipy.stats as stats

# Calculate the Pearson correlation coefficient
correlation, p_value = stats.pearsonr(df["Heart_Rate"], df["Calories"])

# Set significance level
alpha = 0.05

# Interpret results
if p_value < alpha:
    print("Reject H0: There is a significant correlation between heart rate and calories burnt.")
else:
    print("Fail to reject H0: There is no significant correlation between heart rate and calories burnt.")

Pearson correlation coefficient. The Pearson correlation coefficient is a measure of the linear relationship between two variables. It ranges from -1 to 1, where:

-1 indicates a perfect negative correlation

0 indicates no correlation

1 indicates a perfect positive correlation



##### Why did you choose the specific statistical test?

The Pearson correlation coefficient is a measure of the linear relationship between two variables.

In this case, we are interested in measuring the relationship between heart rate and calories burned.

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

#### What all missing value imputation techniques have you used and why did you use those techniques?

There are no missing values in any columns.

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments
df.info()

In [None]:
numerical_cols = ['Height', 'Weight', 'Heart_Rate', 'Age', 'Body_Temp', 'Duration']
def handle_outliers_iqr(data, cols):
    for col in cols:
        Q1 = data[col].quantile(0.25)
        Q3 = data[col].quantile(0.75)
        IQR = Q3 - Q1
        # Define the upper and lower bounds for outlier detection
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR
        # Replace outliers with NaN
        data[col][(data[col] < lower_bound) | (data[col] > upper_bound)] = None  # or use np.nan
    return data

# Apply outlier handling using the IQR method
df_clean_iqr = handle_outliers_iqr(df.copy(), numerical_cols)

##### What all outlier treatment techniques have you used and why did you use those techniques?

Interquartile Range (IQR) Method:
The IQR method identifies outliers based on the spread of the middle 50% of the data.

It calculates the difference between the third quartile (Q3) and the first quartile (Q1) of the dataset (IQR = Q3 - Q1). Data points outside a certain range of the IQR are considered outliers.

This method is chosen because it is robust to outliers in the data and provides a measure of variability that is not influenced by extreme values.

### 3. Categorical Encoding

In [None]:
df['Gender'].value_counts()

In [None]:
# Encode your categorical columns
# Assuming 'df' is your DataFrame containing the 'Gender' column
# Encode the 'Gender' column using one-hot encoding
one_hot_encoded_data = pd.get_dummies(df, columns = ['Gender'])
print(one_hot_encoded_data)

# # Print the encoded dataframe
print(df.head())



#### What all categorical encoding techniques have you used & why did you use those techniques?

One-hot encoding: This is the most common and straightforward method of categorical encoding.

It creates a new binary column for each unique value in the categorical column. I used this technique for the "Gender" column.

This technique is simple to implement and interpret. It is also compatible with most machine learning algorithms.

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
numeric_df = df.select_dtypes(include=['float64', 'int64'])
correlation_matrix = numeric_df.corr()
#Feature Scaling
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[['Age', 'Height', 'Weight', 'Duration', 'Heart_Rate']])
df_scaled = pd.DataFrame(scaled_features, columns=['Age', 'Height', 'Weight', 'Duration', 'Heart_Rate'])

#### 2. Feature Selection

In [None]:
# Step 1: Calculate correlation coefficients with the target variable
correlation_with_calories = df.drop(['User_ID', 'Gender'], axis=1).corr()['Calories'].abs().sort_values(ascending=False)

# Step 2: Select features based on correlation coefficients
selected_features = correlation_with_calories[1:4].index.tolist()  # Exclude 'Calories' itself

# Step 3: Manually select additional features based on domain knowledge
domain_features = ['Duration', 'Heart_Rate', 'Body_Temp']  # Additional features based on domain knowledge

# Combine selected features
final_selected_features = selected_features + domain_features

# Print selected features
print("Selected Features:", final_selected_features)




##### What all feature selection methods have you used  and why?

Correlation Analysis: Quantifies the linear relationship between features and the target variable. Features with higher absolute correlation coefficients are considered more relevant.

Manual Selection based on Domain Knowledge: Incorporates expert insights to select features known to be important or relevant to the problem.

These methods are chosen for their simplicity, interpretability, and ability to identify informative features for predicting the target variable.

##### Which all features you found important and why?

The important features identified are Duration, Heart Rate, and Body Temperature.

These features are considered important because they have a significant impact on calorie expenditure during physical activities.

Duration represents the length of activity, while Heart Rate and Body Temperature reflect the intensity and metabolic activity, respectively, both of which influence calorie burn rates.

### 5. Data Transformation

Yes,the dataset may benefit from the following transformations:

Normalization:

The features in the Calories dataset have different units and scales.

 We can normalize the features by scaling them to a common range, such as between 0 and 1.

This can be done using the MinMaxScaler or StandardScaler classes from the sklearn.preprocessing module.


### 6. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import MinMaxScaler
features_to_scale = ['User_ID', 'Height', 'Weight', 'Heart_Rate', 'Age', 'Body_Temp', 'Duration']

# Initialize the StandardScaler
scaler = MinMaxScaler()

# Scale the selected features
df[features_to_scale] = scaler.fit_transform(df[features_to_scale])

# Print the scaled DataFrame
print(df.head())

##### Which method have you used to scale you data and why?

Min-max scaling: This method scales the data so that all features lie between 0 and 1.

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

The preprocessing step is often performed to ensure that features with different scales and units contribute equally to the analysis and modeling process.

In this case, since the features are already scaled, it may not be necessary to apply further dimensionality reduction techniques

### 8. Data Splitting

In [None]:
#Split the data into features and target
features = df[["User_ID", "Gender", "Age", "Weight", "Height", "Heart_Rate", "Duration"]]
target = df["Calories"]

# Split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.25, random_state=42)

# Print the shapes of the training and test sets
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

##### What data splitting ratio have you used and why?

The dataset uses a data splitting ratio of 75% training data and 25% test data.

This is a common ratio that is used in machine learning.

It is computationally efficient: Training a machine learning model can be computationally expensive, so using a smaller test set can save time and resources.

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

To determine whether a dataset is imbalanced, we need to look at the distribution of the target variable.

In this case, the target variable is the Calories column.

We can use the value_counts() method to get the count of each unique value in the Calories column

In [None]:
#checking whether the dataset is balanced or not
df["Calories"].value_counts()


In [None]:
# Plot the distribution of the 'Calories' variable
plt.figure(figsize=(8, 6))
df['Calories'].hist(bins=20, color='skyblue', edgecolor='black', alpha=0.7)
plt.title('Distribution of Calories Burnt')
plt.xlabel('Calories')
plt.ylabel('Frequency')
plt.grid(False)
plt.show()

In [None]:
from sklearn.utils import resample
#for balancing columns:calories
# Separate majority and minority classes
majority_class = df[df['Calories'] == 0]
minority_class = df[df['Calories'] == 1]

# Upsample minority class
minority_upsampled = resample(minority_class,
                               replace=True,     # sample with replacement
                               n_samples=len(majority_class),    # to match majority class
                               random_state=42) # reproducible results

# Combine majority class with upsampled minority class
df_upsampled = pd.concat([majority_class, minority_upsampled])

# Display new class counts
df_upsampled['Calories'].value_counts()

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

 I used upsampling to handle the imbalanced dataset also known as oversampling.
 Upsampling involves randomly duplicating samples from the minority class to balance the class distribution.

## ***7. ML Model Implementation***

### ML Model - 1

In [None]:
# Assuming you have loaded your dataset into a DataFrame named 'df'
# Select relevant features (independent variables) and the target variable
X = df[['User_ID', 'Age', 'Height', 'Weight', 'Duration', 'Heart_Rate', 'Body_Temp']]
y = df['Calories']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a linear regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Predict calories burnt for the test set
y_pred = model.predict(X_test)

# Calculate Mean Squared Error
mean_squared_error(y_test, y_pred)


# Calculate R-squared (accuracy) for the test set
r2 = r2_score(y_test, y_pred)
Linear_accuracy_percentage = r2 * 100
print("Accuracy (Test): {:.2f}%".format(Linear_accuracy_percentage))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Calculate Mean Absolute Error
Linear_MSE  = mean_squared_error(y_test, y_pred)
print("MSE :" ,Linear_MSE)

Linear_MAE =mean_absolute_error(y_test, y_pred)
print("Linear MAE:", Linear_MAE)

Linear_RMSE = np.sqrt(Linear_MSE)
print("RMSE :" ,Linear_RMSE)

# Calculate R-squared (accuracy) for the test set
Linear_r2 = r2_score(y_test, y_pred)
print("r2:", Linear_r2)

Linear_adjusted_r2 = 1-(1-r2_score((y_test),(y_pred)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Adjusted R2 : ",Linear_adjusted_r2)

In [None]:
# Visualizing evaluation Metric Score chart

# Evaluation metric scores
mse = 132.10
accuracy_percentage = 96.73

# Labels for the bars
labels = ['Mean Squared Error (MSE)', 'Accuracy (Test)']

# Scores for the bars
scores = [mse, accuracy_percentage]

# Create a bar chart
plt.figure(figsize=(8, 6))
plt.bar(labels, scores, color=['blue', 'green'])

# Add title and labels
plt.title('Evaluation Metric Scores')
plt.xlabel('Metric')
plt.ylabel('Score')

# Show the plot
plt.show()


#1. Which Evaluation metrics did you consider for a positive business impact and why?

Both R-squared and adjusted R-squared have a highly positive impact on business decisions, as they provide insights into how well the model fits the data and explain the variance in the target variable.

#2. Which ML model did you choose from the above created models as your final prediction model and why?

Linear Regression was chosen as the final prediction model due to its simplicity, interpretability, and good performance in explaining the variance in the target variable (calories burnt) based on the provided features.

#3. Explain the model which you have used and the feature importance using any model explainability tool?

The model used for predicting calorie expenditure is Linear Regression. Linear Regression is a supervised learning algorithm used for predicting a continuous target variable based on one or more input features.

It assumes a linear relationship between the input features and the target variable.

Feature Importance:
Feature importance in a linear regression model can be determined by examining the coefficients assigned to each feature. Features with higher absolute coefficient values are considered more important in influencing the target variable.

### ML Model - 2

In [None]:
# Select relevant features (independent variables) and the target variable
X = df[['User_ID', 'Age', 'Height', 'Weight', 'Duration', 'Heart_Rate', 'Body_Temp']]
y = df['Calories']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a decision tree regression model
model = DecisionTreeRegressor()

# Train the model on the training data
model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test)

# Calculate Mean Absolute Error
mae = np.mean(np.abs(y_test - y_pred))
print("Mean Absolute Error:", mae)

# Calculate R-squared (accuracy) for the test set
accuracy = r2_score(y_test, y_pred)
DT_accuracy_percentage = accuracy * 100
print("Accuracy (Test): {:.2f}%".format(DT_accuracy_percentage))

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
DT_MSE  = mean_squared_error(y_test, y_pred)
print("MSE :" , DT_MSE)

DT_MAE  = mean_absolute_error(y_test, y_pred)
print("MSE :" , DT_MAE)

DT_RMSE = np.sqrt(DT_MSE)
print("RMSE :" ,DT_RMSE)

DT_r2 = r2_score(y_test, y_pred)
print("R2 :" ,DT_r2)
DT_adjusted_r2 = 1-(1-r2_score((y_test),(y_pred)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Adjusted R2 : ",DT_adjusted_r2)

1.The model we used is Decision Tree.

2.Fit the algorithm and predict it.

The evaluation metrics we used are:-

1.Mean Squared Error:-It is calculated as the average of the squared differences between the predicted values and the true values.

2.Root Mean Squared Error:It is calculated as the average of the root of squared differences between the predicted values and the true values.

3.R2 score:It represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

In [None]:
# Visualizing evaluation Metric Score chart

# Plotting the predicted values against the actual values
plt.figure(figsize=(6, 3))
plt.scatter(y_test, y_pred, color='blue', alpha=0.5)
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Evaluation Metric Score Chart')
plt.grid(True)
plt.show()


# 1. Which Evaluation metrics did you consider for a positive business impact and why?

These two metrics, R-squared and Adjusted R-squared, are highly valued in business contexts as they provide insights into the explanatory power and generalization ability of the predictive model.

Higher values of these metrics indicate better model performance, which ultimately leads to more informed decision-making and positive business impact.

#2. Which ML model did you choose from the above created models as your final prediction model and why?

 Decision Tree Regression model was chosen as the final prediction model due to its simplicity, interpretability, and ability to capture complex relationships in the data without extensive preprocessing.

#3. Explain the model which you have used and the feature importance using any model explainability tool?

The model used is Decision Tree Regression, a supervised learning algorithm used for regression tasks. It works by recursively partitioning the data into subsets based on features, aiming to minimize the variance of the target variable within each subset.

Feature Importance:
Decision trees inherently provide a measure of feature importance based on how frequently they are used for splitting and how much they reduce impurity (e.g., mean squared error). Features used at the top of the tree and those that result in the largest reduction in impurity are considered more important.

### ML Model - 3

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.metrics import r2_score

# Assuming you have loaded your dataset into a DataFrame named 'df'
# Select relevant features (independent variables) and the target variable
X = df[['Age', 'Height', 'Weight', 'Duration', 'Heart_Rate', 'Body_Temp']]
y = df['Calories']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalize the input features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create and train the SVR model
model = SVR(kernel='rbf')  # RBF kernel is often used for SVR
model.fit(X_train_scaled, y_train)

# Predict the target variable for the test set
y_pred = model.predict(X_test_scaled)

# Calculate Mean Squared Error and Mean Absolute Error
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
print("Mean Squared Error:", mse)
print("Mean Absolute Error:", mae)

# Calculate R-squared (accuracy) for the test set
accuracy = r2_score(y_test, y_pred)
SVM_accuracy_percentage = accuracy * 100
print("Accuracy (Test): {:.2f}%".format(SVM_accuracy_percentage))


#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
SVM_MSE  = mean_squared_error(y_test, y_pred)
print("MSE :" , SVM_MSE)

SVM_MAE  = mean_absolute_error(y_test, y_pred)
print("MSE :" , SVM_MAE)

SVM_RMSE = np.sqrt(SVM_MSE)
print("RMSE :" ,SVM_RMSE)

SVM_r2 = r2_score(y_test, y_pred)
print("R2 :" ,SVM_r2)
# Adjusted R-squared formula = 1 - ( (1-R^2) * (n-1) / (n-p-1) )
SVM_adjusted_r2 = 1-(1-r2_score((y_test),(y_pred)))*((X_test.shape[0]-1)/(X_test.shape[0]-X_test.shape[1]-1))
print("Adjusted R2 : ",SVM_adjusted_r2)

In [None]:
# Visualizing evaluation Metric Score chart
import matplotlib.pyplot as plt

# Plotting the predicted values against the actual values
plt.figure(figsize=(6,3))
plt.scatter(y_test, y_pred, color='blue', alpha=0.5)
#'k' stands for black color ('k' is the abbreviation for black in matplotlib),
# and '--' indicates that the line should be dashed. So, 'k--' creates a black dashed line.
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Evaluation Metric Score Chart')
plt.grid(True)
plt.show()


### 1. Which Evaluation metrics did you consider for a positive business impact and why?

R-squared (R2) and Adjusted R-squared are the evaluation metrics with the most positive impact. Higher values of these metrics indicate better model performance in explaining the variance in the target variable, leading to more reliable predictions and informed decision-making.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

Support Vector Regression (SVR) with an RBF kernel was chosen as the final prediction model.

SVR offers flexibility in capturing complex relationships, benefits from feature scaling, performs well on regression tasks, and provides accurate predictions.

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

Support Vector Regression (SVR) is a supervised learning algorithm used for regression tasks.

It is based on the Support Vector Machine (SVM) algorithm and aims to find a function that approximates the mapping from input variables to the target variable.

SVR works by finding the hyperplane that best separates the data into different classes while maximizing the margin between the hyperplane and the support vectors.

# **Conclusion**

In [None]:
# Evaluation metric score chart
model_data = pd.DataFrame()
model_data['Model Name'] = ['Linear Regression','Decision Tree','Support Vector Machine']
model_data['MSE'] = [Linear_MSE,DT_MSE,SVM_MSE]
model_data['MAE'] = [Linear_MAE,DT_MAE,SVM_MAE]
model_data['RMSE'] = [Linear_RMSE,DT_RMSE,SVM_RMSE]
model_data['R2'] = [Linear_r2,DT_r2,SVM_r2]
model_data['Adjusted R2'] = [Linear_adjusted_r2,DT_adjusted_r2,SVM_adjusted_r2]
model_data['Accuracy']=[Linear_accuracy_percentage,DT_accuracy_percentage,SVM_accuracy_percentage]
model_data

In conclusion, we explored the implementation of various machine learning models, including Linear Regression, Decision Trees, and Support Vector Machines (SVM), for predicting calorie expenditure based on different features such as age, height, weight, duration, heart rate, and body temperature.

We evaluated the performance of each model using metrics such as Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared (R2) score on a test dataset. Among the models tested, the Support Vector Machine (SVM) model achieved the highest accuracy, with an R2 score of 98.88%.

We then saved the trained SVM model using joblib for future use and demonstrated how to load the saved model to make predictions on unseen data.

Overall, the SVM model showed promising results for predicting calorie expenditure, and the ability to save and load the model allows for easy deployment and use in real-world applications.




### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***