<font color=darkblue>
&nbsp;
    
### Cardio Good Fitness
    
<font color=blue>
&nbsp;
    
Cardio Good Fitness is a retail store and this data is of customers who bought various treadmill models.

<font color=darkblue>
&nbsp;

### Dataset Information :

<font color=blue>
&nbsp;

CardioGoodFitness.csv: The csv contains data related to customers who have purchased different model from Cardio Good Fitness :

1. Product - the model no. of the treadmill
2. Age - in no of years, of the customer
3. Gender - of the customer
4. Education - in no. of years, of the customer
5. Marital Status - of the customer
6. Usage - Avg. # times the customer wants to use the treadmill every week
7. Fitness - Self rated fitness score of the customer (5 - very fit, 1 - very unfit)
8. Income - of the customer
9. Miles- expected to run

<font color=darkblue>
&nbsp;

### Objective

<font color=blue>
&nbsp;

Come up with a customer profile (characteristics of a customer) of the different Products
based on the data we have, to generate a set of insights and recommendations that will help the company in targetting new customers.
 
<font color=darkblue>
&nbsp;
    
### Hypothetical questions:
    
<font color=blue>
&nbsp;
    
1. Is there any relaionship between Age of the customers with the products they buy?
2. Are there specific Products favourable to Male/Female customers?
3. Does the income of customers have any relationship with the Products bought by the customers?
4. Does the years of education of a customer related to his/her income?
5. Does the Usage expectaions have any imoact on the customer preferences of products?
6. Do fitter people buy a specific product?
7. Does marital status of a customer play any role in Product selection?
8. Which product is most sold, and is there any specific reason behind that?
9. Is the Product sales biased towards a specific Gender?
10. Is the Product sales depend upon marital status of customers?
11. Does the usage in terms of Miles to be run (expected) determine which Products are bought?

<font color=darkslategray>
&nbsp;&nbsp;
    
### Understanding the structure of the data

<font color=royalblue>
&nbsp;

#### Import necessary libraries for exploratory data analysis

In [None]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

import plotly.express as px

<font color=royalblue>
&nbsp;

#### Read the data file into a dataframe

In [None]:
df = pd.read_csv('/kaggle/input/cardiogoodfitness/CardioGoodFitness.csv')
# Create a copy of the dataframe, just so that the original dataframe remains intact 
# while we perform various transformations on the copy dataframe
fit_df = df.copy()

<font color=royalblue>
&nbsp;

#### Check data samples

In [None]:
fit_df.head(5)

In [None]:
fit_df.tail(5)

<font color=royalblue>
&nbsp;

#### Check dataset properties

In [None]:
fit_df.info()

In [None]:
# Converting object type columns to categories

fit_df.Product=fit_df["Product"].astype("category")
fit_df.Gender=fit_df["Gender"].astype("category")
fit_df.MaritalStatus=fit_df["MaritalStatus"].astype("category")

<font color=royalblue>
&nbsp;

#### Check column level basic stats on the dataset

In [None]:
fit_df.describe().T

In [None]:
# Check the number of columns and rows
fit_df.shape

<font color='green'>
  
#### Observations:
1. There are 180 observations
2. Age has a average of 28, however the median is 26
3. Education has average and median almost at the same point, 16 years
4. Usage has average of 3.45 days, whereas median is 3 days
5. Fitness level has avergae of 3.31, whereas median is 3
6. Income has anaverage of 53.7K, whereas median is 50.6K
7. Miles to run has average of 103, which is way higher than the median of 94
8. There are 9 columns

<font color=royalblue>
&nbsp;

#### Check for missing data

In [None]:
fit_df.isnull().sum()

<font color='green'>
&nbsp;

#### Observatios:
1. No missing values

<font color=royalblue>
&nbsp;

#### Check for data duplications

In [None]:
fit_df.duplicated().sum()

<font color='green'>

#### Observatios:
1. No data duplication

In [None]:
# Checking for unique values and value counts:

def check_unique_value_counts(df, cols):
    for col in cols:
        print(col , ' : ' , df[col].unique())
        print(col , ' : ' , df[col].value_counts(), '\n') if str(df[col].dtype) == 'category' else print('')


In [None]:
check_unique_value_counts(fit_df, fit_df.columns.to_list())

<font color='green'>

#### Observations:
1. There are 3 products: 'TM195', 'TM498', 'TM798'
2. There are more male customers (104) than female(76)
3. There are more partnered (107) than single (73)

<font color='royalblue'>
&nbsp;
    
#### Product-wise data statistics

In [None]:
fit_df[fit_df['Product'] == 'TM195'].describe().T

In [None]:
fit_df[fit_df['Product'] == 'TM498'].describe().T

In [None]:
fit_df[fit_df['Product'] == 'TM798'].describe().T

<font color='green'>

#### Observations:
1. TM195 was sold the most, TM798 being the least
2. Education median is a bit higher for TM798 compared to the other two products
3. The customers of TM798 earn more, the median is at 76.5K compared to 46.6K for TM195, and 49.5K for TM498
4. Fitness level median for TM798 is 5, higher than the other two products
5. Usage no. of days median is also higher for TM798 than the other two products
6. Miles median is also much higher for TM798 than the other two products
7. TM195 is popular among customers with age 26 years. Average age: 28.5.
8. TM498 is popular among customers with age 26 years, Average age: ~30 years.
9. TM798 is popular among customers with age 27 years, Average age: ~29 years.
    
These suggests that the treadmill TM798 is probably being used by athletes or health-conscious people or professionals

<font color=darkslategray>
&nbsp;&nbsp;
    
### Univariate Analysis

In [None]:
# The function to show distribution, box plot and violin plot for the non-categorical variables
# to understand the range, skewness, outliers, distribution of the attribute

def univariate_analysis_quantitative(df, attr):
    """
    Signature:
    univariate_analysis(
    df=None,
    attr=None
    )
    
    df: pd.DataFrame
    attr: List of attributes in the df dataframe
    
    Returns histogram, boxplot and violin plots for each of the attributes
    """
    for col in attr:
        fig, axes =plt.subplots(1,3,figsize=(20, 5))
        fig.suptitle("Distribution of Data: "+ col  , fontsize=18, fontweight='bold')
        sns.histplot(data=df, x=col, kde=True, ax=axes[0], color='royalblue')
        axes[0].axvline(df[col].mean(), color='b', linestyle='dashed',linewidth=2)
        axes[0].axvline(df[col].median(), color='r', linestyle='dashed', linewidth=2)
        sns.boxplot(data=df, x=col, showmeans=True, ax=axes[1], color='springgreen')
        sns.violinplot(data=df, x=col, ax=axes[2], color='coral')

In [None]:
univariate_analysis_quantitative(fit_df, ['Income', 'Age'])

<font color='green'>

#### Observations:
1. Average customer base earns around 54K, Median is around 51K. There are huge number of outliers beyond ~78K. Data is skewed to the right
2. Average customer age group is around 28 years, Median is around 26. Outliers are beyond 46 years. Data is skewed to the right

In [None]:
univariate_analysis_quantitative(fit_df, ['Education', 'Miles'])

<font color='green'>

#### Observations:
1. The average and mean education of customers are around 16 years. Outliers are for 20+ years of education. Data is right skewed.
2. Average miles expected to run is 103, compared to mean of 94 Miles. Outliers are for more than 190 Miles. Data is right skewed.

In [None]:
univariate_analysis_quantitative(fit_df, ['Usage', 'Fitness'])

<font color='green'>

#### Observations:
1. The average usage is around 3-4 days a week. Data is right skewed. Outliers: more than 5 days a week.
2. The average Fitness rating is around 3. Data is left skewed, with outlier of fitness rating of 1.

In [None]:
# Data skewness
fit_df.skew()

In [None]:
# The function to show distribution, box plot and violin plot for the non-categorical variables
# to understand the range, skewness, outliers, distribution of the attribute
def univariate_analysis_qualitative(df, attr):
    fig, axes =plt.subplots(1,len(attr),figsize=(20, 5))
    i = 0
    for col in attr:
        sns.countplot(data=df, x=col, ax=axes[i], order=df[col].value_counts().index).set(title=col + ' wise Sales')
        i+=1

In [None]:
univariate_analysis_qualitative(fit_df, ['Product', 'Gender', 'MaritalStatus'])

In [None]:
# Checking percentage sales by Products
fit_df.groupby(by=['Product']).agg({'Product': 'count'}).div(fit_df['Product'].count()) * 100


In [None]:
# Checking percentage sales by Gender
fit_df.groupby(by=['Gender']).agg({'Gender': 'count'}).div(fit_df['Product'].count()) * 100

In [None]:
# Checking percentage sales by Marital Status
fit_df.groupby(by=['MaritalStatus']).agg({'MaritalStatus': 'count'}).div(fit_df['Product'].count()) * 100

<font color='green'>

#### Observations:
1. TM195 has highest sales, whereas TM798 has lowest
2. Male customers are higher in number than Females
3. Partnered customers are higher in number than Singles

<font color=darkslategray>
&nbsp;&nbsp;
    
### Bivariate Analysis

In [None]:
# Linear regression plot for each product to check product distribution by Age and Income

sns.lmplot(data=fit_df, x='Age', y='Income', hue='Gender', col='Product', fit_reg=True);

In [None]:
# Gender-wise Marital Status specific bar plots to understand customer base per Product

fig, axes = plt.subplots(1,3,figsize=(20,5))
i = 0
for val in fit_df['Product'].unique():
    sns.countplot(data = fit_df[fit_df['Product']==val], x='Gender', hue='MaritalStatus', ax=axes[i])\
    .set(title='Product: '+val)
    i+=1

<font color='green'>

#### Observations:
1. TM798 is very popular in males.
2. TM195 is very popular in females and single males.
3. TM498 is preferred by partnered males much more than single males.

In [None]:
# The function plots distribution of quantitative features by qualitative feature
 
def bivariate_analysis(df, x, cat_list):
    fig, axes=plt.subplots(int(len(cat_list)/2 if len(cat_list) % 2 == 0 else (len(cat_list)+1)/2), 2, \
                           figsize=(20,17))
    i = 0
    for h in cat_list:
        #sns.countplot(data=df, x=x, hue=h)
        sns.boxplot(data=df, x=h, y=x, showmeans=True, ax=axes[i//2, i%2]).set(title=x + ' By ' + h)
        i+=1

In [None]:
# Quantitative distribution by Product

bivariate_analysis(fit_df, 'Product', ['Age', 'Income', 'Education', 'Usage', 'Fitness', 'Miles'])

<font color='green'>

#### Observations:
1. TM798 user age has a lot of outliers. There are lots of users of this treadmill aged over 40 years.
2. However TM798 is most popular in users of age 25-30.
3. TM798 users have higher education (16 - 18 years), higher income (58K - 91K), higher usage (4-5 days) per week, and the users of this have rated themselves 4-5 in fitness level, and they expect to run 120-200 miles weekly.
4. More male customers buy TM798
5. TM195 and TM498 are bought by customers of age group 23-33.
6. TM195 and TM498 are bought by customers of education of 14-16 years.
7. TM195 is bought by customers with income of 38K-55K (approx), however the customers seem to use it 3-4 days a week
8. TM498 is bought by customers with income of 45K-55K (approx), although the usage is about 3 days a week.
9. Users of both TM195 and TM498 rated their fitness at around 3, and expect to run 60-100 Miles per week. Both having mean miles to be expected to run at 85 Miles.

In [None]:
# Quantitative distribution by Gender

bivariate_analysis(fit_df, 'Gender', ['Age', 'Income', 'Education', 'Usage', 'Fitness', 'Miles'])

<font color='green'>

#### Observations:
1. Male customers have slightly higher range of income
2. Male customers use the products 3-4 days a week, whereas female customers use the products 3 days a week on average.
3. Male customers expect to run 90 - 145 Miles per week, compared to females expecting to run 70 - 100 Miles a week.

In [None]:
# Quantitative distribution by Marital Status

bivariate_analysis(fit_df, 'MaritalStatus', ['Age', 'Income', 'Education', 'Usage', 'Fitness', 'Miles'])

<font color='green'>

#### Observations:
1. Partnered customers appear to earn more and run more than the single customers.
2. Customers above age of 30 years, are partnered.
3. Customers below the age of 25 years, are single.

<font color=darkslategray>
&nbsp;&nbsp;
    
### Multivariate Analysis

In [None]:
# Heatmap provides quantitative feature correlation in visual

plt.figure(figsize=(15,10))
sns.heatmap(fit_df.corr(), annot=True, cmap='YlGnBu', vmin = -1, vmax = 1);

<font color='green'>

#### Observations:
1. High correlation between Miles and Fitness, Miles and Usage.
2. Very small correlation exists between Fitness and Usage, Education and Income, Income and Age

In [None]:
# Pairplot also provides quantitative feature-vs-feature relationships

sns.pairplot(fit_df, diag_kind = 'kde', corner = True);

In [None]:
# The function is to plot categorical analysis between two quantitative features for multiple values of a 
# qualitative feature, categorized per another qualitative feature

def categorical_plots(df, x, y, hue, col):
    plt.figure(figsize=(15, 10));
    sns.catplot(data=df, x=x, y=y, hue=hue, col = col, kind='bar');

In [None]:
# Categorical plot for each gender, plotting Usage vs Income per product

categorical_plots(fit_df, 'Usage', 'Income', 'Product', 'Gender')

<font color='green'>

#### Observations:
1. Customers with higher income and higher usage prefer TM798

In [None]:
# Categorical plot for each marital status, plotting Usage vs Income per product

categorical_plots(fit_df, 'Gender', 'Income', 'Product', 'MaritalStatus')

<font color='green'>

#### Observations:
1. Partnered customers with higher income prefer TM798

In [None]:
# Plotting Income vs. Age, categorized based on Products, while indicating usage for each customer
# Plotly scatterplot provides interactive insights

px.scatter(data_frame=fit_df, x='Income', y='Age', color='Product', size='Usage')

In [None]:
# Same observation can be found using scatterplot or relplot. Just trying to implement on scaterplot

plt.figure(figsize=(15,10))
sns.scatterplot(data=fit_df, x='Income', y='Age', hue='Product', size='Usage', alpha=0.5, sizes=(40,400));

<font color='green'>

#### Observations:
1. Customers with higher income buy TM798
2. Customers with lower income buy TM195 or TM498
3. Customers with higher usage buy TM798, others buy TM195 or TM498

<font color='green'>
&nbsp;

### Product Profiles based on Conclusive Observations:
    
#### TM195:
1. Customers with lower income have bought this model, median being 46K, 
indicating this might be a very affordable model.
2. The customers have rated their fitness level an average of 3 and they use this model almost 3-4 days a week, indicating most of customers are casual starters.
3. 44.4% of product sales came from this model, indicating this model has a good customer base.
4. Customers are quite young, in the age group of median 26 years.
5. Customers are having 14-16 years of education.
6. Females and single males prefer this model, indicating this model is not gender specific.
7. Miles expected to run is on lower side (mean: 85), again indicating the basic requirements of the customers.
    
#### TM498:
1. Customers with comparatively low to medium income have bought this model, median being 49K, indicating this might be an affordable mid-level model.
2. The customers have rated their fitness level an average of 3 and they use this model 3 days a week, indicating most of customers are casual starters.
3. 33.3% of product sales came from this model, indicating this model has the second best customer base.
4. Customers are quite young, in the age group of median 26 years.
5. Customers are having 14-16 years of education.
6. Partnered males prefer this model, however this has female customer base as well, indicating this model although not gender specific, some features might attract more males than females.
7. Miles expected to run is on lower side (mean: 85), again indicating the basic requirements of the customers.
    
#### TM798:
1. Customers with comparatively higher income have bought this model, median being 76.5K, indicating this might be a costly high-level model.
2. The customers have rated their fitness level an average of 4.5 and they use this model 4-5 days or even 7 days a week, indicating most of customers are serious health-conscious people, or athletes.
3. 22.2% of product sales came from this model, indicating this model has the lowest customer base.
4. Customers are of various age. groups, median is 27 years, however, there are a lot of customers aged 40-48 years.
5. Customers are having 16-18 years of education. This and point(4) indicates customer base is not only young people.
6. This modelis very popular in male customers, although there are very few female customers as well, indicating this model although not gender specific, most of the features attract males customers.
7. Miles expected to run is on higher side (mean: 160), again indicating the serious requirements of the customers.

<font color='royalblue'>
&nbsp;

### Recommendations:
    
1. 57% customers are male; we should consider concentrating on increasing our female customer base. Since our female customers prefer the TM195 and TM498 compared to TM798 model, we should run promotional offers on the TM195 and TM498 models for the female customers on Mother's day/Women's day etc. Also, we can run campaigns during those specific days to spread awareness on health and fitness. We can also endorse some female athletes to promote these two models.
    
2. We can brand the TM195 model as budget-friendly, TM498 model as mid-range, and TM798 model as professional treadmills.
    
3. We can endorse some athletes to promote TM798 to increase it's customer base, since it is the least sold model.
    
4. We can run promotional offers on TM798 time-to-time, as it is of high price (assumed as people with more income buy this model), so than we can encourage our mid-budget customers to give this model a thought. This strategy might reduce profit margin per model, but on long run, should increase the sales of this model. Considering this model might be the most profitable model, in long run, with discounted price and higher customer base, this should provide overall higher profit margin.
    
5. Since our primary customer base is of age 23-33 years, we can run promotional campaigns in schools and colleges to emphasize on health and fitness, and encourage the youngsters and their parents to consider buying this model. This should expand our customer base for the people below 23 and above 33 age as well.
    
6. We can also check with gyms if they are interested in upgrading their treadmills or buying new treadmills in bulk with some discount.