# Exploratory Data Analysis - Cardio Good Fitness
 Cardio Good Fitness is a retail store and this data is  of customers who purchased various treadmill models.
 

**Dataset Information** :

CardioGoodFitness.csv: The csv contains data related to customers who have purchased different model from Cardio Good Fitness  :
- Product - the model no. of the treadmill
- Age - in no of years, of the customer
- Gender - of the customer
- Education - in no. of years, of the customer
- Marital Status - of the customer
- Usage - Avg. # times the customer wants to use the treadmill every week
- Fitness - Self rated fitness score of the customer (5 - very fit, 1 - very unfit)
- Income - of the customer
- Miles- expected to run

**Objective**
- Identify differences between customers of each product
- Explore relationships between the difference attributes of customers

### Import required libraries:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import pandas_profiling as ppf
#Supress warnings
warnings.filterwarnings('ignore') 

### Load Dataset:

In [None]:
cgf_data=pd.read_csv("../input/cardiogoodfitness/CardioGoodFitness.csv")

### Understand the structure of Data:

#### Examine the first 5 rows of the data

In [None]:
cgf_data.head()

#### Examine the last 5 rows of the data

In [None]:
cgf_data.tail()

#### Check the size of data

In [None]:
cgf_data.shape

 **Observations:** There are 180 observations of 9 columns in the dataset

#### Check the object / data types of data in dataframe

In [None]:
cgf_data.dtypes

 **Observations:**

1. Columns Product, Gender and Marital Status are of string datatype 
2. Columns Age, Education, Usage, Fitness, Income and Miles are of integer (numerical) datatype

#### Check additional information on the dataframe

In [None]:
cgf_data.info()

**Observations:**

1. There are 6 columns of integer type
2. There are 3 objects of string type
3. The dataset is of approximately 12.8 kb in size
4. There are 180 rows


#### Identify the column names in dataset

In [None]:
cgf_data.columns

#### Backup data before manipulation

In [None]:
cgf_data_back = cgf_data
#Verify the backup
cgf_data_back

#### Identify the missing values in dataset

In [None]:
cgf_data.isnull().sum()

**Observations:** There are no missing values in the dataset

#### Identify the duplicate records

In [None]:
cgf_data.duplicated().sum()

 **Observations:** There are no duplicate values in the dataset

#### Describe the numerical columns in the dataset

In [None]:
cgf_data.describe()

#### Describe all columns of the dataset

In [None]:
cgf_data.describe(include ='all')

In [None]:
# Transpose the data for another view
cgf_data.describe().T

**Observations:** 

**A. AGE:**

    1. Customers between 18 and 50 years of age are using treadmill
    2. Average age is 28.78 years
    3. As there is not much difference in mean and median, the skewness in data is minimal

**B. INCOME:**

    1. Customers with income range of USD 29,500 to USD 104,500 are using treadmill
    2. Considering the difference between mean and median & mean being greater than median, the data is right skewed
    3. Standard deviation is very high

**C. MILES:**

    1. Customers are expected to run between 21 to 360 miles per week
    2. Considering the difference between mean and median & mean being greater than median, the data is right skewed
    3. Standard deviation is very high

#### Count based on model

In [None]:
cgf_data.Product.value_counts()

**Observations:** 

1. TM195 is the most sold model
2. TM798 is the least sold model

#### Count based on Gender

In [None]:
cgf_data.Gender.value_counts()

**Observations:** 

1. There are 76 female Customers  
2. There are 104 male Customers 
3. Male Customers are buying more treadmills compared to Female Customers

#### Count based on Marital Status

In [None]:
cgf_data.MaritalStatus.value_counts()   

**Observations:** 

1. There are 107 Partnered Customers  
2. There are 73 Single Customers. 
3. Partnered Customers are buying more treadmills compared to Single Customers

#### Understand data for Product Code TM195

In [None]:
cgf_data[cgf_data['Product'] == 'TM195'].describe().T

**Observations:**

1. A total of 80 customers purchased TM195 model 
2. Average age of customer is 28.5 (Median: 26) (Range: 18 - 50)
3. Data is right skewed.
4. Average number of years of Education for customers is 15 (Median: 16)
3. Customer wants to use the treadmill at least 3 times per week
4. Customers are expected to run is 82.78 miles per week (Median: 85)
5. Average income and median is approximately USD 46,000 

#### Understand data for Product Code TM498

In [None]:
cgf_data[cgf_data['Product'] == 'TM498'].describe().T

**Observations:**

1. A total of 60 customers purchased TM498 model
2. Average age of customer is 28.9 (Median: 26) (Range: 19-48)
3. Data is right skewed.
4. Average number of years of Education for customers is 15 (Median: 16)
5. Customer wants use the treadmill at least 3 times per week
6. Customers are expected to run is 60 miles per week (Median: 85)
7. Average income is USD 46,000 (Median: USD 49,459)

#### Understand data for Product Code TM798

In [None]:
cgf_data[cgf_data['Product'] == 'TM798'].describe().T

**Observations:**

1. A total of 40 customers purchased TM798 model
2. Average age of customer is 29 (Median: 27) (Range: 22-48)
3. Average number of years of Education for customers is 17 (Median: 18)
4. Customer wants to use the treadmill at least 4-5 times per week
5. Customers are expected to run are 166 miles per week (Median: 160)
6. Average income is USD 75,000 (Median: USD 76,000)

## Univeriate Analysis:

### Analysis Based on Age

In [None]:
# Historam for age
plt.hist(cgf_data.Age, edgecolor = 'white')
plt.title("Histogram view of Age")
plt.show()

In [None]:
# Distribute data in Age groups
bins = [20,25,30,35,40,45,50]
plt.hist(cgf_data.Age,bins,edgecolor = 'white')
plt.title("Categorical histogram of Age")
plt.show()

In [None]:
# Distribute data in Age groups
bins = [18,20,22,24,26,28,30,32,34]
plt.hist(cgf_data.Age,bins,edgecolor = 'white')
plt.title("Categorical histogram of Age")
plt.show()

**Observations:**

1. Most customers are in the age range of 22 - 32
2. Further classification reveals that most customers are of ages between 24 and 26 years, followed by customers from age group 22 and 24 years

### Analysis based on Income

In [None]:
# Visualisation of income range
sns.distplot(cgf_data.Income)
plt.title("Distribution Plot of Income")
plt.show()

In [None]:
# boxplot view of income
sns.boxplot(cgf_data.Income)
plt.title("Box plot Plot of Income")
plt.show()

**Observations:**

1. There are two peaks shown by the income range of people
2. Data is right skewed and shows outliers on the right
3. Most Customers fall in range of USD 45,000 - USD 60,000
4. Outliers are observed above USD 85,000

### Analysis based on Gender

In [None]:
# Number of records per gender and product model
sns.countplot(cgf_data.Gender, hue=cgf_data.Product)
plt.title('Gender based distribution')
plt.show()

In [None]:
# Number of records per model and per gender
sns.countplot(cgf_data.Product, hue=cgf_data.Gender)
plt.title('Gender based distribution')
plt.show()

**Observations:**

1. Number of male customers purchasing treadmill is more than Female Customers
2. TM798 is the least popular model of treadmill in Female Customers
3. TM195 is equally preferred model of treadmill in both male and Female Customers

### Analysis based on Marital Status

In [None]:
# Number of records per model and per Marital Status
sns.countplot(cgf_data.Product, hue=cgf_data.MaritalStatus)
plt.title('Marital Status based distribution')
plt.show()

In [None]:
# Number of records per model and per Marital Status
sns.countplot(cgf_data.MaritalStatus, hue=cgf_data.Product)
plt.title('Marital Status based distribution')
plt.show()

**Observations:**

1. Partnered Customers have purchased treadmill more than Single Customers
2. TM195 model is popular in both Marital Statuses


### Analysis based on Usage

In [None]:
# Number of records based on usage per week
sns.countplot(cgf_data.Usage)
plt.title('Count based on Usage')
plt.show()

In [None]:
# Number of records per model and for number of times of Usage
sns.countplot(cgf_data.Product, hue=cgf_data.Usage)
plt.title('Usage based distribution')
plt.show()

In [None]:
# Number of records per model and per Usage
sns.countplot(cgf_data.Usage, hue=cgf_data.Product)
plt.title('Usage based distribution')
plt.show()

**Observations:**

1. Most customers use Treadmill at least 3 times per week
2. TM195 is most popular amongst active customers
3. Few customers using TM798 Model use the treadmill for 7 times in a week

### Analysis based on Fitness Level

In [None]:
# Number of records per Fitness
sns.countplot(cgf_data.Fitness)
plt.title('Count based on Self Acclaimed Fitness Levels')
plt.show()

In [None]:
# Number of records per model and for fitness rating
sns.countplot(cgf_data.Product, hue=cgf_data.Fitness)
plt.title('Distribution based on Fitness Levels')
plt.show()

In [None]:
# Number of records per model and for fitness rating
sns.countplot(cgf_data.Fitness, hue=cgf_data.Product)
plt.title('Distribution based on Fitness Levels')
plt.show()

**Observations:**

1. Most customers have rated themselves at Level 3 of Fitness levels
2. TM195 is most popular amongst customers at Level 3
3. Almost all Customers at Fitness Level 5 use TM798 model

### Analysis based on Education

In [None]:
# Distribution based on number of years of education
sns.countplot(cgf_data.Education)
plt.title("Count based on number of years of Education")
plt.show()

In [None]:
# Number of records per model for customer segments based on the number of years of education
sns.countplot(cgf_data.Education, hue=cgf_data.Product)
plt.title('Distribution based on Education')
plt.show()

In [None]:
# Number of records per model for customer segments based on the number of years of education
plt.figure(figsize=(10,5))
sns.countplot(cgf_data.Product, hue=cgf_data.Education)
plt.title('Distribution based on Education')
plt.show()

**Observations:**

1. Most Customers using treadmill have 16 to 18 years of Education
2. Customers with more than 20 years of education have only purchased TM798 Model
3. TM798 is most preferred by the customer with 18 years of education

### Analysis based on Miles planned per week

In [None]:
# Distribution plot of Miles with RUG and KDE
sns.distplot(cgf_data.Miles, rug=True)
plt.title('Count based on Miles')
plt.show()

In [None]:
# Boxplot view of data based on miles
sns.boxplot(cgf_data.Miles)
plt.title("Boxplot of Miles")
plt.show()

**Observations:**

1. Outliers are seen on the higher values 
2. Customers are planning to run more than 180 miles per week



In [None]:
# List the data where miles are greater than 180
cgf_data[cgf_data['Miles'] > 180]

## Bivariate Analysis:

#### Average age for each model

In [None]:
cgf_data.groupby('Product')['Age'].mean()

#### Average Income for each model

In [None]:
cgf_data.groupby('Product')['Income'].mean()

#### Average miles per model

In [None]:
cgf_data.groupby('Product')['Miles'].mean()

#### Average of number of years of education for each model

In [None]:
cgf_data.groupby('Product')['Education'].mean()

### Analysis of Miles based on Age

In [None]:
sns.jointplot(x = 'Age' , y = 'Miles', data = cgf_data)
plt.show()

**Observations:**

There is no definite correlation observed between Age and Miles

### Analysis of Income based on Age

In [None]:
sns.jointplot(x = 'Age' , y = 'Income', data = cgf_data, color='red', kind ='hex')
plt.show()

**Observations:**

Income increases with the age, depicting positive correlation.

### Analysis of Miles based on Income

In [None]:
sns.jointplot(x = 'Income' , y = 'Miles', data = cgf_data, color='orange', kind ='hex')
plt.show()

**Observations:**

With increase in Customer Income a slight increase is observed in Miles

### Analysis of Income based on Gender

In [None]:
sns.catplot(x = 'Gender' , y = 'Income', data = cgf_data)
plt.show()

**Observations:**

Male Customers have higher income range, when compared to Female Customers

### Analysis of Miles based on Gender

In [None]:
sns.catplot(x = 'Gender' , y = 'Miles', data = cgf_data, kind = 'violin')
plt.show()

**Observations:**

Male Customers plan to run more miles, when compared to Female Customers

### Analysis of Usage based on Gender

In [None]:
sns.catplot(x = 'Gender' , y = 'Usage', data = cgf_data, kind = 'bar')
plt.show()

Observations:

Male Customers show higher usage per week, when compared to Female Customers

### Analysis of Income based on Marital Status

In [None]:
sns.catplot(x = 'MaritalStatus' , y = 'Income', data = cgf_data, kind = 'box')
plt.show()

Observations:

Partnered Customers have higher income range, when compared to Single Customers

### Analysis of Miles based on Marital Status

In [None]:
sns.catplot(x = 'MaritalStatus' , y = 'Miles', data = cgf_data, kind = 'swarm')
plt.show()

Observations:

Partnered Customers plan to run more miles, when compared to Single Customers

## Multivariate Analysis:

### Multicolumn catplot of Marital Status showing Gender based data compared to Income

In [None]:
sns.catplot( x = "Gender", y = 'Income', hue = 'Product', col = 'MaritalStatus',data = cgf_data, kind = 'bar');

**Observations:**

1. TM798 leads all charts across Customer Segments
2. Single Female Customers have purchased more of TM195 and TM498 models compared to Single Male Customers
3. Single Male Customers are more than Single Female Customers 
4. Partnered Female Customers are more than Partnered Male Customers

### Pointplot showing sales based on Education and Income

In [None]:
sns.pointplot(x=cgf_data["Education"],y=cgf_data["Income"],hue=cgf_data['Product']) 
plt.show()

**Observation:-**

   1. Customers with higher education has higher income range
   2. TM798 has higher income and higher education

### Correlation between Numerical columns of dataset

In [None]:
# Correlation of numerical values in dataset
cgf_data.corr()

In [None]:
# Heatmap for the correlation of numerical values in dataset
sns.heatmap(cgf_data.corr(), annot=True, vmin=-1, vmax = 1) 
plt.show()

**Observations:**

1. Miles and Usage show high correlation
2. Fitness and Miles show high correlation
3. Education and Income show notable correlation
4. Usage and Fitness show notable correlation
5. Income and Usage show little correlation

#### Pairplot of all numerical values with clasification of Product

In [None]:
sns.pairplot(cgf_data, hue='Product')
plt.show()

#### Pairplot of all numerical values using KDE

In [None]:
sns.pairplot(cgf_data, kind='kde')
plt.show()

## Conclusion (Important Observations):

1. TM195 is most sold model, accounting for 44.44% of total sales.
2. 57.78% of Customers are Male Customers, which is more than the Female Customers.
3. Partnered customers account for 59.44% of sales.
4. Most customers are between 22 to 26 years of age.
5. TM798 is most preferred by customers with higher income range.

## Recommendations:

1. TM195 and TM498 are popular with customers in USD 45,000 and USD 60,0000 income range and can be promoted as affordable models for these income groups
2. TM798 should be branded as Premium Model and marketed among high income groups and specific customer categories. Promotional programs can be run for upgrades from other models.
3. Rewards programs can be launched to promote per week of usage. Gamification based on points can also be introduced, with weekly leader boards.
4. Special promotions to be run to target Female Customers, for instance:
    1. Discounts on Women's Day and similar celebrated ocassions
    2. Purchase offers using Credit Card or Bank Reward points
    3. Free additional gifts and hampers from partners
5. Market research can be conducted to check the feasibility of attracting customers outside the age range of 18-35.

In [None]:
# Module : Fundamentals of AIML
# Project: EDA - Cardio Good Fitness
# Submitted by : Ritesh Sharma
# Submission Date : 23 Jul 2021