# Project Name : Cardio Good Fitness

**Objective - Preliminary Data Analysis**
1. Come up with a customer profile of the different products
2. Perform uni-variate and multi-variate analysis
3. Generate a set of insights and recommendations that will help the company to target new customers

**What does the data contain?**

The data is for customers of the treadmill products of a retail store. It contains the below mentioned features

1. Product - The model number o fthe treadmill
2. Age - In no of years of the customer
3. Gender - Of the customer
4. Education - In number of years of the customer 
5. Marital status - Of the customer
6. Usage - Avg# times the customer wants to use the treadmill every week
7. Fitness - Self rated fitness score of the customer(5-Very Fit, 1 - very unfit)
8. Income - Of the customer
9. Miles - Expected to run

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

# Import the necessary packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.gridspec as gridspec
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Know your data

**The very first step in an Exploratory data analysis is to know your features and get familiar with them**

In the next few steps, I am trying to familiarize myself with the features and the dataset. Below are the steps I followed to the do the same

1. Read the dataset in a pandas dataframe
2. Look at the first few rows of the dataset using the head() option
3. Introduced derived features to analyze further based on them
    *     AgeRange
    *     IncomeRange
    *     FitRange
4. Understand how many rows and columns the dataset has
5. Find out the 5 point summary of the dataset
6. Calculated the median of the numeric values. Particularly I wanted to check the median income
7. Checked if there are any missing values in the dataset. Luckily we have none, else we had to think about imputations

In [None]:
# Load the CSV file
data = pd.read_csv('/kaggle/input/cardiogoodfitness/CardioGoodFitness.csv')

In [None]:
# Let us take a look at few rows of the data
data.head()

In [None]:
# Let us add the AgeRange to the dataframe
bins = [0,18, 20, 25,30,35,40,45,50,np.inf]
names = ['<=18','18-20', '20-25','25-30','30-35','35-40','40-45','45-50','50+']

data['AgeRange'] = pd.cut(data['Age'], bins, labels=names)
data.head(50)

In [None]:
# Let us add the IncomeRange to the dataframe
bins = [0,10000, 30000, 50000,70000,90000,110000,np.inf]
names = ['<=10k','10-30k', '30-50k','50-70k','70-90k','90-110k','110k+']

data['IncomeRange'] = pd.cut(data['Income'], bins, labels=names)

In [None]:
# Let us add the fitness range also
bins=[0,3,np.inf]
names=['<=3','4+']
data['FitRange']=pd.cut(data['Fitness'],bins,labels=names)

In [None]:
# How many rows and columns do we have
data.shape

In [None]:
# What is the 5 point data summary for the dataset. I love to see this in a horizontal view, so transposing
data.describe().T

In [None]:
# What is the median of the numerical values
# I am more interested in the median income
# The median income in US is around 33KUSD. So, this group of customers look like are fairly rich
data.median()

In [None]:
# Do we have any missing value?
# We do not have any missing value in the data
data.isnull().sum()
# data.info() - This would also give you the information if there are any missing values

# Feature Analysis

# Uni-variate Analysis

From here onwards, I start my feature analysis. I start with the uni-variate analysis

* Feature distribution - All the features are positively skewed with the highest skew in Age, Income and Miles
* Gender distribution - Looks like we have more males than females in our customer base(16% more)
* Marital status distribution - We have more married people than single. Why are trade-mills not attracting more single people. Is it because Single people get more opprotunity and time to do outdoor work-outs?
* Age Range distribution - This shows me that the customer profile is fairly young. Close to 90% are between 20 and 40 years of Age.
* Income Range distribution - This shows that this a customer profile who are doing well economically, there are few outliers whose earning are significantly higher(Males are more compared to Females in the higher Income bracket)
* Education distribution - A well educated customer profile, 47% are graduate, 16% of them are post graduate or more. All customers are at least High School educated
* Product popularity - TM195 looks the most popular product. We will later see if a particular product is more popular with the high income group

In [None]:
# What is the distribution of each feature?
dist=data.hist(figsize=(10,15))
print('Age has a skew of {} and Kurtosis of {}'.format(data['Age'].skew(),data['Age'].kurt()))
print('Education has a skew of {} and Kurtosis of {}'.format(data['Education'].skew(),data['Education'].kurt()))
print('Fitness has a skew of {} and Kurtosis of {}'.format(data['Fitness'].skew(),data['Fitness'].kurt()))
print('Income has a skew of {} and Kurtosis of {}'.format(data['Income'].skew(),data['Income'].kurt()))
print('Miles has a skew of {} and Kurtosis of {}'.format(data['Miles'].skew(),data['Miles'].kurt()))
print('Usage has a skew of {} and Kurtosis of {}'.format(data['Usage'].skew(),data['Usage'].kurt()))

In [None]:
#Lets look at the gender distribution
sns.axes_style('whitegrid')
g=sns.catplot("Gender",data=data,aspect=2,kind="count",legend=True,palette=sns.color_palette(['blue','pink']))
for i,bar in enumerate(g.ax.patches):
    h=bar.get_height()
    g.ax.text(i,h+2,'{},{}%'.format(int(h),round(int(h)/180*100)),ha='center',va='center',fontweight='bold',size=14)

In [None]:
#Lets look at the marital status distribution
sns.axes_style('whitegrid')
g=sns.catplot("MaritalStatus",data=data,aspect=2,kind="count",legend=True,palette=sns.color_palette(['blue','green']))
for i,bar in enumerate(g.ax.patches):
    h=bar.get_height()
    g.ax.text(i,h+2,'{},{}%'.format(int(h),round(int(h)/180*100)),ha='center',va='center',fontweight='bold',size=14)

In [None]:
# Let us understand the customer distribution across age range
g=sns.catplot("AgeRange",data=data,aspect=2,kind="count",color="steelblue",legend=True)
for i,bar in enumerate(g.ax.patches):
    h=bar.get_height()
    if h>0:
        g.ax.text(i,h+2,'{},{}%'.format(int(h),round(int(h)/180*100)),ha='center',va='center',fontweight='bold',size=14)
    else:
        h=0
        g.ax.text(i,h+2,'{},{}%'.format(int(h),round(int(h)/180*100)),ha='center',va='center',fontweight='bold',size=14)


In [None]:
# Let us understand the customer distribution across Income range
g=sns.catplot("IncomeRange",data=data,aspect=2,kind="count",color="steelblue",legend=True)
for i,bar in enumerate(g.ax.patches):
    h=bar.get_height()
    if h>0:
        g.ax.text(i,h+2,'{},{}%'.format(int(h),round(int(h)/180*100)),ha='center',va='center',fontweight='bold',size=14)
    else:
        h=0
        g.ax.text(i,h+2,'{},{}%'.format(int(h),round(int(h)/180*100)),ha='center',va='center',fontweight='bold',size=14)


In [None]:
# This also shows that majority of the customers are earning between 40-70KUSD higher that the US median salary
# There are some customers in the higher end of the salary bracket(but mostly males)
ax = sns.stripplot(x=data["Gender"],y=data["Income"], jitter=True)

In [None]:
#What is the education level of my clients?
sns.axes_style('whitegrid')
g=sns.catplot("Education",data=data,aspect=2,kind="count",legend=True)
for i,bar in enumerate(g.ax.patches):
    h=bar.get_height()
    g.ax.text(i,h+2,'{},{}%'.format(int(h),round(int(h)/180*100)),ha='center',va='center',fontweight='bold',size=14)

In [None]:
#Lets look at which product is selling more
sns.axes_style('whitegrid')
g=sns.catplot("Product",data=data,aspect=2,kind="count",legend=True)
for i,bar in enumerate(g.ax.patches):
    h=bar.get_height()
    g.ax.text(i,h+2,'{},{}%'.format(int(h),round(int(h)/180*100)),ha='center',va='center',fontweight='bold',size=14)

# Outlier Identification

* There are significant outliers on the Income side. So, in this customer profile, there are some people who are earning significantly more. Their choice of product may be different from rest of the people. We can see that below when we do an analysis of Income Range against the product
* Outliers also exist on the Age side but not significant outliers, we probably can live with them
* A few outliers on the Education side as well
* Not many outliers from an USAGE perspective
* Miles have a good number of outliers




In [None]:
ax=sns.boxplot(data['Income'])

In [None]:
ax=sns.boxplot(data['Age'])

In [None]:
ax=sns.boxplot(data['Education'])

In [None]:
ax=sns.boxplot(data['Usage'])

In [None]:
ax=sns.boxplot(data['Miles'])

# Multi-Variate Analysis

It is mostly a bi-variate analysis.

* Looks like TM798 is more popular with the high income customers. The few low income customers bought TM195
* TM798 owners look like are the fittest. This particular equipment may be of higher end
* The fitness level of the majority of the customers are at a lower to mid level
* TM195 is popular with both male and femal, TM798 is more popular with male customers



In [None]:
ax=sns.catplot(x='Product',kind='count',hue='IncomeRange',data=data,aspect=2)

In [None]:
ax=sns.catplot(x='Product',kind='count',hue='FitRange',data=data,aspect=2)

In [None]:
ax=sns.catplot(x='FitRange',kind='count',hue='AgeRange',data=data,aspect=2)

In [None]:
ax=sns.catplot(x='Product',kind='count',hue='Gender',data=data,aspect=2)

In [None]:
plot=sns.catplot(x='Product',kind='count',hue='MaritalStatus',data=data)

# Bivariate Analysis - more observations

* Female customers income is lower than male customers
* Also there are more outliers in male customers
* TM798 is populare with high income group
* Customers with post-graduation are preferring TM798
* Customers whose usage is high prefer TM798
* TM798 is popular among age group 25-30
* As the education increases, the income is also increasing


In [None]:
ax=sns.boxplot(x='Income', y='Gender', data=data)

In [None]:
ax=sns.boxplot(x='Income', y='Product', data=data)

In [None]:
ax=sns.boxplot(x='Education', y='Product', data=data)

In [None]:
ax=sns.boxplot(x='Usage', y='Product', data=data)

In [None]:
ax=sns.boxplot(x='Age', y='Product', data=data)

In [None]:
ax=sns.boxplot(x='Miles', y='Product', data=data)

In [None]:
ax=sns.boxplot(x='Education', y='Income', data=data)

In [None]:
pd.pivot_table(data,'Income', index=['Product', 'Gender'],
                     columns=[ 'MaritalStatus'],aggfunc=[np.median,np.mean,len])

In [None]:
pd.pivot_table(data,'Age', index=['Product', 'Gender'],
                     columns=[ 'MaritalStatus'],aggfunc=[np.median,np.mean])

# Correlation Analysis

* Fitness and Age seems are not correlated.
* Fitness and Usage are highly correlated. 
* Similarly Fitness and Miles are highly co-related 
* Income and Age are also moderately co-related
* Income and Education also seems to have a good co-relation



In [None]:
plt.figure(dpi=120,figsize=(5,4))
mask=np.triu(np.ones_like(data.corr(),dtype=bool),0)
sns.heatmap(data.corr(),mask=mask,fmt=".1f",annot=True,lw=1,cmap='plasma')
plt.yticks(rotation=0)
plt.xticks(rotation=90)
plt.title('Correlation Heatmap')
plt.show()

In [None]:
ax=sns.pairplot(data=data)

# Pandas profiling

I kept this for the end. This is the most easiest thing to do. This is kind of a catch-all for me for all the things that I have done above

In [None]:
import pandas_profiling as pp

In [None]:
data_profile = pd.read_csv('/kaggle/input/cardiogoodfitness/CardioGoodFitness.csv')
pp.ProfileReport(data_profile)