## Background:

### Objective - Preliminary Data Analysis. 
To explore the dataset and practice extracting basic observations about the data. 

<B> Context
- The data is for customers of the treadmill product(s) of a retail store called Cardio Good Fitness. It contains the following variables

- Product - the model no. of the treadmill
- Age - in no of years, of the customer
- Gender - of the customer
- Education - in no. of years, of the customer
- Marital Status - of the customer
- Usage - Avg. # times the customer wants to use the treadmill every week
- Fitness - Self rated fitness score of the customer (5 - very fit, 1 - very unfit)
- Income - of the customer
- Miles- expected to run







## Key Questions:
- To come up with a customer profile (characteristics of a customer) of the different products
- Perform uni-variate and multi-variate analyses
- Generate a set of insights and recommendations that will help the company in targeting new customers


### <B> Loading the different libraries and packages-numpy,pandas,seaborn,matplotlib,scipy

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
# To enable plotting graphs in Jupyter notebook
%matplotlib inline
from scipy.stats import skew, norm, probplot, boxcox, f_oneway
sns.set(style="darkgrid")

In [None]:
import warnings
warnings.filterwarnings('ignore')

### Reading the dataset

In [None]:
fit=pd.read_csv('../input/cardiogoodfitness/CardioGoodFitness.csv')#Reading the dataset

In [None]:
fit.head()#Displaying the dataset-first 5 rows

In [None]:
fit.tail()#Displaying the dataset-last 5 rows

#### Observations - This is based on an eyeballing of the displayed data. There might be some missing values that will be investigated later on.

- Product column contains the types of products
-Age contains the vlaue of how old a person is
-Gender is either male or female
-Education is represented by the number of years of education 
-Marital status is either single or partnered
-Usage - denotes the average number of times the customer wants to use the treadmill every week
-Fitness is a self rated fitness score of the customer on a scale of 1-5(where 5 = very fit, 1 = very unfit)
-Income of the customer is given
-The last column shows the number of expected miles to run

In [None]:
fit.shape#checking the shape-rows&columns(observations) of the dataset

- The dataset has 180 rows and 9 columns.

### Checking the structure of the dataset

In [None]:
fit.info()# tells us specifcally what does the data contain/types, indication about missing values etc

####  Observations:

- All column have 180 observations indicating that there are no missing values
- Ther are no null values( this is verified further with isna function)
- Product, gender and marital status are the type object and it is advisable to convert these into categorical data type
for efficiency reasons, python execution and its usefulness for model building at later stages
- The other variables are numerical integers datatypes 

In [None]:
fitcopy = fit.copy()# Making a copy of the dataset to reacll(if required) when changes are made 

# Data Preprocessing
    ( To transform the raw data in a useful and efficient format suitable for analysis) 
        We will 
        -carry out data conversion(if any)and
        -check for missing vales(and fix them) 
        

### *Data conversion

Fixing the data--->converting object data types into categorical variables

In [None]:
fit['Product'] =fit.Product.astype('category')
fit['Gender'] =fit.Gender.astype('category')
fit['MaritalStatus'] =fit.MaritalStatus.astype('category')

In [None]:
fit.info()

#### - As can be seen from the above datatype conversion, data is now fixed 

In [None]:
fit.isna().sum()# Next, we check for the missing values

In [None]:
fit.isnull().sum()

#### Good news! There are no missing values. Data is ready for analysis.

 # EDA - Univariate Analysis

<B>  Using value_counts & countplot to analyse certain variables


In [None]:
fit['Product'].value_counts()#counting the number of treadmills by model

#### Observation: 
- Based upon the above data, there number of TM195 are more followed by TM498 and TM798. 
- It can also be inferred that TM195 is less expensive and a popularly purchased product compared to the other two models.

In [None]:
fit['Gender'].value_counts()#counting the number of Males and Females

In [None]:
# Visualisation of the number of men and women

plt.figure(figsize=(5,5))
sns.countplot(x='Gender', data=fit);

#### Observation: 
- Based upon the given data, more men use treadmill than women

In [None]:
#counting the number of customers based on marital status
fit['MaritalStatus'].value_counts()

In [None]:
# Visualisation of the number customer based upon of martialstatus
plt.figure(figsize=(5,5))
sns.countplot(x='MaritalStatus', data=fit);

#### Observation: 
- Based upon the given data,partenered people use the treadmill more than those 
with the marital stauts as single 

## *Calculate a 5-point summary
This will give us an overview about the Measures of Central Tendencay and the Measures of Dispersin of the entire dataset which is ueful for further analysis

In [None]:
fit.describe().T  #note that only the numerical datatype is included here and s transposed for better readability

####  Observations:

- The mean age of a customer is 29 years and median is 26.The IQR is 9 which quantifies the middle 50% of the data spread. The minimum age is 18 and the maximum is 50-on an average there are more customer whose ages are more on younger side of the population than the older side.Age spread is between 24-33years.
- Customers have completed at least high school education.The mean of the education level of the customers is approximately 16 years that coincides with median also 16 years. The IQR or spread of the 50% of the middle values in not that large, its between 14-16 years. The data points tend to be very close to the mean
- The mean and median for the usage of the treadmills is approximately 3 times a week
- Fitness level is self-rated variable and most of the users possess an average level of fitness which is 3
- The spread of the annual incomes is between approximately 44k-58k
- The average number of miles run is 103 miles which varies from median value of 94. The spread of the run is between 66 - 115 miles

### Check skewness of the quantitative variables

In [None]:
fit.skew()

#### Observations:

-    The variables Income and Miles are highly positively skewed comapred to Age, Education, and Usage


In [None]:
fit.kurtosis()

####  Observations: 
- The variables Miles has a heavy tailed(data) towards the right, followed by Education and Income. The variable Fitness has a negative kurtosis value which means that the distribution has lighter tails and is flatter than the normal curve.


## <B> BoxPlot & Histogram - Visualising the 5-point summary along with Histogram

 - A simultaneous represenation of both Box -Plot and Histogram will enable us to identify,visualise,and understand
the patterns in our CardioGoodFitness dataset

 - The Box-plot will give a pictoral represenatation of the 5 point summary and wil help us to identify the outliers, if any

 - A Histogram will provide us with the distribution of a numeric variable’s values as a series of bars(A histogram is a chart that plots the distribution of a numeric variable’s values as a series of bars. Each bar typically covers a range of numeric values called a bin or class; a bar’s height indicates the frequency of data points with a value within the corresponding bin.)



In [None]:
def histogram_boxplot(feature, figsize=(10,10), bins = None):
    """ Boxplot and histogram combined
    feature: 1-d feature array
    figsize: size of fig (default (9,8))
    bins: number of bins (default None / auto)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(nrows=2, # Number of rows of the subplot grid= 2
                                           sharex = True, # x-axis will be shared among all subplots
                                           gridspec_kw = {"height_ratios": (.10, .10)}, 
                                           figsize = figsize 
                                           ) # creating the 2 subplots
    sns.boxplot(feature, ax=ax_box2, showmeans=True, color='red') # boxplot will be created and a star will indicate the mean value of the column
    sns.distplot(feature, kde=F, ax=ax_hist2, bins=bins) if bins else sns.distplot(feature, kde=False, ax=ax_hist2) # For histogram
    ax_hist2.axvline(np.mean(feature), color='g', linestyle='--') # Add mean to the histogram
    ax_hist2.axvline(np.median(feature), color='black', linestyle='-') # Add median to the histogram

In [None]:
histogram_boxplot(fit.Age)

### Observations on Age:
- Range is 18-50yrs(5-point summary)
- Skewness = 0.982161 meaning that distribution is positively skewed
- Mean = 29, median =26-customer population is young across the product lines
- The right tail of the distribution is longer & age is concentrated between 24-33yrs.There are a few outliers
- More variation on the right-side


In [None]:
histogram_boxplot(fit.Income)

### Observations on Income: 
- Range lies between 29K-104K(5-point summary)
- Skewness = 1.291
- The distribution is Highly skewed
- Mean = 53K, median =50K
- There are large numbers of outliers indicating a large variation . As can be seen from the distribution itself, some people  are earning significantly more.
- - There could be a correlation between Income & Education which we will explore soon



In [None]:
histogram_boxplot(fit.Education)

### Observations on Education:
    - Skewness = 0.622. The distribution is positively skewed
    - Mean =15.57, median =16(neighbourhood of mean) . Good education level of customers , completed bachelor’s degree
    - A few outliers

In [None]:
histogram_boxplot(fit.Usage)

### Observations on Usage:
- Skewness = 0.739. The distribution is positively skewed
- Mean =3.4 median =3(very close to mean) On an average a customer uses the treadmill 3 times a week
- Very few outliers


In [None]:
histogram_boxplot(fit.Fitness)

### Observations on Fitness:
- Skewness = 0.454. The distribution is positively skewed
- Mean =3.3, median =3(very close) . Overall an average level of fitness
- Outlier can be seen 

In [None]:
histogram_boxplot(fit.Miles)

### Observations on Miles:
- Skewness = 1.7 The distribution is highly positively skewed
- Mean =103 median =94. 
- There are many outliers-miles vary among the customer base
- There could be a correlation between Fitness & Miles which we will explore soon


## EDA Bivariate & Multivariate analysis

### Exploring the attributes of more than one variable in order to identify and analyse relationship for each product to determine a customer profile.

## *Gender

In [None]:
# Checking the popluarity of treadmill by Gender by identifying the relationship between the two
pd.crosstab(fit["Product"],fit["Gender"])

In [None]:
sns.countplot(x='Product' , hue='Gender',data=fit)# Impact of gender on product-visualisation

### Observation
- Gender seems unaffected for TM195. There is a slight difference by gender for TM498
- For TM798, Men constitute a larger customer base than women


## *Marital Status

In [None]:
pd.crosstab(fit["Product"],fit["MaritalStatus"])#Identifying the relationship between Product and Marital Status

In [None]:
sns.countplot(x='Product' , hue='MaritalStatus',data=fit)# Impact of Marital Status on product

### Observations:
- Partnered use of the products is more popular than single

## *Age


In [None]:
pd.pivot_table(fit, "Age" ,index=["Product", "Gender"], columns=["MaritalStatus"],aggfunc=[np.median,np.mean,len])
#Analysing the mean and median ages of customers by gender and marital status for the product

In [None]:
pd.pivot_table(fit,values='Age',index=['Product'],aggfunc=[np.median,np.mean])
# Calculating the average and median AGE of customers for each product

### Observation
The median age is between 26-27 and the mean age is close to 29. The population is fairly young.

In [None]:
fit.hist(by ="Product", column="Age", bins=10)# Ages concentartion by product

### Observation
- For the three products, the ages are mainly concentrated between 20-35years
- Customers of TM798 are younger than that of TM195 & TM498

In [None]:
plt.figure(figsize=(6,6))
sns.scatterplot(x='Age',y='Education', data=fit,hue='Product')# exploring the Product relationship with Age & Education

### Observation
- The customers of TM798 are highly educated and are relatively younger than those of TM195 &TM498

## * Income

In [None]:
fit[['Income', 'Product']].groupby(['Product']).median().sort_values(by='Product')#average median income by product

In [None]:
pd.pivot_table(fit, "Income", index=["Product", "Gender"], columns=["MaritalStatus"], aggfunc=[np.median,np.mean])
# Analysing the average and median level of income of the customer base who buy the product across by marital status and gender

### Observation
- The median income for the customers of TM798 is higher than the customer base for TM195&TM498

In [None]:
sns.scatterplot(x='Age', y='Income',data=fit, hue = 'Product')# exploring the Product relationship with Age & Income


### Observation
- TM798 poulation is young and a large number of customers of TM798 earn more than the customers of TM498 &TM195.
- TM798 looks more like an expensive product as higher-income people tend to buy it more than the customers of TM195&TM498 

In [None]:
sns.barplot(x='Gender', y='Income',data=fit, hue = 'Product',ci=None)# exploring the Product relationship with Income & Gender

### Observation

- There is no visible difference between the incomes of male&females within the product categories


In [None]:
sns.barplot(x='Education', y='Income',data=fit, hue = 'Product', ci=None)
# exploring the Product relationship with Income & Education


### Observation

- TM498 and TM195 is preferred by low-average income group who have finished college/vocational studies. 
- TM798 appears to be a high-end model popular among high-income earners. And they are highly educated than the buyers of TM195 & TM498.

- There might be a correlation between the two variables, we shall see later.

## *Miles & Fitness

In [None]:
fit[['Miles', 'Product']].groupby(['Product']).median()# Miles run by product

In [None]:
pd.pivot_table(fit, "Miles" ,index=["Product"], columns=["Gender"],aggfunc=[np.mean,np.median])
# Analysing the mean and median miles run for the product by gender 

In [None]:
plt.figure(figsize=(6,6))
sns.scatterplot(x='Age',y='Miles', data=fit,hue='Product');# exploring the Product relationship with Age & Miles

### Observation

- Customers of'TM798' are more on the younger side and run more miles than the customer base of TM195&TM498

In [None]:
fit[['Fitness', 'Product']].groupby(['Product']).median()#Fitness level by product

In [None]:
plt.figure(figsize=(6,6))
sns.scatterplot(x='Fitness',y='Miles', data=fit,hue='Product');# # exploring the Product relationship with Fitness & Miles

### Observation
- Customers of TM798 are health-conscious who run more miles and consider themselves more fit than the customers of TM195&TM498

In [None]:
pd.pivot_table(fit, "Fitness" ,index=["Product"], columns=["Gender"],aggfunc=[np.mean,np.median])
# Analysing the mean and median level of fitness for the product by gender

### Observation
- It seems that the customers of TM798 are more fitter than that of TM195&TM498

In [None]:
plt.figure(figsize=(6,6))
sns.barplot(x='Fitness',y='Usage', data=fit,hue='Product', ci=None);#comparing the relation between usage and fitness 

### Observation
- Customers of TM798 use the treadmill more times a week and rate themsleves to be more fit than the customers of TM195&TM498 

###  *Usage

In [None]:
plt.figure(figsize=(6,6))
sns.swarmplot(x='Usage',y='Miles', data=fit,hue='Product');#comparing the relation between usage and miles

In [None]:
fit[['Usage', 'Product']].groupby(['Product'], ).median()#Usage per week  by product

In [None]:
pd.pivot_table(fit, "Usage" ,index=["Product"], columns=["Gender"],aggfunc=[np.mean,np.median])
# Analysing the level of usage for the product by gender

### Observation
-  The mean and median usage by the customers of TM798 is higher compared to that of TM195&TM498
meaning that the customers of TM798 use the treadmill more times a week 
and they also run more miles than the customers of TM195&TM498


## Correlation & Heatmap

In [None]:
corr=fit.corr()
corr

In [None]:
plt.figure(figsize=(8,8))
sns.heatmap(corr,annot=True)


## Observations: 
- Correlation between Age& Fitness and Age&Usage and Age&Miles is negligible.
- Education and Income have a good correlation.
- There is a moderate corelation between Usage & Miles, Usage&Fitness and is less moderate between Usage&Income
- Income&Age and Income&Education have a good corelation followed by the correlation between Income&Age/Fitness/Miles

### Bivariate Scatter Plots

In [None]:
sns.pairplot(fit)

### Observation
- They mirror the correlation run above and the exploration of the interplay between the different variables


## **Insights & Conclusion 

 *TM195
- It is the most popular product among all age groups. Equally popular with male and females; low-average income group who have finished college/vocational studies. The customer base is comparatively less fit and run less miles(do not use equipment as frequently as the customers of TM798). Partnered people use it more than people who are single.

*TM498
- 2nd most popular product until middle-age group (~upto 40 years). Not much difference in popularity among male or females; low-average income group who have finished college/vocational studies. The customer base is comparatively less fit and run less miles (do not use equipment as frequently as the customers of TM798). Partnered people use it more than people who are single.

*TM798
- This appears to be a high-end model; very popular among young, partnered males (~ <30yrs) who are high income earners than TM195 & TM498. People buying this product are more health conscious and fit. They run more miles.



## **Further Analysis & Recommendations

- More information on the features of the product can help in further verifying the popularity of the models
- Fitness level is a self-rated variable. Basing this variable on a fitness-test can determine a concrete analysis and make it more reliable.
- A further analysis of the cost price and selling price will be beneficial to understand the (target) customers classification based upon income and the sales of the product
- Qualitative variables such as customer fitness goals will be useful-weight loss, look good, stay fit due to health reasons/underlying conditions.

### Recommendations

- Partnered customers are buying more than the single – Focus should be on expanding the sales among the ‘singles’. Running advertising campaigns/sponsoring events in colleges/universities emphasizing the benefits of staying fit to target ‘singles’ as the customer base. Target age range between 20-25years

- Understand through market research and surveys ‘singles’ fitness requirements, goals, etc. Incorporate the same in model as “Add-ons” to make product more attractive and easier to sell to ‘singles’. 

- Setting up health check-up camps in societies to raise awareness and tying it up with product promotion. This way both-the existing ‘partnered’ base and ‘single’ customer base may be further expanded 

- People who have 12-14 years of education. They can also be target customers for TM195 & TM498.

- Posters in doctor’s waiting area promoting health and fitness-age range 40-50years. This age group is not targeted well and may have high potential as they are mostly inclined towards maintaining a good health.

- Tie up with corporates and discounted deals for the employees on the purchase of treadmill
- Special deals on TM798 for female customers or working women 
- Boosting Sales & marketing campaigns more during January – as people tend to make years resolution about getting fit
- Emphasizing the importance of exercising anytime/anywhere and more people can exercise the same time as opposed to going to gym



##################################################End of Data Analysis#########################################################