# **Cardio Good Fitness Case Study**

## Problem Statement

We have data collected on individuals who purchased a treadmill at a CardioGoodFitness retail store during three months. The task is to identify the profile of the typical customer for each treadmill product offered by CardioGood Fitness. The differences across the product lines with respect to customer characteristics are to be explored.  

### Dataset
The columns of dataset are:
1. Product - product purchased, TM195, TM498, or TM798 
2. Gender - Male or Female
3. Age - in years
4. Education - in years
5. MaritalStatus - relationship status, single or partnered
6. Income - annual household income ($)
7. Usage - average number of times the customer plans to use the treadmill each week
8. Miles - average number of miles the customer expects to walk/run each week
9. Fitness - self-rated fitness on an 1-to-5 scale, where 1 is poor shape and 5 is excellent shape.

In [None]:
# Import libraries
import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (9, 5)
sns.set_style('darkgrid')

In [None]:
# Load dataset
df = pd.read_csv("../input/cardiogoodfitness/CardioGoodFitness.csv")

## Data Preparation

The real word datasets normally required preparation and cleaning before performing any analysis. As a first step, we will have a quick look at the data as a pandas dataframe.

In [None]:
# View the data as dataframe
df.head(5)

In [None]:
# To check more information of data
df.info()

So there are 180 rows and 9 columns in dataset and all are non-null values.

In [None]:
# To check number of uniques values in columns
df.nunique()

We have 3 category of products and 2 category each in 'Gender' and 'MaritalStatus' columns. So we will now look into other numeric value columns.

In [None]:
# To check numeric value columns
df.describe()

Now looking at min, max, mean etc of columns we can conclude that the dataset is clean and no further cleaning is required.

## Exploratory Data Analysis

Since our interest lies in Products, we can check counts of them.

In [None]:
# Count plot
ax = sns.countplot(x = 'Product', data = df)
ax.set_title("Pdoduct Counts");

In [None]:
# Plotting percentages
plt.figure(figsize=(10,9))
plt.title('Percentage of Products Sold')
product_count = df['Product'].value_counts()
plt.pie(product_count, labels = product_count.index, autopct = '%1.2f%%');

Now we can look at the product purchase based on gender.

In [None]:
# count plot based on gender
plt.figure(figsize=(9,7))
ax = sns.countplot(x = 'Product', data = df, hue = 'Gender')
ax.set_title("Pdoduct Bought and Gender");

We can observe that female customers are less interested for product TM798 comparing to other two products. 

Checking the behaviour of customers based on their marital status.

In [None]:
# count plot based on marital status
ax = sns.countplot(x = 'Product', data = df, hue = 'MaritalStatus')
ax.set_title("Pdoduct bought and MaritalStatus");

No pattern is visible here. It seems marital status is not an important factor in selection of type of treadmill product. 

We will check how usage per week influence product purchase.

In [None]:
# count plot based on usage
ax = sns.countplot(x = 'Product', data = df, hue = 'Usage')
ax.set_title("Pdoduct bought and Usage per week");

It can be seen that customers expecting more usage per week prefer product TM798.

Now we can check the influence of fitness level of customers.

In [None]:
# count plot for fitness level
ax = sns.countplot(x = 'Product', data = df, hue = 'Fitness')
ax.set_title("Pdoduct bought and Fitness");

It is clear that people maintaining good fitness mostly prefer TM798.

Let's check the effect of average miles customers planning to cover.

In [None]:
# histogram for miles expected to cover
ax = sns.histplot(data=df, x="Miles", hue="Product", kde = True)
ax.set_title("Product and Miles");

Here also we may infer that customers expecting to cover more miles are interested in TM798. Also there is a slight difference in distribution of miles of customers preferring other two products.

let's also check the effect of income of customers.

In [None]:
# histogram based on income
ax = sns.histplot(data=df, x="Income", hue="Product", kde = True)
ax.set_title("Product and Income");

We can see that high income earning customers prefer TM798.

Now we will check effect of age of customers on purchase of products.

In [None]:
# histogram based on age of customers
ax = sns.histplot(data=df, x="Age", hue="Product",  element = "poly")
ax.set_title("Product and Age");

It is clear from the plot that younger customers are more for all the products and there is no sigificant patterns among product selection.

Finally we will look at the influence of education.

In [None]:
# histogram based on education
ax = sns.histplot(data=df, x="Education", hue="Product", multiple="stack")
ax.set_title("Product and Education");

We can find out that people have more years of education tend to buy product TM798.

We can identify many patters from the plots above, but we need more clarity since one or more of these variables may be associated. Now we will check for any such interesting relations.

In [None]:
# heatmap
plt.figure(figsize=(9,9))
sns.heatmap(df.corr(), square=True, linewidths=.5, annot=True, cbar=False);

We can observe a correlation of 0.76 between Usage and Miles, 0.79 between Fitness and Miles, one can think of customers who are more fit expected to run/walk more miles and use more number of times. And who are comparatively less fit expected to use less at the time of purchase. Similarly Usage and Fitness also has 0.67 correlation. Also we can say there is some association with income and education, as people with more education expected to be have more income.   

Now we will also have a look at age - income relation and the product selection.

In [None]:
# scatter plot age and income
ax = sns.scatterplot(y = 'Income', x = 'Age', data = df, hue = 'Product')
ax.set_title("Age, Income and Product Selection");

It is clear that people of all age having higher income prefer the TM798 product. All products are purchased by customers of all ages except the early age group do not purchase TM798, although since their income is less we cannot treat that as a pattern. 

It will he helpful for us to look at the product specific central tendency and dispersion etc of the variables.

In [None]:
# describe TM195 product category
print("For Product TM195")
df[df['Product'] == 'TM195'].describe()

In [None]:
# describe TM468 product category
print("For Product TM498")
df[df['Product'] == 'TM498'].describe()

In [None]:
# describe TM798 product category
print("For Product TM798")
df[df['Product'] == 'TM798'].describe()

If we observe carefully we can clearly see values such as mean, min, max, standard deviation etc of variables are nearly in the same range for TM195 and TM498 and significantly different from that of TM798. 

## Asking Questions

We will ask few questions which can help to identify customer profiles.

### 1. Which product is preferred by customers having poor to average fitness? 

In [None]:
plt.figure(figsize=(10,9))
less_fit_count = df[(df['Fitness'] <= 3)].Product.value_counts()
plt.title("Product preferred by customers group of poor to average fitness")
plt.pie(less_fit_count, labels = less_fit_count.index, autopct = "%1.2f%%");

Products TM195 and TM498 are preferred by customers having fitness less than or equal to 3 in general.

### 2. Which product is preferred by customers expected to use for less than 100 miles?

In [None]:
plt.figure(figsize=(10,9))
less_miles_count = df[(df['Miles'] < 100)].Product.value_counts()
plt.title("Product preferred by customers setting target less than 100 Miles")
plt.pie(less_miles_count, labels = less_miles_count.index, autopct = "%1.2f%%");

Customers looking to workout for less than 100 miles mostly buy TM195 or TM498.

### 3. Customers having annual income less than 45000 dollar prefer which product?

In [None]:
plt.figure(figsize=(10,9))
less_miles_count = df[(df['Income'] < 45000)].Product.value_counts()
plt.title("Product preferred by customers having annual income less than 45000$")
plt.pie(less_miles_count, labels = less_miles_count.index, autopct = "%1.2f%%");

Customers having annual income less than 45000 dollar prefer to buy TM195 or TM498.

### 4. How is the product preference of mid income users(45000 - 70000) compared to all users who bought that product?

In [None]:
mid_income_perc = df[(df['Income'] < 70000) & ( df['Income'] > 45000)].Product.value_counts()*100/df.Product.value_counts()
plt.title('Product preference of mid income users compared to all users bought that product')
sns.barplot(x = mid_income_perc, y = mid_income_perc.index);

We can observe approximately 75% of customer who bought TM498 are middle income users.

### 5. How is the love for a product for customers aiming average workout(80-150) compared to all customers bought that product?

In [None]:
avg_workout_perc = df[(df['Miles'] < 150) & (df['Miles'] > 80)].Product.value_counts()*100 / df.Product.value_counts()
plt.title('Product preference of customers planning average workout compared to all users bought that product')
sns.barplot(x = avg_workout_perc, y = avg_workout_perc.index);

The 60% customers who bought TM498 are aiming medium workout(80 - 150 miles per week), where as it is 50% for TM195 and 30% for TM798.

### 6. How values of income, usage, miles etc of females who bought TM798 are different from the values for all females customers? 

In [None]:
print("All female variable mean values")
print(df[df['Gender'] == 'Female'].mean())
print("Female who bought TM798 - variable mean values")
print(df[(df['Product'] == 'TM798') & (df['Gender'] == 'Female')].mean())

It can be seen that females who bought TM798 have higher income, expected to run/walk more miles and are more fit.

## Conclusions

1. Customers having very good fitness, expecting to use more number of times per week and those who expect to workout for more miles are interested in buying product TM798.
2. Customers having higher annual income prefer to buy TM798 regardless of their age and other characterestics. 
3. Marital status, age and gender of customers were not turned out to be influencing factors for product selection.
4. Association between Usage, Fitness and Miles is observed.
5. Even though products TM195 and TM498 have a similar customer base, the significant majority of people who bought TM498 are mid-income and those who expect to do medium workout.

### Customer Profile for Treadmill Products

Based on the analysis a likely customer profiles for three products is shown below.

### TM195

1. Gender - Male or Female
2. Age - 18 to 55
3. Marital Status - Single or Partnered
4. Education - 12 to 20 years
5. Income - 30000 to 70000 ($)
6. Usage - 2 to 5 times per week
7. Miles - 20 to 200 miles per week 
8. Fitness - 1 to 5 

### TM498

1. Gender - Male or Female
2. Age - 18 to 55
3. Marital Status - Single or Partnered
4. Education - 12 to 20 years
5. Income - 30000 to 80000 ($)
6. Usage - 2 to 4 times per week
7. Miles - 50 to 250 miles per week 
8. Fitness - 1 to 5 


### TM798

1. Gender - Male or Female
2. Age - 18 to 55
3. Marital Status - Single or Partnered
4. Education - 12 to 22 years
5. Income - 50000 to 110000 ($)
6. Usage - 3 to 7 times per week
7. Miles - 100 to 400 miles per week 
8. Fitness - 3 to 5 