# Cardio Good Fitness - Analysis of Customer Demographics

Analyst: Jordan Rich

Kaggle ID: JordanRich

# Project Objective:

- Preliminary Data Analysis. Explore the dataset and extract basic observations about the data. 

- Come up with a customer profile (characteristics of a customer) for the different products
- Perform uni-variate and multi-variate analyses
- Generate a set of insights and recommendations that will help the company in targeting new customers

# Context:

The data is for customers of the treadmill product(s) of a retail store called Cardio Good Fitness. It contains the following variables:

- Product - the model no. of the treadmill

- Age - in no of years, of the customer

- Gender - of the customer

- Education - in no. of years, of the customer

- Marital Status - of the customer

- Usage - Avg. # times the customer wants to use the treadmill every week

- Fitness - Self rated fitness score of the customer (5 - very fit, 1 - very unfit)

- Income - of the customer

- Miles- expected to run


### Import packages

In [None]:
#import libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#configure jupyter to allow each cell to display multiple outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

#ignore warnings
import warnings
warnings.filterwarnings('ignore')

### Load data and display sample

In [None]:
#loads CardioGoodFitness dataset and displays 10 sampled rows
dataset = pd.read_csv('../input/cardiogoodfitness/CardioGoodFitness.csv')
dataset.sample(10)

### Check structure of dataset

In [None]:
#check dataset
dataset.info()

Observations:
1. The Cardio Good Fitness dataset contains 9 Features, 
2. Each feature contains 180 entries.
3. There is no null values. 
4. Six of the features are numerical and Three of the features are objects (categorical).

### Check quantitative data stats

In [None]:
#check distribution of numerical data
dataset.describe()

Observations:
1. Mean Customer age is about 29 years old, and customer ages concentrate between 24 and 33.
2. Mean Customer education is approxiamtely 16 years (Bachelor's Degree), with IQR ranging between 14-16 years.
3. Majority of customers expect to use the treadmill 3 or 4 days/week.
4. Majority of customers rate themselves to be in average to fit physical fitness level.
5. Majority of customers have annual incomes between approx. 45k-60k.
6. Customers expect to run between 66 - 115 miles on their treadmill, with average customer expecting to run 103 miles.

### Check skew of quantitative variables

In [None]:
#check skew
dataset.skew()

Observations:
1. Age, Education, and Usage are moderately positively skewed (0.5 < x < 1.0).
2. Income and Miles are highly positively skewed (x > 1.0).

### Create figure to display spread of data

In [None]:
#create counts of data for plotting categorical variables
products = np.unique(dataset['Product'], return_counts=True)
gender = np.unique(dataset['Gender'], return_counts=True)
marital_stat = np.unique(dataset['MaritalStatus'], return_counts=True)
print('products = {}\n'.format(products))
print('gender = {}\n'.format(gender))
print('marital_stat = {}\n'.format(marital_stat))

In [None]:
# create fontdicts for formatting figure text
axtitle_dict = {'family': 'serif',
        'color':  'darkred',
        'weight': 'bold',
        'size': 16
        }

axlab_dict = {'family': 'serif',
              'color': 'black',
              'size': 14
              }

#create figure with 3 x 3 grid of subplots
fig = plt.figure(figsize=[16,12])
fig.suptitle("SPREAD OF DATA", fontsize=18, fontweight='bold')
fig.subplots_adjust(top=0.92)
fig.subplots_adjust(hspace=0.5, wspace=0.4)

#load plots into subplots, and set plot parameters
ax0 = fig.add_subplot(3, 3, 1)
sns.barplot(x=list(products[0]), y=list(products[1]), ax=ax0, color='teal')
ax0.text(0.3, 70, '{}%' .format(str(round(products[1][0]/sum(products[1])*100,1))), ha='right', va='center', size=13, fontdict={'weight': 'bold'})
ax0.text(1.3, 50, '{}%' .format(str(round(products[1][1]/sum(products[1])*100,1))), ha='right', va='center', size=13, fontdict={'weight': 'bold'})
ax0.text(2.3, 30, '{}%' .format(str(round(products[1][2]/sum(products[1])*100,1))), ha='right', va='center', size=13, fontdict={'weight': 'bold'})
ax0.set_title('Model No.', fontdict=axtitle_dict)
ax0.set_xlabel('Categorical', fontdict=axlab_dict)
ax0.set_ylabel('Count', fontdict=axlab_dict)

ax1 = fig.add_subplot(3, 3, 2)
sns.distplot(dataset['Age'], ax=ax1, color='dodgerblue');
ax1.axvline(dataset['Age'].quantile(q=0.25),color='green',linestyle='--',label='25% Quartile')
ax1.axvline(dataset['Age'].mean(),color='red',linestyle='--',label='Mean')
ax1.axvline(dataset['Age'].median(),color='black',linestyle='--',label='Median')
ax1.axvline(dataset['Age'].quantile(q=0.75),color='blue',linestyle='--',label='75% Quartile')
ax1.text(58, 0.04, 'skewness: {}' .format(str(round(dataset['Age'].skew(),3))), ha='right', va='center', size=11)
ax1.set_title('Age', fontdict=axtitle_dict)
ax1.set_xlabel('Age [yrs]', fontdict=axlab_dict)
ax1.set_ylabel('Probability per Unit', fontdict=axlab_dict)
ax1.legend(fontsize=11)

ax2 = fig.add_subplot(3, 3, 3)
sns.barplot(x=list(gender[0]), y=list(gender[1]), ax=ax2, color='coral')
ax2.text(0.2, 62, '{}%' .format(str(round(gender[1][0]/sum(gender[1])*100,1))), ha='right', va='center', size=13, fontdict={'weight': 'bold'})
ax2.text(1.2, 90, '{}%' .format(str(round(gender[1][1]/sum(gender[1])*100,1))), ha='right', va='center', size=13, fontdict={'weight': 'bold'})
ax2.set_title('Gender', fontdict=axtitle_dict)
ax2.set_xlabel('Gender', fontdict=axlab_dict)
ax2.set_ylabel('Count', fontdict=axlab_dict)

ax3 = fig.add_subplot(3, 3, 4)
sns.countplot(dataset['Education'], ax=ax3, color='limegreen')
ax3.set_title('Education (in years)', fontdict=axtitle_dict)
ax3.set_xlabel('Integer Categorical', fontdict=axlab_dict)
ax3.set_ylabel('Count', fontdict=axlab_dict)

ax4 = fig.add_subplot(3, 3, 5)
sns.barplot(x=list(marital_stat[0]), y=list(marital_stat[1]), ax=ax4, color='orchid')
ax4.text(0.2, 92, '{}%' .format(str(round(marital_stat[1][0]/sum(marital_stat[1])*100,1))), ha='right', va='center', size=13, fontdict={'weight': 'bold'})
ax4.text(1.2, 58, '{}%' .format(str(round(marital_stat[1][1]/sum(marital_stat[1])*100,1))), ha='right', va='center', size=13, fontdict={'weight': 'bold'})
ax4.set_title('Marital Status', fontdict=axtitle_dict)
ax4.set_xlabel('Integer Categorical', fontdict=axlab_dict)
ax4.set_ylabel('Count', fontdict=axlab_dict)

ax5 = fig.add_subplot(3, 3, 6)
sns.countplot(dataset['Usage'], ax=ax5, color='gold')
ax5.set_title('Expected Usage (freq.)', fontdict=axtitle_dict)
ax5.set_xlabel('Integer Categorical', fontdict=axlab_dict)
ax5.set_ylabel('Count', fontdict=axlab_dict)

ax6 = fig.add_subplot(3, 3, 7)
sns.countplot(dataset['Fitness'], ax=ax6, color='tomato')
ax6.set_title('Fitness (Self-rated)', fontdict=axtitle_dict)
ax6.set_xlabel('Integer Categorical', fontdict=axlab_dict)
ax6.set_ylabel('Count', fontdict=axlab_dict)

ax7 = fig.add_subplot(3, 3, 8)
sns.distplot(dataset['Income'], ax=ax7, color='slateblue')
ax7.axvline(dataset['Income'].quantile(q=0.25),color='green',linestyle='--',label='25% Quartile')
ax7.axvline(dataset['Income'].mean(),color='red',linestyle='--',label='Mean')
ax7.axvline(dataset['Income'].median(),color='black',linestyle='--',label='Median')
ax7.axvline(dataset['Income'].quantile(q=0.75),color='blue',linestyle='--',label='75% Quartile')
ax7.text(118000, 1.8e-5, 'skewness: {}' .format(str(round(dataset['Income'].skew(),3))), ha='right', va='center', size=11)
ax7.set_title('Income', fontdict=axtitle_dict)
ax7.set_xlabel('Annual Income [$USD]', fontdict=axlab_dict)
ax7.set_ylabel('Probability per Unit', fontdict=axlab_dict)
ax7.legend(fontsize=11)

ax8 = fig.add_subplot(3, 3, 9)
sns.distplot(dataset['Miles'], ax=ax8, color='peru');
ax8.axvline(dataset['Miles'].quantile(q=0.25),color='green',linestyle='--',label='25% Quartile')
ax8.axvline(dataset['Miles'].mean(),color='red',linestyle='--',label='Mean')
ax8.axvline(dataset['Miles'].median(),color='black',linestyle='--',label='Median')
ax8.axvline(dataset['Miles'].quantile(q=0.75),color='blue',linestyle='--',label='75% Quartile')
ax8.text(400, 0.006, 'skewness: {}' .format(str(round(dataset['Miles'].skew(),3))), ha='right', va='center', size=11)
ax8.set_title('Expected Miles', fontdict=axtitle_dict)
ax8.set_xlabel('Distance [miles]', fontdict=axlab_dict)
ax8.set_ylabel('Probability per Unit', fontdict=axlab_dict)
ax8.legend(fontsize=11)

fig.show();

Observations:

1. TM195 was purchased more than TM498 and TM498 was purchased more than TM798.
2. Customer ages appear to be skewed to young.
3. More males (57.8%) purchased treadmills than females (42.2%).
4. Majority of customers have either some college or a bachelors.
5. Majority of customers are partnered (59.4%) rather than single (40.6%).
6. Majority of customers expect to use between 3-4 days per week.
7. Majority of customers self-rated their fitness as average.
8. Customer incomes appear to be skewed toward lower annual pay.
9. Majority of customers expect to run less than 100 miles on treadmill.

### Look at outliers

In [None]:
#create fontdict for axis labels
axlab2 = {'family': 'serif',
              'color': 'black',
              'weight': 'bold',
              'size': 16
         }
#create subplot layout
fig = plt.figure(figsize=[10,10]);
grid = plt.GridSpec(6, 1, wspace=0.3, hspace=1.2);
x = ['Age', 'Education', 'Usage', 'Fitness', 'Income', 'Miles'];
col = ['forestgreen','dodgerblue','goldenrod', 'coral', 'burlywood','thistle'];

#loop to populate boxplots within subplots
for i in np.arange(0,6):
    for j in np.arange(0,1): 
        exec(f'ax{i}{j} = plt.subplot(grid[i,j]);')
        exec(f'sns.boxplot(x=dataset[x[{i}]], ax=ax{i}{j}, color=col[{i}]);')
        exec(f'ax{i}{j}.set_title(x[{i}], fontdict=axlab2);')
        exec(f'ax{i}{j}.set_xlabel("", fontdict=axlab2);')
        exec(f'a{i} = ax{i}{j}.axvline(dataset[x[{i}]].mean(),color= "red", linestyle="--", label="mean")')
        exec(f'b{i} = ax{i}{j}.axvline(dataset[x[{i}]].mean()+ 3 * dataset[x[{i}]].std(),color= "orange", linestyle="--", label="3sigma")')
        exec(f'ax{i}{j}.axvline(max([dataset[x[{i}]].mean()- 3 * dataset[x[{i}]].std(), 0]),color= "orange", linestyle="--")')
        exec(f'c{i} = ax{i}{j}.axvline(dataset[x[{i}]].mean()+ 2 * dataset[x[{i}]].std(),color= "slategrey", linestyle="--", label="2sigma")')
        exec(f'ax{i}{j}.axvline(max([dataset[x[{i}]].mean()- 2 * dataset[x[{i}]].std(), 0]),color= "slategrey", linestyle="--")')
        plt.xticks(fontsize=14);

plt.legend([a0, c0, b0], ['mean','2sigma','3sigma'], loc='upper center', bbox_to_anchor=(0.9, 14), fontsize=14)        
fig.show();

In [None]:
#create fontdict for axis labels
axlab2 = {'family': 'serif',
              'color': 'black',
              'weight': 'bold',
              'size': 16
         }
#create subplot layout
fig = plt.figure(figsize=[20,10]);
grid = plt.GridSpec(6, 3, wspace=0.1, hspace=1.2);
x = ['Age', 'Education', 'Usage', 'Fitness', 'Income', 'Miles'];
col = ['forestgreen','dodgerblue','goldenrod', 'coral', 'burlywood','thistle'];
prod = ['TM195', 'TM498', 'TM798']

#loop to populate boxplots within subplots
for i in np.arange(0,6):
    for j in np.arange(0,3): 
        exec(f'ax{i}{j} = plt.subplot(grid[i,j]);')
        exec(f'sns.boxplot(x=dataset.loc[dataset["Product"] == prod[{j}], x[{i}]], ax=ax{i}{j}, color=col[{i}]);')
        blah = x[i] + ' - ' + prod[j]
        exec(f'ax{i}{j}.set_title(blah, fontdict=axlab2);')
        exec(f'ax{i}{j}.set_xlabel("", fontdict=axlab2);')
        exec(f'a{i}{j} = ax{i}{j}.axvline(dataset.loc[dataset["Product"] == prod[{j}], x[{i}]].mean(),color= "red", linestyle="--", label="mean")')
        exec(f'b{i}{j} = ax{i}{j}.axvline(dataset.loc[dataset["Product"] == prod[{j}], x[{i}]].mean()+ 3 * dataset.loc[dataset["Product"] == prod[{j}], x[{i}]].std(),color= "orange", linestyle="--", label="3sigma")')
        exec(f'ax{i}{j}.axvline(max([dataset.loc[dataset["Product"] == prod[{j}], x[{i}]].mean()- 3 * dataset.loc[dataset["Product"] == prod[{j}], x[{i}]].std(), 0]),color= "orange", linestyle="--")')
        exec(f'c{i}{j} = ax{i}{j}.axvline(dataset.loc[dataset["Product"] == prod[{j}], x[{i}]].mean()+ 2 * dataset.loc[dataset["Product"] == prod[{j}], x[{i}]].std(),color= "slategrey", linestyle="--", label="2sigma")')
        exec(f'ax{i}{j}.axvline(max([dataset.loc[dataset["Product"] == prod[{j}], x[{i}]].mean()- 2 * dataset.loc[dataset["Product"] == prod[{j}], x[{i}]].std(), 0]),color= "slategrey", linestyle="--")')
        plt.xticks(fontsize=14);

plt.legend([a00, c00, b00], ['mean','2sigma','3sigma'], loc='upper center', bbox_to_anchor=(0.9, 14), fontsize=14)        
fig.show();

In [None]:
#plot Age - TM798 boxplot
fig = plt.figure(figsize=[16,4])
sns.boxplot(x=dataset.loc[dataset["Product"] == 'TM798', 'Age'], color='forestgreen');
plt.title('Age - TM798', fontdict=axlab2);
plt.xlabel("", fontdict=axlab2);
plt.axvline(dataset.loc[dataset["Product"] == 'TM798', 'Age'].mean(),color= "red", linestyle="--", label="mean")
plt.axvline(dataset.loc[dataset["Product"] == 'TM798', 'Age'].mean()+ 3 * dataset.loc[dataset["Product"] == 'TM798', 'Age'].std(),color= "orange", linestyle="--", label="3sigma")
plt.axvline(max([dataset.loc[dataset["Product"] == 'TM798', 'Age'].mean()- 3 * dataset.loc[dataset["Product"] == 'TM798', 'Age'].std(), 0]),color= "orange", linestyle="--")
plt.axvline(dataset.loc[dataset["Product"] == 'TM798', 'Age'].mean()+ 2 * dataset.loc[dataset["Product"] == 'TM798', 'Age'].std(),color= "slategrey", linestyle="--", label="2sigma")
plt.axvline(max([dataset.loc[dataset["Product"] == 'TM798', 'Age'].mean()- 2 * dataset.loc[dataset["Product"] == 'TM798', 'Age'].std(), 0]),color= "slategrey", linestyle="--")
plt.xticks(fontsize=14);
plt.legend(loc='upper center', bbox_to_anchor=(0.88, 1), fontsize=14);       

Observations:

By looking at the overall data, it appears that there might be some outliers in Income and Miles, however, after looking at the spread by product, it seems that such outliers are not significant.

There appears to be some significant outliers in Age for TM798 - more than likely indicating some untapped sales potential in the 40s and beyond age groups.

### Perform Analysis of Correlation

In [None]:
#create dummies of categorical features so that correlation may be analyzed
dum_dataset= pd.get_dummies(dataset, prefix='Prod', columns=['Product'])
dum_dataset= pd.get_dummies(dum_dataset, prefix='Mar', columns=['MaritalStatus'])
dum_dataset= pd.get_dummies(dum_dataset, prefix='Gen', columns=['Gender'])
dum_dataset.head(10)

In [None]:
#plot correlation matrix heatmap
fig, ax = plt.subplots(figsize=[13,10])
sns.heatmap(dum_dataset.corr(), ax=ax,  annot=True, linewidths=0.05, fmt= '.2f',cmap="RdBu")
ax.tick_params(axis='both', which='major', labelsize=14)
ax.set_title('Dataset Correlation Matrix', fontdict={'family': 'serif', 'color': 'black', 'size': 18, 'weight': 'bold'})
fig.show();

In [None]:
#find which features have significant correlation (>0.5) 
blah = dum_dataset.corr()>0.5

#disregard identity
for i in np.arange(0,len(blah)):
    blah.iloc[i,i] = False

#create table of correlation relationships by index values
corr_val = []
for i in np.arange(0,len(blah.iloc[0,:])):
    for j in np.arange(0,len(blah.iloc[:,0])):
        if blah.iloc[i,j] == True:
            corr_val.append([blah.index.values[j], blah.columns.values[i]])

#drop rows in table in which relationships are duplicated in table (x vs. y and y vs. x)            
x = []
for i in np.arange(0,len(corr_val)):
    x.append(str(i))
corr_val = pd.DataFrame(corr_val, columns=["",""], index=x)
del blah, x
for i in corr_val.index:
    for j in corr_val.index:
        if any(corr_val.index == i) == True:
            a, b = corr_val.loc[i]
            if i != j:
                if np.logical_and((corr_val.loc[j][0] == b) == True, (corr_val.loc[j][1] == a) == True) == True:
                    corr_val.drop(j, inplace=True);
corr_val.reset_index(drop=True, inplace=True)

print('The following features in the data are significantly correlated (coef. > 0.5):\n {}' .format(corr_val))

### Scatter plots

In [None]:
sns.pairplot(dataset, corner=True);

### Create figures to investigate correlation between features

In [None]:
#create fontdict for axis labels
axlab2 = {'family': 'serif',
              'color': 'black',
              'weight': 'bold',
              'size': 16
         }

# create figure with 4 subplots
fig = plt.figure(figsize=[20,8])
fig.suptitle("Fitness, Usage, Miles, and Income are Highly Correlated Features of the Data", fontsize=20, fontweight='bold', color='darkred')
grid = plt.GridSpec(2, 3, wspace=0.3, hspace=0.4)

ax0 = plt.subplot(grid[0, 0])
sns.violinplot(data=dataset, x='Usage', y='Fitness', ax=ax0)
ax0.text(4.75, 0.5, 'Corr. Ceof.: {}' .format(str(round(dataset.corr().iloc[3,2],2))), ha='right', va='center', size=14)
ax0.set_xlabel('Expected Usage', fontdict=axlab2)
ax0.set_ylabel('Fitness (Self-rated)', fontdict=axlab2)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

ax1 = plt.subplot(grid[0, 1])
sns.violinplot(data=dataset, x='Usage', y='Miles', ax=ax1)
ax1.text(4.75, 8, 'Corr. Ceof.: {}' .format(str(round(dataset.corr().iloc[5,2],2))), ha='right', va='center', size=14)
ax1.set_xlabel('Expected Usage', fontdict=axlab2)
ax1.set_ylabel('Expected Miles', fontdict=axlab2)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

ax2 = plt.subplot(grid[:3, 2])
sns.violinplot(data=dataset, x='Fitness', y='Miles', ax=ax2)
ax2.text(3.75, -20, 'Corr. Ceof.: {}' .format(str(round(dataset.corr().iloc[5,3],2))), ha='right', va='center', size=14)
ax2.set_xlabel('Fitness (Self-rated)',fontdict=axlab2)
ax2.set_ylabel('Expected Miles', fontdict=axlab2)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

ax3 = plt.subplot(grid[1, :2])
sns.pointplot(data=dataset, x='Usage', y='Income', ax=ax3, color='salmon')
ax3.text(5, 45000, 'Corr. Ceof.: {}' .format(str(round(dataset.corr().iloc[2,4],2))), ha='right', va='center', size=14)
ax3.set_xlabel('Expected Usage',fontdict=axlab2)
ax3.set_ylabel('Annual Income [USD]', fontdict=axlab2)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

fig.show();

Observations:

1. Customers who expect to use their treadmill for more days/week tended to rate their fitness level higher.
2. Customers who expect to use their treadmill for more days/week also tended to expect to run more miles.
3. Customers who expected to use their treadmill for more days/week tended to make more money per year.
4. Customers who self-rated their fitness level higher tended to expect to run more miles.

In [None]:
#create boxplot of education by product
fig = plt.figure();
sns.boxplot(data=dataset, x='Product', y='Education');
plt.xlabel('Product', fontsize=14, fontweight='bold');
plt.ylabel('Education', fontsize=14, fontweight='bold');
plt.title('Education Level by Product', fontsize=16, fontweight='bold');

Observation:

Customers who bought the TM798 model tended to be more educated that customers that bought other models.

In [None]:
plt.figure();
sns.pointplot(data=dataset, x='Education', y='Income');

Observation

As would be expected, Customer annual incomes are positevely correlated with education level.

In [None]:
#create barplot of total products sold
fig = plt.figure();
sns.barplot(x=list(products[0]), y=list(products[1]), palette=['dodgerblue','salmon','limegreen']);
plt.text(0.2, 70, '{}%' .format(str(round(products[1][0]/sum(products[1])*100,1))), ha='right', va='center', size=13, fontdict={'weight': 'bold'})
plt.text(1.2, 50, '{}%' .format(str(round(products[1][1]/sum(products[1])*100,1))), ha='right', va='center', size=13, fontdict={'weight': 'bold'})
plt.text(2.2, 30, '{}%' .format(str(round(products[1][2]/sum(products[1])*100,1))), ha='right', va='center', size=13, fontdict={'weight': 'bold'})
plt.xlabel('Product', fontsize=14, fontweight='bold');
plt.ylabel('Count', fontsize=14, fontweight='bold');
plt.title('Total Products Sold', fontsize=14, fontweight='bold');

In [None]:
#function to display percent alongside count
def spec(x):
    return '{:.1f}%\n({:.0f})'.format(x, np.sum(products[1])*x/100);

#create pie chart of products sold
fig = plt.figure(figsize=[5,5]);
patches, texts, autotexts = plt.pie(list(products[1]),labels=list(products[0]), autopct=spec, shadow=True, startangle=90);
texts[0].set_fontsize(14)
texts[1].set_fontsize(14)
texts[2].set_fontsize(14)
autotexts[0].set_fontsize(14)
autotexts[1].set_fontsize(14)
autotexts[2].set_fontsize(14)
plt.title('Total Products Sold', fontsize=16, fontweight='bold');

### Look at relationship between Product, MaritalStatus, and Gender

In [None]:
#Create bins for single and partnered per product, then create sub-bins for male customers within single or partnered
# Normalize data by Product for comparability across products
blah = np.zeros([3,2])
hah = np.zeros([3,2])
for i in np.arange(0,len(dataset['Product'])):
    if np.logical_and(dataset.loc[i, 'Product'] == 'TM195', dataset.loc[i, 'MaritalStatus'] == 'Single')==True:
        blah[0,0] += 1
        if dataset.loc[i, 'Gender'] == 'Male':
            hah[0,0] += 1
    elif np.logical_and(dataset.loc[i, 'Product'] == 'TM195', dataset.loc[i, 'MaritalStatus'] == 'Partnered')==True:
        blah[0,1] += 1
        if dataset.loc[i, 'Gender'] == 'Male':
            hah[0,1] += 1
    elif np.logical_and(dataset.loc[i, 'Product'] == 'TM498', dataset.loc[i, 'MaritalStatus'] == 'Single')==True:
        blah[1,0] += 1
        if dataset.loc[i, 'Gender'] == 'Male':
            hah[1,0] += 1
    elif np.logical_and(dataset.loc[i, 'Product'] == 'TM498', dataset.loc[i, 'MaritalStatus'] == 'Partnered')==True:
        blah[1,1] += 1
        if dataset.loc[i, 'Gender'] == 'Male':
            hah[1,1] += 1
    elif np.logical_and(dataset.loc[i, 'Product'] == 'TM798', dataset.loc[i, 'MaritalStatus'] == 'Single')==True:
        blah[2,0] += 1
        if dataset.loc[i, 'Gender'] == 'Male':
            hah[2,0] += 1
    elif np.logical_and(dataset.loc[i, 'Product'] == 'TM798', dataset.loc[i, 'MaritalStatus'] == 'Partnered')==True:
        blah[2,1] += 1
        if dataset.loc[i, 'Gender'] == 'Male':
            hah[2,1] += 1
ms_v_prod = pd.DataFrame(blah, columns=['Single','Partnered'], index=['TM195','TM498','TM798'])

#normalize data by product
blah = ms_v_prod.copy()
blah.iloc[0,:] = np.round(ms_v_prod.iloc[0,:]/ms_v_prod.iloc[0,:].sum(),2)
blah.iloc[1,:] = np.round(ms_v_prod.iloc[1,:]/ms_v_prod.iloc[1,:].sum(),2)
blah.iloc[2,:] = np.round(ms_v_prod.iloc[2,:]/ms_v_prod.iloc[2,:].sum(),2)
ms_v_prod1 = blah
ms_g_v_prod = pd.DataFrame(hah, columns=['S_male','P_male'], index=['TM195','TM498','TM798'])
hah = ms_g_v_prod.copy()
hah.iloc[0,:] = np.round(ms_g_v_prod.iloc[0,:]/ms_v_prod.iloc[0,:].sum(),2)
hah.iloc[1,:] = np.round(ms_g_v_prod.iloc[1,:]/ms_v_prod.iloc[1,:].sum(),2)
hah.iloc[2,:] = np.round(ms_g_v_prod.iloc[2,:]/ms_v_prod.iloc[2,:].sum(),2)
haha = hah.copy()
#for i in [0,1,2]:
#    for j in [0,1]:
#        haha.iloc[i,j] = np.round(hah.iloc[i,j]*blah.iloc[i,j],2)
ms_g_v_prod1 = haha

# create an overlayed and grouped barplot
fig = plt.figure(figsize=[12,10])
ax1 = fig.add_subplot(111)
ax2 = ax1.twiny()
barWidth = 0.25
 
# set height of bar
bars1 = list(ms_v_prod1['Single'])
bars2 = list(ms_v_prod1['Partnered'])
bars3 = list(ms_g_v_prod1['S_male'])
bars4 = list(ms_g_v_prod1['P_male'])
 
# Set position of bar on X axis
r1 = np.arange(len(bars1))
r2 = [x + barWidth for x in r1 + 0.05]
 
# Make the plot
ax1.bar(r1, bars1, color='salmon', width=barWidth, edgecolor='white', label='Female')
ax1.bar(r1, bars3, color='dodgerblue', width=barWidth, edgecolor='white', label='Male')
ax1.bar(r2, bars2, color='salmon', width=barWidth, edgecolor='white')
ax1.bar(r2, bars4, color='dodgerblue', width=barWidth, edgecolor='white')
ax1.axvline(0.65,color='grey',linestyle='--')
ax1.axvline(1.65,color='grey',linestyle='--')

# Label the plot and make it look good
ax1.set_ylabel('Normalized by Product', fontdict={'weight': 'bold', 'color': 'black', 'size': 16})
ax1.set_yticks(np.arange(0,0.8,0.1))
ax1.set_xticks([-0.09, 0.177, 0.92, 1.177, 1.92, 2.177])
ax1.set_xticklabels(['Single', 'Partnered','Single', 'Partnered','Single', 'Partnered'], fontdict={'color': 'black', 'size': 16, 'rotation': 45})
ax1.set_yticklabels(np.round(np.arange(0,0.8,0.1),1), fontdict={'color': 'black', 'size': 14})
ax1.legend(fontsize=14)
ax1.tick_params(axis = "x", which = "both", bottom = False, top = False)

ax2.set_xlim(ax1.get_xlim())
ax2.set_xticks([r + barWidth - 0.1 for r in range(len(bars1))])
ax2.set_xticklabels(['TM195','TM498','TM798'],fontdict={'color': 'black', 'size': 14})
ax2.tick_params(axis = "x", which = "both", bottom = False, top = False)
ax1.text(0.1, 0.77, 'Product:', ha='right', va='center', size=16, fontweight='bold')

fig.show();

Observations:
1. For all products, Partnered Customers tend to be more likely to buy a treadmill than Single Customers
2. It appears that Single females tend to be more likely to buy a TM498 than single males (this will be expanded upon below).
3. Males make up a super majority of TM798 customers.

In [None]:
#calculate fraction of males in each group in the above plot and then store in pandas dataframe
#normalize by product to allow comparability across products
nah = blah.copy()
for i in [0,1,2]:
    for j in [0,1]:
        nah.iloc[i,j] = np.round(haha.iloc[i,j]/blah.iloc[i,j],2)
frac_male = nah

#create grouped bar plot
fig = plt.figure(figsize=[10,8])
fig.suptitle("Percent Female Customers", fontsize=18, fontweight='bold')
fig.subplots_adjust(top=0.92)
ax1 = fig.add_subplot(111)
barWidth = 0.25
 
# set height of bar
bars1 = list(1-frac_male['Single'])
bars2 = list(1-frac_male['Partnered'])
 
# Set position of bar on X axis
r1 = np.arange(len(bars1))
r2 = [x + barWidth for x in r1 + 0.05]
 
# Make the plot
ax1.bar(r1, bars1, color='mediumseagreen', width=barWidth, edgecolor='white', label='Single')
ax1.bar(r2, bars2, color='cornflowerblue', width=barWidth, edgecolor='white', label='Parterned')
ax1.axvline(0.65,color='grey',linestyle='--')
ax1.axvline(1.65,color='grey',linestyle='--')
ax1.set_ylabel('Normalized by Product', fontdict={'weight': 'bold', 'color': 'black', 'size': 16})
ax1.set_xlabel('Product', fontdict={'weight': 'bold', 'color': 'black', 'size': 16})
ax1.set_xticks([0.15, 1.15, 2.15])
ax1.set_yticks(np.arange(0,0.9,0.1))
ax1.tick_params(axis = "x", which = "both", bottom = False, top = False)
ax1.set_xticklabels(['TM195','TM498','TM798'],fontdict={'color': 'black', 'size': 14})
ax1.set_yticklabels(np.round(np.arange(0,0.9,0.1),1), fontdict={'color': 'black', 'size': 14})
fig.legend(loc='upper center', bbox_to_anchor=(0.25, 0.9), fontsize=14)
fig.show();

Observation:

1. Single male more likely to purchase TM195 than single female; partnered female more likely to purchase TM195 than partnered male.
2. Single females tend to be much more likely to purchase TM498 than single males. 
3. Partnered males tend to be much more likely to purchase TM498 than partnered females.
4. Across marital status, males make up super majority of customers for TM798.

In [None]:
print('The percent of females as depicted in the above plot is as follows:\n\n{}'.format(1-frac_male))

In [None]:
#create fontdict for axis labels
axlab2 = {'family': 'serif',
              'color': 'black',
              'weight': 'bold',
              'size': 16
         }

# create figure with 4 subplots
fig = plt.figure(figsize=[10,10])
grid = plt.GridSpec(3, 2, wspace=0.2, hspace=0.2)

ax0 = plt.subplot(grid[0, 0])
sns.violinplot(data=dataset, x='Gender', y='Fitness', ax=ax0)
ax0.set_ylabel('Fitness', fontdict=axlab2)
ax0.set(xticklabels=[], xlabel=None)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

ax1 = plt.subplot(grid[0, 1])
sns.violinplot(data=dataset, x='MaritalStatus', y='Fitness', ax=ax1, palette=['forestgreen','firebrick'])
ax1.set(xticklabels=[], xlabel=None, ylabel=None)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

ax2 = plt.subplot(grid[1, 0])
sns.violinplot(data=dataset, x='Gender', y='Usage', ax=ax2)
ax2.set_ylabel('Expected Usage', fontdict=axlab2)
ax2.set(xticklabels=[], xlabel=None)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

ax3 = plt.subplot(grid[1, 1])
sns.violinplot(data=dataset, x='MaritalStatus', y='Usage', ax=ax3, palette=['forestgreen','firebrick'])
ax3.set(xticklabels=[], xlabel=None, ylabel=None)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

ax4 = plt.subplot(grid[2, 0])
sns.violinplot(data=dataset, x='Gender', y='Miles', ax=ax4)
ax4.set_xlabel('Gender', fontdict=axlab2)
ax4.set_ylabel('Expected Miles', fontdict=axlab2)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

ax5 = plt.subplot(grid[2, 1])
sns.violinplot(data=dataset, x='MaritalStatus', y='Miles', ax=ax5, palette=['forestgreen','firebrick'])
ax5.set_xlabel('Marital Status', fontdict=axlab2)
ax5.set(ylabel=None)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

plt.show();

Observations:

1. Customers who self-rated as very fit (>=5) were male and this is true regardless of marital status.
2. In general, customers who expect to use their treadmill > 3 days per week tended to be single, males or (with less likelyhood), partnered males.
3. There does not appear to be a significant difference in the expected miles run across gender or marital status.

In [None]:
#create fontdict for axis labels
axlab2 = {'family': 'serif',
              'color': 'black',
              'weight': 'bold',
              'size': 16
         }

# create figure with 4 subplots
fig = plt.figure(figsize=[14,8])
#fig.suptitle("Title", fontsize=18, fontweight='bold', color='darkred')
grid = plt.GridSpec(1, 5, wspace=0.2)

ax0 = plt.subplot(grid[0, 0:2])
sns.barplot(data=dataset, x='Product', y='Income', hue='Fitness', ax=ax0);
ax0.set_ylabel('Income', fontdict=axlab2)
ax0.set(ylim=(0, 120000))
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

ax1 = plt.subplot(grid[0, 2:4])
sns.barplot(data=dataset, x='Product', y='Income', hue='Usage', ax=ax1);
ax1.set(yticklabels=[], ylabel=None)
ax1.set(ylim=(0, 120000))
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

ax2 = plt.subplot(grid[0, 4])
sns.violinplot(data=dataset, y='Income', ax=ax2, palette=['mediumaquamarine']);
ax2.set(ylim=(0, 120000))
ax2.yaxis.tick_right()
ax2.yaxis.set_label_position("right")
ax2.set_ylabel('Income', fontdict=axlab2)
plt.xticks(fontsize=14)
plt.yticks(fontsize=14)

plt.show();

Observations:

1. Higher income customers tended to purchase the TM798.
2. Lower income customers tended to purchase the TM498 or TM195.
3. Customers who rated their fitness level higher and/or expected to use the treadmill more days/week tended to purchase the TM798.

In [None]:
# create jointplot figure
fig = plt.figure(figsize=[14,8]);

a = sns.jointplot(data=dataset, x='Age', y='Miles', kind='hex');
a.set_axis_labels('Age', 'Miles', fontsize=16, fontweight='bold');
fig.show();

Observation:

1. This plot shows that two specific groups, approximately mid-20s, and approximately  mid-30s, make up the primary age groups of customers.
2. Customers in their mid-20s tend to expect to run between 50-100 miles.
3. Customers in their mid-30s tend to expect to run about 100 miles without much variability.

In [None]:
#create probability density plot of customer annual income by product
plt.figure(figsize=[10,8]);

sns.distplot(dataset['Age'].where(dataset['Product']=='TM195'), hist=None, label='TM195', kde_kws=dict(linewidth=5, color='royalblue'));
sns.distplot(dataset['Age'].where(dataset['Product']=='TM498'), hist=None, label='TM498', kde_kws=dict(linewidth=5, color='darkorange'));
sns.distplot(dataset['Age'].where(dataset['Product']=='TM798'), hist=None, label='TM798', kde_kws=dict(linewidth=5, color='forestgreen'));
#annotate with mean, median, and mode
plt.text(61, 0.06, '            Mean     Median    Mode\nTM195    {}          {}     {}\nTM498    {}          {}     {}\nTM798    {}          {}     {}'.format(round(dataset['Age'].where(dataset['Product']=='TM195').mean(),1),dataset['Age'].where(dataset['Product']=='TM195').median(),dataset['Age'].where(dataset['Product']=='TM195').mode()[0],round(dataset['Age'].where(dataset['Product']=='TM498').mean(),1),dataset['Age'].where(dataset['Product']=='TM498').median(),dataset['Age'].where(dataset['Product']=='TM498').mode()[0],round(dataset['Age'].where(dataset['Product']=='TM798').mean(),1),dataset['Age'].where(dataset['Product']=='TM798').median(),dataset['Age'].where(dataset['Product']=='TM798').mode()[0]), ha='right', va='center', size=12)

plt.legend(fontsize=14, );

plt.title('Customer Ages by Product', fontsize=16, fontweight='bold');
plt.xlabel('Customer Age', fontsize=14, fontweight='bold');
plt.ylabel('Percent Density per Unit', fontsize=14, fontweight='bold');

In [None]:
#bin age into groups with values normalized by product for comparability across products
hah = [np.sum(dataset['Product']=='TM195'),np.sum(dataset['Product']=='TM498'),np.sum(dataset['Product']=='TM798')]
blah = np.zeros([3,4])
blah[0,0]=np.sum((dataset['Age']<=20) & (dataset['Product']=='TM195'))/hah[0]
blah[0,1]=np.sum((dataset['Age']>20) & (dataset['Age']<=30) & (dataset['Product']=='TM195'))/hah[0]
blah[0,2]=np.sum((dataset['Age']>30) & (dataset['Age']<=40) & (dataset['Product']=='TM195'))/hah[0]
blah[0,3]=np.sum((dataset['Age']>40) & (dataset['Age']<=50) & (dataset['Product']=='TM195'))/hah[0]
blah[1,0]=np.sum((dataset['Age']<=20) & (dataset['Product']=='TM498'))/hah[1]
blah[1,1]=np.sum((dataset['Age']>20) & (dataset['Age']<=30) & (dataset['Product']=='TM498'))/hah[1]
blah[1,2]=np.sum((dataset['Age']>30) & (dataset['Age']<=40) & (dataset['Product']=='TM498'))/hah[1]
blah[1,3]=np.sum((dataset['Age']>40) & (dataset['Age']<=50) & (dataset['Product']=='TM498'))/hah[1]
blah[2,0]=np.sum((dataset['Age']<=20) & (dataset['Product']=='TM798'))/hah[2]
blah[2,1]=np.sum((dataset['Age']>20) & (dataset['Age']<=30) & (dataset['Product']=='TM798'))/hah[2]
blah[2,2]=np.sum((dataset['Age']>30) & (dataset['Age']<=40) & (dataset['Product']=='TM798'))/hah[2]
blah[2,3]=np.sum((dataset['Age']>40) & (dataset['Age']<=50) & (dataset['Product']=='TM798'))/hah[2]

age_prod = pd.DataFrame(blah, index=['TM195','TM498','TM798'],columns=['20','30','40','50'])

#create grouped bar plot
fig = plt.figure(figsize=[8,6])
fig.suptitle("Customer Age Group by Product", fontsize=18, fontweight='bold')
fig.subplots_adjust(top=0.92)
ax1 = fig.add_subplot(111)
barWidth = 0.15
 
# set height of bar
bars1 = list(age_prod['20'])
bars2 = list(age_prod['30'])
bars3 = list(age_prod['40'])
bars4 = list(age_prod['50'])

 
# Set position of bar on X axis
r1 = np.arange(len(bars1)) 
r2 = [x + 0.05 + barWidth for x in r1]
r3 = [x + 0.05 + barWidth for x in r2]
r4 = [x + 0.05 + barWidth for x in r3]
 
# Make the plot
ax1.bar(r1, bars1, color='mediumseagreen', width=barWidth, edgecolor='white', label='Age <= 20')
ax1.bar(r2, bars2, color='cornflowerblue', width=barWidth, edgecolor='white', label='20 < Age <= 30')
ax1.bar(r3, bars3, color='salmon', width=barWidth, edgecolor='white', label="30 < Age <= 40")
ax1.bar(r4, bars4, color='orange', width=barWidth, edgecolor='white', label="40 < Age <= 50")
ax1.set_ylabel('Normalized by Product', fontdict={'weight': 'bold', 'color': 'black', 'size': 16})
ax1.set_xlabel('Product', fontdict={'weight': 'bold', 'color': 'black', 'size': 16})
ax1.set_xticks([0.3, 1.3, 2.3])
ax1.set_yticks(np.arange(0,1.1,0.1))
ax1.tick_params(axis = "x", which = "both", bottom = False, top = False)
ax1.set_xticklabels(['TM195','TM498','TM798'],fontdict={'color': 'black', 'size': 14})
ax1.set_yticklabels(np.round(np.arange(0,1.1,0.1),1), fontdict={'color': 'black', 'size': 14})
fig.legend(loc='upper center', bbox_to_anchor=(0.27, 0.93), fontsize=14)
fig.show();

Observation:

1. The majority of customers are in their 20s or thirties.
2. Customers in their 30s are more likely to purchase a TM498 or TM195 than a TM798.
3. Customers who purchased a TM798 were predominately in their 20s.

In [None]:
#create table of mean, median, and mode values by product
blah = pd.DataFrame(np.zeros([3,3]), columns=['mean','median','mode'], index=['TM195','TM498','TM798'])
blah['mean']= [dataset['Age'].where(dataset['Product']=='TM195').mean(),dataset['Age'].where(dataset['Product']=='TM498').mean(),dataset['Age'].where(dataset['Product']=='TM798').mean()]
blah['median']= [dataset['Age'].where(dataset['Product']=='TM195').median(),dataset['Age'].where(dataset['Product']=='TM498').median(),dataset['Age'].where(dataset['Product']=='TM798').median()]
blah['mode']= [dataset['Age'].where(dataset['Product']=='TM195').mode()[0],dataset['Age'].where(dataset['Product']=='TM498').mode()[0],dataset['Age'].where(dataset['Product']=='TM798').mode()[0]]
print(str(blah))

In [None]:
#create probability density plot of customer annual income by product
plt.figure(figsize=[10,8]);

sns.distplot(dataset['Fitness'].where(dataset['Product']=='TM195'), hist=None, label='TM195', kde_kws=dict(linewidth=5, color='royalblue'));
sns.distplot(dataset['Fitness'].where(dataset['Product']=='TM498'), hist=None, label='TM498', kde_kws=dict(linewidth=5, color='darkorange'));
sns.distplot(dataset['Fitness'].where(dataset['Product']=='TM798'), hist=None, label='TM798', kde_kws=dict(linewidth=5, color='forestgreen'));
#annotate with mean, median, and mode
plt.text(2, 0.6, '            Mean   Median    Mode\nTM195     {}         {}       {}\nTM498     {}         {}       {}\nTM798     {}         {}       {}'.format(round(dataset['Fitness'].where(dataset['Product']=='TM195').mean(),1),dataset['Fitness'].where(dataset['Product']=='TM195').median(),dataset['Fitness'].where(dataset['Product']=='TM195').mode()[0],round(dataset['Fitness'].where(dataset['Product']=='TM498').mean(),1),dataset['Fitness'].where(dataset['Product']=='TM498').median(),dataset['Fitness'].where(dataset['Product']=='TM498').mode()[0],round(dataset['Fitness'].where(dataset['Product']=='TM798').mean(),1),dataset['Fitness'].where(dataset['Product']=='TM798').median(),dataset['Fitness'].where(dataset['Product']=='TM798').mode()[0]), ha='right', va='center', size=12)

plt.legend(fontsize=14);

plt.title('Density of Customer Fitness by Product', fontsize=16, fontweight='bold');
plt.xlabel('Customer Fitness (Self-Assessed)', fontsize=14, fontweight='bold');
plt.ylabel('Density per Unit', fontsize=14, fontweight='bold');

Observation:

Customers who purchased the TM798 tended to rate themselves as being more fit than customers who purchased the TM195 or TM498, as indicated by the right-shift in the distribution for TM798.



In [None]:
#bin fitness into groups with values normalized by product
hah = [np.sum(dataset['Product']=='TM195'),np.sum(dataset['Product']=='TM498'),np.sum(dataset['Product']=='TM798')]
blah = np.zeros([3,3])
blah[0,0]=np.sum((dataset['Fitness']<=2) & (dataset['Product']=='TM195'))/hah[0]
blah[0,1]=np.sum((dataset['Fitness']>2) & (dataset['Fitness']<=4) & (dataset['Product']=='TM195'))/hah[0]
blah[0,2]=np.sum((dataset['Fitness']==5) & (dataset['Product']=='TM195'))/hah[0]
blah[1,0]=np.sum((dataset['Fitness']<=2) & (dataset['Product']=='TM498'))/hah[1]
blah[1,1]=np.sum((dataset['Fitness']>2) & (dataset['Fitness']<=4) & (dataset['Product']=='TM498'))/hah[1]
blah[1,2]=np.sum((dataset['Fitness']==5) & (dataset['Product']=='TM498'))/hah[1]
blah[2,0]=np.sum((dataset['Fitness']<=2) & (dataset['Product']=='TM798'))/hah[2]
blah[2,1]=np.sum((dataset['Fitness']>2) & (dataset['Fitness']<=4) & (dataset['Product']=='TM798'))/hah[2]
blah[2,2]=np.sum((dataset['Fitness']==5) & (dataset['Product']=='TM798'))/hah[2]
fit_prod = pd.DataFrame(blah, index=['TM195','TM498','TM798'],columns=['Fitness <= 2','2 < Fitness <= 4', 'Fitness = 5'])

#create grouped bar plot
fig = plt.figure(figsize=[8,6])
fig.suptitle("Self-Rated Fitness Level", fontsize=18, fontweight='bold')
fig.subplots_adjust(top=0.92)
ax1 = fig.add_subplot(111)
barWidth = 0.15
 
# set height of bar
bars1 = list(fit_prod['Fitness <= 2'])
bars2 = list(fit_prod['2 < Fitness <= 4'])
bars3 = list(fit_prod['Fitness = 5'])
 
# Set position of bar on X axis
r1 = np.arange(len(bars1)) 
r2 = [x + 0.05 + barWidth for x in r1]
r3 = [x + 0.05 + barWidth for x in r2]
 
# Make the plot
ax1.bar(r1, bars1, color='mediumseagreen', width=barWidth, edgecolor='white', label='Unfit')
ax1.bar(r2, bars2, color='cornflowerblue', width=barWidth, edgecolor='white', label='Fit')
ax1.bar(r3, bars3, color='salmon', width=barWidth, edgecolor='white', label="Very Fit")
ax1.set_ylabel('Normalized by Product', fontdict={'weight': 'bold', 'color': 'black', 'size': 16})
ax1.set_xlabel('Product', fontdict={'weight': 'bold', 'color': 'black', 'size': 16})
ax1.set_xticks([0.2, 1.2, 2.2])
ax1.set_yticks(np.arange(0,1.1,0.1))
ax1.tick_params(axis = "x", which = "both", bottom = False, top = False)
ax1.set_xticklabels(['TM195','TM498','TM798'],fontdict={'color': 'black', 'size': 14})
ax1.set_yticklabels(np.round(np.arange(0,1.1,0.1),1), fontdict={'color': 'black', 'size': 14})
fig.legend(loc='upper center', bbox_to_anchor=(0.805, 0.93), fontsize=14)
fig.show();

Observation:

Customers who purchased the TM798 tended to rate themselves as being more fit than customers who purchased the TM195 or TM498



In [None]:
#create probability density plot of customer annual income by product
plt.figure(figsize=[10,8]);

sns.distplot(dataset['Usage'].where(dataset['Product']=='TM195').dropna(), hist=None, label='TM195', kde_kws=dict(linewidth=5, color='royalblue'));
sns.distplot(dataset['Usage'].where(dataset['Product']=='TM498').dropna(), hist=None, label='TM498', kde_kws=dict(linewidth=5, color='darkorange'));
sns.distplot(dataset['Usage'].where(dataset['Product']=='TM798').dropna(), hist=None, label='TM798', kde_kws=dict(linewidth=5, color='forestgreen'));
#annotate with mean, median, and mode
plt.text(8.3, 0.4, '            Mean   Median    Mode\nTM195     {}         {}       {}\nTM498     {}         {}       {}\nTM798     {}         {}       {}'.format(round(dataset['Usage'].where(dataset['Product']=='TM195').mean(),1),dataset['Usage'].where(dataset['Product']=='TM195').median(),dataset['Usage'].where(dataset['Product']=='TM195').mode()[0],round(dataset['Usage'].where(dataset['Product']=='TM498').mean(),1),dataset['Usage'].where(dataset['Product']=='TM498').median(),dataset['Usage'].where(dataset['Product']=='TM498').mode()[0],round(dataset['Usage'].where(dataset['Product']=='TM798').mean(),1),dataset['Usage'].where(dataset['Product']=='TM798').median(),dataset['Usage'].where(dataset['Product']=='TM798').mode()[0]), ha='right', va='center', size=12)

plt.legend(fontsize=14);

plt.title('Density of Expected Usage by Product', fontsize=16, fontweight='bold');
plt.xlabel('Expected Usage [days/wk]', fontsize=14, fontweight='bold');
plt.ylabel('Density per Unit', fontsize=14, fontweight='bold');

Observation:

TM798 customers tend to use the treadmill more times per week then customers who purchased the TM195 or TM498 as  indicated by the right shift in the distribution for the TM798. 

** There is clearly a problem with the represented scale for the TM498 distribution. This distribution should have similar scale as the TM195 distribution. (The riemann sum should equal 1).



In [None]:
#bin usage into three groups with values normalized by product to allow comparison
hah = [np.sum(dataset['Product']=='TM195'),np.sum(dataset['Product']=='TM498'),np.sum(dataset['Product']=='TM798')]
blah = np.zeros([3,3])
blah[0,0]=np.sum((dataset['Usage']<=3) & (dataset['Product']=='TM195'))/hah[0]
blah[0,1]=np.sum((dataset['Usage']>3) & (dataset['Usage']<=5) & (dataset['Product']=='TM195'))/hah[0]
blah[0,2]=np.sum((dataset['Usage']>5) & (dataset['Product']=='TM195'))/hah[0]
blah[1,0]=np.sum((dataset['Usage']<=3) & (dataset['Product']=='TM498'))/hah[1]
blah[1,1]=np.sum((dataset['Usage']>3) & (dataset['Usage']<=5) & (dataset['Product']=='TM498'))/hah[1]
blah[1,2]=np.sum((dataset['Usage']>5) & (dataset['Product']=='TM498'))/hah[1]
blah[2,0]=np.sum((dataset['Usage']<=3) & (dataset['Product']=='TM798'))/hah[2]
blah[2,1]=np.sum((dataset['Usage']>3) & (dataset['Usage']<=5) & (dataset['Product']=='TM798'))/hah[2]
blah[2,2]=np.sum((dataset['Usage']>5) & (dataset['Product']=='TM798'))/hah[2]
use_prod = pd.DataFrame(blah, index=['TM195','TM498','TM798'],columns=['Usage <= 3','3 < Usage <= 5', 'Usage > 5'])

#create grouped bar plot
fig = plt.figure(figsize=[8,6])
fig.suptitle("Expected Use per Week", fontsize=18, fontweight='bold')
fig.subplots_adjust(top=0.92)
ax1 = fig.add_subplot(111)
barWidth = 0.25
 
# set height of bar
bars1 = list(use_prod['Usage <= 3'])
bars2 = list(use_prod['3 < Usage <= 5'])
bars3 = list(use_prod['Usage > 5'])
 
# Set position of bar on X axis
r1 = np.arange(len(bars1)) 
r2 = [x + 0.05 + barWidth for x in r1]
r3 = [x + 0.05 + barWidth for x in r2]
 
# Make the plot
ax1.bar(r1, bars1, color='mediumseagreen', width=barWidth, edgecolor='white', label='Days <= 3')
ax1.bar(r2, bars2, color='cornflowerblue', width=barWidth, edgecolor='white', label='3 < Days <= 5')
ax1.bar(r3, bars3, color='salmon', width=barWidth, edgecolor='white', label="Days > 5")
ax1.set_ylabel('Normalized by Product', fontdict={'weight': 'bold', 'color': 'black', 'size': 16})
ax1.set_xlabel('Product', fontdict={'weight': 'bold', 'color': 'black', 'size': 16})
ax1.set_xticks([0.3, 1.3, 2.3])
ax1.tick_params(axis = "x", which = "both", bottom = False, top = False)
ax1.set_xticklabels(['TM195','TM498','TM798'],fontdict={'color': 'black', 'size': 14})
ax1.set_yticks(np.arange(0,1.1,0.1))
ax1.set_yticklabels(np.round(np.arange(0,1.1,0.1),1), fontdict={'color': 'black', 'size': 14})
fig.legend(loc='upper center', bbox_to_anchor=(0.77, 0.93), fontsize=14)
fig.show();

Observation:

TM798 customers tend to use the treadmill more times per week then custoemrs who purchased the TM195 or TM498

In [None]:
sns.violinplot(data=dataset, y='Product', x='Usage');

Observation:

TM798 customers tend to use the treadmill more times per week then customers who purchased the TM195 or TM498


In [None]:
#create probability density plot of customer annual income by product
plt.figure(figsize=[10,8]);

sns.distplot(dataset['Income'].where(dataset['Product']=='TM195'), hist=None, label='TM195', kde_kws=dict(linewidth=5, color='royalblue'));
sns.distplot(dataset['Income'].where(dataset['Product']=='TM498'), hist=None, label='TM498', kde_kws=dict(linewidth=5, color='darkorange'));
sns.distplot(dataset['Income'].where(dataset['Product']=='TM798'), hist=None, label='TM798', kde_kws=dict(linewidth=5, color='forestgreen'));
#annotate with mean, median, and mode
plt.text(135000, 3.5e-5, '            Mean         Median          Mode\nTM195     {}       {}      {}\nTM498     {}       {}      {}\nTM798     {}       {}      {}'.format(round(dataset['Income'].where(dataset['Product']=='TM195').mean(),1),dataset['Income'].where(dataset['Product']=='TM195').median(),dataset['Income'].where(dataset['Product']=='TM195').mode()[0],round(dataset['Income'].where(dataset['Product']=='TM498').mean(),1),dataset['Income'].where(dataset['Product']=='TM498').median(),dataset['Income'].where(dataset['Product']=='TM498').mode()[0],round(dataset['Income'].where(dataset['Product']=='TM798').mean(),1),dataset['Income'].where(dataset['Product']=='TM798').median(),dataset['Income'].where(dataset['Product']=='TM798').mode()[0]), ha='right', va='center', size=12)

plt.legend(fontsize=14);

plt.title('Density of Customer Annual Income by Product', fontsize=16, fontweight='bold');
plt.xlabel('Annual Income [$USD]', fontsize=14, fontweight='bold');
plt.ylabel('Density per Unit', fontsize=14, fontweight='bold');

Observation:
    
The TM195 and TM498 appear to appeal to lower income cusdtomers. The TM798 appears to appeal to higher income Customers, but does not appear to be a driving factor.

In [None]:
#create violinplot of customer income by product
sns.set(font_scale = 1.3);
sns.set_style("ticks");
fig = plt.figure(figsize=[8,4]);
ax = sns.violinplot(data=dataset, y='Product', x='Income');
plt.xlabel('Annual Income [$USD]',fontsize=16,fontweight='bold');
plt.ylabel('Product',fontsize=16,fontweight='bold');
plt.title('Customer Income by Product', fontsize=18, fontweight='bold');
ax.set_xticks(np.arange(0,140001,15000));
fig.show();

In [None]:
#setup subplots
fig = plt.figure(figsize=[10,6]);
grid = plt.GridSpec(3, 1, wspace=0.3, hspace=1);

#plot boxplots of income by product
ax0=plt.subplot(grid[0, 0]);
sns.boxplot(dataset.loc[(dataset['Product']=='TM195'),'Income'], ax=ax0, color='forestgreen');
ax0.axvline(dataset.loc[(dataset['Product']=='TM195'),'Income'].mean(),color= "red", linestyle="--", label="mean")
ax0.axvline(dataset.loc[(dataset['Product']=='TM195'),'Income'].mean()+ 2 * dataset.loc[(dataset['Product']=='TM195'),'Income'].std(),color= "slategrey", linestyle="--", label="2sigma")
ax0.axvline(max(dataset.loc[(dataset['Product']=='TM195'),'Income'].mean()- 3 * dataset.loc[(dataset['Product']=='TM195'),'Income'].std(), 0),color= "orange", linestyle="--")
ax0.axvline(max(dataset.loc[(dataset['Product']=='TM195'),'Income'].mean()- 2 * dataset.loc[(dataset['Product']=='TM195'),'Income'].std(), 0),color= "slategrey", linestyle="--")
ax0.axvline(dataset.loc[(dataset['Product']=='TM195'),'Income'].mean()+ 3 * dataset.loc[(dataset['Product']=='TM195'),'Income'].std(),color= "orange", linestyle="--", label="3sigma")
ax0.set_xlabel('',fontdict=axlab2);
ax0.set_xticks(np.arange(0,140001,15000));
ax0.set_title('TM195', fontsize=14, fontweight='bold')


ax1=plt.subplot(grid[1, 0]);
sns.boxplot(dataset.loc[(dataset['Product']=='TM498'),'Income'], ax=ax1, color='dodgerblue');
ax1.axvline(dataset.loc[(dataset['Product']=='TM498'),'Income'].mean(),color= "red", linestyle="--")
ax1.axvline(dataset.loc[(dataset['Product']=='TM498'),'Income'].mean()+ 2 * dataset.loc[(dataset['Product']=='TM498'),'Income'].std(),color= "slategrey", linestyle="--")
ax1.axvline(max(dataset.loc[(dataset['Product']=='TM498'),'Income'].mean()- 3 * dataset.loc[(dataset['Product']=='TM498'),'Income'].std(), 0),color= "orange", linestyle="--")
ax1.axvline(max(dataset.loc[(dataset['Product']=='TM498'),'Income'].mean()- 2 * dataset.loc[(dataset['Product']=='TM498'),'Income'].std(), 0),color= "slategrey", linestyle="--")
ax1.axvline(dataset.loc[(dataset['Product']=='TM498'),'Income'].mean()+ 3 * dataset.loc[(dataset['Product']=='TM498'),'Income'].std(),color= "orange", linestyle="--")
ax1.set_xlabel('',fontdict=axlab2);
ax1.set_xticks(np.arange(0,140001,15000));
ax1.set_title('TM498', fontsize=14, fontweight='bold')

ax2=plt.subplot(grid[2, 0]);
sns.boxplot(dataset.loc[(dataset['Product']=='TM798'),'Income'], ax=ax2, color='coral');
ax2.axvline(dataset.loc[(dataset['Product']=='TM798'),'Income'].mean(),color= "red", linestyle="--")
ax2.axvline(dataset.loc[(dataset['Product']=='TM798'),'Income'].mean()+ 2 * dataset.loc[(dataset['Product']=='TM798'),'Income'].std(),color= "slategrey", linestyle="--")
ax2.axvline(max(dataset.loc[(dataset['Product']=='TM798'),'Income'].mean()- 3 * dataset.loc[(dataset['Product']=='TM798'),'Income'].std(), 0),color= "orange", linestyle="--")
ax2.axvline(max(dataset.loc[(dataset['Product']=='TM798'),'Income'].mean()- 2 * dataset.loc[(dataset['Product']=='TM798'),'Income'].std(), 0),color= "slategrey", linestyle="--")
ax2.axvline(dataset.loc[(dataset['Product']=='TM798'),'Income'].mean()+ 3 * dataset.loc[(dataset['Product']=='TM798'),'Income'].std(),color= "orange", linestyle="--")
ax2.set_xlabel('Annual Income [$USD]',fontdict=axlab2);
ax2.set_xticks(np.arange(0,140001,15000));
ax2.set_title('TM798', fontsize=14, fontweight='bold')

plt.xticks(fontsize=14);
plt.yticks(fontsize=14);
fig.legend(loc='upper center', bbox_to_anchor=(0.8, 0.9), fontsize=14)        
fig.show();

 Observation:

3. TM195 & TM798 - Income less than 75k/yr
4. TM798 - not income specific



In [None]:
#create probability density plot of customer annual income by product
plt.figure(figsize=[10,8]);

sns.distplot(dataset['Education'].where(dataset['Product']=='TM195'), hist=None, label='TM195', kde_kws=dict(linewidth=5, color='royalblue'));
sns.distplot(dataset['Education'].where(dataset['Product']=='TM498'), hist=None, label='TM498', kde_kws=dict(linewidth=5, color='darkorange'));
sns.distplot(dataset['Education'].where(dataset['Product']=='TM798'), hist=None, label='TM798', kde_kws=dict(linewidth=5, color='forestgreen'));
#annotate with mean, median, and mode
plt.text(23.5, 0.28, '            Mean   Median    Mode\nTM195     {}       {}      {}\nTM498     {}       {}      {}\nTM798     {}       {}      {}'.format(round(dataset['Education'].where(dataset['Product']=='TM195').mean(),1),dataset['Education'].where(dataset['Product']=='TM195').median(),dataset['Education'].where(dataset['Product']=='TM195').mode()[0],round(dataset['Education'].where(dataset['Product']=='TM498').mean(),1),dataset['Education'].where(dataset['Product']=='TM498').median(),dataset['Education'].where(dataset['Product']=='TM498').mode()[0],round(dataset['Education'].where(dataset['Product']=='TM798').mean(),1),dataset['Education'].where(dataset['Product']=='TM798').median(),dataset['Education'].where(dataset['Product']=='TM798').mode()[0]), ha='right', va='center', size=12);

plt.legend(fontsize=14);

plt.title('Density of Customer Education by Product', fontsize=16, fontweight='bold');
plt.xlabel('Education [yrs]', fontsize=14, fontweight='bold');
plt.ylabel('Density per Unit', fontsize=14, fontweight='bold');
plt.show();

Observations:

1. Customers with higher education levels tend to purchase the TM798 as indicated by the right-shift in the distribution for the TM798. 
2. Customers who purchased the TM195 or TM498 tended to be college educated or pursuing a college degree as indicated by the two local maximums for their distributions.



In [None]:
#create barplot of education level (High School, Some College, Bachelors, Advanced) by product
#first bin data and normalize by product
hah = [np.sum(dataset['Product']=='TM195'),np.sum(dataset['Product']=='TM498'),np.sum(dataset['Product']=='TM798')]
blah = np.zeros([3,4])
for i in np.arange(0,len(dataset['Product'])):
    if dataset.loc[i,'Education'] == 12:
        if dataset.loc[i,'Product']=='TM195':
            blah[0,0] += 1
        if dataset.loc[i,'Product']=='TM498':
            blah[1,0] += 1
        if dataset.loc[i,'Product']=="TM798":
            blah[2,0] += 1
    if np.logical_and(dataset.loc[i,'Education'] > 12, dataset.loc[i,'Education'] < 16):
        if dataset.loc[i,'Product']=='TM195':
            blah[0,1] += 1
        if dataset.loc[i,'Product']=='TM498':
            blah[1,1] += 1
        if dataset.loc[i,'Product']=="TM798":
            blah[2,1] += 1
    if np.logical_and(dataset.loc[i,'Education'] >= 16, dataset.loc[i,'Education'] < 18):
        if dataset.loc[i,'Product']=='TM195':
            blah[0,2] += 1
        if dataset.loc[i,'Product']=='TM498':
            blah[1,2] += 1
        if dataset.loc[i,'Product']=="TM798":
            blah[2,2] += 1
    if dataset.loc[i,'Education'] >= 18:
        if dataset.loc[i,'Product']=='TM195':
            blah[0,3] += 1
        if dataset.loc[i,'Product']=='TM498':
            blah[1,3] += 1
        if dataset.loc[i,'Product']=="TM798":
            blah[2,3] += 1
prod_ed = pd.DataFrame(blah, columns=['High School','Some College','Bachelors','Advanced'], index=['TM195','TM498','TM798'])

#normalize data by product
prod_ed.loc['TM195',:] = prod_ed.loc['TM195',:]/hah[0]
prod_ed.loc['TM498',:] = prod_ed.loc['TM498',:]/hah[1]
prod_ed.loc['TM798',:] = prod_ed.loc['TM798',:]/hah[2]

#create grouped bar plot
fig = plt.figure(figsize=[8,6])
fig.suptitle("Products Sold by Degree", fontsize=18, fontweight='bold')
fig.subplots_adjust(top=0.92)
ax1 = fig.add_subplot(111)
barWidth = 0.15
 
# set height of bar
bars1 = list(prod_ed['High School'])
bars2 = list(prod_ed['Some College'])
bars3 = list(prod_ed['Bachelors'])
bars4 = list(prod_ed['Advanced'])
 
# Set position of bar on X axis
r1 = np.arange(len(bars1)) 
r2 = [x + 0.05 + barWidth for x in r1]
r3 = [x + 0.05 + barWidth for x in r2]
r4 = [x + 0.05 + barWidth for x in r3]
 
# Make the plot
ax1.bar(r1, bars1, color='mediumseagreen', width=barWidth, edgecolor='white', label='High School')
ax1.bar(r2, bars2, color='cornflowerblue', width=barWidth, edgecolor='white', label='Some College')
ax1.bar(r3, bars3, color='lightcoral', width=barWidth, edgecolor='white', label="Bachelor's")
ax1.bar(r4, bars4, color='orange', width=barWidth, edgecolor='white', label='Advanced')
ax1.set_ylabel('Normalized by Product', fontdict={'weight': 'bold', 'color': 'black', 'size': 16})
ax1.set_xlabel('Product', fontdict={'weight': 'bold', 'color': 'black', 'size': 16})
ax1.set_xticks([0.3, 1.3, 2.3])
ax1.set_yticks(np.arange(0,1,0.1))
ax1.tick_params(axis = "x", which = "both", bottom = False, top = False)
ax1.set_xticklabels(['TM195','TM498','TM798'],fontdict={'color': 'black', 'size': 14})
ax1.set_yticklabels(np.round(np.arange(0,1,0.1),1), fontdict={'color': 'black', 'size': 14})
fig.legend(loc='upper center', bbox_to_anchor=(0.775, 0.93), fontsize=14)
fig.show();

Observation:

Customers who bought the TM798 tend to be more educated then customers who bought the TM195 or TM498



# Conclusion

### TM195 Demographic:
- Most purchased model 
- Predominately in 20s with secondary group in 30s
- Annual income less than $75,000 
- Primarily some college or bachelor’s
- 40/60 single vs. partnered.
- Predominately male for single and female for partnered
- Mostly average fitness, with some unfit
- Expect to use 3-4 days per week

### TM498 Demographic:
- Predominately in 20s with close second group in 30s
- Annual income less than $75,000 
- Primarily some college or bachelor’s
- 40/60 single vs. partnered
- Predominately female for single and male for partnered
- Mostly average fitness, with some unfit
- Expect to use 3-4 days per week

### TM798 Demographic:
- Least purchased model
- Mostly in 20s with secondary 30-50 demographic
- Not income specific but with higher earners
- Primarily advanced education
- Super majority male
- Mostly very fit with some average fitness
- Expect to use 4-5 days per week

# Recommendations

### For Better Insights:
- Look at profit by product model to better understand sales percentages
- Gather information on fitness goals: lose weight, better cardio health, maintain, etc.
- Gather information on partner to gain second half of story on partnered customers.

### To Target New Customers:
- For TM195: Concentrate advertising broadly across gender and marital status towards individuals with annual income less than $75,000, with some college education or a bachelor’s degree, who are unfit or average fitness and in their 20s or 30s.

- For TM498: Concentrate advertising broadly across gender and marital status towards individuals with annual income less than $75,000, with some college education or a bachelor’s degree, who are unfit or average fitness and in their 20s or 30s.

- For TM798: Concentrate advertising towards males who are average fitness to very fit, have a bachelors degree or advanced education, and are in their 20s or 30s.

- <b>There may be untapped potential for targeting customers in the 40s and beyond age group, which appear to be an underserved population. Analysis indicates more than just outlying purchases of TM798.</b>

- <b>Individuals with only a high school education also appear to be an underserved population. Likely best candidates for TM195 or TM498 due to annual income constraints.</b>