In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# Project explanation:
The market research team at AdRight is assigned the task to identify the profile of the typical customer for each treadmill product offered by CardioGood Fitness. The market research team decides to investigate whether there are differences across the product lines with respect to customer characteristics. The team decides to collect data on individuals who purchased a treadmill at a CardioGoodFitness retail store during the prior three months. The data are stored in the CardioGoodFitness.csv file. The team identifies the following customer variables to study: product purchased, TM195, TM498, or TM798; gender; age, in years;education, in years; relationship status, single or partnered; annual household income ($); average number of times the customer plans to use the treadmill each week; average number of miles the customer expects to walk/run each week; and self-rated fitness on an 1-to-5 scale, where 1 is poor shape and 5 is excellent shape. Perform descriptive analytics to create a customer profile for each CardioGood Fitness treadmill product line.

## Table of content:
- ### I)Data Preprocessing:
    - #### 1) Hot encode the variable Product to separate the original dataset into different dataset for each product.
    - #### 2) Modify the gender variable to 1 for Male and 0 for Female.
    - #### 3) Separate the original dataset into 3 dataset, each dataset contains informations about one of the products.
- ### II) Data Analysis:
    - #### 1) Observing the correlation of each value with our 3 products.
    - #### 2) Analyse each feature and describ what are the values that represent the "best clients" for each product
    - #### 3) Compute mode value for the categorical features.

In [None]:
df = pd.read_csv('/kaggle/input/cardiogoodfitness/CardioGoodFitness.csv')

In [None]:
df.head()

In [None]:
df.info()

## I) Data preprocessing:

### 1) Hot encode the variable Product to separate the original dataset into different dataset for each product:

In [None]:
y = pd.get_dummies(df.Product, prefix='Product')
print(y.head())

In [None]:
df['TM195'] = y['Product_TM195']
df['TM498'] = y['Product_TM498']
df['TM798'] = y['Product_TM798']

In [None]:
df.head()

### 2) Modify the gender variable to 1 for Male and 0 for Female:

In [None]:
df['Gender'].replace('Male',1,inplace=True)
df['Gender'].replace('Female',0,inplace=True)

In [None]:
df.head()

### 3) Separate the original dataset into 3 dataset, each dataset contains informations about one of the products:

In [None]:
TM195 = df[df['TM195'] == 1]
TM498 = df[df['TM498'] == 1]
TM798 = df[df['TM798'] == 1]

In [None]:
TM195.shape[0],TM498.shape[0],TM798.shape[0]

In [None]:
# Dropping redundant variables:
TM195.drop(['TM498','TM798','TM195'],axis=1,inplace = True)
TM498.drop(['TM195','TM798','TM498'],axis=1,inplace = True)
TM798.drop(['TM195','TM498','TM798'],axis=1,inplace = True)

## II) Data Analysis:

### 1) Let's start by observing the correlation of each value with our 3 products:

In [None]:
df.corr()[['Age','Gender','Education','Usage','Fitness','Income','Miles']].iloc[[-3,-2,-1]]

### Analysis: 
- General analysis: we can observe that age is not correlated with any of the products
- Product TM195:
    - Gender: we can observe than this product is negatively correlated with gender, which means that this product is more popular with Females.
    - Education: This product is negatively correlated with higher education, which means this product is more popular with people with less education.
    - Usage: this product is more popular with people having a smaller weekly usage.
    - Fitness: This product is more popular with people who identified themselves as not fit.
    - Income: This product is popular with low income.
    - Miles: This product is popular with more casual users and beginners.

- Product TM498:
    - Gender: This product is not as correlated with gender as the two other products, although it has a negative correlations with males.
    - Education: This product is negatively correlated with higher education, which means this product is more popular with people with less education, although the negative correlation is less than that of the first product.
    - Usage: this product is more popular with people having a smaller weekly usage, although the negative correlation is less than that of the first product.
    - Fitness: This product is more popular with people who identified themselves as not fit.
    - Income: This product is popular with low income, although the negative correlation is less than that of the first product.
    - Miles: This product is popular with more casual users and beginners, although the negative correlation is less than that of the first product.

- Product TM798:
    - Gender: we can observe than this product is positively correlated with gender, which means that this product is more popular with Males.
    - Education: This product is positively correlated with higher education, which means this product is more popular with people with higher education.
    - Usage: This product is highly positively correlated with Usage,this product is more popular with people having a higher weekly usage, so this product should be more popular with more involved clients.
    - Fitness: This product is highly positively correlated with Fitness, This product is more popular with people who identified themselves as fit.
    - Income: This product is highly positively correlated with high income.
    - Miles: This product is popular with more advanced users.

### 2) Now, we are going to analyse each feature and describ what are the values that represent the "best clients" for each product

#### For numerical variables, we are going first to plot and observe their distribution. observe the shape of the plot. and finally:
- For continuous values(Age,Income):  we are going to plot the distribution and describ them as well as provide a range of mean + 2 standard deviation for income distribution, since in normal distribution 95% of the data is bewteen mean + or - 2 sd (https://www.labce.com/spg49741_acceptable_standard_deviation_sd.aspx)
- For discret values(Usage,Fitness,Miles): provide the two highest mode values (mode is the value with the highest number of occurance)

In [None]:
numTM195 = TM195[['Age','Education','Usage','Fitness','Income']]
numTM498 = TM498[['Age','Education','Usage','Fitness','Income']]
numTM798 = TM798[['Age','Education','Usage','Fitness','Income']]

### plotting the distribution for each numerical variable:

In [None]:
def plotdist (col):
    print('Distribution and range for column :',col.name)
    print(col.hist(bins=12, alpha=0.5))
    plt.show()
    print('range of mean +- 2 standard deviation: [',col.mean() - 2* col.std(),',',col.mean() + 2* col.std(),']');
    print('Median ',col.name,': ',col.median());
    print('Mean ',col.name,': ',col.mean());
    print('__________________________________________________________________________')
    print()

### Product TM195:

In [None]:
TM195[['Age','Income','Miles']].apply(plotdist)

#### Analysis: these distributions confirm the analysis made with the dataset correlations. the income distribution shows that this product is more popular with clients with smaller income

#### the range of values, mean and median Income for this product is similar to the second product. but are significantly smaller than the last product

### Product TM498:

In [None]:
TM498[['Age','Income','Miles']].apply(plotdist)

#### Analysis: Although we can observe that the income distribution is slightly weighted towards smaller income, we can observe that this product is actually popular with more average incomes than the first product
#### the range of values, mean and median Income for this product is similar to the first product. but are significantly smaller than the last product

### Product TM798:

In [None]:
TM798[['Age','Income','Miles']].apply(plotdist)

#### Analysis: Although we can observe that the income distribution is weighted towards very high income level and slightly less towards very small income, we can also observe that this product is highly popular with the younger population way more than the two other products
#### Median and mean Income are also way higher than the other two products.

#### the range of values for the income is significantly Higher than that of the two other products.

#### As for the Miles variable, we can observe that clients who are more attracted to this product have a higher value than the two other products

### 3) Compute mode value for the categorical features:

In [None]:
TM195.head()

In [None]:
TM195[['MaritalStatus']].mode()

In [None]:
def modvalue(col):
    print('Mode value for column :',col.name);
    print('--------------------------------')
    print('First Mode value/ Most occurring value :',col.mode());
    print('__________________________________________________________________________')
    print()

#### A) product TM195

In [None]:
TM195[['Gender','Education','MaritalStatus','Usage','Fitness']].apply(modvalue)

#### B)Product TM498:

In [None]:
TM498[['Gender','Education','MaritalStatus','Usage','Fitness']].apply(modvalue)

#### C) product TM798

In [None]:
TM798[['Gender','Education','MaritalStatus','Usage','Fitness']].apply(modvalue)