## **Importing libraries**

In [None]:
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns
import pandas as pd
from tabulate import tabulate
plt.style.use('dark_background')

In [None]:
pip install category_encoders

## **Importing dataset**

The market research team at AdRight is assigned the task to identify the profile of the typical customer for each treadmill product offered by CardioGood Fitness. The market research team decides to investigate whether there are differences across the product lines with respect to customer characteristics. The team decides to collect data on individuals who purchased a treadmill at a CardioGoodFitness retail store during the prior three months. The data are stored in the CardioGoodFitness.csv file. The team identifies the following customer variables to study: product purchased, TM195, TM498, or TM798; gender; age, in years;education, in years; relationship status, single or partnered; annual household income ($); average number of times the customer plans to use the treadmill each week; average number of miles the customer expects to walk/run each week; and self-rated fitness on an 1-to-5 scale, where 1 is poor shape and 5 is excellent shape. Perform descriptive analytics to create a customer profile for each CardioGood Fitness treadmill product line.

In [None]:
dataset = pd.read_csv("../input/cardiogoodfitness/CardioGoodFitness.csv")

## **Assessing data**

In [None]:
dataset.head(10)

In [None]:
dataset.tail(10)

In [None]:
plt.figure(figsize=(8,8))  # on this line I just set the size of figure to 12 by 10.
p=sns.heatmap(dataset.corr(), annot=True)  # seaborn has very simple solution for heatmap

In [None]:
dataset.shape

In [None]:
dataset.info()

In [None]:
dataset.describe()

## **Analyzing data**

In [None]:
p = dataset.hist(figsize = (10,10))

In [None]:
dataset[['Income', 'Product']].groupby(['Product'], as_index=False).median().sort_values(by='Product', ascending=False)

In [None]:
dataset[['Usage', 'Product']].groupby(['Product'], as_index=False).median().sort_values(by='Product', ascending=False)

In [None]:
dataset[['Miles', 'Product']].groupby(['Product'], as_index=False).median().sort_values(by='Product', ascending=False)

In [None]:
dataset[['Fitness', 'Product']].groupby(['Product'], as_index=False).mean().sort_values(by='Product', ascending=False)

In [None]:
z = dataset[['Age', 'Product']].groupby(['Product'], as_index=False).median().sort_values(by='Product', ascending=False)
print(z)
g = sns.FacetGrid(dataset, col='Product')
g.map(plt.hist, 'Age', bins=20)

In [None]:
z = dataset[['Education', 'Product']].groupby(['Product'], as_index=False).mean().sort_values(by='Product', ascending=False).round()
print(z)
g = sns.FacetGrid(dataset, col='Product')
g.map(plt.hist, 'Education', bins=20)
#People with less education year were more interested in the lower model

In [None]:
g = sns.FacetGrid(dataset, col='Product')
g.map(plt.hist, 'Gender', bins=20)
#TM195 and TM498 preferred by both men and women while TM798 preferred more by men also with people who had greater income
#sns.countplot(x='Product', hue = 'Gender', data = dataset)

In [None]:
groups = dataset[['Gender','MaritalStatus','Product']].groupby(['Gender','MaritalStatus',]).count().sort_values("Product",ascending=False )
#Married males are more interested in buying than single people
#and males in general are more interested in buying treadmills than females
print(groups)
groups.plot.bar(color="white")
plt.show()

In [None]:
dataset[['Gender','Income',]].groupby(['Gender',]).median().sort_values("Income",ascending=False )

## **Drawing Conclusion**

The three products we have here are TM798, TM498 and TM195 from the analysis and visualization done above we can find:
1.  males in general are more interested in buying than females especially with the expensive model TM798.
2.  we realize also that partnered males showed the most interest in buying a treadmill than partnered females.
3. Couples in general are more interested in buying a treadmill than single people maybe because it will be used by two people so it is two people preference not only one.
4. people with higher income showed more interest in the TM798  that shows us that it is the most expensive one while the other two  nearly the same price because people with similar incomes show interest in them equally.
5. younger people are more interest in TM195 and as the age goes up people choose the other two treadmills.
6. people with higher fitness level choose the TM798 but i dont think its trustworthy as it is self-rated.
7. people who wanted to buy TM798, said that they will use it more often and run more miles on it than the other two. 



## **Applying machine learning classification models**
to show which people will buy which product here i will try using more than one classification model, then i will show the training and testing scores in the end in a tabel.

First we need to encode the categorical data which is product,gender and marital status to be used by the algorithm using a package called category_encoders

In [None]:
dataset = dataset.drop("Education" , axis = 1)

In [None]:
import category_encoders as ce

encoder = ce.OrdinalEncoder(cols=['Product' , 'Gender' , 'MaritalStatus'], return_df=True , verbose = None)

# Assume our loan data has been imported as df already
# and split into df_train and df_test
dataset = encoder.fit_transform(dataset)

## **Test**

so the data was encoded in this pattern
1. males:1 , females:2
2. single:1 , partnered:2
3. TM798:1, TM498:2 and TM195:3 



In [None]:
dataset.head(10)

## **Splitting the data**

In [None]:
X = dataset.drop("Product" , axis=1)
y = dataset["Product"]

In [None]:
#Import Libraries
from sklearn.preprocessing import StandardScaler
#----------------------------------------------------
#Standard Scaler for Data
scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
X = scaler.fit_transform(X)

In [None]:
from sklearn.model_selection import train_test_split
X_train ,X_test , y_train , y_test = train_test_split(X,y , test_size = 0.1 , random_state = 42)

## **Applying logistic regression**

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(solver = "newton-cg" , max_iter = 100 , C = 8)
classifier.fit(X_train, y_train)
print("logistic regression training score is " + str(classifier.score(X_train , y_train)))
print("logistic regression test score is " + str(classifier.score(X_test , y_test)))
print('----------------------------------------------------')
# Making the Confusion Matrix

from sklearn.metrics import confusion_matrix
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(5,5))
p=sns.heatmap(cm, annot=True)


Logisticregressionscoretraining = classifier.score(X_train , y_train)
Logisticregressionscoretest = classifier.score(X_test , y_test)

## **Applying Neural networks**

In [None]:
#Import Libraries
from sklearn.neural_network import MLPClassifier
#----------------------------------------------------
#Applying MLPClassifier Model 
MLPClassifierModel = MLPClassifier(activation='tanh',
                                   solver='lbfgs',  
                                   learning_rate='constant',
                                   early_stopping= False,
                                   alpha=0.03,hidden_layer_sizes=(256,128) , max_iter=10000)
MLPClassifierModel.fit(X_train, y_train)
#Calculating Details
print('MLPClassifierModel Train Score is : ' , MLPClassifierModel.score(X_train, y_train))
print('MLPClassifierModel Test Score is : ' , MLPClassifierModel.score(X_test, y_test))
MLPClassifierModelTrainScore =  MLPClassifierModel.score(X_train, y_train)
MLPClassifierModelTestScore = MLPClassifierModel.score(X_test, y_test)

## **Applying Support vector machine**

In [None]:
#Import Libraries
from sklearn.svm import SVC
SVCModel = SVC(kernel= 'linear')
SVCModel.fit(X_train, y_train)
#Calculating Details
print('SVCModel Train Score is : ' , SVCModel.score(X_train, y_train))
print('SVCModel Test Score is : ' , SVCModel.score(X_test, y_test))
print('----------------------------------------------------')
accuracy = SVCModel.score(X_test, y_test)


y_pred = SVCModel.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(5,5))
p=sns.heatmap(cm, annot=True)

SVCModelscoretraining = SVCModel.score(X_train, y_train)
SVCModelscoretest = SVCModel.score(X_test, y_test)

## **Applying gaussian neural networks**

In [None]:
#Import Libraries
from sklearn.naive_bayes import GaussianNB
GaussianNBModel = GaussianNB()
GaussianNBModel.fit(X_train, y_train)
print('GaussianNBModel Train Score is : ' , GaussianNBModel.score(X_train, y_train))
print('GaussianNBModel Test Score is : ' , GaussianNBModel.score(X_test, y_test))
print('----------------------------------------------------')
GaussianNBModelscoretrain = GaussianNBModel.score(X_train, y_train)
GaussianNBModelscoretest = GaussianNBModel.score(X_test, y_test)

## **Applying KNN**

In [None]:
#Import Libraries
from sklearn.neighbors import KNeighborsClassifier
KNeighborsClassifierModel = KNeighborsClassifier(n_neighbors = 1, weights='distance',
                                               algorithm = 'auto')    
KNeighborsClassifierModel.fit(X_train, y_train)
print('KNeighborsclassifierModel Train Score is : ' , KNeighborsClassifierModel.score(X_train, y_train))
print('KNeighborsclassifierModel Test Score is : ' , KNeighborsClassifierModel.score(X_test, y_test))
print('----------------------------------------------------')
KNeighborsClassifierModelscoretraining = KNeighborsClassifierModel.score(X_train, y_train)
KNeighborsClassifierModelscoretest = KNeighborsClassifierModel.score(X_test, y_test)

## **Applying Randomforestclassifier**

In [None]:
#Import Libraries
from sklearn.ensemble import RandomForestClassifier
#----------------------------------------------------

#Applying RandomForestClassifier Model 
RandomForestClassifierModel = RandomForestClassifier(criterion = 'gini',n_estimators=100,max_depth=2,random_state=33) #criterion can be also : entropy 
RandomForestClassifierModel.fit(X_train, y_train)

#Calculating Details
print('RandomForestClassifierModel Train Score is : ' , RandomForestClassifierModel.score(X_train, y_train))
print('RandomForestClassifierModel Test Score is : ' , RandomForestClassifierModel.score(X_test, y_test))
print('----------------------------------------------------')
RandomForestClassifierModeltrain =  RandomForestClassifierModel.score(X_train, y_train)
RandomForestClassifierModeltest = RandomForestClassifierModel.score(X_test, y_test)

## **Applying Gradient boosting**

In [None]:
#Import Libraries
from sklearn.ensemble import GradientBoostingClassifier
#----------------------------------------------------

#Applying GradientBoostingClassifier Model 

GBCModel = GradientBoostingClassifier(n_estimators=100,max_depth=3,random_state=33) 
GBCModel.fit(X_train, y_train)

#Calculating Details
print('GBCModel Train Score is : ' , GBCModel.score(X_train, y_train))
print('GBCModel Test Score is : ' , GBCModel.score(X_test, y_test))
GBCModeltraining = GBCModel.score(X_train, y_train)
GBCModeltesting =GBCModel.score(X_test, y_test)



## **Comparison between Algorithms**

In [None]:
models = pd.DataFrame({
                          'Model': ['logistic regression ',
                                    'KNN', 
                                    'Naive Bayes', 
                                    'Linear SVC', 
                                    'Neural networks',
                                    "Random forest",
                                    "Gradient boosting"],
                       
                          'Scoretrain': [Logisticregressionscoretraining, 
                                         KNeighborsClassifierModelscoretraining, 
                                         GaussianNBModelscoretrain, 
                                         SVCModelscoretraining, 
                                         MLPClassifierModelTrainScore,
                                         RandomForestClassifierModeltrain,
                                         GBCModeltraining],
                       
                             'scoretest':[Logisticregressionscoretest,
                                          KNeighborsClassifierModelscoretest,
                                         GaussianNBModelscoretest,
                                          SVCModelscoretest,
                                          MLPClassifierModelTestScore,
                                          RandomForestClassifierModeltest,
                                          GBCModeltesting]})


print(tabulate(models , headers = ['Model' , 'Train' , 'Test'] , tablefmt = 'psql' , showindex =False)) 