<h3>About the dataset</h3>
A company called Adright is assigned the task to identify the profile of the typical customer for each treadmill product offered by CardioGood Fitness. The market research team decides to investigate whether there are differences across the product lines with respect to customer characteristics. The team decides to collect data on individuals who purchased a treadmill at a CardioGoodFitness retail store during the prior three months. The data are stored in the CardioGoodFitness.csv file. The team identifies the following customer variables to study: product purchased, TM195, TM498, or TM798; gender; age, in years;education, in years; relationship status, single or partnered; annual household income ($); average number of times the customer plans to use the treadmill each week; average number of miles the customer expects to walk/run each week; and self-rated fitness on an 1-to-5 scale, where 1 is poor shape and 5 is excellent shape.

<br><h3>What we need to predict</h3>
We need to predict which product the customer is likely to buy based on several parameters.

<h3>We will do following things in order to come to our conclusion</h3>
<ol>
<li>Importing Libraries</li>
<li>Loading the dataset</li>
<li>Do some basic EDA</li>
<li>Model the data</li>
</ol>







<h3>Importing the libraries</h3>

In [None]:
#!pip install category_encoders
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set("notebook")
sns.set_style("darkgrid")
from scipy.special import boxcox
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
from keras.layers import Dense,Dropout
from keras.models import Sequential
from keras.utils import to_categorical

<h3>Loading the dataset</h3>

In [None]:
#Reading Data
df=pd.read_csv('../input/cardiogoodfitness/CardioGoodFitness.csv')
print("\n")
print("Few Rows")
print(df.head())
print("\n")
print("Data Dictionary")
print(df.info())
print("\n")
print("Descriptive Statistics")
print(df.describe().T)

In [None]:
df.head()

In [None]:
pd.crosstab(df['Product'],df['MaritalStatus'])

In [None]:
pd.crosstab(df['Product'],df['Gender'])

<h3>Lets do some basic EDA</h3>
We will check if the Age is normally distributed.

In [None]:
print(df['Age'].median())
sns.distplot(df['Age'])

Looks like the Age is normally distributed

Lets check the distribution of target variable

In [None]:
sns.countplot(df['Product'])
df['Product'].value_counts()

In [None]:
plt.figure(figsize=(10,8))
corr=df.corr()
sns.heatmap(corr,square=True,annot=True,cmap='RdYlGn')

In [None]:
df[['Product','Usage']].groupby(['Product'],as_index=False).median().sort_values(by='Product',ascending=False)

In [None]:
df[['Product','Fitness']].groupby(['Product'],as_index=False).median().sort_values(by='Product',ascending=False)

In [None]:
print(df['Age'].min())
print(df['Age'].max())

<h3>Splitting the age into different groups</h3>

<h3>Feature Engineering</h3>
Lets do a bit of Feature Engineering.<br>In this simple scenario we will transform Age into different groups. 

In [None]:
category=pd.cut(df['Age'],bins=[17,26,42,50],labels=['Young','Middle','Senior'])
df.insert(3,'Age Group',category)

In [None]:
pd.crosstab(df['Product'],df['Age Group'])

<h3>Modelling the data</h3>

In [None]:
X=df.drop(['Product'],axis=1)
y=df['Product']

In [None]:
X=X.drop(['Age','Education'],axis=1)

<h3>Encoding categorical variables</h3>


In [None]:
import category_encoders as ce
encoder=ce.OrdinalEncoder(cols=['Gender','Age Group','MaritalStatus'],return_df=True,verbose=None)
X=encoder.fit_transform(X)

In [None]:
X.head()

Changing the category of categorical variables

In [None]:
categorical_cols=['Gender','Age Group','MaritalStatus','Usage','Fitness']
for col in X[categorical_cols]:
    X[col]=X[col].astype('category')

In [None]:
X_train,X_val,y_train,y_val=train_test_split(X,y,test_size=0.1,random_state=42,shuffle=y)

In [None]:
print(X_train.shape)
print(X_val.shape)
print(y_train.shape)
print(y_val.shape)

<h3>Scaling the data</h3>

In [None]:
sc=StandardScaler()
ct=ColumnTransformer([('scaler',sc,[5,6])],remainder='passthrough')
X_train=ct.fit_transform(X_train)
X_val=ct.transform(X_val)

We have used column transformer here to scale the columns Income and Miles.

In [None]:
X_train=pd.DataFrame(X_train,columns=['Income','Miles','Gender','Age Group','MaritalStatus','Usage','Fitness'])

In [None]:
X_val=pd.DataFrame(X_val,columns=['Income','Miles','Gender','Age Group','MaritalStatus','Usage','Fitness'])

<h3>We have selected 3 models for our classification task</h3>
<ol>
<li>Logistic Regression</li>
<li>XGBoost</li>
<li>K Nearest Neighbours</li>
</ol>    

In [None]:
clf_log=LogisticRegression(C=0.1)
clf_log.fit(X_train,y_train)
log_y_preds=clf_log.predict(X_val)
print('Accuracy Score %0.2f'%(100*accuracy_score(y_val,log_y_preds)))
print(classification_report(y_val,log_y_preds))

We got a slightly decent 72%

In [None]:
xg_clf=XGBClassifier(n_estimators=120,learning_rate=0.1)
xg_clf.fit(X_train,y_train)
xg_y_preds=xg_clf.predict(X_val)
print('Accuracy Score %0.2f'%(100*accuracy_score(y_val,xg_y_preds)))
print(classification_report(y_val,log_y_preds))

Wow! 6% bump up from logistic regression score

In [None]:
knn_clf=KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train,y_train)
knn_y_preds=knn_clf.predict(X_val)
print('Accuracy Score %0.2f'%(100*accuracy_score(y_val,knn_y_preds)))
print(classification_report(y_val,knn_y_preds))

XGBoost and KNN got a tie with both models getting the same accuracy score.<br>
We can further bump up the score by tuning hyperparameters.<br>
But lets not get into this

<h3>Lets try something new and model our data with ANN(Artificial Neural Network)</h3>

In [None]:
model=Sequential()
model.add(Dense(64,input_dim=(7),activation='relu'))
model.add(Dense(32,activation='relu'))
model.add(Dense(16,activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(3,activation='softmax'))
model.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['accuracy'])
model.summary()

In [None]:
y_train.value_counts()
product_code={'TM195':1,'TM498':2,'TM798':3}
y_train=y_train.map(product_code)
y_val=y_val.map(product_code)
y_train=pd.get_dummies(y_train)
y_val=pd.get_dummies(y_val)

In [None]:
print(y_train.shape)
print(y_val.shape)

In [None]:
model.fit(X_train,y_train,epochs=100,verbose=1)

In [None]:
_,accuracy=model.evaluate(X_val,y_val)
print('Accuracy is {:0.2f}%'.format(100*accuracy))

We got almost same accuracy as XGBoost and KNN.I feel this may be because of two reasons
* Not many features to find the pattern
* Not many samples to train

Please upvote if you find this kernel helpful.