# Source of this dataset

 https://www.kaggle.com/denisadutca/customer-behaviour

# About the data

This is data of 400 clients of a company including a unique ID, the gender, the age of the customer and the salary. We want to predict whether a given customer purchased the product or not.

# Importing Modules

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Loading the data

In [None]:
customer_df = pd.read_csv("../input/customer-behaviour/Customer_Behaviour.csv")

In [None]:
customer_df.head(5)

In [None]:
customer_df.info()

# Encoding Categorical Data

In [None]:
customer_df["Male"] = pd.get_dummies(customer_df["Gender"])["Male"]

# Drop Unnecessary Features

In [None]:
for col in ["Gender", "User ID"]:
  customer_df.drop(col, axis=1, inplace=True)

# Visualizing Data

In [None]:
sns.pairplot(customer_df)

In [None]:
sns.scatterplot(customer_df["Age"], customer_df["EstimatedSalary"], hue=customer_df["Purchased"])

**Observation :**
- Customer over the age of 40 tend to purchase the product irrespective of their salary
- Customer having higher estimated salary tend to purchase the product irrespective of their age
- Customer with lower salary (<80000) and age below 40 did not buy the product

In [None]:
sns.scatterplot(customer_df["Age"], customer_df["EstimatedSalary"], hue=customer_df["Male"])

**Observation :** Customer of both genders have similar salary in this data

In [None]:
sns.countplot(customer_df["EstimatedSalary"], hue=customer_df["Purchased"])

In [None]:
sns.countplot(customer_df["Male"], hue=customer_df["Purchased"])

**Observation :**
- The data is well balanced
- Male customers have a little less purchase rate 

# Splitting the data for training and testing

In [None]:
X = customer_df.drop("Purchased", axis=1).values
y = customer_df["Purchased"].values

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=47)

# Feature Scaling

KNN requires scaling of data because KNN uses the Euclidean distance between two data points to find nearest neighbors. 
Euclidean distance is sensitive to magnitudes. 
The features with high magnitudes will weight more than features with low magnitudes.
Ref Link : https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn

Standardization is not required for logistic regression.
The main goal of standardizing features is to help convergence of the technique used for optimization. 
If you use logistic regression with LASSO or ridge regression (as Weka Logistic class does) you should. 
Ref Link : https://stats.stackexchange.com/questions/48360/is-standardization-needed-before-fitting-logistic-regression

Decision Tree and Random Forest do not require feature scaling as well.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Evaluating Model Performance

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
def evaluate_model_performance(y_test, y_pred):
  print(accuracy_score(y_test, y_pred))
  print(confusion_matrix(y_test, y_pred))

# K-Nearest Neighbors Classifier

First we need to choose the right value of K for fitting the model. We will use elbow method for choosing the value of K that minimizes test error.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
error_rate = []

for i in range(1,40):
    model = KNeighborsClassifier(n_neighbors=i)
    model.fit(X_train_scaled,y_train)
    pred_i = model.predict(X_test_scaled)
    error_rate.append(np.mean(pred_i != y_test))
    
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

As per the graph, using k=5 should produce the minimum error.

In [None]:
model = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)

evaluate_model_performance(y_test, y_pred)

# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

evaluate_model_performance(y_test, y_pred)

# Decision Tree Classification

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

evaluate_model_performance(y_test, y_pred)

# Random Forest Classification

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

evaluate_model_performance(y_test, y_pred)

# Conclusion

We will use KNNClassifier for this data as this has given the most accurate prediction. (95% accuracy)