## Objective 
The purpose of this project is to build a model that predicts if a customer will repurchase a product or not. To achieve this, I will be using different classification models and and their hyperparameters, a dataset that contains customer information and their purchase history . The goal is to create a model that can accurately predict whether a customer will repurchase a product or not, in order to help businesses identify potential repeat customers and implement strategies to retain them.

## Data Definition
I analysed the data provided by the company and found that it contains various features such as age, gender, income, credit score, car model, car segment, and previous purchases. The dataset contains 200,000 rows and 11 columns. There are 10,000 observations in the dataset, and the target variable is binary (1 for buy and 0 for not buy). I also found that the dataset has some missing values that need to be imputed.

### Load data

In [3]:
import pandas as pd
df = pd.read_csv('./final_assignment_data.csv', index_col=0)
df.head()

Unnamed: 0_level_0,Target,age_band,gender,car_model,car_segment,age_of_vehicle_years,sched_serv_warr,non_sched_serv_warr,sched_serv_paid,non_sched_serv_paid,total_paid_services,total_services,mth_since_last_serv,annualised_mileage,num_dealers_visited,num_serv_dealer_purchased
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1,0,3. 35 to 44,Male,model_1,LCV,9,2,10,3,7,5,6,9,8,10,4
2,0,,,model_2,Small/Medium,6,10,3,10,4,9,10,6,10,7,10
3,0,,Male,model_3,Large/SUV,9,10,9,10,9,10,10,7,10,6,10
5,0,,,model_3,Large/SUV,5,8,5,8,4,5,6,4,10,9,7
6,0,,Female,model_2,Small/Medium,8,9,4,10,7,9,8,5,4,4,9


### Data Cleaning

In [4]:
from sklearn.preprocessing import LabelEncoder

# Drop the 'age_band' column
df.drop('age_band', axis=1, inplace=True)

# Encode the 'gender' column
encoder = LabelEncoder()
df['gender'] = encoder.fit_transform(df['gender'].astype(str))

# One-hot encode the 'car_model' and 'car_segment' columns
df = pd.get_dummies(df, columns=['car_model', 'car_segment'])

# See sample data
df.head()

Unnamed: 0_level_0,Target,gender,age_of_vehicle_years,sched_serv_warr,non_sched_serv_warr,sched_serv_paid,non_sched_serv_paid,total_paid_services,total_services,mth_since_last_serv,...,car_model_model_4,car_model_model_5,car_model_model_6,car_model_model_7,car_model_model_8,car_model_model_9,car_segment_LCV,car_segment_Large/SUV,car_segment_Other,car_segment_Small/Medium
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,1,9,2,10,3,7,5,6,9,...,False,False,False,False,False,False,True,False,False,False
2,0,2,6,10,3,10,4,9,10,6,...,False,False,False,False,False,False,False,False,False,True
3,0,1,9,10,9,10,9,10,10,7,...,False,False,False,False,False,False,False,True,False,False
5,0,2,5,8,5,8,4,5,6,4,...,False,False,False,False,False,False,False,True,False,False
6,0,0,8,9,4,10,7,9,8,5,...,False,False,False,False,False,False,False,False,False,True


### Split Data

In [6]:
from sklearn.model_selection import train_test_split

X = df.drop('Target', axis=1)
Y = df['Target']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [9]:
### Function to See Model Performance
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def evaluate_model(gcv):
    # Print the best hyperparameters
    print("Best hyperparameters:", gcv.best_params_)
    
    # Predict the labels of the test set using the best estimator
    Y_pred = gcv.best_estimator_.predict(X_test)
    
    # Evaluate the performance of the model using various metrics
    print("Accuracy:", accuracy_score(Y_test, Y_pred))
    print("Precision:", precision_score(Y_test, Y_pred))
    print("Recall:", recall_score(Y_test, Y_pred))
    print("F1-Score:", f1_score(Y_test, Y_pred))
    

## Logistic Regression

In [None]:
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression

# Logistic Regression Model Parameters
cv_params = {
    "C": [0.01, 0.1, 0.5, 1, 10, 50],
    "max_iter": [100, 500]
}

# Logistic Regression Model
lr = LogisticRegression()

# 5 Fold Grid Search Cross Validation
gcv = GridSearchCV(lr, cv_params, cv=5)

# Fit the GridSearchCV object to the training data
gcv.fit(X_train, Y_train)

In [11]:
evaluate_model(gcv)

Best hyperparameters: {'C': 10, 'max_iter': 500}
Accuracy: 0.9779579716765646
Precision: 0.8395061728395061
Recall: 0.19738751814223512
F1-Score: 0.3196239717978848


## K Nearest Neighbour

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# KNN Model Parameters
cv_params = {
    "n_neighbors": [3, 5, 7],
}

# KNN Model
knn = KNeighborsClassifier()

# 5 Fold Grid Search Cross Validation
gcv = GridSearchCV(knn, cv_params, cv=5)

# Fit the GridSearchCV object to the training data
gcv.fit(X_train, Y_train)

In [15]:
evaluate_model(gcv)

Best hyperparameters: {'n_neighbors': 3}
Accuracy: 0.9883508451347647
Precision: 0.9172113289760349
Recall: 0.6110304789550073
F1-Score: 0.7334494773519163


## Support Vector Machine

In [16]:
from sklearn.svm import SVC

# Train SVM model with radial basis function kernel
svm = SVC(kernel="rbf")
svm.fit(X_train, Y_train)

# Make predictions on test set
Y_pred = svm.predict(X_test)

# Evaluate the performance of the model using various metrics
print("Accuracy:", accuracy_score(Y_test, Y_pred))
print("Precision:", precision_score(Y_test, Y_pred))
print("Recall:", recall_score(Y_test, Y_pred))
print("F1-Score:", f1_score(Y_test, Y_pred))

Accuracy: 0.9873991167961017
Precision: 0.9475
Recall: 0.5500725689404935
F1-Score: 0.6960514233241506


## Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
# RFC Model Parameters
cv_params = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 5, 10]
}

# RFC Regression Model
rfc = RandomForestClassifier()

# 5 Fold Grid Search Cross Validation
gcv = GridSearchCV(rfc, cv_params, cv=5)

# Fit the GridSearchCV object to the training data
gcv.fit(X_train, Y_train)

In [20]:
evaluate_model(gcv)

Best hyperparameters: {'max_depth': None, 'n_estimators': 150}
Accuracy: 0.9932998324958124
Precision: 0.9706422018348624
Recall: 0.7677793904208998
F1-Score: 0.8573743922204214


## Key Findings

Here are the accuracies of models used
 - Logistic Regression: 97.7
 - KNN: 98.8
 - SVM: 98.7
 - Random Forest: 99.3

And out of all these models, `Random Forest` has best `F1 Score`. Meaning RFC provides best balance of `Precision` and `Recall`.   
  
`Random Forest Classifier` is good choice of Model for this purpose. 

## Next Step

### 1. Model Comparison and Ensemble Methods:
**Consider Other Algorithms:** Although Random Forest is performing well, exploring other algorithms (e.g., Gradient Boosting, XGBoost) might yield additional insights or better results.  
**Ensemble Techniques:** Combine multiple models (including Random Forest) into an ensemble (e.g., bagging, boosting) to potentially improve overall performance and robustness.  

### 2. Error Analysis:
**Confusion Matrix:** Analyze the confusion matrix to understand the types of errors the model is making. This can help identify areas for improvement, such as addressing class imbalance or collecting more data for specific categories.  
**Error Patterns:** Look for patterns in the errors. Are there specific instances where the model consistently fails? This can help uncover biases or limitations in the data or model.  

### 3. Model Deployment and Monitoring:
**Deployment Plan:** Develop a plan for deploying the model into a production environment, considering factors like scalability, performance, and maintainability.  
**Monitoring:** Implement monitoring to track the model's performance over time and detect any degradation in performance. This includes monitoring metrics like accuracy, precision, recall, and F1-score.  

## 4. Explainability:
**Model Interpretability:** If explainability is crucial for your application, explore techniques to understand how the Random Forest model makes its decisions. This can help build trust in the model and identify potential biases.