# **Experiment 1: Logistic Regression**


In this experiment, we will build our binary classifier using a logistic regression and a KNN.

The steps are:
1.   Load and explore dataset
2.   Data Preparation
3.   Feature Scaling
4.   Data Splitting
5.   Assess model baseline
6.   Train a logistic regression model
7.   Train Logistic Regression Classifier with L1 and L2 Regularisation
8.   Build a KNN model using euclidean distance
9.   Build a KNN model using euclidean distance and 50 neighbors
10.   Assess the Best Model on the Testing Set

### 1. Load and Explore Dataset

In [8]:
#Importing the essential libraries for load and exploration
import pandas as pd
import numpy as np

**[1.1]** Loading Dataset

In [9]:
url='https://drive.google.com/file/d/177p-Vaa2__BtaxNCmd4YGBv7zfENUNmy/view?usp=sharing'
url='https://drive.google.com/uc?id=' + url.split('/')[-2]
df = pd.read_csv(url, index_col = False)

**[1.2]** Dataset Exploration

In [10]:
df.head()

Unnamed: 0,ID,Target,age_band,gender,car_model,car_segment,age_of_vehicle_years,sched_serv_warr,non_sched_serv_warr,sched_serv_paid,non_sched_serv_paid,total_paid_services,total_services,mth_since_last_serv,annualised_mileage,num_dealers_visited,num_serv_dealer_purchased
0,1,0,3. 35 to 44,Male,model_1,LCV,9,2,10,3,7,5,6,9,8,10,4
1,2,0,,,model_2,Small/Medium,6,10,3,10,4,9,10,6,10,7,10
2,3,0,,Male,model_3,Large/SUV,9,10,9,10,9,10,10,7,10,6,10
3,5,0,,,model_3,Large/SUV,5,8,5,8,4,5,6,4,10,9,7
4,6,0,,Female,model_2,Small/Medium,8,9,4,10,7,9,8,5,4,4,9


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131337 entries, 0 to 131336
Data columns (total 17 columns):
 #   Column                     Non-Null Count   Dtype 
---  ------                     --------------   ----- 
 0   ID                         131337 non-null  int64 
 1   Target                     131337 non-null  int64 
 2   age_band                   18962 non-null   object
 3   gender                     62029 non-null   object
 4   car_model                  131337 non-null  object
 5   car_segment                131337 non-null  object
 6   age_of_vehicle_years       131337 non-null  int64 
 7   sched_serv_warr            131337 non-null  int64 
 8   non_sched_serv_warr        131337 non-null  int64 
 9   sched_serv_paid            131337 non-null  int64 
 10  non_sched_serv_paid        131337 non-null  int64 
 11  total_paid_services        131337 non-null  int64 
 12  total_services             131337 non-null  int64 
 13  mth_since_last_serv        131337 non-null  

In [12]:
print( f" Percentage Null values - 'age_band' : {100 * df['age_band'].isna().sum() / len(df)}" )
print( f" Percentage Null values - 'gender' : {100 * df['gender'].isna().sum() / len(df)}" )

 Percentage Null values - 'age_band' : 85.56233201611123
 Percentage Null values - 'gender' : 52.77111552723147


In [13]:
df.describe()

Unnamed: 0,ID,Target,age_of_vehicle_years,sched_serv_warr,non_sched_serv_warr,sched_serv_paid,non_sched_serv_paid,total_paid_services,total_services,mth_since_last_serv,annualised_mileage,num_dealers_visited,num_serv_dealer_purchased
count,131337.0,131337.0,131337.0,131337.0,131337.0,131337.0,131337.0,131337.0,131337.0,131337.0,131337.0,131337.0,131337.0
mean,77097.38418,0.026809,5.493022,5.4525,5.472517,5.452287,5.49705,5.481692,5.454967,5.469807,5.502836,5.485438,5.480778
std,44501.636704,0.161525,2.843299,2.884328,2.870665,2.886528,2.878699,2.880408,2.875961,2.859756,2.854896,2.876772,2.867524
min,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,38563.0,0.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0
50%,77132.0,0.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
75%,115668.0,0.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0,8.0
max,154139.0,1.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0


In [14]:
for cols in df.columns:
    print(cols)
    print(df[cols].unique())

ID
[     1      2      3 ... 154137 154138 154139]
Target
[0 1]
age_band
['3. 35 to 44' nan '1. <25' '4. 45 to 54' '2. 25 to 34' '7. 75+'
 '5. 55 to 64' '6. 65 to 74']
gender
['Male' nan 'Female']
car_model
['model_1' 'model_2' 'model_3' 'model_5' 'model_6' 'model_4' 'model_7'
 'model_8' 'model_9' 'model_10' 'model_11' 'model_13' 'model_12'
 'model_14' 'model_15' 'model_16' 'model_17' 'model_18' 'model_19']
car_segment
['LCV' 'Small/Medium' 'Large/SUV' 'Other']
age_of_vehicle_years
[ 9  6  5  8  7  1  3  4 10  2]
sched_serv_warr
[ 2 10  8  9  4  1  3  7  5  6]
non_sched_serv_warr
[10  3  9  5  4  8  1  6  2  7]
sched_serv_paid
[ 3 10  8  5  2  6  1  4  9  7]
non_sched_serv_paid
[ 7  4  9  3  1  2  6  5 10  8]
total_paid_services
[ 5  9 10  6  8  1  2  7  3  4]
total_services
[ 6 10  8  4  2  1  3  5  9  7]
mth_since_last_serv
[ 9  6  7  4  5  8  1  3 10  2]
annualised_mileage
[ 8 10  4  5  6  1  7  3  9  2]
num_dealers_visited
[10  7  6  9  4  5  2  1  3  8]
num_serv_dealer_purchased
[

### 2. Data Preparation

In [15]:
df_cleaned = df.copy()

We are going to drop the ID column because it provides no utility or significance to our dataset.

From our explorations above, we can see that 85% of the **age-band** values are null. One solution to overcome this problem would be to consider the average of each band and fill it in for the rest of the rows, but this would mean that we are taking risks of age inaccuracy in 85% of the cases. This would highly mis-guide the model. Hence it is probably safest to drop the column, although we would be missing out on a highly important parameter that affects the likelihood of an idividual to buy a car. 

Furthermore, we can see that 52 % of the **gender** values are missing, and there are no ways to infer what their genders may be. Thus this column is also removed.

In [16]:
## Dropping columns ID, gender and age_band
df_cleaned.drop(['ID','gender', 'age_band'], axis = 1, inplace = True)

The **car_model** feature is a categorical one that is nominal. Although one school of thought would be to consider it as ordinal, since they have a numerical label, but it would be bold to assume that these numbers are chronologically assigned, or are somewhat related to car performance. Hence it is considered a nominal variable for which I have applied One-hot Encoding.

In [17]:
#Applying one-hot encoding to the nominal data 'Car_model'
df_car_model = pd.get_dummies(df_cleaned['car_model'])
df_car_model

Unnamed: 0,model_1,model_10,model_11,model_12,model_13,model_14,model_15,model_16,model_17,model_18,model_19,model_2,model_3,model_4,model_5,model_6,model_7,model_8,model_9
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
131332,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
131333,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
131334,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
131335,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


We will map numerical values to the ordinal variable **car_segment**

In [18]:
ord_cols = ['car_segment']
for col in ord_cols:
  print(col)
  print(df_cleaned[col].unique()) 

car_segment
['LCV' 'Small/Medium' 'Large/SUV' 'Other']


In [19]:
car_segment_mapper = {
    "Other": 0, 
    "LCV": 1, 
    "Small/Medium": 2,
    "Large/SUV": 3
}
car_segment_mapper

{'Other': 0, 'LCV': 1, 'Small/Medium': 2, 'Large/SUV': 3}

In [20]:
df_cleaned["car_segment"] = df_cleaned["car_segment"].replace(car_segment_mapper)
df_cleaned["car_segment"]

0         1
1         2
2         3
3         3
4         2
         ..
131332    3
131333    3
131334    3
131335    3
131336    1
Name: car_segment, Length: 131337, dtype: int64

##### Now we are going to merge the encoded data frames to a variable called X (which would be our refined dataset)

In [21]:
X = pd.concat([df_cleaned, df_car_model], axis=1)
## Drop the variable "car_model" since now we have the encoded columns replacing it
X.drop('car_model', axis = 1, inplace = True)

Let's extract the target variable on variable y

In [22]:
y = X.pop('Target')

### 3. Scale Data

In the code below we will normalize the dataset so that all the features are levelled down to a standard range of values.

In [23]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X = scaler.transform(X)
X

array([[-1.64100321,  1.23342356, -1.19699048, ..., -0.25750093,
        -0.22712936, -0.08732716],
       [-0.28324066,  0.17830704,  1.5766297 , ..., -0.25750093,
        -0.22712936, -0.08732716],
       [ 1.07452189,  1.23342356,  1.5766297 , ..., -0.25750093,
        -0.22712936, -0.08732716],
       ...,
       [ 1.07452189, -0.52510398, -0.50358543, ..., -0.25750093,
        -0.22712936, -0.08732716],
       [ 1.07452189, -1.22851499, -1.543693  , ..., -0.25750093,
        -0.22712936, -0.08732716],
       [-1.64100321,  0.17830704, -1.543693  , ..., -0.25750093,
        -0.22712936, -0.08732716]])

### 4. Data Splitting

In [24]:
y.value_counts(normalize=True)

0    0.973191
1    0.026809
Name: Target, dtype: float64

In [25]:
from sklearn.model_selection import train_test_split

In [26]:
X_data, X_test, y_data, y_test = train_test_split (X, y, test_size=0.2, random_state=8)

In [27]:
y_test.value_counts(normalize=True)

0    0.971867
1    0.028133
Name: Target, dtype: float64

In [28]:
X_train, X_val, y_train, y_val = train_test_split(X_data, y_data, test_size=0.2, random_state=8)

In [29]:
y_train.value_counts(normalize=True)

0    0.973494
1    0.026506
Name: Target, dtype: float64

In [30]:
y_val.value_counts(normalize=True)

0    0.973637
1    0.026363
Name: Target, dtype: float64

## 5. Assess Baseline Model

**[5.1]** Finding the mode of the target variable and print its value

In [31]:
y_mode = y.mode()

**[5.2]** Creating a numpy array called y_base filled with the mode

In [32]:
y_base = np.full(y_train.shape, y_mode)

In [33]:
# Unit Tests
assert isinstance(y_base, np.ndarray)
assert y_base.shape == y_train.shape

**[5.3]** Importing accuracy score from sklearn

In [34]:
from sklearn.metrics import accuracy_score

**[5.4]** Displaying accuracy score of this baseline model

In [35]:
accuracy_score(y_base, y_train)

0.9734935458925703

## 6. Train a Logistic Regression Model

**[6.1]** Importing the LogisticRegression class from sklearn

In [36]:
from sklearn.linear_model import LogisticRegression

**[6.2]** Instantiating our model default hyperparameter

In [37]:
log_reg = LogisticRegression()

**[6.3]** Fitting our model with the training data

In [38]:
log_reg.fit(X_train, y_train)

**[6.4]** Print the accuracy score of this model for the training set

In [39]:
y_train_preds = log_reg.predict(X_train)
accuracy_score(y_train_preds, y_train)

0.9778240437808577

**[6.5]** Print the accuracy score of this model for the validation set

In [40]:
y_val_preds = log_reg.predict(X_val)
accuracy_score(y_val_preds, y_val)

0.9789188160274103

Performance on the validation set is showing higher than that in training set. A possible reason may be that the train-val split ratio is too high. However, let us try with regularisation. 

## 7. Train Logistic Regression Classifier with L1 and L2 Regularisation

**[7.1]** Instantiate a Logistic Regression with L1 and L2 regularisation

In [41]:
log_elastic_reg = LogisticRegression(penalty = 'elasticnet', l1_ratio = 0.5, solver = 'saga')

**[7.2]** Fit our model with the training data

In [42]:
log_elastic_reg.fit(X_train, y_train)



**[7.3]** Display the accuracy score for the training set

In [43]:
y_preds_train_elastic = log_elastic_reg.predict(X_train)
accuracy_score(y_preds_train_elastic, y_train)

0.9778359407530783

**[7.4]** Display the accuracy score for the validation set

In [44]:
y_preds_val_elastic = log_elastic_reg.predict(X_val)
accuracy_score(y_preds_val_elastic, y_val)

0.9789188160274103

We have the same results. The validation sample needed to be greater.

## 8. Build a KNN model using euclidian distance

**[8.1]** Import the KNeighborsClassifier class from sklearn

In [45]:
from sklearn.neighbors import KNeighborsClassifier

**[8.2]** Instantiate our model with n_neighbors=15 and metric:'euclidean'

In [46]:
knn_15_euc_class = KNeighborsClassifier(n_neighbors = 15, metric = 'euclidean')

**[8.3]** Fit our model with the training data

In [47]:
knn_15_euc_class.fit(X_train, y_train)

**[8.4]** Display the accuracy score on the training set

In [48]:
y_knn_15_train_preds = knn_15_euc_class.predict(X_train)
accuracy_score(y_knn_15_train_preds, y_train)

0.9835345904467313

**[8.5]** Display the accuracy score on the validation set

In [49]:
y_knn_15_val_preds = knn_15_euc_class.predict(X_val)
accuracy_score(y_knn_15_val_preds, y_val)

0.9823926905872276

## 9. Build a KNN model using euclidian distance and 55 neighbors

**[9.1]** Instantiate our model with n_neighbors=55 and metric:'euclidean' and fit on the training set

In [50]:
knn_55_euc_class = KNeighborsClassifier(n_neighbors = 55, metric = 'euclidean')

**[9.2]** Display the accuracy score on the training set

In [51]:
model = knn_55_euc_class.fit(X_train, y_train)

**[9.3]** Display the accuracy score on the validation set

In [52]:
y_knn_55_train_preds = knn_55_euc_class.predict(X_train)
accuracy_score(y_knn_55_train_preds, y_train)

0.9778835286419606

## 10. Assess the Best Model on the Testing Set

**[10.1]** Using the trained logistic regression model with regularisation, display the accuracy score on the validation set

In [53]:
y_preds_test_elastic = log_elastic_reg.predict(X_test)
accuracy_score(y_preds_test_elastic, y_test)

0.9770443124714482