# Customer Segmentation Model
### Context
Market segmentation is an effective tool for businesses to closely align their strategy and tactics with, and better target, their customers. Every customer is different and every customer journey is different so a single approach often isn’t going to work for all. This is where customer segmentation becomes a valuable process.

Customer segmentation is the process by which you divide your customers up based on common characteristics – such as demographics or behaviours, so you can market to those customers more effectively

### About Data
A telco company is planning to introduce several new products to their existing customers. Based on the market research conducted by their marketing department, the result is promising and these new products can be marketed according to different customer segments.

In their market research, the marketing department has classified all customers into 4 segments (Class 1, Class 2, Class 3 and Class 4). Marketing campaigns were performed to these customer segments and the ROI is good. The same marketing campaigns will be used on the new potential 1614 customers.

You are required to use Machine Learning models to predict the right segments of the new customers.

### Datasets
train.csv - the training set.

test.csv - the test set. The task is to predict the correct label.

sample_submission.csv - a sample submission file in the correct format

### Columns
- va1_1 to var_9 - variables related to customer profiles
    - var_1, var_2, var_4, var_5, var_7, var_9 categorical variables
    - var_3, var_6, var_8 integer (with ordering)
- class - customer segments


In [52]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from pandas_profiling import ProfileReport

In [53]:
dataset_path = "./dataset/"
train_path = os.path.join(dataset_path, "train.csv")
df = pd.read_csv(train_path)
df

Unnamed: 0,ID,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,class
0,1,1,0,21,1,1.0,1.0,1,4.0,4.0,4
1,2,1,1,37,1,2.0,,2,3.0,4.0,1
2,3,1,1,66,1,2.0,1.0,1,1.0,6.0,2
3,4,1,1,66,1,3.0,0.0,3,2.0,6.0,2
4,5,1,1,39,1,4.0,,3,6.0,6.0,1
...,...,...,...,...,...,...,...,...,...,...,...
6449,6450,1,1,68,1,6.0,0.0,1,2.0,4.0,2
6450,6451,1,0,25,1,4.0,6.0,1,2.0,6.0,4
6451,6452,1,1,35,1,5.0,8.0,1,4.0,4.0,4
6452,6453,1,1,24,1,2.0,6.0,2,2.0,6.0,1


In [54]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6454 entries, 0 to 6453
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      6454 non-null   int64  
 1   var_1   6454 non-null   int64  
 2   var_2   6454 non-null   int64  
 3   var_3   6454 non-null   int64  
 4   var_4   6454 non-null   int64  
 5   var_5   6358 non-null   float64
 6   var_6   5800 non-null   float64
 7   var_7   6454 non-null   int64  
 8   var_8   6190 non-null   float64
 9   var_9   6397 non-null   float64
 10  class   6454 non-null   int64  
dtypes: float64(4), int64(7)
memory usage: 554.8 KB


Null values in var_5, var_6, var_8 and var_9

In [55]:
report_path = "Market Segmentation.html"
if not os.path.exists(report_path):
    profile = ProfileReport(df, title="Market Segmentation")
    profile.to_file(report_path)

### Insights from Pandas Profiling
1. There's only 1 unique value in var_1 and var_4
2. ID has 6454 unique values - same as number of samples
3. var_2 has 2 unique values - value "1" (58%) and value "0" (41%); high correlation with class; categorical
4. var_3 has 67 unique values - ranging from 17 to 88; high correlation with class
5. var_5 has 9 unique values - ranging from 1 to 9; high correlation with var_2, var_3 and var_7; 96 missing values (1.5%)
6. var_6 has 15 unique values - ranging from 0 to 15; 654 missing values (10.1%)
7. var_7 has 3 unique values - value "1" (61.1%), value "2" (24.1%) and value "3" (14.7%); high correlation with var_2, var_3, var_5, var_8
8. var_8 has 9 unique values - ranging from 1 to 9; high correlation with var_7; 264 missing values (4.1%)
9. var_9 has 7 unique values - ranging from 1 to 7

**Columns:**
- va1_1 to var_9 - variables related to customer profiles
    - var_1, var_2, var_4, var_5, var_7, var_9 categorical variables
    - var_3, var_6, var_8 integer (with ordering)
- class - customer segments

### Data Cleanup
Missing values in var_5, var_6, var_8 and var_9

We try the following cleanup strategy:
1. Remove all rows with missing values -> Not possible as test data also has missing values
2. Fill in missing values in respective column with median 


In [56]:
# missing_col = ["var_5", "var_6", "var_8", "var_9"]
# df1 = df.copy()
df1 = df.fillna(df.median())
df1

Unnamed: 0,ID,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,class
0,1,1,0,21,1,1.0,1.0,1,4.0,4.0,4
1,2,1,1,37,1,2.0,1.0,2,3.0,4.0,1
2,3,1,1,66,1,2.0,1.0,1,1.0,6.0,2
3,4,1,1,66,1,3.0,0.0,3,2.0,6.0,2
4,5,1,1,39,1,4.0,1.0,3,6.0,6.0,1
...,...,...,...,...,...,...,...,...,...,...,...
6449,6450,1,1,68,1,6.0,0.0,1,2.0,4.0,2
6450,6451,1,0,25,1,4.0,6.0,1,2.0,6.0,4
6451,6452,1,1,35,1,5.0,8.0,1,4.0,4.0,4
6452,6453,1,1,24,1,2.0,6.0,2,2.0,6.0,1


After dropping rows with missing values, the dataset has 5473 rows which is 84.88% of original aka removed 15.19% data.

In [57]:
cat_variables = ["var_1", "var_2", "var_4", "var_5", "var_7", "var_9"]
for var in cat_variables:
    df1.loc[:, var].astype('category')

int_variables = ["var_3", "var_6", "var_8", "class"]
for var in int_variables:
    df1.loc[:, var].astype('int32')

df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6454 entries, 0 to 6453
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      6454 non-null   int64  
 1   var_1   6454 non-null   int64  
 2   var_2   6454 non-null   int64  
 3   var_3   6454 non-null   int64  
 4   var_4   6454 non-null   int64  
 5   var_5   6454 non-null   float64
 6   var_6   6454 non-null   float64
 7   var_7   6454 non-null   int64  
 8   var_8   6454 non-null   float64
 9   var_9   6454 non-null   float64
 10  class   6454 non-null   int64  
dtypes: float64(4), int64(7)
memory usage: 554.8 KB


In [58]:
df1 = df1.drop(columns=["ID", "var_1", "var_4"])
df1

Unnamed: 0,var_2,var_3,var_5,var_6,var_7,var_8,var_9,class
0,0,21,1.0,1.0,1,4.0,4.0,4
1,1,37,2.0,1.0,2,3.0,4.0,1
2,1,66,2.0,1.0,1,1.0,6.0,2
3,1,66,3.0,0.0,3,2.0,6.0,2
4,1,39,4.0,1.0,3,6.0,6.0,1
...,...,...,...,...,...,...,...,...
6449,1,68,6.0,0.0,1,2.0,4.0,2
6450,0,25,4.0,6.0,1,2.0,6.0,4
6451,1,35,5.0,8.0,1,4.0,4.0,4
6452,1,24,2.0,6.0,2,2.0,6.0,1


## Data Prep

1. var_2, var_5, var_7, var_9 is categorical hence require one-hot encoding
2. Split dataset into:
    - Train: 80%
    - Test: 20%

In [59]:
df1 = pd.get_dummies(df1, columns=["var_2", "var_5", "var_7", "var_9"])
df1

Unnamed: 0,var_3,var_6,var_8,class,var_2_0,var_2_1,var_5_1.0,var_5_2.0,var_5_3.0,var_5_4.0,...,var_7_1,var_7_2,var_7_3,var_9_1.0,var_9_2.0,var_9_3.0,var_9_4.0,var_9_5.0,var_9_6.0,var_9_7.0
0,21,1.0,4.0,4,1,0,1,0,0,0,...,1,0,0,0,0,0,1,0,0,0
1,37,1.0,3.0,1,0,1,0,1,0,0,...,0,1,0,0,0,0,1,0,0,0
2,66,1.0,1.0,2,0,1,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0
3,66,0.0,2.0,2,0,1,0,0,1,0,...,0,0,1,0,0,0,0,0,1,0
4,39,1.0,6.0,1,0,1,0,0,0,1,...,0,0,1,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6449,68,0.0,2.0,2,0,1,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
6450,25,6.0,2.0,4,1,0,0,0,0,1,...,1,0,0,0,0,0,0,0,1,0
6451,35,8.0,4.0,4,0,1,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
6452,24,6.0,2.0,1,0,1,0,1,0,0,...,0,1,0,0,0,0,0,0,1,0


In [60]:
from sklearn.model_selection import train_test_split

In [61]:
X = df1.drop(columns="class")
y = df1["class"]

In [62]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(5163, 24) (5163,)
(1291, 24) (1291,)


### Baseline Model

**Problem type:** Classification

**Proposed baseline:** Logistic regression

**Contender models:**
- Decision tree
- Random forest
- Neural network
- Gradient boost?

In [63]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier

In [64]:
pipe = [('scaler', StandardScaler()), ('logistic', LogisticRegression())]
pipe = Pipeline(pipe).fit(X_train, y_train)
pipe.score(X_test, y_test)

0.47327652982184354

In [65]:
pipe = [('scaler', StandardScaler()), ('decision', DecisionTreeClassifier())]
pipe = Pipeline(pipe).fit(X_train, y_train)
pipe.score(X_test, y_test)

0.40975987606506586

In [66]:
pipe = [('scaler', StandardScaler()), ('random', RandomForestClassifier())]
pipe = Pipeline(pipe).fit(X_train, y_train)
pipe.score(X_test, y_test)

0.4500387296669249

In [67]:
pipe = [('scaler', StandardScaler()), ('svc', LinearSVC())]
pipe = Pipeline(pipe).fit(X_train, y_train)
pipe.score(X_test, y_test)



0.4686289697908598

In [68]:
pipe = [('scaler', StandardScaler()), ('mlp', MLPClassifier())]
pipe = Pipeline(pipe).fit(X_train, y_train)
pipe.score(X_test, y_test)



0.47172734314484893

### Prediction

In [69]:
test_path = os.path.join(dataset_path, "test.csv")
df_test = pd.read_csv(test_path)
df_test

Unnamed: 0,ID,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9
0,6455,1,0,36,1,1.0,1.0,1,2.0,6.0
1,6456,1,0,41,1,5.0,0.0,1,1.0,5.0
2,6457,1,0,21,1,1.0,9.0,1,4.0,3.0
3,6458,1,1,40,1,7.0,2.0,2,4.0,6.0
4,6459,1,1,74,1,6.0,0.0,1,1.0,6.0
...,...,...,...,...,...,...,...,...,...,...
1609,8064,1,0,21,1,,0.0,1,7.0,1.0
1610,8065,1,0,34,1,6.0,3.0,1,4.0,4.0
1611,8066,1,0,32,1,1.0,1.0,1,1.0,6.0
1612,8067,1,0,26,1,1.0,1.0,1,4.0,6.0


In [47]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1614 entries, 0 to 1613
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      1614 non-null   int64  
 1   var_1   1614 non-null   int64  
 2   var_2   1614 non-null   int64  
 3   var_3   1614 non-null   int64  
 4   var_4   1614 non-null   int64  
 5   var_5   1586 non-null   float64
 6   var_6   1439 non-null   float64
 7   var_7   1614 non-null   int64  
 8   var_8   1543 non-null   float64
 9   var_9   1595 non-null   float64
dtypes: float64(4), int64(6)
memory usage: 126.2 KB


In [70]:
df_test = df_test.fillna(df_test.median())
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1614 entries, 0 to 1613
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      1614 non-null   int64  
 1   var_1   1614 non-null   int64  
 2   var_2   1614 non-null   int64  
 3   var_3   1614 non-null   int64  
 4   var_4   1614 non-null   int64  
 5   var_5   1614 non-null   float64
 6   var_6   1614 non-null   float64
 7   var_7   1614 non-null   int64  
 8   var_8   1614 non-null   float64
 9   var_9   1614 non-null   float64
dtypes: float64(4), int64(6)
memory usage: 126.2 KB


In [74]:
df_test1 = df_test.drop(columns=["ID", "var_1", "var_4"])
df_test1 = pd.get_dummies(df_test1, columns=["var_2", "var_5", "var_7", "var_9"])
# X_pred = df_test1.to_numpy()
assert df_test1.shape[1] == X_train.shape[1]
df_test1

Unnamed: 0,var_3,var_6,var_8,var_2_0,var_2_1,var_5_1.0,var_5_2.0,var_5_3.0,var_5_4.0,var_5_5.0,...,var_7_1,var_7_2,var_7_3,var_9_1.0,var_9_2.0,var_9_3.0,var_9_4.0,var_9_5.0,var_9_6.0,var_9_7.0
0,36,1.0,2.0,1,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
1,41,0.0,1.0,1,0,0,0,0,0,1,...,1,0,0,0,0,0,0,1,0,0
2,21,9.0,4.0,1,0,1,0,0,0,0,...,1,0,0,0,0,1,0,0,0,0
3,40,2.0,4.0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
4,74,0.0,1.0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1609,21,0.0,7.0,1,0,0,0,0,0,1,...,1,0,0,1,0,0,0,0,0,0
1610,34,3.0,4.0,1,0,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
1611,32,1.0,1.0,1,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
1612,26,1.0,4.0,1,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0


In [77]:
pred = pipe.predict(df_test1)
pred

array([4, 2, 4, ..., 4, 4, 1], dtype=int64)

In [81]:
df_pred = pd.DataFrame(data={"ID":df_test["ID"], "class":pred})
df_pred

Unnamed: 0,ID,class
0,6455,4
1,6456,2
2,6457,4
3,6458,2
4,6459,4
...,...,...
1609,8064,3
1610,8065,4
1611,8066,4
1612,8067,4


In [83]:
df_pred.to_csv("submission.csv", index=False)