# Customer Segmentation Model
### Context
Market segmentation is an effective tool for businesses to closely align their strategy and tactics with, and better target, their customers. Every customer is different and every customer journey is different so a single approach often isn’t going to work for all. This is where customer segmentation becomes a valuable process.

Customer segmentation is the process by which you divide your customers up based on common characteristics – such as demographics or behaviours, so you can market to those customers more effectively

### About Data
A telco company is planning to introduce several new products to their existing customers. Based on the market research conducted by their marketing department, the result is promising and these new products can be marketed according to different customer segments.

In their market research, the marketing department has classified all customers into 4 segments (Class 1, Class 2, Class 3 and Class 4). Marketing campaigns were performed to these customer segments and the ROI is good. The same marketing campaigns will be used on the new potential 1614 customers.

You are required to use Machine Learning models to predict the right segments of the new customers.

### Datasets
train.csv - the training set.

test.csv - the test set. The task is to predict the correct label.

sample_submission.csv - a sample submission file in the correct format

### Columns
- va1_1 to var_9 - variables related to customer profiles
    - var_1, var_2, var_4, var_5, var_7, var_9 categorical variables
    - var_3, var_6, var_8 integer (with ordering)
- class - customer segments


In [171]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from pandas_profiling import ProfileReport
from sklearn.impute import SimpleImputer

In [172]:
dataset_path = "./dataset/"
train_path = os.path.join(dataset_path, "train.csv")
df = pd.read_csv(train_path)
df

Unnamed: 0,ID,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,class
0,1,1,0,21,1,1.0,1.0,1,4.0,4.0,4
1,2,1,1,37,1,2.0,,2,3.0,4.0,1
2,3,1,1,66,1,2.0,1.0,1,1.0,6.0,2
3,4,1,1,66,1,3.0,0.0,3,2.0,6.0,2
4,5,1,1,39,1,4.0,,3,6.0,6.0,1
...,...,...,...,...,...,...,...,...,...,...,...
6449,6450,1,1,68,1,6.0,0.0,1,2.0,4.0,2
6450,6451,1,0,25,1,4.0,6.0,1,2.0,6.0,4
6451,6452,1,1,35,1,5.0,8.0,1,4.0,4.0,4
6452,6453,1,1,24,1,2.0,6.0,2,2.0,6.0,1


In [173]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6454 entries, 0 to 6453
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      6454 non-null   int64  
 1   var_1   6454 non-null   int64  
 2   var_2   6454 non-null   int64  
 3   var_3   6454 non-null   int64  
 4   var_4   6454 non-null   int64  
 5   var_5   6358 non-null   float64
 6   var_6   5800 non-null   float64
 7   var_7   6454 non-null   int64  
 8   var_8   6190 non-null   float64
 9   var_9   6397 non-null   float64
 10  class   6454 non-null   int64  
dtypes: float64(4), int64(7)
memory usage: 554.8 KB


Null values in var_5, var_6, var_8 and var_9

In [174]:
report_path = "Market Segmentation.html"
if not os.path.exists(report_path):
    profile = ProfileReport(df, title="Market Segmentation")
    profile.to_file(report_path)

### Insights from Pandas Profiling
1. There's only 1 unique value in var_1 and var_4
2. ID has 6454 unique values - same as number of samples
3. var_2 has 2 unique values - value "1" (58%) and value "0" (41%); high correlation with class; categorical
4. var_3 has 67 unique values - ranging from 17 to 88; high correlation with class
5. var_5 has 9 unique values - ranging from 1 to 9; high correlation with var_2, var_3 and var_7; 96 missing values (1.5%)
6. var_6 has 15 unique values - ranging from 0 to 15; 654 missing values (10.1%)
7. var_7 has 3 unique values - value "1" (61.1%), value "2" (24.1%) and value "3" (14.7%); high correlation with var_2, var_3, var_5, var_8
8. var_8 has 9 unique values - ranging from 1 to 9; high correlation with var_7; 264 missing values (4.1%)
9. var_9 has 7 unique values - ranging from 1 to 7
10. class has 4 unique values - quite balanced between classes

**Columns:**
- va1_1 to var_9 - variables related to customer profiles
    - var_1, var_2, var_4, var_5, var_7, var_9 categorical variables
    - var_3, var_6, var_8 integer (with ordering)
- class - customer segments

### Data Cleanup
Missing values in var_5, var_6, var_8 and var_9

We try the following cleanup strategy:
1. Remove all rows with missing values -> Not possible as test data also has missing values
2. Fill in missing values in respective column with median or mode (mean not possible because not continuous)

In [175]:
# missing_col = ["var_5", "var_6", "var_8", "var_9"]
# df1 = df.copy()
# df1 = df.fillna(df.mode().iloc[0, :])
# df1

In [211]:
imputer = SimpleImputer(strategy="most_frequent")
df1 = imputer.fit_transform(df.drop(columns="class"))
df1 = pd.DataFrame(df1, columns=df.columns.to_list()[:-1])
df1

Unnamed: 0,ID,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9
0,1.0,1.0,0.0,21.0,1.0,1.0,1.0,1.0,4.0,4.0
1,2.0,1.0,1.0,37.0,1.0,2.0,1.0,2.0,3.0,4.0
2,3.0,1.0,1.0,66.0,1.0,2.0,1.0,1.0,1.0,6.0
3,4.0,1.0,1.0,66.0,1.0,3.0,0.0,3.0,2.0,6.0
4,5.0,1.0,1.0,39.0,1.0,4.0,1.0,3.0,6.0,6.0
...,...,...,...,...,...,...,...,...,...,...
6449,6450.0,1.0,1.0,68.0,1.0,6.0,0.0,1.0,2.0,4.0
6450,6451.0,1.0,0.0,25.0,1.0,4.0,6.0,1.0,2.0,6.0
6451,6452.0,1.0,1.0,35.0,1.0,5.0,8.0,1.0,4.0,4.0
6452,6453.0,1.0,1.0,24.0,1.0,2.0,6.0,2.0,2.0,6.0


In [212]:
# cat_variables = ["var_1", "var_2", "var_4", "var_5", "var_7", "var_9"]
# for var in cat_variables:
#     df1.loc[:, var].astype('category')

# int_variables = ["var_3", "var_6", "var_8", "class"]
# for var in int_variables:
#     df1.loc[:, var].astype('int32')

df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6454 entries, 0 to 6453
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      6454 non-null   float64
 1   var_1   6454 non-null   float64
 2   var_2   6454 non-null   float64
 3   var_3   6454 non-null   float64
 4   var_4   6454 non-null   float64
 5   var_5   6454 non-null   float64
 6   var_6   6454 non-null   float64
 7   var_7   6454 non-null   float64
 8   var_8   6454 non-null   float64
 9   var_9   6454 non-null   float64
dtypes: float64(10)
memory usage: 504.3 KB


In [213]:
df1.describe()

Unnamed: 0,ID,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9
count,6454.0,6454.0,6454.0,6454.0,6454.0,6454.0,6454.0,6454.0,6454.0,6454.0
mean,3227.5,1.0,0.587078,42.39247,1.0,4.274094,2.490858,1.536102,2.813604,5.159591
std,1863.253651,0.0,0.492397,16.808992,0.0,2.147445,3.280449,0.737212,1.499872,1.410352
min,1.0,1.0,0.0,17.0,1.0,1.0,0.0,1.0,1.0,1.0
25%,1614.25,1.0,0.0,29.0,1.0,2.0,0.0,1.0,2.0,4.0
50%,3227.5,1.0,1.0,39.0,1.0,5.0,1.0,1.0,2.0,6.0
75%,4840.75,1.0,1.0,52.0,1.0,5.0,4.0,2.0,4.0,6.0
max,6454.0,1.0,1.0,88.0,1.0,9.0,14.0,3.0,9.0,7.0


In [214]:
# df1 = df1.drop(columns=["ID", "var_1", "var_4"])
df1 = df1.drop(columns=["ID", "var_1"])
df1

Unnamed: 0,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9
0,0.0,21.0,1.0,1.0,1.0,1.0,4.0,4.0
1,1.0,37.0,1.0,2.0,1.0,2.0,3.0,4.0
2,1.0,66.0,1.0,2.0,1.0,1.0,1.0,6.0
3,1.0,66.0,1.0,3.0,0.0,3.0,2.0,6.0
4,1.0,39.0,1.0,4.0,1.0,3.0,6.0,6.0
...,...,...,...,...,...,...,...,...
6449,1.0,68.0,1.0,6.0,0.0,1.0,2.0,4.0
6450,0.0,25.0,1.0,4.0,6.0,1.0,2.0,6.0
6451,1.0,35.0,1.0,5.0,8.0,1.0,4.0,4.0
6452,1.0,24.0,1.0,2.0,6.0,2.0,2.0,6.0


## Data Prep

1. var_2, var_5, var_7, var_9 is categorical hence require one-hot encoding
2. Split dataset into:
    - Train: 80%
    - Test: 20%

In [215]:
df2 = pd.get_dummies(df1, columns=["var_2", "var_5", "var_7", "var_9"])
df2

Unnamed: 0,var_3,var_4,var_6,var_8,var_2_0.0,var_2_1.0,var_5_1.0,var_5_2.0,var_5_3.0,var_5_4.0,...,var_7_1.0,var_7_2.0,var_7_3.0,var_9_1.0,var_9_2.0,var_9_3.0,var_9_4.0,var_9_5.0,var_9_6.0,var_9_7.0
0,21.0,1.0,1.0,4.0,1,0,1,0,0,0,...,1,0,0,0,0,0,1,0,0,0
1,37.0,1.0,1.0,3.0,0,1,0,1,0,0,...,0,1,0,0,0,0,1,0,0,0
2,66.0,1.0,1.0,1.0,0,1,0,1,0,0,...,1,0,0,0,0,0,0,0,1,0
3,66.0,1.0,0.0,2.0,0,1,0,0,1,0,...,0,0,1,0,0,0,0,0,1,0
4,39.0,1.0,1.0,6.0,0,1,0,0,0,1,...,0,0,1,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6449,68.0,1.0,0.0,2.0,0,1,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
6450,25.0,1.0,6.0,2.0,1,0,0,0,0,1,...,1,0,0,0,0,0,0,0,1,0
6451,35.0,1.0,8.0,4.0,0,1,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
6452,24.0,1.0,6.0,2.0,0,1,0,1,0,0,...,0,1,0,0,0,0,0,0,1,0


In [216]:
from sklearn.model_selection import train_test_split

In [217]:
X = df2
y = df["class"]

In [218]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(5163, 25) (5163,)
(1291, 25) (1291,)


## Data Modelling: Baseline

**Problem type:** Classification

**Proposed baseline:** Logistic regression

**Contender models:**
- Decision tree
- Random forest
- Neural network
- Gradient boost?

In [219]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier

In [220]:
def run_pipeline(scaler, model):
    pipe = [('scaler', scaler), ('model', model)]
    pipe = Pipeline(pipe).fit(X_train, y_train)
    print(pipe.named_steps, pipe.score(X_test, y_test))
    return pipe

In [221]:
combinations = [
                [MinMaxScaler(), LogisticRegression()],
                [MinMaxScaler(), DecisionTreeClassifier()],
                [MinMaxScaler(), RandomForestClassifier()],
                [MinMaxScaler(), LinearSVC()],
                [MinMaxScaler(), MLPClassifier()],
                [MinMaxScaler(), GradientBoostingClassifier()],
                ]
for combi in combinations:
    scaler = combi[0]
    model = combi[1]
    pipe = run_pipeline(scaler, model)

{'scaler': MinMaxScaler(), 'model': LogisticRegression()} 0.4740511231603408
{'scaler': MinMaxScaler(), 'model': DecisionTreeClassifier()} 0.41518202943454685
{'scaler': MinMaxScaler(), 'model': RandomForestClassifier()} 0.4469403563129357
{'scaler': MinMaxScaler(), 'model': LinearSVC()} 0.4701781564678544




{'scaler': MinMaxScaler(), 'model': MLPClassifier()} 0.4817970565453137
{'scaler': MinMaxScaler(), 'model': GradientBoostingClassifier()} 0.4856700232378002


## Data Modelling: Recursive Feature Elimination with Cross Validation

In [187]:
from sklearn.feature_selection import RFECV

In [209]:
clf = GradientBoostingClassifier()
selector = RFECV(estimator=clf, step=1)

X = df1
selector = selector.fit(X, y)

In [210]:
print(selector.n_features_)
print(selector.support_)

8
[False  True  True  True  True  True  True  True  True]


## Data Modelling: Grid Search with Cross Validation

## Prediction

In [230]:
test_path = os.path.join(dataset_path, "test.csv")
df_test = pd.read_csv(test_path)
df_test

Unnamed: 0,ID,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9
0,6455,1,0,36,1,1.0,1.0,1,2.0,6.0
1,6456,1,0,41,1,5.0,0.0,1,1.0,5.0
2,6457,1,0,21,1,1.0,9.0,1,4.0,3.0
3,6458,1,1,40,1,7.0,2.0,2,4.0,6.0
4,6459,1,1,74,1,6.0,0.0,1,1.0,6.0
...,...,...,...,...,...,...,...,...,...,...
1609,8064,1,0,21,1,,0.0,1,7.0,1.0
1610,8065,1,0,34,1,6.0,3.0,1,4.0,4.0
1611,8066,1,0,32,1,1.0,1.0,1,1.0,6.0
1612,8067,1,0,26,1,1.0,1.0,1,4.0,6.0


In [231]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1614 entries, 0 to 1613
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      1614 non-null   int64  
 1   var_1   1614 non-null   int64  
 2   var_2   1614 non-null   int64  
 3   var_3   1614 non-null   int64  
 4   var_4   1614 non-null   int64  
 5   var_5   1586 non-null   float64
 6   var_6   1439 non-null   float64
 7   var_7   1614 non-null   int64  
 8   var_8   1543 non-null   float64
 9   var_9   1595 non-null   float64
dtypes: float64(4), int64(6)
memory usage: 126.2 KB


In [232]:
# df_test = df_test.fillna(df_test.mode().iloc[0, :])
# df_test.info()

In [233]:
df_test1 = imputer.transform(df_test)
df_test1 = pd.DataFrame(df_test1, columns=df_test.columns.to_list())
df_test1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1614 entries, 0 to 1613
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      1614 non-null   float64
 1   var_1   1614 non-null   float64
 2   var_2   1614 non-null   float64
 3   var_3   1614 non-null   float64
 4   var_4   1614 non-null   float64
 5   var_5   1614 non-null   float64
 6   var_6   1614 non-null   float64
 7   var_7   1614 non-null   float64
 8   var_8   1614 non-null   float64
 9   var_9   1614 non-null   float64
dtypes: float64(10)
memory usage: 126.2 KB


In [234]:
df_test1 = df_test1.drop(columns=["ID", "var_1"])
# df_test1 = df_test1.drop(columns=["ID"])
df_test1 = pd.get_dummies(df_test1, columns=["var_2", "var_5", "var_7", "var_9"])
df_test1

Unnamed: 0,var_3,var_4,var_6,var_8,var_2_0.0,var_2_1.0,var_5_1.0,var_5_2.0,var_5_3.0,var_5_4.0,...,var_7_1.0,var_7_2.0,var_7_3.0,var_9_1.0,var_9_2.0,var_9_3.0,var_9_4.0,var_9_5.0,var_9_6.0,var_9_7.0
0,36.0,1.0,1.0,2.0,1,0,1,0,0,0,...,1,0,0,0,0,0,0,0,1,0
1,41.0,1.0,0.0,1.0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
2,21.0,1.0,9.0,4.0,1,0,1,0,0,0,...,1,0,0,0,0,1,0,0,0,0
3,40.0,1.0,2.0,4.0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
4,74.0,1.0,0.0,1.0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1609,21.0,1.0,0.0,7.0,1,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
1610,34.0,1.0,3.0,4.0,1,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
1611,32.0,1.0,1.0,1.0,1,0,1,0,0,0,...,1,0,0,0,0,0,0,0,1,0
1612,26.0,1.0,1.0,4.0,1,0,1,0,0,0,...,1,0,0,0,0,0,0,0,1,0


In [235]:
# X_pred = df_test1.to_numpy()
assert df_test1.shape[1] == X_train.shape[1], f"Expected {X_train.shape[1]} but got {df_test1.shape[1]}"
df_test1

Unnamed: 0,var_3,var_4,var_6,var_8,var_2_0.0,var_2_1.0,var_5_1.0,var_5_2.0,var_5_3.0,var_5_4.0,...,var_7_1.0,var_7_2.0,var_7_3.0,var_9_1.0,var_9_2.0,var_9_3.0,var_9_4.0,var_9_5.0,var_9_6.0,var_9_7.0
0,36.0,1.0,1.0,2.0,1,0,1,0,0,0,...,1,0,0,0,0,0,0,0,1,0
1,41.0,1.0,0.0,1.0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
2,21.0,1.0,9.0,4.0,1,0,1,0,0,0,...,1,0,0,0,0,1,0,0,0,0
3,40.0,1.0,2.0,4.0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
4,74.0,1.0,0.0,1.0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1609,21.0,1.0,0.0,7.0,1,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0
1610,34.0,1.0,3.0,4.0,1,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
1611,32.0,1.0,1.0,1.0,1,0,1,0,0,0,...,1,0,0,0,0,0,0,0,1,0
1612,26.0,1.0,1.0,4.0,1,0,1,0,0,0,...,1,0,0,0,0,0,0,0,1,0


In [236]:
pred = pipe.predict(df_test1)
pred

array([4, 1, 4, ..., 4, 4, 1], dtype=int64)

In [237]:
df_pred = pd.DataFrame(data={"ID":df_test["ID"], "class":pred})
df_pred

Unnamed: 0,ID,class
0,6455,4
1,6456,1
2,6457,4
3,6458,2
4,6459,4
...,...,...
1609,8064,4
1610,8065,4
1611,8066,4
1612,8067,4


In [238]:
df_pred.to_csv("submission.csv", index=False)