# Customer Segmentation Model
### Context
Market segmentation is an effective tool for businesses to closely align their strategy and tactics with, and better target, their customers. Every customer is different and every customer journey is different so a single approach often isn’t going to work for all. This is where customer segmentation becomes a valuable process.

Customer segmentation is the process by which you divide your customers up based on common characteristics – such as demographics or behaviours, so you can market to those customers more effectively

### About Data
A telco company is planning to introduce several new products to their existing customers. Based on the market research conducted by their marketing department, the result is promising and these new products can be marketed according to different customer segments.

In their market research, the marketing department has classified all customers into 4 segments (Class 1, Class 2, Class 3 and Class 4). Marketing campaigns were performed to these customer segments and the ROI is good. The same marketing campaigns will be used on the new potential 1614 customers.

You are required to use Machine Learning models to predict the right segments of the new customers.

### Datasets
train.csv - the training set.

test.csv - the test set. The task is to predict the correct label.

sample_submission.csv - a sample submission file in the correct format

### Columns
- va1_1 to var_9 - variables related to customer profiles
    - var_1, var_2, var_4, var_5, var_7, var_9 categorical variables
    - var_3, var_6, var_8 integer (with ordering)
- class - customer segments


In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from pandas_profiling import ProfileReport

In [3]:
dataset_path = "./dataset/"
train_path = os.path.join(dataset_path, "train.csv")
df = pd.read_csv(train_path)
df

Unnamed: 0,ID,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,class
0,1,1,0,21,1,1.0,1.0,1,4.0,4.0,4
1,2,1,1,37,1,2.0,,2,3.0,4.0,1
2,3,1,1,66,1,2.0,1.0,1,1.0,6.0,2
3,4,1,1,66,1,3.0,0.0,3,2.0,6.0,2
4,5,1,1,39,1,4.0,,3,6.0,6.0,1
...,...,...,...,...,...,...,...,...,...,...,...
6449,6450,1,1,68,1,6.0,0.0,1,2.0,4.0,2
6450,6451,1,0,25,1,4.0,6.0,1,2.0,6.0,4
6451,6452,1,1,35,1,5.0,8.0,1,4.0,4.0,4
6452,6453,1,1,24,1,2.0,6.0,2,2.0,6.0,1


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6454 entries, 0 to 6453
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ID      6454 non-null   int64  
 1   var_1   6454 non-null   int64  
 2   var_2   6454 non-null   int64  
 3   var_3   6454 non-null   int64  
 4   var_4   6454 non-null   int64  
 5   var_5   6358 non-null   float64
 6   var_6   5800 non-null   float64
 7   var_7   6454 non-null   int64  
 8   var_8   6190 non-null   float64
 9   var_9   6397 non-null   float64
 10  class   6454 non-null   int64  
dtypes: float64(4), int64(7)
memory usage: 554.8 KB


Null values in var_5, var_6, var_8 and var_9

In [9]:
report_path = "Market Segmentation.html"
if not os.path.exists(report_path):
    profile = ProfileReport(df, title="Market Segmentation")
    profile.to_file(report_path)

### Insights from Pandas Profiling
1. There's only 1 unique value in var_1 and var_4
2. ID has 6454 unique values - same as number of samples
3. var_2 has 2 unique values - value "1" (58%) and value "0" (41%); high correlation with class
4. var_3 has 67 unique values - ranging from 17 to 88; high correlation with class
5. var_6 has 9 unique values - ranging from 1 to 9; high correlation with var_2, var_3 and var_7; 96 missing values (1.5%)
6. var_7 has 3 unique values - value "1" (61.1%), value "2" (24.1%) and value "3" (14.7%); high correlation with var_2, var_3, var_5, var_8
7. var_8 has 9 unique values - ranging from 1 to 9; high correlation with var_7; 264 missing values (4.1%)
8. var_9 has 7 unique values - ranging from 1 to 7

**Problem type:** Classification
**Proposed baseline:** Logistic regression
**Contender models:**
- Decision tree
- Random forest
- Neural network
- Gradient boost?

Clean up data

Baseline Model