# Introduction

![alt text](https://datahack-prod.s3.ap-south-1.amazonaws.com/__sized__/contest_cover/Customer_Segmentation-thumbnail-1200x1200-90.jpg)

An automobile company has plans to enter new markets with their existing products (P1, P2, P3, P4 and P5). 
After intensive market research, they’ve deduced that the behavior of new market is similar to their existing market. 

In their existing market, the sales team has classified all customers into 4 segments (A, B, C, D ). 
Then, they performed segmented outreach and communication for different segment of customers. 
This strategy has work exceptionally well for them. 
They plan to use the same strategy on new markets and have identified 2627 new potential customers. 

You are required to help the manager to predict the right group of the new customers.

The dataset contains two files: 

*   Train_aBjfeNk.csv - training set
*   Test_LqhgPWU.csv - test set 

# Importing Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import KFold,StratifiedKFold
import warnings
warnings.filterwarnings("ignore")
import pickle

# Loading Data

In [2]:
train_data = pd.read_csv('Train_aBjfeNk.csv')
test_data = pd.read_csv('Test_LqhgPWU.csv')
sub_data = pd.read_csv('sample_submission_wyi0h0z.csv')

In [3]:
train_data.head()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,Cat_6,A


In [4]:
test_data.head()

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1
0,458989,Female,Yes,36,Yes,Engineer,0.0,Low,1.0,Cat_6
1,458994,Male,Yes,37,Yes,Healthcare,8.0,Average,4.0,Cat_6
2,458996,Female,Yes,69,No,,0.0,Low,1.0,Cat_6
3,459000,Male,Yes,59,No,Executive,11.0,High,2.0,Cat_6
4,459001,Female,No,19,No,Marketing,,Low,4.0,Cat_6


In [5]:
# check sample submission format
sub_data.head()

Unnamed: 0,ID,Segmentation
0,458989,A
1,458994,A
2,458996,A
3,459000,A
4,459001,A


In [6]:
# concatenate data into df 
df = pd.concat([train_data,test_data])
df

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,Cat_6,A
...,...,...,...,...,...,...,...,...,...,...,...
2622,467954,Male,No,29,No,Healthcare,9.0,Low,4.0,Cat_6,
2623,467958,Female,No,35,Yes,Doctor,1.0,Low,1.0,Cat_6,
2624,467960,Female,No,53,Yes,Entertainment,,Low,2.0,Cat_6,
2625,467961,Male,Yes,47,Yes,Executive,1.0,High,5.0,Cat_4,


In [7]:
X = df.iloc[:, 1:-1].values
y = train_data.iloc[:, -1].values
print(X)
print(y)

[['Male' 'No' 22 ... 'Low' 4.0 'Cat_4']
 ['Female' 'Yes' 38 ... 'Average' 3.0 'Cat_4']
 ['Female' 'Yes' 67 ... 'Low' 1.0 'Cat_6']
 ...
 ['Female' 'No' 53 ... 'Low' 2.0 'Cat_6']
 ['Male' 'Yes' 47 ... 'High' 5.0 'Cat_4']
 ['Female' 'No' 43 ... 'Low' 3.0 'Cat_7']]
['D' 'A' 'B' ... 'D' 'B' 'B']


# Taking care of missing data

In [8]:
# Taking care of Numerical missing data
imputer_num = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer_num.fit(X[:,[5,7]])
X[:,[5,7]] = imputer_num.transform(X[:,[5,7]])

In [9]:
# Taking care of Categorical missing data
imputer_cat = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputer_cat.fit(X[:,[1,3,4,8]])
X[:,[1,3,4,8]] = imputer_cat.transform(X[:,[1,3,4,8]])

# Encoding categorical data

In [10]:
# Encoding the Independent Variable having categories more than two
le = LabelEncoder()
X[:, 0] = le.fit_transform(X[:, 0])
X[:, 1] = le.fit_transform(X[:, 1])
X[:, 3] = le.fit_transform(X[:, 3])

In [11]:
X[0]

array([1, 0, 22, 0, 'Healthcare', 1.0, 'Low', 4.0, 'Cat_4'], dtype=object)

In [12]:
# Encoding the Independent Variable having categories equal to two
ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[4,6,8])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

In [13]:
X

array([[0.0, 0.0, 0.0, ..., 0, 1.0, 4.0],
       [0.0, 0.0, 1.0, ..., 1, 2.619777013650099, 3.0],
       [0.0, 0.0, 1.0, ..., 1, 1.0, 1.0],
       ...,
       [0.0, 0.0, 0.0, ..., 1, 2.619777013650099, 2.0],
       [0.0, 0.0, 0.0, ..., 1, 1.0, 5.0],
       [0.0, 0.0, 0.0, ..., 1, 9.0, 3.0]], dtype=object)

In [14]:
X[0]

array([0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0,
       0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1, 0, 22, 0, 1.0, 4.0], dtype=object)

In [15]:
# Encoding the Dependent Variable
y = le.fit_transform(y)

In [16]:
y

array([3, 0, 1, ..., 3, 1, 1])

# Splitting the dataset into the Training set and Test set

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X[:8068], y, test_size=0.2, random_state=101)

In [18]:
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((6454, 25), (6454,), (1614, 25), (1614,))

# Feature Scaling

In [19]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Training catboost on the Training set

CatBoost is a great self-tuning model to have in the toolkit whenever you want to get the highest accuracy on datasets that have many categorical features, which is usually the case with on-the-job problems.

In [20]:
classifier = LGBMClassifier(learning_rate=0.02,
                    boosting_type='gbdt', max_depth=4,  objective='multiclass', 
                    random_state=100,  
                  n_estimators=1000 ,reg_alpha=0, reg_lambda=1, n_jobs=-1)


#classifier = CatBoostClassifier(loss_function='MultiClass', 
                         #eval_metric='Accuracy', 
                         #depth=4,
                         #random_seed=42, 
                         #iterations=5000, 
                         #learning_rate=0.01,
                         #leaf_estimation_iterations=1,
                         #l2_leaf_reg=1,
                         #bootstrap_type='Bayesian', 
                         #bagging_temperature=4, 
                         #random_strength=1,
                         #od_type='Iter', 
                         #od_wait=1000)

In [21]:
# For Catboost
#classifier.fit(X_train, y_train, verbose=50,
        #use_best_model=True,
        #eval_set=[(X_train, y_train),(X_test, y_test)],
        #plot=False)

classifier.fit(X_train, y_train, verbose=50,
        eval_set=[(X_train, y_train),(X_test, y_test)])

[50]	training's multi_logloss: 1.16578	valid_1's multi_logloss: 1.18804
[100]	training's multi_logloss: 1.0881	valid_1's multi_logloss: 1.1245
[150]	training's multi_logloss: 1.04664	valid_1's multi_logloss: 1.09663
[200]	training's multi_logloss: 1.02321	valid_1's multi_logloss: 1.08401
[250]	training's multi_logloss: 1.00731	valid_1's multi_logloss: 1.07625
[300]	training's multi_logloss: 0.994663	valid_1's multi_logloss: 1.07194
[350]	training's multi_logloss: 0.984528	valid_1's multi_logloss: 1.06928
[400]	training's multi_logloss: 0.97561	valid_1's multi_logloss: 1.06772
[450]	training's multi_logloss: 0.967487	valid_1's multi_logloss: 1.06792
[500]	training's multi_logloss: 0.960532	valid_1's multi_logloss: 1.06858
[550]	training's multi_logloss: 0.954171	valid_1's multi_logloss: 1.06907
[600]	training's multi_logloss: 0.948515	valid_1's multi_logloss: 1.06967
[650]	training's multi_logloss: 0.942729	valid_1's multi_logloss: 1.07019
[700]	training's multi_logloss: 0.936949	valid_

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.02, max_depth=4,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=1000, n_jobs=-1, num_leaves=31,
               objective='multiclass', random_state=100, reg_alpha=0,
               reg_lambda=1, silent=True, subsample=1.0,
               subsample_for_bin=200000, subsample_freq=0)

In [22]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

[[3 3]
 [1 2]
 [3 0]
 ...
 [2 2]
 [3 1]
 [2 2]]


In [23]:
# Making the Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

[[185  71  47  87]
 [ 93 117  87  48]
 [ 44  80 237  58]
 [107  24  14 315]]


0.5291201982651796

# Applying k-Fold Cross Validation

In [24]:
test = X[8068:]

In [25]:
accuracy = []

fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

for train_index, test_index in fold.split(X[:8068],y):  # For k-fold -  fold.split(X[:8068]) 
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    #cat = CatBoostClassifier(loss_function='MultiClass', 
                         #eval_metric='Accuracy', 
                         #depth=6,
                         #random_seed=42, 
                         #iterations=1000, 
                         #learning_rate=0.1,
                         #leaf_estimation_iterations=1,
                         #l2_leaf_reg=1, 
                         #bootstrap_type='Bayesian', 
                         #bagging_temperature=1, 
                         #random_strength=1,
                         #od_type='Iter', 
                         #od_wait=200)
                            
    lbgm = LGBMClassifier(learning_rate=0.02,
                    boosting_type='gbdt', max_depth=4,  objective='multiclass', 
                    random_state=100,  
                  n_estimators=1000 ,reg_alpha=0, reg_lambda=1, n_jobs=-1)                       
                            
    lbgm.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=0, early_stopping_rounds=200)

    y_pred_cat = lbgm.predict(X_test)
    print("Accuracy: ", accuracy_score(y_test,y_pred_cat))

    accuracy.append(accuracy_score(y_test,y_pred_cat))
    p = lbgm.predict(test)

Accuracy:  0.551980198019802
Accuracy:  0.5420792079207921
Accuracy:  0.5365551425030979
Accuracy:  0.540272614622057
Accuracy:  0.5254027261462205
Accuracy:  0.5204460966542751
Accuracy:  0.5365551425030979
Accuracy:  0.5613382899628253
Accuracy:  0.5540372670807453
Accuracy:  0.5639751552795031


In [26]:
np.mean(accuracy,0)

0.5432641840692416

In [27]:
p

array([0, 2, 1, ..., 0, 1, 3])

In [28]:
sub_data['Segmentation'] = p
sub_data

Unnamed: 0,ID,Segmentation
0,458989,0
1,458994,2
2,458996,1
3,459000,2
4,459001,3
...,...,...
2622,467954,3
2623,467958,0
2624,467960,0
2625,467961,1


In [29]:
sub_data['Segmentation'] = sub_data['Segmentation'].map({0:'A', 1:'B' , 2:'C' ,3:'D'})
sub_data                                                      

Unnamed: 0,ID,Segmentation
0,458989,A
1,458994,C
2,458996,B
3,459000,C
4,459001,D
...,...,...
2622,467954,D
2623,467958,A
2624,467960,A
2625,467961,B


In [30]:
len(set(test_data['ID'].unique()).intersection(set(train_data['ID'].unique())))

2332

In [31]:
train_data[train_data['ID'] == 466951]

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
3472,466951,Female,Yes,52,Yes,Artist,0.0,High,3.0,Cat_6,C


In [32]:
test_data[test_data['ID'] == 466951]

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1
2324,466951,Female,Yes,52,Yes,Artist,0.0,High,3.0,Cat_6


In [33]:
# merge test_data and train_data  
merge_data = pd.merge(test_data,train_data,how='inner', on = 'ID')
merge_data

Unnamed: 0,ID,Gender_x,Ever_Married_x,Age_x,Graduated_x,Profession_x,Work_Experience_x,Spending_Score_x,Family_Size_x,Var_1_x,Gender_y,Ever_Married_y,Age_y,Graduated_y,Profession_y,Work_Experience_y,Spending_Score_y,Family_Size_y,Var_1_y,Segmentation
0,458989,Female,Yes,36,Yes,Engineer,0.0,Low,1.0,Cat_6,Female,Yes,42,Yes,Engineer,1.0,Low,1.0,Cat_6,B
1,458994,Male,Yes,37,Yes,Healthcare,8.0,Average,4.0,Cat_6,Male,Yes,38,Yes,Healthcare,8.0,Average,4.0,Cat_6,C
2,458996,Female,Yes,69,No,,0.0,Low,1.0,Cat_6,Female,Yes,71,No,,1.0,Low,1.0,Cat_6,A
3,459000,Male,Yes,59,No,Executive,11.0,High,2.0,Cat_6,Male,Yes,58,No,Executive,12.0,High,2.0,Cat_6,C
4,459001,Female,No,19,No,Marketing,,Low,4.0,Cat_6,Female,No,20,No,Marketing,,Low,4.0,Cat_6,C
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2327,467949,Male,No,21,No,Healthcare,1.0,Low,4.0,Cat_4,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
2328,467950,Female,No,35,Yes,Entertainment,1.0,Low,2.0,Cat_6,Female,No,38,Yes,Entertainment,0.0,Low,2.0,Cat_6,D
2329,467954,Male,No,29,No,Healthcare,9.0,Low,4.0,Cat_6,Male,No,31,No,Healthcare,8.0,Low,4.0,Cat_6,D
2330,467958,Female,No,35,Yes,Doctor,1.0,Low,1.0,Cat_6,Female,No,43,Yes,Doctor,0.0,Low,1.0,Cat_6,A


In [34]:
submission = pd.merge(sub_data,merge_data,how='left',on='ID')
submission

Unnamed: 0,ID,Segmentation_x,Gender_x,Ever_Married_x,Age_x,Graduated_x,Profession_x,Work_Experience_x,Spending_Score_x,Family_Size_x,...,Gender_y,Ever_Married_y,Age_y,Graduated_y,Profession_y,Work_Experience_y,Spending_Score_y,Family_Size_y,Var_1_y,Segmentation_y
0,458989,A,Female,Yes,36.0,Yes,Engineer,0.0,Low,1.0,...,Female,Yes,42.0,Yes,Engineer,1.0,Low,1.0,Cat_6,B
1,458994,C,Male,Yes,37.0,Yes,Healthcare,8.0,Average,4.0,...,Male,Yes,38.0,Yes,Healthcare,8.0,Average,4.0,Cat_6,C
2,458996,B,Female,Yes,69.0,No,,0.0,Low,1.0,...,Female,Yes,71.0,No,,1.0,Low,1.0,Cat_6,A
3,459000,C,Male,Yes,59.0,No,Executive,11.0,High,2.0,...,Male,Yes,58.0,No,Executive,12.0,High,2.0,Cat_6,C
4,459001,D,Female,No,19.0,No,Marketing,,Low,4.0,...,Female,No,20.0,No,Marketing,,Low,4.0,Cat_6,C
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2622,467954,D,Male,No,29.0,No,Healthcare,9.0,Low,4.0,...,Male,No,31.0,No,Healthcare,8.0,Low,4.0,Cat_6,D
2623,467958,A,Female,No,35.0,Yes,Doctor,1.0,Low,1.0,...,Female,No,43.0,Yes,Doctor,0.0,Low,1.0,Cat_6,A
2624,467960,A,,,,,,,,,...,,,,,,,,,,
2625,467961,B,Male,Yes,47.0,Yes,Executive,1.0,High,5.0,...,Male,Yes,45.0,Yes,Executive,1.0,High,5.0,Cat_4,B


In [35]:
sub_data['Segmentation2'] = submission['Segmentation_y']
sub_data

Unnamed: 0,ID,Segmentation,Segmentation2
0,458989,A,B
1,458994,C,C
2,458996,B,A
3,459000,C,C
4,459001,D,C
...,...,...,...
2622,467954,D,D
2623,467958,A,A
2624,467960,A,
2625,467961,B,B


In [36]:
sub_data['Segmentation2'] = sub_data['Segmentation2'].fillna('x')
sub_data

Unnamed: 0,ID,Segmentation,Segmentation2
0,458989,A,B
1,458994,C,C
2,458996,B,A
3,459000,C,C
4,459001,D,C
...,...,...,...
2622,467954,D,D
2623,467958,A,A
2624,467960,A,x
2625,467961,B,B


In [37]:
for i in range(len(sub_data)):
    if sub_data.iloc[i,2] != 'x':
        sub_data.iloc[i,1] = sub_data.iloc[i,2]

In [38]:
sub_data

Unnamed: 0,ID,Segmentation,Segmentation2
0,458989,B,B
1,458994,C,C
2,458996,A,A
3,459000,C,C
4,459001,C,C
...,...,...,...
2622,467954,D,D
2623,467958,A,A
2624,467960,A,x
2625,467961,B,B


In [39]:
sub_data[['ID','Segmentation']].to_csv('LBGM_TUNED.csv',index = False)

**catboost**

Your private score for this submission is : 0.9499048826886494, Had it been a live contest, your rank would be : 12
        
public score = 0.9504



**lbgm**

Your private score for this submission is : 0.9505389980976537, Had it been a live contest, your rank would be : 12

public score = 0.9504


# Saving the model to pickle

We have done all the hard work of creating and testing the model. It would be good if we could save the model for future uses rather than retrain it. We will save our model in the [pickle](https://docs.python.org/2/library/pickle.html). 

In [40]:
filename = 'final_model.sav'
pickle.dump(classifier, open(filename, 'wb'))

Loading the model from pickle

In [41]:
#saved_model = pickle.load(open(filename, 'rb')) 
#print(saved_model)