- An automobile company has plans to enter new markets with their existing products (P1, P2, P3, P4 and P5). After intensive market research, they’ve deduced that the behavior of new market is similar to their existing market. 

- In their existing market, the sales team has classified all customers into 4 segments (A, B, C, D ). Then, they performed segmented outreach and communication for different segment of customers. This strategy has work exceptionally well for them. They plan to use the same strategy on new markets and have identified 2627 new potential customers. 

- You are required to help the manager to predict the right group of the new customers.

### Data Description 
|Variable	      |Definition                                                            |
|---------------  |--------------------------------------------------------------------  |
|ID	              |Unique ID                                                             |
|Gender	          |Gender of the customer                                                |
|Ever_Married     |Marital status of the customer                                        |
|Age              |Age of the customer                                                   |
|Graduated	      |Is the customer a graduate?                                           |
|Profession	      |Profession of the customer                                            |
|Work_Experience  |Work Experience in years                                              |
|Spending_Score	  |Spending score of the customer                                        |
|Family_Size	  |Number of family members for the customer (including the customer)    |
|Var_1	          |Anonymised Category for the customer                                  |
|Segmentation	  |(target) Customer Segment of the customer                             |

### sample_submission.csv

|ID:              |Unique ID                                                             |
|-----------------|----------------------------------------------------------------------|
|Segmentation:    |Predicted segment for customers in the test set                       |

In [1]:
# Read data
import numpy as np                           # Linear Algebra (calculate the mean and standard deviation)
import pandas as pd                          # manipulate data, data processing, load csv file I/O (e.g. pd.read_csv)

# Visualization
import matplotlib.pyplot as plt              # Visualization using matplotlib
%matplotlib inline
import seaborn as sns                        # Visualization using seaborn

# style
plt.style.use("fivethirtyeight")             # Set Graphs Background style using matplotlib
sns.set_style("darkgrid")                    # Set Graphs Background style using seaborn

import warnings                              # To ignore any warnings
warnings.filterwarnings("ignore")

In [70]:
# ML model building; Pre Processing & Evaluation
from sklearn.model_selection import train_test_split                     # split  data into training and testing sets
from sklearn.linear_model import LogisticRegression                      # LogisticRegression
from sklearn.tree import DecisionTreeClassifier                          # Decision tree Classifier
from sklearn.ensemble import RandomForestClassifier                      # this will make a Random Forest Classifier
import xgboost
from xgboost import XGBClassifier                                        # XGBoost Classifier
import lightgbm as lgb
from sklearn.preprocessing import StandardScaler                         # Standard Scalar
from sklearn.metrics import confusion_matrix, classification_report      # this creates a confusion matrix
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV     # this will do cross validation

In [3]:
# Read train and test dataset
train = pd.read_csv("Train_aBjfeNk.csv")
test = pd.read_csv("Test_LqhgPWU.csv")
submission = pd.read_csv("sample_submission_wyi0h0z.csv")

In [4]:
# Import first 5 rows
display(train.head())
display(test.head())
display(submission.head())

Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,462809,Male,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,462643,Female,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A
2,466315,Female,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B
3,461735,Male,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B
4,462669,Female,Yes,40,Yes,Entertainment,,High,6.0,Cat_6,A


Unnamed: 0,ID,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1
0,458989,Female,Yes,36,Yes,Engineer,0.0,Low,1.0,Cat_6
1,458994,Male,Yes,37,Yes,Healthcare,8.0,Average,4.0,Cat_6
2,458996,Female,Yes,69,No,,0.0,Low,1.0,Cat_6
3,459000,Male,Yes,59,No,Executive,11.0,High,2.0,Cat_6
4,459001,Female,No,19,No,Marketing,,Low,4.0,Cat_6


Unnamed: 0,ID,Segmentation
0,458989,A
1,458994,A
2,458996,A
3,459000,A
4,459001,A


In [5]:
# checking dimension (num of rows and columns) of dataset
print("Training data shape (Rows, Columns):",train.shape)
print("Test data shape (Rows, Columns):",test.shape)

Training data shape (Rows, Columns): (8068, 11)
Test data shape (Rows, Columns): (2627, 10)


In [6]:
train_original=train.copy() 
test_original=test.copy()

In [7]:
# check dataframe structure like columns and its counts, datatypes & Null Values
display(train.info())
display(test.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8068 entries, 0 to 8067
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               8068 non-null   int64  
 1   Gender           8068 non-null   object 
 2   Ever_Married     7928 non-null   object 
 3   Age              8068 non-null   int64  
 4   Graduated        7990 non-null   object 
 5   Profession       7944 non-null   object 
 6   Work_Experience  7239 non-null   float64
 7   Spending_Score   8068 non-null   object 
 8   Family_Size      7733 non-null   float64
 9   Var_1            7992 non-null   object 
 10  Segmentation     8068 non-null   object 
dtypes: float64(2), int64(2), object(7)
memory usage: 693.5+ KB


None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2627 entries, 0 to 2626
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   ID               2627 non-null   int64  
 1   Gender           2627 non-null   object 
 2   Ever_Married     2577 non-null   object 
 3   Age              2627 non-null   int64  
 4   Graduated        2603 non-null   object 
 5   Profession       2589 non-null   object 
 6   Work_Experience  2358 non-null   float64
 7   Spending_Score   2627 non-null   object 
 8   Family_Size      2514 non-null   float64
 9   Var_1            2595 non-null   object 
dtypes: float64(2), int64(2), object(6)
memory usage: 205.4+ KB


None

In [8]:
display(train.dtypes.value_counts())
display(test.dtypes.value_counts())

object     7
int64      2
float64    2
dtype: int64

object     6
int64      2
float64    2
dtype: int64

In [9]:
# Gives number of data points in each variable
display(train.count())
display(test.count())

ID                 8068
Gender             8068
Ever_Married       7928
Age                8068
Graduated          7990
Profession         7944
Work_Experience    7239
Spending_Score     8068
Family_Size        7733
Var_1              7992
Segmentation       8068
dtype: int64

ID                 2627
Gender             2627
Ever_Married       2577
Age                2627
Graduated          2603
Profession         2589
Work_Experience    2358
Spending_Score     2627
Family_Size        2514
Var_1              2595
dtype: int64

In [10]:
train.drop('ID', axis=1, inplace=True)
test.drop('ID', axis=1, inplace=True)

### Missing Values

In [11]:
display(train.isnull().sum())
display(test.isnull().sum())

Gender               0
Ever_Married       140
Age                  0
Graduated           78
Profession         124
Work_Experience    829
Spending_Score       0
Family_Size        335
Var_1               76
Segmentation         0
dtype: int64

Gender               0
Ever_Married        50
Age                  0
Graduated           24
Profession          38
Work_Experience    269
Spending_Score       0
Family_Size        113
Var_1               32
dtype: int64

## Converting Categorical features into Numerical

### 1. Gender

In [12]:
display(train.Gender.value_counts())
display(test.Gender.value_counts())

Male      4417
Female    3651
Name: Gender, dtype: int64

Male      1424
Female    1203
Name: Gender, dtype: int64

In [13]:
train['Gender'] = train['Gender'].map({'Male':1, 'Female':0})
test['Gender'] = test['Gender'].map({'Male':1, 'Female':0})

In [14]:
display(train.head())
display(test.head())

Unnamed: 0,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,1,No,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,0,Yes,38,Yes,Engineer,,Average,3.0,Cat_4,A
2,0,Yes,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B
3,1,Yes,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B
4,0,Yes,40,Yes,Entertainment,,High,6.0,Cat_6,A


Unnamed: 0,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1
0,0,Yes,36,Yes,Engineer,0.0,Low,1.0,Cat_6
1,1,Yes,37,Yes,Healthcare,8.0,Average,4.0,Cat_6
2,0,Yes,69,No,,0.0,Low,1.0,Cat_6
3,1,Yes,59,No,Executive,11.0,High,2.0,Cat_6
4,0,No,19,No,Marketing,,Low,4.0,Cat_6


### 2. Ever_Married

In [15]:
display(train['Ever_Married'].value_counts())
display(test['Ever_Married'].value_counts())

Yes    4643
No     3285
Name: Ever_Married, dtype: int64

Yes    1520
No     1057
Name: Ever_Married, dtype: int64

In [16]:
train['Ever_Married'] = train['Ever_Married'].fillna(train['Ever_Married'].mode()[0])
test['Ever_Married'] = test['Ever_Married'].fillna(train['Ever_Married'].mode()[0])

In [17]:
display(train['Ever_Married'].isnull().sum())
display(test['Ever_Married'].isnull().sum())

0

0

In [18]:
train['Ever_Married'] = train['Ever_Married'].map({'Yes':1, 'No':0})
test['Ever_Married'] = test['Ever_Married'].map({'Yes':1, 'No':0})

In [19]:
display(train.head())
display(test.head())

Unnamed: 0,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,1,0,22,No,Healthcare,1.0,Low,4.0,Cat_4,D
1,0,1,38,Yes,Engineer,,Average,3.0,Cat_4,A
2,0,1,67,Yes,Engineer,1.0,Low,1.0,Cat_6,B
3,1,1,67,Yes,Lawyer,0.0,High,2.0,Cat_6,B
4,0,1,40,Yes,Entertainment,,High,6.0,Cat_6,A


Unnamed: 0,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1
0,0,1,36,Yes,Engineer,0.0,Low,1.0,Cat_6
1,1,1,37,Yes,Healthcare,8.0,Average,4.0,Cat_6
2,0,1,69,No,,0.0,Low,1.0,Cat_6
3,1,1,59,No,Executive,11.0,High,2.0,Cat_6
4,0,0,19,No,Marketing,,Low,4.0,Cat_6


### 3. Graduated

In [20]:
display(train['Graduated'].value_counts())
display(test['Graduated'].value_counts())

Yes    4968
No     3022
Name: Graduated, dtype: int64

Yes    1602
No     1001
Name: Graduated, dtype: int64

In [21]:
train['Graduated'] = train['Graduated'].fillna(train['Graduated'].mode()[0])
test['Graduated'] = test['Graduated'].fillna(train['Graduated'].mode()[0])

In [22]:
display(train['Graduated'].isnull().sum())
display(test['Graduated'].isnull().sum())

0

0

In [23]:
train['Graduated'] = train['Graduated'].map({'Yes':1, 'No':0})
test['Graduated'] = test['Graduated'].map({'Yes':1, 'No':0})

In [24]:
display(train.head())
display(test.head())

Unnamed: 0,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,1,0,22,0,Healthcare,1.0,Low,4.0,Cat_4,D
1,0,1,38,1,Engineer,,Average,3.0,Cat_4,A
2,0,1,67,1,Engineer,1.0,Low,1.0,Cat_6,B
3,1,1,67,1,Lawyer,0.0,High,2.0,Cat_6,B
4,0,1,40,1,Entertainment,,High,6.0,Cat_6,A


Unnamed: 0,Gender,Ever_Married,Age,Graduated,Profession,Work_Experience,Spending_Score,Family_Size,Var_1
0,0,1,36,1,Engineer,0.0,Low,1.0,Cat_6
1,1,1,37,1,Healthcare,8.0,Average,4.0,Cat_6
2,0,1,69,0,,0.0,Low,1.0,Cat_6
3,1,1,59,0,Executive,11.0,High,2.0,Cat_6
4,0,0,19,0,Marketing,,Low,4.0,Cat_6


### 4. Profession

In [25]:
display(train['Profession'].value_counts())
display(test['Profession'].value_counts())

Artist           2516
Healthcare       1332
Entertainment     949
Engineer          699
Doctor            688
Lawyer            623
Executive         599
Marketing         292
Homemaker         246
Name: Profession, dtype: int64

Artist           802
Healthcare       418
Entertainment    301
Doctor           242
Engineer         236
Lawyer           221
Executive        176
Marketing        111
Homemaker         82
Name: Profession, dtype: int64

In [26]:
from sklearn.preprocessing import LabelEncoder

In [27]:
Prof_Dummies_train = pd.get_dummies(train['Profession'], drop_first=True)
Prof_Dummies_test = pd.get_dummies(test['Profession'], drop_first=True)

In [28]:
display(Prof_Dummies_train.head())
display(Prof_Dummies_test.head())

Unnamed: 0,Doctor,Engineer,Entertainment,Executive,Healthcare,Homemaker,Lawyer,Marketing
0,0,0,0,0,1,0,0,0
1,0,1,0,0,0,0,0,0
2,0,1,0,0,0,0,0,0
3,0,0,0,0,0,0,1,0
4,0,0,1,0,0,0,0,0


Unnamed: 0,Doctor,Engineer,Entertainment,Executive,Healthcare,Homemaker,Lawyer,Marketing
0,0,1,0,0,0,0,0,0
1,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0
3,0,0,0,1,0,0,0,0
4,0,0,0,0,0,0,0,1


In [29]:
train.drop('Profession', axis=1, inplace=True)
test.drop('Profession', axis=1, inplace=True)

### 5. Work_Experience

In [30]:
display(train['Work_Experience'].value_counts())
display(test['Work_Experience'].value_counts())

1.0     2354
0.0     2318
9.0      474
8.0      463
2.0      286
3.0      255
4.0      253
6.0      204
7.0      196
5.0      194
10.0      53
11.0      50
12.0      48
13.0      46
14.0      45
Name: Work_Experience, dtype: int64

1.0     773
0.0     769
8.0     149
9.0     139
4.0      93
2.0      87
3.0      82
5.0      76
6.0      61
7.0      60
14.0     21
11.0     14
12.0     12
10.0     11
13.0     11
Name: Work_Experience, dtype: int64

In [31]:
train['Work_Experience'].mean()

2.641663213150988

In [32]:
test['Work_Experience'].mean()

2.552586938083121

In [33]:
train['Work_Experience'] = train['Work_Experience'].fillna(3)
test['Work_Experience'] = test['Work_Experience'].fillna(2)

In [34]:
display(train['Work_Experience'].isnull().sum())
display(test['Work_Experience'].isnull().sum())

0

0

In [35]:
display(train.head())
display(test.head())

Unnamed: 0,Gender,Ever_Married,Age,Graduated,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,1,0,22,0,1.0,Low,4.0,Cat_4,D
1,0,1,38,1,3.0,Average,3.0,Cat_4,A
2,0,1,67,1,1.0,Low,1.0,Cat_6,B
3,1,1,67,1,0.0,High,2.0,Cat_6,B
4,0,1,40,1,3.0,High,6.0,Cat_6,A


Unnamed: 0,Gender,Ever_Married,Age,Graduated,Work_Experience,Spending_Score,Family_Size,Var_1
0,0,1,36,1,0.0,Low,1.0,Cat_6
1,1,1,37,1,8.0,Average,4.0,Cat_6
2,0,1,69,0,0.0,Low,1.0,Cat_6
3,1,1,59,0,11.0,High,2.0,Cat_6
4,0,0,19,0,2.0,Low,4.0,Cat_6


### 6. Spending_Score

In [36]:
display(train['Spending_Score'].value_counts())
display(test['Spending_Score'].value_counts())

Low        4878
Average    1974
High       1216
Name: Spending_Score, dtype: int64

Low        1616
Average     625
High        386
Name: Spending_Score, dtype: int64

In [37]:
train['Spending_Score'] = train['Spending_Score'].map({'Low':0, 'Average':1, 'High':2})
test['Spending_Score'] = test['Spending_Score'].map({'Low':0, 'Average':1, 'High':2})

In [38]:
display(train.head())
display(test.head())

Unnamed: 0,Gender,Ever_Married,Age,Graduated,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,1,0,22,0,1.0,0,4.0,Cat_4,D
1,0,1,38,1,3.0,1,3.0,Cat_4,A
2,0,1,67,1,1.0,0,1.0,Cat_6,B
3,1,1,67,1,0.0,2,2.0,Cat_6,B
4,0,1,40,1,3.0,2,6.0,Cat_6,A


Unnamed: 0,Gender,Ever_Married,Age,Graduated,Work_Experience,Spending_Score,Family_Size,Var_1
0,0,1,36,1,0.0,0,1.0,Cat_6
1,1,1,37,1,8.0,1,4.0,Cat_6
2,0,1,69,0,0.0,0,1.0,Cat_6
3,1,1,59,0,11.0,2,2.0,Cat_6
4,0,0,19,0,2.0,0,4.0,Cat_6


### 7. Family_Size

In [39]:
display(train['Family_Size'].value_counts())
display(test['Family_Size'].value_counts())

2.0    2390
3.0    1497
1.0    1453
4.0    1379
5.0     612
6.0     212
7.0      96
8.0      50
9.0      44
Name: Family_Size, dtype: int64

2.0    768
1.0    512
3.0    455
4.0    444
5.0    200
6.0     78
7.0     26
9.0     16
8.0     15
Name: Family_Size, dtype: int64

In [40]:
train['Family_Size'].median()

3.0

In [41]:
test['Family_Size'].median()

2.0

In [42]:
train['Family_Size'] = train['Family_Size'].fillna(3)
test['Family_Size'] = test['Family_Size'].fillna(2)

In [43]:
display(train['Family_Size'].isnull().sum())
display(test['Family_Size'].isnull().sum())

0

0

In [44]:
display(train.head())
display(test.head())

Unnamed: 0,Gender,Ever_Married,Age,Graduated,Work_Experience,Spending_Score,Family_Size,Var_1,Segmentation
0,1,0,22,0,1.0,0,4.0,Cat_4,D
1,0,1,38,1,3.0,1,3.0,Cat_4,A
2,0,1,67,1,1.0,0,1.0,Cat_6,B
3,1,1,67,1,0.0,2,2.0,Cat_6,B
4,0,1,40,1,3.0,2,6.0,Cat_6,A


Unnamed: 0,Gender,Ever_Married,Age,Graduated,Work_Experience,Spending_Score,Family_Size,Var_1
0,0,1,36,1,0.0,0,1.0,Cat_6
1,1,1,37,1,8.0,1,4.0,Cat_6
2,0,1,69,0,0.0,0,1.0,Cat_6
3,1,1,59,0,11.0,2,2.0,Cat_6
4,0,0,19,0,2.0,0,4.0,Cat_6


### 7. Var_1

In [45]:
display(train['Var_1'].value_counts())
display(test['Var_1'].value_counts())

Cat_6    5238
Cat_4    1089
Cat_3     822
Cat_2     422
Cat_7     203
Cat_1     133
Cat_5      85
Name: Var_1, dtype: int64

Cat_6    1672
Cat_4     386
Cat_3     267
Cat_2     141
Cat_7      66
Cat_1      34
Cat_5      29
Name: Var_1, dtype: int64

In [46]:
train['Var_1'] = train['Var_1'].fillna(train['Var_1'].mode()[0])
test['Var_1'] = test['Var_1'].fillna(train['Var_1'].mode()[0])

In [47]:
display(train['Var_1'].isnull().sum())
display(test['Var_1'].isnull().sum())

0

0

In [48]:
Var_Dummies_train = pd.get_dummies(train['Var_1'], drop_first=True)
Var_Dummies_test = pd.get_dummies(test['Var_1'], drop_first=True)

In [49]:
display(Var_Dummies_train.head())
display(Var_Dummies_test.head())

Unnamed: 0,Cat_2,Cat_3,Cat_4,Cat_5,Cat_6,Cat_7
0,0,0,1,0,0,0
1,0,0,1,0,0,0
2,0,0,0,0,1,0
3,0,0,0,0,1,0
4,0,0,0,0,1,0


Unnamed: 0,Cat_2,Cat_3,Cat_4,Cat_5,Cat_6,Cat_7
0,0,0,0,0,1,0
1,0,0,0,0,1,0
2,0,0,0,0,1,0
3,0,0,0,0,1,0
4,0,0,0,0,1,0


In [50]:
train.drop('Var_1', axis=1, inplace=True)
test.drop('Var_1', axis=1, inplace=True)

### 8. Segmentation

In [51]:
train['Segmentation'].value_counts()

D    2268
A    1972
C    1970
B    1858
Name: Segmentation, dtype: int64

In [52]:
train['Segmentation'] = train['Segmentation'].map({'A':0, 'B':1, 'C':2, 'D':3})

In [53]:
train1 = pd.concat([train ,Prof_Dummies_train, Var_Dummies_train], axis=1)
test1 = pd.concat([test, Prof_Dummies_test, Var_Dummies_test], axis=1)

In [54]:
display(train1.head())
display(test1.head())

Unnamed: 0,Gender,Ever_Married,Age,Graduated,Work_Experience,Spending_Score,Family_Size,Segmentation,Doctor,Engineer,...,Healthcare,Homemaker,Lawyer,Marketing,Cat_2,Cat_3,Cat_4,Cat_5,Cat_6,Cat_7
0,1,0,22,0,1.0,0,4.0,3,0,0,...,1,0,0,0,0,0,1,0,0,0
1,0,1,38,1,3.0,1,3.0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
2,0,1,67,1,1.0,0,1.0,1,0,1,...,0,0,0,0,0,0,0,0,1,0
3,1,1,67,1,0.0,2,2.0,1,0,0,...,0,0,1,0,0,0,0,0,1,0
4,0,1,40,1,3.0,2,6.0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


Unnamed: 0,Gender,Ever_Married,Age,Graduated,Work_Experience,Spending_Score,Family_Size,Doctor,Engineer,Entertainment,...,Healthcare,Homemaker,Lawyer,Marketing,Cat_2,Cat_3,Cat_4,Cat_5,Cat_6,Cat_7
0,0,1,36,1,0.0,0,1.0,0,1,0,...,0,0,0,0,0,0,0,0,1,0
1,1,1,37,1,8.0,1,4.0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
2,0,1,69,0,0.0,0,1.0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,1,1,59,0,11.0,2,2.0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,0,0,19,0,2.0,0,4.0,0,0,0,...,0,0,0,1,0,0,0,0,1,0


<h2 style="color:blue" align="left"> 7. Model building and Evaluation </h2>

In [55]:
# Independant variable
X = train1.drop('Segmentation',axis=1)        # All rows & columns exclude Target features

# Dependant variable
y = train1['Segmentation']                   # Only target feature

In [56]:
# split  data into training and testing sets of 80:20 ratio
# 20% of test size selected
# random_state is random seed
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=4)

In [57]:
# shape of X & Y test / train
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(6454, 21) (1614, 21) (6454,) (1614,)


### Random Forest

In [58]:
rf = RandomForestClassifier(n_estimators=300, max_depth=200, criterion='gini')
rf.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=200, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [59]:
pred_rf = rf.predict(X_test)

In [60]:
print("Train Score {:.2f} & Test Score {:.2f}".format(rf.score(X_train, y_train), rf.score(X_test, y_test)))

Train Score 0.96 & Test Score 0.46


### Logistic Regression

In [61]:
LogReg = LogisticRegression(C=5.0, max_iter=50, solver='lbfgs', multi_class='multinomial', penalty='l2')
LogReg.fit(X_train, y_train)

LogisticRegression(C=5.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=50, multi_class='multinomial',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [62]:
pred_LogReg = LogReg.predict(X_test)

In [63]:
print("Train Score {:.2f} & Test Score {:.2f}".format(LogReg.score(X_train, y_train), LogReg.score(X_test, y_test)))

Train Score 0.47 & Test Score 0.45


### XGBOOST

In [64]:
reg_xgb = xgboost.XGBClassifier(objective='multi:softprob', gamma=0.4, colsample_bytree=0.5,
                                learning_rate=0.1, max_depth=6, min_child_weight=7)
reg_xgb.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.5, gamma=0.4, gpu_id=-1,
       importance_type='gain', interaction_constraints='',
       learning_rate=0.1, max_delta_step=0, max_depth=6,
       min_child_weight=7, missing=nan, monotone_constraints='()',
       n_estimators=100, n_jobs=0, num_parallel_tree=1,
       objective='multi:softprob', random_state=0, reg_alpha=0,
       reg_lambda=1, scale_pos_weight=None, subsample=1,
       tree_method='exact', validate_parameters=1, verbosity=None)

In [65]:
# predicting X_test
y_pred_xgb = reg_xgb.predict(X_test)

In [66]:
print("Train Score {:.2f} & Test Score {:.2f}".format(reg_xgb.score(X_train,y_train),reg_xgb.score(X_test,y_test)))

Train Score 0.61 & Test Score 0.51


### LGBM

In [68]:
params = {}
params['learning_rate'] = 0.04
params['max_depth'] = 18
params['n_estimators'] = 3000
params['objective'] = 'multiclass'
params['boosting_type'] = 'gbdt'
params['subsample'] = 0.7
params['random_state'] = 42
params['colsample_bytree'] =0.7
params['min_data_in_leaf'] = 55
params['reg_alpha'] = 1.7
params['reg_lambda'] = 1.11
params['class_weight']: {0: 0.44, 1: 0.4, 2: 0.37}

In [71]:
# class_weight='balanced'
# from lightgbm import LGBMClassifier

clf = lgb.LGBMClassifier(**params)
clf.fit(X_train, y_train)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=0.7,
        importance_type='split', learning_rate=0.04, max_depth=18,
        min_child_samples=20, min_child_weight=0.001, min_data_in_leaf=55,
        min_split_gain=0.0, n_estimators=3000, n_jobs=-1, num_leaves=31,
        objective='multiclass', random_state=42, reg_alpha=1.7,
        reg_lambda=1.11, silent=True, subsample=0.7,
        subsample_for_bin=200000, subsample_freq=0)

In [72]:
y_pred_LGBM = clf.predict(X_test)

In [73]:
print("Train Score {:.2f} & Test Score {:.2f}".format(clf.score(X_train,y_train), clf.score(X_test,y_test)))

Train Score 0.66 & Test Score 0.50


### Submission

In [69]:
# predicting test
y_pred_test = lgbm_model.predict(test1)

In [71]:
#submission = pd.read_csv("sample_submission_49d68Cx.csv")

submission['Segmentation'] = y_pred_test                # filling Loan_Status with predictions
submission['ID'] = test_original['ID']       # filling Loan_ID with test Loan_ID

# replacing 0 and 1 with N and Y 
submission['Segmentation'].replace(0, 'A',inplace=True) 
submission['Segmentation'].replace(1, 'B',inplace=True)
submission['Segmentation'].replace(2, 'C',inplace=True)
submission['Segmentation'].replace(3, 'D',inplace=True)

# Converting submission file to .csv format 
pd.DataFrame(submission, columns=['ID','Segmentation']).to_csv('XGBOOST.csv')