## Explolatory Data Analysis For Landslide Prevention and Innovation Dataset

### Understand The Problem Statement

The core of our project is to design and shape the future of landslide prevention and management with the example of Hong Kong.

Hong Kong is one of the hilliest and most densely populated cities in the world which is frequently hit by extreme rainfall and is therefore highly susceptible to rain-induced landslides. A landslide is the movement of masses of rock, debris, or earth down a slope and can result in significant loss of life and property. A high-quality landslide inventory is essential not only for landslide hazard and risk analysis but also for supporting agency decisions on landslide hazard mitigation and prevention.

The common practice of identifying landslides is visual interpretation which, however, is labor-intensive and time-consuming

### Type of the Problem
It is a classification problem where we have to automate landslide identification using artificial intelligence techniques

### Load Python Packages

In [1]:
# import important modules 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns  
plt.rcParams["axes.labelsize"] = 18
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline 
import joblib

## Load Dataset

In [28]:
# Import data
new_data= pd.read_csv('Train.csv')


In [3]:
# print shape 
print('train data shape :', data.shape)

train data shape : (10864, 227)


The above output show the number of rows and columns for dataset 

In [4]:
# Inspect Data by shing the first five rows 
data.head()

Unnamed: 0,Sample_ID,1_elevation,2_elevation,3_elevation,4_elevation,5_elevation,6_elevation,7_elevation,8_elevation,9_elevation,...,17_sdoif,18_sdoif,19_sdoif,20_sdoif,21_sdoif,22_sdoif,23_sdoif,24_sdoif,25_sdoif,Label
0,1,130,129,127,126,123,126,125,124,122,...,1.281779,1.281743,1.28172,1.281684,1.281811,1.281788,1.281752,1.281729,1.281693,0
1,2,161,158,155,153,151,162,159,155,153,...,1.359639,1.359608,1.359587,1.359556,1.359683,1.359662,1.359631,1.35961,1.359579,1
2,3,149,151,154,156,158,154,157,158,160,...,1.365005,1.365025,1.365055,1.365075,1.364937,1.364967,1.364988,1.365018,1.365038,0
3,4,80,78,77,75,73,80,78,77,75,...,1.100708,1.100738,1.100759,1.100789,1.10063,1.10065,1.10068,1.1007,1.100731,0
4,5,117,115,114,112,110,115,113,111,110,...,1.28418,1.28413,1.284056,1.284006,1.284125,1.28405,1.284001,1.283926,1.283876,0


### Explolatory Data Analysis 

This is the process of finding some insights from you dataset before create predictive models.


In [5]:
#show list of columns 
list(data.columns)  

['Sample_ID',
 '1_elevation',
 '2_elevation',
 '3_elevation',
 '4_elevation',
 '5_elevation',
 '6_elevation',
 '7_elevation',
 '8_elevation',
 '9_elevation',
 '10_elevation',
 '11_elevation',
 '12_elevation',
 '13_elevation',
 '14_elevation',
 '15_elevation',
 '16_elevation',
 '17_elevation',
 '18_elevation',
 '19_elevation',
 '20_elevation',
 '21_elevation',
 '22_elevation',
 '23_elevation',
 '24_elevation',
 '25_elevation',
 '1_slope',
 '2_slope',
 '3_slope',
 '4_slope',
 '5_slope',
 '6_slope',
 '7_slope',
 '8_slope',
 '9_slope',
 '10_slope',
 '11_slope',
 '12_slope',
 '13_slope',
 '14_slope',
 '15_slope',
 '16_slope',
 '17_slope',
 '18_slope',
 '19_slope',
 '20_slope',
 '21_slope',
 '22_slope',
 '23_slope',
 '24_slope',
 '25_slope',
 '1_aspect',
 '2_aspect',
 '3_aspect',
 '4_aspect',
 '5_aspect',
 '6_aspect',
 '7_aspect',
 '8_aspect',
 '9_aspect',
 '10_aspect',
 '11_aspect',
 '12_aspect',
 '13_aspect',
 '14_aspect',
 '15_aspect',
 '16_aspect',
 '17_aspect',
 '18_aspect',
 '19_aspect

In [6]:
## show Some information about the dataset 
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10864 entries, 0 to 10863
Columns: 227 entries, Sample_ID to Label
dtypes: float64(175), int64(52)
memory usage: 18.8 MB
None


The outshow shows the list of variables , sizes and data types in each variables. This will help us to know what feature engineering you can apply.

In [7]:
# Check for missing values
print('missing values:', data.isnull().sum())
print('')
print('Total number of missing values is:', data.isnull().sum().sum())

missing values: Sample_ID      0
1_elevation    0
2_elevation    0
3_elevation    0
4_elevation    0
              ..
22_sdoif       0
23_sdoif       0
24_sdoif       0
25_sdoif       0
Label          0
Length: 227, dtype: int64

Total number of missing values is: 0


We don't have missing data in our dataset.

### FEATURE ENGINEERING 

In [8]:
#remove id feature 

data = data.drop(['Sample_ID','1_elevation','2_elevation','3_elevation','4_elevation','5_elevation','6_elevation','7_elevation','8_elevation','9_elevation','10_elevation','11_elevation','12_elevation','14_elevation','15_elevation','16_elevation','17_elevation','18_elevation','19_elevation','20_elevation','21_elevation','22_elevation','23_elevation','24_elevation','25_elevation','1_slope','2_slope','3_slope','4_slope','5_slope','6_slope','7_slope','8_slope','9_slope','10_slope','11_slope','12_slope','14_slope','15_slope','16_slope','17_slope','18_slope','19_slope','20_slope','21_slope','22_slope','23_slope','24_slope','25_slope','1_aspect','2_aspect','3_aspect','4_aspect','5_aspect','6_aspect','7_aspect','8_aspect','9_aspect','10_aspect','11_aspect','12_aspect','14_aspect','15_aspect','16_aspect','17_aspect','18_aspect','19_aspect','20_aspect','21_aspect','22_aspect','23_aspect','24_aspect','25_aspect','1_placurv','2_placurv','3_placurv','4_placurv','5_placurv','6_placurv','7_placurv','8_placurv','9_placurv','10_placurv','11_placurv','12_placurv','14_placurv','15_placurv','16_placurv','17_placurv','18_placurv','19_placurv','20_placurv','21_placurv','22_placurv','23_placurv','24_placurv','25_placurv','1_procurv','2_procurv','3_procurv','4_procurv','5_procurv','6_procurv','7_procurv','8_procurv','9_procurv','10_procurv','11_procurv','12_procurv','14_procurv','15_procurv','16_procurv','17_procurv','18_procurv','19_procurv','20_procurv','21_procurv','22_procurv','23_procurv','24_procurv','25_procurv','1_lsfactor','2_lsfactor','3_lsfactor','4_lsfactor','5_lsfactor','6_lsfactor','7_lsfactor','8_lsfactor','9_lsfactor','10_lsfactor','11_lsfactor','12_lsfactor','14_lsfactor','15_lsfactor','16_lsfactor','17_lsfactor','18_lsfactor','19_lsfactor','20_lsfactor','21_lsfactor','22_lsfactor','23_lsfactor','24_lsfactor','25_lsfactor','1_twi','2_twi','3_twi','4_twi','5_twi','6_twi','7_twi','8_twi','9_twi','10_twi','11_twi','12_twi','14_twi','15_twi','16_twi','17_twi','18_twi','19_twi','20_twi','21_twi','22_twi','23_twi','24_twi','25_twi','1_geology','2_geology','3_geology','4_geology','5_geology','6_geology','7_geology','8_geology','9_geology','10_geology','11_geology','12_geology','14_geology','15_geology','16_geology','17_geology','18_geology','19_geology','20_geology','21_geology','22_geology','23_geology','24_geology','25_geology','1_sdoif','2_sdoif','3_sdoif','4_sdoif','5_sdoif','6_sdoif','7_sdoif','8_sdoif','9_sdoif','10_sdoif','11_sdoif','12_sdoif','14_sdoif','15_sdoif','16_sdoif','17_sdoif','18_sdoif','19_sdoif','20_sdoif','21_sdoif','22_sdoif','23_sdoif','24_sdoif','25_sdoif',], axis=1)

data.shape 

(10864, 10)

In [40]:
#remove id feature 

data5 = new_data.drop(['Sample_ID','1_elevation','2_elevation','3_elevation','4_elevation','5_elevation','6_elevation','7_elevation','8_elevation','9_elevation','10_elevation','11_elevation','12_elevation','14_elevation','15_elevation','16_elevation','17_elevation','18_elevation','19_elevation','20_elevation','21_elevation','22_elevation','23_elevation','24_elevation','25_elevation','1_slope','2_slope','3_slope','4_slope','5_slope','6_slope','7_slope','8_slope','9_slope','10_slope','11_slope','12_slope','14_slope','15_slope','16_slope','17_slope','18_slope','19_slope','20_slope','21_slope','22_slope','23_slope','24_slope','25_slope','1_aspect','2_aspect','3_aspect','4_aspect','5_aspect','6_aspect','7_aspect','8_aspect','9_aspect','10_aspect','11_aspect','12_aspect','14_aspect','15_aspect','16_aspect','17_aspect','18_aspect','19_aspect','20_aspect','21_aspect','22_aspect','23_aspect','24_aspect','25_aspect','1_placurv','2_placurv','3_placurv','4_placurv','5_placurv','6_placurv','7_placurv','8_placurv','9_placurv','10_placurv','11_placurv','12_placurv','14_placurv','15_placurv','16_placurv','17_placurv','18_placurv','19_placurv','20_placurv','21_placurv','22_placurv','23_placurv','24_placurv','25_placurv','1_procurv','2_procurv','3_procurv','4_procurv','5_procurv','6_procurv','7_procurv','8_procurv','9_procurv','10_procurv','11_procurv','12_procurv','14_procurv','15_procurv','16_procurv','17_procurv','18_procurv','19_procurv','20_procurv','21_procurv','22_procurv','23_procurv','24_procurv','25_procurv','1_lsfactor','2_lsfactor','3_lsfactor','4_lsfactor','5_lsfactor','6_lsfactor','7_lsfactor','8_lsfactor','9_lsfactor','10_lsfactor','11_lsfactor','12_lsfactor','14_lsfactor','15_lsfactor','16_lsfactor','17_lsfactor','18_lsfactor','19_lsfactor','20_lsfactor','21_lsfactor','22_lsfactor','23_lsfactor','24_lsfactor','25_lsfactor','1_twi','2_twi','3_twi','4_twi','5_twi','6_twi','7_twi','8_twi','9_twi','10_twi','11_twi','12_twi','14_twi','15_twi','16_twi','17_twi','18_twi','19_twi','20_twi','21_twi','22_twi','23_twi','24_twi','25_twi','1_geology','2_geology','3_geology','4_geology','5_geology','6_geology','7_geology','8_geology','9_geology','10_geology','11_geology','12_geology','14_geology','15_geology','16_geology','17_geology','18_geology','19_geology','20_geology','21_geology','22_geology','23_geology','24_geology','25_geology','1_sdoif','2_sdoif','3_sdoif','4_sdoif','5_sdoif','6_sdoif','7_sdoif','8_sdoif','9_sdoif','10_sdoif','11_sdoif','12_sdoif','14_sdoif','15_sdoif','16_sdoif','17_sdoif','18_sdoif','19_sdoif','20_sdoif','21_sdoif','22_sdoif','23_sdoif','24_sdoif','25_sdoif',], axis=1)

data5.columns 

Index(['13_elevation', '13_slope', '13_aspect', '13_placurv', '13_procurv',
       '13_lsfactor', '13_twi', '13_geology', '13_sdoif', 'Label'],
      dtype='object')

In [9]:
#show the first five rows after removing unneccessary columns
data.head()

Unnamed: 0,13_elevation,13_slope,13_aspect,13_placurv,13_procurv,13_lsfactor,13_twi,13_geology,13_sdoif,Label
0,119,44.56372,113.9625,0.017273,0.002025,11.03584,3.154479,3,1.28173,0
1,156,32.31153,198.435,0.017014,-0.00322,9.067206,4.383853,3,1.359574,1
2,164,45.0,270.0,0.043121,0.025843,13.69647,4.169325,2,1.36505,0
3,77,16.69924,180.0,0.032324,0.011816,3.400196,4.259946,2,1.100826,0
4,109,29.49621,135.0,-0.007245,-0.012066,8.043085,4.430152,5,1.284217,0


In [10]:
from sklearn.preprocessing import OneHotEncoder

In [11]:
oneHot = OneHotEncoder()

In [12]:
data2=oneHot.fit_transform(data['13_geology'].values.reshape(-1,1))

In [13]:
data2.shape

(10864, 7)

In [14]:
data2=pd.DataFrame(data2.toarray(),columns=oneHot.get_feature_names_out())

In [15]:
data2

Unnamed: 0,x0_1,x0_2,x0_3,x0_4,x0_5,x0_6,x0_7
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...
10859,0.0,1.0,0.0,0.0,0.0,0.0,0.0
10860,0.0,0.0,1.0,0.0,0.0,0.0,0.0
10861,0.0,0.0,1.0,0.0,0.0,0.0,0.0
10862,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [16]:
data =data.drop('13_geology',axis=1)

In [17]:
data=pd.concat([data,data2],axis=1)

In [18]:
joblib.dump(oneHot,'/Users/Lenovo/Desktop/Project/preprocessing/one_Hot_encoder.pkl')

['/Users/Lenovo/Desktop/Project/preprocessing/one_Hot_encoder.pkl']

In [None]:
# enc=data()

In [None]:
joblib.dump(data1,'/Users/Lenovo/Desktop/Project/preprocessing/one_hot_encoding.pkl')

In [None]:
#show first five rows
data.head()

In [39]:
new_data.columns

Index(['Sample_ID', '1_elevation', '2_elevation', '3_elevation', '4_elevation',
       '5_elevation', '6_elevation', '7_elevation', '8_elevation',
       '9_elevation',
       ...
       '17_sdoif', '18_sdoif', '19_sdoif', '20_sdoif', '21_sdoif', '22_sdoif',
       '23_sdoif', '24_sdoif', '25_sdoif', 'Label'],
      dtype='object', length=227)

In [None]:
# show list of columns 

list(data.columns)


In [21]:
# import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

In [24]:
# feature scalling by using minmaxscaler method 
scaler = MinMaxScaler(feature_range=(0, 1))


data3 = scaler.fit_transform(data.drop('Label',axis=1).values.reshape(-1,1))


In [23]:
joblib.dump(scaler,'/Users/Lenovo/Desktop/Project/preprocessing/min-max-scaler.pkl')

['/Users/Lenovo/Desktop/Project/preprocessing/min-max-scaler.pkl']

In [26]:
data=pd.DataFrame(data3.reshape(-1,15),columns=data.drop('Label',axis=1).columns)


In [27]:
data.columns

Index(['13_elevation', '13_slope', '13_aspect', '13_placurv', '13_procurv',
       '13_lsfactor', '13_twi', '13_sdoif', 'x0_1', 'x0_2', 'x0_3', 'x0_4',
       'x0_5', 'x0_6', 'x0_7'],
      dtype='object')

In [25]:
joblib.dump(scaler,'/Users/Lenovo/Desktop/Project/preprocessing/min-max-scaler2.pkl')

['/Users/Lenovo/Desktop/Project/preprocessing/min-max-scaler2.pkl']

In [None]:
#show shape 
data.shape  

In [None]:
#show first five rows 
data.head() 

In [None]:
# show data of the first row 
data[:1].values 

### FEATURE SELECTIONS

#### Univariate Selection

In [None]:
# import packages 
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [None]:
data.shape

In [29]:
#split dataset into features and target
features = data.copy()

In [30]:
target=new_data['Label']

In [32]:
features.shape

(10864, 15)

In [None]:
#apply SelectKBest class to extract top 10 best features
bestfeatures = SelectKBest(score_func=chi2, k=10)

#train to find best features
fit = bestfeatures.fit(features,target)

#save in the dataframe 
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(features.columns)

#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)

#naming the dataframe columns
featureScores.columns = ['Specs','Score'] 

#print 10 best features 
print(featureScores.nlargest(10,'Score'))  

**Where:** 

**1:** Weathered Cretaceous granitic rocks
**2:** Weathered Jurassic granite rocks
**3:** Weathered Jurassic tuff and lava
**4:** Weathered Cretaceous tuff and lava
**5:** Quaternary deposits
**6:** Fill
**7:** Weathered Jurassic sandstone, siltstone and mudstone

**13_lsfactor** Length-slope factor  
**13_twi:** Topographic wetness index 
**13_sdoif** Step duration orographic intensification factor: 

In [None]:
# fit and tranform into the 10 best features 
transformer = SelectKBest(chi2, k=10)

#transform from 15 features into top 10 features
top_10_features = transformer.fit_transform(features, target)

#show the shape 
top_10_features.shape 

#### Feature Importance 


In [None]:
#import package 
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier

In [None]:
#create model for training 
model = RandomForestClassifier()
model.fit(features,target)

#use inbuilt class feature_importances of tree based classifiers
print(model.feature_importances_) 

#plot graph of feature importances for better visualization
feature_importances = pd.Series(model.feature_importances_, index=features.columns)

# show the first 30 important features 

fig= plt.figure(figsize=(25,25))
sns.set(font_scale = 3)
feature_importances.nlargest(30).plot(kind='barh')
plt.show()

#### Correlation Matrix with Heatmap

In [None]:
#get correlations of each features in dataset
plt.figure(figsize=(30,30))

#plot heat map
sns.set(font_scale = 3)
# to show number set annot=True
d = sns.heatmap(data.corr(),annot=False, cmap="RdPu")

#save the figure 
figure = d.get_figure()
figure.savefig("heatmap_output.png")

# show the heatamp graph 
d  

In [None]:
# SHOW CORRELATION OF DATA TO THE TARGET COLUMN 
features_corr = pd.DataFrame(abs(data.corr()['Label']).sort_values(ascending = False)) 

features_corr 

### Using Logistic regression

In [33]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [34]:
model = LogisticRegression(random_state=0)
X_train,X_test,y_train,y_test=train_test_split(features,target,train_size=0.75,test_size=0.25)

### 1st Evaluation Metric:  Accuracy

In [None]:
model.fit(X_train,y_train)
predictions=model.predict(X_test)
score=accuracy_score(y_test,predictions)
print(score)

In [None]:
add_col=[]
for column in featureScores.nlargest(20,'Score')['Specs']:
    add_col.append(column)
    print(add_col)
    model.fit(X_train[add_col],y_train)
    predictions=model.predict(X_test[add_col])
    print(accuracy_score(y_test,predictions))

### 2nd Evaluation Metric: Confusion Matrix 

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

#Compute confusion matrix to evaluate the accuracy of a classification.
confusion_table = confusion_matrix(y_test, predictions)

#Confusion Matrix visualization.
cm_display = ConfusionMatrixDisplay(confusion_matrix=confusion_table,
                                    display_labels=['No', 'Yes'])
#create the plot
cm_display.plot()

#display the plot
plt.show()

### 3rd Evaluation Metric: F1 Score

In [None]:
from sklearn.metrics import f1_score

score = f1_score(y_test,predictions)

score 

# Using Random Forest Classification

In [None]:
#split data into train and test 
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(features,
                                                    target,
                                                    test_size=0.20,
                                                    random_state=42)

In [None]:
# train the model 
from sklearn.ensemble import RandomForestClassifier

classifier  = RandomForestClassifier(n_estimators=100)

In [None]:
#train the model 
classifier.fit(X_train,y_train)

In [None]:
# make predicition on test set 

preds = classifier.predict(X_test)

### 1st Evaluation Metric:  Accuracy

In [None]:
from sklearn.metrics import accuracy_score

model_score = accuracy_score(y_test,preds)

print("{:.2f}%".format(model_score*100))

### 2nd Evaluation Metric: Confusion Matrix 

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

#Compute confusion matrix to evaluate the accuracy of a classification.
confusion_table = confusion_matrix(y_test, preds)

#Confusion Matrix visualization.
cm_display = ConfusionMatrixDisplay(confusion_matrix=confusion_table,
                                    display_labels=['No', 'Yes'])
#create the plot
cm_display.plot()

#display the plot
plt.show()

### 3rd Evaluation Metric: F1 Score

In [None]:
from sklearn.metrics import f1_score

score = f1_score(y_test,preds,)

score 

# Using Histogram based Gradient Boost

In [None]:
#split data into train and test 
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(features,
                                                    target,
                                                    test_size=0.20,
                                                    stratify=target,
                                                    random_state=42)

In [None]:
from sklearn.ensemble import HistGradientBoostingClassifier

classifier  = HistGradientBoostingClassifier()

In [None]:
#train the model 
classifier.fit(X_train,y_train)

In [None]:
# make predicition on test set 

preds = classifier.predict(X_test)

### 1st Evaluation Metric: Accuracy

In [None]:
from sklearn.metrics import accuracy_score

model_score = accuracy_score(y_test,preds)

print("{:.2f}%".format(model_score*100))

### 2nd Evaluation Metric: F1 Score

In [None]:
from sklearn.metrics import f1_score
score = f1_score(y_test,preds,)

score 

we have seen here the accuracy of the model increased a bit using histogram based gradient boost

# Using Voting Classifier

In [None]:
#split data into train and test 
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(features,
                                                    target,
                                                    test_size=0.20,
                                                    stratify=target,
                                                    random_state=42)

In [None]:
#importing libraries
from sklearn.ensemble import VotingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

#ensemble/ group of models
estimator=[]
estimator.append(('LR',LogisticRegression(solver='lbfgs',multi_class='multinomial',max_iter=200)))
estimator.append(('SVC',SVC(gamma='auto',probability=True)))
estimator.append(('DTC',DecisionTreeClassifier()))

# classifier  = VotingClassifier(estimators=estimator, voting='hard')
classifier  = VotingClassifier(estimators=estimator, voting='soft')

In [None]:
#train the model 
classifier.fit(X_train,y_train)

In [None]:
# make predicition on test set 

preds = classifier.predict(X_test)

### 1st Evaluation Metric: Accuracy

In [None]:
from sklearn.metrics import accuracy_score

model_score = accuracy_score(y_test,preds)

print("{:.2f}%".format(model_score*100))

### 2nd Evaluation Metric: F1 Score

In [None]:
from sklearn.metrics import f1_score

score = f1_score(y_test,preds,)

score 

we have seen that the accuracy of the model dropped when using voting classifier

## Using Lightgbm

In [None]:
!pip install lightgbm

In [None]:
import lightgbm as lgb
from lightgbm import LGBMClassifier

In [None]:
from imblearn import over_sampling
from imblearn.over_sampling import SMOTE
oversample=SMOTE()
X_train,y_train=oversample.fit_resample(X_train,y_train)

In [None]:
classifier  = LGBMClassifier()

#train the model 
classifier.fit(X_train,y_train)

# make predicition on test set 

preds = classifier.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score

model_score = accuracy_score(y_test,preds)

print("{:.2f}%".format(model_score*100))

In [None]:
from sklearn.metrics import f1_score

score = f1_score(y_test,preds,)

score 

## Using Xgboost

In [None]:
!pip install xgboost

In [None]:
import xgboost as xgb
from xgboost import XGBClassifier

In [None]:
from imblearn import over_sampling
from imblearn.over_sampling import SMOTE
oversample=SMOTE()
X_train,y_train=oversample.fit_resample(X_train,y_train)

In [None]:
classifier  = XGBClassifier()

#train the model 
classifier.fit(X_train,y_train)

# make predicition on test set 

preds = classifier.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score

model_score = accuracy_score(y_test,preds)

print("{:.2f}%".format(model_score*100))

In [None]:
from sklearn.metrics import f1_score

score = f1_score(y_test,preds,)

score 

## Using Catboost

In [None]:
!pip install catboost

In [35]:
import catboost as clf
from catboost import CatBoostClassifier

In [36]:
from imblearn import over_sampling
from imblearn.over_sampling import SMOTE
oversample=SMOTE()
X_train,y_train=oversample.fit_resample(X_train,y_train)

In [37]:
classifier2  = CatBoostClassifier()

#train the model 
classifier2.fit(X_train,y_train)

# make predicition on test set 

preds = classifier2.predict(X_test)

Learning rate set to 0.029978
0:	learn: 0.6729107	total: 228ms	remaining: 3m 47s
1:	learn: 0.6536065	total: 266ms	remaining: 2m 12s
2:	learn: 0.6376268	total: 296ms	remaining: 1m 38s
3:	learn: 0.6210441	total: 330ms	remaining: 1m 22s
4:	learn: 0.6087147	total: 357ms	remaining: 1m 11s
5:	learn: 0.5973086	total: 383ms	remaining: 1m 3s
6:	learn: 0.5863635	total: 408ms	remaining: 57.9s
7:	learn: 0.5750870	total: 445ms	remaining: 55.2s
8:	learn: 0.5646530	total: 477ms	remaining: 52.6s
9:	learn: 0.5562905	total: 503ms	remaining: 49.8s
10:	learn: 0.5473116	total: 532ms	remaining: 47.8s
11:	learn: 0.5391586	total: 558ms	remaining: 45.9s
12:	learn: 0.5306327	total: 587ms	remaining: 44.6s
13:	learn: 0.5243723	total: 619ms	remaining: 43.6s
14:	learn: 0.5184207	total: 651ms	remaining: 42.8s
15:	learn: 0.5127558	total: 691ms	remaining: 42.5s
16:	learn: 0.5071054	total: 726ms	remaining: 42s
17:	learn: 0.5017502	total: 765ms	remaining: 41.7s
18:	learn: 0.4969502	total: 794ms	remaining: 41s
19:	learn:

In [38]:
joblib.dump(classifier2,'/Users/Lenovo/Desktop/Project/model/catboost-model2.pkl')

['/Users/Lenovo/Desktop/Project/model/catboost-model2.pkl']

In [None]:
from sklearn.metrics import accuracy_score

model_score = accuracy_score(y_test,preds)

print("{:.2f}%".format(model_score*100))

In [None]:
from sklearn.metrics import f1_score

score = f1_score(y_test,preds,)

score 

In [None]:
!pip install imblearn

In [None]:
import pickle as pkl

In [None]:
pickle_out1=open("classifier2.pkl","wb")

In [None]:
pkl.dump(classifier2,pickle_out1)

In [None]:
pickle_out1.close()

In [None]:
sklearn.__version__

In [None]:
import sklearn

In [None]:
sklea