## XGBOOST Classification

##This is a basic implementation of xgbclassification. The notebook will guide you through how to import the data, process it, setup a xgb classification model, fit the model and finally different methods to evaluate it.


In [None]:
import xgboost as xgb
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

##To check if you have imported the model correctly and it's working fine just run the below command to check the version.

In [None]:
print(xgb.__version__)

0.90


##We are using the heart attack data from kaggle. Check this site to download the dataset:[dataset](kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset?select=o2Saturation.csv)

In [None]:
data = pd.read_csv("heart.csv")

In [None]:
data.head()

Unnamed: 0,age,sex,cp,trtbps,chol,fbs,restecg,thalachh,exng,oldpeak,slp,caa,thall,output
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


#About the Data

Age : Age of the patient

Sex : Sex of the patient

exang: exercise induced angina (1 = yes; 0 = no)

ca: number of major vessels (0-3)

cp : Chest Pain type chest pain type

Value 1: typical angina
Value 2: atypical angina
Value 3: non-anginal pain
Value 4: asymptomatic
trtbps : resting blood pressure (in mm Hg)

chol : cholestoral in mg/dl fetched via BMI sensor

fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

rest_ecg : resting electrocardiographic results

Value 0: normal
Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
thalach : maximum heart rate achieved

target : 0= less chance of heart attack 1= more chance of heart attack



In [None]:
data.shape

(303, 14)

In [None]:
X, y = data.iloc[:,:-1], data.iloc[:,-1]

In [None]:
X.shape, y.shape

((303, 13), (303,))

In [None]:
X.isna().sum()

age         0
sex         0
cp          0
trtbps      0
chol        0
fbs         0
restecg     0
thalachh    0
exng        0
oldpeak     0
slp         0
caa         0
thall       0
dtype: int64

In [None]:
y.isna().sum()

0

In [None]:
X.dtypes

age           int64
sex           int64
cp            int64
trtbps        int64
chol          int64
fbs           int64
restecg       int64
thalachh      int64
exng          int64
oldpeak     float64
slp           int64
caa           int64
thall         int64
dtype: object

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,shuffle=True)

##XGClassifier is a method in xgb class which is used to build a regressor model. The parameters here are:
1. max_depth: It is the max depth of the tree build by the xgbressor(It's usually 3 and doesn't have to go beyond 6, even for complex datasets)
2. verbosity: The default value is 1 which is used to give warnings, here we have set it 0 in order to silent the warninigs.
3. objective: Specifies the learning task, here we need it to classifiy n that's why its"binary:logistic" you can used different parameters for different purpouses.
4. booster: relate to which booster we are using to do boosting, commonly tree or linear model

In [None]:
model = xgb.XGBClassifier(max_depth=3,learning_rate=0.4,verbosity=0,objective='binary:logistic',booster='gbtree')

NameError: ignored

In [None]:
model

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.3, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=0)

##The above output is the summary of the model 

## The data is fit on the model

In [None]:
model.fit(X_train,y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.3, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=0)

##To make predictions on the testing data we need to use **model.predict**

In [None]:
y = model.predict(X_test)
y

array([1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0])

##Different Evaluation metrics

##Testing Accuracy

In [None]:
accuracy = accuracy_score(y_test,y)
print(accuracy)

0.8360655737704918


#Trainig Accuracy

In [None]:
score = model.score(X_train, y_train)   
print("Training score: ", score) 

Training score:  1.0


#Cross Validation Score

In [None]:
# - cross validataion 
scores = cross_val_score(model, X_train, y_train, cv=5)
print("Mean cross-validation score: %.2f" % scores.mean())

Mean cross-validation score: 0.79


##K Fold Score

In [None]:
kfold = KFold(n_splits=10, shuffle=True)
kf_cv_scores = cross_val_score(model, X_train, y_train, cv=kfold )
print("K-fold CV average score: %.2f" % kf_cv_scores.mean())

K-fold CV average score: 0.79
