# **Bagging**



---



---



# **ADABoost**

Deploying ada boost on iris

**steps  **
1. import libraries
2. make dataset ready( apply preprocessing if needed)
3. split dataset
4. create adaboost model that uses decision tree algo as base estimator
5. once model is initialised, fit the model on training dataset
6. evaluate model with test dataset

In [4]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import metrics

In [5]:
df=datasets.load_iris()
x=df.data
y=df.target

In [6]:
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.3,random_state=1)

In [7]:
model=AdaBoostClassifier(n_estimators=50)

In [8]:
model.get_params

<bound method BaseEstimator.get_params of AdaBoostClassifier()>

In [9]:
model.fit(xtrain,ytrain)

In [10]:
y_pred=model.predict(xtest)

In [11]:
print(metrics.accuracy_score(ytest,y_pred))

0.9555555555555556


In [12]:
print(model.score(xtest,ytest))

0.9555555555555556


Decision tree

In [13]:
from sklearn.tree import DecisionTreeClassifier

In [14]:
model1=DecisionTreeClassifier()

In [15]:
model1.get_params

<bound method BaseEstimator.get_params of DecisionTreeClassifier()>

In [16]:
model1.fit(xtrain,ytrain)

In [17]:
y_pred=model1.predict(xtest)

In [18]:
print(metrics.accuracy_score(ytest,y_pred))

0.9555555555555556


**using adaboost on breast cancer data**

In [19]:
df=datasets.load_breast_cancer()
x=df.data
y=df.target

In [20]:
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.3,random_state=1)

In [21]:
model=AdaBoostClassifier(n_estimators=50)

In [22]:
model.get_params

<bound method BaseEstimator.get_params of AdaBoostClassifier()>

In [23]:
model.fit(xtrain,ytrain)

In [24]:
y_pred=model.predict(xtest)

In [25]:
print(metrics.accuracy_score(ytest,y_pred))

0.9415204678362573


# **Gradient Boosting**
In advance ml techniques, gradient boosting is also called as probability approximately correct learning. (PAC)

In [32]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import GradientBoostingClassifier

In [34]:
train=pd.read_csv('train.csv')
test=pd.read_csv('test .csv')

In [35]:
ytrain=train['Survived'] #OUTPUT COLUMN
train.drop(labels="Survived", axis=1,inplace=True) #axis=1 means columns, axis=0 means rows. inplace=true makes the changes permanent in dataset

In [36]:
full_data=train.append(test) #merged train and test together

  full_data=train.append(test) #merged train and test together


In [37]:
drop_columns=["Name","Age","SibSp","Ticket","Cabin","Parch","Embarked"]
full_data.drop(labels=drop_columns,axis=1,inplace= True)

In [38]:
full_data=pd.get_dummies(full_data,columns=["Sex"]) #get_dummies converts a string into numeric categorical form
full_data.fillna(value=0.0,inplace=True)

In [39]:
#sequential splitting
xtrain=full_data.values[0:891] #0 to 890
xtest=full_data[891:]

In [40]:
scaler=MinMaxScaler()
xtrain=scaler.fit_transform(xtrain)
xtest=scaler.transform(xtest)



In [41]:
state=12
test_size=0.30
xtrain,xval,ytrain,yval=train_test_split(xtrain,ytrain,test_size=test_size,random_state=state) #890 train samples are further split into train and validation set

In [42]:
lr_list=[0.05,0.075,0.1,0.25,0.5,0.75,1] #trying diff lr
for learning_rate in lr_list:
  gb_clf=GradientBoostingClassifier(n_estimators=20,learning_rate=learning_rate,max_features=2,max_depth=2,random_state=0) #base model is boosted 20 times,
  #at every level 2 features will be randomly chosen(out of which 1 best is selected), depth of decision tree is 2
  gb_clf.fit(xtrain,ytrain) #model is getting trained
  print("Learning rate:",learning_rate)
  print("Accuracy(training): {0:.3f}".format(gb_clf.score(xtrain,ytrain))) #accuracy for training data
  print("Accuracy(validation): {0:.3f}".format(gb_clf.score(xval,yval))) #accuracy of testing data- main concern, should be maximum.


Learning rate: 0.05
Accuracy(training): 0.801
Accuracy(validation): 0.731
Learning rate: 0.075
Accuracy(training): 0.814
Accuracy(validation): 0.731
Learning rate: 0.1
Accuracy(training): 0.812
Accuracy(validation): 0.724
Learning rate: 0.25
Accuracy(training): 0.835
Accuracy(validation): 0.750
Learning rate: 0.5
Accuracy(training): 0.864
Accuracy(validation): 0.772
Learning rate: 0.75
Accuracy(training): 0.875
Accuracy(validation): 0.754
Learning rate: 1
Accuracy(training): 0.875
Accuracy(validation): 0.739


In [43]:
gb_clf2= GradientBoostingClassifier(n_estimators=20,learning_rate=0.5,max_features=2,max_depth=2,random_state=0)

gb_clf2.fit(xtrain,ytrain)
predictions=gb_clf2.predict(xval)

print("Confusion Matrix:")
print(confusion_matrix(yval,predictions))
print("Classification Report")
print(classification_report(yval,predictions))


Confusion Matrix:
[[142  19]
 [ 42  65]]
Classification Report
              precision    recall  f1-score   support

           0       0.77      0.88      0.82       161
           1       0.77      0.61      0.68       107

    accuracy                           0.77       268
   macro avg       0.77      0.74      0.75       268
weighted avg       0.77      0.77      0.77       268



# **Gradient Boosting with cross-validation**
##Diff types of cross-validation in ML
1. KFold ->
2. Stratified KFold ->
3. Repeated KFold->
4. Repeated Stratified KFold->

In [44]:
from numpy import mean
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import confusion_matrix

In [45]:
from sklearn.datasets import make_classification #used for synthetically creating samples
x,y=make_classification(n_samples=100,n_features=20,random_state=1)
print(x.shape)
print(y.shape)

(100, 20)
(100,)


In [46]:
from sklearn.model_selection import cross_val_score, RepeatedStratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier

In [47]:
#split x and y in train and test (80,20%)
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.2,random_state=1)

In [48]:
model=GradientBoostingClassifier()

In [49]:
cv=RepeatedStratifiedKFold(n_splits=10,n_repeats=3,random_state=1) #Repeats Stratified K-Fold n times with different randomization in each repetition.
n_scores=cross_val_score(model,xtrain,ytrain,scoring='accuracy',cv=cv)
print(mean(n_scores))

0.9208333333333333


In [50]:
model.fit(xtrain,ytrain)

In [51]:
y_pred=model.predict(xtest)

In [52]:
print(metrics.accuracy_score(ytest,y_pred))

0.9


In [53]:
predictions=model.predict(xtest)

print("Confusion Matrix:")
print(confusion_matrix(ytest,predictions))

Confusion Matrix:
[[12  0]
 [ 2  6]]


# **Gradient Bossting Regressor**

In [67]:
from sklearn.datasets import make_regression
from sklearn.metrics import r2_score
x,y=make_regression(n_samples=1000,n_features=20,random_state=7)
print(x.shape)
print(y.shape)

(1000, 20)
(1000,)


In [55]:
xtrain,xtest,ytrain,ytest=train_test_split(x,y,test_size=0.2,random_state=1)

In [56]:
from sklearn.model_selection import cross_val_score, RepeatedKFold
from sklearn.ensemble import GradientBoostingRegressor

In [57]:
model=GradientBoostingRegressor()

In [58]:
cv=RepeatedKFold(n_splits=10,n_repeats=3,random_state=8)
n_scores=cross_val_score(model,xtrain,ytrain,scoring='r2',cv=cv)
print(mean(n_scores))

0.9161009015894994


In [59]:
model.fit(xtrain,ytrain)

In [60]:
y_pred=model.predict(xtest)

In [68]:
print(r2_score(ytest,y_pred))

0.9245811149886595
