# Step 1: Import Modules

We begin by importing the necessary packages which are required to build our bagging model. We use the Pandas data analysis library to load our dataset. The dataset to be used is known as the Pima Indians Diabetes Database, used for predicting the onset of diabetes based on various diagnostic measures.

Next, let's import the BaggingClassifier from the sklearn.ensemble package, and the DecisionTreeClassifier from the sklearn.tree package.

In [1]:
import pandas
from sklearn import model_selection
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Step 2: Loading the Dataset

We name all the classes of the dataset in a list called names.  We use the read_csv method from Pandas to load the dataset, and then set the class names using the parameter “names” to our variable, also called names. Next, we load the features and target variables using the slicing operation and assign them to variables X and Y respectively.

In [2]:
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv('diabetes.csv')
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]

# Step 3:  Loading the Classifier

In this step, we set the seed value for all the random states, so that the generated random values shall remain the same until the seed value is exhausted. We now declare the KFold model where we set the n_splits parameter to 10, and the random_state parameter to the seed value. Here, we build the bagging algorithm with the Classification and Regression Trees algorithm. To do so, we import the DecisionTreeClassifier, store it in the cart variable, and initiate the number of trees variable, num_trees to 100.

In [3]:
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cart = DecisionTreeClassifier()
num_trees = 100



In [6]:
# Import train_test_split function
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3) # 70% training and 30% test

# Step 4:  Training and Results

In the last step, we train the model using the BaggingClassifier by declaring the parameters to the values defined in the previous step.

In [13]:
model = BaggingClassifier(base_estimator=cart,n_estimators=num_trees)

model.fit(X_train,y_train)

y_pred=model.predict(X_test)


#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

# results = model_selection.cross_val_score(model, X, Y, cv=kfold)
# print(results.mean())

Accuracy: 0.7748917748917749


In [7]:
# using a Decision Tree classifier alone
result = model_selection.cross_val_score(cart, X, Y, cv=kfold)
print(result.mean())

0.6978468899521532


# Random Forest

In [11]:
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)

y_pred=clf.predict(X_test)

In [12]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.7575757575757576
