# Decision Tree Classifier Building in Scikit-learn

In this task, you need to build a Decision Tree Model to predict the onset of diabetes based on different diagnostic measures. The dataset is from Kaggle, named Pima Indians Diabetes Database.

## Importing Required Libraries

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

## Loading Data

Let's first load the required Pima Indian Diabetes dataset using pandas' read CSV function. You can check the first few rows of the dataset by executing the cell below.

In [2]:
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv("diabetes.csv", header=None, names=col_names)
pima.head()

Unnamed: 0,pregnant,glucose,bp,skin,insulin,bmi,pedigree,age,label
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Feature Selection

Here, you need to divide given columns into two types of variables, dependent (or target variable) ``X`` and independent variable (or feature variables) ``y``. The selected features are ``['pregnant', 'insulin', 'bmi', 'age', 'glucose', 'bp', 'pedigree']``.

In [3]:
feature_cols = ['pregnant', 'insulin', 'bmi', 'age', 'glucose', 'bp', 'pedigree']
# TODO1: Select the corresponding features and extract feature variables X and label variable y
X = pima[feature_cols]
y = pima.label

## Splitting Data

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

## Building Decision Tree Model

Here, you need to build a decision tree classifier, train it using training dataset, and calculate the predicted values on the testing dataset. The predicted values should be stored in ``y_pred``. To ensure a reproducible evaluation, set the ``random_state`` to 42.

In [5]:
SEED = 42
# TODO2: Calculate y_pred using decision tree model
clf = DecisionTreeClassifier(random_state=SEED)
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)


## Evaluate Model

Let's estimate how accurately the classifier or model can predict the type of cultivars. Accuracy can be computed by comparing actual test set values and predicted values.

In [6]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.6883116883116883
