# Pima Indian diabetes data analysis with Python: Application of supervised learning models for the prediction of type 2 diabetes in females

## Felipe Núñez Villena

### 12-09-2021


## Table of contents

1. <a href ="#1.-Load-Packages">Load Packages</a>
2. <a href ="#2.-Load-data">Load data</a>
3. <a href ="#3.-Data-cleaning">Data cleaning</a><br> 
    3.1 <a href ="#3.1.-Evaluate-whether-the-dataframe-diabetes-contains-missing-values">Evaluate whether the dataframe diabetes contains missing values</a><br> 
    3.2 <a href ="#3.2.-Replacing-0s-with-missing-values">Replacing 0s with missing values</a><br> 
    3.3   <a href ="#3.3.-Imputation-of-missing-values">Imputation of missing values</a><br>
4. <a href ="#4.-Exploratory-data-analysis">Exploratory data analysis</a><br>
5. <a href ="#5.-Feature-engineering">Feature engineering</a><br>
6. <a href ="#6.-Encoding-categorical-data">Encoding categorical data</a><br>
7. <a href ="#7.-Scaling-data">Scaling data</a><br>
8. <a href ="#8.-Identification-of-correlated-features">Identification of correlated features</a><br>
9. <a href ="#9.-Machine-learning">Machine learning</a><br>

## 1. Load Packages

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
import warnings
warnings.filterwarnings('ignore')

# To center the resulting plots in the jupyter notebook
from IPython.core.display import HTML
HTML("""
<style>
.output_png {
    display: table-cell;
    text-align: center;
    vertical-align: middle;
}
</style>
""")

## 2. Load data

In [None]:
diabetes = pd.read_csv('../input/pima-indians-diabetes-database/diabetes.csv')
diabetes.head()

## 3. Data cleaning

### 3.1. Evaluate whether the dataframe diabetes contains missing values

In [None]:
diabetes.isnull().sum()

The data frame diabetes does not contain entries listed as missing. However missing values can be expressed as 0s or 
with other symbology (i.e: ?). Let's evaluate whether diabetes is truly clean or possess missing data

In [None]:
sns.set_style('ticks')

fig, axs = plt.subplots(2, 4, figsize=(15, 8))
fig.tight_layout()

sns.histplot(data=diabetes, x="Pregnancies", kde=True, color="skyblue", bins = 10, ax=axs[0, 0])
axs[0, 0].annotate(text='0s',
    xy=(1, 200),
    xycoords='data',
    fontsize=20,
    xytext=(100,0),
    textcoords='offset points',
    arrowprops=dict(arrowstyle='->', color='black'),  # Use color black
    horizontalalignment='center',  # Center horizontally
    verticalalignment='center')  # Center vertically

sns.histplot(data=diabetes, x="Glucose", kde=True, color="olive", bins=10, ax=axs[0, 1])
sns.histplot(data=diabetes, x="BloodPressure", kde=True, color="olive",bins=10, ax=axs[0, 2])
sns.histplot(data=diabetes, x="SkinThickness", kde=True, color="gold", bins = 10, ax=axs[0, 3])
axs[0, 3].annotate(text='0s',
    xy=(10, 55),
    xycoords='data',
    fontsize=20,
    xytext=(75,0),
    textcoords='offset points',
    arrowprops=dict(arrowstyle='->', color='black'),  # Use color black
    horizontalalignment='center',  # Center horizontally
    verticalalignment='center')  # Center vertically

sns.histplot(data=diabetes, x="Insulin", kde=True, color="teal", bins = 10, ax=axs[1, 0])
axs[1, 0].annotate(text='0s',
    xy=(70, 140),
    xycoords='data',
    fontsize=20,
    xytext=(75,0),
    textcoords='offset points',
    arrowprops=dict(arrowstyle='->', color='black'),  # Use color black
    horizontalalignment='center',  # Center horizontally
    verticalalignment='center')  # Center vertically


sns.histplot(data=diabetes, x="BMI", kde=True, color="teal",bins=10,ax=axs[1, 1])
sns.histplot(data=diabetes, x="DiabetesPedigreeFunction", kde=True,bins=10,color="teal", ax=axs[1, 2])
sns.histplot(data=diabetes, x="Age", kde=True, color="teal",bins=10, ax=axs[1, 3])
plt.show()

In [None]:
for col in diabetes.columns:  
    print("The feature {} contains: {}".format(col,(diabetes[col]==0).sum()) + " 0s entries")

We found 6 features containing 0s:

1 - Pregnancies (111)

2 - Glucose (5)

3 - Blood Pressure (35)

4 - Skin Thickness (227)

5 - Insulin (374)

6 - BMI (11)


PREGNANCIES: It is plausible to find women with no pregnancies. Therefore, 0 in this feature does not represent missing values.

## 3.2. Replacing 0s with missing values

The remaining features are measurements and the read-outs can not be 0, suggesting that 0 represents missing data.
Therefore, let's replace 0s with missing values (NaNs)


In [None]:
diabetes[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = diabetes[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)
diabetes.isna().sum()/len(diabetes) * 100

### 3.2.1. Exploring the missingness

In [None]:
msno.matrix(diabetes)
plt.show()

### 3.3. Imputation of missing values

There is not an specific part of the data that is missing. On the contrary, the missing values seems to be 
randomly distributed within the Insulin and SkinThickness features.

Since we can not afford to miss approximately 50% of the data, features containing missing values should be imputed.

### 3.3.1. Imputation of features with a low number of missing values (<4.5%)

The features Glucose, BloodPressure and BMI contain an small fraction of missing data. Therefore, features can be

imputed with the median grouped by the target variable (Outcome).


#### 3.3.1.1. Glucose

In [None]:
temp = diabetes[diabetes['Glucose'].notnull()]
temp = temp[['Glucose', 'Outcome']].groupby(['Outcome'])[['Glucose']].median().reset_index()
temp

In [None]:
diabetes.loc[(diabetes['Outcome'] == 0 ) & (diabetes['Glucose'].isnull()), 'Glucose'] = 107
diabetes.loc[(diabetes['Outcome'] == 1 ) & (diabetes['Glucose'].isnull()), 'Glucose'] = 140

#### 3.3.1.2. Blood Pressure

In [None]:
temp = diabetes[diabetes['BloodPressure'].notnull()]
temp = temp[['BloodPressure', 'Outcome']].groupby(['Outcome'])[['BloodPressure']].median().reset_index()
temp

In [None]:
diabetes.loc[(diabetes['Outcome'] == 0 ) & (diabetes['BloodPressure'].isnull()), 'BloodPressure'] = 70
diabetes.loc[(diabetes['Outcome'] == 1 ) & (diabetes['BloodPressure'].isnull()), 'BloodPressure'] = 74.5

#### BMI

In [None]:
temp = diabetes[diabetes['BMI'].notnull()]
temp = temp[['BMI', 'Outcome']].groupby(['Outcome'])[['BMI']].median().reset_index()
temp

In [None]:
diabetes.loc[(diabetes['Outcome'] == 0 ) & (diabetes['BMI'].isnull()), 'BMI'] = 30.1
diabetes.loc[(diabetes['Outcome'] == 1 ) & (diabetes['BMI'].isnull()), 'BMI'] = 34.3

# 3.3.2 - Imputation of features with high number of missing values

#### Insulin

Insulin is a hormone secreted in the pancreatic beta cells. It's secretion rate increases in response to higher metabolic demands for insulin.

High Insulin levels are not purely characteristic of diabetic subjects.

Let's evaluate whether insulin is correlated with other features in this dataset

In [None]:
f, ax = plt.subplots(figsize=(8, 5))
diabetes_nomissing = diabetes.dropna().copy()
sns.heatmap(diabetes_nomissing.corr(), cmap="YlGnBu",annot=True,linewidths=.5)
plt.show()

As we can see in the heatmap, insulin exhibits the highest correlation with Glucose (0.58)

Therefore, we should consider glucose to impute insulin. 

#### Based on their glucose levels, subjects can be categorized as:

1 - Hypoglycemic

2 - Normoglycemic

3 - Hyperglycemic


In [None]:
sns.boxplot(diabetes['Glucose'])
plt.show()
np.percentile(diabetes['Glucose'],[25,50,75])

In [None]:
Glu_cat = []

for glucose in diabetes['Glucose']:
    if glucose < 99.75:
        Glu_cat.append('Hypoglycemia')
    elif 99 <= glucose <= 140.25:
        Glu_cat.append('Normoglycemia')
    elif glucose > 140.25:
        Glu_cat.append('Hyperglycemia')

diabetes['Glu_cat'] = Glu_cat

In [None]:
display(diabetes.groupby(['Glu_cat','Outcome'])['Insulin'].mean().reset_index())

As we can see in this dataset, irrespective whether the patient is diabetic, hyperglycemic patients exhibited
increasing levels of insulin

In [None]:
# Imputing Insulin with the insulin median grouped by Outcome and Glucose status.

diabetes.loc[(diabetes['Outcome'] == 0 ) & (diabetes['Glu_cat'] == 'Hypoglycemia' ) & (diabetes['Insulin'].isnull()), 'Insulin'] = 70.180851
diabetes.loc[(diabetes['Outcome'] == 0 ) & (diabetes['Glu_cat'] == 'Normoglycemia' ) & (diabetes['Insulin'].isnull()), 'Insulin'] = 141.888889
diabetes.loc[(diabetes['Outcome'] == 0 ) & (diabetes['Glu_cat'] == 'Hyperglycemia' ) & (diabetes['Insulin'].isnull()), 'Insulin'] = 246.971429

diabetes.loc[(diabetes['Outcome'] == 1 ) & (diabetes['Glu_cat'] == 'Hypoglycemia' ) & (diabetes['Insulin'].isnull()), 'Insulin'] = 122.750000
diabetes.loc[(diabetes['Outcome'] == 1 ) & (diabetes['Glu_cat'] == 'Normoglycemia' ) & (diabetes['Insulin'].isnull()), 'Insulin'] = 162.750000
diabetes.loc[(diabetes['Outcome'] == 1 ) & (diabetes['Glu_cat'] == 'Hyperglycemia' ) & (diabetes['Insulin'].isnull()), 'Insulin'] = 249.214286

#### Skin thickness

As we saw in the heatmap, skin thickness exhibits the highest correlation with BMI (0.67)

Therefore, we should consider BMI as well.

Based on the BMI, subjects can be categorized as:

1 - Underweight (<18)

2 - Normal (18.5 - 24.9)

3 - Overweight (25 - 29.9)

4 - Obese (30 - 34.9)

5 - Extremely obese (>35)

In [None]:
bmi_cat = []

for BMI in diabetes['BMI']:
    if BMI < 18:
        bmi_cat.append('Underweight')
    elif 18.1 <= BMI <= 24.9:
        bmi_cat.append('Normal')
    elif 24.91 <= BMI <= 29.9:
        bmi_cat.append('Overweight')
    elif 29.91 <= BMI <= 34.9:
        bmi_cat.append('Obese')
    elif BMI > 34.91:
        bmi_cat.append('Extremely obese')

diabetes['BMI_cat'] = bmi_cat

In [None]:
display(diabetes.groupby(['BMI_cat','Outcome'])['SkinThickness'].mean().reset_index())

In [None]:
# Imputing SkinThickness

diabetes.loc[(diabetes['Outcome'] == 0 ) & (diabetes['BMI_cat'] == 'Normal' ) & (diabetes['SkinThickness'].isnull()), 'SkinThickness'] = 15
diabetes.loc[(diabetes['Outcome'] == 0 ) & (diabetes['BMI_cat'] == 'Overweight' ) & (diabetes['SkinThickness'].isnull()), 'SkinThickness'] = 22.830000
diabetes.loc[(diabetes['Outcome'] == 0 ) & (diabetes['BMI_cat'] == 'Obese' ) & (diabetes['SkinThickness'].isnull()), 'SkinThickness'] = 27.969388
diabetes.loc[(diabetes['Outcome'] == 0 ) & (diabetes['BMI_cat'] == 'Extremely obese' ) & (diabetes['SkinThickness'].isnull()), 'SkinThickness'] = 36.388350

diabetes.loc[(diabetes['Outcome'] == 1 ) & (diabetes['BMI_cat'] == 'Normal' ) & (diabetes['SkinThickness'].isnull()), 'SkinThickness'] = 17.666667
diabetes.loc[(diabetes['Outcome'] == 1 ) & (diabetes['BMI_cat'] == 'Overweight' ) & (diabetes['SkinThickness'].isnull()), 'SkinThickness'] = 24.666667
diabetes.loc[(diabetes['Outcome'] == 1 ) & (diabetes['BMI_cat'] == 'Obese' ) & (diabetes['SkinThickness'].isnull()), 'SkinThickness'] = 31.843750
diabetes.loc[(diabetes['Outcome'] == 1 ) & (diabetes['BMI_cat'] == 'Extremely obese' ) & (diabetes['SkinThickness'].isnull()), 'SkinThickness'] = 36.850575

In [None]:
diabetes.isna().sum()

## 4. Exploratory data analysis

#### 4.1 - Is the dataset balanced?

In [None]:
# Creating dataset
Diabetes_code = ['0', '1']
  
data = diabetes['Outcome'].value_counts().to_list()
  
# Creating color parameters
colors = ( "orange", "cyan")
  
# Wedge properties
wp = { 'linewidth' : 1, 'edgecolor' : "black" }
  
# Creating autocpt arguments
def func(pct, allvalues):
    absolute = int(pct / 100.*np.sum(allvalues))
    return "{:.1f}%\n({:d})".format(pct, absolute)
  
# Creating plot
fig, ax = plt.subplots(figsize =(10, 7))
wedges, texts, autotexts = ax.pie(data, 
                                  autopct = lambda pct: func(pct, data), 
                                  labels = Diabetes_code,
                                  shadow = False,
                                  colors = colors,
                                  startangle = 90,
                                  wedgeprops = wp,
                                  textprops = dict(color ="black",fontsize=20))
# Adding legend
Diabetes_status = ['Healthy', 'Diabetic']

ax.legend(wedges, Diabetes_status,
          title ="Diabetes status",
          fontsize = 'medium',
          loc ="center left",
          bbox_to_anchor =(1, 0, 0.5, 1))

plt.setp(autotexts, size = 15)
ax.set_title("Women distribution according to their health status",size = 20, weight = "bold",loc='right')
  
# show plot
plt.show()

As we can see, this dataset contains much more information of healthy patients (65.1%)

#### 4.2 - What are the characteristics of patients who suffer from diabetes?

##### 4.2.1 - Age

In [None]:
sns.boxplot(y='Age',x='Outcome',data=diabetes)
plt.show()

temp = diabetes.groupby('Outcome')['Age'].median().reset_index()

for idx in temp.index:
    print("The median of the group {} is: {}".format(idx,temp.iloc[idx]['Age']) + ' years')

As we can see in the boxplot, older people are more prone to develop diabetes. 

This is not suprising since type 2 diabetes is a progressive condition and therefore needs time to manifest their symptoms.

##### 4.2.2 - BMI

In [None]:
temp = diabetes.groupby('Outcome')['BMI_cat'].value_counts(normalize=True).reset_index(name="Percentage")

In [None]:
temp

In [None]:
sns.barplot(y='Percentage',x = 'BMI_cat',hue = 'Outcome',order = ['Normal','Overweight','Obese','Extremely obese'],data=temp)
plt.show()

As we can see in the barplot, the BMI distribution across healthy patients is similar (19.8% - 27.8%).
On the contrary, approximately 82 % of diabetic patients are obese or extremely obese based on their BMI.

Normally, one would imagine that most of the people extremely obese should be diabetic. 
However, we observed a significant portion of extremely obese patients that are labeled as healthy (26%)

##### 4.3 - What's the difference between healthy and diabetic Extremely obese patients?

In [None]:
temp = diabetes.groupby(['BMI_cat','Outcome'])['Age'].median().reset_index()
temp

In [None]:
sns.boxplot(y='Age',x = 'BMI_cat',hue = 'Outcome',order = ['Normal','Overweight','Obese','Extremely obese'],data=diabetes)
plt.show()

Irrespective of their BMI category, every diabetic group was clearly older than the healthy group.
Confirming that age is a risk factor for the development of type 2 diabetes.

The median for healthy patients was between 26 - 28 years.

The median for diabetic patients with normal BMI was 50 years.

Furthermore, the median for diabetic patients whose BMI was overweight, Obese or extremely obese was in the range of 
27 - 36 years, suggesting that a higher BMI accelerate the onset of diabetes.

#### 4.4 - Glucose analysis

Diabetes is defined as the incapacity of the organism to properly regulate blood glucose levels.

According to the mayo clinic, 2 hours after an Oral Glucose Tolerance Test (OGTT)

1) Below: 140 mg/dL - Normal blood glucose

2) 140 and 199 mg/dL (Subjects at risk to onset diabetes) - Impaired Glucose metabolism: 

3) 200 mg/dL or more - Diabetic

To understand the glycemic statuts of the patients in this dataset, let's plot glucose levels by 2 categorical variables: Glucose categories (Glu_cat) and health status (Output)

In [None]:
g = sns.FacetGrid(diabetes, col="Outcome",col_wrap=2,sharex=True,sharey=True,height=5)
g.map_dataframe(sns.boxplot, x="Glu_cat", y="Glucose",hue = 'BMI_cat',order = ['Hypoglycemia','Normoglycemia','Hyperglycemia'])
g.add_legend()
plt.show()

Despite we have 192 data points between 140 - 199, we do not have any value higher than 199 for Glucose after a 2hr OGTT. This is surprising, because this is the gold-standard indicator for the diagnosis of diabetes.

Therefore, Why do we observe a total of 500 patients labeled as diabetic?

Since my understanding of diabetes, it is plausible to think that hyperglycemic subjects were considered as diabetic.
However, i don't understand why subjects with normal blood glucose levels (Normoglycemia) were considered diabetic.

In [None]:
r1 = ((diabetes['Glu_cat'] == 'Hyperglycemia') & (diabetes['Outcome'] == 0)).sum()
r2 = ((diabetes['Glu_cat'] == 'Normoglycemia') & (diabetes['Outcome'] == 1)).sum()

print("There are {}".format(r1 + r2) + ' questionable data points in this dataset')

## 5. Feature engineering

In [None]:
# 5 - Metabolic_state
diabetes['Metabolic_state'] = diabetes['Glu_cat'] + '_' + diabetes['BMI_cat']

## 6. Encoding categorical data

In [None]:
# Defining Target variable
target_col = ["Outcome"] # 0 = Healthy and 1 = Diabetes

# Defining categorical variables
print(diabetes.nunique()[diabetes.nunique() <= 17].keys().tolist())
cat_cols   = diabetes.nunique()[diabetes.nunique() < 17].keys().tolist()

# numerical columns
num_cols   = [x for x in diabetes.columns if x not in cat_cols + target_col]

# Binary columns
bin_cols   = diabetes.nunique()[diabetes.nunique() == 2].keys().tolist()

#Columns with more than 2 categories
multi_cols = [i for i in cat_cols if i not in bin_cols]

# ENCODING CATEGORICAL FEATURES

# Label encoding Binary columns
le = LabelEncoder()
for i in bin_cols :
    diabetes[i] = le.fit_transform(diabetes[i])
    
# Duplicating columns for multi value columns
diabetes = pd.get_dummies(data = diabetes,columns = multi_cols)

## 7. Scaling data

In [None]:
# Scaling Numerical columns: Only numerical features
std = StandardScaler()
    # We index numerical features and create a new df scaled.
scaled = std.fit_transform(diabetes[num_cols])
scaled = pd.DataFrame(scaled,columns=num_cols)

# Dropping original values merging scaled values for numerical columns
df_data_og = diabetes.copy()
diabetes = diabetes.drop(columns = num_cols,axis = 1)
diabetes = diabetes.merge(scaled,left_index=True,right_index=True,how = "left")

## 8. Identification of correlated features

In [None]:
to_drop_feat = ['BMI_cat_Normal','BMI_cat_Obese','BMI_cat_Extremely obese','BMI_cat_Overweight',
              'Glu_cat_Hyperglycemia','Glu_cat_Hypoglycemia','Glu_cat_Normoglycemia']

diabetes = diabetes.drop(to_drop_feat,1)

In [None]:
# Let's evaluate which variables are telling me the same

diabetes_corr = diabetes.corr()
plt.figure(figsize=(30, 20))
sns.heatmap(diabetes_corr, cmap="YlGnBu",annot=True)
plt.show()

In [None]:
upper_tri = diabetes_corr.where(np.triu(np.ones(diabetes_corr.shape),k=1).astype(np.bool))
to_drop = [column for column in upper_tri.columns if any(upper_tri[column] > 0.7)]
print(to_drop)

##### 8.1 - Remove BMI feature

In [None]:
diabetes2 = diabetes.drop(columns = 'BMI',axis = 1).copy()

## 9. Machine learning

##### 9.1 - Segregate features (X) and labels (y) into separable variables

In [None]:
X = diabetes2.drop('Outcome', 1)
y = diabetes2['Outcome'] 

##### 9.2 - KNN 

In [None]:
steps = [('knn', KNeighborsClassifier())]
        
# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Specify the hyperparameter space

neighbors = np.arange(1, 10)
parameters = {'knn__n_neighbors':neighbors}

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42,stratify=diabetes2['Outcome'])

# Instantiate the GridSearchCV object: cv
cv = GridSearchCV(pipeline,parameters,cv=5)

# Fit to the training set

cv.fit(X_train,y_train)

# Predict the labels of the test set: y_pred
y_pred = cv.predict(X_test)

# Compute and print metrics

print("Accuracy: {}".format(cv.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))

y_pred = cv.predict(X_test)
print(classification_report(y_test, y_pred))

##### 9.3 - Logistic Regression

In [None]:
################### 21 - Machine learning model pipeline - Logistic Regression ######################################## 

# 1-Create a dictionary storing the parameters(keys) and their values(values)

c_space = np.logspace(-5, 8, 15)
tol = [0.01,0.001,0.0001]
max_iter = [100,150,200]
param_grid = {'C': c_space, 'penalty': ['l1', 'l2'],'tol': tol, 'max_iter': max_iter}

# Instantiate the logistic regression classifier: logreg
# 2 - Instate the classifier of choice, in this case logistic regression
logreg = LogisticRegression()

# Create train and test sets
# 3 - Split the data into train and test set
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=42,stratify=diabetes2['Outcome'])

# Instantiate the GridSearchCV object: logreg_cv
# 4 - Run GridSearchCV using the: A)classifier , B) parameters and C) CV (different portions of the data)
logreg_cv = GridSearchCV(logreg,param_grid,cv=5)

# Fit it to the training data
# 5 - Fitting the model using the train data
logreg_cv.fit(X_train,y_train)

In [None]:
print("Tuned Logistic Regression Parameter: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Accuracy: {}".format(logreg_cv.best_score_))
print(" ")
y_pred = logreg_cv.predict(X_test)
print(classification_report(y_test, y_pred))

##### 9.4 - Decision Tree Classifier

In [None]:
# DecisionTreeClassifier

# Setup the parameters and distributions to sample from: param_dist
# We create a dictionary with all the keys (hyperparameters) and the values (parameters value) we would like to test.

param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}


# Instantiate a Decision Tree classifier: tree
tree = DecisionTreeClassifier()

# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)

# Fit it to the data

tree_cv.fit(X,y)
y_pred = tree_cv.predict(X_test)
# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))
print(" ")
print(classification_report(y_test, y_pred))

##### 9.5 SVM

In [None]:
# 1 - SVM

steps = [('SVM', SVC())]

pipeline = Pipeline(steps)

# Specify the hyperparameter space

parameters = {'SVM__C':[1, 10, 100],
              'SVM__gamma':[0.1, 0.01]}


# Instantiate the GridSearchCV object: cv
cv = GridSearchCV(pipeline,parameters,cv=3)

# Fit to the training set

cv.fit(X_train,y_train)

# Predict the labels of the test set: y_pred
y_pred = cv.predict(X_test)

# Compute and print metrics

print("Accuracy: {}".format(cv.score(X_test, y_test)))
print("Tuned Model Parameters: {}".format(cv.best_params_))
print(" ")
print(classification_report(y_test, y_pred))

<a href ="#Table-of-contents">Back to top</a>