# Evaluating the Performance of a Support Vector Machine (SVM) Model 🚀

In this notebook, we will explore the evaluation process of a **Support Vector Machine (SVM)** model. SVM is a powerful classification algorithm that identifies the optimal hyperplane to separate data points into different classes.

## Key Objectives 🎯:
- **Model Evaluation**: We will assess how well the SVM model performs in terms of accuracy, precision, recall, and F1 score.
- **Performance Metrics 📊**: Using visualizations such as confusion matrices and classification reports to understand the model's strengths and areas for improvement.

Let's dive into the details and evaluate the model's performance effectively! 🔍


### includes

In [13]:

import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, classification_report, roc_curve, confusion_matrix,accuracy_score
from sklearn.preprocessing import MinMaxScaler,LabelEncoder
from sklearn.svm import SVC
import matplotlib.pyplot as plt
import seaborn as sn

### loading dataset

In [2]:
df=pd.read_csv('../dataset/diabetes_dataset.csv')
df.head()


Unnamed: 0,year,gender,age,location,race:AfricanAmerican,race:Asian,race:Caucasian,race:Hispanic,race:Other,hypertension,heart_disease,smoking_history,bmi,hbA1c_level,blood_glucose_level,diabetes
0,2020,Female,32.0,Alabama,0,0,0,0,1,0,0,never,27.32,5.0,100,0
1,2015,Female,29.0,Alabama,0,1,0,0,0,0,0,never,19.95,5.0,90,0
2,2015,Male,18.0,Alabama,0,0,0,0,1,0,0,never,23.76,4.8,160,0
3,2015,Male,41.0,Alabama,0,0,1,0,0,0,0,never,27.32,4.0,159,0
4,2016,Female,52.0,Alabama,1,0,0,0,0,0,0,never,23.75,6.5,90,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 16 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   year                  100000 non-null  int64  
 1   gender                100000 non-null  object 
 2   age                   100000 non-null  float64
 3   location              100000 non-null  object 
 4   race:AfricanAmerican  100000 non-null  int64  
 5   race:Asian            100000 non-null  int64  
 6   race:Caucasian        100000 non-null  int64  
 7   race:Hispanic         100000 non-null  int64  
 8   race:Other            100000 non-null  int64  
 9   hypertension          100000 non-null  int64  
 10  heart_disease         100000 non-null  int64  
 11  smoking_history       100000 non-null  object 
 12  bmi                   100000 non-null  float64
 13  hbA1c_level           100000 non-null  float64
 14  blood_glucose_level   100000 non-null  int64  
 15  d

In [4]:
df.describe

<bound method NDFrame.describe of        year  gender   age location  race:AfricanAmerican  race:Asian  \
0      2020  Female  32.0  Alabama                     0           0   
1      2015  Female  29.0  Alabama                     0           1   
2      2015    Male  18.0  Alabama                     0           0   
3      2015    Male  41.0  Alabama                     0           0   
4      2016  Female  52.0  Alabama                     1           0   
...     ...     ...   ...      ...                   ...         ...   
99995  2018  Female  33.0  Wyoming                     0           0   
99996  2016  Female  80.0  Wyoming                     0           1   
99997  2018    Male  46.0  Wyoming                     0           1   
99998  2018  Female  51.0  Wyoming                     1           0   
99999  2016    Male  13.0  Wyoming                     0           0   

       race:Caucasian  race:Hispanic  race:Other  hypertension  heart_disease  \
0                   

In [5]:
df.columns

Index(['year', 'gender', 'age', 'location', 'race:AfricanAmerican',
       'race:Asian', 'race:Caucasian', 'race:Hispanic', 'race:Other',
       'hypertension', 'heart_disease', 'smoking_history', 'bmi',
       'hbA1c_level', 'blood_glucose_level', 'diabetes'],
      dtype='object')

In [6]:
## we drop the unnessessary information
df.drop(columns=['year','location','race:AfricanAmerican','race:Asian','race:Caucasian','race:Hispanic','race:Other'],inplace=True)
df

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hbA1c_level,blood_glucose_level,diabetes
0,Female,32.0,0,0,never,27.32,5.0,100,0
1,Female,29.0,0,0,never,19.95,5.0,90,0
2,Male,18.0,0,0,never,23.76,4.8,160,0
3,Male,41.0,0,0,never,27.32,4.0,159,0
4,Female,52.0,0,0,never,23.75,6.5,90,0
...,...,...,...,...,...,...,...,...,...
99995,Female,33.0,0,0,never,21.21,6.5,90,0
99996,Female,80.0,0,0,No Info,36.66,5.7,100,0
99997,Male,46.0,0,0,ever,36.12,6.2,158,0
99998,Female,51.0,0,0,not current,29.29,6.0,155,0


### Data Preprocessing

### Label Encoding

`LabelEncoder` is a preprocessing technique used to convert **categorical text data** into **numerical labels**. It is useful for machine learning algorithms that can only work with numerical values.





In [7]:
encoder = LabelEncoder()
df['smoking_history'] = encoder.fit_transform(df['smoking_history'])
df['gender'] = encoder.fit_transform(df['gender'])


In [8]:
df.tail()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,hbA1c_level,blood_glucose_level,diabetes
99995,0,33.0,0,0,4,21.21,6.5,90,0
99996,0,80.0,0,0,0,36.66,5.7,100,0
99997,1,46.0,0,0,2,36.12,6.2,158,0
99998,0,51.0,0,0,5,29.29,6.0,155,0
99999,1,13.0,0,0,0,17.16,5.0,90,0


### Data Cleaning

`Data Cleaning` is the process of detecting, correcting, or removing inaccurate, incomplete, or irrelevant parts of the data. This is a crucial step in data preprocessing to ensure the quality and reliability of the data before analysis or modeling.

#### Common Data Cleaning Tasks:

- **Handling Missing Values**
  - Remove rows or columns with too many missing values.
  - Fill missing values with mean, median, mode, or use interpolation.
  
- **Removing Duplicates**
  - Identify and drop duplicate records to avoid bias.

In [None]:
## for data cleaning (removing nulls and duplicate values)
df.drop_duplicates(inplace=True)
print(df.info())

<class 'pandas.core.frame.DataFrame'>
Index: 96146 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   gender               96146 non-null  int32  
 1   age                  96146 non-null  float64
 2   hypertension         96146 non-null  int64  
 3   heart_disease        96146 non-null  int64  
 4   smoking_history      96146 non-null  int32  
 5   bmi                  96146 non-null  float64
 6   hbA1c_level          96146 non-null  float64
 7   blood_glucose_level  96146 non-null  int64  
 8   diabetes             96146 non-null  int64  
dtypes: float64(3), int32(2), int64(4)
memory usage: 6.6 MB
None


### Data Normalization

To put all the variables on the same scale, we normalized the data to a range of 0–1 using the following equation:


Where:  
- `x*` = normalized value  
- `X` = original value  
- `min(x)` = minimum value of the dataset  
- `max(x)` = maximum value of the dataset  

This normalization ensures that all features contribute equally to the analysis.


In [14]:
## normalizing of the dataset

# Select columns to normalize
cols_to_normalize = ['age', 'bmi', 'hbA1c_level', 'blood_glucose_level']

# Initialize scaler and apply normalization
scaler = MinMaxScaler()
df[cols_to_normalize] = scaler.fit_transform(df[cols_to_normalize])

print(df.tail())

       gender       age  hypertension  heart_disease  smoking_history  \
99995       0  0.411912             0              0                4   
99996       0  1.000000             0              0                0   
99997       1  0.574575             0              0                2   
99998       0  0.637137             0              0                5   
99999       1  0.161662             0              0                0   

            bmi  hbA1c_level  blood_glucose_level  diabetes  
99995  0.130719     0.545455             0.045455         0  
99996  0.311041     0.400000             0.090909         0  
99997  0.304739     0.490909             0.354545         0  
99998  0.225023     0.454545             0.340909         0  
99999  0.083450     0.272727             0.045455         0  
