<a href="https://colab.research.google.com/github/souzajvp/IA_tutorials/blob/main/template_methods/Template_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Versions and packages

In [1]:
# Python version in use
!python --version

Python 3.7.10


In [2]:
# pandas is a package most used for managing and organizing data
import pandas as pd
# numpy is a package used for numerical operations
import numpy as np
# matplotlib and seaborn are packages for ploting graphs
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib as mpl

In [3]:
print('The Pandas version utilized in this example is -',pd.__version__)
print('The Matplotlib version utilized in this example is -',mpl.__version__)
print('The Seaborn version utilized in this example is -',sns.__version__)
print('The Numpy version utilized in this example is -',np.__version__)

The Pandas version utilized in this example is - 1.1.5
The Matplotlib version utilized in this example is - 3.2.2
The Seaborn version utilized in this example is - 0.11.1
The Numpy version utilized in this example is - 1.19.5


# Sample data 
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are **females at least 21 years old** of Pima Indian heritage.

This dataset can be found on [Kaggle](https://www.kaggle.com/uciml/pima-indians-diabetes-database).

In [4]:
# The dataset can be loaded directly from my github
url = 'https://github.com/souzajvp/IA_tutorials/blob/main/diabetes_class/archive.zip?raw=true'

In [5]:
# reading the dataset using pd.read_csv from pandas
# it is necessary to specify the compression.
diabetes = pd.read_csv(url, compression='zip')

## Dealing with null values

In [6]:
# Some variables have values set as zero which make no sense, such as insulin.
# 0 values were replaced by Null
diabetes[diabetes.columns[1:-1]] = diabetes[diabetes.columns[1:-1]].replace([0], [np.nan])

Number of missing values for each column

In [7]:
diabetes.isna().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

Insulin and BloodPressure have the most missing values. Decided to drop all patientes with missing values.

In [8]:
diabetes = diabetes.dropna()

In [9]:
# Final number of patientes were greatly reduced.
diabetes.shape

(392, 9)

# Models

Checking for class balance

In [10]:
diabetes['Outcome'].value_counts()

0    262
1    130
Name: Outcome, dtype: int64

We note that we have more patients that do not have diabetes than otherwise. In thise sense, **it is important to ensure stratified separation of training and test sets**.

## Spliting the data

In [11]:
# X attribute matrix
X = diabetes.iloc[:,:-1].values.copy()
# y Outcome
y = diabetes['Outcome'].values.copy()

### Scaling data

In [12]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

### Spiting, training and evaluating models

In [13]:
from sklearn.model_selection import StratifiedKFold
# StratifiedKFold separates the dataset according to desired number of splits
# suffle in a stratified way, making sure the same proportion of positive and
# negative outcomes are split between training and test.

In [14]:
skf = StratifiedKFold(n_splits=3, random_state=42, shuffle=True)

In [15]:
# Importing the LogisticRegression model from sklearn
from sklearn.linear_model import LogisticRegression
# naming
logit = LogisticRegression()

In [16]:
# Importing metrics to evaluate the model performance
from sklearn.metrics import accuracy_score, precision_score, recall_score

## Training the models

In [17]:
# Here I create a set of lists to hold the scores obtained
accuracy_train = []
accuracy_test = []

precision_train = []
precision_test = []

recall_train = []
recall_test = []

# Here I iterate throught the 3 splits made by skf, using X matrix and y target
# Basically, for every split, selects indexes to be used for training and testing
# These are then used for training and I evaluate the performance of the model on
# the training and test sets using accuracy, precision and recall metrics.
for train_index, test_index in skf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    # fitting model
    logit.fit(X_train, y_train)
    # Predictions on the training set
    y_train_pred = logit.predict(X_train)
    # Predictions on the testing set
    y_test_pred = logit.predict(X_test)

    # In the lines below, I add the values to the set of lists.

    accuracy_train.append(accuracy_score(y_train, y_train_pred))
    accuracy_test.append(accuracy_score(y_test, y_test_pred))

    precision_train.append(precision_score(y_train, y_train_pred))
    precision_test.append(precision_score(y_test, y_test_pred))

    recall_train.append(recall_score(y_train, y_train_pred))
    recall_test.append(recall_score(y_test, y_test_pred))

# Here I create a new list to represent the average of values found
scores_stratified = ['Average_3models', np.mean(accuracy_train), np.mean(precision_train), np.mean(recall_train),
                                        np.mean(accuracy_test), np.mean(precision_test), np.mean(recall_test),]

In [18]:
models = ['Logit_'+str(i) for i in range(1,4)]

In [19]:
# Setting everything to be in a good format for creating a dataframe
data = np.column_stack((models, accuracy_train, precision_train, recall_train,
                                accuracy_test, precision_test, recall_test))

In [20]:
results = pd.DataFrame(columns=['Model', 'Accuracy_train', 'Precision_train', 'Recall_train',
                                         'Accuracy_test', 'Precision_test', 'Recall_test'],
                       data=data)
results.loc[3] = scores_stratified

In [21]:
results

Unnamed: 0,Model,Accuracy_train,Precision_train,Recall_train,Accuracy_test,Precision_test,Recall_test
0,Logit_1,0.7777777777777778,0.704225352112676,0.5747126436781609,0.8015267175572519,0.84,0.4883720930232558
1,Logit_2,0.8122605363984674,0.7761194029850746,0.6046511627906976,0.7709923664122137,0.6944444444444444,0.5681818181818182
2,Logit_3,0.7977099236641222,0.7361111111111112,0.6091954022988506,0.7692307692307693,0.6585365853658537,0.627906976744186
3,Average_3models,0.795916,0.738819,0.596186,0.780583,0.730994,0.561487
