<a href="https://colab.research.google.com/github/ralsouza/machine_learning_python/blob/master/notebooks/02_machine_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 1. Defining business problem
Let's go create a predictive model that's able to predict whether or not a patient has diabetes. To do this, we'll use historical data from patients.

We'll use the dataset [Diabetes Data Set](http://archive.ics.uci.edu/ml/datasets/diabetes)


This dataset describes the medical records of patients from Pima Indians and each record was marqued whether the patient develop diabetes or not.


**Information about the atributes:**

1. Number of times pregnant
2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. Diastolic blood pressure (mm Hg)
4. Triceps skin fold thickness (mm)
5. 2-Hour serum insulin (mu U/ml)
6. Body mass index (weight in kg/(height in m)^2)
7. Diabetes pedigree function
8. Age (years)
9. Class variable (0 or 1)

# 2. Extracting and loading data

In [1]:
# Import libraries
import pandas as pd

In [2]:
# Dataset path
path = '/content/drive/My Drive/Colab Notebooks/07_machine_learning/data/pima-data.csv'

In [3]:
# Load data
data = pd.read_csv(path)
data.head(5)

# This dataset doesn't have header

Unnamed: 0,6,148,72,35,0,33.6,0.627,50,1
0,1,85,66,29,0,26.6,0.351,31,0
1,8,183,64,0,0,23.3,0.672,32,1
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,5,116,74,0,0,25.6,0.201,30,0


In [4]:
# Load again and add column names
columns = ['preg','plas','pres','skin','test','mass','pedi','age','class']
data = pd.read_csv(path,names=columns)

In [5]:
# Check dataset with column names
data.head(5)

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [6]:
# Checking the shape
data.shape

(768, 9)

# 3. Exploratory Data Analysis

## 3.1 Descriptive Analysis

In [7]:
# Predictor variables: preg	,plas	,pres	,skin	,test	,mass	,pedi ,age
# Target variable: class
data.head(20)

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


If the number of rows in the file is bigger, the algorithm might take much time to process. Whether the number of records is smallest, it's possible that there isn't no records enough to train the model.

If there are many columns, the algorithm might have performance problems because the high dimensionality.

The best solution will be depend case by case. Remember that: train the model in a subset of the whole dataset and then apply the model with new data.

In [8]:
# Check shape
data.shape

(768, 9)

The data types are very important, it might be possible to convert strings, or columns with integer numbers could represent categorical variables or ordinary values.

In [9]:
# Data types of each attribute
data.dtypes

preg       int64
plas       int64
pres       int64
skin       int64
test       int64
mass     float64
pedi     float64
age        int64
class      int64
dtype: object

In [11]:
# Describing data
data.describe()

Unnamed: 0,preg,plas,pres,skin,test,mass,pedi,age,class
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [12]:
# Check the balance of the target variable - Distribuition of classes
data.groupby('class').size()

class
0    500
1    268
dtype: int64

In classification problems it might be necessary to balance the classes. Unbalance classes (that is: greater volume in one the classes) are common and need to be addressed during the pre processing fase. We can see above that exists a clear disproportion between the classes 0 (not occurrence of diabetes) and 1 (occurrence of diabetes).

The algorithm might learn more about the the class 0 than the class 1, it's possible to apply some technique to balance these variables.