# Linear regression in diabetes dataset

Exploremos los conjuntos de datos incluidos en esta biblioteca de Python. Estos conjuntos de datos se han limpiado y formateado para su uso en algoritmos de ML.

## Ej 1: Carga y explora el dataset diabetes de sklearn 

In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()

In [3]:
diabetes 


{'data': array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
          0.01990749, -0.01764613],
        [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
         -0.06833155, -0.09220405],
        [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
          0.00286131, -0.02593034],
        ...,
        [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
         -0.04688253,  0.01549073],
        [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
          0.04452873, -0.02593034],
        [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
         -0.00422151,  0.00306441]]),
 'target': array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
         69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
         68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
         87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
        259.,  53., 190., 142.,  75., 142., 155., 225.,  59

In [4]:
print(diabetes['DESCR']) # Prints the description of the diabetes dataset


.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 1

## A partir de la descripción de los datos, responda a las siguientes preguntas:

1. ¿Cuántos atributos hay en los datos? ¿Qué significan?

1. ¿Cuál es la relación entre `diabetes['data']` y `diabetes['target']`?

1. ¿Cuántos registros hay en los datos?

Number of Instances: 442 (registros)

Number of features: 10 

Target is the last column, the relation with data is a quantitative measure of disease progression one year after baseline 


## Ahora explora lo que contiene la parte *data* así como la parte *target* de `diabetes`. 

Scikit-learn normalmente toma arrays numpy 2D como entrada (aunque también se aceptan dataframes pandas). Inspeccione la forma de `data` y `target`. Confirme que son consistentes con la descripción de los datos.

## Realiza un EDA de los datos

In [5]:
diabetes['data'] # Returns the diabetes dataset


array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
         0.01990749, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
        -0.06833155, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
         0.00286131, -0.02593034],
       ...,
       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
        -0.04688253,  0.01549073],
       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
         0.04452873, -0.02593034],
       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
        -0.00422151,  0.00306441]])

In [6]:
'''target column contains 
the quantitative measure of disease progression 
that the model will attempt to predict.'''

diabetes['target']




array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
        69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
        68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
        87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
       259.,  53., 190., 142.,  75., 142., 155., 225.,  59., 104., 182.,
       128.,  52.,  37., 170., 170.,  61., 144.,  52., 128.,  71., 163.,
       150.,  97., 160., 178.,  48., 270., 202., 111.,  85.,  42., 170.,
       200., 252., 113., 143.,  51.,  52., 210.,  65., 141.,  55., 134.,
        42., 111.,  98., 164.,  48.,  96.,  90., 162., 150., 279.,  92.,
        83., 128., 102., 302., 198.,  95.,  53., 134., 144., 232.,  81.,
       104.,  59., 246., 297., 258., 229., 275., 281., 179., 200., 200.,
       173., 180.,  84., 121., 161.,  99., 109., 115., 268., 274., 158.,
       107.,  83., 103., 272.,  85., 280., 336., 281., 118., 317., 235.,
        60., 174., 259., 178., 128.,  96., 126., 28

In [None]:
df =pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df  

In [8]:
df.isnull().sum()

age    0
sex    0
bmi    0
bp     0
s1     0
s2     0
s3     0
s4     0
s5     0
s6     0
dtype: int64

In [9]:
# Dimensions (rows, columns)

diabetes['data'].shape




(442, 10)

## Construye un modelo 

1. Crea un modelo de regresión lineal.
2. Parte los datos en conjuntos de entrenamiento y test. Usa las ultimas 20 filas para los datos de test.
3. Entrena el modelo. Muestra los parametros del modelo.
4. Realiza una predicción con los datos de test

In [10]:
# Crea un modelo de regresión lineal.
from sklearn.linear_model import LinearRegression   
modelo=LinearRegression()   

##  Pinta las predicciones y comparalas con los datos de test

In [45]:
X_train = diabetes['data'][:-20]
y_train = diabetes['target'][:-20]
X_test = diabetes['data'][-20:]
y_test = diabetes['target'][-20:]

In [46]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(422, 10)
(422,)
(20, 10)
(20,)


## Calcula y visualiza los errores
* fit
* _model_intercept_ / _model_coef_    

In [47]:
modelo.fit(X_train, y_train)    

In [48]:
modelo.intercept_ 


# The intercept (bias) term of the model.


152.76429169049118

In [49]:
modelo.coef_   


# Exposes the slope and intercept values learned during fitting.


array([ 3.06094248e-01, -2.37635570e+02,  5.10538048e+02,  3.27729878e+02,
       -8.14111926e+02,  4.92799595e+02,  1.02841240e+02,  1.84603496e+02,
        7.43509388e+02,  7.60966464e+01])

In [50]:
# Predict
modelo.predict(X_test)  

array([197.61898486, 155.44031962, 172.88875144, 111.53270645,
       164.79397301, 131.06765869, 259.12441219, 100.47873746,
       117.06005372, 124.30261597, 218.36868146,  61.19581944,
       132.24837933, 120.33293546,  52.54513009, 194.03746764,
       102.5756431 , 123.56778709, 211.03465323,  52.60221696])

In [51]:
y_test

array([233.,  91., 111., 152., 120.,  67., 310.,  94., 183.,  66., 173.,
        72.,  49.,  64.,  48., 178., 104., 132., 220.,  57.])

In [61]:
 '''difference between the actual (y_test)
 and the predictions from (modelo.predict(X_test)) 
on the test data. This gives the error of the
model's predictions on the test set.'''


y_test - modelo.predict(X_test)



array([ 35.38101514, -64.44031962, -61.88875144,  40.46729355,
       -44.79397301, -64.06765869,  50.87558781,  -6.47873746,
        65.93994628, -58.30261597, -45.36868146,  10.80418056,
       -83.24837933, -56.33293546,  -4.54513009, -16.03746764,
         1.4243569 ,   8.43221291,   8.96534677,   4.39778304])

In [59]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# Prediction on the X_test data
predict = modelo.predict(X_test)






In [60]:
print("RMSE:", np.sqrt(mean_squared_error(y_test, predict)))
print("MAE:", mean_absolute_error(y_test, predict))

RMSE: 44.77185149548999
MAE: 36.60961865545878
