# Linear regression in diabetes dataset

Exploremos los conjuntos de datos incluidos en esta biblioteca de Python. Estos conjuntos de datos se han limpiado y formateado para su uso en algoritmos de ML.

## Ej 1: Carga y explora el dataset diabetes de sklearn 

In [3]:
import pandas as pd
import numpy as np

In [4]:
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()

In [5]:
diabetes 


{'data': array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
          0.01990749, -0.01764613],
        [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
         -0.06833155, -0.09220405],
        [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
          0.00286131, -0.02593034],
        ...,
        [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
         -0.04688253,  0.01549073],
        [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
          0.04452873, -0.02593034],
        [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
         -0.00422151,  0.00306441]]),
 'target': array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
         69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
         68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
         87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
        259.,  53., 190., 142.,  75., 142., 155., 225.,  59

In [6]:
print(diabetes['DESCR']) # Prints the description of the diabetes dataset


.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 1

## A partir de la descripción de los datos, responda a las siguientes preguntas:

1. ¿Cuántos atributos hay en los datos? ¿Qué significan?

1. ¿Cuál es la relación entre `diabetes['data']` y `diabetes['target']`?

1. ¿Cuántos registros hay en los datos?

Number of Instances: 442 (registros)

Number of features: 10 

Target is the last column, the relation with data is a quantitative measure of disease progression one year after baseline 


## Ahora explora lo que contiene la parte *data* así como la parte *target* de `diabetes`. 

Scikit-learn normalmente toma arrays numpy 2D como entrada (aunque también se aceptan dataframes pandas). Inspeccione la forma de `data` y `target`. Confirme que son consistentes con la descripción de los datos.

## Realiza un EDA de los datos

In [7]:
diabetes['data'] # Returns the diabetes dataset


array([[ 0.03807591,  0.05068012,  0.06169621, ..., -0.00259226,
         0.01990749, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, ..., -0.03949338,
        -0.06833155, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, ..., -0.00259226,
         0.00286131, -0.02593034],
       ...,
       [ 0.04170844,  0.05068012, -0.01590626, ..., -0.01107952,
        -0.04688253,  0.01549073],
       [-0.04547248, -0.04464164,  0.03906215, ...,  0.02655962,
         0.04452873, -0.02593034],
       [-0.04547248, -0.04464164, -0.0730303 , ..., -0.03949338,
        -0.00422151,  0.00306441]])

In [8]:
'''target column contains 
the quantitative measure of disease progression 
that the model will attempt to predict.'''

diabetes['target']




array([151.,  75., 141., 206., 135.,  97., 138.,  63., 110., 310., 101.,
        69., 179., 185., 118., 171., 166., 144.,  97., 168.,  68.,  49.,
        68., 245., 184., 202., 137.,  85., 131., 283., 129.,  59., 341.,
        87.,  65., 102., 265., 276., 252.,  90., 100.,  55.,  61.,  92.,
       259.,  53., 190., 142.,  75., 142., 155., 225.,  59., 104., 182.,
       128.,  52.,  37., 170., 170.,  61., 144.,  52., 128.,  71., 163.,
       150.,  97., 160., 178.,  48., 270., 202., 111.,  85.,  42., 170.,
       200., 252., 113., 143.,  51.,  52., 210.,  65., 141.,  55., 134.,
        42., 111.,  98., 164.,  48.,  96.,  90., 162., 150., 279.,  92.,
        83., 128., 102., 302., 198.,  95.,  53., 134., 144., 232.,  81.,
       104.,  59., 246., 297., 258., 229., 275., 281., 179., 200., 200.,
       173., 180.,  84., 121., 161.,  99., 109., 115., 268., 274., 158.,
       107.,  83., 103., 272.,  85., 280., 336., 281., 118., 317., 235.,
        60., 174., 259., 178., 128.,  96., 126., 28

In [9]:
df =pd.DataFrame(diabetes.data, columns=diabetes.feature_names)
df  

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.050680,0.044451,-0.005670,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.025930
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641
...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485
439,0.041708,0.050680,-0.015906,0.017293,-0.037344,-0.013840,-0.024993,-0.011080,-0.046883,0.015491
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044529,-0.025930


In [10]:
df.isnull().sum()

age    0
sex    0
bmi    0
bp     0
s1     0
s2     0
s3     0
s4     0
s5     0
s6     0
dtype: int64

In [26]:
# Dimensions (rows, columns)

print(diabetes['data'].shape)
print(diabetes['target'].shape)




(442, 10)
(442,)


In [24]:
df.describe()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
count,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0
mean,-2.511817e-19,1.23079e-17,-2.245564e-16,-4.79757e-17,-1.3814990000000001e-17,3.9184340000000004e-17,-5.777179e-18,-9.04254e-18,9.293722000000001e-17,1.130318e-17
std,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905
min,-0.1072256,-0.04464164,-0.0902753,-0.1123988,-0.1267807,-0.1156131,-0.1023071,-0.0763945,-0.1260971,-0.1377672
25%,-0.03729927,-0.04464164,-0.03422907,-0.03665608,-0.03424784,-0.0303584,-0.03511716,-0.03949338,-0.03324559,-0.03317903
50%,0.00538306,-0.04464164,-0.007283766,-0.005670422,-0.004320866,-0.003819065,-0.006584468,-0.002592262,-0.001947171,-0.001077698
75%,0.03807591,0.05068012,0.03124802,0.03564379,0.02835801,0.02984439,0.0293115,0.03430886,0.03243232,0.02791705
max,0.1107267,0.05068012,0.1705552,0.1320436,0.1539137,0.198788,0.1811791,0.1852344,0.1335973,0.1356118


In [27]:
df.corr()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
age,1.0,0.173737,0.185085,0.335428,0.260061,0.219243,-0.075181,0.203841,0.270774,0.301731
sex,0.173737,1.0,0.088161,0.24101,0.035277,0.142637,-0.37909,0.332115,0.149916,0.208133
bmi,0.185085,0.088161,1.0,0.395411,0.249777,0.26117,-0.366811,0.413807,0.446157,0.38868
bp,0.335428,0.24101,0.395411,1.0,0.242464,0.185548,-0.178762,0.25765,0.39348,0.39043
s1,0.260061,0.035277,0.249777,0.242464,1.0,0.896663,0.051519,0.542207,0.515503,0.325717
s2,0.219243,0.142637,0.26117,0.185548,0.896663,1.0,-0.196455,0.659817,0.318357,0.2906
s3,-0.075181,-0.37909,-0.366811,-0.178762,0.051519,-0.196455,1.0,-0.738493,-0.398577,-0.273697
s4,0.203841,0.332115,0.413807,0.25765,0.542207,0.659817,-0.738493,1.0,0.617859,0.417212
s5,0.270774,0.149916,0.446157,0.39348,0.515503,0.318357,-0.398577,0.617859,1.0,0.464669
s6,0.301731,0.208133,0.38868,0.39043,0.325717,0.2906,-0.273697,0.417212,0.464669,1.0


In [30]:
import seaborn as sns       
sns   

AttributeError: module 'seaborn' has no attribute 'corr'

## Construye un modelo 

1. Crea un modelo de regresión lineal.
2. Parte los datos en conjuntos de entrenamiento y test. Usa las ultimas 20 filas para los datos de test.
3. Entrena el modelo. Muestra los parametros del modelo.
4. Realiza una predicción con los datos de test

In [13]:
# Crea un modelo de regresión lineal.
from sklearn.linear_model import LinearRegression   
modelo=LinearRegression()   

##  Pinta las predicciones y comparalas con los datos de test

In [14]:
X_train = diabetes['data'][:-20]
y_train = diabetes['target'][:-20]
X_test = diabetes['data'][-20:]
y_test = diabetes['target'][-20:]

In [15]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(422, 10)
(422,)
(20, 10)
(20,)


## Calcula y visualiza los errores
* fit
* _model_intercept_ / _model_coef_    

In [16]:
modelo.fit(X_train, y_train)    

In [17]:
modelo.intercept_ 


# The intercept (bias) term of the model.


152.76429169049118

The intercept value of 152.76429169049118 in your model indicates the following:

Intercept (also called bias or constant) is a parameter in linear regression models. It refers to the value of the target variable when all the independent variables are equal to zero.

In a diabetes prediction model like yours, the intercept indicates the estimated baseline blood glucose level when all other input features (like BMI, age, etc) are zero.

The intercept itself usually does not have a direct real-world interpretation, since having all inputs as zero is not realistic.

However, a high/low intercept can provide some insights:

A high intercept indicates the baseline predicted glucose level is high even without any inputs. This may suggest other important variables are missing from the model.

A low/negative intercept indicates the predicted glucose starts low/negative. The model may be overfit and predicting unrealistically low glucose when inputs are zero.

So in summary, the high 152 intercept in your model indicates a high baseline glucose prediction, potentially due to missing variables. Evaluating the intercept along with overall model statistics can help assess if the model is reasonable or needs improvement.





In [18]:
modelo.coef_   


# Exposes the slope and intercept values learned during fitting.


array([ 3.06094248e-01, -2.37635570e+02,  5.10538048e+02,  3.27729878e+02,
       -8.14111926e+02,  4.92799595e+02,  1.02841240e+02,  1.84603496e+02,
        7.43509388e+02,  7.60966464e+01])

In [19]:
# Predict
modelo.predict(X_test)  

array([197.61898486, 155.44031962, 172.88875144, 111.53270645,
       164.79397301, 131.06765869, 259.12441219, 100.47873746,
       117.06005372, 124.30261597, 218.36868146,  61.19581944,
       132.24837933, 120.33293546,  52.54513009, 194.03746764,
       102.5756431 , 123.56778709, 211.03465323,  52.60221696])

This array contains the coefficient values for each feature/variable in the machine learning model. Specifically:

The coefficients are for a linear regression model, where the prediction is modeled as:

Prediction = Intercept + Coeff1X1 + Coeff2X2 + ... + CoeffN*XN

There are 10 coefficient values here, indicating there are 10 feature variables in your model.

Each coefficient indicates the change in the predicted output for a 1 unit change in that input feature, keeping other features fixed.

A positive coefficient means the predicted output increases as that input increases. A negative coefficient means the predicted output decreases as input increases.

Higher absolute coefficient values indicate that feature has a stronger influence on the prediction.

So in summary, the coefficients show the relative strength and direction (positive/negative) of each input variable in determining the predicted output. Analyzing coefficients is useful for understanding model behavior and importance of variables. Large positive or negative values like 510.53 and -814.11 indicate strongly influential variables in your model.

In [20]:
y_test

array([233.,  91., 111., 152., 120.,  67., 310.,  94., 183.,  66., 173.,
        72.,  49.,  64.,  48., 178., 104., 132., 220.,  57.])

In [21]:
 '''difference between the actual (y_test)
 and the predictions from (modelo.predict(X_test)) 
on the test data. This gives the error of the
model's predictions on the test set.'''


y_test - modelo.predict(X_test)



array([ 35.38101514, -64.44031962, -61.88875144,  40.46729355,
       -44.79397301, -64.06765869,  50.87558781,  -6.47873746,
        65.93994628, -58.30261597, -45.36868146,  10.80418056,
       -83.24837933, -56.33293546,  -4.54513009, -16.03746764,
         1.4243569 ,   8.43221291,   8.96534677,   4.39778304])

This array shows the residuals, which are the differences between the actual y_test values and the predicted values from the modelo.predict(X_test) on the test data.

Specifically:

Residual = Actual y value - Predicted y value

A low residual value means the prediction was close to the actual.

A high positive residual means the model under-predicted the actual y value.

A high negative residual means the model over-predicted the actual y value.

Looking at the residuals is useful for evaluating model performance:

The residuals should randomly scatter around 0 if the model predictions are unbiased.

Any systematic patterns (e.g. large residuals only in one direction) indicate a problem.

The average and spread of residuals reveals overall model accuracy.

In this case, the residuals range from -83 to 66, indicating the predictions are off by as much as 83 units. The variability suggests room for improvement in the model accuracy and fit. Analyzing residuals for patterns can help identify where the model needs refinement. Overall this residual analysis is an important check that should be done whenever evaluating a regression model.





In [22]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

# Prediction on the X_test data
predict = modelo.predict(X_test)






In [23]:
print("RMSE:", np.sqrt(mean_squared_error(y_test, predict)))
print("MAE:", mean_absolute_error(y_test, predict))

RMSE: 44.77185149548999
MAE: 36.60961865545878


Based on the RMSE (Root Mean Squared Error) and MAE (Mean Absolute Error) metrics you provided, I can make the following conclusions about your model's performance:

RMSE of 44.77 and MAE of 36.61 indicate your model is off by 44-37 units on average when predicting the target variable.

These are relatively large errors that suggest your model needs improvement in accuracy.

RMSE penalizes larger errors more than MAE, so the higher RMSE indicates your model is making some predictions with very large errors.

The gap between RMSE and MAE shows the errors are not normally distributed but skewed towards more extreme errors.

For context, whether these are acceptable error levels depends on the target variable's scale and application needs. But in most cases, RMSE in the 40s would be considered high.

In summary, the large RMSE and MAE values indicate your current model is not very accurate - there is significant room for improvement. I would recommend techniques like:

Inspecting residuals to identify patterns in the errors

Trying different/more advanced modeling algorithms

Obtaining more quality training data

Removing or transforming problematic variables

Tuning hyperparameters to optimize model performance

The goal would be to reduce both RMSE and MAE, with priority on lowering RMSE by reducing large errors.



