# Diabetes 데이터와 Linear Regression

#### 당뇨병 진행도와 관련된 데이터를 이용해 당뇨병 진행을 예측하는 Linear Regression을 학습해 보겠습니다.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(2021)

# 1. Data

## 1.1 Data Load

#### 데이터는 sklearn.datasets의 load_diabetes함수를 이용해 받을 수 있습니다.

In [6]:
from sklearn.datasets import load_diabetes

diabetes = load_diabetes()

#### 당뇨병 데이터에서 사용되는 변수명은 feature_names 키 값으로 들어 있습니다.
#### 변수명과 변수에 대한 설명은 다음과 같습니다.
* age : 나이
* sex : 성별
* bmi : Body mass index
* bp : Average blood pressure
* 혈청에 대한 6가지 지표들 - s1, s2, s3, s4, s5, s6

In [8]:
diabetes['feature_names']

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

In [13]:
data, target = diabetes['data'], diabetes['target']

In [14]:
data[0]

array([ 0.03807591,  0.05068012,  0.06169621,  0.02187235, -0.0442235 ,
       -0.03482076, -0.04340085, -0.00259226,  0.01990842, -0.01764613])

In [15]:
target[0]

151.0

## 1.2 Data EDA

In [17]:
df = pd.DataFrame(data, columns=diabetes['feature_names'])

In [18]:
df.describe()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
count,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0,442.0
mean,-3.639623e-16,1.309912e-16,-8.013951e-16,1.289818e-16,-9.042540000000001e-17,1.301121e-16,-4.563971e-16,3.863174e-16,-3.848103e-16,-3.398488e-16
std,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905,0.04761905
min,-0.1072256,-0.04464164,-0.0902753,-0.1123996,-0.1267807,-0.1156131,-0.1023071,-0.0763945,-0.1260974,-0.1377672
25%,-0.03729927,-0.04464164,-0.03422907,-0.03665645,-0.03424784,-0.0303584,-0.03511716,-0.03949338,-0.03324879,-0.03317903
50%,0.00538306,-0.04464164,-0.007283766,-0.005670611,-0.004320866,-0.003819065,-0.006584468,-0.002592262,-0.001947634,-0.001077698
75%,0.03807591,0.05068012,0.03124802,0.03564384,0.02835801,0.02984439,0.0293115,0.03430886,0.03243323,0.02791705
max,0.1107267,0.05068012,0.1705552,0.1320442,0.1539137,0.198788,0.1811791,0.1852344,0.133599,0.1356118


## 1.3 Data Split

#### sklearn.model_selection의 train_test_split함수를 이용해 데이터를 나누겠습니다.

#### train_test_split(
    *arrays,  
    test_size=None,
    train_size=None,
    random_state=None,
    shuffle=True,
    stratify=None,
)

* *arrays : 입력은 array로 이루어진 데이터를 받습니다.
* test_size : test로 분할된 사이즈를 정합니다.
* train_size : train으로 분할된 사이즈를 정합니다.
* random_state : 다음에도 같은 값을 얻기 위해서 난수를 고정합니다.
* shuffle : 데이터를 섞을지 말지 결정합니다.
* stratify : 데이터를 나눌 때 정답의 분포를 반영합니다.


In [23]:
from sklearn.model_selection import train_test_split

train_data, test_data, train_target, test_target = train_test_split(data, target, test_size=0.3)

train과 test를 7:3의 비율로 나누었습니다.
실제로 잘 나누어졌는지 확인해보겠습니다.

In [24]:
len(data), len(train_data), len(test_data)

(442, 309, 133)

In [25]:
print("train ratio : {:.2f}".format(len(train_data)/len(data)))
print("test ration : {:.2f}".format(len(test_data)/len(data)))

train ratio : 0.70
test ration : 0.30


# 2. Multivariate Regression

## 2.1 학습

In [27]:
from sklearn.linear_model import LinearRegression

multi_regressor = LinearRegression()
multi_regressor.fit(train_data, train_target)

LinearRegression()

## 2.2 회귀식 확인

In [29]:
multi_regressor.intercept_

147.71524417759434