<a href="https://colab.research.google.com/github/yiruchen1993/nvidia_gtc_dli_rapids_2020/blob/section_notebooks%2Fmachine_learning/2_05_logistic_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 羅吉斯迴歸分析 (Logistic Regression)

在本筆記本中，您將使用GPU加速的羅吉斯迴歸分析 (Logistic Regression)基於我們的人口成員特徵預測感染風險。

## 目標

在您完成本筆記本時，您將能夠：

-使用GPU加速的羅吉斯迴歸分析

## 載入

In [None]:
import cudf
import cuml

import cupy as cp

## Load Data

In [None]:
gdf = cudf.read_csv('./data/pop_2-05.csv', usecols=['age', 'sex', 'infected'])

In [None]:
gdf.dtypes

age         float64
sex         float64
infected    float64
dtype: object

In [None]:
gdf.shape

(58479894, 3)

In [None]:
gdf.head()

Unnamed: 0,age,sex,infected
0,0.0,0.0,0.0
1,0.0,0.0,0.0
2,0.0,0.0,0.0
3,0.0,0.0,0.0
4,0.0,0.0,0.0


## 羅吉斯迴歸分析

羅吉斯迴歸分析可以用來估計某些（假定獨立的）輸入函數的結果機率。在我們的案例中，我們希望根據人群的年齡和性別來估計感染風險。

在這裡，我們創建一個cuML 羅吉斯迴歸分析instance `logreg`：

In [None]:
logreg = cuml.LogisticRegression()

## 練習: Regress Infected Status

logreg.fit方法有兩個參數：模型的自變量*X*和因變量*y*。使用`gdf`列`age`和`sex`作為*X*以及`infected`欄位作為*y*來擬合logreg模型。

#### 解答

In [None]:
# %load solutions/regress_infected
logreg.fit(gdf[['age', 'sex']], gdf['infected'])


LogisticRegression(penalty='l2', tol=0.0001, C=1.0, fit_intercept=True, max_iter=1000, linesearch_max_iter=50, verbose=0, solver='qn', handle=<cuml.common.handle.Handle object at 0x7f33f256dd68>)

## 查看結果

擬合模型後，我們可以使用`logreg.predict`來估算某人是否有超過50％的機會被感染，但是由於該病毒在人群中的流行率較低（在此數據集中約為1-2％），個體感染的可能性遠低於50％，該模型應正確預測沒有人會個體感染。

但是，我們也可以在`logreg.coef_`和`logreg.intercept_`處獲取截距。這兩個值都是CUDA device陣列，與我們在生成`northing`和`easting`欄位時看到的類型相同：

In [None]:
type(logreg.coef_)

numba.cuda.cudadrv.devicearray.DeviceNDArray

In [None]:
type(logreg.intercept_)

numba.cuda.cudadrv.devicearray.DeviceNDArray

要查看這些值，我們需要使用它們的`copy_to_host`方法，這些方法將返回我們可以打印的CPU NumPy資料型態。

In [None]:
logreg.coef_.copy_to_host()

array([[0.01379566],
       [0.00249283]])

In [None]:
#to_host_array()
logreg_coef = logreg.coef_.copy_to_host()
logreg_int = logreg.intercept_.copy_to_host()[0]

print("Coefficients: [age, sex]")
print([logreg_coef[0][0], logreg_coef[1][0]])

print("Intercept:")
print(logreg_int)

Coefficients: [age, sex]
[0.013795661578590414, 0.002492827409631911]
Intercept:
-4.757416365733313


## 估計感染機率

與所有羅吉斯回歸一樣，係數使我們能夠計算每個邏輯對數。據此，我們可以計算出估計的感染風險百分比。

In [None]:
# logit = x1 * m1 + x2 * m2 + b
exp_logit = cp.exp(gdf['age'] * logreg_coef[0][0].item() + 
                   gdf['sex'] * logreg_coef[1][0].item() + 
                   logreg_int.item())

# converting the logit to a percentage risk via the logistic function p = exp(logit) / (exp(logit) + 1)
gdf['risk'] = exp_logit / (exp_logit + 1)

In [None]:
gdf.tail()

Unnamed: 0,age,sex,infected,risk
58479889,90.0,1.0,0.0,0.028936
58479890,90.0,1.0,0.0,0.028936
58479891,90.0,1.0,0.0,0.028936
58479892,90.0,1.0,0.0,0.028936
58479893,90.0,1.0,0.0,0.028936


查看原始記錄及其新的估計風險，我們可以看到估計風險在各個個體之間如何變化。

In [None]:
gdf.take(cp.random.choice(gdf.shape[0], size=5, replace=False))

Unnamed: 0,age,sex,infected,risk
23578557,63.0,0.0,0.0,0.020069
55778211,75.0,1.0,0.0,0.023655
21540428,57.0,0.0,0.0,0.018504
17131148,46.0,0.0,0.0,0.015941
27480382,78.0,0.0,0.0,0.02457


## 練習: 顯示感染率與年齡有關

年齡的正係數表示，即使控制性別，該病毒在老年人中也更普遍。

在本練習中，通過按年齡分組印出最老和最年輕人口的平均`infected`值，表明感染率與年齡有一定關係：

#### 解答

In [None]:
# %load solutions/risk_by_age
age_groups = gdf[['age', 'infected']].groupby(['age'])
print(age_groups.mean().head())
print(age_groups.mean().tail())


     infected
age          
0.0  0.000000
1.0  0.000889
2.0  0.001960
3.0  0.002715
4.0  0.003586
      infected
age           
86.0  0.023417
87.0  0.023256
88.0  0.024569
89.0  0.024412
90.0  0.025017


## 練習: 顯示感染率與性別有關

同樣，關於性別的正係數表明，即使在控制年齡的情況下，該病毒在性別= 1的人群（女性）中也更為普遍。

在此練習中，通過按性別分組印出人群的平均`infected`值，表明感染率與性別有一定關係：

#### Solution

In [None]:
# %load solutions/risk_by_sex
sex_groups = gdf[['sex', 'infected']].groupby(['sex'])
sex_groups.mean()


Unnamed: 0_level_0,infected
sex,Unnamed: 1_level_1
0.0,0.01014
1.0,0.020713


## 使用訓練和測試資料

cuML為我們提供了一種用於生成配對訓練/測試數據的簡單方法：

In [None]:
x_train, x_test, y_train, y_test  = cuml.train_test_split(gdf[['age', 'sex']], gdf['infected'], train_size=0.9)

## 練習：使用訓練數據擬合羅吉斯回歸模型

在本練習中，創建一個新的羅吉斯回歸模型`logreg`，並將其與剛剛創建的*X*和*y*訓練數據擬合。

#### 解答

In [None]:
# %load solutions/fit_training
logreg = cuml.LogisticRegression()
logreg.fit(x_train, y_train)


LogisticRegression(penalty='l2', tol=0.0001, C=1.0, fit_intercept=True, max_iter=1000, linesearch_max_iter=50, verbose=0, solver='qn', handle=<cuml.common.handle.Handle object at 0x7f33f256dd50>)

## 利用測試資料來驗證模型

現在，我們可以使用與上述相同的過程通過測試數據來預測感染風險：

In [None]:
logreg_coef = logreg.coef_.copy_to_host()
logreg_int = logreg.intercept_.copy_to_host()[0]

exp_logit = cp.exp(x_test['age'] * logreg_coef[0][0].item() + 
                   x_test['sex'] * logreg_coef[1][0].item() + 
                   logreg_int.item())

y_test_pred = exp_logit / (exp_logit + 1)

正如我們之前看到的那樣，即使在高風險人群中，實際上也很少有人受到感染。作為檢查模型的一種簡單方法，我們將測試集分為高於平均預期的風險和低於平均預期的風險，然後觀察到感染率與那些預期的風險密切相關。

In [None]:
test_results = cudf.DataFrame()
test_results['infected'] = y_test
test_results['predicted_risk'] = y_test_pred
test_results['high_risk'] = test_results['predicted_risk'] > test_results['predicted_risk'].mean()

risk_groups = test_results.groupby('high_risk')
risk_groups.mean()

Unnamed: 0_level_0,infected,predicted_risk
high_risk,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.01145,0.011647
True,0.020426,0.020162


<br>
<div align="center"><h2>請重啟 Kernel</h2></div>

In [None]:
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

## 下一步

在下一個notebook中，您將使用GPU加速的k最近鄰算法來定位距每個醫院最近的道路節點。