In [1]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot

### 01. CSV 資料格式

CSV (Comma Seperated Values) 資料檔是一個純文字的檔案, 就是用逗號區分!

比方說我們有個台灣 2020 年 7-9 月汽車銷量統計的表格:

車型 | 7月 | 8月 | 9月
:-----|----|-----|------
Mazda3 | 319 | 189 | 488
Corolla Sport | 303 | 338 | 239

相對的 CSV 檔, 就是一個純文字 `.csv` 的檔案, 內容是這樣:

    車型,7月,8月,9月
    Mazda3,319,189,488
    Corolla Sport,303,338,239
    
注意其實 CSV 檔「不需要」以逗號隔開, 換成其他符號, 如 `*` 等都是可以的。

### 02. 用 `pandas` 讀入一個 CSV 檔

在 data 資料夾中, 我們有個叫 `diabets_data_upload.csv` 的數據庫。這是在出名有很多數據庫的 [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml) 中的[這份資料](https://archive.ics.uci.edu/ml/datasets/Early+stage+diabetes+risk+prediction+dataset)。

In [2]:
df = pd.read_csv('data/diabetes_data_upload.csv')

`pandas` 的 DataFrame (常用 `df` 命名), 可以想成就是一個 Excel 的表單。我們可以用 `.head` 來看內容。

In [4]:
df.tail()

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
515,39,Female,Yes,Yes,Yes,No,Yes,No,No,Yes,No,Yes,Yes,No,No,No,Positive
516,48,Female,Yes,Yes,Yes,Yes,Yes,No,No,Yes,Yes,Yes,Yes,No,No,No,Positive
517,58,Female,Yes,Yes,Yes,Yes,Yes,No,Yes,No,No,No,Yes,Yes,No,Yes,Positive
518,32,Female,No,No,No,Yes,No,No,Yes,Yes,No,Yes,No,No,Yes,No,Negative
519,42,Male,No,No,No,No,No,No,No,No,No,No,No,No,No,No,Negative


是不是自動就排得很漂亮呢? 我們標準程序就是要把 `Yes`, `No`, 還有 `Male`, `Female`, `Positive`, `Negative` 等等變成 0 或 1。

In [6]:
egg = {"Female":1, "Male":0, "Yes":1, "No":0, "Positive":1, "Negative":0}

In [7]:
f = lambda x: egg[x]

In [9]:
f('No')

0

In [17]:
df.loc[:,'Gender':]

Unnamed: 0,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
0,Male,No,Yes,No,Yes,No,No,No,Yes,No,Yes,No,Yes,Yes,Yes,Positive
1,Male,No,No,No,Yes,No,No,Yes,No,No,No,Yes,No,Yes,No,Positive
2,Male,Yes,No,No,Yes,Yes,No,No,Yes,No,Yes,No,Yes,Yes,No,Positive
3,Male,No,No,Yes,Yes,Yes,Yes,No,Yes,No,Yes,No,No,No,No,Positive
4,Male,Yes,Yes,Yes,Yes,Yes,No,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Positive
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
515,Female,Yes,Yes,Yes,No,Yes,No,No,Yes,No,Yes,Yes,No,No,No,Positive
516,Female,Yes,Yes,Yes,Yes,Yes,No,No,Yes,Yes,Yes,Yes,No,No,No,Positive
517,Female,Yes,Yes,Yes,Yes,Yes,No,Yes,No,No,No,Yes,Yes,No,Yes,Positive
518,Female,No,No,No,Yes,No,No,Yes,Yes,No,Yes,No,No,Yes,No,Negative


In [22]:
df.loc[:,'Gender':] = df.loc[:,'Gender':].applymap(f)

In [23]:
df.head()

Unnamed: 0,Age,Gender,Polyuria,Polydipsia,sudden weight loss,weakness,Polyphagia,Genital thrush,visual blurring,Itching,Irritability,delayed healing,partial paresis,muscle stiffness,Alopecia,Obesity,class
0,40,0,0,1,0,1,0,0,0,1,0,1,0,1,1,1,1
1,58,0,0,0,0,1,0,0,1,0,0,0,1,0,1,0,1
2,41,0,1,0,0,1,1,0,0,1,0,1,0,1,1,0,1
3,45,0,0,0,1,1,1,1,0,1,0,1,0,0,0,0,1
4,60,0,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1


In [30]:
x = df.loc[:, :'Obesity'].values

In [34]:
y = df['class'].values

In [35]:
x[87]

array([28,  1,  0,  0,  0,  0,  0,  0,  1,  0,  0,  0,  1,  1,  0,  0])

In [36]:
y[87]

1