### 讀取資料
首先，我們用 pandas 讀取最主要的資料 application_train.csv (記得到 https://www.kaggle.com/c/home-credit-default-risk/data 下載)

Note: `data/application_train.csv` 表示 `application_train.csv` 與該 `.ipynb` 的資料夾結構關係如下
```
data
    /application_train.csv
Day_002_first_EDA.ipynb
```

In [1]:
import os
import numpy as np
import pandas as pd

In [2]:
# 設定 data_path
dir_data = './data/'

#### 用 pd.read_csv 來讀取資料

In [3]:
f_app = os.path.join(dir_data, 'application_train.csv')
print('Path of read in data: %s' % (f_app))
app_train = pd.read_csv(f_app)

Path of read in data: ./data/application_train.csv


#### Note: 在 jupyter notebook 中，可以使用 `?` 來調查函數的定義

In [4]:
# for example
?pd.read_csv

#### 接下來我們可以用 .head() 這個函數來觀察前 5 row 資料

In [5]:
app_train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


## 練習時間
資料的操作有很多，接下來的馬拉松中我們會介紹常被使用到的操作，參加者不妨先自行想像一下，第一次看到資料，我們一般會想知道什麼訊息？

#### Ex: 如何知道資料的 row 數以及 column 數、有什麼欄位、多少欄位、如何截取部分的資料等等

有了對資料的好奇之後，我們又怎麼通過程式碼來達成我們的目的呢？

#### 可參考該[基礎教材](https://bookdata.readthedocs.io/en/latest/base/01_pandas.html#DataFrame-%E5%85%A5%E9%97%A8)或自行 google

In [21]:
app_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB


In [18]:
app_train.columns

Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY',
       ...
       'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
       'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR',
       'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
       'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype='object', length=122)

In [20]:
app_train.iloc[100:105]

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
100,100119,0,Revolving loans,F,N,Y,0,54000.0,180000.0,9000.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
101,100120,0,Cash loans,F,N,N,0,135000.0,781920.0,28215.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
102,100121,0,Cash loans,M,N,Y,0,126000.0,787131.0,26145.0,...,0,0,0,0,1.0,1.0,0.0,0.0,0.0,1.0
103,100122,0,Cash loans,F,N,N,1,76500.0,808650.0,26217.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
104,100123,0,Cash loans,F,N,N,0,103500.0,675000.0,19737.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,2.0


In [27]:
app_train.dtypes[0:5]

SK_ID_CURR             int64
TARGET                 int64
NAME_CONTRACT_TYPE    object
CODE_GENDER           object
FLAG_OWN_CAR          object
dtype: object

In [29]:
app_train.empty

False

In [32]:
app_train.ndim

2

In [36]:
app_train.size == app_train.shape[0]*app_train.shape[1]

True

In [51]:
app_train.values[1][2] is app_train.iloc[1][2]

True

In [55]:
app_train.values[3][4] is app_train[app_train.columns[4]][3]

True

In [62]:
app_train['CODE_GENDER'].where(app_train['TARGET']>0)

0           M
1         NaN
2         NaN
3         NaN
4         NaN
5         NaN
6         NaN
7         NaN
8         NaN
9         NaN
10        NaN
11        NaN
12        NaN
13        NaN
14        NaN
15        NaN
16        NaN
17        NaN
18        NaN
19        NaN
20        NaN
21        NaN
22        NaN
23        NaN
24        NaN
25        NaN
26          F
27        NaN
28        NaN
29        NaN
         ... 
307481      M
307482    NaN
307483    NaN
307484    NaN
307485    NaN
307486    NaN
307487    NaN
307488    NaN
307489      F
307490    NaN
307491    NaN
307492    NaN
307493    NaN
307494    NaN
307495    NaN
307496    NaN
307497    NaN
307498    NaN
307499    NaN
307500    NaN
307501    NaN
307502    NaN
307503    NaN
307504    NaN
307505    NaN
307506    NaN
307507    NaN
307508    NaN
307509      F
307510    NaN
Name: CODE_GENDER, Length: 307511, dtype: object

In [61]:
sum(app_train['TARGET'] > 0)

24825