## 練習時間
資料的操作有很多，接下來的馬拉松中我們會介紹常被使用到的操作，參加者不妨先自行想像一下，第一次看到資料，我們一般會想知道什麼訊息？

#### Ex: 如何知道資料的 row 數以及 column 數、有什麼欄位、多少欄位、如何截取部分的資料等等

有了對資料的好奇之後，我們又怎麼通過程式碼來達成我們的目的呢？

#### 可參考該[基礎教材](https://bookdata.readthedocs.io/en/latest/base/01_pandas.html#DataFrame-%E5%85%A5%E9%97%A8)或自行 google

In [3]:
import os
import numpy as np
import pandas as pd

In [1]:
# 設定 data_path
dir_data = './Part01/'

In [4]:
f_app = os.path.join(dir_data, 'application_train.csv')
print('Path of read in data: %s' % (f_app))
app_train = pd.read_csv(f_app)

Path of read in data: ./Part01/application_train.csv


### 如果沒有想法，可以先嘗試找出剛剛例子中提到的問題的答案
#### 資料的 row 數以及 column 數

In [10]:
print("row數:", app_train.shape[0])
print("column數:", app_train.shape[1])

row數: 307511
column數: 122


#### 列出所有欄位

In [11]:
app_train.columns

Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY',
       ...
       'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
       'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR',
       'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
       'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype='object', length=122)

#### 截取部分資料

In [14]:
#app_train.head()
app_train.iloc[0:200:2, 0:10]

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5
6,100009,0,Cash loans,F,Y,Y,1,171000.0,1560726.0,41301.0
8,100011,0,Cash loans,F,N,Y,0,112500.0,1019610.0,33826.5
10,100014,0,Cash loans,F,N,Y,1,112500.0,652500.0,21177.0
12,100016,0,Cash loans,F,N,Y,0,67500.0,80865.0,5881.5
14,100018,0,Cash loans,F,N,Y,0,189000.0,773680.5,32778.0
16,100020,0,Cash loans,M,N,N,0,108000.0,509602.5,26149.5
18,100022,0,Revolving loans,F,N,Y,0,112500.0,157500.0,7875.0


#### 還有各種數之不盡的資料操作，重點還是取決於實務中遇到的狀況和你想問的問題，在馬拉松中我們也會陸續提到更多例子

找出包含NAN的column

In [17]:
app_train.isnull().any()

SK_ID_CURR                    False
TARGET                        False
NAME_CONTRACT_TYPE            False
CODE_GENDER                   False
FLAG_OWN_CAR                  False
FLAG_OWN_REALTY               False
CNT_CHILDREN                  False
AMT_INCOME_TOTAL              False
AMT_CREDIT                    False
AMT_ANNUITY                    True
AMT_GOODS_PRICE                True
NAME_TYPE_SUITE                True
NAME_INCOME_TYPE              False
NAME_EDUCATION_TYPE           False
NAME_FAMILY_STATUS            False
NAME_HOUSING_TYPE             False
REGION_POPULATION_RELATIVE    False
DAYS_BIRTH                    False
DAYS_EMPLOYED                 False
DAYS_REGISTRATION             False
DAYS_ID_PUBLISH               False
OWN_CAR_AGE                    True
FLAG_MOBIL                    False
FLAG_EMP_PHONE                False
FLAG_WORK_PHONE               False
FLAG_CONT_MOBILE              False
FLAG_PHONE                    False
FLAG_EMAIL                  

找出每個column NAN的數量

In [18]:
app_train.isnull().sum()

SK_ID_CURR                         0
TARGET                             0
NAME_CONTRACT_TYPE                 0
CODE_GENDER                        0
FLAG_OWN_CAR                       0
FLAG_OWN_REALTY                    0
CNT_CHILDREN                       0
AMT_INCOME_TOTAL                   0
AMT_CREDIT                         0
AMT_ANNUITY                       12
AMT_GOODS_PRICE                  278
NAME_TYPE_SUITE                 1292
NAME_INCOME_TYPE                   0
NAME_EDUCATION_TYPE                0
NAME_FAMILY_STATUS                 0
NAME_HOUSING_TYPE                  0
REGION_POPULATION_RELATIVE         0
DAYS_BIRTH                         0
DAYS_EMPLOYED                      0
DAYS_REGISTRATION                  0
DAYS_ID_PUBLISH                    0
OWN_CAR_AGE                   202929
FLAG_MOBIL                         0
FLAG_EMP_PHONE                     0
FLAG_WORK_PHONE                    0
FLAG_CONT_MOBILE                   0
FLAG_PHONE                         0
F

找出所有NAN的數量

In [19]:
app_train.isnull().sum().sum()

9152465

根據邏輯條件選出符合的資料

In [25]:
app_train[np.logical_and(app_train['CODE_GENDER'] == 'M', app_train['NAME_CONTRACT_TYPE'] == 'Cash loans')]

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
5,100008,0,Cash loans,M,N,Y,0,99000.0,490495.5,27517.5,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,1.0
7,100010,0,Cash loans,M,Y,Y,0,360000.0,1530000.0,42075.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
13,100017,0,Cash loans,M,Y,N,1,225000.0,918468.0,28966.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
15,100019,0,Cash loans,M,Y,Y,0,157500.0,299772.0,20160.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
16,100020,0,Cash loans,M,N,N,0,108000.0,509602.5,26149.5,...,0,0,0,0,0.0,0.0,0.0,0.0,1.0,0.0
24,100029,0,Cash loans,M,Y,N,2,135000.0,247500.0,12703.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
27,100032,0,Cash loans,M,N,Y,1,112500.0,327024.0,23827.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
28,100033,0,Cash loans,M,Y,Y,0,270000.0,790830.0,57676.5,...,0,0,0,0,0.0,0.0,0.0,1.0,0.0,1.0
