# Day15 Numerical Data 1/2 replace N/A or outlier 
# 數值型特徵 1/2 填補N/A與離群值

在Day04的文章中介紹了幾種常見可供替補N/A或離群值的數值，本日文章來實際操做，以Kaggle競賽[Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic)作為使用的資料集演示。

In the Day04 article we talked about several values that could be used to fill N/As and Outliers. Today, we are going to show how to actually replace missing and extreme data with those values using the data downloaded from [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic).

In [1]:
import pandas as pd
import numpy as np
import copy

df = pd.read_csv('data/train.csv') # 讀取檔案 read in the file
df.head() # 顯示前五筆資料 show the first five rows

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [2]:
# 只取int64, float64兩種數值型欄位存到 num_features中 
# save the columns that only contains int64, float64 datatypes into num_features
num_features = []
for dtype, feature in zip(df.dtypes, df.columns):
    if dtype == 'float64' or dtype == 'int64':
        num_features.append(feature)
print(f'{len(num_features)} Numeric Features : {num_features}')

7 Numeric Features : ['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare']


In [3]:
# 去掉文字型欄位，只留數值型欄位 only keep the numeric columns
df = df[num_features]
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
0,1,0,3,22.0,1,0,7.25
1,2,1,1,38.0,1,0,71.2833
2,3,1,3,26.0,0,0,7.925
3,4,1,1,35.0,1,0,53.1
4,5,0,3,35.0,0,0,8.05


In [4]:
# 檢查欄位缺值數量 check N/As
df.isnull().sum().sort_values(ascending=False)

Age            177
Fare             0
Parch            0
SibSp            0
Pclass           0
Survived         0
PassengerId      0
dtype: int64

### 以平均值填補空值

In [5]:
df_mn = df.fillna(df.mean())
df_mn['Age']

0      22.000000
1      38.000000
2      26.000000
3      35.000000
4      35.000000
         ...    
886    27.000000
887    19.000000
888    29.699118
889    26.000000
890    32.000000
Name: Age, Length: 891, dtype: float64

### 以中位數填補空值

In [6]:
df_md = df.fillna(df.median())
df_md['Age']

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888    28.0
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

本篇程式碼請參考[Github](https://github.com/tgnco1218/Data-Cleaning-and-Scraping-30Days)。The code is available on [Github](https://github.com/tgnco1218/Data-Cleaning-and-Scraping-30Days).

文中若有錯誤還望不吝指正，感激不盡。
Please let me know if there’s any mistake in this article. Thanks for reading.

Reference 參考資料：

[1] 第二屆機器學習百日馬拉松內容

[2] [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic)

