# 範例 : (Kaggle)房價預測精簡版

https://www.kaggle.com/c/house-prices-advanced-regression-techniques

以下是房價預測的精簡版範例  
使用最小量的特徵工程以及線性回歸模型做預測, 最後輸出可以在Kaggle提交的預測檔

# [教學目標]

以下程式碼雖然與 Day16 類似, 但是主要重點在於特徵工程的使用, 後續的課程當中會教導同學如何對這塊作調整

# [範例重點]

精簡後的特徵工程 - 包含補缺失值(fillna). 標籤編碼(LabelEncoder).  
最小最大化(MinMaxScaler) 如何使用在同一個程式區塊中 (In[3])

In [1]:
# 載入基本套件
import pandas as pd
import numpy as np

# 載入標籤編碼與最小最大化, 以便做最小的特徵工程
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

# 讀取訓練與測試資料
data_path = '../data/'
df_train = pd.read_csv(data_path + 'house_train.csv.gz')
df_test = pd.read_csv(data_path + 'house_test.csv.gz')
print(df_train.head())
print(df_test.head())
print(df_train.shape)

   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities  ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold  \
0         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
1         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      5   
2         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      9   
3         Lvl    AllPub  ...        0    NaN   NaN         NaN       0      2   
4         Lvl    AllPub  ...        0    NaN   NaN         NaN       0     12   

  YrSold  SaleType  SaleCondition  SalePrice  
0   2008        WD   

# [知識點]

## 1.数据平滑处理 -- log1p( ) 和 exmp1( )

1.  数据预处理时首先可以对偏度比较大的数据用og1p函数进行转化，使其更加服从高斯分布，此步处理可能会使我们后续的分类结果得到一个好的结果。

2. 平滑问题很容易处理掉，导致模型的结果达不到一定的标准，log1p( )能够避免复值得问题 — 复值指一个自变量对应多个因变量

log1p( ) 的使用就像是一个数据压缩到了一个区间，与数据的标准类似。其逆运算就是expm1的函数

由于使用的log1p（）对数据进行了压缩，最后需要将预测出的平滑数据进行一个还原，而还原过程就是log1p的逆运算expm1.

log1p = log（x+1）

当x较大时直接计算，当x较小时用泰勒展开式计算
--------------------- 
作者：Kun Li 
来源：CSDN 
原文：https://blog.csdn.net/u012193416/article/details/83211016 
版权声明：本文为博主原创文章，转载请附上博文链接！

## 2.Pandas 合并 concat

pandas处理多组数据的时候往往会要用到数据的合并处理,  
使用 concat是一种基本的合并方式.  
而且concat中有很多参数可以调整,合并成你想要的数据形式.

## 3.Pandas drop處理丢失数据

用法：DataFrame.drop(labels=None,axis=0, index=None, columns=None, inplace=False)

In [2]:
 # 訓練資料需要 train_X, train_Y / 預測輸出需要 ids(識別每個預測值), test_X
# 在此先抽離出 train_Y 與 ids, 而先將 train_X, test_X 該有的資料合併成df, 先做特徵工程
train_Y = np.log1p(df_train['SalePrice'])

print( 'Before log is in below:\n%s\n' % df_train['SalePrice'].head())
print( 'After log is in below:\n%s' % train_Y.head())

Before log is in below:
0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64

After log is in below:
0    12.247699
1    12.109016
2    12.317171
3    11.849405
4    12.429220
Name: SalePrice, dtype: float64


In [3]:
ids = df_test['Id']

# df_train = df_train.drop(column = ['ID', "SalePrice"]) 
df_train = df_train.drop(['Id', 'SalePrice'] , axis=1)
df_test = df_test.drop(['Id'] , axis=1)
df = pd.concat([df_train,df_test])
df.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,2,2008,WD,Normal
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,0,,,,0,5,2007,WD,Normal
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,9,2008,WD,Normal
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,,0,2,2006,WD,Abnorml
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,0,,,,0,12,2008,WD,Normal


In [4]:
# 特徵工程-簡化版 : 全部空值先補-1, 所有類別欄位先做 LabelEncoder, 然後再與數字欄位做 MinMaxScaler
# 這邊使用 LabelEncoder 只是先將類別欄位用統一方式轉成數值以便輸入模型, 當然部分欄位做 One-Hot可能會更好, 只是先使用最簡單版本作為範例
LEncoder = LabelEncoder()
# 除上述之外, 還要把標籤編碼與數值欄位一起做最大最小化, 這麼做雖然有些暴力, 卻可以最簡單的平衡特徵間影響力
MMEncoder = MinMaxScaler()
for c in df.columns:
    if df[c].dtype == 'object': # 如果是文字型 / 類別型欄位, 就先補缺 'None' 後, 再做標籤編碼
        df[c] = df[c].fillna('None')
        df[c] = LEncoder.fit_transform(df[c]) 
    else: # 其他狀況(本例其他都是數值), 就補缺 -1
        df[c] = df[c].fillna(-1)
    # 最後, 將標籤編碼與數值欄位一起最大最小化, 因為需要是一維陣列, 所以這邊切出來後用 reshape 降維
    df[c] = MMEncoder.fit_transform(df[c].values.reshape(-1, 1))
df.head()



Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,0.235294,0.8,0.210191,0.03342,1.0,0.5,1.0,1.0,0.0,1.0,...,0.0,0.0,1.0,1.0,0.25,0.0,0.090909,0.5,1.0,0.8
1,0.0,0.8,0.257962,0.038795,1.0,0.5,1.0,1.0,0.0,0.5,...,0.0,0.0,1.0,1.0,0.25,0.0,0.363636,0.25,1.0,0.8
2,0.235294,0.8,0.219745,0.046507,1.0,0.5,0.0,1.0,0.0,1.0,...,0.0,0.0,1.0,1.0,0.25,0.0,0.727273,0.5,1.0,0.8
3,0.294118,0.8,0.194268,0.038561,1.0,0.5,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,1.0,0.25,0.0,0.090909,0.0,1.0,0.0
4,0.235294,0.8,0.270701,0.060576,1.0,0.5,0.0,1.0,0.0,0.5,...,0.0,0.0,1.0,1.0,0.25,0.0,1.0,0.5,1.0,0.8


In [5]:
print(df_train.shape)

(1460, 79)


In [6]:
 # 將前述轉換完畢的資料 df, 重新切成 train_X, test_X, 因為不論何種特徵工程, 都需要對　train / test 做處理
# 常見並簡便的方式就是 - 先將 train / test 接起來, 做完後再拆開, 不然過程當中往往需要將特徵工程部分寫兩次, 麻煩且容易遺漏
# 在較複雜的特徵工程中尤其如此, 若實務上如果碰到 train 與 test 需要分階段進行, 則通常會另外寫成函數處理
train_num = train_Y.shape[0]
train_X = df[:train_num]
test_X = df[train_num:]

# 使用線性迴歸模型 : train_X, train_Y 訓練模型, 並對 test_X 做出預測結果 pred
from sklearn.linear_model import LinearRegression
estimator = LinearRegression()
estimator.fit(train_X, train_Y)
pred = estimator.predict(test_X)

In [7]:
# 將輸出結果 pred 與前面留下的 ID(ids) 合併, 輸出成檔案
# 可以下載並點開 house_baseline.csv 查看結果, 以便了解預測結果的輸出格式
# 本範例所與作業所輸出的 csv 檔, 均可用於本題的 Kaggle 答案上傳, 可以試著上傳來熟悉 Kaggle 的介面操作
print('Before expm1 : %s' % (pred))
pred = np.expm1(pred)
print('After expm1 : %s' % (pred))
sub = pd.DataFrame({'Id': ids, 'SalePrice': pred})
sub.to_csv('house_baseline.csv', index=False) 

Before expm1 : [11.64385472 11.96797109 12.01066191 ... 11.95018999 11.65971802
 12.39238468]
After expm1 : [113987.71398915 157623.52921185 164498.35281831 ... 154845.56351156
 115810.36952759 240958.62760425]


  """


# 作業 : (Kaggle)鐵達尼生存預測精簡版

https://www.kaggle.com/c/titanic

# [作業目標]

試著不依賴說明, 只依照下列程式碼回答下列問題, 初步理解什麼是"特徵工程"的區塊

# [作業重點]

試著不依賴註解, 以之前所學, 回答下列問題

In [71]:
 # 程式區塊A
import os
import pandas as pd
import numpy as np

data_path = '../data/'
df_train = pd.read_csv(data_path + 'titanic_train.csv')
df_test = pd.read_csv(data_path + 'titanic_train.csv')
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [72]:
# 程式區塊 B
train_Y = df_train['Survived']
ids = df_test['PassengerId']
df_train = df_train.drop(['PassengerId', 'Survived'] , axis=1)
df_test = df_test.drop(['PassengerId'] , axis=1)
df = pd.concat([df_train,df_test])
df.head()

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,Pclass,Sex,SibSp,Survived,Ticket
0,22.0,,S,7.25,"Braund, Mr. Owen Harris",0,3,male,1,,A/5 21171
1,38.0,C85,C,71.2833,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,1,female,1,,PC 17599
2,26.0,,S,7.925,"Heikkinen, Miss. Laina",0,3,female,0,,STON/O2. 3101282
3,35.0,C123,S,53.1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,1,female,1,,113803
4,35.0,,S,8.05,"Allen, Mr. William Henry",0,3,male,0,,373450


In [73]:
 # 程式區塊C
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
    
LEncoder = LabelEncoder()
MMEncoder = MinMaxScaler()

for c in df.columns :
    df[c] = df[c].fillna(-1)
    if df[c].dtype == 'object':
        df[c] = LEncoder.fit_transform(list(df[c].values))
    df[c] = MMEncoder.fit_transform(df[c].values.reshape(-1, 1))
df.head()



Unnamed: 0,Age,Cabin,Embarked,Fare,Name,Parch,Pclass,Sex,SibSp,Survived,Ticket
0,0.283951,0.0,1.0,0.014151,0.121348,0.0,1.0,1.0,0.125,0.0,0.769118
1,0.481481,0.557823,0.333333,0.139136,0.213483,0.0,0.0,0.0,0.125,0.0,0.876471
2,0.333333,0.0,1.0,0.015469,0.396629,0.0,1.0,0.0,0.0,0.0,0.983824
3,0.444444,0.380952,1.0,0.103644,0.305618,0.0,0.0,0.0,0.125,0.0,0.072059
4,0.444444,0.0,1.0,0.015713,0.016854,0.0,1.0,1.0,0.0,0.0,0.694118


In [74]:
# 程式區塊 D
train_num = train_Y.shape[0]
train_X = df[:train_num]
test_X = df[train_num:]

from sklearn.linear_model import LogisticRegression
estimator = LogisticRegression()
estimator.fit(train_X, train_Y)
pred = estimator.predict(test_X)
print(pred)

[0 1 1 1 0 0 1 0 1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0
 0 1 1 1 1 0 1 1 0 0 1 0 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 1 0
 1 0 1 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 1 1 1 0 0 0 0 1
 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1
 0 1 0 0 0 1 0 1 0 1 1 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0
 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 1 1
 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0
 0 1 0 1 1 0 0 1 0 1 1 1 0 1 1 1 1 0 0 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 0 0
 0 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 0 0 1 0 0 0 1 1 1 1
 1 0 0 0 1 1 1 1 0 0 1 1 0 1 0 0 0 1 0 1 0 0 0 1 1 0 1 0 0 1 0 0 1 0 1 0 0
 0 0 1 0 0 1 0 0 1 1 1 0 1 0 0 0 1 0 0 1 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1
 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 1 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 1 0
 0 0 0 1 1 1 0 0 0 0 0 0 



In [65]:
# 程式區塊 E
sub = pd.DataFrame({'PassengerId': ids, 'Survived': pred})
sub.to_csv('titanic_baseline.csv', index=False) 

# 作業1

下列A~E五個程式區塊中，哪一塊是特徵工程?

In [None]:
# 程式區塊 Ｃ
LEncoder = LabelEncoder()
MMencoder = MinMaxScaler()
for c in df.columns:
    df[c] = df[c].fillna(-1)
    if df[c].dtype == 'object'
        df[c] = LEncoder.fit_transform(list(df[c].values))
    df[c] = MMencoder.fit_transform(df[c].values.reshape(-1, 1))
df.head()

# 作業2

對照程式區塊 B 與 C 的結果，請問那些欄位屬於"類別型欄位"? (回答欄位英文名稱即可)

In [51]:
df[['Embarked', 'Name']].head()

Unnamed: 0,Embarked,Name
0,S,"Braund, Mr. Owen Harris"
1,C,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,S,"Heikkinen, Miss. Laina"
3,S,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,S,"Allen, Mr. William Henry"


# 作業3

續上題，請問哪個欄位是"目標值"?

In [None]:
df_train['Survived']