# [House Prices](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)

### [10% 範例](https://jackdry.com/house-prices-advanced-regression-techniques-kaggle)

In [1]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import sklearn
import sklearn.model_selection # 分割資料模組
import sklearn.linear_model # 線性模組
import sklearn.svm # SVM 模組
import sklearn.decomposition # PCA

%matplotlib inline

In [2]:
train = pd.read_csv('train.csv', index_col='Id') # train datas
test = pd.read_csv('test.csv', index_col='Id') # test datas
submission = pd.read_csv('sample_submission.csv') # 上傳範本
saleprice = train['SalePrice']
train = train.drop(columns='SalePrice')
data = pd.concat([train, test]) # 全體資料一起前處理

In [3]:
cols_dict = {'MSSubClass':'住宅類型', 'MSZoning':'地段分區', 'LotFrontage':'距離街道距離', 'LotArea':'地坪', 'Street':'街道類型', 'Alley':'小巷類型', 'LotShape':'土地形狀', 'LandContour':'土地平坦程度', 'Utilities':'能源使用', 'LotConfig':'配置方式', 'LandSlope':'物業斜度', 'Neighborhood':'地理位置', 'Condition1':'條件1', 'Condition2':'條件2', 'BldgType':'住宅類型', 'HouseStyle':'住宅風格', 'OverallQual':'綜合材料評分', 'OverallCond':'綜合評分', 'YearBuilt':'原始施工日期', 'YearRemodAdd':'改建日期', 'RoofStyle':'屋頂造型', 'RoofMatl':'屋頂材料', 'Exterior1st':'外牆材料1', 'Exterior2nd':'外牆材料2', 'MasVnrType':'裝飾類型', 'MasVnrArea':'貼磁磚面積', 'ExterQual':'評估外部材料質量', 'ExterCond':'外部材料當前情況', 'Foundation':'房屋基礎材料', 'BsmtQual':'地下室高度', 'BsmtCond':'地下室情況評估', 'BsmtExposure':'花園牆壁材料', 'BsmtFinType1':'地下室完成區域的等級1', 'BsmtFinSF1':'地下室完成坪數1', 'BsmtFinType2':'地下室完成區域的等級2', 'BsmtFinSF2':'地下室完成坪數2', 'BsmtUnfSF':'地下室未完成坪數', 'TotalBsmtSF':'地下室總面積', 'Heating':'加熱設備形式', 'HeatingQC':'加熱設備品質', 'CentralAir':'中央空調', 'Electrical':'電氣系統', '1stFlrSF':'一樓坪數', '2ndFlrSF':'二樓坪數', 'LowQualFinSF':'低品質的總坪數', 'GrLivArea':'地上的屋內坪數', 'BsmtFullBath':'地下室完整浴室', 'BsmtHalfBath':'地下室半浴室', 'FullBath':'完整浴室', 'HalfBath':'半浴室', 'BedroomAbvGr':'地上房間', 'KitchenAbvGr':'地上廚房', 'KitchenQual':'廚房品質', 'TotRmsAbvGrd':'地上客房', 'Functional':'居家功能', 'Fireplaces':'壁爐數量', 'FireplaceQu':'壁爐質量', 'GarageType':'車庫位置', 'GarageYrBlt':'車庫建成年份', 'GarageFinish':'車庫內部裝飾', 'GarageCars':'車庫容量', 'GarageArea':'車庫坪數', 'GarageQual':'車庫質量', 'GarageCond':'車庫條件', 'PavedDrive':'車道材質', 'WoodDeckSF':'木地板面積', 'OpenPorchSF':'開放陽台面積', 'EnclosedPorch':'封閉門廊面積', '3SsnPorch':'開放門廊面積', 'ScreenPorch':'屏幕門廊面積', 'PoolArea':'泳池面積', 'PoolQC':'泳池品質', 'Fence':'圍欄質量', 'MiscFeature':'其他功能', 'MiscVal':'雜項功能', 'MoSold':'售出月份', 'YrSold':'售出年份', 'SaleType':'銷售方式', 'SaleCondition':'銷售條件'}

| columns | / | / |
| :--- | :--- | :--- |
| MSSubClass    | 住宅類型        | Identifies the type of dwelling involved in the sale                                    |
| MSZoning      | 地段分區        | Identifies the general zoning classification of the sale                                |
| LotFrontage   | 距離街道距離      | Linear feet of street connected to property                                             |
| LotArea       | 地坪          | Lot size in square feet                                                                 |
| Street        | 街道類型        | Type of road access to property                                                         |
| Alley         | 小巷類型        | Type of alley access to property                                                        |
| LotShape      | 土地形狀        | General shape of property                                                               |
| LandContour   | 土地平坦程度      | Flatness of the property                                                                |
| Utilities     | 能源使用        | Type of utilities available                                                             |
| LotConfig     | 配置方式        | Lot configuration                                                                       |
| LandSlope     | 物業斜度        | Slope of property                                                                       |
| Neighborhood  | 地理位置        | Physical locations within Ames city limits                                              |
| Condition1    | 條件1         | Proximity to various conditions                                                         |
| Condition2    | 條件2         | Proximity to various conditions (if more than one is present)                           |
| BldgType      | 住宅類型        | Type of dwelling                                                                        |
| HouseStyle    | 住宅風格        | Style of dwelling                                                                       |
| OverallQual   | 綜合材料評分      | Rates the overall material and finish of the house                                      |
| OverallCond   | 綜合評分        | Rates the overall condition of the house                                                |
| YearBuilt     | 原始施工日期      | Original construction date                                                              |
| YearRemodAdd  | 改建日期        | (無改建同上YearBuilt) Remodel date (same as construction date if no remodeling or additions) |
| RoofStyle     | 屋頂造型        | Type of roof                                                                            |
| RoofMatl      | 屋頂材料        | Roof material                                                                           |
| Exterior1st   | 外牆材料1       | Exterior covering on house                                                              |
| Exterior2nd   | 外牆材料2       | Exterior covering on house (if more than one material)                                  |
| MasVnrType    | 裝飾類型        | Masonry veneer type                                                                     |
| MasVnrArea    | 貼磁磚面積       | Masonry veneer area in square feet                                                      |
| ExterQual     | 評估外部材料質量    | Evaluates the quality of the material on the exterior                                   |
| ExterCond     | 外部材料當前情況    | Evaluates the present condition of the material on the exterior                         |
| Foundation    | 房屋基礎材料      | Type of foundation                                                                      |
| BsmtQual      | 地下室高度       | Evaluates the height of the basement                                                    |
| BsmtCond      | 地下室情況評估     | Evaluates the general condition of the basement                                         |
| BsmtExposure  | 花園牆壁材料      | Refers to walkout or garden level walls                                                 |
| BsmtFinType1  | 地下室完成區域的等級1 | Rating of basement finished area                                                        |
| BsmtFinSF1    | 地下室完成坪數1    | Type 1 finished square feet                                                             |
| BsmtFinType2  | 地下室完成區域的等級2 | Rating of basement finished area (if multiple types)                                    |
| BsmtFinSF2    | 地下室完成坪數2    | Type 2 finished square feet                                                             |
| BsmtUnfSF     | 地下室未完成坪數    | Unfinished square feet of basement area                                                 |
| TotalBsmtSF   | 地下室總面積      | Total square feet of basement area                                                      |
| Heating       | 加熱設備形式      | Type of heating                                                                         |
| HeatingQC     | 加熱設備品質      | Heating quality and condition                                                           |
| CentralAir    | 中央空調        | Central air conditioning                                                                |
| Electrical    | 電氣系統        | Electrical system                                                                       |
| 1stFlrSF      | 一樓坪數        | First Floor square feet                                                                 |
| 2ndFlrSF      | 二樓坪數        | Second floor square feet                                                                |
| LowQualFinSF  | 低品質的總坪數     | Low quality finished square feet (all floors)                                           |
| GrLivArea     | 地上的屋內坪數     | Above grade (ground) living area square feet                                            |
| BsmtFullBath  | 地下室完整浴室     | Basement full bathrooms                                                                 |
| BsmtHalfBath  | 地下室半浴室      | Basement half bathrooms                                                                 |
| FullBath      | 完整浴室        | Full bathrooms above grade                                                              |
| HalfBath      | 半浴室         | Half baths above grade                                                                  |
| BedroomAbvGr  | 地上房間        | Bedrooms above grade (does NOT include basement bedrooms)                               |
| KitchenAbvGr  | 地上廚房        | Kitchens above grade                                                                    |
| KitchenQual   | 廚房品質        | Kitchen quality                                                                         |
| TotRmsAbvGrd  | 地上客房        | Total rooms above grade (does not include bathrooms)                                    |
| Functional    | 居家功能        | Home functionality (Assume typical unless deductions are warranted)                     |
| Fireplaces    | 壁爐數量        | Number of fireplaces                                                                    |
| FireplaceQu   | 壁爐質量        | Fireplace quality                                                                       |
| GarageType    | 車庫位置        | Garage location                                                                         |
| GarageYrBlt   | 車庫建成年份      | Year garage was built                                                                   |
| GarageFinish  | 車庫內部裝飾      | Interior finish of the garage                                                           |
| GarageCars    | 車庫容量        | Size of garage in car capacity                                                          |
| GarageArea    | 車庫坪數        | Size of garage in square feet                                                           |
| GarageQual    | 車庫質量        | Garage quality                                                                          |
| GarageCond    | 車庫條件        | Garage condition                                                                        |
| PavedDrive    | 車道材質        | Paved driveway                                                                          |
| WoodDeckSF    | 木地板面積       | Wood deck area in square feet                                                           |
| OpenPorchSF   | 開放陽台面積      | Open porch area in square feet                                                          |
| EnclosedPorch | 封閉門廊面積      | Enclosed porch area in square feet                                                      |
| 3SsnPorch     | 開放門廊面積      | Three season porch area in square feet                                                  |
| ScreenPorch   | 屏幕門廊面積      | Screen porch area in square feet                                                        |
| PoolArea      | 泳池面積        | Pool area in square feet                                                                |
| PoolQC        | 泳池品質        | Pool quality                                                                            |
| Fence         | 圍欄質量        | Fence quality                                                                           |
| MiscFeature   | 其他功能        | Miscellaneous feature not covered in other categories                                   |
| MiscVal       | 雜項功能        | $Value of miscellaneous feature                                                         |
| MoSold        | 售出月份        | Month Sold (MM)                                                                         |
| YrSold        | 售出年份        | Year Sold (YYYY)                                                                        |
| SaleType      | 銷售方式        | Type of sale                                                                            |
| SaleCondition | 銷售條件        | Condition of sale                                                                       |


### 資料預覽

In [4]:
data.describe()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
count,2919.0,2433.0,2919.0,2919.0,2919.0,2919.0,2919.0,2896.0,2918.0,2918.0,...,2918.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0
mean,57.137718,69.305795,10168.11408,6.089072,5.564577,1971.312778,1984.264474,102.201312,441.423235,49.582248,...,472.874572,93.709832,47.486811,23.098321,2.602261,16.06235,2.251799,50.825968,6.213087,2007.792737
std,42.517628,23.344905,7886.996359,1.409947,1.113131,30.291442,20.894344,179.334253,455.610826,169.205611,...,215.394815,126.526589,67.575493,64.244246,25.188169,56.184365,35.663946,567.402211,2.714762,1.314964
min,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0
25%,20.0,59.0,7478.0,5.0,5.0,1953.5,1965.0,0.0,0.0,0.0,...,320.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2007.0
50%,50.0,68.0,9453.0,6.0,5.0,1973.0,1993.0,0.0,368.5,0.0,...,480.0,0.0,26.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0
75%,70.0,80.0,11570.0,7.0,6.0,2001.0,2004.0,164.0,733.0,0.0,...,576.0,168.0,70.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0
max,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1526.0,...,1488.0,1424.0,742.0,1012.0,508.0,576.0,800.0,17000.0,12.0,2010.0


### 抓出數字特徵並且沒有缺值的特徵跑跑看簡單的迴歸

In [5]:
ar = data.describe().iloc[0]
cols = ar[ar == 2919].index

In [6]:
data[cols].describe()

Unnamed: 0,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,...,Fireplaces,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
count,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,...,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0,2919.0
mean,57.137718,10168.11408,6.089072,5.564577,1971.312778,1984.264474,1159.581706,336.483727,4.694416,1500.759849,...,0.597122,93.709832,47.486811,23.098321,2.602261,16.06235,2.251799,50.825968,6.213087,2007.792737
std,42.517628,7886.996359,1.409947,1.113131,30.291442,20.894344,392.362079,428.701456,46.396825,506.051045,...,0.646129,126.526589,67.575493,64.244246,25.188169,56.184365,35.663946,567.402211,2.714762,1.314964
min,20.0,1300.0,1.0,1.0,1872.0,1950.0,334.0,0.0,0.0,334.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0
25%,20.0,7478.0,5.0,5.0,1953.5,1965.0,876.0,0.0,0.0,1126.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,2007.0
50%,50.0,9453.0,6.0,5.0,1973.0,1993.0,1082.0,0.0,0.0,1444.0,...,1.0,0.0,26.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0
75%,70.0,11570.0,7.0,6.0,2001.0,2004.0,1387.5,704.0,0.0,1743.5,...,1.0,168.0,70.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0
max,190.0,215245.0,10.0,9.0,2010.0,2010.0,5095.0,2065.0,1064.0,5642.0,...,4.0,1424.0,742.0,1012.0,508.0,576.0,800.0,17000.0,12.0,2010.0


In [7]:
for col in cols:
    print(col, cols_dict[col])

MSSubClass 住宅類型
LotArea 地坪
OverallQual 綜合材料評分
OverallCond 綜合評分
YearBuilt 原始施工日期
YearRemodAdd 改建日期
1stFlrSF 一樓坪數
2ndFlrSF 二樓坪數
LowQualFinSF 低品質的總坪數
GrLivArea 地上的屋內坪數
FullBath 完整浴室
HalfBath 半浴室
BedroomAbvGr 地上房間
KitchenAbvGr 地上廚房
TotRmsAbvGrd 地上客房
Fireplaces 壁爐數量
WoodDeckSF 木地板面積
OpenPorchSF 開放陽台面積
EnclosedPorch 封閉門廊面積
3SsnPorch 開放門廊面積
ScreenPorch 屏幕門廊面積
PoolArea 泳池面積
MiscVal 雜項功能
MoSold 售出月份
YrSold 售出年份


In [8]:
try_data = data[cols]
X = sklearn.preprocessing.minmax_scale(try_data.values)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X[:1460], saleprice.values, test_size=0.25, random_state=59)

model = sklearn.linear_model.LinearRegression()
model.fit(X_train, y_train)
print(model.score(X_train, y_train))
print(model.score(X_test, y_test))

0.7940519861310117
0.7828951097674459


In [9]:
model.fit(X[:1460], saleprice.values) # 全部一起訓練
print(model.score(X[:1460], saleprice.values))

0.7926822419354128


### 獲得 0.43802 頗差的結果  (越小越好)

In [10]:
model = sklearn.linear_model.Lasso(alpha=1e3)
model.fit(X_train, y_train)
print(model.score(X_train, y_train))
print(model.score(X_test, y_test))

0.7451945452051486
0.7460335390943017


In [11]:
model.fit(X[:1460], saleprice.values) # 全部一起訓練
print(model.score(X[:1460], saleprice.values))

0.7440094305146373


### 使用L2正規化減少過多因子  獲得 0.19605

* 檢查哪些重要參數影響

In [12]:
ind = 0
for col in cols:
    if (model.coef_ != 0)[ind]:
        print(col, cols_dict[col])
    ind += 1

MSSubClass 住宅類型
OverallQual 綜合材料評分
YearBuilt 原始施工日期
YearRemodAdd 改建日期
1stFlrSF 一樓坪數
GrLivArea 地上的屋內坪數
Fireplaces 壁爐數量


# 以下參考所有參數


### 空值檢查

In [13]:
get_nan_col = data.isnull().sum()
cols = get_nan_col[get_nan_col != 0].index
get_nan_col[get_nan_col != 0]

MSZoning           4
LotFrontage      486
Alley           2721
Utilities          2
Exterior1st        1
Exterior2nd        1
MasVnrType        24
MasVnrArea        23
BsmtQual          81
BsmtCond          82
BsmtExposure      82
BsmtFinType1      79
BsmtFinSF1         1
BsmtFinType2      80
BsmtFinSF2         1
BsmtUnfSF          1
TotalBsmtSF        1
Electrical         1
BsmtFullBath       2
BsmtHalfBath       2
KitchenQual        1
Functional         2
FireplaceQu     1420
GarageType       157
GarageYrBlt      159
GarageFinish     159
GarageCars         1
GarageArea         1
GarageQual       159
GarageCond       159
PoolQC          2909
Fence           2348
MiscFeature     2814
SaleType           1
dtype: int64

In [14]:
for col in cols:
    print(cols_dict[col])
    print(data[col].value_counts(), '\n')

地段分區
RL         2265
RM          460
FV          139
RH           26
C (all)      25
Name: MSZoning, dtype: int64 

距離街道距離
60.0     276
80.0     137
70.0     133
50.0     117
75.0     105
        ... 
137.0      1
182.0      1
119.0      1
195.0      1
141.0      1
Name: LotFrontage, Length: 128, dtype: int64 

小巷類型
Grvl    120
Pave     78
Name: Alley, dtype: int64 

能源使用
AllPub    2916
NoSeWa       1
Name: Utilities, dtype: int64 

外牆材料1
VinylSd    1025
MetalSd     450
HdBoard     442
Wd Sdng     411
Plywood     221
CemntBd     126
BrkFace      87
WdShing      56
AsbShng      44
Stucco       43
BrkComm       6
Stone         2
AsphShn       2
CBlock        2
ImStucc       1
Name: Exterior1st, dtype: int64 

外牆材料2
VinylSd    1014
MetalSd     447
HdBoard     406
Wd Sdng     391
Plywood     270
CmentBd     126
Wd Shng      81
Stucco       47
BrkFace      47
AsbShng      38
Brk Cmn      22
ImStucc      15
Stone         6
AsphShn       4
CBlock        3
Other         1
Name: Exterior2nd, dt

### 需要的columns 補值

In [15]:
data.MSZoning = data.MSZoning.fillna('RL') # 補上眾數
data.LotFrontage = data.LotFrontage.fillna(data.LotFrontage.mean()) # 補上平均數
data.Alley = data.Alley.notnull().astype(object) # 轉換成在不在巷子  'category' 分類
data.Exterior1st = data.Exterior1st.fillna('VinylSd') # 補上眾數
data.Exterior2nd = data.Exterior2nd.fillna('VinylSd') # 補上眾數
data.MasVnrType = data.MasVnrType.fillna('None') # 補上眾數
data.MasVnrArea = data.MasVnrArea.fillna(data.MasVnrArea.mean()) # 補上平均數
data.BsmtQual = data.BsmtQual.fillna('TA') # 補上眾數
data.BsmtCond = data.BsmtCond.fillna('TA') # 補上眾數
data.BsmtExposure = data.BsmtExposure.fillna('No') # 補上眾數
data.BsmtFinType1 = data.BsmtFinType1.fillna('Unf') # 補上眾數
data.BsmtFinSF1 = data.BsmtFinSF1.fillna(data.BsmtFinSF1.mean()) # 補上平均數
data.BsmtFinType2 = data.BsmtFinType2.fillna('Unf') # 補上眾數
data.BsmtFinSF2 = data.BsmtFinSF2.fillna(data.BsmtFinSF2.mean()) # 補上平均數
data.BsmtUnfSF = data.BsmtUnfSF.fillna(data.BsmtUnfSF.mean()) # 補上平均數
data.TotalBsmtSF = data.TotalBsmtSF.fillna(data.TotalBsmtSF.mean()) # 補上平均數
data.Electrical = data.Electrical.fillna('SBrkr') # 補上眾數
data.BsmtFullBath = data.BsmtFullBath.fillna(0.0) # 補上眾數
data.BsmtHalfBath = data.BsmtHalfBath.fillna(0.0) # 補上眾數
data.KitchenQual = data.KitchenQual.fillna('TA') # 補上眾數
data.Functional = data.Functional.fillna('Typ') # 補上眾數
data.GarageType = data.GarageType.fillna('Attchd') # 補上眾數
data.GarageYrBlt = data.GarageYrBlt.fillna(data.GarageYrBlt.mean()) # 補上平均數
data.GarageFinish = data.GarageFinish.fillna('Unf') # 補上眾數
data.GarageCars = data.GarageCars.fillna(2.0) # 補上眾數
data.GarageArea = data.GarageArea.fillna(data.GarageArea.mean()) # 補上平均數
data.GarageQual = data.GarageQual.fillna('TA') # 補上眾數
data.GarageCond = data.GarageCond.fillna('TA') # 補上眾數
data.SaleType = data.SaleType.fillna('WD') # 補上眾數

drop_cols = ['Utilities', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']  # 全不同值不考慮使用
data = data.drop(columns=drop_cols)


### 添加新特徵
* 例如房屋年齡(賣房屋年份 - 蓋房子的年份)

In [16]:
data['house_old1'] = np.log(data.YrSold - data.YearBuilt + 2) # 加上log 讓差距不要這麼大
data['house_old2'] = np.log(data.YrSold - data.YearRemodAdd + 3)

### [循環資料特徵工程](http://blog.davidkaleko.com/feature-engineering-cyclical-features.html)
*  月份(1~12) 投影到兩軸, 旋轉一圈

In [17]:
data["SinMoSold"] = np.sin(2 * np.pi * data["MoSold"] / 12)
data["CosMoSold"] = np.cos(2 * np.pi * data["MoSold"] / 12)
data = data.drop("MoSold", axis=1)

### 確認資料型態都是適合的資料型態
    * MSSubClass 雖然是數字, 實際上是分類
    * YrSold 出售年份雖然是, 

In [18]:
def type_col(type_):
    return np.array(data.dtypes[data.dtypes == type_].index)

In [19]:
# 需要轉換one-hot 編碼
cols = ["MSSubClass", "YrSold"]
data[cols] = data[cols].astype(object)

In [20]:
cols = type_col(int)
data[cols] = data[cols].astype(float)

In [21]:
data.dtypes.value_counts()

object     40
float64    37
dtype: int64

### 選用特徵

In [22]:
n = 0
for col in type_col(object):
    print(cols_dict[col], n)
    n += 1
    print(data[col].value_counts(), '\n')

住宅類型 0
20     1079
60      575
50      287
120     182
30      139
160     128
70      128
80      118
90      109
190      61
85       48
75       23
45       18
180      17
40        6
150       1
Name: MSSubClass, dtype: int64 

地段分區 1
RL         2269
RM          460
FV          139
RH           26
C (all)      25
Name: MSZoning, dtype: int64 

街道類型 2
Pave    2907
Grvl      12
Name: Street, dtype: int64 

小巷類型 3
False    2721
True      198
Name: Alley, dtype: int64 

土地形狀 4
Reg    1859
IR1     968
IR2      76
IR3      16
Name: LotShape, dtype: int64 

土地平坦程度 5
Lvl    2622
HLS     120
Bnk     117
Low      60
Name: LandContour, dtype: int64 

配置方式 6
Inside     2133
Corner      511
CulDSac     176
FR2          85
FR3          14
Name: LotConfig, dtype: int64 

物業斜度 7
Gtl    2778
Mod     125
Sev      16
Name: LandSlope, dtype: int64 

地理位置 8
NAmes      443
CollgCr    267
OldTown    239
Edwards    194
Somerst    182
NridgHt    166
Gilbert    165
Sawyer     151
NWAmes     131
SawyerW    1

### one-hot 編碼 使用太多One-hot 跑迴歸容易暴掉(當資料分得很差)

In [23]:
col_indax = [3, 4, 18, 30, 37]
onehot = data[type_col(object)[col_indax]]

In [24]:
onehot = pd.get_dummies(onehot).values
onehot

array([[1, 0, 0, ..., 1, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 1, ..., 1, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]], dtype=uint8)

### 連續資料正歸化

In [25]:
data.corr()

Unnamed: 0,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,...,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,house_old1,house_old2,SinMoSold,CosMoSold
LotFrontage,1.0,0.364382,0.20419,-0.06835,0.116905,0.085608,0.20221,0.203603,0.041396,0.104971,...,0.15216,0.010541,0.025255,0.069348,0.160857,0.035762,-0.122641,-0.10481,-0.023985,0.015225
LotArea,0.364382,1.0,0.100541,-0.035617,0.024128,0.021612,0.125354,0.194021,0.084055,0.021361,...,0.104797,0.020974,0.015995,0.054375,0.093708,0.069029,-0.008301,-0.024271,-0.021416,0.007518
OverallQual,0.20419,0.100541,1.0,-0.093847,0.597554,0.571532,0.430961,0.281704,-0.042755,0.275072,...,0.298084,-0.139256,0.018715,0.04291,0.03074,0.005562,-0.676121,-0.583552,-0.041762,0.046515
OverallCond,-0.06835,-0.035617,-0.093847,1.0,-0.368477,0.047654,-0.135752,-0.050403,0.041489,-0.138162,...,-0.068978,0.071044,0.043739,0.043713,-0.016876,0.033956,0.365833,0.00372,-0.005429,-0.09757
YearBuilt,0.116905,0.024128,0.597554,-0.368477,1.0,0.612235,0.312579,0.279547,-0.027591,0.130457,...,0.198554,-0.374073,0.015958,-0.041046,0.002304,-0.010886,-0.896871,-0.596082,-0.01749,0.02751
YearRemodAdd,0.085608,0.021612,0.571532,0.047654,0.612235,1.0,0.196117,0.152056,-0.062125,0.165099,...,0.242182,-0.220456,0.037433,-0.046878,-0.011407,-0.003124,-0.690506,-0.937284,-0.032524,-0.002718
MasVnrArea,0.20221,0.125354,0.430961,-0.135752,0.312579,0.196117,1.0,0.301999,-0.015633,0.089712,...,0.143659,-0.111156,0.013611,0.065188,0.004512,0.04481,-0.297642,-0.199101,-0.017224,0.025477
BsmtFinSF1,0.203603,0.194021,0.281704,-0.050403,0.279547,0.152056,0.301999,1.0,-0.055045,-0.477404,...,0.124153,-0.09971,0.050908,0.096821,0.084462,0.093295,-0.191372,-0.11478,0.002237,0.044643
BsmtFinSF2,0.041396,0.084055,-0.042755,0.041489,-0.027591,-0.062125,-0.015633,-0.055045,1.0,-0.238241,...,-0.005875,0.032739,-0.023279,0.063301,0.044524,-0.005139,0.109404,0.107131,0.014899,-0.011418
BsmtUnfSF,0.104971,0.021361,0.275072,-0.138162,0.130457,0.165099,0.089712,-0.477404,-0.238241,1.0,...,0.119753,0.005006,-0.00581,-0.049157,-0.032273,-0.010492,-0.276797,-0.247502,-0.03773,0.029619


In [26]:
cont = sklearn.preprocessing.robust_scale((data[type_col(float)]))
cont

array([[-0.23921085, -0.24511241,  0.5       , ..., -0.43072011,
         0.8660254 ,  1.15470054],
       [ 0.59412248,  0.03592375,  0.        , ...,  0.33780114,
         0.5       , -0.42264973],
       [-0.07254418,  0.43914956,  0.5       , ..., -0.36816045,
        -1.        ,  0.57735027],
       ...,
       [ 5.03856693,  2.57746823, -0.5       , ..., -0.17284591,
        -1.        ,  0.57735027],
       [-0.40587752,  0.24144673, -0.5       , ..., -0.03035931,
        -0.5       , -0.42264973],
       [ 0.26078915,  0.04252199,  0.5       , ..., -0.09683886,
        -0.5       ,  1.57735027]])

In [27]:
X = np.concatenate((onehot, cont), 1)
X.shape

(2919, 56)

In [28]:
def train_(n):
    model = sklearn.linear_model.LinearRegression()
    score = np.zeros((2, n))
    for i in range(n):
        X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X[:1460], saleprice.values, test_size=0.25)

        model.fit(X_train, y_train)
        
        score[0, i] = model.score(X_train, y_train)
        score[1, i] = model.score(X_test, y_test)
#         if score[1, i] < 0:
#             return [X_train, X_test, y_train, y_test]
        
    return score

In [29]:
score = train_(200)

In [30]:
score.mean(1)

array([0.84241335, 0.76565161])

In [31]:
model = sklearn.linear_model.LinearRegression()
model.fit(X[:1460], saleprice.values)
model.fit(X[:1460], saleprice.values) # 全部一起訓練
print(model.score(X[:1460], saleprice.values))

0.8342483078657159


### 獲得 0.18714 成績

In [32]:
submission.SalePrice = model.predict(X[1460:])
submission.to_csv('submission.csv', index=0)

In [33]:
price = np.log(saleprice.values)

In [34]:
model = sklearn.linear_model.Ridge(alpha=1e2)
model.fit(X[:1460], price)
model.fit(X[:1460], price) # 全部一起訓練
print(model.score(X[:1460], price))

0.8669457600190038


In [35]:
model = sklearn.linear_model.Lasso(alpha=2e-3)
model.fit(X[:1460], price)
model.fit(X[:1460], price) # 全部一起訓練
print(model.score(X[:1460], price))

0.8672464787232553


In [36]:
(model.coef_ != 0).sum()

35

In [37]:
submission.SalePrice = np.exp(model.predict(X[1460:]))
submission.to_csv('submission.csv', index=0)

In [38]:
data.skew()

MSSubClass        1.376165
LotFrontage       1.646420
LotArea          12.829025
Alley             3.439091
OverallQual       0.197212
OverallCond       0.570605
YearBuilt        -0.600114
YearRemodAdd     -0.451252
MasVnrArea        2.612892
BsmtFinSF1        1.425966
BsmtFinSF2        4.148166
BsmtUnfSF         0.919981
TotalBsmtSF       1.163082
1stFlrSF          1.470360
2ndFlrSF          0.862118
LowQualFinSF     12.094977
GrLivArea         1.270010
BsmtFullBath      0.625153
BsmtHalfBath      3.933616
FullBath          0.167692
HalfBath          0.694924
BedroomAbvGr      0.326492
KitchenAbvGr      4.304467
TotRmsAbvGrd      0.758757
Fireplaces        0.733872
GarageYrBlt      -0.392992
GarageCars       -0.218705
GarageArea        0.241342
WoodDeckSF        1.843380
OpenPorchSF       2.536417
EnclosedPorch     4.005950
3SsnPorch        11.381914
ScreenPorch       3.948723
PoolArea         16.907017
MiscVal          21.958480
YrSold            0.132467
house_old1       -0.666134
h