# 多元线性回归/对数线性回归（二选一）

## 一、多元线性回归
这部分的内容是要求大家完成多元线性回归，我们会先带着大家使用sklearn做一元线性回归的十折交叉验证，多元线性回归大家可以仿照着完成

### 1. 读取数据

In [6]:
import numpy as np

In [22]:
import pandas as pd

# 读取数据
data = pd.read_csv('data/kaggle_house_price_prediction/kaggle_hourse_price_train.csv')

# 丢弃有缺失值的特征（列）
data.dropna(axis = 1, inplace = True)

# 只保留整数的特征
data = data[[col for col in data.dtypes.index if data.dtypes[col] == 'int64']]

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 35 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Id             1460 non-null   int64
 1   MSSubClass     1460 non-null   int64
 2   LotArea        1460 non-null   int64
 3   OverallQual    1460 non-null   int64
 4   OverallCond    1460 non-null   int64
 5   YearBuilt      1460 non-null   int64
 6   YearRemodAdd   1460 non-null   int64
 7   BsmtFinSF1     1460 non-null   int64
 8   BsmtFinSF2     1460 non-null   int64
 9   BsmtUnfSF      1460 non-null   int64
 10  TotalBsmtSF    1460 non-null   int64
 11  1stFlrSF       1460 non-null   int64
 12  2ndFlrSF       1460 non-null   int64
 13  LowQualFinSF   1460 non-null   int64
 14  GrLivArea      1460 non-null   int64
 15  BsmtFullBath   1460 non-null   int64
 16  BsmtHalfBath   1460 non-null   int64
 17  FullBath       1460 non-null   int64
 18  HalfBath       1460 non-null   int64
 19  Bedroo

In [23]:
# 获取所有特征标签
col = list(data.columns.values)
col=col[1:-1]

['MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold']


### 2. 引入模型

In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_predict

### 3. 使用sklearn完成一元线性回归的十折交叉验证验证

#### 创建模型

In [8]:
model = LinearRegression()

#### 选取数据

In [12]:
features = ['LotArea']
x = data[features]
y = data['SalePrice']

#### 做十折交叉验证的预测

In [13]:
prediction = cross_val_predict(model, x, y, cv = 10)

这十折交叉验证是按顺序做的，会先将前10%的数据作为测试集，然后会往后顺延到10%到20%，最后将这十份的预测值按顺序拼接后返回

In [14]:
prediction.shape

(1460,)

### 4. 计算评价指标

#### MAE

In [15]:
mean_absolute_error(prediction, data['SalePrice'])

55394.44195244894

#### RMSE

In [16]:
mean_squared_error(prediction, data['SalePrice']) ** 0.5

77868.51337752414

### 5. 请你选择多种特征进行组合，完成多元线性回归，并对比不同的特征组合，它们训练出的模型在十折交叉验证上MAE与RMSE的差别，至少完成3组


###### 提示：多元线性回归，只要在上方的features这个list中，加入其他特征的名字就可以

### 模型1

In [17]:
# 选取数据
features1=['LotArea','BsmtUnfSF','GarageArea']
x1 = data[features1]
y1 = data['SalePrice']


In [18]:
# 做十折交叉验证的预测
prediction1 = cross_val_predict(model, x1, y1, cv = 10)
prediction1.shape

(1460,)

In [19]:
mae1=mean_absolute_error(prediction1, data['SalePrice'])
smse1=mean_squared_error(prediction1, data['SalePrice']) ** 0.5

print("模型1：\nMAE:",mae1,"\nSMSE:",smse1,"\n")

模型1：
MAE: 40924.148811351835 
SMSE: 60785.974862177216 



### 模型2

In [20]:
# 选取数据
features2=['LotArea', 'YearBuilt', 'GrLivArea']
x2 = data[features2]
y2 = data['SalePrice']


In [21]:
# 做十折交叉验证的预测
prediction2 = cross_val_predict(model, x2, y2, cv = 10)
prediction2.shape

(1460,)

In [22]:
mae2=mean_absolute_error(prediction2, data['SalePrice'])
smse2=mean_squared_error(prediction2, data['SalePrice']) ** 0.5

print("模型2：\nMAE:",mae2,"\nSMSE:",smse2,"\n")

模型2：
MAE: 30283.19296629594 
SMSE: 46422.7681688791 



### 模型3

In [23]:
# 选取数据
features3=['LotArea', 'YearBuilt', 'GrLivArea', 'BsmtUnfSF','GarageArea']
x3 = data[features3]
y3 = data['SalePrice']

In [24]:
# 做十折交叉验证的预测
prediction3 = cross_val_predict(model, x3, y3, cv = 10)
prediction3.shape

(1460,)

In [25]:
mae3=mean_absolute_error(prediction3, data['SalePrice'])
smse3=mean_squared_error(prediction3, data['SalePrice']) ** 0.5

print("模型3：\nMAE:",mae3,"\nSMSE:",smse3,"\n")

模型3：
MAE: 28853.359679436122 
SMSE: 44521.633949275696 



## 采用所有特征

In [24]:
# 选取数据
x4 = data[col]
y4 = data['SalePrice']

In [25]:
# 做十折交叉验证的预测
prediction4 = cross_val_predict(model, x4, y4, cv = 10)
prediction4.shape

(1460,)

In [26]:
mae4=mean_absolute_error(prediction4, data['SalePrice'])
smse4=mean_squared_error(prediction4, data['SalePrice']) ** 0.5

print("模型：\nMAE:",mae4,"\nSMSE:",smse4,"\n")

模型：
MAE: 21876.547074388393 
SMSE: 36877.16673206901 



###### 双击此处填写
1. 模型1使用的特征：['LotArea','BsmtUnfSF','GarageArea']
2. 模型2使用的特征：['LotArea', 'YearBuilt', 'GrLivArea']
3. 模型3使用的特征: ['LotArea', 'YearBuilt', 'GrLivArea','BsmtUnfSF','GarageArea']

模型|MAE|SMSE
-|-|-
模型1 | 40924.148811351835 | 60785.974862177216 
模型2 | 30283.19296629594 | 46422.7681688791 
模型3 | 28853.359679436122 | 44521.633949275696
采取所有特征|21876.547074388393 |36877.16673206901 

对比上述3个表格可知，模型3（使用特征['LotArea', 'YearBuilt', 'GrLivArea','BsmtUnfSF','GarageArea']）得到的预测结果较好。

与选取的3组特征模型结果相比，选取所有特征进行多元线性回归得到的结果最好。

## 扩展：多项式回归（一元线性回归的扩展），尝试对部分特征进行变换，如将其二次幂，三次幂作为特征输入模型，观察模型在预测能力上的变化

### LotArea特征 二次幂

In [26]:
# 选取数据
features=['LotArea','BsmtUnfSF','GarageArea']
x5 = data[features]
y5 = data['SalePrice']
x5['LotArea']=x5['LotArea'].map(lambda x:x**2 )

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x5['LotArea']=x5['LotArea'].map(lambda x:x**2 )


In [27]:
# 做十折交叉验证的预测
prediction5 = cross_val_predict(model, x5, y5, cv = 10)
prediction5.shape

(1460,)

In [28]:
mae5=mean_absolute_error(prediction5, data['SalePrice'])
smse5=mean_squared_error(prediction5, data['SalePrice']) ** 0.5

print("模型：\nMAE:",mae5,"\nSMSE:",smse5,"\n")

模型：
MAE: 41754.53257550982 
SMSE: 61536.674969921194 



### LotArea特征 三次幂

In [29]:
# 选取数据
features=['LotArea','BsmtUnfSF','GarageArea']
x6 = data[features]
y6 = data['SalePrice']
x6['LotArea']=x6['LotArea'].map(lambda x:x**3 )

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  x6['LotArea']=x6['LotArea'].map(lambda x:x**3 )


In [30]:
# 做十折交叉验证的预测
prediction6 = cross_val_predict(model, x6, y6, cv = 10)
prediction6.shape

(1460,)

In [31]:
mae6=mean_absolute_error(prediction6, data['SalePrice'])
smse6=mean_squared_error(prediction6, data['SalePrice']) ** 0.5

print("模型：\nMAE:",mae6,"\nSMSE:",smse6,"\n")

模型：
MAE: 41888.62258375124 
SMSE: 61593.48035097124 



选取特征为['LotArea','BsmtUnfSF','GarageArea']
|LotArea处理|MAE|SMSE|
-|-|-|
一次方|55394.44195244894|77868.51337752414
二次方|41754.53257550982 |61536.674969921194 
三次方|41888.62258375124 |61593.48035097124 


由上表可知，LotArea特征取二次方时，误差减少较多；取三次方时，误差又增多。