# 赛题理解

## 分析赛题

1. 此题为传统的数据挖掘问题，通过数据科学以及机器学习深度学习的办法来进行建模得到结果。
2. 此题是一个典型的回归问题。
3. 主要应用xgb、lgb、catboost，以及pandas、numpy、matplotlib、seabon、sklearn、keras等等数据挖掘常用库或者框架来进行数据挖掘任务。
4. 通过EDA来挖掘数据的联系和自我熟悉数据。

评价指标：MAE


### 导入数据

In [1]:
import pandas as pd
import numpy as np

In [2]:
train_data = pd.read_csv("./data/used_car_train_20200313.csv", sep=' ')
test_data = pd.read_csv("./data/used_car_testA_20200313.csv", sep=' ')
print("The shape of train_data", train_data.shape)
print("The shape of test_data", test_data.shape)

The shape of train_data (150000, 31)
The shape of test_data (50000, 30)


训练数据共15万条，31个column，测试数据5万，30个column，训练集多出一列price

**<font color=red>瞄一眼前10行</font>**

In [3]:
train_data.head(10)

Unnamed: 0,SaleID,name,regDate,model,brand,bodyType,fuelType,gearbox,power,kilometer,...,v_5,v_6,v_7,v_8,v_9,v_10,v_11,v_12,v_13,v_14
0,0,736,20040402,30.0,6,1.0,0.0,0.0,60,12.5,...,0.235676,0.101988,0.129549,0.022816,0.097462,-2.881803,2.804097,-2.420821,0.795292,0.914762
1,1,2262,20030301,40.0,1,2.0,0.0,0.0,0,15.0,...,0.264777,0.121004,0.135731,0.026597,0.020582,-4.900482,2.096338,-1.030483,-1.722674,0.245522
2,2,14874,20040403,115.0,15,1.0,0.0,0.0,163,12.5,...,0.25141,0.114912,0.165147,0.062173,0.027075,-4.846749,1.803559,1.56533,-0.832687,-0.229963
3,3,71865,19960908,109.0,10,0.0,0.0,1.0,193,15.0,...,0.274293,0.1103,0.121964,0.033395,0.0,-4.509599,1.28594,-0.501868,-2.438353,-0.478699
4,4,111080,20120103,110.0,5,1.0,0.0,0.0,68,5.0,...,0.228036,0.073205,0.09188,0.078819,0.121534,-1.89624,0.910783,0.93111,2.834518,1.923482
5,5,137642,20090602,24.0,10,0.0,1.0,0.0,109,10.0,...,0.260246,0.000518,0.119838,0.090922,0.048769,1.885526,-2.721943,2.45766,-0.286973,0.206573
6,6,2402,19990411,13.0,4,0.0,0.0,1.0,150,15.0,...,0.267998,0.117675,0.142334,0.025446,0.028174,-4.9022,1.610616,-0.834605,-1.996117,-0.10318
7,7,165346,19990706,26.0,14,1.0,0.0,0.0,101,15.0,...,0.239506,0.0,0.122943,0.039839,0.082413,3.693829,-0.245014,-2.19281,0.236728,0.195567
8,8,2974,20030205,19.0,1,2.0,1.0,1.0,179,15.0,...,0.263833,0.116583,0.144255,0.039851,0.024388,-4.925234,1.587796,0.075348,-1.551098,0.069433
9,9,82021,19980101,7.0,7,5.0,0.0,0.0,88,15.0,...,0.262473,0.068267,0.012176,0.010291,0.098727,-1.089584,0.600683,-4.18621,0.198273,-1.025822


In [4]:
train_data.dtypes   # 看一下字段名字与类型

SaleID                 int64
name                   int64
regDate                int64
model                float64
brand                  int64
bodyType             float64
fuelType             float64
gearbox              float64
power                  int64
kilometer            float64
notRepairedDamage     object
regionCode             int64
seller                 int64
offerType              int64
creatDate              int64
price                  int64
v_0                  float64
v_1                  float64
v_2                  float64
v_3                  float64
v_4                  float64
v_5                  float64
v_6                  float64
v_7                  float64
v_8                  float64
v_9                  float64
v_10                 float64
v_11                 float64
v_12                 float64
v_13                 float64
v_14                 float64
dtype: object

**<font color=red>可以看到有以下字段</font>**<br>
* name - 汽车编码
* regDate - 汽车注册时间
* model - 车型编码
* brand - 品牌
* bodyType - 车身类型
* fuelType - 燃油类型
* gearbox - 变速箱
* power - 汽车功率
* kilometer - 汽车行驶公里
* notRepairedDamage - 汽车有尚未修复的损坏
* regionCode - 看车地区编码
* seller - 销售方
* offerType - 报价类型
* creatDate - 广告发布时间
* price - 汽车价格
* v_0', 'v_1', 'v_2', 'v_3', 'v_4', 'v_5', 'v_6', 'v_7', 'v_8', 'v_9', 'v_10', 'v_11', 'v_12', 'v_13','v_14'（根据汽车的评论、标签等大量信息得到的embedding向量）【人工构造 匿名特征】


In [5]:
train_data.count() # 检查是否有缺失值

SaleID               150000
name                 150000
regDate              150000
model                149999
brand                150000
bodyType             145494
fuelType             141320
gearbox              144019
power                150000
kilometer            150000
notRepairedDamage    150000
regionCode           150000
seller               150000
offerType            150000
creatDate            150000
price                150000
v_0                  150000
v_1                  150000
v_2                  150000
v_3                  150000
v_4                  150000
v_5                  150000
v_6                  150000
v_7                  150000
v_8                  150000
v_9                  150000
v_10                 150000
v_11                 150000
v_12                 150000
v_13                 150000
v_14                 150000
dtype: int64

# 分类指标评价计算示例

In [6]:
## accuracy 分类准确率分数
import numpy as np
from sklearn.metrics import accuracy_score
y_pred = [0, 1, 0, 1]
y_true = [0, 1, 1, 1]
print('ACC:',accuracy_score(y_true, y_pred))    # 正确预测的数目/len(y_true)
'''
TP实际为正样本你预测为正样本，FN实际为正样本你预测为负样本，
FP实际为负样本你预测为正样本，TN实际为负样本你预测为负样本。
Accuracy=TP+TN/(TP+TN+FP+FN)
'''

ACC: 0.75


'\nTP实际为正样本你预测为正样本，FN实际为正样本你预测为负样本，\nFP实际为负样本你预测为正样本，TN实际为负样本你预测为负样本。\nAccuracy=TP+TN/(TP+TN+FP+FN)\n'

In [7]:
## Precision,Recall,F1-score
from sklearn import metrics
y_pred = [0, 1, 0, 0]
y_true = [0, 1, 0, 1]
print('Precision',metrics.precision_score(y_true, y_pred))   # 精确率
print('Recall',metrics.recall_score(y_true, y_pred))       # 召回率
print('F1-score:',metrics.f1_score(y_true, y_pred))        # F1分数，又称为平衡F分数（BalancedScore），定义为精确率和召回率的调和平均数。
'''
Precision=TP/(TP+FP)
Recall=TP/(TP+FN)
F1score=2∗Precision∗Recall/(Precision+Recall)
'''

Precision 1.0
Recall 0.5
F1-score: 0.6666666666666666


'\nPrecision=TP/(TP+FP)\nRecall=TP/(TP+FN)\nF1score=2∗Precision∗Recall/(Precision+Recall)\n'

In [8]:
## AUC
import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
print('AUC socre:',roc_auc_score(y_true, y_scores)) # AUC即ROC曲线下的面积，而ROC曲线的横轴是FPRate，纵轴是TPRate，当二者相等时，即y=x

AUC socre: 0.75


###  回归指标评价计算示例

In [9]:
# coding=utf-8
import numpy as np
from sklearn import metrics

# MAPE需要自己实现
def mape(y_true, y_pred):
    return np.mean(np.abs((y_pred - y_true) / y_true))

y_true = np.array([1.0, 5.0, 4.0, 3.0, 2.0, 5.0, -3.0])
y_pred = np.array([1.0, 4.5, 3.8, 3.2, 3.0, 4.8, -2.2])

# MSE 均方误差
print('MSE:',metrics.mean_squared_error(y_true, y_pred))
# RMSE 均方根误差
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_true, y_pred)))
# MAE  平均绝对误差
print('MAE:',metrics.mean_absolute_error(y_true, y_pred))
# MAPE 平均绝对百分比误差
print('MAPE:',mape(y_true, y_pred))

MSE: 0.2871428571428571
RMSE: 0.5358571238146014
MAE: 0.4142857142857143
MAPE: 0.1461904761904762


In [10]:
# 手动实现
y_true = np.array([1.0, 5.0, 4.0, 3.0, 2.0, 5.0, -3.0])
y_pred = np.array([1.0, 4.5, 3.8, 3.2, 3.0, 4.8, -2.2])

# MSE 均方误差
print("MSE: ", np.sum((y_true - y_pred) ** 2)/len(y_true))

# RMSE 均方根误差
print('RMSE:', np.sqrt(np.sum((y_true - y_pred) ** 2)/len(y_true)))

# MAE  平均绝对误差
print('MAE:', np.sum(np.absolute(y_true - y_pred))/len(y_true))

# MAPE 平均绝对百分比误差
print('MAPE:', np.sum(np.absolute((y_true - y_pred)/y_true))/len(y_true))



MSE:  0.2871428571428571
RMSE: 0.5358571238146014
MAE: 0.4142857142857143
MAPE: 0.1461904761904762


In [11]:
## R2-score R方值
from sklearn.metrics import r2_score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print('R2-score:',r2_score(y_true, y_pred))

R2-score: 0.9486081370449679


In [12]:
y_true = np.array([3, -0.5, 2, 7])
y_pred = np.array([2.5, 0.0, 2, 8])

print('R2-score:', 1 - np.sum((y_true - y_pred) ** 2)/len(y_true) / np.var(y_true) )


R2-score: 0.9486081370449679
