# 线性判别分析

## 实验内容
1. 使用线性判别分析完成垃圾邮件分类问题和Dota2结果预测问题。
2. 计算十折交叉验证下的精度(accuracy)，查准率(precision)，查全率(recall)，F1值。

## 评测指标  
1. 精度
2. 查准率
3. 查全率
4. F1

# 1. 读取数据

In [1]:
import numpy as np

In [2]:
spambase = np.loadtxt('data/spambase/spambase.data', delimiter = ",")
dota2results = np.loadtxt('data/dota2Dataset/dota2Train.csv', delimiter=',')

# 2. 导入模型

In [3]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_predict

# 3. 提取数据

这里的spamx和dota2x包含了数据集内所有的特征

In [4]:
spamx = spambase[:, :57]
spamy = spambase[:, 57]

dota2x = dota2results[:, 1:]
dota2y = dota2results[:, 0]

# 4. 训练

请你完成两个模型使用全部特征的训练与预测，并将预测结果存储起来

**注意：dota2数据集上，线性判别分析模型在训练的过程中会有警告出现，不会影响程序运行**

In [7]:
model = LinearDiscriminantAnalysis()

y_pred = cross_val_predict(model, spamx, spamy, cv=10)



y_pred2 = cross_val_predict(model, dota2x, dota2y, cv=10)






# 5. 评价指标的计算

请你计算两个模型的四项指标

In [8]:
# YOUR CODE HERE

acc = accuracy_score(y_true=spamy, y_pred=y_pred)
recall = recall_score(y_true=spamy, y_pred=y_pred)
precision = precision_score(y_true=spamy, y_pred=y_pred)
f1 = f1_score(y_true=spamy, y_pred=y_pred)
print(acc, recall, precision, f1)


acc2 = accuracy_score(y_true=dota2y, y_pred=y_pred2)
recall2 = recall_score(y_true=dota2y, y_pred=y_pred2)
precision2 = precision_score(y_true=dota2y, y_pred=y_pred2)
f12 = f1_score(y_true=dota2y, y_pred=y_pred2)
print(acc2, recall2, precision2, f12)




0.8830688980656379 0.7815774958632101 0.90891597177678 0.8404507710557533
0.59876956287102 0.6764585297855766 0.6067107924250782 0.6396890629240491


###### 双击此处填写

数据集|精度|查准率|查全率|F1
-|-|-|-|-
spambase | 0.8830688980656379  | 0.7815774958632101  |  0.90891597177678 |0.8404507710557533
dota2Results | 0.59876956287102  |0.6764585297855766  | 0.6067107924250782  | 0.6396890629240491

# 6. 选做：尝试对特征进行变换、筛选、组合后，训练模型并计算十折交叉验证后的四项指标

In [11]:
from sklearn.preprocessing import StandardScaler,MinMaxScaler,Normalizer
def min_feature(x):
    return MinMaxScaler().fit_transform(x)


def nor_feature(x):
    return Normalizer().fit_transform(x)

def standard_feature(x):
    return StandardScaler().fit_transform(x)

spamx=min_feature(spamx)
y_pred = cross_val_predict(model, spamx, spamy, cv=10)
acc = accuracy_score(y_true=spamy, y_pred=y_pred)
recall = recall_score(y_true=spamy, y_pred=y_pred)
precision = precision_score(y_true=spamy, y_pred=y_pred)
f1 = f1_score(y_true=spamy, y_pred=y_pred)
print(acc, recall, precision, f1)

spamx=nor_feature(spamx)
y_pred = cross_val_predict(model, spamx, spamy, cv=10)
acc = accuracy_score(y_true=spamy, y_pred=y_pred)
recall = recall_score(y_true=spamy, y_pred=y_pred)
precision = precision_score(y_true=spamy, y_pred=y_pred)
f1 = f1_score(y_true=spamy, y_pred=y_pred)
print(acc, recall, precision, f1)

spamx=standard_feature(spamx)
spamx=nor_feature(spamx)

y_pred = cross_val_predict(model, spamx, spamy, cv=10)
acc = accuracy_score(y_true=spamy, y_pred=y_pred)
recall = recall_score(y_true=spamy, y_pred=y_pred)
precision = precision_score(y_true=spamy, y_pred=y_pred)
f1 = f1_score(y_true=spamy, y_pred=y_pred)
print(acc, recall, precision, f1)


0.9128450336883286 0.8742415885273028 0.9015927189988624 0.8877065247829741
0.9130623777439687 0.8747931605074463 0.9016486640136441 0.8880179171332587
0.9184959791349706 0.8797573083287369 0.9103881278538812 0.8948106591865358


###### 双击此处填写
1. 模型1的处理流程：特征归一化

2. 模型2的处理流程：特征去均值

3. 模型3的处理流程:特征去均值同时特征正则化

特征经过正则化和去均值后效果最好

模型|数据集|精度|查准率|查全率|F1
-|-|-|-|-|-
模型1 | 数据集 0.8830688980656379 0.7815774958632101 0.90891597177678 0.8404507710557533

模型2 | 数据集 0.9128450336883286 0.8742415885273028 0.9015927189988624 0.8877065247829741

模型3 | 数据集 0.9128450336883286 0.8742415885273028 0.9015927189988624 0.8877065247829741
