# 线性判别分析

## 实验内容
1. 使用线性判别分析完成垃圾邮件分类问题和Dota2结果预测问题。
2. 计算十折交叉验证下的精度(accuracy)，查准率(precision)，查全率(recall)，F1值。

## 评测指标  
1. 精度
2. 查准率
3. 查全率
4. F1

# 1. 读取数据

In [1]:
import numpy as np

In [2]:
spambase = np.loadtxt('data/spambase/spambase.data', delimiter = ",")
dota2results = np.loadtxt('data/dota2Dataset/dota2Train.csv', delimiter=',')

# 2. 导入模型

In [3]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_predict

# 3. 提取数据

这里的spamx和dota2x包含了数据集内所有的特征

In [4]:
spamx = spambase[:, :57]
spamy = spambase[:, 57]

dota2x = dota2results[:, 1:]
dota2y = dota2results[:, 0]

# 4. 训练

请你完成两个模型使用全部特征的训练与预测，并将预测结果存储起来

**注意：dota2数据集上，线性判别分析模型在训练的过程中会有警告出现，不会影响程序运行**

In [5]:
# 训练与预测
spam_model=LinearDiscriminantAnalysis()
dota2_model=LinearDiscriminantAnalysis()

# spam
spam_prediction=cross_val_predict(spam_model, spamx, spamy, cv=10)

# dota2
dota2_prediction=cross_val_predict(dota2_model,dota2x,dota2y,cv=10)

# 5. 评价指标的计算

请你计算两个模型的四项指标

In [6]:
# spam
spam_Acc = accuracy_score(spamy, spam_prediction)
spam_Pre = precision_score(spamy, spam_prediction)
spam_Recall = recall_score(spamy, spam_prediction)
spam_F1 = f1_score(spamy, spam_prediction)
print("Spam:\n","Accuracy:",spam_Acc,"\nPrecision:",spam_Pre,"\nRecall:",spam_Recall,"\nF1:",spam_F1)

# dota2
dota2_Acc = accuracy_score(dota2y, dota2_prediction)
dota2_Pre = precision_score(dota2y, dota2_prediction)
dota2_Recall = recall_score(dota2y, dota2_prediction)
dota2_F1 = f1_score(dota2y, dota2_prediction)
print("Dota2:\n","Accuracy:",dota2_Acc,"\nPrecision:",dota2_Pre,"\nRecall:",dota2_Recall,"\nF1:",dota2_F1)




Spam:
 Accuracy: 0.8832862421212779 
Precision: 0.9094993581514762 
Recall: 0.7815774958632101 
F1: 0.8407000889943637
Dota2:
 Accuracy: 0.5986724230976794 
Precision: 0.6066457034626064 
Recall: 0.6762740355048994 
F1: 0.6395703886083188


###### 双击此处填写

数据集|精度|查准率|查全率|F1
-|-|-|-|-
spambase | 0.8832862421212779  | 0.9094993581514762 |0.7815774958632101 | 0.8407000889943637
dota2Results | 0.5986724230976794  |0.6066457034626064  | 0.6762740355048994  | 0.6395703886083188

# 6. 选做：尝试对特征进行变换、筛选、组合后，训练模型并计算十折交叉验证后的四项指标

## 方差阈值过滤 VarianceThreshold

### 模型1.1

In [7]:
from sklearn.feature_selection import VarianceThreshold

print("过滤前特征个数：",spamx.shape[1])

sel11 = VarianceThreshold(threshold=0.1)
# 坑：注意筛选后数据的表达形式
data_11=sel11.fit_transform(spamx)

print("过滤后特征个数：",data_11.shape[1])

过滤前特征个数： 57
过滤后特征个数： 43


In [8]:
# spam
spam_prediction11=cross_val_predict(spam_model, data_11, spamy, cv=10)

spam_Acc = accuracy_score(spamy, spam_prediction11)
spam_Pre = precision_score(spamy, spam_prediction11)
spam_Recall = recall_score(spamy, spam_prediction11)
spam_F1 = f1_score(spamy, spam_prediction11)
print("Spam:\n","Accuracy:",spam_Acc,"\nPrecision:",spam_Pre,"\nRecall:",spam_Recall,"\nF1:",spam_F1)

Spam:
 Accuracy: 0.8739404477287546 
Precision: 0.8944337811900192 
Recall: 0.7710976282404853 
F1: 0.8281990521327014


### 模型1.2

In [9]:
print("过滤前特征个数：",spamx.shape[1])

sel12 = VarianceThreshold(threshold=0.5)
# 坑：注意筛选后数据的表达形式
data_12=sel12.fit_transform(spamx)

print("过滤后特征个数：",data_12.shape[1])

过滤前特征个数： 57
过滤后特征个数： 17


In [10]:
# spam
spam_prediction12=cross_val_predict(spam_model, data_12, spamy, cv=10)

spam_Acc = accuracy_score(spamy, spam_prediction12)
spam_Pre = precision_score(spamy, spam_prediction12)
spam_Recall = recall_score(spamy, spam_prediction12)
spam_F1 = f1_score(spamy, spam_prediction12)
print("Spam:\n","Accuracy:",spam_Acc,"\nPrecision:",spam_Pre,"\nRecall:",spam_Recall,"\nF1:",spam_F1)


Spam:
 Accuracy: 0.8346011736579004 
Precision: 0.8226993865030675 
Recall: 0.7396580253723111 
F1: 0.7789718268951495


### 模型2.1

In [11]:
print("过滤前特征个数：",dota2x.shape[1])

sel21 = VarianceThreshold(threshold=0.05)
# 坑：注意筛选后数据的表达形式
data_21=sel21.fit_transform(dota2x)

print("过滤后特征个数：",data_21.shape[1])

过滤前特征个数： 116
过滤后特征个数： 72


In [12]:
# dota2
dota2_prediction21=cross_val_predict(dota2_model, data_21, dota2y, cv=10)

dota2_Acc = accuracy_score(dota2y, dota2_prediction21)
dota2_Pre = precision_score(dota2y, dota2_prediction21)
dota2_Recall = recall_score(dota2y, dota2_prediction21)
dota2_F1 = f1_score(dota2y, dota2_prediction21)
print("Dota2:\n","Accuracy:",dota2_Acc,"\nPrecision:",dota2_Pre,"\nRecall:",dota2_Recall,"\nF1:",dota2_F1)

Dota2:
 Accuracy: 0.5906853750674582 
Precision: 0.59839437487541 
Recall: 0.6768890164404904 
F1: 0.6352259938632013


### 模型2.2

In [13]:
print("过滤前特征个数：",dota2x.shape[1])

sel22 = VarianceThreshold(threshold=0.1)
# 坑：注意筛选后数据的表达形式
data_22=sel22.fit_transform(dota2x)

print("过滤后特征个数：",data_22.shape[1])

过滤前特征个数： 116
过滤后特征个数： 39


In [14]:
# dota2
dota2_prediction22=cross_val_predict(dota2_model, data_22, dota2y, cv=10)

dota2_Acc = accuracy_score(dota2y, dota2_prediction22)
dota2_Pre = precision_score(dota2y, dota2_prediction22)
dota2_Recall = recall_score(dota2y, dota2_prediction22)
dota2_F1 = f1_score(dota2y, dota2_prediction22)
print("Dota2:\n","Accuracy:",dota2_Acc,"\nPrecision:",dota2_Pre,"\nRecall:",dota2_Recall,"\nF1:",dota2_F1)

Dota2:
 Accuracy: 0.566173772261198 
Precision: 0.5733716082291026 
Recall: 0.6878766758230495 
F1: 0.6254263508098336


###### 双击此处填写
1. 模型1.1的处理流程：过滤掉spamx中方差小于0.1的特征
2. 模型1.2的处理流程：过滤掉spamx中方差小于0.5的特征
3. 模型2.1的处理流程: 过滤掉dota2x中方差小于0.05的特征
4. 模型2.2的处理流程：过滤掉dota2x中方差小于0.1的特征

模型|数据集|精度|查准率|查全率|F1
-|-|-|-|-|-
模型1.1| spamx | 0.8739404477287546  |0.8944337811900192  | 0.7710976282404853 | 0.8281990521327014
模型1.2 | spamx | 0.8346011736579004  |  0.8226993865030675 | 0.7396580253723111 | 0.7789718268951495
模型2.1 | dota2x | 0.5906853750674582 | 0.59839437487541  | 0.6768890164404904 |0.6352259938632013
模型2.2|dota2x|0.566173772261198 |0.5733716082291026 |0.6878766758230495 |0.6254263508098336