# 对数几率回归

## 实验内容
1. 使用对数几率回归完成垃圾邮件分类问题和Dota2结果预测问题。
2. 计算十折交叉验证下的精度(accuracy)，查准率(precision)，查全率(recall)，F1值。

## 评测指标  
1. 精度
2. 查准率
3. 查全率
4. F1

# 1. 读取数据

In [160]:
import numpy as np

In [161]:
spambase = np.loadtxt('data/spambase/spambase.data', delimiter = ",")
dota2results = np.loadtxt('data/dota2Dataset/dota2Train.csv', delimiter=',')

# 2. 导入模型

In [162]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_predict

# 3. 提取数据

这里的spamx和dota2x包含了数据集内所有的特征

In [163]:
spamx = spambase[:, :57]
spamy = spambase[:, 57]

dota2x = dota2results[:, 1:]
dota2y = dota2results[:, 0]


# 4. 训练并预测

请你完成两个模型使用全部特征的训练与预测，并将预测结果存储起来

In [164]:
# YOUR CODE HERE
spam_model=LogisticRegression()
dota2_model=LogisticRegression()

# spam
spam_prediction=cross_val_predict(spam_model, spamx, spamy, cv=10)

# dota2
dota2_prediction=cross_val_predict(dota2_model,dota2x,dota2y,cv=10)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

# 5. 评价指标的计算

请你计算两个模型的四项指标

In [165]:
# spam
spam_Acc = accuracy_score(spamy, spam_prediction)
spam_Pre = precision_score(spamy, spam_prediction)
spam_Recall = recall_score(spamy, spam_prediction)
spam_F1 = f1_score(spamy, spam_prediction)
print("Spam:\n","Accuracy:",spam_Acc,"\nPrecision:",spam_Pre,"\nRecall:",spam_Recall,"\nF1:",spam_F1)

# dota2
dota2_Acc = accuracy_score(dota2y, dota2_prediction)
dota2_Pre = precision_score(dota2y, dota2_prediction)
dota2_Recall = recall_score(dota2y, dota2_prediction)
dota2_F1 = f1_score(dota2y, dota2_prediction)
print("Dota2:\n","Accuracy:",dota2_Acc,"\nPrecision:",dota2_Pre,"\nRecall:",dota2_Recall,"\nF1:",dota2_F1)

Spam:
 Accuracy: 0.9076287763529668 
Precision: 0.8747300215982722 
Recall: 0.8935466078323221 
F1: 0.8840381991814462
Dota2:
 Accuracy: 0.5993631948192121 
Precision: 0.6071120254210826 
Recall: 0.6775654954696404 
F1: 0.6404068781787358


###### 双击此处填写

数据集|精度|查准率|查全率|F1
-|-|-|-|-
spambase | 0.9076287763529668  | 0.8747300215982722  | 0.8935466078323221 |0.8935466078323221 
dota2Results | 0.5993631948192121  | 0.6071120254210826  | 0.6775654954696404  | 0.6404068781787358

# 6. 选做：尝试对特征进行变换、筛选、组合后，训练模型并计算十折交叉验证后的四项指标

## 两组数据中原有的特征个数

In [166]:
print(spamx.shape[1]) # 57个特征
print(dota2x.shape[1]) # 116个特征

57
116


## 方差阈值过滤 VarianceThreshold

### 模型1.1

In [194]:
from sklearn.feature_selection import VarianceThreshold

print("过滤前特征个数：",spamx.shape[1])

sel11 = VarianceThreshold(threshold=0.1)
# 坑：注意筛选后数据的表达形式
data_11=sel11.fit_transform(spamx)

print("过滤后特征个数：",data_11.shape[1])

过滤前特征个数： 57
过滤后特征个数： 43


In [195]:
# spam
spam_prediction11=cross_val_predict(spam_model, data_11, spamy, cv=10)

spam_Acc = accuracy_score(spamy, spam_prediction11)
spam_Pre = precision_score(spamy, spam_prediction11)
spam_Recall = recall_score(spamy, spam_prediction11)
spam_F1 = f1_score(spamy, spam_prediction11)
print("Spam:\n","Accuracy:",spam_Acc,"\nPrecision:",spam_Pre,"\nRecall:",spam_Recall,"\nF1:",spam_F1)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Spam:
 Accuracy: 0.9021951749619648 
Precision: 0.8717948717948718 
Recall: 0.8814120242691671 
F1: 0.8765770707624794


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### 模型1.2

In [197]:
print("过滤前特征个数：",spamx.shape[1])

sel12 = VarianceThreshold(threshold=0.5)
# 坑：注意筛选后数据的表达形式
data_12=sel12.fit_transform(spamx)

print("过滤后特征个数：",data_12.shape[1])

过滤前特征个数： 57
过滤后特征个数： 17


In [198]:
# spam
spam_prediction12=cross_val_predict(spam_model, data_12, spamy, cv=10)

spam_Acc = accuracy_score(spamy, spam_prediction12)
spam_Pre = precision_score(spamy, spam_prediction12)
spam_Recall = recall_score(spamy, spam_prediction12)
spam_F1 = f1_score(spamy, spam_prediction12)
print("Spam:\n","Accuracy:",spam_Acc,"\nPrecision:",spam_Pre,"\nRecall:",spam_Recall,"\nF1:",spam_F1)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Spam:
 Accuracy: 0.8728537274505542 
Precision: 0.8355191256830601 
Recall: 0.8433535576392719 
F1: 0.8394180620367828


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### 模型2.1

In [222]:
print("过滤前特征个数：",dota2x.shape[1])

sel21 = VarianceThreshold(threshold=0.05)
# 坑：注意筛选后数据的表达形式
data_21=sel21.fit_transform(dota2x)

print("过滤后特征个数：",data_21.shape[1])

过滤前特征个数： 116
过滤后特征个数： 72


In [223]:
# dota2
dota2_prediction21=cross_val_predict(dota2_model, data_21, dota2y, cv=10)

dota2_Acc = accuracy_score(dota2y, dota2_prediction21)
dota2_Pre = precision_score(dota2y, dota2_prediction21)
dota2_Recall = recall_score(dota2y, dota2_prediction21)
dota2_F1 = f1_score(dota2y, dota2_prediction21)
print("Dota2:\n","Accuracy:",dota2_Acc,"\nPrecision:",dota2_Pre,"\nRecall:",dota2_Recall,"\nF1:",dota2_F1)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Dota2:
 Accuracy: 0.5892174851592012 
Precision: 0.5971620666533771 
Recall: 0.6754950596531507 
F1: 0.6339178361532468


### 模型2.2

In [230]:
print("过滤前特征个数：",dota2x.shape[1])

sel22 = VarianceThreshold(threshold=0.1)
# 坑：注意筛选后数据的表达形式
data_22=sel22.fit_transform(dota2x)

print("过滤后特征个数：",data_22.shape[1])

过滤前特征个数： 116
过滤后特征个数： 39


In [231]:
# dota2
dota2_prediction22=cross_val_predict(dota2_model, data_22, dota2y, cv=10)

dota2_Acc = accuracy_score(dota2y, dota2_prediction22)
dota2_Pre = precision_score(dota2y, dota2_prediction22)
dota2_Recall = recall_score(dota2y, dota2_prediction22)
dota2_F1 = f1_score(dota2y, dota2_prediction22)
print("Dota2:\n","Accuracy:",dota2_Acc,"\nPrecision:",dota2_Pre,"\nRecall:",dota2_Recall,"\nF1:",dota2_F1)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Dota2:
 Accuracy: 0.5662277388019428 
Precision: 0.5736017130620985 
Recall: 0.6864007215776311 
F1: 0.6249521729798334


###### 双击此处填写
1. 模型1.1的处理流程：过滤掉spamx中方差小于0.1的特征
2. 模型1.2的处理流程：过滤掉spamx中方差小于0.5的特征
3. 模型2.1的处理流程: 过滤掉dota2x中方差小于0.05的特征
4. 模型2.2的处理流程：过滤掉dota2x中方差小于0.1的特征

模型|数据集|精度|查准率|查全率|F1
-|-|-|-|-|-
模型1.1|spamx | 0.9021951749619648  | 0.8717948717948718  | 0.8814120242691671  | 0.8765770707624794
模型1.2| spamx | 0.8728537274505542 | 0.8355191256830601 | 0.8433535576392719  | 0.8394180620367828
模型2.1 | dota2x| 0.5892174851592012  | 0.5971620666533771  | 0.6754950596531507 |0.6339178361532468
模型2.2|dota2x|0.5662277388019428 |0.5736017130620985 |0.6864007215776311 |0.6249521729798334