## fastText新闻文本分类

### 准备
1、安装[fasttext](https://github.com/facebookresearch/fastText)

2、新闻数据集：[清华文本分类数据集 1.5GB](https://thunlp.oss-cn-qingdao.aliyuncs.com/THUCNews.zip)

数据集名字改一下，方便程序读取 ['affairs','constellation','economic','edu','ent','fashion','game','home','house','lottery','science','sports','society','stock']

3、预处理数据:
```
python3 processing.py
```

预处理后的数据[下载](https://pan.baidu.com/s/1JXUIqhIu09AaIxDwa-eQKw)（xmye）

### 训练

In [1]:
import fasttext
import time

In [2]:
start = time.time()

classifier = fasttext.train_supervised('news_fasttext_train.txt', label_prefix = '__label__')

print('训练完成，运行时间:%.2f秒'%(time.time()-start))
classifier.save_model("model_file.bin")


训练完成，运行时间:16.13秒


### 测试

In [3]:
result = classifier.test('news_fasttext_test.txt')
print(result)

(58368, 0.9271861293859649, 0.9271861293859649)


In [11]:
labels_right = []
texts = []
with open("news_fasttext_test.txt") as fr:
    for line in fr:
        line = str(line.encode("utf-8"), 'utf-8').rstrip()
        labels_right.append(line.split("\t")[1].replace("__label__",""))
        texts.append(line.split("\t")[0])
    #     print labels
    #     print texts
#     break
labels_predict = [term[0] for term in classifier.predict(texts)[0]] #预测输出结果为二维形式
# print labels_predict

text_labels = list(set(labels_right))
text_predict_labels = list(set(labels_predict))
print(text_predict_labels)
print(text_labels)
print()

A = dict.fromkeys(text_labels,0)  #预测正确的各个类的数目
B = dict.fromkeys(text_labels,0)   #测试数据集中各个类的数目
C = dict.fromkeys(text_predict_labels,0) #预测结果中各个类的数目
for i in range(0,len(labels_right)):
    B[labels_right[i]] += 1
    C[labels_predict[i]] += 1
    if labels_right[i] == labels_predict[i].replace('__label__', ''):
        A[labels_right[i]] += 1

print('预测正确的各个类的数目:', A) 
print()
print('测试数据集中各个类的数目:', B)
print()
print('预测结果中各个类的数目:', C)
print()
#计算准确率，召回率，F值
for key in B:
    try:
        r = float(A[key]) / float(B[key])
        p = float(A[key]) / float(C['__label__' + key])
        f = p * r * 2 / (p + r)
        print("%s:\t p:%f\t r:%f\t f:%f" % (key,p,r,f))
    except:
        print("error:", key, "right:", A.get(key,0), "real:", B.get(key,0), "predict:",C.get(key,0))

['__label__home', '__label__game', '__label__economic', '__label__lottery', '__label__sports', '__label__science', '__label__stock', '__label__constellation', '__label__society', '__label__fashion', '__label__edu', '__label__ent', '__label__house', '__label__affairs']
['economic', 'society', 'affairs', 'fashion', 'sports', 'home', 'ent', 'science', 'game', 'edu', 'stock', 'house']

预测正确的各个类的数目: {'economic': 4589, 'society': 4538, 'affairs': 4544, 'fashion': 3224, 'sports': 4821, 'home': 4681, 'ent': 4741, 'science': 4403, 'game': 4658, 'edu': 4691, 'stock': 4553, 'house': 4675}

测试数据集中各个类的数目: {'economic': 5000, 'society': 5000, 'affairs': 5000, 'fashion': 3368, 'sports': 5000, 'home': 5000, 'ent': 5000, 'science': 5000, 'game': 5000, 'edu': 5000, 'stock': 5000, 'house': 5000}

预测结果中各个类的数目: {'__label__home': 4984, '__label__game': 4813, '__label__economic': 4934, '__label__lottery': 73, '__label__sports': 4915, '__label__science': 4996, '__label__stock': 5052, '__label__constellation': 

#### fasttext的result中只有测试结果的总体数据，还可以[统计详细数据](https://github.com/NLP-LOVE/ML-NLP/blob/master/NLP/16.2%20fastText/fastText.ipynb)


### Ref
* [ML-NLP-16.2 fastText](https://github.com/NLP-LOVE/ML-NLP/tree/master/NLP/16.2%20fastText)