# fasttext用于中文文本分类

fastText论文中提到了一些tricks

- hierarchical softmax
    - 类别数较多时，通过构建一个霍夫曼编码树来加速softmax layer的计算，和之前word2vec中的trick相同
- N-gram features
    - 只用unigram的话会丢掉word order信息，所以通过加入N-gram features进行补充用hashing来减少N-gram的存储
- Subword
    - 对一些出现次数很少或者没有出现的词，使用subword的词向量之和来表达，如coresponse这个词，使用co的词向量与response的词向量之和来表示

fastText做文本分类要求文本是如下的存储形式：
```
__label__2 , birchas chaim , yeshiva birchas chaim is a orthodox jewish mesivta high school in lakewood township new jersey . it was founded by rabbi shmuel zalmen stein in 2001 after his father rabbi chaim stein asked him to open a branch of telshe yeshiva in lakewood . as of the 2009-10 school year the school had an enrollment of 76 students and 6 . 6 classroom teachers ( on a fte basis ) for a student–teacher ratio of 11 . 5 1 . 
__label__6 , motor torpedo boat pt-41 , motor torpedo boat pt-41 was a pt-20-class motor torpedo boat of the united states navy built by the electric launch company of bayonne new jersey . the boat was laid down as motor boat submarine chaser ptc-21 but was reclassified as pt-41 prior to its launch on 8 july 1941 and was completed on 23 july 1941 . 
__label__11 , passiflora picturata , passiflora picturata is a species of passion flower in the passifloraceae family . 
__label__13 , naya din nai raat , naya din nai raat is a 1974 bollywood drama film directed by a . bhimsingh . the film is famous as sanjeev kumar reprised the nine-role epic performance by sivaji ganesan in navarathri ( 1964 ) which was also previously reprised by akkineni nageswara rao in navarathri ( telugu 1966 ) . this film had enhanced his status and reputation as an actor in hindi cinema . 
```
其中前面的`__label__`是前缀，也可以自己定义，`__label__`后接的为类别。

我们定义我们的5个类别分别为：
```
1:technology
2:car
3:entertainment
4:military
5:sports
```

In [1]:
import jieba
import pandas as pd
import random

# 设定各类类别映射，如'technology'为1，'car'为2……
cate_dic = {'technology':1, 'car':2, 'entertainment':3, 'military':4, 'sports':5}
# 读取数据
df_technology = pd.read_csv("./origin_data/technology_news.csv", encoding='utf-8')
df_technology = df_technology.dropna()

df_car = pd.read_csv("./origin_data/car_news.csv", encoding='utf-8')
df_car = df_car.dropna()

df_entertainment = pd.read_csv("./origin_data/entertainment_news.csv", encoding='utf-8')
df_entertainment = df_entertainment.dropna()

df_military = pd.read_csv("./origin_data/military_news.csv", encoding='utf-8')
df_military = df_military.dropna()

df_sports = pd.read_csv("./origin_data/sports_news.csv", encoding='utf-8')
df_sports = df_sports.dropna()
# 转换为list列表的形式
technology = df_technology.content.values.tolist()[1000:21000]
car = df_car.content.values.tolist()[1000:21000]
entertainment = df_entertainment.content.values.tolist()[:20000]
military = df_military.content.values.tolist()[:20000]
sports = df_sports.content.values.tolist()[:20000]

### 载入停用词表 并定义本文处理函数，将文本处理为fasttext的输入格式

In [2]:
stopwords=pd.read_csv("origin_data/stopwords.txt",index_col=False,quoting=3,sep="\t",names=['stopword'], encoding='utf-8')
stopwords=stopwords['stopword'].values

def preprocess_text(content_lines, sentences, category):
    for line in content_lines:
        try:
            segs=jieba.lcut(line)
            # 去标点、停用词等
            segs = list(filter(lambda x:len(x)>1, segs))
            segs = list(filter(lambda x:x not in stopwords, segs))
            # 将句子处理成  __label__1 词语 词语 词语 ……的形式
            sentences.append("__label__"+str(category)+" , "+" ".join(segs))
        except Exception as e:
            print(line)
            continue

In [3]:
#生成训练数据
sentences = []

preprocess_text(technology, sentences, cate_dic['technology'])
preprocess_text(car, sentences, cate_dic['car'])
preprocess_text(entertainment, sentences, cate_dic['entertainment'])
preprocess_text(military, sentences, cate_dic['military'])
preprocess_text(sports, sentences, cate_dic['sports'])

# 随机打乱数据
random.shuffle(sentences)

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\ADMINI~1\AppData\Local\Temp\jieba.cache
Loading model cost 0.742 seconds.
Prefix dict has been built successfully.


In [4]:
# 将数据保存到train_data.txt中
print("writing data to fasttext format...")
out = open('train_data.txt', 'w', encoding='utf-8')
for sentence in sentences:
    out.write(sentence+"\n")
print("done!")

writing data to fasttext format...
done!


### 调用fastText训练生成模型

In [6]:
import fasttext
"""
  训练一个监督模型, 返回一个模型对象

  @param input:           训练数据文件路径
  @param lr:              学习率
  @param dim:             向量维度
  @param ws:              cbow模型时使用
  @param epoch:           次数
  @param minCount:        词频阈值, 小于该值在初始化时会过滤掉
  @param minCountLabel:   类别阈值，类别小于该值初始化时会过滤掉
  @param minn:            构造subword时最小char个数
  @param maxn:            构造subword时最大char个数
  @param neg:             负采样
  @param wordNgrams:      n-gram个数
  @param loss:            损失函数类型, softmax, ns: 负采样, hs: 分层softmax
  @param bucket:          词扩充大小, [A, B]: A语料中包含的词向量, B不在语料中的词向量
  @param thread:          线程个数, 每个线程处理输入数据的一段, 0号线程负责loss输出
  @param lrUpdateRate:    学习率更新
  @param t:               负采样阈值
  @param label:           类别前缀
  @param verbose:         ??
  @param pretrainedVectors: 预训练的词向量文件路径, 如果word出现在文件夹中初始化不再随机
  @return model object
"""
classifier = fasttext.train_supervised(input='train_data.txt', dim=100, epoch=5,
                                         lr=0.1, wordNgrams=2, loss='softmax')
classifier.save_model('classifier.model')

### 对模型效果进行评估

In [7]:
result = classifier.test('train_data.txt')
print('P@1:', result[1])
print('R@1:', result[2])
print('Number of examples:', result[0])

P@1: 0.9838656268198271
R@1: 0.9838656268198271
Number of examples: 87577


In [8]:
### 实际预测
label_to_cate = {'__label__1':'technology', '__label__2':'car', '__label__3':'entertainment',
                 '__label__4':'military', '__label__5':'sports'}

texts = '这 是 中国 制造 宝马 汽车'
labels = classifier.predict(texts)
# print(labels)
print(label_to_cate[labels[0][0]])

car


### Top K 个预测结果

In [9]:
labels = classifier.predict(texts, k=3)
label, proba = labels[0], labels[1]
for label, proba in zip(label, proba):
    print('预测：%s\t概率为： %f' % (label_to_cate[label], proba))

预测：car	概率为： 1.000009
预测：technology	概率为： 0.000011
预测：military	概率为： 0.000010


# fasttext用于中文无监督学习

In [10]:
def preprocess_text_unsupervised(content_lines, sentences):
    for line in content_lines:
        try:
            segs=jieba.lcut(line)
            segs = list(filter(lambda x:len(x)>1, segs))
            segs = list(filter(lambda x:x not in stopwords, segs))
            # 处理成  词语 词语 词语…… 的形式
            sentences.append(" ".join(segs))
        except Exception as e:
            print(line)
            continue
#生成无监督训练数据
sentences = []

preprocess_text_unsupervised(technology, sentences)
preprocess_text_unsupervised(car, sentences)
preprocess_text_unsupervised(entertainment, sentences)
preprocess_text_unsupervised(military, sentences)
preprocess_text_unsupervised(sports, sentences)

print("writing data to fasttext unsupervised learning format...")
out = open('unsupervised_train_data.txt', 'w', encoding='utf-8')
for sentence in sentences:
    out.write(sentence+"\n")
print("done!")

writing data to fasttext unsupervised learning format...
done!


In [11]:
import fasttext

# Skipgram model
model = fasttext.train_unsupervised('unsupervised_train_data.txt', model='skipgram')
print(model.words[:10]) # list of words in dictionary

# CBOW model
model = fasttext.train_unsupervised('unsupervised_train_data.txt', model='cbow')
print(model.words[:10]) # list of words in dictionary

['</s>', '中国', '发展', '汽车', '用户', '技术', '比赛', '市场', '平台', '服务']
['</s>', '中国', '发展', '汽车', '用户', '技术', '比赛', '市场', '平台', '服务']


In [12]:
# 查看某个词的词向量
print(model['赛季'])

[ 1.3030976   0.9404861  -0.01847073 -2.0281935  -0.47557053  1.4022453
 -0.33904338  1.0796572   0.02798706  2.1330366  -0.18552518 -0.2461389
 -0.96020913 -0.60029024 -0.3148932   2.2439585  -2.0460677  -2.2157037
 -0.5819198  -0.0692748   0.26359314 -0.29606423  1.8785787  -0.19154015
 -1.3072726  -0.06210047  0.74192524 -0.5015831  -0.9866113  -0.5674383
 -0.9844613  -0.50053316  1.5576434  -0.1627377  -2.2799628  -0.83161664
 -3.1632657  -0.15478554 -1.1918309   1.7669501   0.8818059   0.78309166
  0.7428605   0.01461019  0.9616978  -2.0978618   1.9600568  -0.9531319
  0.35986143  1.4861448  -2.2054806   1.4554088   0.1940116   0.91389835
 -0.06472382 -1.0512189   0.95620656 -0.8704989  -2.5449433  -1.335377
  0.5264219  -2.4620938   2.6068513  -0.10895383  1.7347517   0.9680276
 -0.57421255  3.0573      0.07793453 -0.37695345  0.75320894 -1.8995788
 -1.1326122  -0.45068133  0.58303857 -0.06479045  1.364764   -0.8579126
  0.9971492   0.5678582   0.84928197 -1.400441    1.0710597  

## 对比gensim的word2vec

In [3]:
def preprocess_text_unsupervised(content_lines, sentences):
    for line in content_lines:
        try:
            segs=jieba.lcut(line)
            segs = list(filter(lambda x:len(x)>1, segs))
            segs = list(filter(lambda x:x not in stopwords, segs))
            # gensim 输入格式为 [词， 词， 词]
            sentences.append(segs)
        except Exception as e:
            print(line)
            continue
#生成无监督训练数据
sentences = []

preprocess_text_unsupervised(technology, sentences)
preprocess_text_unsupervised(car, sentences)
preprocess_text_unsupervised(entertainment, sentences)
preprocess_text_unsupervised(military, sentences)
preprocess_text_unsupervised(sports, sentences)

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\ADMINI~1\AppData\Local\Temp\jieba.cache
Loading model cost 0.728 seconds.
Prefix dict has been built successfully.


In [4]:
from gensim.models import word2vec
model = word2vec.Word2Vec(sentences, size=100, window=5, min_count=5, workers=4)
model.save("gensim_word2vec.model")

In [5]:
print(model.wv['赛季'])

[ 2.8249986   0.03112162 -0.25813046  0.33364102 -1.6184547  -1.8084553
  2.1597168   0.1351388  -0.53724444  0.33972478  2.1969304  -0.14570096
 -0.7215449   0.12304255 -0.19669688  1.0330333   0.55557615  1.9715163
 -0.62517285 -1.4554526  -2.1550758  -0.75689536  0.7878873  -1.8041505
  0.6196159   0.16723332 -0.82358366 -0.27559796 -0.7487638   1.9861195
 -0.8994759  -0.9230798  -1.3786101   0.41648138 -1.6454602   0.3242791
 -1.3978794   0.08787971  0.8411618   1.4878358  -0.14459854  1.2883182
  1.4306669  -2.1317682   0.15039028  1.1206025   0.13805757 -1.6349252
 -0.9597154   1.2816765   0.6698505   0.05626296 -1.1199676  -1.0730348
 -0.79737777  1.8237026  -2.020744    0.91954416  1.2784522  -1.3709328
  0.8549252  -0.33926338 -1.3959678   0.2701675  -1.1709216  -0.83576757
 -0.8271942   1.1551841  -1.6863061   0.05112632 -0.1284147   0.33976018
  0.7828476  -1.5634745  -0.8728223  -1.9741758  -0.5336166  -1.6338608
 -1.2736895  -1.3195976  -0.44755006 -0.57404816  0.54282814 

In [6]:
# 寻找相似词语
print(model.wv.most_similar('赛季'))

[('亚冠', 0.8779898881912231), ('中甲', 0.8449328541755676), ('BIG4', 0.8394372463226318), ('本赛季', 0.8369283676147461), ('辽足', 0.8342424631118774), ('国安', 0.8320003747940063), ('恒大', 0.8271181583404541), ('名额', 0.825170636177063), ('全北', 0.8210830688476562), ('强赛', 0.8180699348449707)]
