# 用朴素贝叶斯完成语种检测
拉丁语系，字母差不多，根据字母使用的顺序和频度分类，粒度是字母，而不是词

## 先看数据

In [6]:
import pandas as pd
data = pd.read_csv('../nlp/input/data.csv',header=0)

可以看到这个数据没有属性头

In [8]:
data.head()

Unnamed: 0,1 december wereld aids dag voorlichting in zuidafrika over bieten taboes en optimisme,nl
0,1 millón de afectados ante las inundaciones en...,es
1,1 millón de fans en facebook antes del 14 de f...,es
2,1 satellite galileo sottoposto ai test presso ...,it
3,10 der welt sind bei,de
4,10 jaar voor overval op juwelier bejaard echtp...,nl


In [16]:
in_f = open('../nlp/input/data.csv',encoding='utf-8')
lines = in_f.readlines()
in_f.close()

#strip()移除字符串头尾的指定字符，默认是空格
#[ :倒数第三个]截取
#将line切分成两块，整个是作为list的
dataset = [(line.strip()[:-3],line.strip()[-2:]) for line in lines]

In [17]:
dataset[:5]#一个样本是一个二元组

[('1 december wereld aids dag voorlichting in zuidafrika over bieten taboes en optimisme',
  'nl'),
 ('1 millón de afectados ante las inundaciones en sri lanka unicef está distribuyendo ayuda de emergencia srilanka',
  'es'),
 ('1 millón de fans en facebook antes del 14 de febrero y paty miki dani y berta se tiran en paracaídas qué harías tú porunmillondefans',
  'es'),
 ('1 satellite galileo sottoposto ai test presso lesaestec nl galileo navigation space in inglese',
  'it'),
 ('10 der welt sind bei', 'de')]

## 做训练集和测试集

In [24]:
from sklearn.model_selection import train_test_split
#zip(压缩，两个分别取一个元素打包成元组，组成一个新list) zip(*)将原来是元组的解压成原来的
#二元组展开成两个
x,y = zip(*dataset)

In [31]:
x[:5]

('1 december wereld aids dag voorlichting in zuidafrika over bieten taboes en optimisme',
 '1 millón de afectados ante las inundaciones en sri lanka unicef está distribuyendo ayuda de emergencia srilanka',
 '1 millón de fans en facebook antes del 14 de febrero y paty miki dani y berta se tiran en paracaídas qué harías tú porunmillondefans',
 '1 satellite galileo sottoposto ai test presso lesaestec nl galileo navigation space in inglese',
 '10 der welt sind bei')

In [32]:
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state=1)

In [35]:
len(x_train)

6799

## 预处理
由于原文是从类似微博拉取的数据，所以需要去除一些噪声，如：@，#等

In [36]:
import re
#我们用正则表达式，去掉噪声数据
def remove_noise(document):
    noise_pattern = re.compile("|".join(["http\S+", "\@\w+", "\#\w+"]))
    clean_text = re.sub(noise_pattern, "", document)
    return clean_text.strip()

remove_noise("Trump images are now more popular than cat gifs. @trump #trends http://www.trumptrends.html")

'Trump images are now more popular than cat gifs.'

## 抽取n-gram统计特征 向量化
给每个出现的最小粒度的元素，都给了一个编码

(123 24 56 1024 1567)

trump images are now... => 1gram = t,r,u,m,p... 2gram = tr,ru,um,mp...<br>
先用全部的词语fit，得到1或者2gram的高频词和编号<br>
transform每个样本的词，得到在相应词的编号

In [39]:
from sklearn.feature_extraction.text import CountVectorizer

model = CountVectorizer(
    lowercase=True,    # 英文文本全小写
    analyzer='char_wb', # 逐个字母解析
    ngram_range=(1,2), # 1=出现的字母以及每个字母出现的次数，2=出现的连续2个字母，和连续2个字母出现的频次 使用1-gram和2gram
    max_features =1000,#选1000个高频，提升效果可以增加词表数量也可以加n-gram或者改模型
    preprocessor=remove_noise
)
model.fit(x_train)

def get_features(x):
    model.transform(x)

In [41]:
result = model.transform(["Trump images are now more popular than cat gifs"])
print(result)

  (0, 0)	18
  (0, 11)	1
  (0, 13)	1
  (0, 17)	1
  (0, 19)	1
  (0, 23)	1
  (0, 24)	1
  (0, 26)	1
  (0, 30)	1
  (0, 150)	5
  (0, 159)	1
  (0, 166)	1
  (0, 170)	2
  (0, 172)	1
  (0, 212)	1
  (0, 215)	1
  (0, 272)	3
  (0, 273)	2
  (0, 293)	1
  (0, 302)	1
  (0, 321)	1
  (0, 330)	2
  (0, 338)	1
  (0, 342)	1
  (0, 359)	1
  :	:
  (0, 536)	2
  (0, 537)	1
  (0, 553)	1
  (0, 570)	3
  (0, 590)	1
  (0, 592)	1
  (0, 597)	1
  (0, 603)	3
  (0, 604)	1
  (0, 619)	1
  (0, 624)	1
  (0, 644)	4
  (0, 645)	1
  (0, 651)	2
  (0, 667)	1
  (0, 682)	2
  (0, 683)	2
  (0, 717)	2
  (0, 718)	1
  (0, 726)	1
  (0, 754)	2
  (0, 767)	1
  (0, 768)	1
  (0, 809)	1
  (0, 810)	1


## 准备ML模型

In [42]:
from sklearn.naive_bayes import MultinomialNB
clf =MultinomialNB()
clf.fit(model.transform(x_train),y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [44]:
clf.score(model.transform(x_test), y_test)

0.9770621967357741

# 规范化，写成一个class

In [45]:
import re

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB


class LanguageDetector():

    def __init__(self, classifier=MultinomialNB()):
        self.classifier = classifier
        self.vectorizer = CountVectorizer(ngram_range=(1,2), max_features=1000, preprocessor=self._remove_noise)

    def _remove_noise(self, document):
        noise_pattern = re.compile("|".join(["http\S+", "\@\w+", "\#\w+"]))
        clean_text = re.sub(noise_pattern, "", document)
        return clean_text

    def features(self, X):
        return self.vectorizer.transform(X)

    def fit(self, X, y):
        self.vectorizer.fit(X)
        self.classifier.fit(self.features(X), y)

    def predict(self, x):
        return self.classifier.predict(self.features([x]))

    def score(self, X, y):
        return self.classifier.score(self.features(X), y)

In [47]:
in_f = open('../nlp/input/data.csv',encoding='utf-8')
lines = in_f.readlines()
in_f.close()
dataset = [(line.strip()[:-3], line.strip()[-2:]) for line in lines]
x, y = zip(*dataset)
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)

language_detector = LanguageDetector()
language_detector.fit(x_train, y_train)
print(language_detector.predict('This is an English sentence'))
print(language_detector.score(x_test, y_test))

['en']
0.9770621967357741
