使用head读取文件的前n行数据，在jupyter notebook上使用!来执行linux命令。

In [1]:
!head -5 data.csv

1 december wereld aids dag voorlichting in zuidafrika over bieten taboes en optimisme,nl
1 millón de afectados ante las inundaciones en sri lanka unicef está distribuyendo ayuda de emergencia srilanka,es
1 millón de fans en facebook antes del 14 de febrero y paty miki dani y berta se tiran en paracaídas qué harías tú porunmillondefans,es
1 satellite galileo sottoposto ai test presso lesaestec nl galileo navigation space in inglese,it
10 der welt sind bei,de


使用open函数读取data.csv文件

In [2]:
in_f = open('data.csv')
lines = in_f.readlines()
in_f.close()

dataset = [(line.strip()[:-3], line.strip()[-2:]) for line in lines]

显示列表中的前五条数据

In [3]:
dataset[:5]

[('1 december wereld aids dag voorlichting in zuidafrika over bieten taboes en optimisme',
  'nl'),
 ('1 millón de afectados ante las inundaciones en sri lanka unicef está distribuyendo ayuda de emergencia srilanka',
  'es'),
 ('1 millón de fans en facebook antes del 14 de febrero y paty miki dani y berta se tiran en paracaídas qué harías tú porunmillondefans',
  'es'),
 ('1 satellite galileo sottoposto ai test presso lesaestec nl galileo navigation space in inglese',
  'it'),
 ('10 der welt sind bei', 'de')]

使用sklearn数据集划分函数，把原数据集划分为训练集和测试集

In [4]:
from sklearn.model_selection import train_test_split

X, y = zip(*dataset)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=100)

zip()是将对象元素压缩为一个个tuple，zip(*)相当解压。

In [5]:
x = ['a', 'b']
y = ['c', 'd']

z = list(zip(x, y))
print(list(zip(x, y)))
print(list(zip(*z)))

[('a', 'c'), ('b', 'd')]
[('a', 'b'), ('c', 'd')]


In [6]:
len(X_train)

6799

使用正则表达式去除噪声数据

In [7]:
import re

def remove_noise(doc):
    noise_pattern = re.compile("|".join(["http\S+", "\@\w+", "\#\w+"]))
    clean_text = re.sub(noise_pattern, "", doc)
    
    return clean_text.strip()

remove_noise("Trump images are now more popular than cat gits.@Trump #tends http://ww.trumptrends.html")

'Trump images are now more popular than cat gits.'

**sklearn CountVectorizer和TfidfVectorizer**<br/>
- CountVectorizer统计每一个训练文本中，每个单词出现的频率；然后构成一个特征矩阵，每一行表示一个训练文本的词频统计结果。其思想是，先根据所有训练文本，不考虑其出现顺序，只将训练文本中每个出现过的词汇单独视为一列特征，构成一个词汇表。<br/>
- TfidfVectorizer除了考量某一词汇在当前训练文本中出现的频率之外，同时关注包含这个词汇的其它训练文本数目的倒数。<br/>
相比之下，训练文本的数量越多，TfidfVectorizer这种特征量化方式就更有优势。

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

vec = CountVectorizer(lowercase = True, 
                       analyzer = 'char_wb',
                       ngram_range = (1,3),
                       max_features = 1000,
                       preprocessor = remove_noise)
vec.fit(X_train)

def get_features(X):
    vec.tranform(X)

使用朴素贝叶斯分类器进行训练

In [9]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB(alpha=0.01)
classifier.fit(vec.transform(X_train), y_train)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

查看准确率

In [10]:
classifier.score(vec.transform(X_test), y_test)

0.9907366563740626

### 规范化写一个class

In [11]:
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from joblib import dump, load

class LanguageDetector(object):
    # 私有函数
    def __init__(self, classifier=MultinomialNB()):
        self.classifier = classifier
        self.vectorizer = CountVectorizer(max_features=1100, 
                                          ngram_range=(1,3), 
                                          preprocessor=self._remove_noise)
    
    # 数据清洗
    def _remove_noise(slef, doc):
        noise_pattern = re.compile("|".join(["http\S+", "\#\w+", "\@\w+"]))
        clean_text = re.sub(noise_pattern, "", doc)
        
        return clean_text.strip()
    
    # 构建特征
    def features(self, X):
        return self.vectorizer.transform(X)
    
    # 拟合数据
    def fit(self, X, y):
        self.vectorizer.fit(X)
        self.classifier.fit(self.features(X), y)
    
    # 预测类别
    def prediction(self, X):
        return self.classifier.predict(self.features([X]))
    
    # 测试集得分
    def score(self, X, y):
        return self.classifier.score(self.features(X), y)
    
    # 保存模型
    def save_model(self, path):
        dump((self.classifier, self.vectorizer), path)
    
    # 加载模型
    def load_model(self, path):
        self.classifier, self.vectorizer = load(path)

### 模型训练与存储

In [12]:
# 加载数据
in_f = open('data.csv')
lines = in_f.readlines()
in_f.close()

dataset = [(line.strip()[:-3], line.strip()[-2:]) for line in lines]
# 划分数据集
X, y = zip(*dataset)
X_train, X_test, y_train, y_test = train_test_split(X, y)

# 训练模型
language_detector = LanguageDetector()
language_detector.fit(X_train, y_train)
print(language_detector.score(X_test, y_test))

0.9775033083370093


In [13]:
# 保存模型
language_detector.save_model('./model/language_detector.model')

In [14]:
# 加载模型
load_language_detector = LanguageDetector()
load_language_detector.load_model('./model/language_detector.model')
load_language_detector.score(X_test, y_test)

0.9775033083370093