# Learning LDA Chinese topic models with scikit-learn

Scikit-learn is an important machine learning library in Python. Scikit-learn is short for sklearn. It supports four machine learning algorithms including classification, regression, dimensionality reduction and clustering. It also includes three modules: feature extraction, data processing and model evaluation.

## A quick overview of machine learning

The general process of traditional machine learning tasks from the beginning to the modeling is: 
* obtaining data
* data preprocessing
* training models
* model evaluation
* prediction and classification.

This time we will follow the traditional machine learning process to see what common functions are in each step of the process and how they are used. 


## Obtaining data

### Import sklearn dataset

Sklearn contains a large number of high-quality data sets. In the process of learning machine learning, we can use these data sets to implement different models. At the same time, this process can also deepen the understanding and grasp of theoretical knowledge.

First, to use the datasets in sklearn, you must import the datasets module.


In [37]:
import sklearn
from sklearn import datasets

Here we use iris data as an example to represent the exported data set.

In [38]:
iris = datasets.load_iris() 
X = iris.data 
y = iris.target 

### Create dataset

In addition to using the dataset contained in sklearn, we can also create training samples ourselves.

Specific usage can refer to: https://scikit-learn.org/stable/datasets/

Here we take an example of a sample generator for classification problems:

In [39]:
from sklearn.datasets.samples_generator import make_classification
 
X, y = make_classification(n_samples=6, n_features=5, n_informative=2,
    n_redundant=2, n_classes=2, n_clusters_per_class=2, scale=1.0,
    random_state=20)
 

The test is as follows:

In [40]:
for x_,y_ in zip(X,y):
    print(y_,end=': ')
    print(x_)

0: [-0.6600737  -0.0558978   0.82286793  1.1003977  -0.93493796]
1: [ 0.4113583   0.06249216 -0.90760075 -1.41296696  2.059838  ]
1: [ 1.52452016 -0.01867812  0.20900899  1.34422289 -1.61299022]
0: [-1.25725859  0.02347952 -0.28764782 -1.32091378 -0.88549315]
0: [-3.28323172  0.03899168 -0.43251277 -2.86249859 -1.10457948]
1: [ 1.68841011  0.06754955 -1.02805579 -0.83132182  0.93286635]


## Data preprocessing

The data preprocessing stage is an indispensable part of machine learning, which will make the data more effectively recognized by the model or evaluator.

### Data standardization

Normalization: In machine learning, we may have to deal with different kinds of data, such as pixel values on audio and pictures. These data may be high latitudes. After the data is normalized, the values in each feature will average to 0 ( The value of each feature is subtracted from the average of the feature in the original data), and the standard deviation becomes 1. This method is widely used in many machine learning algorithms (for example: support vector machines, logistic regression, and neural network-like) .

StandardScaler calculates the average and standard deviation of the training set in order to test the data and use the same transformation.

After the transformation, the features in each dimension have a mean value of 0, and the unit variance is also called z-score normalization (zero mean normalization). The calculation method is to subtract the mean value from the feature value and divide it by the standard deviation.

* fit: It is used to calculate the mean and variance of the training data. Later, the mean and variance will be used to convert the training data.

* fit_transform: Not only calculate the mean and variance of the training data, but also convert the training data based on the calculated mean and variance, thereby transforming the data into a standard normal distribution.

* transform: Obviously, it just transforms, it just transforms the training data into a standard normal distribution. (Generally, the train and test sets are put together for standardization, or after the training set is standardized, the same standardizer is used to standardize the test set. At this time, a scaler can be used.)


In [45]:
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X)
scaler.transform(X)

array([[-0.23118282, -1.71131893,  1.71770304,  1.19857094, -0.51965501],
       [ 0.39217125,  0.97024754, -1.00050853, -0.50892817,  1.77777141],
       [ 1.03980354, -0.86828257,  0.75345647,  1.36421794, -1.03981919],
       [-0.57862217,  0.08660018, -0.02668957, -0.4463902 , -0.4817237 ],
       [-1.75732374,  0.437955  , -0.2542427 , -1.49369334, -0.64979461],
       [ 1.13515394,  1.08479879, -1.18971871, -0.11377717,  0.91322111]])

### Min-max normalization

The minimum-maximum normalization performs a linear transformation on the original data and transforms it to the interval [0,1] (it can also be other intervals with fixed minimum and maximum values).

In [44]:
scaler = preprocessing.MinMaxScaler(feature_range=(0, 1)).fit(X)
scaler.transform(X)

array([[0.52762409, 0.        , 1.        , 0.94203914, 0.18461312],
       [0.74313278, 0.95903204, 0.06507834, 0.34457514, 1.        ],
       [0.96703505, 0.30150246, 0.66834995, 1.        , 0.        ],
       [0.40750585, 0.64300551, 0.40002079, 0.36645754, 0.19807544],
       [0.        , 0.76866361, 0.32175449, 0.        , 0.13842486],
       [1.        , 1.        , 0.        , 0.4828408 , 0.69315972]])

### Normalize

An essential operation when you want to calculate the similarity of two samples is regularization. The idea is: first find the p-norm of the sample, and then divide all elements of the sample by the norm, so that the norm of each sample is 1 in the end. Normalization (Normalization) maps the values of different ranges to the same fixed range. Commonly, [0,1] is also normalized.

The following example transforms each sample into a unit norm

In [47]:
X = [[ 1., -1.,  2.],
     [ 2.,  0.,  0.],
     [ 0.,  1., -1.]]
X_normalized = preprocessing.normalize(X, norm='l2')
print(X_normalized)

[[ 0.40824829 -0.40824829  0.81649658]
 [ 1.          0.          0.        ]
 [ 0.          0.70710678 -0.70710678]]


## Dataset splitting

When obtaining a training data set, we usually often split the training data into a training set and a validation set, which is helpful to the selection of our model parameters.

train_test_split is a commonly used function in cross-validation. Its function is to randomly select train data and testdata from the sample according to the proportion. The example codes are as follows:

X_train,X_test, y_train, y_test = cross_validation.train_test_split(train_data,train_target,test_size=0.4, random_state=0)

## Defining the model

In this step, we must first analyze the type of our data and understand what model we want to use. Then we can define the model in sklearn. Sklearn provides very similar interfaces for all models. Familiar with the usage of all models. Before that, let's take a look at the common properties and functions of models.

### Linear regression

In [50]:
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)

### Naive Bayes

In [55]:
from sklearn import naive_bayes
model = naive_bayes.GaussianNB()
model = naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)
model = naive_bayes.BernoulliNB(alpha=1.0, binarize=0.0, fit_prior=True, class_prior=None)

### Support Vector Machines

In [58]:
from sklearn.svm import SVC
model = SVC(C=1.0, kernel='rbf', gamma='auto')

### K Nearest Neighbors

In [59]:
from sklearn import neighbors
model = neighbors.KNeighborsClassifier(n_neighbors=5, n_jobs=1)
model = neighbors.KNeighborsRegressor(n_neighbors=5, n_jobs=1)

## Model evaluation and selection

Evaluation indicators have different indicators for different machine learning tasks, and the same task has different evaluation indicators.

### Cross-validation

The original data (dataset) is grouped in a sense, one part is used as a training set and the other is used as a validation set or test set. The classifier is first trained with the training set, and then the validation set is used to test the model.

## scikit-learn LDA topic model overview

In scikit-learn, the class of the LDA topic model is in the sklearn.decomposition.LatentDirichletAllocation package, and its algorithm implementation is mainly based on the variational inference EM algorithm.

We have three document corpora, which are placed in nlp_test0.txt, nlp_test2.txt and nlp_test4.txt

### Parsing

First we perform parsing, and save the parsing results in nlp_test1.txt, nlp_test3.txt, and nlp_test5.txt, respectively

In [76]:
import jieba
jieba.suggest_freq('沙瑞金', True)
jieba.suggest_freq('易学习', True)
jieba.suggest_freq('王大路', True)
jieba.suggest_freq('京州', True)
#parsing the first file
f = open("nlp_test0.txt", encoding='UTF-8')
document = f.read()
document_cut = jieba.cut(document)
result = ' '.join(document_cut)
f2 = open("nlp_test1.txt", 'w', encoding='UTF-8')
f2.write(result)
f.close()
f2.close() 

f3 = open("nlp_test1.txt", encoding='UTF-8')
res1 = f3.read()
print(res1)

沙瑞金 赞叹 易学习 的 胸怀 ， 是 金山 的 百姓 有福 ， 可是 这件 事对 李达康 的 触动 很大 。 易学习 又 回忆起 他们 三人 分开 的 前一晚 ， 大家 一起 喝酒 话别 ， 易学习 被 降职 到 道口 县当 县长 ， 王大路 下海经商 ， 李达康 连连 赔礼道歉 ， 觉得 对不起 大家 ， 他 最 对不起 的 是 王大路 ， 就 和 易学习 一起 给 王大路 凑 了 5 万块 钱 ， 王大路 自己 东挪西撮 了 5 万块 ， 开始 下海经商 。 没想到 后来 王大路 竟然 做 得 风生水 起 。 沙瑞金 觉得 他们 三人 ， 在 困难 时期 还 能 以沫 相助 ， 很 不 容易 。


In [77]:
#parsing the second file
f = open("nlp_test2.txt", encoding='UTF-8')
document = f.read()
document_cut = jieba.cut(document)
result = ' '.join(document_cut)
f2 = open("nlp_test3.txt", 'w', encoding='UTF-8')
f2.write(result)
f.close()
f2.close() 

f3 = open("nlp_test3.txt", encoding='UTF-8')
res2 = f3.read()
print(res2)

沙瑞金 向 毛娅 打听 他们 家 在 京州 的 别墅 ， 毛娅 笑 着 说 ， 王大路 事业有成 之后 ， 要 给 欧阳 菁 和 她 公司 的 股权 ， 她们 没有 要 ， 王大路 就 在 京州 帝豪园 买 了 三套 别墅 ， 可是 李达康 和 易学习 都 不要 ， 这些 房子 都 在 王大路 的 名下 ， 欧阳 菁 好像 去 住 过 ， 毛娅 不想 去 ， 她 觉得 房子 太大 很 浪费 ， 自己 家住 得 就 很 踏实 。


In [82]:
#parsing the third file
f = open("nlp_test4.txt", encoding='UTF-8')
document = f.read()
document_cut = jieba.cut(document)
result = ' '.join(document_cut)
f2 = open("nlp_test5.txt", 'w', encoding='UTF-8')
f2.write(result)
f.close()
f2.close() 

f3 = open("nlp_test5.txt", encoding='UTF-8')
res3 = f3.read()
print(res3)

347 年 （ 永和 三年 ） 三月 ， 桓 温兵 至 彭模 （ 今 四川 彭山 东南 ） ， 留下 参军 周楚 、 孙盛 看守 辎重 ， 自己 亲率 步兵 直攻 成都 。 同月 ， 成汉 将领 李福 袭击 彭模 ， 结果 被 孙盛 等 人 击退 ； 而桓温 三 战三胜 ， 一直 逼近 成都 。


### Import stop words from file

In [83]:
stpwrdpath = "stop_words.txt"
stpwrd_dic = open(stpwrdpath, 'rb')
stpwrd_content = stpwrd_dic.read()
#convert to list 
stpwrdlst = stpwrd_content.splitlines()
stpwrd_dic.close()

### Convert words into word frequency vectors. 
Note that because LDA is based on word frequency, TF-IDF is generally not used for document features. Codes are shown as below:

In [84]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
corpus = [res1,res2,res3]
cntVector = CountVectorizer(stop_words=stpwrdlst)
cntTf = cntVector.fit_transform(corpus)
print(cntTf)

  (0, 47)	2
  (0, 66)	1
  (0, 41)	4
  (0, 59)	1
  (0, 72)	1
  (0, 52)	1
  (0, 42)	1
  (0, 20)	1
  (0, 68)	1
  (0, 10)	1
  (0, 44)	2
  (0, 63)	1
  (0, 36)	1
  (0, 26)	1
  (0, 12)	2
  (0, 4)	2
  (0, 15)	1
  (0, 16)	1
  (0, 28)	2
  (0, 2)	2
  (0, 24)	1
  (0, 64)	1
  (0, 73)	1
  (0, 71)	1
  (0, 17)	1
  :	:
  (2, 35)	2
  (2, 25)	1
  (2, 34)	1
  (2, 8)	1
  (2, 51)	1
  (2, 19)	1
  (2, 23)	1
  (2, 29)	2
  (2, 55)	1
  (2, 67)	1
  (2, 11)	1
  (2, 45)	1
  (2, 53)	1
  (2, 38)	2
  (2, 21)	1
  (2, 37)	1
  (2, 32)	1
  (2, 43)	1
  (2, 61)	1
  (2, 57)	1
  (2, 14)	1
  (2, 58)	1
  (2, 39)	1
  (2, 1)	1
  (2, 70)	1


### Make the LDA topic model
The output is the word frequency vector of each word in all documents. With this word frequency vector, we can make the LDA topic model. Since we only have three documents, we choose the number of topics K = 2. Codes are shown as below:

In [87]:
lda = LatentDirichletAllocation(n_components=2, learning_offset=50., random_state=0)
docres = lda.fit_transform(cntTf)

### Print the results

Through the fit_transform function, we can get the topic model of the document distributed in docres. The topic terms are distributed in lda.components_.

#### Document topics are distributed as follows:

In [91]:
print(docres)

[[0.99176826 0.00823174]
 [0.01463255 0.98536745]
 [0.01463255 0.98536745]]


#### The topics and words are distributed as follows:

In [93]:
print(lda.components_)

[[0.50064745 0.50064745 2.49967511 2.49967511 2.49967511 0.50064745
  0.50064745 2.49967511 0.50064745 1.49968362 1.49968362 0.50064745
  2.49967511 1.49968362 0.50064745 1.49968362 1.49968362 1.49968362
  1.49968362 0.50064745 1.49968362 0.50064745 1.49968362 0.50064745
  1.49968362 0.50064745 1.49968362 1.49968362 2.49967511 0.5006523
  1.49968362 2.49967511 0.50064745 1.49968362 0.50064745 0.5006523
  1.49968362 0.50064745 0.5006523  0.50064745 1.49968362 4.49967268
  1.49968362 0.50064745 2.49967511 0.50064745 0.50064745 2.49967511
  1.49968362 0.50064745 5.49967238 0.50064745 1.49968362 0.50064745
  1.49968362 0.50064745 1.49968362 0.50064745 0.50064745 1.49968362
  1.50023484 0.50064745 2.49967511 1.49968362 1.49968362 1.49968362
  1.49968362 0.50064745 1.49968362 1.49968362 0.50064745 1.49968362
  1.49968362 1.49968362 1.49968362]
 [2.49935255 2.49935255 0.50032489 0.50032489 0.50032489 2.49935255
  2.49935255 0.50032489 2.49935255 0.50031638 0.50031638 2.49935255
  0.50032489 0

### Conclusion:

#### The first document has a higher probability of belonging to topic 1, and the second and third documents belong to topic 2.