# 主题向量

通过隐形语义分析（latent senmantic analysis LSA）可以不仅仅把词的意义表示为向量，还可以用向量来表示通篇文档的意义。

本章将学习这些语义或主题向量，通过TF-IDF向量的加权得分来计算所谓的主题得分，而将这些得分构成了主题向量的各个维度。

将使用归一化词频直接的关联来将词归并到同意主题，每个归并结果定义了新主题向量的一个维度。

In [22]:
import numpy as np 
import random 
topic={}
random.seed(1)
tfidf= dict(list(zip('cat dog apple lion NYC love'.split(),np.random.rand(6))))

print("得到虚拟的每个词的tf-idf值")
tfidf
topic['petness']=(0.3*tfidf['cat']+0.3*tfidf['dog']+0*tfidf['apple']+0*tfidf['lion']-0.2*tfidf['NYC']+0.2*tfidf['love'])
topic['animalness']=(0.1*tfidf['cat']+0.1*tfidf['dog']-0.1*tfidf['apple']+0.5*tfidf['lion']+0.1*tfidf['NYC']-0.1*tfidf['love'])
topic['cityness']=(0*tfidf['cat']-0.1*tfidf['dog']+0.2*tfidf['apple']-0.1*tfidf['lion']-0.5*tfidf['NYC']+0.1*tfidf['love'])
print(topic)
# 构建相应矩阵 
topic_m=np.zeros(shape=(3,6))
print(topic_m)
word_tf=np.zeros(shape=(6,1))
print(word_tf)

得到虚拟的每个词的tf-idf值
{'petness': 0.3801465633219733, 'animalness': 0.4797383032984889, 'cityness': -0.2865590942806303}
[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]
[[0.]
 [0.]
 [0.]
 [0.]
 [0.]]


#LDA 
LDA 分类器是一种有监督算法，因此需要对文本进行标注，但是其需要训练的样本数相对较少。

LDA是一维模型，所以其不需要SVD，可以只计算二类问题（如垃圾和非垃圾）问题中的每一类的所有TF-IDF向量的质心（平均值）。推导就变成了这两个质心之间的直线，TF-IDF向量与这条直线越近（TF-IDF向量与这两条直线的点积）就表示它与其中一个类接近。

^C


In [14]:
from cgitb import handler
import re
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.linear_model import LogisticRegression as LR
from sklearn.model_selection import cross_val_score
import numpy as np
print("处理训练数据：...\n")
train_txt = pd.read_table('sms/train.txt',sep='\t',header=None)  
train_txt.columns = ['label', 'text']
label_map = {'ham': 0, 'spam': 1 }#1为垃圾短信
train_txt['label'] = train_txt['label'].map(label_map)

#train_txt = pd.get_dummies(train_txt, columns=['label'])# 将标签onehot编码

def pre_clean_text(origin_text):
    # 去掉标点符号和非法字符
    text = re.sub("[^a-zA-Z]", " ", origin_text)
    # 将字符全部转化为小写，并通过空格符进行分词处理
    words = text.lower().split()

    # 将剩下的词还原成str类型
    cleaned_text = " ".join(words)
    
    return cleaned_text

if __name__=='__main__':

    #清理数据
    train_txt['text'] = train_txt['text'].apply(lambda x: pre_clean_text(x))

    #删去空值.测试时若无效词删去后为空则直接为垃圾信息(实际测试中没有)
    #print(train_txt.shape)
    train_txt = train_txt.loc[train_txt['text'] != '',:]
    # 查看数据
     
    #print(train_txt.shape)
    #实现tf-id数据向量化
    
    tfidf = TfidfVectorizer (
    analyzer="word",
    tokenizer=None,
    preprocessor=None,
    stop_words=None,
    max_features=200)
    word_vict=tfidf.fit_transform(train_txt['text']).toarray()
    print(word_vict.shape)
    print(word_vict[20,:])
    
    mask=np.array(train_txt['label'].astype(bool))
    print("得到{}矩阵".format(mask.shape))
    spam_centroid=word_vict[mask].mean(axis=0).round(2)#axis=0 计算列平均值
    print("垃圾短信平均向量：",spam_centroid.shape)
    ham_centroid=word_vict[mask].mean(axis=0).round(2)
 
    print("短信平均向量：",ham_centroid.shape)
    sh=spam_centroid-ham_centroid
    print(sh.shape)
    spamsocre=word_vict@sh 
   
    spamscore2=word_vict@ham_centroid
     
    from sklearn.preprocessing import MinMaxScaler 
    spam1=MinMaxScaler().fit_transform(spamsocre.reshape(-1,1))#reshape(-1,1)转换成1列：
 
    spam2=MinMaxScaler().fit_transform(spamscore2.reshape(-1,1))
  
    train_txt['lda_score']=spam1
    train_txt['lda_pred']=(train_txt['lda_score']>0.2).astype(int)
    train_txt

处理训练数据：...

(5161, 200)
[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.6447955  0.         0.
 0.47170963 0.         0.         0.         0.35643069 0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.       

# 隐性语义分析
LSA的底层是SVD技术，利用SVD将TF_IDF矩阵分解3个矩阵，而后根据其方差贡献率（信息载荷）进行降维，当在NLL中使用SVD时，将其称为隐性语义分析（LSA），

LSA揭示了被隐含并等待被发现的词的语义或意义。

LSA是一种属性技术，用于寻找对任意一组NLP向量进行最佳线性变换（旋转和拉伸）的方法，这些NLP向量包括TF-IDF向量或词袋向量。对许多应用而言，最好的变换方法是将

坐标轴（维度）对齐到新向量中，使得其在词频上具有最大方差。然后可以在新向量空间中去掉哪些对不同文档向量贡献不大的维度。

> LSA 步骤
1. 构建TF-IDF或其他文档-词矩阵向量,行为文档(doc)，列为词(term)
<gif url="https://pic4.zhimg.com/80/v2-288292d4fd98b748c4b5e37786c06bd3_720w.jpg"/>
2. 对矩阵向量进行SVD分解
3. 根据主题数目，或方差贡献率累计比，选择降维数目
4. 对原有矩阵进行降维


In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
 

In [6]:



len(documents)

INFO:sklearn.datasets._twenty_newsgroups:Downloading 20news dataset. This may take a few minutes.
INFO:sklearn.datasets._twenty_newsgroups:Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


11314

In [16]:
!pip install PandasGUI

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting PandasGUI
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/70/60/dfbe9817d621bb6868427283273e17c1209d0c5b763106a75acccd44ebda/pandasgui-0.2.13.tar.gz (215 kB)
     -------------------------------------- 215.9/215.9 kB 1.1 MB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting PyQtWebEngine
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/65/d9/4b480349ac6d05cb5e82a37582479696beb6717bb717ad64c7d543d1a56a/PyQtWebEngine-5.15.6-cp37-abi3-win_amd64.whl (182 kB)
     ------------------------------------ 182.7/182.7 kB 480.3 kB/s eta 0:00:00
Collecting wordcloud
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/fa/61/19099314c93861629db41678df8ad39fdf33423365aa56ad19e886a845d2/wordcloud-1.8.2.2-cp38-cp38-win_amd64.whl (152 kB)
     ------------------------------------ 152.9/152.9 kB 914.7 kB/s eta 0:00:00
Collecting appdirs
 

In [17]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd 
from pandasgui import show
def loadData():
    '''实现载入sklearn中的“20 Newsgroup”数据
    '''
    dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
    return pd.DataFrame(dataset.data)

 
    
if __name__=="__main__":
    print("转为Data数据观察")
    document=loadData()
    print(type(document))
    print(document.describe())
    print(document.head(15))
    show(document)
 

ImportError: DLL load failed while importing QtWebEngineWidgets: 找不到指定的模块。

In [21]:
import PyQt5.QtWebEngineWidgets

ImportError: DLL load failed while importing QtWebEngineWidgets: 找不到指定的模块。

In [3]:
from collections.abc import Mapping
from nlpia.book.examples.ch04_catdog_lsa_3x6x16 import word_topic_vectors
print(word_topic_vectors.shape)

  [datetime.datetime, pd.datetime, pd.Timestamp])
  MIN_TIMESTAMP = pd.Timestamp(pd.datetime(1677, 9, 22, 0, 12, 44), tz='utc')
  np = pd.np
  np = pd.np
INFO:nlpia.constants:Starting logger in nlpia.constants...
  np = pd.np
  np = pd.np
INFO:nlpia.loaders:No BIGDATA index found in c:\Users\tomis\.conda\envs\py3.8\lib\site-packages\nlpia\data\bigdata_info.csv so copy c:\Users\tomis\.conda\envs\py3.8\lib\site-packages\nlpia\data\bigdata_info.latest.csv to c:\Users\tomis\.conda\envs\py3.8\lib\site-packages\nlpia\data\bigdata_info.csv if you want to "freeze" it.
INFO:nlpia.futil:Reading CSV with `read_csv(*('c:\\Users\\tomis\\.conda\\envs\\py3.8\\lib\\site-packages\\nlpia\\data\\mavis-batey-greetings.csv',), **{'low_memory': False})`...
INFO:nlpia.futil:Reading CSV with `read_csv(*('c:\\Users\\tomis\\.conda\\envs\\py3.8\\lib\\site-packages\\nlpia\\data\\sms-spam.csv',), **{'low_memory': False})`...


UnicodeDecodeError: 'gbk' codec can't decode byte 0x94 in position 7333: illegal multibyte sequence