# 主题向量

通过隐形语义分析（latent senmantic analysis LSA）可以不仅仅把词的意义表示为向量，还可以用向量来表示通篇文档的意义。

本章将学习这些语义或主题向量，通过TF-IDF向量的加权得分来计算所谓的主题得分，而将这些得分构成了主题向量的各个维度。

将使用归一化词频直接的关联来将词归并到同意主题，每个归并结果定义了新主题向量的一个维度。

In [22]:
import numpy as np 
import random 
topic={}
random.seed(1)
tfidf= dict(list(zip('cat dog apple lion NYC love'.split(),np.random.rand(6))))

print("得到虚拟的每个词的tf-idf值")
tfidf
topic['petness']=(0.3*tfidf['cat']+0.3*tfidf['dog']+0*tfidf['apple']+0*tfidf['lion']-0.2*tfidf['NYC']+0.2*tfidf['love'])
topic['animalness']=(0.1*tfidf['cat']+0.1*tfidf['dog']-0.1*tfidf['apple']+0.5*tfidf['lion']+0.1*tfidf['NYC']-0.1*tfidf['love'])
topic['cityness']=(0*tfidf['cat']-0.1*tfidf['dog']+0.2*tfidf['apple']-0.1*tfidf['lion']-0.5*tfidf['NYC']+0.1*tfidf['love'])
print(topic)
# 构建相应矩阵 
topic_m=np.zeros(shape=(3,6))
print(topic_m)
word_tf=np.zeros(shape=(6,1))
print(word_tf)

得到虚拟的每个词的tf-idf值
{'petness': 0.3801465633219733, 'animalness': 0.4797383032984889, 'cityness': -0.2865590942806303}
[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]
[[0.]
 [0.]
 [0.]
 [0.]
 [0.]]


#LDA 
LDA 分类器是一种有监督算法，因此需要对文本进行标注，但是其需要训练的样本数相对较少。

LDA是一维模型，所以其不需要SVD，可以只计算二类问题（如垃圾和非垃圾）问题中的每一类的所有TF-IDF向量的质心（平均值）。推导就变成了这两个质心之间的直线，TF-IDF向量与这条直线越近（TF-IDF向量与这两条直线的点积）就表示它与其中一个类接近。

^C


In [14]:
from cgitb import handler
import re
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.linear_model import LogisticRegression as LR
from sklearn.model_selection import cross_val_score
import numpy as np
print("处理训练数据：...\n")
train_txt = pd.read_table('sms/train.txt',sep='\t',header=None)  
train_txt.columns = ['label', 'text']
label_map = {'ham': 0, 'spam': 1 }#1为垃圾短信
train_txt['label'] = train_txt['label'].map(label_map)

#train_txt = pd.get_dummies(train_txt, columns=['label'])# 将标签onehot编码

def pre_clean_text(origin_text):
    # 去掉标点符号和非法字符
    text = re.sub("[^a-zA-Z]", " ", origin_text)
    # 将字符全部转化为小写，并通过空格符进行分词处理
    words = text.lower().split()

    # 将剩下的词还原成str类型
    cleaned_text = " ".join(words)
    
    return cleaned_text

if __name__=='__main__':

    #清理数据
    train_txt['text'] = train_txt['text'].apply(lambda x: pre_clean_text(x))

    #删去空值.测试时若无效词删去后为空则直接为垃圾信息(实际测试中没有)
    #print(train_txt.shape)
    train_txt = train_txt.loc[train_txt['text'] != '',:]
    # 查看数据
     
    #print(train_txt.shape)
    #实现tf-id数据向量化
    
    tfidf = TfidfVectorizer (
    analyzer="word",
    tokenizer=None,
    preprocessor=None,
    stop_words=None,
    max_features=200)
    word_vict=tfidf.fit_transform(train_txt['text']).toarray()
    print(word_vict.shape)
    print(word_vict[20,:])
    
    mask=np.array(train_txt['label'].astype(bool))
    print("得到{}矩阵".format(mask.shape))
    spam_centroid=word_vict[mask].mean(axis=0).round(2)#axis=0 计算列平均值
    print("垃圾短信平均向量：",spam_centroid.shape)
    ham_centroid=word_vict[mask].mean(axis=0).round(2)
 
    print("短信平均向量：",ham_centroid.shape)
    sh=spam_centroid-ham_centroid
    print(sh.shape)
    spamsocre=word_vict@sh 
   
    spamscore2=word_vict@ham_centroid
     
    from sklearn.preprocessing import MinMaxScaler 
    spam1=MinMaxScaler().fit_transform(spamsocre.reshape(-1,1))#reshape(-1,1)转换成1列：
 
    spam2=MinMaxScaler().fit_transform(spamscore2.reshape(-1,1))
  
    train_txt['lda_score']=spam1
    train_txt['lda_pred']=(train_txt['lda_score']>0.2).astype(int)
    train_txt

处理训练数据：...

(5161, 200)
[0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.6447955  0.         0.
 0.47170963 0.         0.         0.         0.35643069 0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.       

# 隐性语义分析(LSA,Latent Semantic Analysis)
> 参考
> * https://zhuanlan.zhihu.com/p/46376672
> * https://zhuanlan.zhihu.com/p/144367432

LSA的底层是SVD技术，利用SVD将TF_IDF矩阵分解3个矩阵，而后根据其方差贡献率（信息载荷）进行降维，当在NLL中使用SVD时，将其称为隐性语义分析（LSA），

LSA揭示了被隐含并等待被发现的词的语义或意义。

LSA是一种属性技术，用于寻找对任意一组NLP向量进行最佳线性变换（旋转和拉伸）的方法，这些NLP向量包括TF-IDF向量或词袋向量。对许多应用而言，最好的变换方法是将

坐标轴（维度）对齐到新向量中，使得其在词频上具有最大方差。然后可以在新向量空间中去掉哪些对不同文档向量贡献不大的维度。

<font color='red'>LSA 中单词-文本-svd关系</font>
<img src="https://pic2.zhimg.com/v2-b3eb29a45d1fc11f57b858fd5af7571d_r.jpg"/>
> LSA 步骤
1. 构建TF-IDF或其他文档-词矩阵向量,行为文档(doc)，列为词(term)
<img src="https://pic4.zhimg.com/80/v2-288292d4fd98b748c4b5e37786c06bd3_720w.jpg"/>
2. 对矩阵向量进行SVD分解
3. 根据主题数目，或方差贡献率累计比，选择降维数目
4. 对原有矩阵进行降维


In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
 

In [8]:
!pip install nltk

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple


In [None]:
nltk.set_proxy('SYSTEM PROXY')

nltk.download('stopwords')

In [31]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd 
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer as tf
import numpy as np 
 
def loadData():
    '''实现载入sklearn中的“20 Newsgroup”数据
    '''
    dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))
    return pd.DataFrame(dataset.data)

 
def clearData(df:pd.DataFrame):
    """开始之前，我们先尝试着清理文本数据。主要思想就是清除其中的标点、数字和特殊字符。之后，我们需要删除较短的单词，因为通常它们不会包含什么有用的信息。最后，我们将文本变为不区分大小写。

    Args:
        df (pd.DataFrame): _原始文本_
    """
# removing everything except alphabets`
    
    df['clean_doc'] = df['document'].str.replace("[^a-zA-Z#]", " ")
# removing short words
    df['clean_doc'] = df['clean_doc'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))
# make all text lowercase
    df['clean_doc'] = df['clean_doc'].apply(lambda x: x.lower())
    return df

def stopWords(df:pd.DataFrame):
    """之后我们要删除没有特别意义的停止词，例如“it”、“they”、“am”、“been”、“about”、“because”、“while”等等。为了实现这一目的，我们要对文本进行标记化，也就是将一串文本分割成独立的标记或单词。删除停止词之后，再把这些标记组合在一起。

    Args:
        df (pd.DataFrame): _description_
    """
    stop_words = stopwords.words('english')
    # tokenization分词
    tokenized_doc = df['clean_doc'].apply(lambda x: x.split())
    # remove stop-words
    tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])
# de-tokenization
    detokenized_doc = []
    for i in range(len(df)):
        t = ' '.join(tokenized_doc[i])
        detokenized_doc.append(t)

    df['clean_doc'] = detokenized_doc
    return df  

def vec_words(document:pd.DataFrame,max_feature:int=200)->np.array:
    """基于sklearn实现文档向量化，其中元素为tf-idf，注意TfidfVectorizer参数

    Args:
        document (pd.DataFrame): _description_

    Returns:
        _type_: _description_
    """
    tfidf =tf(analyzer="word",tokenizer=None,max_features=max_feature)
    tfidf.fit(document)
    lexcion=tfidf.vocabulary_ # 返回向量化对应的词典
    lexcion_len=tfidf.vocabulary_.__len__()  # 返回向量化对应的词典长度
    vec=tfidf.fit_transform(document).toarray()
    return vec,lexcion,lexcion_len

def svd(vec:np.array):
    # 由于输入的向量化文本为行：text，列：word，但np.linalg.svd对应的原始矩阵为行：wword，列：text故需要先转置
    vecT=vec.T
    u,sigma,vt=np.linalg.svd(vecT)
    return u,sigma,vt
    

if __name__=="__main__":
    print("转为Data数据观察")
    documents=loadData()
    print(documents.head(3))
 
    print(documents.describe())
 
    #clear data
 
    documents.columns=['document']
 
    documents= clearData(documents)
    print(documents.head(3))
    # tf-idf 向量化 
    vec,lexcion,lexcion_len=vec_words(documents['clean_doc'])
    print("得到向量化矩阵形状：{},词典个数：{}".format(vec.shape,lexcion_len))
    print("词典 \n",lexcion)
    #print("向量化后文本\n ",vec[:8,:])
    
    u,sigma,vt=svd(vec)
    print('sigma:',sigma.shape)
 

转为Data数据观察
                                                   0
0  Well i'm not sure about the story nad it did s...
1  \n\n\n\n\n\n\nYeah, do you expect people to re...
2  Although I realize that principle is not one o...
            0
count   11314
unique  10994
top          
freq      218


  df['clean_doc'] = df['document'].str.replace("[^a-zA-Z#]", " ")


                                            document  \
0  Well i'm not sure about the story nad it did s...   
1  \n\n\n\n\n\n\nYeah, do you expect people to re...   
2  Although I realize that principle is not one o...   

                                           clean_doc  
0  well sure about story seem biased what disagre...  
1  yeah expect people read actually accept hard a...  
2  although realize that principle your strongest...  
得到向量化矩阵形状：(11314, 200),词典个数：200
词典 
 {'well': 182, 'sure': 152, 'about': 1, 'what': 184, 'with': 192, 'your': 199, 'that': 159, 'most': 99, 'world': 195, 'having': 68, 'such': 150, 'have': 67, 'them': 161, 'least': 82, 'same': 135, 'think': 168, 'might': 97, 'they': 165, 'more': 98, 'from': 57, 'government': 63, 'some': 143, 'after': 4, 'look': 90, 'other': 111, 'when': 185, 'power': 122, 'people': 115, 'read': 130, 'actually': 3, 'hard': 66, 'need': 103, 'little': 88, 'these': 164, 'just': 78, 'will': 189, 'maybe': 95, 'much': 100, 'would': 196, 's

In [3]:
from collections.abc import Mapping
from nlpia.book.examples.ch04_catdog_lsa_3x6x16 import word_topic_vectors
print(word_topic_vectors.shape)

  [datetime.datetime, pd.datetime, pd.Timestamp])
  MIN_TIMESTAMP = pd.Timestamp(pd.datetime(1677, 9, 22, 0, 12, 44), tz='utc')
  np = pd.np
  np = pd.np
INFO:nlpia.constants:Starting logger in nlpia.constants...
  np = pd.np
  np = pd.np
INFO:nlpia.loaders:No BIGDATA index found in c:\Users\tomis\.conda\envs\py3.8\lib\site-packages\nlpia\data\bigdata_info.csv so copy c:\Users\tomis\.conda\envs\py3.8\lib\site-packages\nlpia\data\bigdata_info.latest.csv to c:\Users\tomis\.conda\envs\py3.8\lib\site-packages\nlpia\data\bigdata_info.csv if you want to "freeze" it.
INFO:nlpia.futil:Reading CSV with `read_csv(*('c:\\Users\\tomis\\.conda\\envs\\py3.8\\lib\\site-packages\\nlpia\\data\\mavis-batey-greetings.csv',), **{'low_memory': False})`...
INFO:nlpia.futil:Reading CSV with `read_csv(*('c:\\Users\\tomis\\.conda\\envs\\py3.8\\lib\\site-packages\\nlpia\\data\\sms-spam.csv',), **{'low_memory': False})`...


UnicodeDecodeError: 'gbk' codec can't decode byte 0x94 in position 7333: illegal multibyte sequence