# 文本检索作业补充练习

这次作业主要内容是词上下文表关联统计方法，可以参考本节课PPT的第八页，希望能够帮助大家巩固pandas的表关联和统计操作。

看一些关于表关联和groupby的例子：

In [1]:
# 计算笛卡尔积（两两组合）
# 参数cross: creates the cartesian product from both frames, preserves the order of the left keys.
# pandas版本1.2.0.新增功能

import pandas as pd

df1 = pd.DataFrame({'left': ['foo', 'bar']})
df2 = pd.DataFrame({'right': [7, 8]})
df3 = df1.merge(df2, how='cross')
df3

Unnamed: 0,left,right
0,foo,7
1,foo,8
2,bar,7
3,bar,8


In [2]:
# 多重索引的groupby，用level参数指定
arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
          ['Captive', 'Wild', 'Captive', 'Wild']]
index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))
df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},
                  index=index)
display(df)

df1 = df.groupby(level=0).mean()
df2 = df.groupby(level="Type").mean()

display(df1, df2)

Unnamed: 0_level_0,Unnamed: 1_level_0,Max Speed
Animal,Type,Unnamed: 2_level_1
Falcon,Captive,390.0
Falcon,Wild,350.0
Parrot,Captive,30.0
Parrot,Wild,20.0


Unnamed: 0_level_0,Max Speed
Animal,Unnamed: 1_level_1
Falcon,370.0
Parrot,25.0


Unnamed: 0_level_0,Max Speed
Type,Unnamed: 1_level_1
Captive,210.0
Wild,185.0


## 练习部分

首先构造由一些句子组成的数据表，表中的每一行代表一个文档：

In [3]:
# 一些准备工作
import pandas as pd
import re
import numpy as np

doc = ['I know that the day will come when my sight of this earth shall be lost,',
       'life will take its leave in silence, drawing the last curtain over my eyes.',
       'Yet stars will watch at night, and morning rise as before,',
       'and hours heave like sea waves casting up pleasures and pains.',
       'When I think of this end of my moments, the barrier of the moments breaks',
       'and I see by the light of death thy world with its careless treasures.',
       'Rare is its lowliest seat, rare is its meanest of lives.',
       'Things that I longed for in vain and things that I got, let them pass.',
       'Let me but truly possess the things that I ever spurned and overlooked.'
      ]
doc = [re.sub('[^A-Za-z0-9 ]+', '', i).lower() for i in doc]

doc_id = [i + 1 for i in range(len(doc))]

# 表中的每一行代表一个文档
data = pd.DataFrame({'doc_id': doc_id, 'doc': doc})
data

Unnamed: 0,doc_id,doc
0,1,i know that the day will come when my sight of...
1,2,life will take its leave in silence drawing th...
2,3,yet stars will watch at night and morning rise...
3,4,and hours heave like sea waves casting up plea...
4,5,when i think of this end of my moments the bar...
5,6,and i see by the light of death thy world with...
6,7,rare is its lowliest seat rare is its meanest ...
7,8,things that i longed for in vain and things th...
8,9,let me but truly possess the things that i eve...


进行简单的分词处理，doc_id用于标识当前词所属的文档，position表示这个词在文档中的位置。

In [4]:
split_words = data['doc'].str.split(' ', expand=True).stack().rename('word').reset_index()
new_data = pd.merge(data['doc_id'], split_words, left_index=True, right_on='level_0')
new_data.drop('level_0', axis=1, inplace=True)
new_data.rename(columns={'level_1': 'position'}, inplace=True)
new_data.head(10)

Unnamed: 0,doc_id,position,word
0,1,0,i
1,1,1,know
2,1,2,that
3,1,3,the
4,1,4,day
5,1,5,will
6,1,6,come
7,1,7,when
8,1,8,my
9,1,9,sight


### 1. 生成“词-上下文词”的二维索引的序列表
这里的“上下文词”定义为同一个文档内出现的所有词，比如有一个文档为"A B C A"，生成的表格中需要有A-A，A-B，A-C，A-A； B-A，B-B，B-C，B-A等等这些“词”与“上下文词”的配对。

结果不一定要采用二维索引（多重索引）的形式展现，只要能体现出词语之间的两两配对即可。

对于处于同一个位置上的“自己”-“自己”这样的配对（上面那个例子中第一个A-A），可以保留，也可以通过筛选去除这部分。

提示：在merge函数中使用参数 how="cross" 来计算笛卡尔积需要pandas版本高于1.2

提示：这里需要在每个文档内部生成词的两两配对，一种做法是先按doc_id进行groupby，对于groupby得到的每一个“组”用apply方法，取原表中一部分数据（doc_id与这一组的doc_id相等的部分）与这“组”进行merge(how='cross')，可以再对结果做一些适当的处理（reset_index重建索引、删除不必要的数据等等），这样方便后面的流程。

In [5]:
#TODO
df = new_data.groupby('doc_id').apply(lambda x: pd.merge(new_data[new_data.doc_id == x.doc_id.min()],x,how='cross'))

In [6]:
df = new_data.groupby('doc_id').apply(lambda x: pd.merge(new_data[new_data.doc_id == x.doc_id.min()],x,how='cross'))
df = df[df.word_x != df.word_y].drop(['doc_id_y','position_x','position_y'],axis=1).rename(columns={'doc_id_x':'doc_id','word_x':'word','word_y':'context'})
df.reset_index(drop=True)

Unnamed: 0,doc_id,word,context
0,1,i,know
1,1,i,that
2,1,i,the
3,1,i,day
4,1,i,will
...,...,...,...
1481,9,overlooked,that
1482,9,overlooked,i
1483,9,overlooked,ever
1484,9,overlooked,spurned


### 2. 计算TF值
按照 词-上下文词 做groupby，然后count计数，得到每个词的上下文词的TF值（这里计算上下文词在所有文档中出现的次数即可，不用分文档统计）。

如果groupby+聚合函数返回的结果是series，需要再转换成dataframe。

结果的形式不作限定，可以用多重索引，大致像这样：

|  词  | 上下文词 | TF |
|  ----  | ----  | ----  |
| word1  | word2 | 3 |
|   | word4 | 2 |
| word2  | word1 | 3 |
|   | word3 | 5 |

也可以展开索引，依次排列所有数据，大概这样：

|    |  词  | 上下文词 | TF |
| ---- |  ----  | ----  | ----  |
| 0 | word1  | word2 | 3 |
| 1 | word1  | word4 | 2 |
| 2 | word2  | word1 | 3 |
| 3 | word2  | word3 | 5 |

In [7]:
#TODO
df1 = pd.DataFrame(df.groupby(['word','context']).doc_id.count().rename('TF')).reset_index()
df1

Unnamed: 0,word,context,TF
0,and,as,1
1,and,at,1
2,and,before,1
3,and,but,1
4,and,by,1
...,...,...,...
1177,yet,night,1
1178,yet,rise,1
1179,yet,stars,1
1180,yet,watch,1


### 3. 计算IDF值
对 上下文词 这一项groupby，然后对doc_id计数，计算每个词在多少文档中出现，得到DF值，再转换成IDF值。

提示：nunique()可以统计有多少个不同的值

In [8]:
#TODO
df2 = pd.DataFrame(df.groupby('context').doc_id.nunique().rename('DF')).reset_index()
df2['IDF'] = np.log2(df.doc_id.nunique() / df2['DF'])
df2

Unnamed: 0,context,DF,IDF
0,and,5,0.847997
1,as,1,3.169925
2,at,1,3.169925
3,barrier,1,3.169925
4,be,1,3.169925
...,...,...,...
76,when,2,2.169925
77,will,3,1.584963
78,with,1,3.169925
79,world,1,3.169925


### 4. 计算TF-IDF值
将TF表和IDF表根据“上下文词”这一列进行合并，实现为TF表中的每个上下文词增加IDF字段，请再新增一列计算TF-IDF值。

In [9]:
#TODO
df3 = pd.merge(df1,df2,how='left')
df3

Unnamed: 0,word,context,TF,DF,IDF
0,and,as,1,1,3.169925
1,and,at,1,1,3.169925
2,and,before,1,1,3.169925
3,and,but,1,1,3.169925
4,and,by,1,1,3.169925
...,...,...,...,...,...
1177,yet,night,1,1,3.169925
1178,yet,rise,1,1,3.169925
1179,yet,stars,1,1,3.169925
1180,yet,watch,1,1,3.169925


### 5. 转换成二维表、矩阵
只保留表中TF-IDF这一项的值，将上一步得到的数据转换成一个新表，词和上下文词分别作为新表的行列索引。（新表[word1][word2]位置的值是这个“词word1-上下文词word2”对应的TF-IDF值，索引中所有词的集合相当于整个词表。）

请再将新表中的所有NaN用0填充，把这张二维表转换成一个矩阵$M$，计算$M * M.T$

提示：可以使用pandas的pivot_table或pivot方法，取出表中的两列数据作为行列索引，取第三列数据作为值。

提示：可以使用to_numpy()方法将dataframe转换成矩阵

In [10]:
#TODO
w2c = df3[['word','context','IDF']]
w2c = w2c.set_index(['word','context']).unstack().fillna(0)
M = w2c.to_numpy()
M

array([[0.        , 3.169925  , 3.169925  , ..., 3.169925  , 3.169925  ,
        3.169925  ],
       [0.84799691, 0.        , 3.169925  , ..., 0.        , 0.        ,
        3.169925  ],
       [0.84799691, 3.169925  , 0.        , ..., 0.        , 0.        ,
        3.169925  ],
       ...,
       [0.84799691, 0.        , 0.        , ..., 0.        , 3.169925  ,
        0.        ],
       [0.84799691, 0.        , 0.        , ..., 3.169925  , 0.        ,
        0.        ],
       [0.84799691, 3.169925  , 3.169925  , ..., 0.        , 0.        ,
        0.        ]])

In [11]:
M@M.T

array([[426.40594453,  82.89950225,  82.89950225, ...,  85.70642426,
         85.70642426,  82.89950225],
       [ 82.89950225,  83.618601  ,  73.57017649, ...,   0.71909875,
          0.71909875,  73.57017649],
       [ 82.89950225,  73.57017649,  83.618601  , ...,   0.71909875,
          0.71909875,  73.57017649],
       ...,
       [ 85.70642426,   0.71909875,   0.71909875, ...,  86.42552302,
         76.3770985 ,   0.71909875],
       [ 85.70642426,   0.71909875,   0.71909875, ...,  76.3770985 ,
         86.42552302,   0.71909875],
       [ 82.89950225,  73.57017649,  73.57017649, ...,   0.71909875,
          0.71909875,  83.618601  ]])