# Week 06: Collocation Extraction
In Assignment 5, we found all skip-grams and their frequencies in <u>*wiki1G.txt*</u>. This week, we want to use the result of assignment 5 to extract collocations of [AKL verbs](https://uclouvain.be/en/research-institutes/ilc/cecl/academic-keyword-list.html). We will use [Smadja’s algorithm](https://aclanthology.org/J93-1007.pdf) to do it. Here are some basic terms need to be explain. 

We take "*dpend*" as an example:

<img src="https://imgur.com/cPyd7Gr.jpg" >

In this case, we want to find the collocations of "depend". Then, "depend" is called **base word** and marked as $W$. As for "on", "the", "for"..., they are called **collocate** and marked as $W_{i}$ where **i** represents their serial number. $P_{j}$ means the frequency of $W$ and $W_{i}$ with distance j. And **Freq** is the sum of frequencies of all distances.

There are three conditions to filter the skipgram to find collocations. We will go through three conditions below.

Considering that some students did not complete Assignment 5, in order to avoid them being unable to do assignment 6, we provide you with a file of calculated skipgram with frequencies, called **AKL_skipgram.tsv**. It only keeps the skipgrams with any AKL verb.

## Read Data
<font color="red">**[ TODO ]**</font> Please read <u>*AKL_skipgram.tsv*</u> and store it in the way you like.

In [5]:
#### here are some hyperparameter
k0 = 1
k1 = 1
U0 = 10
base_word = "depend"

In [49]:
import pandas as pd
import os
import numpy as np
## read file here
skipgram_col = ['W', 'Wi', 'Freq', 'P-5', 'P-4', 'P-3', 'P-2', 'P-1', 'P1', 'P2', 'P3', 'P4', 'P5']
with open(os.path.join('data', 'AKL_skipgram.tsv'), encoding="utf-8") as f:
    tmp_df = pd.read_csv(f, sep='\t') #store tsv as DataFrame, seperate by "\t"
    tmp_df.columns = skipgram_col #give DataFrame column names     

In [178]:
#test
print((tmp_df['P-5'] + tmp_df['P-3'])[:3])

print( tmp_df[ ['W', 'P-5'] ] )

print(tmp_df.head(3))

print(tmp_df[ tmp_df['W'] == "depend"].head(3))

0    17
1     6
2     0
dtype: int64
         W  P-5
0        0    3
1        0    4
2        0    0
3        0    4
4        0    0
...     ..  ...
5542979  𝕏    1
5542980  𝛿    0
5542981  𝛿    0
5542982  𢒉    0
5542983  𢒉    0

[5542984 rows x 2 columns]
   W       Wi  Freq  P-5  P-4  P-3  P-2  P-1  P1  P2  P3  P4  P5
0  0  account    29    3    4   14    7    0   0   0   0   0   1
1  0  achieve    13    4    3    2    0    4   0   0   0   0   0
2  0  acquire     2    0    0    0    0    0   0   0   1   0   1
              W   Wi  Freq  P-5  P-4  P-3  P-2  P-1  P1  P2  P3  P4  P5
1600255  depend    0     1    1    0    0    0    0   0   0   0   0   0
1600256  depend  000     2    0    2    0    0    0   0   0   0   0   0
1600257  depend  035     1    1    0    0    0    0   0   0   0   0   0


## C1 Condition
C1 helps eliminate the collocates that are not frequent enough. This condition specifies that the frequency of appearance of $W_{i}$ in the neighborhood of $W$ must be at least one standard deviation above the average.

The formula is here:

$$strength = \frac{freq - \bar{f}}{\sigma} \geq k_{0} = 1$$

where $freq$ is the frequency of certain collocate, (e.g., 2573 for "on") and 

$\bar{f}$ is the average frequencies of all collocates and 

${\sigma}$ is the standard deviation of frequencies of all collocates.

<font color="red">**[ TODO ]**</font> Please follow the condition to filter the skipgrams of "depend" and keep some which pass the condition.

The ouput sholud have `collocate` with its `strength`.

In [201]:
def C1_filter(base_word, filter_word):
    ### [TODO]
    tmp = base_word[ base_word['W'] == filter_word ] #先篩選出每個word
    sum_tmp = tmp['Freq'].sum() #freq總值
    cnt_tmp = tmp.shape[0] #列數
    avg_tmp = sum_tmp/cnt_tmp #f bar
    std_tmp = np.std(tmp['Freq'], ddof=0) #標準差
    tmp_store = tmp[ (tmp['Freq'] - avg_tmp)/std_tmp >= k0 ] #篩選strength
    tmp_store['strength'] = ((tmp['Freq'] - avg_tmp)/std_tmp).round(3) #新增column
    '''
    a = base_word['W'].unique()
    for i in a[:10]:
        tmp = base_word[ base_word['W'] == i ] #先篩選出每個word
        sum_tmp = tmp['Freq'].sum() #freq總值
        cnt_tmp = tmp.shape[0] #列數
        avg_tmp = sum_tmp/cnt_tmp
        std_tmp = np.std(tmp['Freq'], ddof=1)
        tmp_store = tmp[ (tmp['Freq'] - avg_tmp)/std_tmp >= k0 ]
        #print(i, sum_tmp, cnt_tmp, std_tmp)
    '''   
    return tmp_store[ ['Wi', 'strength'] ]

In [202]:
filtered_by_C1 = C1_filter(tmp_df, 'depend')
### Print
print(filtered_by_C1)

                 Wi  strength
1600339           a     6.381
1600497         all     1.151
1600518        also     1.133
1600550          an     1.367
1600558         and    15.183
1600640         are     1.962
1600675          as     2.395
1600957         but     1.529
1600961          by     1.042
1600985         can     1.421
1601752          do     1.656
1601758        does     5.299
1602246         for     4.686
1602269     formula     1.565
1602694          in     5.876
1602887          is     2.611
1602896          it     2.287
1602901         its     1.818
1603255         may     2.864
1603569         not     8.437
1603628          of    23.461
1603648          on    46.313
1603654        only     1.295
1603678          or     2.485
1603712       other     1.656
1604155  properties     1.042
1604541           s     2.161
1604817        some     1.187
1605014        such     1.439
1605166        that     7.247
1605168         the    44.707
1605169       their     2.828
1605193   

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


<font color="green">Expected output: </font> (The order isn't important.)

> a {'strength': 6.381}   
> all {'strength': 1.151}   
> also {'strength': 1.133}   
> an {'strength': 1.367}   
> and {'strength': 15.183}   
> are {'strength': 1.962}   
> as {'strength': 2.395}   
> but {'strength': 1.529}   
> by {'strength': 1.042}   
> can {'strength': 1.421}   
> do {'strength': 1.656}   
> does {'strength': 5.299}   
> for {'strength': 4.686}   
> formula {'strength': 1.565}   
> in {'strength': 5.876}   
> is {'strength': 2.611}   
> it {'strength': 2.287}   
> its {'strength': 1.818}   
> may {'strength': 2.864}   
> not {'strength': 8.437}   
> of {'strength': 23.461}   
> on {'strength': 46.313}   
> only {'strength': 1.295}   
> or {'strength': 2.485}   
> other {'strength': 1.656}   
> properties {'strength': 1.042}   
> s {'strength': 2.161}   
> some {'strength': 1.187}   
> such {'strength': 1.439}   
> that {'strength': 7.247}   
> the {'strength': 44.707}   
> their {'strength': 2.828}   
> these {'strength': 1.944}   
> they {'strength': 2.233}   
> this {'strength': 1.908}   
> to {'strength': 8.419}   
> type {'strength': 1.295}   
> upon {'strength': 4.902}   
> which {'strength': 4.379}   
> will {'strength': 3.784}   
> would {'strength': 1.601}   

## C2 Condition
C2 requires that the histogram of the 10 relative frequencies of appearance of $W_i$ within five words of $W$ (or $p^j_i$s) have at least one spike. If the histogram is flat, it will be rejected by this condition.

The formula is here:

$$spread = \frac{\Sigma^{10}_{j=1}(p^j_i - \bar{p_i})^2}{10} \geq U_{0} = 10$$

where $p^j_i$ is the frequency of certain collocate with a distance of *j*, (e.g., 16 for "on" when its distance is -5) and 

$\bar{p_i}$ is the average frequencies of "on" with any distance 

<font color="red">**[ TODO ]**</font> Please follow C2 to filter the result of C1 and keep some which pass C2.

The ouput sholud have `collocate` with `strength` and `spread`.

In [203]:
import math
def C2_filter(base_word, filtered_by_C1, filter_word):
    ### [TODO]
    bs = base_word[ base_word['W'] == filter_word ] #篩選出depend 
    bs = bs[ bs['Wi'].isin(filtered_by_C1['Wi']) ] #篩選出filter的Wi
    bs['Pi_bar'] = (bs['P-5']+bs['P-4']+bs['P-3']+bs['P-2']+bs['P-1']+bs['P1']+bs['P2']+bs['P3']+bs['P4']+bs['P5'])/10
    bs['spread'] = 0
    for i in range(10): #套公式
        bs['spread'] += np.square( bs[bs.columns[i+3]] - bs['Pi_bar'])
    bs = bs[ bs['spread']/10 >= U0 ] #篩選spread
    filtered_by_C1['spread'] = bs['spread']/10 #新增spread column
    return filtered_by_C1

In [204]:
filtered_by_C2 = C2_filter(tmp_df, filtered_by_C1, 'depend')
### Print
print(filtered_by_C2)

                 Wi  strength     spread
1600339           a     6.381     777.29
1600497         all     1.151      29.89
1600518        also     1.133     208.96
1600550          an     1.367      56.29
1600558         and    15.183    2170.41
1600640         are     1.962      98.84
1600675          as     2.395     104.96
1600957         but     1.529      24.40
1600961          by     1.042      26.21
1600985         can     1.421     208.24
1601752          do     1.656     410.21
1601758        does     5.299    6477.09
1602246         for     4.686     376.65
1602269     formula     1.565      46.16
1602694          in     5.876     396.09
1602887          is     2.611     148.20
1602896          it     2.287     112.76
1602901         its     1.818      94.24
1603255         may     2.864    1352.24
1603569         not     8.437   12938.41
1603628          of    23.461   20132.64
1603648          on    46.313  420371.01
1603654        only     1.295     134.01
1603678         

<font color="green">Expected output: </font> (The order isn't important.)

> a {'strength': 6.381, 'spread': 777.29}   
> all {'strength': 1.151, 'spread': 29.89}   
> also {'strength': 1.133, 'spread': 208.96}   
> an {'strength': 1.367, 'spread': 56.29}   
> and {'strength': 15.183, 'spread': 2170.41}   
> are {'strength': 1.962, 'spread': 98.84}   
> as {'strength': 2.395, 'spread': 104.96}   
> but {'strength': 1.529, 'spread': 24.4}   
> by {'strength': 1.042, 'spread': 26.21}   
> can {'strength': 1.421, 'spread': 208.24}   
> do {'strength': 1.656, 'spread': 410.21}   
> does {'strength': 5.299, 'spread': 6477.09}   
> for {'strength': 4.686, 'spread': 376.65}   
> formula {'strength': 1.565, 'spread': 46.16}   
> in {'strength': 5.876, 'spread': 396.09}   
> is {'strength': 2.611, 'spread': 148.2}   
> it {'strength': 2.287, 'spread': 112.76}   
> its {'strength': 1.818, 'spread': 94.24}   
> may {'strength': 2.864, 'spread': 1352.24}   
> not {'strength': 8.437, 'spread': 12938.41}   
> of {'strength': 23.461, 'spread': 20132.64}   
> on {'strength': 46.313, 'spread': 420371.01}   
> only {'strength': 1.295, 'spread': 134.01}   
> or {'strength': 2.485, 'spread': 85.61}   
> other {'strength': 1.656, 'spread': 31.61}   
> properties {'strength': 1.042, 'spread': 30.21}   
> s {'strength': 2.161, 'spread': 125.85}   
> some {'strength': 1.187, 'spread': 15.29}   
> such {'strength': 1.439, 'spread': 27.45}   
> that {'strength': 7.247, 'spread': 1492.61}   
> the {'strength': 44.707, 'spread': 98586.04}   
> their {'strength': 2.828, 'spread': 209.56}   
> these {'strength': 1.944, 'spread': 180.01}   
> they {'strength': 2.233, 'spread': 316.09}   
> this {'strength': 1.908, 'spread': 71.09}   
> to {'strength': 8.419, 'spread': 3941.16}   
> type {'strength': 1.295, 'spread': 213.41}   
> upon {'strength': 4.902, 'spread': 4984.01}   
> which {'strength': 4.379, 'spread': 346.16}   
> will {'strength': 3.784, 'spread': 2250.05}   
> would {'strength': 1.601, 'spread': 412.44}   

## C3 Condition
C3 keeps the interesting collocates by pulling out the peaks of the $p^j_i$ distributions.

Formula:

$$p^j_i \geq \bar{p_i} + (k_1 \times \sqrt{U_{i}})$$

where $U_i$ is *spread* in C2 and

$k_1$ is equal to 1 

<font color="red">**[ TODO ]**</font> Please follow the condition to filter the result of last step and keep some which pass C3.

The ouput sholud have `base word, collocate, distance, strength, spread, peak, count`.

In [205]:
def C3_filter(base_word, filtered_by_C2, filter_word):
    ### [TODO]
    tmp = pd.DataFrame(columns=['W', 'Wi', 'Pi', 'strength', 'spread', 'peak', 'count'])
    tmp1 = pd.DataFrame(columns=['W', 'Wi', 'Pi', 'strength', 'spread', 'peak', 'count'])
    #[[''],[''],[0],[0],[0],[0],[0]], 
    bs = base_word[ base_word['W'] == filter_word ] #篩選出depend 
    bs = bs[ bs['Wi'].isin(filtered_by_C2['Wi']) ] #篩選出filter的Wi
    bs['Pi_bar'] = 0
    for i in range(10):
        bs['Pi_bar'] += bs[bs.columns[i+3]]
    bs['Pi_bar'] = bs['Pi_bar']/10 #得到每列的Pi_bar
    bs['spread'] = filtered_by_C2['spread'].astype(float) #先把必要的coulumn加進來，方便操作
    bs['strength'] = filtered_by_C2['strength'].astype(float) #先把必要的coulumn加進來，方便操作
    
    for i in range(10): #依序跑10個distance
        bs_tmp = bs[ bs[bs.columns[i+3]] >= bs['Pi_bar'] + k1 * np.sqrt(bs['spread']) ] #篩選C3
        bs_tmp['Pi'] = int(bs_tmp.columns[i+3].replace("P", ""))
        bs_tmp['peak'] = (bs_tmp['Pi_bar'] + k1 * np.sqrt(bs_tmp['spread'])).round(3)
        bs_tmp['count'] = bs_tmp[bs_tmp.columns[i+3]]
        
        tmp = bs_tmp[['W','Wi','Pi','strength','spread','peak', 'count']] #留下需要的column
        frame = [tmp1, tmp] #每做好一個距離就合併
        #print(tmp)
        tmp1 = pd.concat(frame, ignore_index=True)
    tmp1 = tmp1.sort_values(['Wi', 'Pi'], ascending=True)  #sort成與結果相符   
    return tmp1

In [206]:
filtered_by_C3 = C3_filter(tmp_df, filtered_by_C2, 'depend')
### Print
print(filtered_by_C3)

         W          Wi  Pi  strength     spread     peak count
34  depend           a   2     6.381     777.29   63.780    94
4   depend         all  -4     1.151      29.89   12.367    14
10  depend         all  -3     1.151      29.89   12.367    16
20  depend        also  -1     1.133     208.96   21.255    50
35  depend          an   2     1.367      56.29   15.603    24
53  depend          an   5     1.367      56.29   15.603    19
44  depend         and   4    15.183    2170.41  131.288   149
0   depend         are  -5     1.962      98.84   21.342    27
5   depend         are  -4     1.962      98.84   21.342    22
45  depend          as   4     2.395     104.96   24.045    30
54  depend          as   5     2.395     104.96   24.045    28
14  depend         but  -2     1.529      24.40   13.940    14
55  depend         but   5     1.529      24.40   13.940    15
1   depend          by  -5     1.042      26.21   11.420    13
6   depend          by  -4     1.042      26.21   11.42

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


<font color="green">Expected output: </font> (The order isn't important.)

> ('depend', 'a', 2) {'strength': 6.381, 'spread': 777.29, 'peak': 63.78, 'count': 94}   
> ('depend', 'all', -4) {'strength': 1.151, 'spread': 29.89, 'peak': 12.367, 'count': 14}   
> ('depend', 'all', -3) {'strength': 1.151, 'spread': 29.89, 'peak': 12.367, 'count': 16}   
> ('depend', 'also', -1) {'strength': 1.133, 'spread': 208.96, 'peak': 21.255, 'count': 50}   
> ('depend', 'an', 2) {'strength': 1.367, 'spread': 56.29, 'peak': 15.603, 'count': 24}   
> ('depend', 'an', 5) {'strength': 1.367, 'spread': 56.29, 'peak': 15.603, 'count': 19}   
> ('depend', 'and', 4) {'strength': 15.183, 'spread': 2170.41, 'peak': 131.288, 'count': 149}   
> ('depend', 'are', -5) {'strength': 1.962, 'spread': 98.84, 'peak': 21.342, 'count': 27}   
> ('depend', 'are', -4) {'strength': 1.962, 'spread': 98.84, 'peak': 21.342, 'count': 22}   
> ('depend', 'as', 4) {'strength': 2.395, 'spread': 104.96, 'peak': 24.045, 'count': 30}   
> ('depend', 'as', 5) {'strength': 2.395, 'spread': 104.96, 'peak': 24.045, 'count': 28}   
> ('depend', 'but', -2) {'strength': 1.529, 'spread': 24.4, 'peak': 13.94, 'count': 14}   
> ('depend', 'but', 5) {'strength': 1.529, 'spread': 24.4, 'peak': 13.94, 'count': 15}   
> ('depend', 'by', -5) {'strength': 1.042, 'spread': 26.21, 'peak': 11.42, 'count': 13}   
> ('depend', 'by', -4) {'strength': 1.042, 'spread': 26.21, 'peak': 11.42, 'count': 12}   
> ('depend', 'by', 4) {'strength': 1.042, 'spread': 26.21, 'peak': 11.42, 'count': 13}   
> ('depend', 'can', -1) {'strength': 1.421, 'spread': 208.24, 'peak': 22.831, 'count': 49}   
> ('depend', 'do', -2) {'strength': 1.656, 'spread': 410.21, 'peak': 29.954, 'count': 70}   
> ('depend', 'does', -2) {'strength': 5.299, 'spread': 6477.09, 'peak': 110.38, 'count': 271}   
> ('depend', 'for', 4) {'strength': 4.686, 'spread': 376.65, 'peak': 45.907, 'count': 69}   
> ('depend', 'formula', -4) {'strength': 1.565, 'spread': 46.16, 'peak': 15.994, 'count': 19}   
> ('depend', 'formula', 2) {'strength': 1.565, 'spread': 46.16, 'peak': 15.994, 'count': 17}   
> ('depend', 'formula', 5) {'strength': 1.565, 'spread': 46.16, 'peak': 15.994, 'count': 19}   
> ('depend', 'in', -5) {'strength': 5.876, 'spread': 396.09, 'peak': 53.002, 'count': 55}   
> ('depend', 'in', 4) {'strength': 5.876, 'spread': 396.09, 'peak': 53.002, 'count': 62}   
> ('depend', 'is', -5) {'strength': 2.611, 'spread': 148.2, 'peak': 27.174, 'count': 37}   
> ('depend', 'is', 5) {'strength': 2.611, 'spread': 148.2, 'peak': 27.174, 'count': 29}   
> ('depend', 'it', -3) {'strength': 2.287, 'spread': 112.76, 'peak': 23.819, 'count': 39}   
> ('depend', 'it', -2) {'strength': 2.287, 'spread': 112.76, 'peak': 23.819, 'count': 24}   
> ('depend', 'its', 2) {'strength': 1.818, 'spread': 94.24, 'peak': 20.308, 'count': 36}   
> ('depend', 'may', -1) {'strength': 2.864, 'spread': 1352.24, 'peak': 53.173, 'count': 126}   
> ('depend', 'not', -1) {'strength': 8.437, 'spread': 12938.41, 'peak': 161.047, 'count': 388}   
> ('depend', 'of', 4) {'strength': 23.461, 'spread': 20132.64, 'peak': 272.49, 'count': 495}   
> ('depend', 'on', 1) {'strength': 46.313, 'spread': 420371.01, 'peak': 905.66, 'count': 2195}   
> ('depend', 'only', 1) {'strength': 1.295, 'spread': 134.01, 'peak': 19.276, 'count': 40}   
> ('depend', 'or', 4) {'strength': 2.485, 'spread': 85.61, 'peak': 23.553, 'count': 29}   
> ('depend', 'or', 5) {'strength': 2.485, 'spread': 85.61, 'peak': 23.553, 'count': 25}   
> ('depend', 'other', 3) {'strength': 1.656, 'spread': 31.61, 'peak': 15.322, 'count': 19}   
> ('depend', 'other', 5) {'strength': 1.656, 'spread': 31.61, 'peak': 15.322, 'count': 17}   
> ('depend', 'properties', -4) {'strength': 1.042, 'spread': 30.21, 'peak': 11.796, 'count': 12}   
> ('depend', 'properties', -1) {'strength': 1.042, 'spread': 30.21, 'peak': 11.796, 'count': 15}   
> ('depend', 'properties', 3) {'strength': 1.042, 'spread': 30.21, 'peak': 11.796, 'count': 15}   
> ('depend', 's', 4) {'strength': 2.161, 'spread': 125.85, 'peak': 23.718, 'count': 41}   
> ('depend', 'some', -3) {'strength': 1.187, 'spread': 15.29, 'peak': 11.01, 'count': 13}   
> ('depend', 'some', 2) {'strength': 1.187, 'spread': 15.29, 'peak': 11.01, 'count': 14}   
> ('depend', 'such', 4) {'strength': 1.439, 'spread': 27.45, 'peak': 13.739, 'count': 17}   
> ('depend', 'that', -3) {'strength': 7.247, 'spread': 1492.61, 'peak': 79.334, 'count': 84}   
> ('depend', 'that', -1) {'strength': 7.247, 'spread': 1492.61, 'peak': 79.334, 'count': 132}   
> ('depend', 'the', 2) {'strength': 44.707, 'spread': 98586.04, 'peak': 562.384, 'count': 1140}   
> ('depend', 'their', 2) {'strength': 2.828, 'spread': 209.56, 'peak': 30.676, 'count': 52}   
> ('depend', 'these', -2) {'strength': 1.944, 'spread': 180.01, 'peak': 24.717, 'count': 48}   
> ('depend', 'they', -1) {'strength': 2.233, 'spread': 316.09, 'peak': 30.679, 'count': 63}   
> ('depend', 'this', -4) {'strength': 1.908, 'spread': 71.09, 'peak': 19.531, 'count': 28}   
> ('depend', 'this', -2) {'strength': 1.908, 'spread': 71.09, 'peak': 19.531, 'count': 22}   
> ('depend', 'to', -1) {'strength': 8.419, 'spread': 3941.16, 'peak': 109.979, 'count': 228}   
> ('depend', 'type', 3) {'strength': 1.295, 'spread': 213.41, 'peak': 22.309, 'count': 50}   
> ('depend', 'upon', 1) {'strength': 4.902, 'spread': 4984.01, 'peak': 98.298, 'count': 239}   
> ('depend', 'which', -1) {'strength': 4.379, 'spread': 346.16, 'peak': 43.405, 'count': 66}   
> ('depend', 'will', -1) {'strength': 3.784, 'spread': 2250.05, 'peak': 68.935, 'count': 159}   
> ('depend', 'would', -1) {'strength': 1.601, 'spread': 412.44, 'peak': 29.709, 'count': 70}   

## Strongest Collocation
There are too many collocations to check your result easily. Hence, we want you use the rules below to find out one strongest collocation for "depend".

Rule:
1. find the collocate with maximum **`strength`** value
2. find the collocate with maximum **`count`** value

If there're more than two collocations sharing same maximum `strength` value, please use rule 2 to find one as the answer. Otherwise, you can ignore Rule 2.

<font color="red">**[ TODO ]**</font> Please find out the strongest collocation for "depend" by the rules.

The ouput format sholud be `(base word, collocate, distance)`.

In [207]:
def find_strongest_collocation(base_word, filtered_by_C3):
    ### [TODO]
    tmp = filtered_by_C3[filtered_by_C3['strength']==filtered_by_C3['strength'].max()]
    if tmp.shape[0] > 1: #如果第一層篩選有多個row則用rule 2做最終篩選
        tmp = tmp[tmp['count']==tmp['count'].max()]
    result = (tmp['W'].values[0], tmp['Wi'].values[0], tmp['Pi'].values[0])
    print(result)

In [208]:
find_strongest_collocation(tmp_df, filtered_by_C3)
### Run and Print

('depend', 'on', 1)


<font color="green">Expected output: </font>

> ('depend', 'on', 1)

## Find Helpful AKL Collocation
Only one example cannot express how amazing what we just did, so here are some other AKL verbs selected for you to experience. 

<font color="red">**[ TODO ]**</font> Please finish **combination** function to combine last four functions together and use it to find out strongest collocations for **AKL_verbs**. 

The ouput format sholud be `(base word, collocate, distance)`.

In [209]:
AKL_verbs = ['argue', 'can', 'consist', 'contrast', 'favour', 'lack', 'may', 
            'neglect', 'participate', 'present', 'rely', 'suggest']

In [210]:
def combination(base_word, AKL_verbs):
    ### [TODO]
    for i in AKL_verbs:
        filtered_by_C1 = C1_filter(base_word, i)
        filtered_by_C2 = C2_filter(base_word, filtered_by_C1, i)
        filtered_by_C3 = C3_filter(base_word, filtered_by_C2, i)
        find_strongest_collocation(base_word, filtered_by_C3)

In [211]:
### Run and Print
combination(tmp_df, AKL_verbs)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydat

('argue', 'that', 1)
('can', 'be', 1)
('consist', 'of', 1)
('contrast', 'in', -1)
('favour', 'of', 1)
('lack', 'of', 1)
('may', 'be', 1)
('neglect', 'of', 1)
('participate', 'in', 1)
('present', 'with', -3)
('rely', 'on', 1)
('suggest', 'that', 1)


<font color="green">Expected output: </font>

> ('argue', 'that', 1)   
> ('can', 'be', 1)   
> ('consist', 'of', 1)   
> ('contrast', 'in', -1)   
> ('favour', 'of', 1)   
> ('lack', 'of', 1)   
> ('may', 'be', 1)   
> ('neglect', 'of', 1)   
> ('participate', 'in', 1)   
> ('present', 'with', -3)   
> ('rely', 'on', 1)   
> ('suggest', 'that', 1)  

## TA's Notes

If you complete the Assignment, please use [this link](https://docs.google.com/spreadsheets/d/1QGeYl5dsD9sFO9SYg4DIKk-xr-yGjRDOOLKZqCLDv2E/edit#gid=206119035) to reserve demo time.  
The score is only given after TAs review your implementation, so <u>**make sure you make a appointment with a TA before you miss the deadline**</u> .  <br>After demo, please upload your assignment to eeclass. You just need to hand in this ipynb file and rename it as XXXXXXXXX(Your student ID).ipynb.
<br>Note that **late submission will not be allowed**.  

## Reference
[Frank Smadja, Retrieving Collocations from Texts: Xtract, Computational Linguistics, Volume 19, 1993](https://aclanthology.org/J93-1007.pdf)