# Use TF-IDF to Get Keywords

## Term Frequency – Inverse Document Frequency (TF-IDF)
Used to find the importance of sentence/word

TF = # of times term appears in doc / total # of terms in doc
<br>It represents that how common the word is in the current document.

IDF = log(total # of docs / # docs with term)
<br>IDF is used to scale down the term if it occurrs in too many documents. At the same time, it scales up relatively rare term. It represents that how unique/rare a word is.

TF-IDF = TF times IDF

Therefore, TF-IDF can be used to sort out some keywords in our transcripts. Some words, which are not important but in high rank, will be added as stopwords to be removed later.

In the following case, "Jieba" library is used to do sentence segmentation and TF-IDF analysis.
1. Trancripts are got from 2 real calls
2. Sentence segmentation is done based on AQM keywords
3. Get Top 100 keywords from these transcripts by calculating TF-IDF

In [1]:
pip install jieba

Note: you may need to restart the kernel to use updated packages.


In [2]:
import jieba
import jieba.analyse

fInput = open('TranscriptOut/TranscriptList.txt', 'r')
samples = []
for szLine in fInput:
    samples.append(szLine.rstrip())
fInput.close()

# load user dictionary
jieba.load_userdict('ExtraDict/LocalDict_V2.txt')

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/7j/8vqxhyrx6tj2hjs6mq486qpc0000gn/T/jieba.cache
Loading model cost 0.489 seconds.
Prefix dict has been built succesfully.


In [3]:
szFullSent = ''
for szFName in samples:
    fTrans = open('TranscriptOut/' + szFName, 'r');
    for szCont in fTrans:
        szTags = "" if szCont is None else " ".join(jieba.cut(szCont))
        szFullSent += ' ' + szTags
    fTrans.close()

# Default IDF file
jieba.analyse.set_idf_path("ExtraDict/IDF_big.txt")

# Get Top 100 keywords
tags = jieba.analyse.extract_tags(szFullSent, topK=100, withWeight=True)

for item in tags:
    print(item[0] + ',' + str(item[1]))

同埋,0.6291982896263157
明白,0.2894312132281052
還款,0.20134345268042103
個人資料,0.20134345268042103
每月,0.1761755210953684
短訊,0.16683196384926316
申請,0.16359155530284208
同意,0.16359155530284208
如果,0.1510075895103158
利息,0.12583965792526314
條款,0.11325569213273683
貸款,0.10152703168390526
銀碼,0.10067172634021052
任何,0.10067172634021052
細則,0.10067172634021052
收到,0.10067172634021052
本金,0.10067172634021052
通知,0.09541279093642105
清楚,0.08920319230878948
唔會,0.0880877605476842
私隱條例,0.0880877605476842
匯豐,0.08433473361726315
確認,0.08258567218812632
電話,0.0755037947551579
接受,0.0755037947551579
用途,0.0755037947551579
文件,0.0755037947551579
一個,0.0755037947551579
批核,0.0755037947551579
參考,0.0755037947551579
銀行,0.0755037947551579
兩個percent,0.0755037947551579
收取,0.06291982896263157
可以,0.059033587404800004
第一期,0.05690733804473684
關於,0.053903433108631583
戶口,0.05065354525629473
已經,0.05033586317010526
號碼,0.05033586317010526
007615,0.05033586317010526
超連結,0.05033586317010526
ok,0.05033586317010526
講番,0.05033586317010526
小姐,0.05

From the above top 100, the following keywords are added into stopwords.
Top 1:  同埋
Top 43: 講番
Top 44: 小姐
Top 67: 而家
Top 68: 先生
Top 69: 請問
Top 95: 仲有

In [4]:
# load custom stopwords file
jieba.analyse.set_stop_words("ExtraDict/StopWords_V1.txt")

# Get Top 100 keywords
tags = jieba.analyse.extract_tags(szFullSent, topK=100, withWeight=True)

for item in tags:
    print(item[0] + ',' + str(item[1]))

明白,0.3131658913060364
還款,0.2178545330824601
個人資料,0.2178545330824601
每月,0.1906227164471526
短訊,0.18051294493940775
申請,0.17700680812949884
同意,0.17700680812949884
如果,0.16339089981184507
利息,0.13615908317653758
條款,0.12254317485888382
貸款,0.10985271081971526
銀碼,0.10892726654123006
任何,0.10892726654123006
細則,0.10892726654123006
收到,0.10892726654123006
本金,0.10892726654123006
通知,0.10323707447562643
清楚,0.09651826047078588
唔會,0.0953113582235763
私隱條例,0.0953113582235763
匯豐,0.09125056598678816
確認,0.0893580735520729
電話,0.08169544990592253
接受,0.08169544990592253
用途,0.08169544990592253
文件,0.08169544990592253
一個,0.08169544990592253
批核,0.08169544990592253
參考,0.08169544990592253
銀行,0.08169544990592253
兩個percent,0.08169544990592253
收取,0.06807954158826879
可以,0.0638746105177221
第一期,0.06157399902334851
關於,0.05832376019726652
戶口,0.05480736673517085
已經,0.05446363327061503
號碼,0.05446363327061503
007615,0.05446363327061503
超連結,0.05446363327061503
ok,0.05446363327061503
財政,0.05446363327061503
全期總利息,0.05446363327061503

## Comparing the keywords from transpcripts with AQM provided keywords

The keywords, which have not been mentioned by AQM, are shown in below.

In [5]:
fTFKey = open('TranscriptOut/OutKeyWords_withStop.txt', 'r')
fAQMKey = open('ExtraDict/LocalDict_V2.txt', 'r')

array_tf_key = []
array_AQM_key = []
for line in fAQMKey:
    array_AQM_key.append(line.strip())
for line in fTFKey:
    item = line.split(',')
    if len(item) >= 1:
        array_tf_key.append(item[0])

diff_key = ''
for tf_key in array_tf_key:
    found = False
    for AQM_key in array_AQM_key:
        if (tf_key == AQM_key):
            found = True
            break

    if not found:
        diff_key += tf_key + ','

print(diff_key)
fTFKey.close()
fAQMKey.close()

如果,清楚,電話,一個,可以,關於,已經,號碼,007615,ok,覆核,直接,再有,上述,透過,得到,提早,每次,港幣,400,概要,儲存,收返,幾點,開個,平息,上網,會經,之前,放送,重要,send,六位,數正,體圓,菜仔,問番,開始,破格,近住,一齊,玩風帆,


## Custom IDF File

Currently, the IDF value of each word is got from a default file. If the word does not appear in the file, median of the file will be used. The median value of the default file is 11.9547675029.
<br>As mentioned above, IDF value can be used to scale up/down the TF-IDF. Therefore, AQM provided keywords are added in the IDF file and used 20 as the value.

In [6]:
# load custom IDF file, format $WORD $IDF_VALUE
# Low IDF means occurrence in high freq.
jieba.analyse.set_idf_path("ExtraDict/IDF_Out.txt")

# Get Top 100 keywords
tags = jieba.analyse.extract_tags(szFullSent, topK=100, withWeight=True)

for item in tags:
    print(item[0] + ',' + str(item[1]))

明白,0.5239179954441914
貸款,0.38724373576309795
還款,0.36446469248291574
個人資料,0.36446469248291574
通知,0.34168564920273353
每月,0.31890660592255127
申請,0.29612756264236906
同意,0.29612756264236906
確認,0.2733485193621868
短訊,0.2733485193621868
匯豐,0.22779043280182235
利息,0.22779043280182235
條款,0.2050113895216401
銀碼,0.18223234624145787
任何,0.18223234624145787
細則,0.18223234624145787
收到,0.18223234624145787
本金,0.18223234624145787
如果,0.16339089981184507
唔會,0.15945330296127563
私隱條例,0.15945330296127563
接受,0.1366742596810934
用途,0.1366742596810934
文件,0.1366742596810934
批核,0.1366742596810934
戶口,0.1366742596810934
第一期,0.1366742596810934
參考,0.1366742596810934
銀行,0.1366742596810934
兩個percent,0.1366742596810934
手續費,0.11389521640091117
收取,0.11389521640091117
清楚,0.09651826047078588
超連結,0.09111617312072894
隨時,0.09111617312072894
考慮,0.09111617312072894
財政,0.09111617312072894
狀況,0.09111617312072894
全期總利息,0.09111617312072894
冇問題,0.09111617312072894
指定,0.09111617312072894
確認信,0.09111617312072894
還款日期,0.09111617312072894
第三方

In [7]:
fTFKey = open('TranscriptOut/OutKeyWords_withIDF.txt', 'r')
fAQMKey = open('ExtraDict/LocalDict_V2.txt', 'r')

array_tf_key = []
array_AQM_key = []
for line in fAQMKey:
    array_AQM_key.append(line.strip())
for line in fTFKey:
    item = line.split(',')
    if len(item) >= 1:
        array_tf_key.append(item[0])

diff_key = ''
for tf_key in array_tf_key:
    found = False
    for AQM_key in array_AQM_key:
        if (tf_key == AQM_key):
            found = True
            break

    if not found:
        diff_key += tf_key + ','

print(diff_key)
fTFKey.close()
fAQMKey.close()

如果,清楚,電話,一個,可以,關於,已經,號碼,007615,ok,覆核,直接,再有,上述,透過,得到,提早,每次,港幣,400,


It is obvious that the difference between AQM provided keywords and that of sorted keywords is much less than before.

## Get Keywords of Each Sentence  

In [8]:
fSen = open('TranscriptOut/1003_3_TimeGap/1s-Table 1.csv', 'r');
for line in fSen:
    szTags = "" if line is None else " ".join(jieba.cut(line))
    tags = jieba.analyse.extract_tags(szTags, topK=10)

    print(tags)

fSen.close()

['尾數', '已經', 'send', '電話', '號碼', '六位', '數正', '007615']
['明白', '超連結', '接受', '確認', '短訊', '同事', '條款', '隨時', '放送', '體圓']
['貸款', '用途']
[]
['申請', '銀碼', '超連結', '文件', '剔返', '貸款', '考慮', '財政', '狀況', '每月']
['批核', '匯豐', '戶口', '唔會', '任何', '通知', '第一期', '還款日期', '明白', '第三方']
['同意', '根據', '披露', '個人資料', '明白', '銀海', '個頭', '上述']
['個人資料', '細則', '約束', '通知', '匯豐', '主網頁', '左下角', '收到', '短訊', '關於']
['明白', '下載', '文件', '已經', '返圖', '儲存', '可以']
['還款', '每月', '如果', '貸款', '本金', '兩個percent', '款項', '未能', '任何', '逾期未還']


## Further Process
1. Process more transcripts to see any words, which are useless or meaningless, can be added as stopwords
2. Use keywords of each sentence to predict T&C rule