## MACNM Computational Workshop No. 5 (Jan. 12 Morning) <br>Text Mining<br> by Lu Guan

## NOTES
 1. What kind of problems can text mining solve?
 2. What kind of text mining approaches we need to learn?
 3. Conduct the basic text mining techniques today!

## Why do we want to conduct text mining?
- Topic Modeling (automatically discovering topics within the documents)<br/>
<img src="images/IntroToLDA.png" width="60%" style="float: left"> <br/>

- Classification: extracting text features for further assignment (e.g., Junk email identification)<br/>
<img src="images/email classification.jpg" width="60%" style="float: left"> <br/>

- Sentiment analysis: extracting subjective information and helping understand the social sentiment of the brand, product or service<br/>
<img src="images/sentiment_example.png" width="60%" style="float: left"> <br/>


### Text mining techniques
- Word extraction 
 * tokenization: Chopping up a character sequence into pieces
 * dropping stop words: Excluding extremely common and semantically nonselective words (and, are, as, at, be, for, from…)
 * normalization: Creating synonym dictionary so that match occurs despite the superficial differences (e.g., USA = America)
 * stemming: chopping off the ends of words according to some rules to remove the derivational affixes (e.g., cats ->cat)
 * lemmatization: using a dictionary to match a word with its base (e.g., does ->do)
 * Word tagging: Marking up a word, based on both its definition, as well as its context, often requiring context-specific dictionaries. (e.g., apple [noun])
- Bag of words
 * Unigram, Bigram, Ngram: contiguous sequence of n items from a given sample of text or speech
 * Term frequency: Represent text as numerical feature vectors
- Feature transformation
 * f_classif：Compute the ANOVA F-value for the provided sample
 * χ2 calculates whether the occurrence of the term and occurrence of the category are independent.
 * TFIDF, short for term frequency–inverse document frequency, is the product of two statistics, term frequency and inverse document frequency.
- Topic Modeling Algorithm
 * Latent Dirichlet Allocation (LDA), Latent semantic analysis (LSA), Non-negative matrix factorization (NMF),etc.
- Classification Modeling Algorithm
 * Linear Classifiers (Logistic Regression, Naive Bayes Classifier), Support Vector Machines, Decision Trees, Random Forest, Nearest Neighbor, etc.
- Deep learning word embedding: vector representations of a particular word by capturing context of the word in the document.
 * word2vec, GloVe, etc.

#### 1. Word Extraction
- Tokenization
- Dropping stop words
- Normalization
- Stemming/lemmatization
- Words Tagging

#### 1.1 Tokenization
Chopping up a character sequence into pieces, called tokens.

- Input: Seattle is a coastal seaport city.<br/> 
- Output: Seattle/ is/ a/ costal/ seaport/ city   
<br/> 
<br/> 
- Input: 走在風中今天陽光突然好溫柔,天的溫柔地的溫柔像你抱著我<br/> 
- Output: 走/在/風中/今天/陽光/突然/好/溫柔/,/天/的/溫柔/地/的/溫柔/像/你/抱/著/我

#### 1.2. Dropping stop words (停用词)
Excluding extremely common and
semantically nonselective words (stop words).

- and, are, as, at, be, for, from…

#### 1.3. Normalization
Creating synonym dictionary so that match occurs despite the superficial differences.

- USA=U.S.A.=America

#### 1.4. Stemming and Lemmatization
- Reducing related forms of a word to a base form.
- Stemming: chopping off the ends of words according to some rules to remove the derivational affixes.<br/>
<img src="images/stemming.png" width="60%" style="float: left"> <br/>






- Lemmatization: using a dictionary to match a word with its base (lemma)<br/>
 * does ->do 
 * women -> woman
 * went -> go


#### 1.5. Words Tagging
Marking up a word, based on both its definition, as well as its context, often requiring context-specific dictionaries.<br>

- part-of-speech (POS) <br>
  http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html<br/>
<img src="images/POS.png" width="60%" style="float: left"> 
<br/>



- Named Entity Tagging<br>
<img src="images/entitytagging.png" width="60%" style="float: left"> <br/>



- Sentimental Tagging<br>
<img src="images/sentiment.png" width="60%" style="float: left"> <br/>

### 2. Bag-of-words model (BOW)
A document is represented as the bag of its words. The order of words within the bag is free.

#### 2.1 Document Term Frequency
- Represent text as numerical feature vectors<br>
<img src="images/bagofwords.jpeg" width="60%" style="float: left"> 


#### 2.2 Ngram
- contiguous sequence of n items from a given sample of text or speech
<img src="images/ngram.png" width="60%" style="float: left"> 

### 3.Feature Selection and transformation

- 3.1 f_classif <br>
 * Compute the ANOVA F-value for the provided sample
- 3.2 χ2 <br> 
 * calculates whether the occurrence of the term and occurrence of the category are independent.
- 3.3 TFIDF  <br>
 * short for term frequency–inverse document frequency, is the product of two statistics, term frequency and inverse document frequency.<br>
 <img src="images/tfidf.png" width="60%" style="float: left"> 

### Text mining techniques
- Word extraction 
 * tokenization: Chopping up a character sequence into pieces
 * dropping stop words: Excluding extremely common and semantically nonselective words (and, are, as, at, be, for, from…)
 * normalization: Creating synonym dictionary so that match occurs despite the superficial differences (e.g., USA = America)
 * stemming: chopping off the ends of words according to some rules to remove the derivational affixes (e.g., cats ->cat)
 * lemmatization: using a dictionary to match a word with its base (e.g., does ->do)
 * Word tagging: Marking up a word, based on both its definition, as well as its context, often requiring context-specific dictionaries. (e.g., apple [noun])
- Bag of words
 * Unigram, Bigram, Ngram: contiguous sequence of n items from a given sample of text or speech
 * Term frequency: Represent text as numerical feature vectors
- Feature transformation
 * 3.1 f_classif：Compute the ANOVA F-value for the provided sample
 * 3.2 χ2 calculates whether the occurrence of the term and occurrence of the category are independent.
 * 3.3 TFIDF, short for term frequency–inverse document frequency, is the product of two statistics, term frequency and inverse document frequency.
- Topic Modeling Algorithm
 * Latent Dirichlet Allocation (LDA), Latent semantic analysis (LSA), Non-negative matrix factorization (NMF),etc.
- Classification Modeling Algorithm
 * Linear Classifiers (Logistic Regression, Naive Bayes Classifier), Support Vector Machines, Decision Trees, Random Forest, Nearest Neighbor, etc.
- Deep learning word embedding: vector representations of a particular word by capturing context of the word in the document.
 * Neural Network: word2vec, GloVe, etc.

# Map of text mining tasks (文本分析的几种套路)

### Flow chart for topic modeling 
<img src="images/topic_modeling_flow_chart.png" width="60%" style="float: left"> <br/>

### Flow chart for text classification 
photo from https://mp.weixin.qq.com/s/yNOfoOvCOb9Vj4CnVIiifQ
<img src="images/text_classification_flow_chart.jpg" width="70%" style="float: left"> <br/>

### Flow chart for sentiment analysis
photo from https://www.researchgate.net/figure/Flowchart-of-the-Proposed-Twitter-Sentiment-Analysis-System_fig1_273635463 <br/>
<img src="images/sentiment_flow_chart.png" width="60%" style="float: left"> <br/>

# Hands-on today
- Word extraction (chopping the words, dropping stop words, normalization, stemming, etc.)
- Bag of words + TF-IDF
- Topic Modeling

## Code
### 1.1 English word extraction

###  Library to install: nltk (pip install nltk)

https://www.nltk.org/install.html<br/>
<img src="images/nltk.png" width="40%" style="float: left"> <br/>


In [1]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [181]:
import string
import re
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.corpus import brown
brown_tagged_sents=brown.tagged_sents(categories=None)
unigram_tagger=nltk.UnigramTagger(brown_tagged_sents)
wnl = nltk.WordNetLemmatizer()
porter=nltk.PorterStemmer()
stopwords = nltk.corpus.stopwords.words('english')

In [186]:
# Chopping up a character sequence into pieces
s="RT @ReutersAfrica: #UN, #WorldBank to launch refugee and reconstruction bonds #migrantcrisis http://t.co/h7QwzbOJDb http://t.co/KJGRtQgW6M"
s=s.split()
s

['RT',
 '@ReutersAfrica:',
 '#UN,',
 '#WorldBank',
 'to',
 'launch',
 'refugee',
 'and',
 'reconstruction',
 'bonds',
 '#migrantcrisis',
 'http://t.co/h7QwzbOJDb',
 'http://t.co/KJGRtQgW6M']

In [187]:
# filtering words, excluding RT, @, and URLs.
s=[token for token in s if token!="RT" and not token.startswith("@") and not token.startswith("http")]
s

['#UN,',
 '#WorldBank',
 'to',
 'launch',
 'refugee',
 'and',
 'reconstruction',
 'bonds',
 '#migrantcrisis']

In [5]:
##lowercase and drop punctuations
for token in s:
    token=token.translate(str.maketrans("", "", string.punctuation)).lower()
    print (token)

un
worldbank
to
launch
refugee
and
reconstruction
bonds
migrantcrisis


In [6]:
##stemming
for token in s:
    token=token.translate(str.maketrans("", "", string.punctuation)).lower()
    token=porter.stem(token)
    print (token)

un
worldbank
to
launch
refuge
and
reconstruct
bond
migrantcrisi


In [7]:
##Lemmatization
for token in s:
    token=token.translate(str.maketrans("", "", string.punctuation)).lower()
    token=wnl.lemmatize(token,"n")
    token=wnl.lemmatize(token,"v")
    print (token)

un
worldbank
to
launch
refugee
and
reconstruction
bond
migrantcrisis


In [189]:
##POS-tagging
for token in s:
    token=token.translate(str.maketrans("", "", string.punctuation)).lower()
    token=wnl.lemmatize(token,"n")
    token=wnl.lemmatize(token,"v")
    tag=unigram_tagger.tag([token])
    t=tag[0][1]
    print (token,t)

un NN-HL
worldbank None
to TO
launch VB
refugee NN
and CC
reconstruction NN
bond NN
migrantcrisis None


In [9]:
##Find Stop Words
for token in s:
    token=token.translate(str.maketrans("", "", string.punctuation)).lower()
    token=wnl.lemmatize(token,"n")
    token=wnl.lemmatize(token,"v")
    tag=unigram_tagger.tag([token])
    t=tag[0][1]
    if token in stopwords:
        stop=1
    else:
        stop=0
    print (token,t,stop)

un NN-HL 0
worldbank None 0
to TO 1
launch VB 0
refugee NN 0
and CC 1
reconstruction NN 0
bond NN 0
migrantcrisis None 0


<p>** In-class exercise 1**</p>
- read the dataset tweet_en_no_retw.xlsx
- tokenize, lowercase, drop punctuations, and lemmatize the text.

In [180]:
import pandas as pd

In [190]:
##read tweet_en_no_retw.xlsx
tweet_en=
tweet_en.head()

Unnamed: 0,GUID,Date (HKT),URL,Contents,Country,State/Region,City/Urban Area,Category,Emotion,Source,Gender,Posts,Followers,Following,Post Type,author_anonymous,if_ret
0,201504835877605408,2012-05-13 10:51:11,http://twitter.com/MisterBearBear/status/20150...,I'm at City University of Hong Kong 香港城市大學 (Ko...,Hong Kong S.A.R.,Kowloon City,Hong Kong,Neutral,Neutral,Twitter,M,,,,Tweet,1385,0
1,9805916782202880,2010-12-01 11:08:07,http://twitter.com/jeromyu/status/980591678220...,I'm at City University of Hong Kong 香港城市大學 (81...,Hong Kong S.A.R.,,,Neutral,Neutral,Twitter,M,,,,Tweet,1386,0
2,435352986026786816,2014-02-17 18:00:14,http://twitter.com/FDTLIRL/status/435352986026...,City University of Hong Kong makes major enter...,Ireland,Dublin,Dublin,Neutral,Neutral,Twitter,,437.0,28.0,40.0,Tweet,1387,0
3,579954355085803520,2015-03-23 18:34:27,http://twitter.com/PPSSHK/status/5799543550858...,feeling exhausted at City University of Hong K...,Hong Kong S.A.R.,Kowloon City,Hong Kong,Neutral,Neutral,Twitter,,139.0,11.0,56.0,Tweet,1388,0
4,136762160100548608,2011-11-16 19:06:54,http://twitter.com/noreenwelch/status/13676216...,Mr Mugabe made the comments as he attended his...,New Zealand,,,Neutral,Unclassified,Twitter,F,,,,Tweet,1389,0


In [193]:
##take a look at the fifth Contents
tweet_en['Contents'].iloc[5]

'Derek Leong (right), who plays for City University of Hong Kong, helped Perak win gold at SUKMA 2018. https://t.co/VNJwCCHgT0'

In [195]:
import string
def text_cleaning(content):
    word_list=str(content).split()
    word_list=[token for token in word_list if token !="RT" and not token.startswith("＠") and not token.startswith("http") and not token.startswith("https") and not token in(stopwords)]
    new_tokens=[]
    for token in word_list:
        ##lowercase the token and drop punctuation
        token=
        token=wnl.lemmatize(token,"n")
        token=wnl.lemmatize(token,"v")
        token = re.sub("[^A-Za-z0-9]", "", token)
        new_tokens.append(token)
    processed_text = " ".join(new_tokens)
    return processed_text

In [202]:
tweet_en['pro_Contents']=tweet_en['Contents'].apply(text_cleaning)

In [17]:
tweet_en['pro_Contents'].head()

0    im at city university of hong kong  kowloon to...
1    im at city university of hong kong  81 tat che...
2    city university of hong kong make major enterp...
3        feel exhaust at city university of hong kong 
4    mr mugabe make the comment a he attend his dau...
Name: pro_Contents, dtype: object

In [197]:
##take a look at the fifth processed Contents


'derek leong right play city university hong kong help perak win gold sukma 2018'

<p>**1.2 Chinese word extraction **</p>
- jieba (pip install jieba) https://pypi.org/project/jieba/ 
- PKUSeg, https://github.com/lancopku/pkuseg-python
- THULAC, http://thulac.thunlp.org/

In [39]:
import jieba

In [184]:
##tokenizing the word
text = '走在風中今天陽光突然好溫柔,天的溫柔地的溫柔像你抱著我'  
cut_obj =jieba.cut(text) 
sentence=" ".join(cut_obj)
sentence

'走 在 風中 今天 陽光 突然 好 溫柔 , 天 的 溫柔 地 的 溫柔 像 你 抱 著 我'

In [41]:
##drop punctuations
from string import punctuation
all_punc='，。、【 】 “”：；（）《》‘’{}？！⑦()、%^>℃：.”“^-——=&#@￥!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
for token in token_list:
    token=token.translate(str.maketrans("", "", all_punc))
    print (token)

走
在
風中
今天
陽光
突然
好
溫柔

天
的
溫柔
地
的
溫柔
像
你
抱
著
我


In [7]:
##POS-tagging
#documentation: https://blog.csdn.net/suibianshen2012/article/details/53487157
import jieba.posseg
text = '走在風中今天陽光突然好溫柔,天的溫柔地的溫柔像你抱著我' 
seg=jieba.posseg.cut(text)
for j in seg:
    print (j.word,j.flag)

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\lguan3\AppData\Local\Temp\jieba.cache
Loading model cost 0.767 seconds.
Prefix dict has been built succesfully.


走 v
在 p
風 n
中 f
今天 t
陽光 nr
突然 ad
好 a
溫柔 a
, x
天 q
的 uj
溫柔 a
地 uv
的 uj
溫柔 a
像 v
你 r
抱著 v
我 r


<p>** In-class exercise 2**</p>
- read the dataset tweet_ch_no_retw.xlsx
- clean the "RT", @handle (e.g.,@yuexinmutian), and urls. 
- remove punctuations.
- tokenize the Chinese word and connect words with space.

In [43]:
## read the data tweet_ch_no_retw.xlsx
tweet_ch=
tweet_ch.head()

Unnamed: 0,GUID,Date (HKT),URL,Contents,Country,State/Region,City/Urban Area,Category,Emotion,Source,Gender,Posts,Followers,Following,Post Type,author_anonymous,if_ret
4,323828809482784768,2013-04-16 00:02:59,http://twitter.com/voaweishi/status/3238288094...,香港市民支援爱国民主运动联合会和香港城市大学合办的第二届六四临时纪念馆，星期一在城大校园内开...,United States of America,District of Columbia,"Washington, D.C.",Neutral,Unclassified,Twitter,,385.0,211.0,61.0,Tweet,4,0
7,652543999119630336,2015-10-10 01:59:47,http://twitter.com/TheNewsChina2/status/652543...,BBC 中文网 视频讨论：从“占中”一周年看香港民主路 BBC 中文网 讨论会嘉宾：学民思潮...,,,,Neutral,Unclassified,Twitter,,84609.0,3190.0,1184.0,Tweet,7,0
8,337392479538139072,2013-05-23 10:20:10,http://twitter.com/wangjinbo/status/3373924795...,邀请函：独立中文笔会颁奖礼暨文学研讨会 http://t.co/Tquh725Toz 独立中...,,,,Neutral,Unclassified,Twitter,,13210.0,18828.0,403.0,Tweet,8,0
9,1068570626665804032,2018-12-01 02:21:05,http://twitter.com/Rick0v0/status/106857062666...,City University of Hong Kong 香港城市大學 香港城市大学 htt...,Hong Kong S.A.R.,Kowloon City,Hong Kong,Neutral,Unclassified,Twitter,,93.0,5.0,87.0,Tweet,9,0
15,361383540182626304,2013-07-28 15:12:04,http://twitter.com/hu_jia/status/3613835401826...,有道义担当的四所大学：香港大学、中文大学、浸会大学以及城市大学。RT@queeten 港四所...,China,Beijing,,Neutral,Unclassified,Twitter,,9874.0,34065.0,1024.0,Tweet,15,0


In [107]:
## taka a look at the content of index 332


'前香港城市大学讲座教授刘汉城在汉藏会议演讲-<大清一统志>、<大明一统志>里的地图画的很清楚，中国领土里没有西藏。无论大清会典也好明史也好，都很清楚表现出，中国从来没能力在西藏抽税，从来没能力在西... https://t.co/VW1OK3LK04 via @lvv2com'

In [50]:
import re
def clean_chinese_word(contents):
    s=str(contents).split()
    sentence_list=[token for token in s if token !="RT" and not token.startswith("＠") and not token.startswith("@") and not token.startswith("http") and not token.startswith("https") and not token in(stopwords)]
    text=' '.join(sentence_list)
    ## use jieba to tokenize the text
    seg=
    new_word_list=[]
    for j in seg:
        if j.flag!='x':
            new_word_list.append(j.word)
    sentence=" ".join(new_word_list)
    return sentence

In [51]:
clean_chinese_word(tweet_ch['Contents'].iloc[332])

'前 香港城市大学 讲座 教授 刘 汉城 在 汉藏 会议 演讲 大清 一统 志 大明 一统 志 里 的 地 图画 的 很 清楚 中国 领土 里 没有 西藏 无论 大清 会典 也好 明史 也好 都 很 清楚 表现 出 中国 从来 没 能力 在 西藏 抽税 从来 没 能力 在 西 ... via'

In [201]:
tweet_ch['pro_Contents']=tweet_ch['Contents'].apply(clean_chinese_word)

In [10]:
tweet_ch['pro_Contents'].head()

4     香港市民 支援 爱国 民主运动 联合会 和 香港城市大学 合办 的 第二届 六四 临时 纪念...
7     BBC 中文网 视频 讨论 从 占 中 一周年 看 香港 民主 路 BBC 中文网 讨论会 ...
8     邀请函 独立 中文 笔会 颁奖礼 暨 文学 研讨会 独立 中文 笔会 将 於 5 月 25 ...
9          City University of Hong Kong 香港 城市 大學 香港城市大学
15    有 道义 担当 的 四所 大学 香港大学 中文 大学 浸会 大学 以及 城市 大学 RT q...
Name: pro_Contents, dtype: object

### Notes

### 2. Bag-of-words model (BOW)
Represent text as numerical feature vectors<br>
- We create a vocabulary of unique tokens—for example, words—from the entire set of documents.<br>
- We construct a feature vector from each document that contains the counts of how often each word occurs in the particular document.<br>
- Since the unique words in each document represent only a small subset of all the words in the bag-of-words vocabulary, the feature vectors will consist of mostly zeros, which is why we call them sparse<br>
<img src="images/bagofwords.jpeg" width="60%" style="float: left"> 

### 3.Feature Selection and transformation 
- Finding terms that best represent texts in each category<br>
<p>**solution 1:**</p>
- Picking up the most discriminant terms in each category.<br>
- f_classif：Compute the ANOVA F-value for the provided sample
- χ2 calculates whether the occurrence of the term and occurrence of the category are independent.<br>
<p>**solution 2:**</p>
- TFIDF, short for term frequency–inverse document frequency, is the product of two statistics, term frequency and inverse document frequency.<br>
<img src="images/tfidf.png" width="60%" style="float: left"> 


### Code

In [25]:
from sklearn import feature_extraction
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfTransformer  

corpus = ['aaa ccc aaa aaa', 
          'aaa aaa', 
          'aaa aaa aaa', 
          'aaa aaa aaa aaa',
          'aaa bbb aaa bbb aaa',
          'ccc aaa aaa ccc aaa'
         ]
vectorizer = CountVectorizer() 
X = vectorizer.fit_transform(corpus)

# Get the vocabulary in vectorizer   
word = vectorizer.get_feature_names()  
print(word)

['aaa', 'bbb', 'ccc']


In [52]:
# obtian the term frequency matrix
print (X.toarray())

[[3 0 1]
 [2 0 0]
 [3 0 0]
 [4 0 0]
 [3 2 0]
 [3 0 2]]


In [53]:
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(X)
print(tfidf.toarray())

[[0.85151335 0.         0.52433293]
 [1.         0.         0.        ]
 [1.         0.         0.        ]
 [1.         0.         0.        ]
 [0.55422893 0.83236428 0.        ]
 [0.63035731 0.         0.77630514]]


#### In-class exercise 3
- construct the word vector for Chinese processed tweets. 
- Get the top 10 words according to term frequency
- Use TF-IDF to process the term frequency vector.

In [204]:
##combine two datasets together
tweet_total=
len(tweet_total)

6770

In [205]:
## we are going to do bag of word model!
import operator
vectorizer = CountVectorizer()
##train the vectorizer with our tweet_total data and name the matrix as total_count_X
total_count_X=
##get vectorizer vocabulary dictionary
vocab_dic=
## sort the vocabulary by their frequency
sorted_vocab_dic = sorted(vocab_dic.items(), key=operator.itemgetter(1),reverse=True)
print (sorted_vocab_dic)

[('龟苓膏', 12356), ('龟板', 12355), ('齐鲁', 12354), ('齐韵宇', 12353), ('鼓掌', 12352), ('鼓吹', 12351), ('鼓励', 12350), ('鼓动', 12349), ('默契', 12348), ('黑色', 12347), ('黑珍珠', 12346), ('黑暗', 12345), ('黑布', 12344), ('黑客', 12343), ('黑奴', 12342), ('黎明前', 12341), ('黄色', 12340), ('黄文俊', 12339), ('黄成荣', 12338), ('黄子悦', 12337), ('麻省理工学院', 12336), ('麻省理工', 12335), ('麦镜', 12334), ('鹤立鸡群', 12333), ('鸠山由纪夫', 12332), ('鸠山', 12331), ('鲜血', 12330), ('鲍朴', 12329), ('鲁港', 12328), ('鱼蛋', 12327), ('高麗', 12326), ('高速', 12325), ('高达斌', 12324), ('高达', 12323), ('高调', 12322), ('高自联', 12321), ('高考', 12320), ('高级顾问', 12319), ('高级', 12318), ('高等教育', 12317), ('高等学府', 12316), ('高热量', 12315), ('高水平', 12314), ('高校', 12313), ('高标准', 12312), ('高敬文', 12311), ('高效', 12310), ('高招', 12309), ('高度', 12308), ('高层', 12307), ('高喊', 12306), ('高分子', 12305), ('高分', 12304), ('高估', 12303), ('高价', 12302), ('高中毕业', 12301), ('高中同学', 12300), ('高中', 12299), ('骸骨', 12298), ('骨骸', 12297), ('骗子', 12296), ('骗取', 12295), ('验水', 12294), ('驻港', 12293), ('驻法

In [56]:
feature_names = vectorizer.get_feature_names()
len(feature_names)

12357

In [206]:
##we are going to transform our data into tfidf score
transformer = TfidfTransformer()
##transform the term frequency matrix into tfidf matrix
tfidf_X = 
print(tfidf_X.toarray().shape)

(6770, 12357)


<p>**3.Topic Modeling **</p> 
- LDA (Latent Dirichlet Analysis)
- Mixed Membership Model
 * Word is in multiple topics
 * Document contains multiple topics <br>
 * For each document:Pick a mix of topics for this document (e.g. 75% economy, 25% politics)<br>

 <img src="images/IntroToLDA.png" width="80%" style="float: left"> 

Fitting an LDA model
- Find the parameters that best explain the data
- Iterative process
 * Start with random assignments
 * Improve assignments in every iteration
 * Distributions force documents to have few topics, topics to have few words
- Known as 'gibbs sampler'
 * See: Steyvers & Griffith, 2007, Probabilistic Topic Models

#### Hyperparameters (see documentation https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html)
- K: number of topics
- alpha: dispersion parameter
 * Lower alpha means fewer topics per document
 * defaults to 1 / n_components
- (Maximum number of iterations): Large dataset with high dimensions of features may cost long time to converge.


Determining number of topic (K)
- Compute 'perplexity' of different options (goodness-of-fit measure) (model.perplexity())
- Manually inspect outcomes

Sklearn Library, LatentDirichletAllocation function

In [207]:
## we firstly train the lda model with bag-of-word matrix
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=10,random_state=0)
id_topic  = lda.fit_transform(total_count_X)
id_topic



array([[0.70603968, 0.0037037 , 0.00370372, ..., 0.00370374, 0.00370372,
        0.0037037 ],
       [0.08294478, 0.03522031, 0.00322584, ..., 0.09381131, 0.00322581,
        0.00322581],
       [0.0028575 , 0.00285714, 0.00285716, ..., 0.29271061, 0.00285715,
        0.00285714],
       ...,
       [0.01111111, 0.89999973, 0.01111111, ..., 0.01111111, 0.01111111,
        0.01111111],
       [0.00769231, 0.69999861, 0.00769231, ..., 0.00769231, 0.00769231,
        0.23846159],
       [0.01      , 0.80999915, 0.01      , ..., 0.01      , 0.01      ,
        0.11000014]])

In [208]:
def display_topics(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print ("Topic %d:" % (topic_idx))
        print (" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))

n_top_words = 20
display_topics(lda, feature_names, n_top_words)

Topic 0:
香港城市大学 science 香港 六四 教授 郑宇硕 中国 20 学生 报道 讲座 纪念馆 24 新闻 政治学 爆发 表示 城大 30 2011
Topic 1:
hong kong university city im kowloon tong collapse the roof chee tat ave 81 professor china student medium creative post
Topic 2:
大学 香港 mfa 香港城市大学 一名 内地 女生 副教授 发现 2010 一个 城市 毕业 研究生 万元 中新网 香港大学 被判 mba 香港科技大学
Topic 3:
cityu 2001 international run july computer hall 36 香港城市大学 cgi graphic shaw kongs 学生 invite proceedin 通过 sport latest leave
Topic 4:
kongs hong university college business course new research veterinary visit cheng politics southeast talk boost routledgecity joseph nutanix get state
Topic 5:
香港 香港城市大学 教授 学生 郑宇硕 内地 中国 城市 六四 大学 研讨会 民主 参与 独立 运动 中文 授课 15 表示 笔会
Topic 6:
香港城市大学 2018 中国 2013 co 2016 香港 党支部 中共 反恐 法官 升级 校方 维稳 https 成立 研究 新疆 讲座 临时
Topic 7:
香港城市大学 2017 read 调查 10 中国 at 开心 大学 发现 指数 香港 台湾 录取 13 招生 博士 北京 26 appstore
Topic 8:
program 屋顶 倒塌 highlight paper add danger energy misdirect call vegetation sign 绿化 connect competition 天台 icc mba alex car
Topic 9:
hongkong robot place time re

In [209]:
## then, we are going to train the lda model with tf-idf matrix
from sklearn.decomposition import LatentDirichletAllocation
tfidf_lda = LatentDirichletAllocation(n_topics=10,random_state=0)
id_topic  = tfidf_lda.fit_transform(tfidf_X)
id_topic



array([[0.84728549, 0.01696789, 0.01696795, ..., 0.01696821, 0.01696789,
        0.0169681 ],
       [0.19083976, 0.01643032, 0.01643443, ..., 0.01643005, 0.01642981,
        0.01642982],
       [0.01713369, 0.01712898, 0.01712901, ..., 0.01712918, 0.01712893,
        0.01712894],
       ...,
       [0.02971369, 0.7325761 , 0.02971369, ..., 0.02971369, 0.02971421,
        0.02971369],
       [0.02498989, 0.69285758, 0.02498989, ..., 0.02498989, 0.02499327,
        0.02498989],
       [0.02816214, 0.74653854, 0.02816214, ..., 0.02816214, 0.02816422,
        0.02816214]])

In [210]:
def display_topics(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print ("Topic %d:" % (topic_idx))
        print (" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))

n_top_words = 20
display_topics(tfidf_lda, feature_names, n_top_words)

Topic 0:
香港城市大学 六四 香港 通过 学联 退出 纪念馆 graduation 郑宇硕 学生 公投 投诉 教授 报道 大学 学术 联会 多年 担任 政治学
Topic 1:
hong kong university city im kowloon tong chee tat ave 81 collapse post roof photo professor medium kongs the student
Topic 2:
mfa 2015 write do 1984 protest program full found hi 2010 内地 greencorp career 500 招生 closure know namelist sp
Topic 3:
free what old course 2011 授课 online 内地 学生 virtual best 24 tradition 26 world start blog 爆发 2017 who
Topic 4:
boost nutanix efficiency operational enterprise security challenge waste cloud result tackle platform remove administrative deployment enhancement university city hong major
Topic 5:
香港 香港城市大学 中国 教授 大学 郑宇硕 学生 北京 城市 研讨会 参与 深圳 运动 co 民主 独立 中文 活动 我们 表示
Topic 6:
20 香港城市大学 1997 prosecution wenyunchao no 13 17 天台 12 倒塌 目前 nice 反恐 nadinelustre kcapinoystar jadinelove 2009 香港 twitter
Topic 7:
香港城市大学 tfb group so fyp foursquare mayor 开心 oust much proud aejmc legal champion fuck accept wow fantastic popular googlyfish
Topic 8:
kong university hong city plac

In [90]:
def write_down_topics(model, feature_names, n_top_words, path):
    with open(path, 'w', encoding='utf-8') as p:
        for topic_idx, topic in enumerate(model.components_):
            p.write('Topic '+str(topic_idx)+',')
            p.write(" ".join([feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]]))
            p.write('\n')

n_top_words = 20
write_down_topics(tfidf_lda, feature_names, n_top_words,'topic_words.csv')

In [211]:
##write down the topic distribution for each document
tweet_total1=tweet_total.reset_index()
topic_percentage=pd.DataFrame(id_topic, columns=['topic0','topic1','topic2','topic3','topic4','topic5','topic6','topic7','topic8','topic9'])
tweet_total_topic=pd.concat([tweet_total1, topic_percentage], axis=1)
tweet_total_topic.head()

Unnamed: 0,index,GUID,Date (HKT),URL,Contents,Country,State/Region,City/Urban Area,Category,Emotion,...,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9
0,4,323828809482784768,2013-04-16 00:02:59,http://twitter.com/voaweishi/status/3238288094...,香港市民支援爱国民主运动联合会和香港城市大学合办的第二届六四临时纪念馆，星期一在城大校园内开...,United States of America,District of Columbia,"Washington, D.C.",Neutral,Unclassified,...,0.847285,0.016968,0.016968,0.016968,0.016968,0.01697,0.016968,0.016968,0.016968,0.016968
1,7,652543999119630336,2015-10-10 01:59:47,http://twitter.com/TheNewsChina2/status/652543...,BBC 中文网 视频讨论：从“占中”一周年看香港民主路 BBC 中文网 讨论会嘉宾：学民思潮...,,,,Neutral,Unclassified,...,0.19084,0.01643,0.016434,0.016431,0.01643,0.61702,0.077125,0.01643,0.01643,0.01643
2,8,337392479538139072,2013-05-23 10:20:10,http://twitter.com/wangjinbo/status/3373924795...,邀请函：独立中文笔会颁奖礼暨文学研讨会 http://t.co/Tquh725Toz 独立中...,,,,Neutral,Unclassified,...,0.017134,0.017129,0.017129,0.017129,0.017129,0.538986,0.323978,0.017129,0.017129,0.017129
3,9,1068570626665804032,2018-12-01 02:21:05,http://twitter.com/Rick0v0/status/106857062666...,City University of Hong Kong 香港城市大學 香港城市大学 htt...,Hong Kong S.A.R.,Kowloon City,Hong Kong,Neutral,Unclassified,...,0.588741,0.164556,0.030835,0.030835,0.030837,0.030845,0.030836,0.03084,0.030841,0.030834
4,15,361383540182626304,2013-07-28 15:12:04,http://twitter.com/hu_jia/status/3613835401826...,有道义担当的四所大学：香港大学、中文大学、浸会大学以及城市大学。RT@queeten 港四所...,China,Beijing,,Neutral,Unclassified,...,0.021563,0.021564,0.124473,0.021561,0.021562,0.703033,0.021562,0.021561,0.021561,0.021561


In [None]:
tweet_total_topic.to_excel('tweet_to_topic.xlsx')

In [212]:
user_to_topic=tweet_total_topic[['topic0','topic1','topic2','topic3','topic4','topic5','topic6','topic7','topic8','topic9']].groupby(tweet_total_topic['author_anonymous']).mean().reset_index()
user_to_topic.head()

Unnamed: 0,author_anonymous,topic0,topic1,topic2,topic3,topic4,topic5,topic6,topic7,topic8,topic9
0,4,0.847285,0.016968,0.016968,0.016968,0.016968,0.01697,0.016968,0.016968,0.016968,0.016968
1,7,0.154038,0.018627,0.143444,0.043922,0.047383,0.383703,0.076723,0.021666,0.01736,0.093133
2,8,0.017134,0.017129,0.017129,0.017129,0.017129,0.538986,0.323978,0.017129,0.017129,0.017129
3,9,0.588741,0.164556,0.030835,0.030835,0.030837,0.030845,0.030836,0.03084,0.030841,0.030834
4,10,0.209943,0.160206,0.017691,0.017694,0.017693,0.017692,0.017691,0.017691,0.330541,0.193158


In [None]:

user_to_topic.to_excel('user_to_topic.xlsx')