---

_You are currently looking at **version 1.0** of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the [Jupyter Notebook FAQ](https://www.coursera.org/learn/python-text-mining/resources/d9pwm) course resource._

---

# Assignment 4 - Document Similarity & Topic Modelling

## Part 1 - Document Similarity

For the first part of this assignment, you will complete the functions `doc_to_synsets` and `similarity_score` which will be used by `document_path_similarity` to find the path similarity between two documents.

The following functions are provided:
* **`convert_tag:`** converts the tag given by `nltk.pos_tag` to a tag used by `wordnet.synsets`. You will need to use this function in `doc_to_synsets`.
* **`document_path_similarity:`** computes the symmetrical path similarity between two documents by finding the synsets in each document using `doc_to_synsets`, then computing similarities using `similarity_score`.

You will need to finish writing the following functions:
* **`doc_to_synsets:`** returns a list of synsets in document. This function should first tokenize and part of speech tag the document using `nltk.word_tokenize` and `nltk.pos_tag`. Then it should find each tokens corresponding synset using `wn.synsets(token, wordnet_tag)`. The first synset match should be used. If there is no match, that token is skipped.
* **`similarity_score:`** returns the normalized similarity score of a list of synsets (s1) onto a second list of synsets (s2). For each synset in s1, find the synset in s2 with the largest similarity value. Sum all of the largest similarity values together and normalize this value by dividing it by the number of largest similarity values found. Be careful with data types, which should be floats. Missing values should be ignored.

Once `doc_to_synsets` and `similarity_score` have been completed, submit to the autograder which will run `test_document_path_similarity` to test that these functions are running correctly. 

*Do not modify the functions `convert_tag`, `document_path_similarity`, and `test_document_path_similarity`.*

In [12]:
import numpy as np
import nltk # <-- NLTKを使う。
from nltk.corpus import wordnet as wn
import pandas as pd


def convert_tag(tag):
    """Convert the tag given by nltk.pos_tag to the tag used by wordnet.synsets"""
    
    tag_dict = {'N': 'n', 'J': 'a', 'R': 'r', 'V': 'v'}
    try:
        return tag_dict[tag[0]]
    except KeyError:
        return None


def doc_to_synsets(doc):
    """
    Returns a list of synsets in document.

    Tokenizes and tags the words in the document doc.
    Then finds the first synset for each word/tag combination.
    If a synset is not found for that combination it is skipped.

    Args:
        doc: string to be converted

    Returns:
        list of synsets

    Example:
        doc_to_synsets('Fish are nvqjp friends.')
        Out: [Synset('fish.n.01'), Synset('be.v.01'), Synset('friend.n.01')]
    """
    

    # Your Code Here
    # synset => A set of one or more synonyms that are interchangeable in some context without changing the truth value of the proposition.
    # Method:doc_to_synsets => 文をシノニムのsetに変換する
    # ①Tokenize  出力:=>['This', 'is', 'a', 'function', 'to', 'test', 'document_path_similarity', '.']
    wl = nltk.word_tokenize(doc)
    # ②POS Tagging  出力:=>[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('function', 'NN'), ('to', 'TO'), ('test', 'VB'), ('document_path_similarity', 'NN'), ('.', '.')]
    ### Stemming
#     p_stemmer = nltk.PorterStemmer()
#     wl = [p_stemmer.stem(w.lower()) for w in wl] # ここで小文字にしないと"I" => PRP, "i" => NN と異なる結果が返り、正答に影響する。
    ### Lemmatizer
#     WNLemma = nltk.WordNetLemmatizer()
#     tagged_list = nltk.pos_tag([WNLemma.lemmatize(w) for w in wl])
    tagged_list = nltk.pos_tag(wl)
   # ③wordnetにsynsetを検索しに行く   WordNetの追加資料:=> https://www.nltk.org/howto/wordnet.html
    answer = []
    for word, tag in tagged_list:
        ### Synsetにはthis(DT)やa(DT)やto(TO)はないのでconvert_tagでスクリーニングする
#         if convert_tag(tag) is None:
#             ####
#             #### one need to also get the synsets for those tokens which have a None wordnet pos. Granted, nothing in the exercise question tells you to get rid of those. 
#             #### (↑)もしwordnet posがNoneであってもsynsetsを見に行く必要がある。問題文の質問には何も取り除けとは書いていない。
#             ####        try文を書いてNoneでも処理続行する。
#             ####
#             continue
#         else:
        try:
            search_result = wn.synsets(word, pos=convert_tag(tag))
            if len(search_result) > 0:
    #                 print('Synsetsリスト: ', search_result)
                answer.append(search_result[0])
        except e:
            print(e)
            continue
#     print('回答のシノニムリスト', answer)
    return answer# Your Answer Here

def similarity_score(s1, s2):
    """
    Calculate the normalized similarity score of s1 onto s2

    For each synset in s1, finds the synset in s2 with the largest similarity value.
    Sum of all of the largest similarity values and normalize this value by dividing it by the
    number of largest similarity values found.

    Args:
        s1, s2: list of synsets from doc_to_synsets

    Returns:
        normalized similarity score of s1 onto s2

    Example:
        synsets1 = doc_to_synsets('I like cats')
        synsets2 = doc_to_synsets('I like dogs')
        similarity_score(synsets1, synsets2)
        Out: 0.73333333333333339
    """
    
    # Your Code Here
    arr = []
    for synset1 in s1:
        heighest_sim = 0
        for synset2 in s2:
            ret = synset1.path_similarity(synset2)
            if ret is not None and ret > heighest_sim:
                heighest_sim = ret
        ### Regards to handling cases when path_similarity returns None. What do you do if this returns None? 
        ### What do you do if for a given t1 this returns None for every single t2?
        ### I suggest you add a whole bunch of print statements inside your loops and understand your code. 
        ### similarityが全てNoneならどうする？ 0にするかそれともそもそも含めないか。どちらが正しく見える？
        ### とフォーラムにメンターが書いている内容(↑)から読み取れる。
        if heighest_sim > 0:
            arr.append(heighest_sim)
    return sum(arr) / len(arr)# Your Answer Here

# 以下は公式（メンター提供の）のデバッグソース
# doc1 = 'This is a function to test document_path_similarity.'
# doc2 = 'Use this function to see if your code in doc_to_synsets \
# and similarity_score is correct!'
# synsets1 = doc_to_synsets(doc1)
# synsets2 = doc_to_synsets(doc2)
# print("synsets1", synsets1) # a list with 4 elements
# print("synsets2", synsets2) # a list with 7 elements
# s1s2_score = similarity_score(synsets1, synsets2)
# s2s1_score = similarity_score(synsets2, synsets1)
# print("s1s2_score", s1s2_score) # 0.6?2?0?0?0
# print("s2s1_score", s2s1_score) # 0.4?6?3?7?6
# print("s1 s2 doc similarity score", (s1s2_score + s2s1_score) / 2) # 0.5?4?6?8?3

def document_path_similarity(doc1, doc2):
    """Finds the symmetrical similarity between doc1 and doc2"""

    synsets1 = doc_to_synsets(doc1)
    synsets2 = doc_to_synsets(doc2)

    return (similarity_score(synsets1, synsets2) + similarity_score(synsets2, synsets1)) / 2

### test_document_path_similarity

Use this function to check if doc_to_synsets and similarity_score are correct.

*This function should return the similarity score as a float.*

In [13]:
def test_document_path_similarity():
    doc1 = 'This is a function to test document_path_similarity.'
    doc2 = 'Use this function to see if your code in doc_to_synsets \
    and similarity_score is correct!'

    return document_path_similarity(doc1, doc2)
ret = test_document_path_similarity()
ret
# これはnltkの復習です。Assignment2でnltkを使います。===>詳しくはsamples/michigan/TextMiningAssignment2.ipynb参照
# NLTKのbook=CORPUSの情報
# nltk.download() # 追加
# nltk.download('gutenberg') # 追加
# nltk.download('genesis') # 追加
# nltk.download('inaugural') # 追加
# nltk.download('nps_chat') # 追加
# nltk.download('webtext') # 追加
# nltk.download('treebank') # 追加
# nltk.download('averaged_perceptron_tagger') # 追加
# nltk.download('tagsets')
# nltk.download('udhr') # Stemmingに必要
# nltk.download('wordnet') # Lemmatizeに必要
# nltk.download('punkt') # Tokenizeに必要
# from nltk.book import * # 追加

# sents()
# # Count vocabraly of words
# print('Count vocabraly of words of text7: Wall Street Journal: ', text7)
# print(sent7)
# print('len(text7)(全ワード数): ', len(text7), ',   len(sent7): ', len(sent7))
# print('uniqueなwords数: ',  len(set(text7)))
# print('最初の10words:', list(set(text7))[:10])
# dist = nltk.FreqDist(text7)
# print('Frequency Distributionのstructure:', len(dist))
# print('most_common(5):', dist.most_common(5))
# # print('actual word数:', dist.keys()) 多いので割愛
# print('"join"のFrequency:', dist['join'])
# freq_words = [w for w in dist.keys() if len(w) > 5 and dist[w] > 100] #5文字以上１０1回以上出現するwords
# print('5文字以上且つ１０1回以上出現するwords:', freq_words)

# # Stemming / Lemmatization(全てを現在形に)　　Lemmatizationとは、Stemmingをするが、そのワードを有効な(存在する)ワードにすること
# porter = nltk.PorterStemmer()
# udhr = nltk.corpus.udhr.words('English-Latin1')# Universal Declaration of Human Rights
# print('Stemming / Lemmatization前: ', udhr[:20])
# print('Stemming後(universは存在しないワード): ', [porter.stem(w) for w in udhr[:20]])
# WNLemma = nltk.WordNetLemmatizer()
# # 大文字から始まるのはそのまま。(例：Rights)
# print('Lemmatization後(rightsがrightに変わった): ', [WNLemma.lemmatize(w) for w in udhr[:20]])
# # Tokenization（単語や文章を境界で分ける。）
# text11 = "Children shouldn't drink a sguary drink before bed."
# print('Tokenization前: ', text11.split(' '))
# print('Tokenization後: ', nltk.word_tokenize(text11))
# text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is."
# print('（文章）Tokenization後: ', nltk.sent_tokenize(text12))
# # part-of-speech(POS) Tagging (品詞)
# text13 = nltk.word_tokenize(text11)
# print("【NLTK's Tokenizer】: ", nltk.pos_tag(text13))

# print('----------part-of-speechを取得する(その他の実行例)-------------')
# # MDの一例を表示
# nltk.help.upenn_tagset('MD')
# part_of_speech = nltk.pos_tag(text7) # <===== 品詞のタグを取得する
# print('Word with Part-of-Speech tagging（例:NN=noun,VB=Verb）:', part_of_speech[:5])
# pos_values = [word for (word, tag) in part_of_speech]
# print(pos_values[:5])
# cfd = nltk.FreqDist(part_of_speech)
# counts = [(tag, frequency) for (word, tag), frequency in cfd.most_common()] # <===== wordとfrequencyだけだったのに対し、品詞のタグを追加する
# print('tagging and frequency:', counts[:5])
# ここまでnltkの復習

0.554265873015873

<br>
___
`paraphrases` is a DataFrame which contains the following columns: `Quality`, `D1`, and `D2`.

`Quality` is an indicator variable which indicates if the two documents `D1` and `D2` are paraphrases of one another (1 for paraphrase, 0 for not paraphrase).

In [14]:
# Use this dataframe for questions most_similar_docs and label_accuracy
paraphrases = pd.read_csv('paraphrases.csv')
paraphrases.head()

Unnamed: 0,Quality,D1,D2
0,1,"Ms Stewart, the chief executive, was not expec...","Ms Stewart, 61, its chief executive officer an..."
1,1,After more than two years' detention under the...,After more than two years in detention by the ...
2,1,"""It still remains to be seen whether the reven...","""It remains to be seen whether the revenue rec..."
3,0,"And it's going to be a wild ride,"" said Allan ...","Now the rest is just mechanical,"" said Allan H..."
4,1,The cards are issued by Mexico's consulates to...,The card is issued by Mexico's consulates to i...


___

### most_similar_docs

Using `document_path_similarity`, find the pair of documents in paraphrases which has the maximum similarity score.

*This function should return a tuple `(D1, D2, similarity_score)`*

In [40]:
def most_similar_docs():
    
    # Your Code Here
    # 最も類似性の高い文書を検索
    paraphrases['similarity_score'] = [ document_path_similarity(row['D1'], row['D2']) for index, row in paraphrases.iterrows() ]
    
    most_similar_doc = paraphrases[paraphrases['similarity_score'] == paraphrases['similarity_score'].max()]
    
    idx = most_similar_doc.index[0]
    return most_similar_doc.loc[idx, 'D1'], most_similar_doc.loc[idx, 'D2'], most_similar_doc.loc[idx, 'similarity_score']# Your Answer Here
most_similar_docs()

('"Indeed, Iran should be put on notice that efforts to try to remake Iraq in their image will be aggressively put down," he said.',
 '"Iran should be on notice that attempts to remake Iraq in Iran\'s image will be aggressively put down," he said.\n',
 0.97530864197530864)

### label_accuracy

Provide labels for the twenty pairs of documents by computing the similarity for each pair using `document_path_similarity`. Let the classifier rule be that if the score is greater than 0.75, label is paraphrase (1), else label is not paraphrase (0). Report accuracy of the classifier using scikit-learn's accuracy_score.

*This function should return a float.*

In [46]:
def label_accuracy():
    from sklearn.metrics import accuracy_score

    # Your Code Here
    # ラベル(1:言い回し,2:言い回しでない)の妥当性(accuracy)を求める
    paraphrases['similarity'] = [ 1 if document_path_similarity(row['D1'], row['D2']) > 0.75 else 0 for index, row in paraphrases.iterrows() ]
    
    return accuracy_score(paraphrases['similarity'].tolist(), paraphrases['Quality'].tolist())# Your Answer Here

# print(len(paraphrases))# Confirm the dataframe are the twenty pairs of documents
label_accuracy()


0.80000000000000004

## Part 2 - Topic Modelling

For the second part of this assignment, you will use Gensim's LDA (Latent Dirichlet Allocation) model to model topics in `newsgroup_data`. You will first need to finish the code in the cell below by using gensim.models.ldamodel.LdaModel constructor to estimate LDA model parameters on the corpus, and save to the variable `ldamodel`. Extract 10 topics using `corpus` and `id_map`, and with `passes=25` and `random_state=34`.

In [60]:
import pickle
import gensim
from sklearn.feature_extraction.text import CountVectorizer

# Load the list of documents
with open('newsgroups', 'rb') as f:
    newsgroup_data = pickle.load(f)

# Use CountVectorizor to find three letter tokens, remove stop_words,  <-- CountVectorizorを使う
# remove tokens that don't appear in at least 20 documents,
# remove tokens that appear in more than 20% of the documents
vect = CountVectorizer(min_df=20, max_df=0.2, stop_words='english', 
                       token_pattern='(?u)\\b\\w\\w\\w+\\b')
# Fit and transform
X = vect.fit_transform(newsgroup_data)

# Convert sparse matrix to gensim corpus.
corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False) # <-- Convert sparse matrix to Gensim corpus format

# Mapping from word IDs to words (To be used in LdaModel's id2word parameter)
id_map = dict((v, k) for k, v in vect.vocabulary_.items())


In [61]:
# Use the gensim.models.ldamodel.LdaModel constructor to estimate 
# LDA model parameters on the corpus, and save to the variable `ldamodel`

# Your code here:

print([id_map[i] for i in range(100, 105)]) # id_map(辞書)の中身を確認

print(len(corpus), corpus)
print()

# LDAについては./samples/michigan/TextMiningAssignment4_review.png参照
### まとめると、
###　Topic Modeling : 主題・副題となり得るトピックを(分野とfrequencyを使って)分析する（主題はコンピュータサイエンスなのか医学なのか..）
###                              - topicの数は指定する必要がある。
###                              - text clustering problem
###  
###  LDA                  : 最もポピュラーなTopic Modeling手法
###                              - ⑴ 不確定要素の多い雑多な文章からInferenceとEstimationによってModelのパラメータはどのようなものかを見積もる
###  
###  
###  
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=10, id2word=id_map, passes=25, random_state=34)

print(ldamodel.print_topics(num_topics=10, num_words=5)) # 確認
#   ニュース記事から１０のトピックを推論した結果：

#    0(メール/URL?)     1(時事..)                2(CPU?)     3(人事?)   4(ドライブ？)   5(スポーツ..)    6(LIFE..)    7(opinion..)   8(DC/マシン ..)  9(サイエンス)
#   ----------------------------------------------------------------------------------------------------------------------------------------------------------
#   edu  0.056        ground  0.024      drive  0.061     time           car                 game          information   don              use            space 
#   com 0.043        current 0.018      disc    0.033     atheism      just              team            help               people         apple         nasa       
#   thanks 0.033    just       0.018      scsi    0.033     list              don                year            medical         think            power       science    
#   mail  0.022       want     0.013      drives 0.030     left             bike               games         new               just              time          edu 
#   know  0.021     use        0.013      hard   0.028     alt               good             play             use                say               data          data  


['built', 'bus', 'business', 'buy', 'buying']
2000 <gensim.matutils.Sparse2Corpus object at 0x7fb940f70048>

[(0, '0.056*"edu" + 0.043*"com" + 0.033*"thanks" + 0.022*"mail" + 0.021*"know"'), (1, '0.024*"ground" + 0.018*"current" + 0.018*"just" + 0.013*"want" + 0.013*"use"'), (2, '0.061*"drive" + 0.042*"disk" + 0.033*"scsi" + 0.030*"drives" + 0.028*"hard"'), (3, '0.023*"time" + 0.015*"atheism" + 0.014*"list" + 0.013*"left" + 0.012*"alt"'), (4, '0.025*"car" + 0.016*"just" + 0.014*"don" + 0.014*"bike" + 0.012*"good"'), (5, '0.030*"game" + 0.027*"team" + 0.023*"year" + 0.017*"games" + 0.016*"play"'), (6, '0.017*"information" + 0.014*"help" + 0.014*"medical" + 0.012*"new" + 0.012*"use"'), (7, '0.022*"don" + 0.021*"people" + 0.018*"think" + 0.017*"just" + 0.012*"say"'), (8, '0.034*"use" + 0.023*"apple" + 0.020*"power" + 0.016*"time" + 0.015*"data"'), (9, '0.068*"space" + 0.036*"nasa" + 0.021*"science" + 0.020*"edu" + 0.019*"data"')]


### lda_topics

Using `ldamodel`, find a list of the 10 topics and the most significant 10 words in each topic. This should be structured as a list of 10 tuples where each tuple takes on the form:

`(9, '0.068*"space" + 0.036*"nasa" + 0.021*"science" + 0.020*"edu" + 0.019*"data" + 0.017*"shuttle" + 0.015*"launch" + 0.015*"available" + 0.014*"center" + 0.014*"sci"')`

for example.

*This function should return a list of tuples.*

In [77]:
def lda_topics():
    
    # Your Code Here
    # すでに↑の確認で表示しているが、再度..
    
    return ldamodel.print_topics(num_topics=10, num_words=10) # Your Answer Here

### topic_distribution

For the new document `new_doc`, find the topic distribution. Remember to use vect.transform on the the new doc, and Sparse2Corpus to convert the sparse matrix to gensim corpus.

*This function should return a list of tuples, where each tuple is `(#topic, probability)`*

In [62]:
new_doc = ["\n\nIt's my understanding that the freezing will start to occur because \
of the\ngrowing distance of Pluto and Charon from the Sun, due to it's\nelliptical orbit. \
It is not due to shadowing effects. \n\n\nPluto can shadow Charon, and vice-versa.\n\nGeorge \
Krumins\n-- "]

In [86]:
def topic_distribution():
    
    # Your Code Here
    # ドキュメントからトピックを抽出
    X_validate = vect.transform(new_doc)
    corpus_validate = gensim.matutils.Sparse2Corpus(X_validate, documents_columns=False)

    ret = ldamodel.get_document_topics(corpus_validate, per_word_topics=True)

    print(type(ret), type(list(ret)))
    print()
    print(list(ret))

    return list(ret)[0][0] # Your Answer Here

ret = topic_distribution()

print()
print('トピック(主題)： ', [id_map[i] for i in [315, 542, 759, 782]]) # 確認

# get_document_topics:   ドキュメントからトピックを抽出し、そのtopic_idと一致する確率を表示してくれる
# 
#    0(メール/URL?)  1(時事..)    2(CPU?)     3(人事?)     4(ドライブ？)   5(スポーツ..)    6(LIFE..)    7(opinion..)   8(DC/マシン ..)  9(サイエンス)
#                                                                  george 0.99(人事?の確率)
#                                                                                                                                                                                                           orbit 0.99(サイエンスの確率)
#                                                                  start 0.99(人事?の確率)
#                                                                  sun   0.38(人事?の確率)                                                                                                 sun   0.62(サイエンスの確率)
#           全体としての文書のマッチする確率: 3(人事?): 50%,  9(サイエンス) : 34%
#
# 文章は完全にサイエンス(NASAの領域)のニュース(文書->太陽系の惑星が楕円形の軌道のため冷却し始める..)だがtopic_id:3が多い。
# print(lda_topics())
#  
#  topic_id:3 => time, atheism, list, left, alt, faq, probably, know, send, moments
#  topic_id:3(人事?)がそもそも予測のニュースだからか？  


<class 'gensim.interfaces.TransformedCorpus'> <class 'list'>

[([(0, 0.020001831511363526), (1, 0.020002049019703448), (2, 0.020000000832438254), (3, 0.49626338496523503), (4, 0.020002765209762803), (5, 0.020002857122421773), (6, 0.020001697008683754), (7, 0.02000136796417671), (8, 0.020001847807404223), (9, 0.34372219855881042)], [(315, [3]), (542, [9]), (759, [3]), (782, [9, 3])], [(315, [(3, 0.99998325442510538)]), (542, [(9, 0.9999984717800936)]), (759, [(3, 0.99995348396045913)]), (782, [(3, 0.38210771214323236), (9, 0.61787390399963504)])])]

トピック(主題)：  ['george', 'orbit', 'start', 'sun']


### topic_names

From the list of the following given topics, assign topic names to the topics you found. If none of these names best matches the topics you found, create a new 1-3 word "title" for the topic.

Topics: Health, Science, Automobiles, Politics, Government, Travel, Computers & IT, Sports, Business, Society & Lifestyle, Religion, Education.

*This function should return a list of 10 strings.*

In [90]:
def topic_names():
    
    # Your Code Here
    # topic_idにタイトルをつける
    # this answer is a subjective assessment of the words in the topics .. 主観に基づいて決める

    # 実行前に試しに符合するtopic_idを選ぶと..
    # Health(>6), Science(>9), Automobiles(>4), Politics(>7), Government, Travel, Computers & IT(>2と8), Sports(>5),
    # Business(>3), Society & Lifestyle(>1), Religion, Education(>0)

    topics = ['Education', 'Society & Lifestyle', 'Computers & IT', 'Business', 'Automobiles', 'Sports',
              'Health', 'Politics', 'Computers & IT', 'Science']
    
    return topics# Your Answer Here
# 一応これで(10点)満点をもらえた

In [3]:
### Week 4 Notebook Provided Here(テキストブックが配布されておらず、コードのみの提供となっている。)

# nltk.download('wordnet_ic')
# nltk.download('stopwords')
# import re
# import pandas as pd
# import numpy as np
# import nltk
# from nltk.corpus import wordnet as wn

# # Use path length in wordnet to find word similarity
# # find sense of words via synonym set
# # n=noun, 01=synonym set for first meaning of the word
# deer = wn.synset('deer.n.01')
# deer

# elk = wn.synset('elk.n.01')
# deer.path_similarity(elk)

# horse = wn.synset('horse.n.01')
# deer.path_similarity(horse)

# # Use an information criteria to find word similarity
# from nltk.corpus import wordnet_ic
# brown_ic = wordnet_ic.ic('ic-brown.dat')
# deer.lin_similarity(elk, brown_ic)

# deer.lin_similarity(horse, brown_ic)

# # Use NLTK Collocation and Association Measures
# from nltk.collocations import *
# # load some text for examples
# from nltk.book import *
# # text1 is the book "Moby Dick"
# # extract just the words without numbers and sentence marks and make them lower case
# text = [w.lower() for w in list(text1) if w.isalpha()]

# bigram_measures = nltk.collocations.BigramAssocMeasures()
# finder = BigramCollocationFinder.from_words(text)
# finder.nbest(bigram_measures.pmi,10)

# # find all the bigrams with occurrence of at least 10, this modifies our "finder" object
# finder.apply_freq_filter(10)
# finder.nbest(bigram_measures.pmi,10)

# # Working with Latent Dirichlet Allocation (LDA) in Python
# # Several packages available, such as gensim and lda. Text needs to be
# # preprocessed: tokenizing, normalizing such as lower-casing, stopword
# # removal, stemming, and then transforming into a (sparse) matrix for
# # word (bigram, etc) occurences.
# # generate a set of preprocessed documents
# from nltk.stem.porter import PorterStemmer
# from nltk.corpus import stopwords
# from nltk.book import *

# len(stopwords.words('english'))

# stopwords.words('english')

# # extract just the stemmed words without numbers and sentence marks and make them lower case
# p_stemmer = PorterStemmer()
# sw = stopwords.words('english')
# doc1 = [p_stemmer.stem(w.lower()) for w in list(text1) if w.isalpha() and not w.lower() in sw]
# doc2 = [p_stemmer.stem(w.lower()) for w in list(text2) if w.isalpha() and not w.lower() in sw]
# doc3 = [p_stemmer.stem(w.lower()) for w in list(text3) if w.isalpha() and not w.lower() in sw]
# doc4 = [p_stemmer.stem(w.lower()) for w in list(text4) if w.isalpha() and not w.lower() in sw]
# doc5 = [p_stemmer.stem(w.lower()) for w in list(text5) if w.isalpha() and not w.lower() in sw]
# doc_set = [doc1, doc2, doc3, doc4, doc5]

# # under Windows this generates a warning
# import gensim
# from gensim import corpora, models

# dictionary = corpora.Dictionary(doc_set)
# dictionary

# # transform each document into a bag of words
# corpus = [dictionary.doc2bow((doc)) for doc in doc_set]

# # The corpus contains the 5 documents
# # each document is a list of indexed features and occurrence count (freq)
# print(type(corpus))
# print(type(corpus[0]))
# print(type(corpus[0][0]))
# print(corpus[0][::2000])

# # let's try 4 topics for our 5 documents
# # 50 passes takes quite a while, let's try less
# ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=4, id2word=dictionary, passes=10)

# print(ldamodel.print_topics(num_topics=4, num_words=10))