In [1]:
# we are finding the points of topic segmentation

In [1]:
transcript = "[SOUND] This lecture is a overview of text retrieval methods. In the previous lecture, we introduced the problem of text retrieval. We explained that the main problem is the design of ranking function to rank documents for a query. In this lecture, we will give an overview of different ways of designing this ranking function. So the problem is the following. We have a query that has a sequence of words and the document that's also a sequence of words. And we hope to define a function f that can compute a score based on the query and document. So the main challenge you hear is with design a good ranking function that can rank all the relevant documents on top of all the non-relevant ones. Clearly, this means our function must be able to measure the likelihood that a document d is relevant to a query q. That also means we have to have some way to define relevance. In particular, in order to implement the program to do that, we have to have a computational definition of relevance. And we achieve this goal by designing a retrieval model, which gives us a formalization of relevance. Now, over many decades, researchers have designed many different kinds of retrieval models. And they fall into different categories. First, one family of the models are based on the similarity idea. Basically, we assume that if a document is more similar to the query than another document is, then we will say the first document is more relevant than the second one. So in this case, the ranking function is defined as the similarity between the query and the document. One well known example in this case is vector space model, which we will cover more in detail later in the lecture. A second kind of models are called probabilistic models. In this family of models, we follow a very different strategy, where we assume that queries and documents are all observations from random variables. And we assume there is a binary random variable called R here to indicate whether a document is relevant to a query. We then define the score of document with respect to a query as a probability that this random variable R is equal to 1, given a particular document query. There are different cases of such a general idea. One is classic probabilistic model, another is language model, yet another is divergence from randomness model. In a later lecture, we will talk more about one case, which is language model. A third kind of model are based on probabilistic inference. So here the idea is to associate uncertainty to inference rules, and we can then quantify the probability that we can show that the query follows from the document. Finally, there is also a family of models that are using axiomatic thinking. Here, an idea is to define a set of constraints that we hope a good retrieval function to satisfy. So in this case, the problem is to seek a good ranking function that can satisfy all the desired constraints. Interestingly, although these different models are based on different thinking, in the end, the retrieval function tends to be very similar. And these functions tend to also involve similar variables. So now let's take a look at the common form of a state of the art retrieval model and to examine some of the common ideas used in all these models. First, these models are all based on the assumption of using bag of words to represent text, and we explained this in the natural language processing lecture. Bag of words representation remains the main representation used in all the search engines. So with this assumption, the score of a query, like a presidential campaign news with respect to a document of d here, would be based on scores computed based on each individual word. And that means the score would depend on the score of each word, such as presidential, campaign, and news. Here, we can see there are three different components, each corresponding to how well the document matches each of the query words. Inside of these functions, we see a number of heuristics used. So for example, one factor that affects the function d here is how many times does the word presidential occur in the document? This is called a term frequency, or TF. We might also denote as c of presidential and d. In general, if the word occurs more frequently in the document, then the value of this function would be larger. Another factor is, how long is the document? And this is to use the document length for scoring. In general, if a term occurs in a long document many times, it's not as significant as if it occurred the same number of times in a short document. Because in a long document, any term is expected to occur more frequently. Finally, there is this factor called document frequency. That is, we also want to look at how often presidential occurs in the entire collection, and we call this document frequency, or df of presidential. And in some other models, we might also use a probability to characterize this information. So here, I show the probability of presidential in the collection. So all these are trying to characterize the popularity of the term in the collection. In general, matching a rare term in the collection is contributing more to the overall score than matching up common term. So this captures some of the main ideas used in pretty much older state of the art original models. So now, a natural question is, which model works the best? Now it turns out that many models work equally well. So here are a list of the four major models that are generally regarded as a state of the art original models, pivoted length normalization, BM25, query likelihood, PL2. When optimized, these models tend to perform similarly. And this was discussed in detail in this reference at the end of this lecture. Among all these, BM25 is probably the most popular. It's most likely that this has been used in virtually all the search engines, and you will also often see this method discussed in research papers. And we'll talk more about this method later in some other lectures. So, to summarize, the main points made in this lecture are first the design of a good ranking function pre-requires a computational definition of relevance, and we achieve this goal by designing appropriate retrieval model. Second, many models are equally effective, but we don't have a single winner yet. Researchers are still active and working on this problem, trying to find a truly optimal retrieval model. Finally, the state of the art ranking functions tend to rely on the following ideas. First, bag of words representation. Second, TF and document frequency of words. Such information is used in the weighting function to determine the overall contribution of matching a word and document length. These are often combined in interesting ways, and we'll discuss how exactly they are combined to rank documents in the lectures later. There are two suggested additional readings if you have time. The first is a paper where you can find the detailed discussion and comparison of multiple state of the art models. The second is a book with a chapter that gives a broad review of different retrieval models. [MUSIC]"

In [2]:
print(transcript)

[SOUND] This lecture is a overview of text retrieval methods. In the previous lecture, we introduced the problem of text retrieval. We explained that the main problem is the design of ranking function to rank documents for a query. In this lecture, we will give an overview of different ways of designing this ranking function. So the problem is the following. We have a query that has a sequence of words and the document that's also a sequence of words. And we hope to define a function f that can compute a score based on the query and document. So the main challenge you hear is with design a good ranking function that can rank all the relevant documents on top of all the non-relevant ones. Clearly, this means our function must be able to measure the likelihood that a document d is relevant to a query q. That also means we have to have some way to define relevance. In particular, in order to implement the program to do that, we have to have a computational definition of relevance. And we 

In [3]:
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

In [4]:
# LDA Code acquired from https://github.com/priya-dwivedi/Deep-Learning/blob/master/topic_modeling/LDA_Newsgroup.ipynb

'''
Write a function to perform the pre processing steps on the entire dataset
'''
stemmer = SnowballStemmer("english")

def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            result.append(lemmatize_stemming(token))
            
    return result

In [5]:
processed_docs = []
processed_docs.append(preprocess(transcript))

In [6]:
'''
Create a dictionary from 'processed_docs' containing the number of times a word appears 
in the training set using gensim.corpora.Dictionary and call it 'dictionary'
'''
dictionary = gensim.corpora.Dictionary(processed_docs)

In [7]:
'''
Checking dictionary created
'''
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 abl
1 achiev
2 activ
3 addit
4 affect
5 appropri
6 associ
7 assum
8 assumpt
9 axiomat
10 base


In [8]:
'''
Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many
words and how many times those words appear. Save this to 'bow_corpus'
'''
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]
print(bow_corpus)

[[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 3), (8, 2), (9, 1), (10, 7), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 4), (17, 2), (18, 1), (19, 5), (20, 1), (21, 1), (22, 1), (23, 2), (24, 1), (25, 1), (26, 4), (27, 2), (28, 3), (29, 1), (30, 1), (31, 4), (32, 2), (33, 2), (34, 1), (35, 1), (36, 1), (37, 5), (38, 2), (39, 1), (40, 1), (41, 7), (42, 1), (43, 1), (44, 1), (45, 9), (46, 4), (47, 1), (48, 28), (49, 1), (50, 2), (51, 1), (52, 3), (53, 1), (54, 1), (55, 2), (56, 1), (57, 2), (58, 3), (59, 1), (60, 3), (61, 3), (62, 4), (63, 1), (64, 1), (65, 4), (66, 2), (67, 16), (68, 5), (69, 3), (70, 2), (71, 4), (72, 1), (73, 1), (74, 2), (75, 7), (76, 1), (77, 1), (78, 1), (79, 2), (80, 2), (81, 1), (82, 2), (83, 1), (84, 1), (85, 3), (86, 1), (87, 3), (88, 1), (89, 4), (90, 10), (91, 3), (92, 2), (93, 2), (94, 1), (95, 3), (96, 2), (97, 5), (98, 1), (99, 4), (100, 3), (101, 1), (102, 3), (103, 29), (104, 1), (105, 1), (106, 2), (107, 2), (108, 1), (109, 2), (110

In [9]:
'''
Preview BOW for our sample preprocessed document
'''
document_num = 0
bow_doc_x = bow_corpus[document_num]

for i in range(len(bow_doc_x)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_x[i][0], 
                                                     dictionary[bow_doc_x[i][0]], 
                                                     bow_doc_x[i][1]))

Word 0 ("abl") appears 1 time.
Word 1 ("achiev") appears 2 time.
Word 2 ("activ") appears 1 time.
Word 3 ("addit") appears 1 time.
Word 4 ("affect") appears 1 time.
Word 5 ("appropri") appears 1 time.
Word 6 ("associ") appears 1 time.
Word 7 ("assum") appears 3 time.
Word 8 ("assumpt") appears 2 time.
Word 9 ("axiomat") appears 1 time.
Word 10 ("base") appears 7 time.
Word 11 ("basic") appears 1 time.
Word 12 ("best") appears 1 time.
Word 13 ("binari") appears 1 time.
Word 14 ("book") appears 1 time.
Word 15 ("broad") appears 1 time.
Word 16 ("call") appears 4 time.
Word 17 ("campaign") appears 2 time.
Word 18 ("captur") appears 1 time.
Word 19 ("case") appears 5 time.
Word 20 ("categori") appears 1 time.
Word 21 ("challeng") appears 1 time.
Word 22 ("chapter") appears 1 time.
Word 23 ("character") appears 2 time.
Word 24 ("classic") appears 1 time.
Word 25 ("clear") appears 1 time.
Word 26 ("collect") appears 4 time.
Word 27 ("combin") appears 2 time.
Word 28 ("common") appears 3 time

In [10]:

# LDA mono-core -- fallback code in case LdaMulticore throws an error on your machine
# lda_model = gensim.models.LdaModel(bow_corpus, 
#                                    num_topics = 10, 
#                                    id2word = dictionary,                                    
#                                    passes = 50)

# LDA multicore 
'''
Train your lda model using gensim.models.LdaMulticore and save it to 'lda_model'
'''
# TODO
lda_model =  gensim.models.LdaMulticore(bow_corpus, 
                                   num_topics = 1, 
                                   id2word = dictionary,                                    
                                   passes = 3,
                                   workers = 2)

In [11]:

'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(idx, topic ))
    print("\n")

Topic: 0 
Words: 0.042*"model" + 0.041*"document" + 0.024*"function" + 0.021*"queri" + 0.018*"word" + 0.016*"retriev" + 0.016*"rank" + 0.016*"lectur" + 0.014*"relev" + 0.014*"differ"




In [178]:
pip install textsplit


Collecting textsplit
  Downloading textsplit-0.5-py3-none-any.whl (9.6 kB)
Installing collected packages: textsplit
Successfully installed textsplit-0.5
Note: you may need to restart the kernel to use updated packages.


In [196]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [197]:
corpus = ["I'd like an apple", 
"An apple a day keeps the doctor away", 
"Never compare an apple to an orange", 
"I prefer scikit-learn to Orange", 
 "The scikit-learn docs are Orange and Blue"]   

vect = TfidfVectorizer(min_df=1, stop_words="english")                                                                                                                                                                                                   
tfidf = vect.fit_transform(corpus)                                                                                                                                                                                                                       
pairwise_similarity = tfidf * tfidf.T

In [199]:
print(pairwise_similarity.A)

[[1.         0.17668795 0.27056873 0.         0.        ]
 [0.17668795 1.         0.15439436 0.         0.        ]
 [0.27056873 0.15439436 1.         0.19635649 0.16815247]
 [0.         0.         0.19635649 1.         0.54499756]
 [0.         0.         0.16815247 0.54499756 1.        ]]
