<a href="https://colab.research.google.com/github/sasanvhn/IMDb-Retrieval-Project/blob/main/IMDb-Retrieval-Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [12]:
!pip install --upgrade pip setuptools wheel
!pip install -qq datasets
!pip install ipywidgets
!pip install bertopic -qq
!pip install bertopic[visualization]
!pip install -U sentence-transformers bokeh
!pip install python-terrier



In [13]:
from datasets import load_dataset
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
import numpy as np
import pandas as pd
import random
from bokeh.io import output_file, show
from bokeh.models import ColumnDataSource, HoverTool, LinearColorMapper
from bokeh.palettes import plasma, d3, Turbo256
from bokeh.plotting import figure
from bokeh.transform import transform
import pyterrier as pt
from nltk.stem import PorterStemmer, SnowballStemmer

In [14]:
dataset = load_dataset("stanfordnlp/imdb")

# Limit to the first 2000 documents FOR PERFOMANCE
subset = dataset["train"].select(range(2000))
texts = [doc["text"] for doc in subset]

print(f"Number of documents: {len(texts)}")
print(f"Sample document: {texts[0]}")

Number of documents: 2000
Sample document: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and

 Create a topic model from text data and limit it to 2000:

---



In [15]:
topic_model = BERTopic()

topics, probs = topic_model.fit_transform(texts)

print(topic_model.get_topic_info())

    Topic  Count                                  Name  \
0      -1    943                      -1_the_and_of_to   
1       0     88                   0_it_the_movie_this   
2       1     86                 1_the_to_scarecrow_it   
3       2     78                       2_the_of_is_and   
4       3     61                     3_show_the_and_to   
5       4     50                       4_the_of_and_to   
6       5     49                       5_the_of_and_to   
7       6     35                  6_the_western_was_of   
8       7     33                 7_bollywood_is_in_the   
9       8     33                 8_funny_movie_this_it   
10      9     31                     9_to_this_the_and   
11     10     31                      10_of_the_and_to   
12     11     28                 11_in_and_the_musical   
13     12     25               12_sandler_adam_it_this   
14     13     25                     13_her_she_to_and   
15     14     23            14_zombie_zombies_the_gore   
16     15     

# Some visualization:

In [16]:
topic_model.visualize_barchart()

topic_model.visualize_topics()

topic_model.visualize_hierarchy()

Create another topic model with a lower number of topics:

In [17]:
topic_model.reduce_topics(texts, nr_topics=10)

print(topic_model.get_topic_info())

   Topic  Count                       Name  \
0     -1    943           -1_the_and_of_to   
1      0    829            0_the_and_to_of   
2      1     61          1_the_show_and_to   
3      2     40            2_the_to_is_and   
4      3     33             3_the_is_to_in   
5      4     29          4_the_and_of_apes   
6      5     22         5_hitler_the_of_to   
7      6     16        6_prom_the_to_night   
8      7     14  7_the_snowman_killer_snow   
9      8     13           8_the_of_and_her   

                                      Representation  \
0     [the, and, of, to, is, in, this, it, that, br]   
1     [the, and, to, of, is, in, this, it, that, br]   
2   [the, show, and, to, of, br, is, it, that, this]   
3  [the, to, is, and, of, it, in, vampire, this, ...   
4  [the, is, to, in, and, bollywood, of, br, but,...   
5  [the, and, of, apes, to, planet, that, is, was...   
6  [hitler, the, of, to, and, was, he, this, in, is]   
7  [prom, the, to, night, and, of, horror, wa

Preparing the data and using "IterDictIndexer" to index:

In [18]:
df = pd.DataFrame({'docno': range(len(texts)), 'text': texts})
print(f"Sample document:\n{df['text'].iloc[0]}")

documents = [{"docno": str(i), "text": text} for i, text in enumerate(df["text"])]

indexer = pt.IterDictIndexer("./index_iterdict", overwrite=True)
indexref = indexer.index(documents)

# Load the index
index = pt.IndexFactory.of(indexref)
print("Index created successfully using IterDictIndexer!")

Sample document:
I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and

stemmers and weighting formulas:

In [19]:
bm25 = pt.BatchRetrieve(index, wmodel="BM25")
tfidf = pt.BatchRetrieve(index, wmodel="TF_IDF")

porter = PorterStemmer()
snowball = SnowballStemmer("english")

df["text_porter"] = df["text"].apply(lambda x: " ".join([porter.stem(word) for word in x.split()]))
df["text_snowball"] = df["text"].apply(lambda x: " ".join([snowball.stem(word) for word in x.split()]))

documents_porter = [{"docno": str(i), "text": text} for i, text in enumerate(df["text_porter"])]
indexer_porter = pt.IterDictIndexer("./index_porter", overwrite=True)
indexref_porter = indexer_porter.index(documents_porter)
index_porter = pt.IndexFactory.of(indexref_porter)

documents_snowball = [{"docno": str(i), "text": text} for i, text in enumerate(df["text_snowball"])]
indexer_snowball = pt.IterDictIndexer("./index_snowball", overwrite=True)
indexref_snowball = indexer_snowball.index(documents_snowball)
index_snowball = pt.IndexFactory.of(indexref_snowball)

bm25_porter = pt.BatchRetrieve(index_porter, wmodel="BM25")
bm25_snowball = pt.BatchRetrieve(index_snowball, wmodel="BM25")


Call to deprecated class BatchRetrieve. (use pt.terrier.Retriever() instead) -- Deprecated since version 0.11.0.


Call to deprecated class BatchRetrieve. (use pt.terrier.Retriever() instead) -- Deprecated since version 0.11.0.


Call to deprecated class BatchRetrieve. (use pt.terrier.Retriever() instead) -- Deprecated since version 0.11.0.


Call to deprecated class BatchRetrieve. (use pt.terrier.Retriever() instead) -- Deprecated since version 0.11.0.



Defining and running Queries:

In [20]:
query = "great acting and storyline"
queries = pd.DataFrame([{"qid": 1, "query": query}])

results_bm25 = bm25.transform(queries)
results_tfidf = tfidf.transform(queries)
results_bm25_porter = bm25_porter.transform(queries)
results_bm25_snowball = bm25_snowball.transform(queries)

print("Original BM25 Results:\n", results_bm25)
print("Original TF-IDF Results:\n", results_tfidf)
print("Porter Stemmed BM25 Results:\n", results_bm25_porter)
print("Snowball Stemmed BM25 Results:\n", results_bm25_snowball)

Original BM25 Results:
     qid  docid docno  rank      score                       query
0     1    542   542     0  10.963441  great acting and storyline
1     1    884   884     1  10.071519  great acting and storyline
2     1    216   216     2   9.745575  great acting and storyline
3     1    381   381     3   9.637109  great acting and storyline
4     1    450   450     4   8.390032  great acting and storyline
..   ..    ...   ...   ...        ...                         ...
857   1   1199  1199   857   0.653863  great acting and storyline
858   1    746   746   858   0.639973  great acting and storyline
859   1    248   248   859   0.635068  great acting and storyline
860   1   1966  1966   860   0.491815  great acting and storyline
861   1   1173  1173   861   0.439039  great acting and storyline

[862 rows x 6 columns]
Original TF-IDF Results:
     qid  docid docno  rank     score                       query
0     1    542   542     0  7.079073  great acting and storyline
1   