# W4 Lab Exercise
This is the lab exercise for MIS590: Information Retrieval. </br>
In this lab, you will gain the following experience:</br>
- Understand Vector Space Models (VSMs) for Information Retrieval.
- Develop Practical Skills in Vector-Based Document Representation, Including TF-IDF, Word2Vec, and BERT.
- Compare the Effectiveness of Different Term Weighting Schemes.
- Enhance Analytical Thinking in Evaluating IR Models
</br>

**Note:** When you see a pencil icon ✏️ in this notebook, it's time for you to code!

# 1. Preliminaries

## 1.1 Install and Import Libraries

In [1]:
# Install the necessary packages

In [2]:
import math
import string
from collections import defaultdict

## 1.2 Input: Query & Document Collections (Corpus)

In [3]:
query = "sleep deprivation"

corpus = [
    "Sleepless nights in the lab have become my new normal. I tried to fix the experiment setup, but the apparatus seems to have a mind of its own. My advisor says results are just around the corner, but the corner keeps moving. Coffee is my only true companion these days.",
    "I thought grad school would be intellectually stimulating, but it's mostly paperwork and waiting for emails. The departmental printer jammed again, and now I'm late for a meeting. The cafeteria ran out of the good snacks, so I'm surviving on vending machine chips. Sleep has become a luxury I can no longer afford.",
    "Writing the dissertation feels like climbing an endless mountain. Every time I finish a chapter, my supervisor suggests new revisions. The impostor syndrome is real, and I wonder if they made a mistake accepting me. Maybe I should have gone to clown college instead. I am utterly deprived of any semblance of a normal life.",
    "My research data got corrupted, and now I have to start over. The lab mouse escaped, and we spent hours trying to find it. The grant proposal deadline is tomorrow, and the online submission portal is down. At least my pet cactus hasn't died yet.",
    "The group meeting turned into a three-hour debate over font choices for the presentation. I'm pretty sure my colleague is stealing my lunch from the fridge. The photocopier is out to get me; it never works when I'm in a hurry. Is there a PhD in napping? Because I'd ace that.",
    "I haven't seen the sun in days due to endless coding sessions. The simulation keeps crashing, and Stack Overflow doesn't have the answers. My roommate thinks I'm a ghost haunting the apartment. Instant noodles have become my primary food group.",
    "Attending conferences sounded fun until I realized they involve a lot of awkward networking. I accidentally spilled coffee on a famous professor's shoes. My poster fell down twice during the session. Next time, I'll just send a cardboard cutout of myself.",
    "The university gym membership was supposed to keep me healthy, but I've only used it once. I tried to attend a yoga class after staying up late for a deadline, but I fell asleep during the meditation. Maybe instead of the gym, my bed is more essential for keeping me healthy.",
    "My teaching assistantship involves grading endless stacks of exams. Students keep emailing me for extensions with creative excuses. One claimed their dog sleeps on the laptop so they cannot use it for the exam. I was deprived of excuses for not completing my dissertation draft, and I might have got some good ones.",
    "Group projects are the worst when you're the only one doing the work. My team members are as elusive as Bigfoot. The project is due next week, and I haven't heard from them. Perhaps I should just write a paper on the sociological implications of group work avoidance."
]

# Binary labels for the documents' relevancy to the query
# Relevant ones: 1, 2, 5, 6, 8
corpus_relevancy_label = [1, 1, 0, 0, 1, 1, 0, 1, 0, 0]

In [4]:
print(f"Query: {query}\n")
for idx, doc in enumerate(corpus):
    print(f"Document {idx+1}:\n{doc}\n")

Query: sleep deprivation

Document 1:
Sleepless nights in the lab have become my new normal. I tried to fix the experiment setup, but the apparatus seems to have a mind of its own. My advisor says results are just around the corner, but the corner keeps moving. Coffee is my only true companion these days.

Document 2:
I thought grad school would be intellectually stimulating, but it's mostly paperwork and waiting for emails. The departmental printer jammed again, and now I'm late for a meeting. The cafeteria ran out of the good snacks, so I'm surviving on vending machine chips. Sleep has become a luxury I can no longer afford.

Document 3:
Writing the dissertation feels like climbing an endless mountain. Every time I finish a chapter, my supervisor suggests new revisions. The impostor syndrome is real, and I wonder if they made a mistake accepting me. Maybe I should have gone to clown college instead. I am utterly deprived of any semblance of a normal life.

Document 4:
My research dat

# 2. Vector Space Model: TF-IDF

## 2.1 Data Preprocessing

### Steps for textual data preprocessing
1. Tokenization (= word segmentation)
2. Punctualtion and non-alphabetic token removal
3. Stopwords removal
4. Lemmatization / stemming

### Import libraries

In [5]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import wordnet as wn

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [6]:
# Initialize stopwords, lemmatizer, and punctuation list
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
punctuation_table = str.maketrans('', '', string.punctuation)

# We will use this sentence as example to showcase the different steps of data preprocessing
example_sentence = "The graduate student was typing, procrastinating, questioning herself, and finally submitting the dissertation while dreaming about sleep."
print(f"Example Sentence:\n{example_sentence}")

Example Sentence:
The graduate student was typing, procrastinating, questioning herself, and finally submitting the dissertation while dreaming about sleep.


### What is tokenization?

In [7]:
tokens = word_tokenize(example_sentence.lower())
print(tokens)

['the', 'graduate', 'student', 'was', 'typing', ',', 'procrastinating', ',', 'questioning', 'herself', ',', 'and', 'finally', 'submitting', 'the', 'dissertation', 'while', 'dreaming', 'about', 'sleep', '.']


### A quick removal of punctualtions and non-alphabetic words

In [8]:
tokens_noPunc = [word.translate(punctuation_table) for word in tokens if word.isalpha()]
print(tokens_noPunc)

['the', 'graduate', 'student', 'was', 'typing', 'procrastinating', 'questioning', 'herself', 'and', 'finally', 'submitting', 'the', 'dissertation', 'while', 'dreaming', 'about', 'sleep']


### What are stopwords?

In [9]:
tokens_noSW = [word for word in tokens_noPunc if word not in stop_words]
print(tokens_noSW)

['graduate', 'student', 'typing', 'procrastinating', 'questioning', 'finally', 'submitting', 'dissertation', 'dreaming', 'sleep']


### What is lemmatization?

In [10]:
print("Original\tLemmatized\n")

# Here we use pre-stopword removal tokens
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens_noSW]
for ori, lem in zip(tokens_noSW, lemmatized_tokens):
    print(f"{ori}\t{lem}")

Original	Lemmatized

graduate	graduate
student	student
typing	typing
procrastinating	procrastinating
questioning	questioning
finally	finally
submitting	submitting
dissertation	dissertation
dreaming	dreaming
sleep	sleep


### Observe the results above and discuss the following:
- What is lemmatization?
- I guess you cannot tell what lemmatization is from the results above. Let's try lemmatization in another way.

### How about we tell the lemmatizer more information of the tokens?

In [11]:
# Part-of-Speech Tagging
tagged_tokens = pos_tag(tokens)
print(tagged_tokens)

[('the', 'DT'), ('graduate', 'NN'), ('student', 'NN'), ('was', 'VBD'), ('typing', 'VBG'), (',', ','), ('procrastinating', 'VBG'), (',', ','), ('questioning', 'VBG'), ('herself', 'PRP'), (',', ','), ('and', 'CC'), ('finally', 'RB'), ('submitting', 'VBG'), ('the', 'DT'), ('dissertation', 'NN'), ('while', 'IN'), ('dreaming', 'VBG'), ('about', 'RB'), ('sleep', 'NN'), ('.', '.')]


### Then we do the punctuation, non-alphabetic tokens, and stopword removal.

In [12]:
# Remove punctuation and non-alphabetic tokens
tagged_tokens_noPunc = [(word[0].translate(punctuation_table), word[1]) for word in tagged_tokens if word[0].isalpha()]
print(tagged_tokens_noPunc)

# Remove stopwords
tagged_tokens_noSW = [(word[0], word[1]) for word in tagged_tokens_noPunc if word[0] not in stop_words]
print(tagged_tokens_noSW)

[('the', 'DT'), ('graduate', 'NN'), ('student', 'NN'), ('was', 'VBD'), ('typing', 'VBG'), ('procrastinating', 'VBG'), ('questioning', 'VBG'), ('herself', 'PRP'), ('and', 'CC'), ('finally', 'RB'), ('submitting', 'VBG'), ('the', 'DT'), ('dissertation', 'NN'), ('while', 'IN'), ('dreaming', 'VBG'), ('about', 'RB'), ('sleep', 'NN')]
[('graduate', 'NN'), ('student', 'NN'), ('typing', 'VBG'), ('procrastinating', 'VBG'), ('questioning', 'VBG'), ('finally', 'RB'), ('submitting', 'VBG'), ('dissertation', 'NN'), ('dreaming', 'VBG'), ('sleep', 'NN')]


### Take 2: what is lemmatization?

In [13]:
# Convert treebank POS tags to wordnet POS tags so the lemmatizer can read them
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wn.ADJ
    elif treebank_tag.startswith('V'):
        return wn.VERB
    elif treebank_tag.startswith('R'):
        return wn.ADV
    else:
        return wn.NOUN

print("Original\tLemmatized\n")
tagged_tokens_lemmatized = [lemmatizer.lemmatize(token, get_wordnet_pos(tag)) for token, tag in tagged_tokens_noSW]
for ori, lem in zip(tagged_tokens_noSW, tagged_tokens_lemmatized):
    print(f"{ori[0]}\t{lem}")

Original	Lemmatized

graduate	graduate
student	student
typing	type
procrastinating	procrastinate
questioning	question
finally	finally
submitting	submit
dissertation	dissertation
dreaming	dream
sleep	sleep


### Observe the results above and discuss the following:
- What is lemmatization?

### What is stemming?

In [14]:
print("Original\tStemmed\n")
tokens_stemmed = [stemmer.stem(word) for word in tagged_tokens_lemmatized]
for ori, stem in zip(tagged_tokens_lemmatized, tokens_stemmed):
    print(f"{ori}\t{stem}")

Original	Stemmed

graduate	graduat
student	student
type	type
procrastinate	procrastin
question	question
finally	final
submit	submit
dissertation	dissert
dream	dream
sleep	sleep


### Observe the results above and discuss the following:
- What is stemming?
- Why is stemming helpful in imporving TF-IDF performance?

### ✏️ Now let's preprocess the query and the documents!

In [15]:
# TODO
# Preprocessing function
def preprocess_text(text):

    # Step 1: # Convert to lowercase and tokenize text into words
    tokens = word_tokenize(text.lower())

    # Step 2: Tag part-of-speech of the tokens
    tokens = pos_tag(tokens)

    # Step 3: Remove punctuation and non-alphabetic tokens
    tokens = [(word[0].translate(punctuation_table), word[1]) for word in tokens if word[0].isalpha()]

    # Step 4: Remove stopwords
    tokens = [(word[0], word[1]) for word in tokens if word[0] not in stop_words]

    # Step 5: Lemmatize tokens
    tokens = [lemmatizer.lemmatize(token, get_wordnet_pos(tag)) for token, tag in tokens]

    # Step 6: Stem tokens
    tokens = [stemmer.stem(word) for word in tokens]

    return tokens

In [16]:
# Apply preprocessing to each document in the corpus
preprocessed_query = preprocess_text(query)
print(f"Query: {preprocessed_query}\n")

preprocessed_corpus = [preprocess_text(doc) for doc in corpus]
# Print preprocessed corpus
for idx, doc in enumerate(preprocessed_corpus):
    print(f"Document {idx+1}: {doc}")

Query: ['sleep', 'depriv']

Document 1: ['sleepless', 'night', 'lab', 'becom', 'new', 'normal', 'tri', 'fix', 'experi', 'setup', 'apparatu', 'seem', 'mind', 'advisor', 'say', 'result', 'around', 'corner', 'corner', 'keep', 'move', 'coffe', 'true', 'companion', 'day']
Document 2: ['think', 'grad', 'school', 'would', 'intellectu', 'stimul', 'mostli', 'paperwork', 'wait', 'email', 'department', 'printer', 'jam', 'late', 'meet', 'cafeteria', 'run', 'good', 'snack', 'surviv', 'vend', 'machin', 'chip', 'sleep', 'becom', 'luxuri', 'longer', 'afford']
Document 3: ['write', 'dissert', 'feel', 'like', 'climb', 'endless', 'mountain', 'everi', 'time', 'finish', 'chapter', 'supervisor', 'suggest', 'new', 'revis', 'impostor', 'syndrom', 'real', 'wonder', 'make', 'mistak', 'accept', 'mayb', 'go', 'clown', 'colleg', 'instead', 'utterli', 'depriv', 'semblanc', 'normal', 'life']
Document 4: ['research', 'data', 'get', 'corrupt', 'start', 'lab', 'mous', 'escap', 'spend', 'hour', 'tri', 'find', 'grant', '

## ✏️ 2.2 Compute Term Frequency (TF)

In [17]:
# Function to compute term frequency (TF) for each document
def compute_tf(doc):

    # Initialize the TF dictionary
    tf_dict = {}

    # TODO
    # Count the term frequency
    for word in doc:
        tf_dict[word] = tf_dict.get(word, 0) + 1

    # TODO
    # Divide term counts by total number of terms in the document
    total_terms = len(doc)
    for word in tf_dict:
        tf_dict[word] = tf_dict[word] / total_terms

    return tf_dict

# Compute TF for each document in the corpus
tf_corpus = [compute_tf(doc) for doc in preprocessed_corpus]

# Print TF values for each document
for idx, tf in enumerate(tf_corpus):
    print(f"TF for Document {idx+1}: {tf}\n")

TF for Document 1: {'sleepless': 0.04, 'night': 0.04, 'lab': 0.04, 'becom': 0.04, 'new': 0.04, 'normal': 0.04, 'tri': 0.04, 'fix': 0.04, 'experi': 0.04, 'setup': 0.04, 'apparatu': 0.04, 'seem': 0.04, 'mind': 0.04, 'advisor': 0.04, 'say': 0.04, 'result': 0.04, 'around': 0.04, 'corner': 0.08, 'keep': 0.04, 'move': 0.04, 'coffe': 0.04, 'true': 0.04, 'companion': 0.04, 'day': 0.04}

TF for Document 2: {'think': 0.03571428571428571, 'grad': 0.03571428571428571, 'school': 0.03571428571428571, 'would': 0.03571428571428571, 'intellectu': 0.03571428571428571, 'stimul': 0.03571428571428571, 'mostli': 0.03571428571428571, 'paperwork': 0.03571428571428571, 'wait': 0.03571428571428571, 'email': 0.03571428571428571, 'department': 0.03571428571428571, 'printer': 0.03571428571428571, 'jam': 0.03571428571428571, 'late': 0.03571428571428571, 'meet': 0.03571428571428571, 'cafeteria': 0.03571428571428571, 'run': 0.03571428571428571, 'good': 0.03571428571428571, 'snack': 0.03571428571428571, 'surviv': 0.03

## ✏️ 2.3 Compute Inverse Document Frequency (IDF)

In [18]:
# Function to compute inverse document frequency (IDF) for each term in the corpus
def compute_idf(corpus):

    N = len(corpus)  # Total number of documents

    # Initialize the IDF dictionary
    idf_dict = defaultdict(int)

    # TODO
    # Count the number of documents containing each word
    for doc in corpus:
        for word in set(doc):  # Use set to count each word only once per document
            idf_dict[word] += 1

    #TODO
    # Compute IDF (logarithmic scale)
    for word in idf_dict:
        idf_dict[word] = math.log(N / (idf_dict[word])) + 1  # Smoothing by adding 1

    return idf_dict

# Compute IDF for the corpus
idf_dict = compute_idf(preprocessed_corpus)

# Print IDF values
print("IDF for Corpus:")
for word, idf in idf_dict.items():
    print(f"{word}: {idf}")

IDF for Corpus:
tri: 2.203972804325936
normal: 2.6094379124341005
result: 3.302585092994046
advisor: 3.302585092994046
sleepless: 3.302585092994046
true: 3.302585092994046
mind: 3.302585092994046
new: 2.6094379124341005
companion: 3.302585092994046
apparatu: 3.302585092994046
becom: 2.203972804325936
corner: 3.302585092994046
coffe: 2.6094379124341005
seem: 3.302585092994046
experi: 3.302585092994046
lab: 2.6094379124341005
day: 2.6094379124341005
night: 3.302585092994046
keep: 1.916290731874155
fix: 3.302585092994046
setup: 3.302585092994046
say: 3.302585092994046
around: 3.302585092994046
move: 3.302585092994046
school: 3.302585092994046
mostli: 3.302585092994046
paperwork: 3.302585092994046
longer: 3.302585092994046
grad: 3.302585092994046
good: 2.6094379124341005
sleep: 2.6094379124341005
luxuri: 3.302585092994046
intellectu: 3.302585092994046
department: 3.302585092994046
stimul: 3.302585092994046
machin: 3.302585092994046
late: 2.6094379124341005
run: 3.302585092994046
jam: 3.302

## ✏️ 2.4 Compute TF-IDF

In [19]:
# Function to compute TF-IDF for a document
def compute_tfidf(tf_doc, idf_dict):

    # Initialize TF-IDF dictionary
    tfidf_dict = {}

    # TODO
    # Multiply TF by corresponding IDF
    for word, tf_value in tf_doc.items():
        tfidf_dict[word] = tf_value * idf_dict.get(word, 0)  # Multiply TF by corresponding IDF

    return tfidf_dict

# Compute TF-IDF for each document in the corpus
tfidf_corpus = [compute_tfidf(tf, idf_dict) for tf in tf_corpus]

# Print TF-IDF values for each document
for idx, tfidf in enumerate(tfidf_corpus):
    print(f"TF-IDF for Document {idx+1}: {tfidf}\n")


TF-IDF for Document 1: {'sleepless': 0.13210340371976184, 'night': 0.13210340371976184, 'lab': 0.10437751649736403, 'becom': 0.08815891217303744, 'new': 0.10437751649736403, 'normal': 0.10437751649736403, 'tri': 0.08815891217303744, 'fix': 0.13210340371976184, 'experi': 0.13210340371976184, 'setup': 0.13210340371976184, 'apparatu': 0.13210340371976184, 'seem': 0.13210340371976184, 'mind': 0.13210340371976184, 'advisor': 0.13210340371976184, 'say': 0.13210340371976184, 'result': 0.13210340371976184, 'around': 0.13210340371976184, 'corner': 0.2642068074395237, 'keep': 0.0766516292749662, 'move': 0.13210340371976184, 'coffe': 0.10437751649736403, 'true': 0.13210340371976184, 'companion': 0.13210340371976184, 'day': 0.10437751649736403}

TF-IDF for Document 2: {'think': 0.09319421115836073, 'grad': 0.1179494676069302, 'school': 0.1179494676069302, 'would': 0.1179494676069302, 'intellectu': 0.1179494676069302, 'stimul': 0.1179494676069302, 'mostli': 0.1179494676069302, 'paperwork': 0.117949

## 2.5 The Implementaion of Information Retrieval System

### Measuring similarity: cosine similarity

In [20]:
# Function to compute cosine similarity between two vectors
def cosine_similarity(vec1, vec2):
    dot_product = sum(vec1.get(word, 0) * vec2.get(word, 0) for word in vec1)
    magnitude1 = math.sqrt(sum([value ** 2 for value in vec1.values()]))
    magnitude2 = math.sqrt(sum([value ** 2 for value in vec2.values()]))

    if not magnitude1 or not magnitude2:
        return 0.0
    return dot_product / (magnitude1 * magnitude2)

### Rank the documents using cosine similarity

In [21]:
# Compute TF for the query
tf_query = compute_tf(preprocessed_query)

# Compute TF-IDF for the query
tfidf_query = compute_tfidf(tf_query, idf_dict)

# Compute the cosine similarity of each documents to the query
rankings = []
for idx, tfidf_doc in enumerate(tfidf_corpus):
    score = cosine_similarity(tfidf_doc, tfidf_query)
    rankings.append((idx + 1, score))

# Sort documents by similarity score in descending order
rankings = sorted(rankings, key=lambda x: x[1], reverse=True)

# Print document rankings
print("Document Rankings based on Query:")
for rank, (doc_idx, score) in enumerate(rankings, start=1):
    print(f"Rank {rank}: Document {doc_idx} with score {score}")


Document Rankings based on Query:
Rank 1: Document 9 with score 0.20850879673278558
Rank 2: Document 2 with score 0.11131520312033671
Rank 3: Document 3 with score 0.10476487421443352
Rank 4: Document 1 with score 0.0
Rank 5: Document 4 with score 0.0
Rank 6: Document 5 with score 0.0
Rank 7: Document 6 with score 0.0
Rank 8: Document 7 with score 0.0
Rank 9: Document 8 with score 0.0
Rank 10: Document 10 with score 0.0


###Bigram TF-IDF

In [22]:
import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from nltk.corpus import stopwords
import string

# 初始化需要的工具
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
punctuation_table = str.maketrans('', '', string.punctuation)

# 預處理函數：將文本轉為小寫、分詞，並生成 bigram
def preprocess_text(text):
    # Step 1: 將文本轉為小寫並進行分詞
    tokens = word_tokenize(text.lower())

    # Step 2: 移除標點符號和非字母的詞
    tokens = [word.translate(punctuation_table) for word in tokens if word.isalpha()]

    # Step 3: 移除停用詞
    tokens = [word for word in tokens if word not in stop_words]

    # Step 4: 生成 bigrams（二元詞組）
    bigrams = list(ngrams(tokens, 2))

    return bigrams

# 給定的文檔
corpus = [
    "Sleepless nights in the lab have become my new normal. I tried to fix the experiment setup, but the apparatus seems to have a mind of its own. My advisor says results are just around the corner, but the corner keeps moving. Coffee is my only true companion these days.",
    "I thought grad school would be intellectually stimulating, but it's mostly paperwork and waiting for emails. The departmental printer jammed again, and now I'm late for a meeting. The cafeteria ran out of the good snacks, so I'm surviving on vending machine chips. Sleep has become a luxury I can no longer afford.",
    "Writing the dissertation feels like climbing an endless mountain. Every time I finish a chapter, my supervisor suggests new revisions. The impostor syndrome is real, and I wonder if they made a mistake accepting me. Maybe I should have gone to clown college instead. I am utterly deprived of any semblance of a normal life.",
    "My research data got corrupted, and now I have to start over. The lab mouse escaped, and we spent hours trying to find it. The grant proposal deadline is tomorrow, and the online submission portal is down. At least my pet cactus hasn't died yet.",
    "The group meeting turned into a three-hour debate over font choices for the presentation. I'm pretty sure my colleague is stealing my lunch from the fridge. The photocopier is out to get me; it never works when I'm in a hurry. Is there a PhD in napping? Because I'd ace that.",
    "I haven't seen the sun in days due to endless coding sessions. The simulation keeps crashing, and Stack Overflow doesn't have the answers. My roommate thinks I'm a ghost haunting the apartment. Instant noodles have become my primary food group.",
    "Attending conferences sounded fun until I realized they involve a lot of awkward networking. I accidentally spilled coffee on a famous professor's shoes. My poster fell down twice during the session. Next time, I'll just send a cardboard cutout of myself.",
    "The university gym membership was supposed to keep me healthy, but I've only used it once. I tried to attend a yoga class after staying up late for a deadline, but I fell asleep during the meditation. Maybe instead of the gym, my bed is more essential for keeping me healthy.",
    "My teaching assistantship involves grading endless stacks of exams. Students keep emailing me for extensions with creative excuses. One claimed their dog sleeps on the laptop so they cannot use it for the exam. I was deprived of excuses for not completing my dissertation draft, and I might have got some good ones.",
    "Group projects are the worst when you're the only one doing the work. My team members are as elusive as Bigfoot. The project is due next week, and I haven't heard from them. Perhaps I should just write a paper on the sociological implications of group work avoidance."
]

# 應用預處理到查詢和文檔
preprocessed_corpus = [preprocess_text(doc) for doc in corpus]

# 打印預處理後的結果
for idx, doc in enumerate(preprocessed_corpus):
    print(f"Document {idx+1} Bigrams:\n{doc}\n")


Document 1 Bigrams:
[('sleepless', 'nights'), ('nights', 'lab'), ('lab', 'become'), ('become', 'new'), ('new', 'normal'), ('normal', 'tried'), ('tried', 'fix'), ('fix', 'experiment'), ('experiment', 'setup'), ('setup', 'apparatus'), ('apparatus', 'seems'), ('seems', 'mind'), ('mind', 'advisor'), ('advisor', 'says'), ('says', 'results'), ('results', 'around'), ('around', 'corner'), ('corner', 'corner'), ('corner', 'keeps'), ('keeps', 'moving'), ('moving', 'coffee'), ('coffee', 'true'), ('true', 'companion'), ('companion', 'days')]

Document 2 Bigrams:
[('thought', 'grad'), ('grad', 'school'), ('school', 'would'), ('would', 'intellectually'), ('intellectually', 'stimulating'), ('stimulating', 'mostly'), ('mostly', 'paperwork'), ('paperwork', 'waiting'), ('waiting', 'emails'), ('emails', 'departmental'), ('departmental', 'printer'), ('printer', 'jammed'), ('jammed', 'late'), ('late', 'meeting'), ('meeting', 'cafeteria'), ('cafeteria', 'ran'), ('ran', 'good'), ('good', 'snacks'), ('snacks'

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [23]:
import math
from collections import defaultdict
from nltk import word_tokenize
from nltk.util import ngrams

# 假設我們已經預處理了文檔和查詢（bigrams）
preprocessed_query = preprocess_text(query)  # 查詢文本
preprocessed_corpus = [preprocess_text(doc) for doc in corpus]  # 文檔語料

# Step 1: 計算TF（Term Frequency）
def compute_tf(doc):
    tf_dict = {}
    for word in doc:
        tf_dict[word] = tf_dict.get(word, 0) + 1
    total_terms = len(doc)
    for word in tf_dict:
        tf_dict[word] = tf_dict[word] / total_terms
    return tf_dict

tf_corpus = [compute_tf(doc) for doc in preprocessed_corpus]  # 計算每篇文檔的TF

# 打印每篇文檔的TF
print("TF for each document:")
for idx, tf in enumerate(tf_corpus):
    print(f"Document {idx+1}: {tf}\n")

# Step 2: 計算IDF（Inverse Document Frequency）
def compute_idf(corpus):
    N = len(corpus)  # 文檔總數
    idf_dict = defaultdict(int)
    for doc in corpus:
        for word in set(doc):  # 使用set避免重複計算
            idf_dict[word] += 1
    for word in idf_dict:
        idf_dict[word] = math.log(N / (idf_dict[word])) + 1  # 使用平滑處理
    return idf_dict

idf_dict = compute_idf(preprocessed_corpus)  # 計算語料庫的IDF

# 打印IDF
print("\nIDF for the corpus:")
for word, idf in idf_dict.items():
    print(f"{word}: {idf}")

# Step 3: 計算TF-IDF
def compute_tfidf(tf_doc, idf_dict):
    tfidf_dict = {}
    for word, tf_value in tf_doc.items():
        tfidf_dict[word] = tf_value * idf_dict.get(word, 0)
    return tfidf_dict

tfidf_corpus = [compute_tfidf(tf, idf_dict) for tf in tf_corpus]  # 計算每篇文檔的TF-IDF

# 打印每篇文檔的TF-IDF
print("\nTF-IDF for each document:")
for idx, tfidf in enumerate(tfidf_corpus):
    print(f"Document {idx+1}: {tfidf}\n")

# Step 4: 計算查詢的TF-IDF
tf_query = compute_tf(preprocessed_query)  # 計算查詢的TF
tfidf_query = compute_tfidf(tf_query, idf_dict)  # 計算查詢的TF-IDF

# 打印查詢的TF-IDF
print("\nTF-IDF for the query:")
print(tfidf_query)

# Step 5: 計算餘弦相似度
def cosine_similarity(vec1, vec2):
    dot_product = sum(vec1.get(word, 0) * vec2.get(word, 0) for word in vec1)
    magnitude1 = math.sqrt(sum([value ** 2 for value in vec1.values()]))
    magnitude2 = math.sqrt(sum([value ** 2 for value in vec2.values()]))

    if not magnitude1 or not magnitude2:
        return 0.0
    return dot_product / (magnitude1 * magnitude2)

# 計算每篇文檔與查詢之間的餘弦相似度
rankings = []
for idx, tfidf_doc in enumerate(tfidf_corpus):
    score = cosine_similarity(tfidf_doc, tfidf_query)
    rankings.append((idx + 1, score))

# 根據相似度排序
rankings = sorted(rankings, key=lambda x: x[1], reverse=True)

# 打印文檔的相似度排名
print("\nDocument Rankings based on Query:")
for rank, (doc_idx, score) in enumerate(rankings, start=1):
    print(f"Rank {rank}: Document {doc_idx} with score {score}")


TF for each document:
Document 1: {('sleepless', 'nights'): 0.041666666666666664, ('nights', 'lab'): 0.041666666666666664, ('lab', 'become'): 0.041666666666666664, ('become', 'new'): 0.041666666666666664, ('new', 'normal'): 0.041666666666666664, ('normal', 'tried'): 0.041666666666666664, ('tried', 'fix'): 0.041666666666666664, ('fix', 'experiment'): 0.041666666666666664, ('experiment', 'setup'): 0.041666666666666664, ('setup', 'apparatus'): 0.041666666666666664, ('apparatus', 'seems'): 0.041666666666666664, ('seems', 'mind'): 0.041666666666666664, ('mind', 'advisor'): 0.041666666666666664, ('advisor', 'says'): 0.041666666666666664, ('says', 'results'): 0.041666666666666664, ('results', 'around'): 0.041666666666666664, ('around', 'corner'): 0.041666666666666664, ('corner', 'corner'): 0.041666666666666664, ('corner', 'keeps'): 0.041666666666666664, ('keeps', 'moving'): 0.041666666666666664, ('moving', 'coffee'): 0.041666666666666664, ('coffee', 'true'): 0.041666666666666664, ('true', 'co

##計算BigramTF-IDF NDCG、AC、PRC

In [27]:
import numpy as np
from sklearn.metrics import precision_recall_curve, accuracy_score

# 預定的相關性標籤：Relevant = 1, Non-Relevant = 0
corpus_relevancy_label = [1, 1, 0, 0, 1, 1, 0, 1, 0, 0]

# 查詢返回的排序結果（假設查詢結果是基於文檔排名）
# 文檔排名按相似度排序
rankings = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
ranked_relevancy = [corpus_relevancy_label[i-1] for i in rankings]  # 根據排名提取相關性

# Step 1: 計算 DCG 和 iDCG (NDCG)
def dcg_score(relevancies):
    dcg = 0.0
    for i, rel in enumerate(relevancies):
        dcg += (rel) / np.log2(i + 2)  # DCG 計算
    return dcg

# Ideal DCG: 理想情況下的相關性排序
ideal_relevancies = sorted(corpus_relevancy_label, reverse=True)

dcg = dcg_score(ranked_relevancy)
idcg = dcg_score(ideal_relevancies)
ndcg = dcg / idcg if idcg != 0 else 0  # 防止除零錯誤

# Step 2: Precision-Recall Curve (PRC)
# 計算 Precision 和 Recall
precision, recall, _ = precision_recall_curve(corpus_relevancy_label, [1]*len(rankings))  # [1]*len(rankings) 模擬所有結果

# Step 3: 計算 Accuracy (AC)
predicted_labels = [1 if score > 0 else 0 for score in ranked_relevancy]
accuracy = accuracy_score(corpus_relevancy_label, predicted_labels)

# 輸出結果
print(f"NDCG: {ndcg}")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")


NDCG: 0.9121559825872436
Accuracy: 1.0
Precision: [0.5 1. ]
Recall: [1. 0.]


### Observe the results above and discuss the following:
- Are the highly ranked documents relevant to the query?
- Why?

# 3. Vector Space Model: Word2Vec

## 3.1 Import Libraries

In [28]:
import numpy as np
import gensim.downloader as api
from gensim.models import Word2Vec

## 3.2 Load Pre-trained Word2Vec Model

In [29]:
# Load the pre-trained Google News Word2Vec model
# This might take a while
model = api.load('word2vec-google-news-300')



### Let's observe a Word2Vec vector

In [30]:
model['apple']

array([-0.06445312, -0.16015625, -0.01208496,  0.13476562, -0.22949219,
        0.16210938,  0.3046875 , -0.1796875 , -0.12109375,  0.25390625,
       -0.01428223, -0.06396484, -0.08056641, -0.05688477, -0.19628906,
        0.2890625 , -0.05151367,  0.14257812, -0.10498047, -0.04736328,
       -0.34765625,  0.35742188,  0.265625  ,  0.00188446, -0.01586914,
        0.00195312, -0.35546875,  0.22167969,  0.05761719,  0.15917969,
        0.08691406, -0.0267334 , -0.04785156,  0.23925781, -0.05981445,
        0.0378418 ,  0.17382812, -0.41796875,  0.2890625 ,  0.32617188,
        0.02429199, -0.01647949, -0.06494141, -0.08886719,  0.07666016,
       -0.15136719,  0.05249023, -0.04199219, -0.05419922,  0.00108337,
       -0.20117188,  0.12304688,  0.09228516,  0.10449219, -0.00408936,
       -0.04199219,  0.01409912, -0.02111816, -0.13476562, -0.24316406,
        0.16015625, -0.06689453, -0.08984375, -0.07177734, -0.00595093,
       -0.00482178, -0.00089264, -0.30664062, -0.0625    ,  0.07

### Observe the results above and discuss the following:
- What is the data type of this vector?
- What is the dimensionality?

### Finding analogies using Word2Vec

In [31]:
model.most_similar("apple")

[('apples', 0.720359742641449),
 ('pear', 0.6450697183609009),
 ('fruit', 0.6410146355628967),
 ('berry', 0.6302295327186584),
 ('pears', 0.613396167755127),
 ('strawberry', 0.6058260798454285),
 ('peach', 0.6025872826576233),
 ('potato', 0.5960935354232788),
 ('grape', 0.5935863852500916),
 ('blueberry', 0.5866668224334717)]

In [32]:
model.most_similar("Apple")

[('Apple_AAPL', 0.7456986308097839),
 ('Apple_Nasdaq_AAPL', 0.7300410270690918),
 ('Apple_NASDAQ_AAPL', 0.717508852481842),
 ('Apple_Computer', 0.7145972847938538),
 ('iPhone', 0.6924266219139099),
 ('Apple_NSDQ_AAPL', 0.6868603229522705),
 ('Steve_Jobs', 0.6758421659469604),
 ('iPad', 0.6580768823623657),
 ('Apple_nasdaq_AAPL', 0.6444970369338989),
 ('AAPL_PriceWatch_Alert', 0.6439753174781799)]

In [33]:
model.most_similar(positive=['Gates', 'Apple'], negative=['Jobs'])

[('Microsoft', 0.457754522562027),
 ('Steve_Ballmer', 0.42643362283706665),
 ('Robert_Gates', 0.40924885869026184),
 ('Ballmer', 0.40724438428878784),
 ('Mullen', 0.4004097878932953),
 ('Chief_Executive_Steve_Ballmer', 0.3993479311466217),
 ('BlackBerry_maker', 0.39889541268348694),
 ('Apple_Nasdaq_AAPL', 0.39581313729286194),
 ('REDMOND_Wash._Microsoft', 0.3908952474594116),
 ('McAfee', 0.38951441645622253)]

## 3.3 Compute Word2Vec Embeddings

In [35]:
# Notice here we only tokenize and lowercase the tokens:
tokens = [word_tokenize(doc.lower()) for doc in corpus]
query_tokens = word_tokenize(query.lower())

# Function to compute the average word vector for a document or query
def compute_avg_vector(words, model):
    vectors = [model[word] for word in words if word in model]
    if len(vectors) > 0:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(model.vector_size)  # Return zero vector if no word in model

# Compute average word vectors for each document
doc_vectors = [compute_avg_vector(doc, model) for doc in tokens]

# Compute average word vector for the query
query_vector = compute_avg_vector(query_tokens, model)

## 3.4 The Implementaion of Information Retrieval System

### Measuring similarity: cosine similarity

In [36]:
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    magnitude1 = np.linalg.norm(vec1)
    magnitude2 = np.linalg.norm(vec2)

    if magnitude1 == 0 or magnitude2 == 0:
        return 0  # Avoid division by zero
    return dot_product / (magnitude1 * magnitude2)

### ✏️ Rank the documents using cosine similarity

In [37]:
# TODO
# Rank documents based on similarity to the query
rankings = []
for idx, doc_vector in enumerate(doc_vectors):
    score = cosine_similarity(doc_vector, query_vector)
    rankings.append((idx + 1, score))

# TODO
# Sort documents by similarity score in descending order
rankings = sorted(rankings, key=lambda x: x[1], reverse=True)

# Print document rankings
print("Document Rankings based on Query:")
for rank, (doc_idx, score) in enumerate(rankings, start=1):
    print(f"Rank {rank}: Document {doc_idx} with score {score}")

Document Rankings based on Query:
Rank 1: Document 9 with score 0.3664294481277466
Rank 2: Document 8 with score 0.35598620772361755
Rank 3: Document 6 with score 0.35288575291633606
Rank 4: Document 3 with score 0.34057316184043884
Rank 5: Document 2 with score 0.3265257775783539
Rank 6: Document 1 with score 0.30051562190055847
Rank 7: Document 5 with score 0.2849697172641754
Rank 8: Document 10 with score 0.27993151545524597
Rank 9: Document 4 with score 0.2621768116950989
Rank 10: Document 7 with score 0.24750055372714996


### Observe the results above and discuss the following:
- How are the results using Word2Vec different from those using TF-IDF?

### How about we learn our own word2vec model with the corpus?

## 3.5 Train Word2Vec Model from Scratch

In [39]:
# Train Word2Vec on the corpus
model_corpus = Word2Vec(sentences=tokens, vector_size=100, window=5, min_count=1, workers=4)

In [40]:
# Function to compute the average word vector for a document or query
def compute_avg_vector(words, model):
    vectors = [model.wv[word] for word in words if word in model.wv]
    if len(vectors) > 0:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(model.vector_size)  # Return zero vector if no word in model

# Compute average word vectors for each document
doc_vectors = [compute_avg_vector(doc, model_corpus) for doc in tokens]

# Compute average word vector for the query
query_vector = compute_avg_vector(query_tokens, model_corpus)


In [41]:
def cosine_similarity(vec1, vec2):
    dot_product = np.dot(vec1, vec2)
    magnitude1 = np.linalg.norm(vec1)
    magnitude2 = np.linalg.norm(vec2)

    if magnitude1 == 0 or magnitude2 == 0:
        return 0  # Avoid division by zero
    return dot_product / (magnitude1 * magnitude2)

# Rank documents based on similarity to the query
rankings = []
for idx, doc_vector in enumerate(doc_vectors):
    score = cosine_similarity(doc_vector, query_vector)
    rankings.append((idx + 1, score))

# Sort documents by similarity score in descending order
rankings = sorted(rankings, key=lambda x: x[1], reverse=True)

# Print document rankings
print("Document Rankings based on Query:")
for rank, (doc_idx, score) in enumerate(rankings, start=1):
    print(f"Rank {rank}: Document {doc_idx} with score {score}")

Document Rankings based on Query:
Rank 1: Document 9 with score 0.2207733690738678
Rank 2: Document 5 with score 0.21063056588172913
Rank 3: Document 2 with score 0.18398791551589966
Rank 4: Document 6 with score 0.12987695634365082
Rank 5: Document 7 with score 0.11967609822750092
Rank 6: Document 4 with score 0.10431124269962311
Rank 7: Document 1 with score 0.08538194000720978
Rank 8: Document 3 with score 0.06317424029111862
Rank 9: Document 10 with score 0.059812527149915695
Rank 10: Document 8 with score 0.02246974967420101


##word2vec NDCG、Accuracy、Precision


In [42]:
from sklearn.metrics import precision_recall_curve, accuracy_score

# 假設文檔 1、2、5、6、8 是相關的，其他文檔是無關的
corpus_relevancy_label = [1, 1, 0, 0, 1, 1, 0, 1, 0, 0]

# 提取排名中的文檔索引
ranked_docs = [doc_idx for doc_idx, score in rankings]

# 根據文檔排名提取相關性標籤
ranked_relevancy = [corpus_relevancy_label[doc_idx - 1] for doc_idx in ranked_docs]

# 計算 NDCG
def ndcg_at_k(ranked_relevancy, k):
    dcg = 0.0
    for i in range(k):
        if ranked_relevancy[i] == 1:
            dcg += 1.0 / np.log2(i + 2)

    ideal_relevancy = sorted(ranked_relevancy, reverse=True)
    idcg = 0.0
    for i in range(k):
        if ideal_relevancy[i] == 1:
            idcg += 1.0 / np.log2(i + 2)

    return dcg / idcg if idcg > 0 else 0.0

# 計算 NDCG@10
ndcg = ndcg_at_k(ranked_relevancy, 10)
print(f"NDCG: {ndcg}")

# 計算 Accuracy
predicted_labels = [1 if score > 0 else 0 for _, score in rankings]
accuracy = accuracy_score(corpus_relevancy_label, predicted_labels)
print(f"Accuracy: {accuracy}")

# 計算 Precision 和 Recall
precision, recall, _ = precision_recall_curve(corpus_relevancy_label, [score for _, score in rankings])
print(f"Precision: {precision}")
print(f"Recall: {recall}")


NDCG: 0.7407274048032711
Accuracy: 0.5
Precision: [0.5        0.55555556 0.625      0.57142857 0.66666667 0.6
 0.5        0.66666667 1.         1.         1.        ]
Recall: [1.  1.  1.  0.8 0.8 0.6 0.4 0.4 0.4 0.2 0. ]


### Observe the results above and discuss the following:
- How are the results using self-trained Word2Vec different from those using pre-trained Word2Vec?

# 4. Vector Space Model: BERT
This is not how a BERT model is normally used, but we can see how contextualized embeddings are helpful in matching queries and documents beyond just words.

## 4.1 Import Libraries

In [43]:
import torch
from transformers import BertTokenizer, BertModel

## 4.2 Load Pre-trained BERT Model

In [44]:
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model_bert = BertModel.from_pretrained('bert-base-uncased')

# Function to generate BERT embeddings for a given text
def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model_bert(**inputs)
    # The [CLS] token embedding is typically used as the sentence representation
    return outputs.last_hidden_state[:, 0, :]  # Return the embedding for the [CLS] token

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

## 4.3 Compute BERT Embeddings

In [45]:
# Compute BERT embeddings for the query
query_embedding = get_bert_embedding(query)

# Compute BERT embeddings for each document in the corpus
corpus_embeddings = [get_bert_embedding(doc) for doc in corpus]

## 4.4 The Implementaion of Information Retrieval System

### Measuring similarity: cosine similarity

In [46]:
# Function to compute cosine similarity between two vectors
def cosine_similarity(vec1, vec2):
    vec1 = vec1.numpy()
    vec2 = vec2.numpy()
    dot_product = np.dot(vec1, vec2.T)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)

    if norm1 == 0 or norm2 == 0:
        return 0.0
    return dot_product / (norm1 * norm2)

### Rank the documents using cosine similarity

In [47]:
# Rank documents based on similarity to the query
rankings = []
for idx, doc_embedding in enumerate(corpus_embeddings):
    score = cosine_similarity(query_embedding[0], doc_embedding[0])
    rankings.append((idx + 1, score))

# Sort documents by similarity score in descending order
rankings = sorted(rankings, key=lambda x: x[1], reverse=True)

# Print document rankings
print("Document Rankings based on BERT embeddings:")
for rank, (doc_idx, score) in enumerate(rankings, start=1):
    print(f"Rank {rank}: Document {doc_idx} with score {score}")

Document Rankings based on BERT embeddings:
Rank 1: Document 2 with score 0.8103365302085876
Rank 2: Document 6 with score 0.7880857586860657
Rank 3: Document 5 with score 0.7864925861358643
Rank 4: Document 3 with score 0.7857746481895447
Rank 5: Document 1 with score 0.7844794988632202
Rank 6: Document 10 with score 0.775418758392334
Rank 7: Document 7 with score 0.7519571781158447
Rank 8: Document 8 with score 0.7409092783927917
Rank 9: Document 4 with score 0.7397838234901428
Rank 10: Document 9 with score 0.7047690153121948


##BERT NDCG、AC、PRC計算

In [48]:
import torch
from transformers import BertTokenizer, BertModel
import numpy as np
from sklearn.metrics import precision_recall_curve, accuracy_score
from sklearn.metrics import ndcg_score

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model_bert = BertModel.from_pretrained('bert-base-uncased')

# Function to generate BERT embeddings for a given text
def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model_bert(**inputs)
    # The [CLS] token embedding is typically used as the sentence representation
    return outputs.last_hidden_state[:, 0, :]  # Return the embedding for the [CLS] token

# Compute BERT embeddings for the query
query_embedding = get_bert_embedding(query)

# Compute BERT embeddings for each document in the corpus
corpus_embeddings = [get_bert_embedding(doc) for doc in corpus]

# Function to compute cosine similarity between two vectors
def cosine_similarity(vec1, vec2):
    vec1 = vec1.numpy()
    vec2 = vec2.numpy()
    dot_product = np.dot(vec1, vec2.T)
    norm1 = np.linalg.norm(vec1)
    norm2 = np.linalg.norm(vec2)

    if norm1 == 0 or norm2 == 0:
        return 0.0
    return dot_product / (norm1 * norm2)

# Rank documents based on similarity to the query
rankings = []
for idx, doc_embedding in enumerate(corpus_embeddings):
    score = cosine_similarity(query_embedding[0], doc_embedding[0])
    rankings.append((idx + 1, score))

# Sort documents by similarity score in descending order
rankings = sorted(rankings, key=lambda x: x[1], reverse=True)

# Print document rankings
print("Document Rankings based on BERT embeddings:")
for rank, (doc_idx, score) in enumerate(rankings, start=1):
    print(f"Rank {rank}: Document {doc_idx} with score {score}")

# Define relevant documents as per the original relevancy label
corpus_relevancy_label = [1, 1, 0, 0, 1, 1, 0, 1, 0, 0]  # Binary labels for the relevancy

# Extract the ranked relevancy labels
ranked_relevancy = [corpus_relevancy_label[doc_idx - 1] for doc_idx, _ in rankings]

# NDCG
ndcg = ndcg_score([corpus_relevancy_label], [ranked_relevancy])

# Accuracy
y_true = np.array(corpus_relevancy_label)
y_pred = np.array([1 if score > 0 else 0 for _, score in rankings])
accuracy = accuracy_score(y_true, y_pred)

# Precision and Recall (using precision_recall_curve)
precision, recall, _ = precision_recall_curve(y_true, y_pred)

# Print the results
print(f"\nNDCG: {ndcg}")
print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")


Document Rankings based on BERT embeddings:
Rank 1: Document 2 with score 0.8103365302085876
Rank 2: Document 6 with score 0.7880857586860657
Rank 3: Document 5 with score 0.7864925861358643
Rank 4: Document 3 with score 0.7857746481895447
Rank 5: Document 1 with score 0.7844794988632202
Rank 6: Document 10 with score 0.775418758392334
Rank 7: Document 7 with score 0.7519571781158447
Rank 8: Document 8 with score 0.7409092783927917
Rank 9: Document 4 with score 0.7397838234901428
Rank 10: Document 9 with score 0.7047690153121948

NDCG: 0.9081989035557798
Accuracy: 0.5
Precision: [0.5 1. ]
Recall: [1. 0.]


##總結


1. NDCG（Normalized Discounted Cumulative Gain）

Bigram TF-IDF: 0.9121

BERT: 0.9082

Unigram TF-IDF: 0.9038

Word2Vec: 0.7407

 Bigram TF-IDF、BERT 和 Unigram TF-IDF 表現較為接近，都在0.9左右，Word2Vec 的 NDCG 稍微低一點。

2. Accuracy（準確度）

Bigram TF-IDF: 1.0

Unigram TF-IDF: 0.6

BERT: 0.5

Word2Vec: 0.5

Bigram TF-IDF 的準確度達到了100%。

Unigram TF-IDF 準確度為60%，而 BERT 和 Word2Vec 的準確度只有50%。

3. Precision（精確度）

Bigram TF-IDF: [0.5, 1.0]

BERT: [0.5, 1.0]

Unigram TF-IDF: [0.5, 0.6667, 1.0, 1.0, 1.0]

Word2Vec: [0.5, 0.5556, 0.625, 0.5714, 0.6667, 0.6, 0.5, 0.6667, 1.0, 1.0, 1.0]

Bigram TF-IDF 和 BERT 的精確度都是從0.5開始 ，後面達到 1.0。

Unigram TF-IDF 和 Word2Vec 的精確度在不同排名間會波動，代表在較低排名時會錯誤標記一些不相關的文件為相關文件。

4. Recall（召回率）

Bigram TF-IDF: [1.0, 0.0]

BERT: [1.0, 0.0]

Unigram TF-IDF: [1.0, 0.4, 0.4, 0.2, 0.0]

Word2Vec: [1.0, 1.0, 1.0, 0.8, 0.8, 0.6, 0.4, 0.4, 0.4, 0.2, 0.0]

Bigram TF-IDF 和 BERT 在前幾個文件中的召回率是1.0，後面就迅速下降。

Unigram TF-IDF 的召回率在前幾個文檔中的表現也不錯，但是逐漸下降到0。

Word2Vec 的召回率變化較為平滑，最初召回率較高，但逐漸下降到0。



### Observe the results above and discuss the following:
- How are the results using contextualized word embeddings (BERT) different from those using Word2Vec?

# Assignment 1

## Part 1: Implement Bigram TF-IDF
Using the same query and corpus, implement your own information retrival system base on bigram TF-IDF.

## Part 2: Analyze The Results from TF-IDF, Bigram TF-IDF, Word2Vec, and BERT.
Do they successfully retrieve the relevant documents? Compare these four methods using **quantitative** (metrics we introduces in W3) and **qualitative** (case study) analysis.
You can write your own code to compute the quantitative evaluation metrics, or use packages such as scikit-learn.

## 💻 Assignment Submission 💻
Write your code and display the results in this Jupyter Notebook. Then, export it as an HTML file and submit both the Jupyter Notebook and the HTML file to Cyber University. </br>
**Please ensure that the code is executed and the outputs are visible when exporting the HTML file.**