# IRWA Project Part 2

|Name | Email | UPF uNum |
| --- | --- | --- |
| Clara Pena | clara.pena01@estudiant.upf.edu | u186416 |
| Yuyan Wang | yuyan.wang01@estudiant.upf.edu | u199907 |

## Import Libraries and Load Data

In [1]:
import pandas as pd
import numpy as np
from collections import defaultdict
from array import array
from nltk import PorterStemmer, word_tokenize, SnowballStemmer
from nltk.corpus import stopwords
from collections import Counter
import math
import numpy.linalg as la
import string
import textwrap
import re
import warnings

In [2]:
df = pd.read_csv("./data/processed_data.csv")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117405 entries, 0 to 117404
Data columns (total 9 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   id        117405 non-null  int64 
 1   content   117404 non-null  object
 2   date      117405 non-null  object
 3   hashtags  116794 non-null  object
 4   likes     117405 non-null  int64 
 5   retweets  117405 non-null  int64 
 6   url       117405 non-null  object
 7   language  117405 non-null  object
 8   docId     48427 non-null   object
dtypes: int64(3), object(6)
memory usage: 8.1+ MB


We'll be working with those tweets that have a document ID associated with them; that is, the value in the docID column should not be NaN. This is basically taking those tweets in English. Besides that, for simplicity, we're not using the column language anymore, so it can be dropped.

In [3]:
tweets_df = df.dropna(subset=['docId']).drop(columns=['language'])

In [4]:
tweets_df.head(10)

Unnamed: 0,id,content,date,hashtags,likes,retweets,url,docId
1,1364506167226032128,watch full video farmersprotest nofarmersnofoo...,2021-02-24T09:23:16+00:00,#farmersprotest #NoFarmersNoFood,0,0,https://twitter.com/anmoldhaliwal/status/13645...,doc_2
6,1364505991887347714,watch full video farmersprotest nofarmersnofood,2021-02-24T09:22:34+00:00,#farmersprotest #NoFarmersNoFood,0,0,https://twitter.com/anmoldhaliwal/status/13645...,doc_3
9,1364505813834989568,watch full video farmersprotest nofarmersnofood,2021-02-24T09:21:51+00:00,#farmersprotest #NoFarmersNoFood,0,0,https://twitter.com/anmoldhaliwal/status/13645...,doc_4
10,1364505749359976448,anoth farmer malkeet singh mahilpur hoshiarpur...,2021-02-24T09:21:36+00:00,#FarmersProtest,3,3,https://twitter.com/ShariaActivist/status/1364...,doc_5
14,1364505676375076867,hi tell boss modidontsellfarm thank farmerspro...,2021-02-24T09:21:19+00:00,#ModiDontSellFarmers #FarmersProtest,0,0,https://twitter.com/KaurDosanjh1979/status/136...,doc_6
16,1364505511073300481,watch full video farmersprotest nofarmersnofood,2021-02-24T09:20:39+00:00,#farmersprotest #NoFarmersNoFood,0,0,https://twitter.com/anmoldhaliwal/status/13645...,doc_7
18,1364505452134817795,despit increas tax petroldiesel must increas t...,2021-02-24T09:20:25+00:00,#taxes #petrolDiesel #taxes #alcohol #cigarett...,1,1,https://twitter.com/Satende09192805/status/136...,doc_8
20,1364505443997937669,mockeri menac sedit charg farmersprotest,2021-02-24T09:20:23+00:00,#sedition #FarmersProtest,0,0,https://twitter.com/algo_121/status/1364505443...,doc_9
25,1364505314586951680,watch full video farmersprotest nofarmersnofood,2021-02-24T09:19:52+00:00,#farmersprotest #NoFarmersNoFood,0,0,https://twitter.com/anmoldhaliwal/status/13645...,doc_10
26,1364505255946379268,left hear modi lol farmersprotest,2021-02-24T09:19:38+00:00,#FarmersProtest,1,0,https://twitter.com/kdhanjal12/status/13645052...,doc_11


In [5]:
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 48427 entries, 1 to 117404
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        48427 non-null  int64 
 1   content   48427 non-null  object
 2   date      48427 non-null  object
 3   hashtags  48153 non-null  object
 4   likes     48427 non-null  int64 
 5   retweets  48427 non-null  int64 
 6   url       48427 non-null  object
 7   docId     48427 non-null  object
dtypes: int64(3), object(5)
memory usage: 3.3+ MB


## Indexing

### Build inverted index

In [6]:
def create_index(df):
    index = defaultdict(list)
    
    for idx, row in df.iterrows():
        doc_id = row['docId']
        terms = row['content'].split()
        
        current_page_index = {}
        for position, term in enumerate(terms):
            if term in current_page_index:
                current_page_index[term][1].append(position)
            else:
                current_page_index[term] = [doc_id, array('I', [position])]  # 'I' for array of unsigned ints.

        # Merge the current page index with the main index
        for term_page, posting_page in current_page_index.items():
            index[term_page].append(posting_page)

    return index

In [7]:
index = create_index(tweets_df)

In [8]:
print(len(index))

30477


In [9]:
def build_terms(line):
    # stemmer = PorterStemmer()
    stemmer = SnowballStemmer("english")
    stop_words = set(stopwords.words("english"))
    line = line.lower()

    # Handle contractions by removing possessive endings and common contractions
    line = re.sub(r"\b(\w+)'s\b", r'\1', line)  # Changes "people's" to "people"
    line = re.sub(r"\b(\w+)n't\b", r'\1 not', line)  # Changes "isn't" to "is not"
    line = re.sub(r"\b(\w+)'ll\b", r'\1 will', line)  # Changes "I'll" to "I will"
    line = re.sub(r"\b(\w+)'d\b", r'\1 would', line)  # Changes "I'd" to "I would"
    line = re.sub(r"\b(\w+)'re\b", r'\1 are', line)  # Changes "you're" to "you are"
    line = re.sub(r"\b(\w+)'ve\b", r'\1 have', line)  # Changes "I've" to "I have"

    line = line.split()

    table = str.maketrans('', '', string.punctuation)
    line = [w.translate(table) for w in line]
    line = [w for w in line if w not in stop_words]
    line = [stemmer.stem(w) for w in line] 
    return line

In [10]:
def search(query, index):
    query = build_terms(query)  # Normalize and tokenize the query.
    docs = None  # Initialize docs as None to handle the intersection.

    for term in query:
        try:
            # Extract the document IDs for the term.
            term_docs = set([posting[0] for posting in index[term]])

            if docs is None:
                # Initialize docs with the set of document IDs for the first term.
                docs = term_docs
            else:
                # Intersect the sets of document IDs.
                docs = docs.intersection(term_docs)
        except KeyError:
            # If the term is not in the index, return an empty list because no documents can satisfy the query.
            return []

    if docs is None:
        return []  # If no terms were processed, return an empty list.
    else:
        return list(docs)  # Convert the set to a list before returning.

### Propose test queries

In [11]:
# Calculate the top n terms in the DataFrame for the specified column.
def get_top_terms(df, column='content', top_n=10):
    text = ' '.join(df[column].dropna())  # Combine all text and convert to lower case.
    words = text.split()
    # Get a count of all words
    word_count = Counter(words)
    # Return the most common words
    return word_count.most_common(top_n)

top_terms = get_top_terms(tweets_df, column='content', top_n=20)
print(top_terms)

[('farmersprotest', 50272), ('farmer', 17421), ('india', 7724), ('support', 6004), ('protest', 4787), ('amp', 4728), ('right', 3594), ('peopl', 3526), ('modi', 3113), ('indian', 3008), ('govern', 2753), ('bjp', 2649), ('law', 2570), ('releasedetainedfarm', 2432), ('govt', 2338), ('stand', 2207), ('farmersmakeindia', 2133), ('indiabeingsilenc', 2133), ('thank', 2129), ('farm', 2066)]


1.	“India protest”: This query aligns with the high frequency of terms like ‘india’ (7724 occurrences) and ‘protest’ (4787 occurrences). It targets the significant discussion around protests in India, capturing a broad yet significant topic within your dataset. This is crucial for understanding general sentiments or events related to protests in the region.
2.	“support farmers”: Given the prevalence of terms such as ‘farmersprotest’ (50272 occurrences) and ‘support’ (6004 occurrences), this query is highly relevant. It specifically addresses the widespread discourse on supporting farmers, likely related to agricultural policies or farmer welfare, which seems to be a prominent issue in your data.
3.	“Modi shame”: This query is crafted around the high occurrence of ‘modi’ (3113 occurrences). Including a sentiment or descriptive term like ‘shame’ might target specific discussions or criticisms related to policies or actions associated with Modi, the Prime Minister of India, offering insights into public sentiment regarding his administration.
4.	“BJP party”: With ‘bjp’ being mentioned 2649 times, focusing on the political party directly allows for an analysis of content specifically related to the Bharatiya Janata Party. This could reveal discussions on political activities, policies, or public opinions directly connected to the party’s actions and governance.
5.	“human rights violated”: Although ‘right’ appears 3594 times, coupling it with ‘violated’ expands the context to human rights issues. This query is intended to explore discussions or reports on human rights violations, a topic of critical importance that may encompass various aspects of social and political discourse.

In [12]:
def simulate_search(queries, index):
    for query in queries:
        docs = search(query, index)
        top = 10  # Number of results to display
        num_results = len(docs)
        
        print("\n======================\nSample of {} results out of {} for the searched query '{}':\n".format(min(top, num_results), num_results, query))
        for d_id in docs[:top]:
            print("docId = {}".format(d_id))

# List of queries to be processed
queries = [
    "India protest",
    "support farmers",
    "Modi shame",
    "BJP party",
    "human rights violated"
]

simulate_search(queries, index)


Sample of 10 results out of 951 for the searched query 'India protest':

docId = doc_45342
docId = doc_11845
docId = doc_46574
docId = doc_46761
docId = doc_44458
docId = doc_27337
docId = doc_40146
docId = doc_27301
docId = doc_27472
docId = doc_7237

Sample of 10 results out of 3197 for the searched query 'support farmers':

docId = doc_7211
docId = doc_7052
docId = doc_42028
docId = doc_47335
docId = doc_17669
docId = doc_35974
docId = doc_39136
docId = doc_11205
docId = doc_17527
docId = doc_43582

Sample of 10 results out of 193 for the searched query 'Modi shame':

docId = doc_7101
docId = doc_20968
docId = doc_29421
docId = doc_31277
docId = doc_37129
docId = doc_36097
docId = doc_47021
docId = doc_13115
docId = doc_8528
docId = doc_33918

Sample of 10 results out of 128 for the searched query 'BJP party':

docId = doc_24118
docId = doc_4669
docId = doc_33768
docId = doc_46569
docId = doc_2917
docId = doc_20542
docId = doc_10013
docId = doc_2934
docId = doc_40858
docId = doc_23

### Rank your results

In [13]:
def create_index_tfidf(dataframe):
    # num_documents = len(df)
    num_documents = dataframe['docId'].nunique()
    index = defaultdict(list)
    # tf = defaultdict(dict)  # Normalized term frequencies of terms in documents
    tf = defaultdict(list)
    df = defaultdict(int)  # Document frequencies of terms
    idf = defaultdict(float)

    for idx, row in dataframe.iterrows():
        doc_id = row['docId']
        terms = row['content'].split()
        
        current_page_index = {}

        for position, term in enumerate(terms):
            if term in current_page_index:
                # Append the position to the corresponding list in the array
                current_page_index[term][1].append(position)
            else:
                # Initialize the list with page_id and a new array
                current_page_index[term] = [doc_id, array('I', [position])]

        # Calculate the norm for the terms in the document
        norm = math.sqrt(sum(len(positions[1])**2 for positions in current_page_index.values()))

        # calculate the tf(dividing the term frequency by the above computed norm) and df weights
        for term, posting in current_page_index.items():
            # append the tf for current term (tf = term frequency in current doc/norm)
            tf[term].append(np.round(len(posting[1])/norm,4)) ## SEE formula (1) above
            #increment the document frequency of current term (number of documents containing the current term)
            df[term] += 1 # increment DF for current term
        
        # Merge the current page index with the main index
        for term_page, posting_page in current_page_index.items():
            index[term_page].append(posting_page)

    # Calculate IDF for each term
    for term in df:
        idf[term] = math.log(num_documents / (1 + df[term]))  # Smoothing by adding 1 to denominator

    return index, tf, df, idf

In [14]:
index, tf, df, idf = create_index_tfidf(tweets_df)

In [15]:
def rank_documents(terms, docs, index, idf, tf):
    doc_vectors = defaultdict(lambda: np.zeros(len(terms)))
    query_vector = np.zeros(len(terms))
    query_terms_count = Counter(terms)
    query_norm = la.norm(list(query_terms_count.values()))

    # Compute the tf-idf for the query vector
    for term_index, term in enumerate(terms):
        if term in idf:
            query_vector[term_index] = query_terms_count[term] / query_norm * idf[term]
            if term not in index:
                continue
    
            for doc_index, (doc, postings) in enumerate(index[term]):
                if doc in docs:
                    doc_vectors[doc][term_index] = tf[term][doc_index] * idf[term]  # T
            
    # Calculate the score of each doc using cosine similarity (dot product of normalized vectors)
    doc_scores = [[np.dot(cur_doc_vec, query_vector), doc] for doc, cur_doc_vec in doc_vectors.items()]
    doc_scores.sort(reverse=True, key=lambda x: x[0])
    # print(doc_scores)
    result_docs = [x[1] for x in doc_scores]

    if len(result_docs) == 0:
        print("No results found, try again")
    else:
        return result_docs

In [16]:
def search_tf_idf(query, index, idf, tf):
    query = build_terms(query)
    # print(query)
    docs = None  # Initialize to None to handle the first term's document set initialization

    for term in query:
        if term in index:
            term_docs = set([posting[0] for posting in index[term]])  # Collect all document IDs containing this term
            if docs is None:
                docs = term_docs
            else:
                docs = docs.intersection(term_docs)  # Intersection with the accumulated set of documents
        else:
            return []  # If any term is not found, the intersection is empty

    if docs is None:
        return []  # No terms found, return empty list

    docs = list(docs)  # Convert set to list if necessary
    ranked_docs = rank_documents(query, docs, index, idf, tf)  # Rank the documents based on the relevance
    return ranked_docs

In [17]:
def simulate_search_tf_idf(queries, index, idf, tf):
    for query in queries:
        ranked_docs = search_tf_idf(query, index, idf, tf)
        top = 10  # Number of results to display
        num_results = len(ranked_docs)
        
        print("\n======================\nSample of {} results out of {} for the searched query '{}':\n".format(min(top, num_results), num_results, query))
        for d_id in ranked_docs[:top]:
            print("docId = {}".format(d_id))

# List of queries to be processed
queries = [
    "India protest",
    "support farmers",
    "Modi shame",
    "BJP party",
    "human rights violated"
]

In [18]:
simulate_search_tf_idf(queries, index, idf, tf)


Sample of 10 results out of 951 for the searched query 'India protest':

docId = doc_17093
docId = doc_17097
docId = doc_38842
docId = doc_40320
docId = doc_32847
docId = doc_445
docId = doc_27693
docId = doc_46574
docId = doc_13552
docId = doc_29037

Sample of 10 results out of 3197 for the searched query 'support farmers':

docId = doc_38864
docId = doc_43187
docId = doc_31878
docId = doc_47382
docId = doc_47396
docId = doc_47423
docId = doc_30210
docId = doc_24459
docId = doc_3699
docId = doc_4430

Sample of 10 results out of 193 for the searched query 'Modi shame':

docId = doc_7810
docId = doc_37129
docId = doc_38556
docId = doc_39102
docId = doc_41693
docId = doc_42459
docId = doc_42461
docId = doc_42463
docId = doc_45856
docId = doc_45858

Sample of 10 results out of 128 for the searched query 'BJP party':

docId = doc_39703
docId = doc_14912
docId = doc_35686
docId = doc_19862
docId = doc_27074
docId = doc_43419
docId = doc_29032
docId = doc_38429
docId = doc_42112
docId = doc

## Evaluation

In [19]:
evaluation_gt = pd.read_csv('./data/evaluation_gt.csv', sep=';')
df_eva = pd.merge(tweets_df, evaluation_gt, on='docId', how='left')

pd.set_option('display.max_colwidth', None)

query_1_relevant = (df_eva[(df_eva['query_id'] == 1) & (df_eva['label'] == 1)])['docId'].unique()
query_1_not_relevant = (df_eva[(df_eva['query_id'] == 1) & (df_eva['label'] == 0)])['docId'].unique()

query_2_relevant = (df_eva[(df_eva['query_id'] == 2) & (df_eva['label'] == 1)])['docId'].unique()
query_2_not_relevant = (df_eva[(df_eva['query_id'] == 2) & (df_eva['label'] == 0)])['docId'].unique()

In [20]:
# Keep in mind that for the evaluation part we will be using only the subset of documents that are being defined in the evaluation_gt.csv
df_subset_documents = df_eva[df_eva['query_id'].notnull()]
subset_documents = df_subset_documents['docId'].unique()
print(len(subset_documents))

index, tf, df, idf = create_index_tfidf(df_subset_documents)
query_1_results = search_tf_idf("people's rights", index, idf, tf)
query_2_results = search_tf_idf("Indian government", index, idf, tf)

60


In [21]:
def print_wrapped(title, data):
    wrapper = textwrap.TextWrapper(width=90)
    wrapped_text = wrapper.fill(str(data))
    print(f"{title} {wrapped_text}\n")

print(f"Ground Truth Files Query 1 (subset): {query_1_relevant}\n")
print_wrapped("Our Obtained Results Query 1:", query_1_results)

print(f"Ground Truth Files Query 2 (subset): {query_2_relevant}\n")
print_wrapped("Our Obtained Results Query 2:", query_2_results)

Ground Truth Files Query 1 (subset): ['doc_1047' 'doc_2100' 'doc_3287' 'doc_3474' 'doc_3570' 'doc_4053'
 'doc_5480' 'doc_5512' 'doc_5751' 'doc_6477' 'doc_8066' 'doc_9696'
 'doc_9850' 'doc_9937' 'doc_10048']

Our Obtained Results Query 1: ['doc_4053', 'doc_9850', 'doc_9696', 'doc_6477', 'doc_2100', 'doc_2732', 'doc_8819',
'doc_43341', 'doc_8066', 'doc_5480', 'doc_3474', 'doc_43540', 'doc_5751', 'doc_10048',
'doc_5512', 'doc_1047', 'doc_3287', 'doc_3570', 'doc_9937']

Ground Truth Files Query 2 (subset): ['doc_103' 'doc_1566' 'doc_1651' 'doc_1666' 'doc_1785' 'doc_2528'
 'doc_2653' 'doc_3005' 'doc_3076' 'doc_3116' 'doc_3646' 'doc_3682'
 'doc_3927' 'doc_4176' 'doc_4304']

Our Obtained Results Query 2: ['doc_3116', 'doc_103', 'doc_1566', 'doc_3076', 'doc_3682', 'doc_3646', 'doc_2653',
'doc_3927', 'doc_1666', 'doc_3005', 'doc_1651', 'doc_4304', 'doc_1785', 'doc_4176',
'doc_2528']



In [22]:
df_subset_documents_query_1 = df_eva[df_eva['query_id'] == 1]
print(len(df_subset_documents_query_1))
index, tf, df, idf = create_index_tfidf(df_subset_documents_query_1)
query_1_results = search_tf_idf("people's rights", index, idf, tf)

print(f"Ground Truth Files Query 1 (subset): {query_1_relevant}\n")
print_wrapped("Our Obtained Results Query 1:", query_1_results)

30
Ground Truth Files Query 1 (subset): ['doc_1047' 'doc_2100' 'doc_3287' 'doc_3474' 'doc_3570' 'doc_4053'
 'doc_5480' 'doc_5512' 'doc_5751' 'doc_6477' 'doc_8066' 'doc_9696'
 'doc_9850' 'doc_9937' 'doc_10048']

Our Obtained Results Query 1: ['doc_4053', 'doc_9850', 'doc_9696', 'doc_6477', 'doc_2732', 'doc_8819', 'doc_8066',
'doc_5480', 'doc_2100', 'doc_5751', 'doc_10048', 'doc_5512', 'doc_3474', 'doc_1047',
'doc_3287', 'doc_3570', 'doc_9937']



In [23]:
df_subset_documents_query_2 = df_eva[df_eva['query_id'] == 2]
print(len(df_subset_documents_query_2))
index, tf, df, idf = create_index_tfidf(df_subset_documents_query_2)
query_2_results = search_tf_idf("Indian government", index, idf, tf)

print(f"Ground Truth Files Query 2 (subset): {query_2_relevant}\n")
print_wrapped("Our Obtained Results Query 2:", query_2_results)

30
Ground Truth Files Query 2 (subset): ['doc_103' 'doc_1566' 'doc_1651' 'doc_1666' 'doc_1785' 'doc_2528'
 'doc_2653' 'doc_3005' 'doc_3076' 'doc_3116' 'doc_3646' 'doc_3682'
 'doc_3927' 'doc_4176' 'doc_4304']

Our Obtained Results Query 2: ['doc_3116', 'doc_103', 'doc_3076', 'doc_3682', 'doc_3646', 'doc_1566', 'doc_2653',
'doc_1666', 'doc_3005', 'doc_1651', 'doc_4304', 'doc_1785', 'doc_4176', 'doc_3927',
'doc_2528']



In [24]:
personalized_eva_gt = pd.read_csv('./data/personalized_evaluation_gt.csv', sep=';')
df_personalized_eva = pd.merge(tweets_df, personalized_eva_gt, on='docId', how='left')
df_personalized_eva = df_personalized_eva[df_personalized_eva['query_id'].notnull()]
df_personalized_eva.head(2)

Unnamed: 0,id,content,date,hashtags,likes,retweets,url,docId,query_id,label
9,1364505255946379268,left hear modi lol farmersprotest,2021-02-24T09:19:38+00:00,#FarmersProtest,1,0,https://twitter.com/kdhanjal12/status/1364505255946379268,doc_11,3.0,0.0
15,1364504281618001921,know tiger wood accid what go ten thousand indian farmer protest farmersprotest,2021-02-24T09:15:46+00:00,#FarmersProtest,0,0,https://twitter.com/GregMitchell62/status/1364504281618001921,doc_17,1.0,1.0


In [25]:
# Dictionaries to hold relevant documents and results
personalized_query_text = {}
personalized_query_relevant = {}
personalized_query_results = {}

query_texts = ["India protest", "support farmers", "Modi shame", "BJP party", "human rights violated"]

# Populate the dictionaries
for i, query_text in enumerate(query_texts, start=1):
    personalized_query_text[i] = query_texts[i-1]
    df_query = df_personalized_eva[df_personalized_eva['query_id'] == i]
    personalized_query_relevant[i] = df_query[df_query['label'] == 1]['docId'].unique()
    index, tf, df, idf = create_index_tfidf(df_query)
    personalized_query_results[i] = search_tf_idf(query_text, index, idf, tf)

### Precision@K (P@K)

Keep in mind here that we are computing the Binary Relevance.

In [26]:
def precision_at_k(ground_truth, results, K=10):
    top_k_results = results[:K]
    # Calculate the number of relevant documents in the top K results
    relevant_documents = [doc for doc in top_k_results if doc in ground_truth]
    precision = len(relevant_documents) / K if K > 0 else 0
    return precision

In [27]:
K_values = list(sorted(set([5, 10, 15, len(query_1_results)])))
for K in K_values:
    precision = precision_at_k(query_1_relevant, query_1_results, K)
    print(f"Query 1 Precision@{K}: {precision:.4f}")

Query 1 Precision@5: 0.8000
Query 1 Precision@10: 0.8000
Query 1 Precision@15: 0.8667
Query 1 Precision@17: 0.8824


In [28]:
K_values = list(sorted(set([5, 10, 15, len(query_2_results)])))
for K in K_values:
    precision = precision_at_k(query_2_relevant, query_2_results, K)
    print(f"Query 2 Precision@{K}: {precision:.4f}")

# TODO: Check following markdown text

Query 2 Precision@5: 1.0000
Query 2 Precision@10: 1.0000
Query 2 Precision@15: 1.0000


We consider that the precision should be computed for each query separately. Besides that, since we're analyzing binary relevance, it makes more sense to take as final metric the precision done at level that is the actual length of the retrieved documents.

In [29]:
for i, query_text in enumerate(query_texts, start=1):
    # Retrieve relevant documents and results from the dictionaries
    relevant_docs = personalized_query_relevant[i]
    results = personalized_query_results[i]
    
    K_values = list(sorted(set([5, 10, 15, len(results)])))
    
    for K in K_values:
        precision = precision_at_k(relevant_docs, results, K)
        print(f"Personalized Query {i} - {query_text:<25} - Precision@{K:<3}: {precision:.4f}")
    print()


Personalized Query 1 - India protest             - Precision@5  : 1.0000
Personalized Query 1 - India protest             - Precision@10 : 0.5000
Personalized Query 1 - India protest             - Precision@15 : 0.3333

Personalized Query 2 - support farmers           - Precision@5  : 1.0000
Personalized Query 2 - support farmers           - Precision@10 : 1.0000
Personalized Query 2 - support farmers           - Precision@15 : 0.6667

Personalized Query 3 - Modi shame                - Precision@5  : 1.0000
Personalized Query 3 - Modi shame                - Precision@10 : 1.0000
Personalized Query 3 - Modi shame                - Precision@15 : 0.6667

Personalized Query 4 - BJP party                 - Precision@5  : 1.0000
Personalized Query 4 - BJP party                 - Precision@9  : 1.0000
Personalized Query 4 - BJP party                 - Precision@10 : 0.9000
Personalized Query 4 - BJP party                 - Precision@15 : 0.6000

Personalized Query 5 - human rights violated   

### Recall@K (R@K)

In [30]:
def recall_at_k(ground_truth, results, K=10):
    if K > len(results): K = len(results)
    top_k_results = results[:K]
    relevant_documents_retrieved = sum(1 for doc in top_k_results if doc in ground_truth)
    total_relevant_documents = len(ground_truth)
    if total_relevant_documents == 0: return 0
    recall = relevant_documents_retrieved / total_relevant_documents
    return recall

In [31]:
K_values = list(sorted(set([5, 10, 15, len(query_1_results)])))
for K in K_values:
    recall_value = recall_at_k(query_1_relevant, query_1_results, K)
    print(f"Query 1 Recall@{K}: {recall_value:.4f}")

Query 1 Recall@5: 0.2667
Query 1 Recall@10: 0.5333
Query 1 Recall@15: 0.8667
Query 1 Recall@17: 1.0000


In [32]:
K_values = list(sorted(set([5, 10, 15, len(query_2_results)])))
for K in K_values:
    recall_value = recall_at_k(query_2_relevant, query_2_results, K)
    print(f"Query 2 Recall@{K}: {recall_value:.4f}")

Query 2 Recall@5: 0.3333
Query 2 Recall@10: 0.6667
Query 2 Recall@15: 1.0000


In [33]:
for i, query_text in enumerate(query_texts, start=1):
    relevant_docs = personalized_query_relevant[i]
    results = personalized_query_results[i]
    
    K_values = list(sorted(set([5, 10, 15, len(results)])))
    
    for K in K_values:
        recall_value = recall_at_k(relevant_docs, results, K)
        print(f"Personalized Query {i} - {query_text:<25} - Recall@{K:<3}: {recall_value:.4f}")
    print()

Personalized Query 1 - India protest             - Recall@5  : 0.5000
Personalized Query 1 - India protest             - Recall@10 : 0.5000
Personalized Query 1 - India protest             - Recall@15 : 0.5000

Personalized Query 2 - support farmers           - Recall@5  : 0.5000
Personalized Query 2 - support farmers           - Recall@10 : 1.0000
Personalized Query 2 - support farmers           - Recall@15 : 1.0000

Personalized Query 3 - Modi shame                - Recall@5  : 0.5000
Personalized Query 3 - Modi shame                - Recall@10 : 1.0000
Personalized Query 3 - Modi shame                - Recall@15 : 1.0000

Personalized Query 4 - BJP party                 - Recall@5  : 0.5556
Personalized Query 4 - BJP party                 - Recall@9  : 1.0000
Personalized Query 4 - BJP party                 - Recall@10 : 1.0000
Personalized Query 4 - BJP party                 - Recall@15 : 1.0000

Personalized Query 5 - human rights violated     - Recall@5  : 0.5000
Personalized Que

### Average Precision@K (P@K)

In [34]:
def average_precision_at_k(ground_truth, results, K=None):
    if K is None: K = len(results)

    ground_truth_set = set(ground_truth)
    relevant_documents_retrieved = 0
    cumulative_precision = 0.0

    # Iterate over the list of results up to K
    for i, doc_id in enumerate(results[:K]):
        if doc_id in ground_truth_set:
            relevant_documents_retrieved += 1
            precision_at_i = relevant_documents_retrieved / (i + 1)
            cumulative_precision += precision_at_i

    # Calculate average precision
    total_relevant = len(ground_truth_set)
    if total_relevant > 0:
        average_precision = cumulative_precision / total_relevant
    else:
        average_precision = 0

    return average_precision

In [35]:
K_values = list(sorted(set([5, 10, 15, len(query_1_results)])))
for K in K_values:
    precision = average_precision_at_k(query_1_relevant, query_1_results, K)
    print(f"Query 1 Average Precision@{K}: {precision:.4f}")

Query 1 Average Precision@5: 0.2667
Query 1 Average Precision@10: 0.4695
Query 1 Average Precision@15: 0.7509
Query 1 Average Precision@17: 0.8681


In [36]:
K_values = list(sorted(set([5, 10, 15, len(query_2_results)])))
for K in K_values:
    precision = average_precision_at_k(query_2_relevant, query_2_results, K)
    print(f"Query 2 Average Precision@{K}: {precision:.4f}")

Query 2 Average Precision@5: 0.3333
Query 2 Average Precision@10: 0.6667
Query 2 Average Precision@15: 1.0000


In [37]:
# TODO: Check if results make sense

for i, query_text in enumerate(query_texts, start=1):
    relevant_docs = personalized_query_relevant[i]
    results = personalized_query_results[i]
    
    K_values = list(sorted(set([5, 10, 15, len(results)])))
    for K in K_values:
        precision = average_precision_at_k(relevant_docs, results, K)
        print(f"Personalized Query {i} - {query_text:<25} - Average Precision@{K:<3}: {precision:.4f}")
    print()

Personalized Query 1 - India protest             - Average Precision@5  : 0.5000
Personalized Query 1 - India protest             - Average Precision@10 : 0.5000
Personalized Query 1 - India protest             - Average Precision@15 : 0.5000

Personalized Query 2 - support farmers           - Average Precision@5  : 0.5000
Personalized Query 2 - support farmers           - Average Precision@10 : 1.0000
Personalized Query 2 - support farmers           - Average Precision@15 : 1.0000

Personalized Query 3 - Modi shame                - Average Precision@5  : 0.5000
Personalized Query 3 - Modi shame                - Average Precision@10 : 1.0000
Personalized Query 3 - Modi shame                - Average Precision@15 : 1.0000

Personalized Query 4 - BJP party                 - Average Precision@5  : 0.5556
Personalized Query 4 - BJP party                 - Average Precision@9  : 1.0000
Personalized Query 4 - BJP party                 - Average Precision@10 : 1.0000
Personalized Query 4 - BJ

### F1-Score@K

In [38]:
def f1_score_at_k(ground_truth, results, K=None):
    if K is None: K = len(results)

    ground_truth_set = set(ground_truth)
    relevant_documents_retrieved = 0
    results_considered = results[:K]

    # We can be also using defined precision and recall at k functions
    # Compute precision at K
    for doc_id in results_considered:
        if doc_id in ground_truth_set: relevant_documents_retrieved += 1
    precision = relevant_documents_retrieved / len(results_considered) if results_considered else 0

    # Compute recall at K
    total_relevant = len(ground_truth_set)
    recall = relevant_documents_retrieved / total_relevant if total_relevant > 0 else 0

    # Calculate F1 score
    if precision + recall == 0: return 0
    f1 = 2 * (precision * recall) / (precision + recall)

    return f1

In [39]:
K_values = list(sorted(set([5, 10, 15, len(query_1_results)])))
for K in K_values:
    f1 = f1_score_at_k(query_1_relevant, query_1_results, K)
    print(f"Query 2 F1Score@{K}: {f1:.4f}")

Query 2 F1Score@5: 0.4000
Query 2 F1Score@10: 0.6400
Query 2 F1Score@15: 0.8667
Query 2 F1Score@17: 0.9375


In [40]:
K_values = list(sorted(set([5, 10, 15, len(query_2_results)])))
for K in K_values:
    f1 = f1_score_at_k(query_2_relevant, query_2_results, K)
    print(f"Query 2 F1Score@{K}: {f1:.4f}")

Query 2 F1Score@5: 0.5000
Query 2 F1Score@10: 0.8000
Query 2 F1Score@15: 1.0000


In [41]:
for i, query_text in enumerate(query_texts, start=1):
    relevant_docs = personalized_query_relevant[i]
    results = personalized_query_results[i]
    
    K_values = list(sorted(set([5, 10, 15, len(results)])))
    for K in K_values:
        f1 = f1_score_at_k(relevant_docs, results, K)
        print(f"Personalized Query {i} - {query_text:<25} - F1Score@{K:<3}: {f1:.4f}")
    print()

Personalized Query 1 - India protest             - F1Score@5  : 0.6667
Personalized Query 1 - India protest             - F1Score@10 : 0.6667
Personalized Query 1 - India protest             - F1Score@15 : 0.6667

Personalized Query 2 - support farmers           - F1Score@5  : 0.6667
Personalized Query 2 - support farmers           - F1Score@10 : 1.0000
Personalized Query 2 - support farmers           - F1Score@15 : 1.0000

Personalized Query 3 - Modi shame                - F1Score@5  : 0.6667
Personalized Query 3 - Modi shame                - F1Score@10 : 1.0000
Personalized Query 3 - Modi shame                - F1Score@15 : 1.0000

Personalized Query 4 - BJP party                 - F1Score@5  : 0.7143
Personalized Query 4 - BJP party                 - F1Score@9  : 1.0000
Personalized Query 4 - BJP party                 - F1Score@10 : 1.0000
Personalized Query 4 - BJP party                 - F1Score@15 : 1.0000

Personalized Query 5 - human rights violated     - F1Score@5  : 0.6667
Pe

### Mean Average Precision (MAP)

In [42]:
def mean_average_precision_at_k(queries_ground_truth, queries_results, K=10):
    ap_scores = []
    for ground_truth, results in zip(queries_ground_truth, queries_results):
        ap = average_precision_at_k(ground_truth, results, K=K)
        ap_scores.append(ap)
    
    if ap_scores: return sum(ap_scores) / len(ap_scores)
    return 0

In [43]:
queries_ground_truth = (query_1_relevant, query_2_relevant)
queries_results = (query_1_results, query_2_results)

In [44]:
K_values = list(sorted(set([5, 10, 15, min(len(query_2_results), len(query_2_results))])))
for K in K_values:
    precision = mean_average_precision_at_k(queries_ground_truth, queries_results, K)
    print(f"Query 1 & 2 MAP@{K}: {precision:.4f}")

Query 1 & 2 MAP@5: 0.3000
Query 1 & 2 MAP@10: 0.5681
Query 1 & 2 MAP@15: 0.8755


In [45]:
K_values = list(sorted(set([5, 10, 15])))
for K in K_values:
    precision = mean_average_precision_at_k(list(personalized_query_relevant.values()), list(personalized_query_results.values()), K)
    print(f"Personalized Queries MAP@{K}: {precision:.4f}")

Personalized Queries MAP@5: 0.5111
Personalized Queries MAP@10: 0.9000
Personalized Queries MAP@15: 0.9000


### Mean Reciprocal Rank (MRR)

In [46]:
def mean_reciprocal_rank(queries_results, queries_ground_truth):
    reciprocal_ranks = []
    
    for results, ground_truth in zip(queries_results, queries_ground_truth):
        ground_truth_set = set(ground_truth)
        reciprocal_rank = 0
        for rank, doc_id in enumerate(results, start=1):
            if doc_id in ground_truth_set:
                reciprocal_rank = 1 / rank
                break
        reciprocal_ranks.append(reciprocal_rank)
    
    mrr = sum(reciprocal_ranks) / len(reciprocal_ranks) if reciprocal_ranks else 0
    return mrr

In [47]:
print(f"Query 1 & 2 MRR: {mean_reciprocal_rank(queries_ground_truth, queries_results)}")

Query 1 & 2 MRR: 1.0


In [48]:
print(f"Personalized Queries MRR: {mean_reciprocal_rank(list(personalized_query_relevant.values()), list(personalized_query_results.values())):.4f}")

Personalized Queries MRR: 0.8400


### Normalized Discounted Cumulative Gain (NDCG)

In [49]:
def dcg_at_k(relevances, k, method=1):
    relevances = np.asarray(relevances)[:k]
    if relevances.size:
        if method == 1:  # Standard method
            return relevances[0] + np.sum(relevances[1:] / np.log2(np.arange(2, k + 1)))
        elif method == 2:  # Alternative method
            return np.sum((2**relevances - 1) / np.log(np.arange(1, k + 1) + 1))
    return 0

def ndcg_at_k(ground_truth, results, k, method=1):
    # assert k <= len(results)
    if k > len(results): 
        warnings.warn("k is greater than the number of results. Adjusting to maximum available.")
        k = min(k, len(results))
    
    ground_truth_set = set(ground_truth)
    # Get binary relevance for the actual results
    actual_relevance = [1 if doc_id in ground_truth_set else 0 for doc_id in results[:k]]
    
    # Compute DCG for actual results
    actual_dcg = dcg_at_k(actual_relevance, k, method)
    
    # Sort the binary relevance to compute ideal DCG
    ideal_relevance = sorted(actual_relevance, reverse=True)
    ideal_dcg = dcg_at_k(ideal_relevance, k, method)
    
    if ideal_dcg == 0: return 0 
    return actual_dcg / ideal_dcg

In [50]:
K_values = list(sorted(set([5, 10, 15, len(query_1_results)])))
for K in K_values:
    ndcg = ndcg_at_k(query_1_relevant, query_1_results, K)
    print(f"Query 1 NDCG@{K}: {ndcg:.4f}")

Query 1 NDCG@5: 1.0000
Query 1 NDCG@10: 0.9567
Query 1 NDCG@15: 0.9509
Query 1 NDCG@17: 0.9512


In [51]:
K_values = list(sorted(set([5, 10, 15, len(query_2_results)])))
for K in K_values:
    ndcg = ndcg_at_k(query_2_relevant, query_2_results, K)
    print(f"Query 2 NDCG@{K}: {ndcg:.4f}")

Query 2 NDCG@5: 1.0000
Query 2 NDCG@10: 1.0000
Query 2 NDCG@15: 1.0000


In [52]:
for i, query_text in enumerate(query_texts, start=1):
    relevant_docs = personalized_query_relevant[i]
    results = personalized_query_results[i]
    
    K_values = list(sorted(set([5, 10, 15, len(results)])))
    for K in K_values:
        ndcg = ndcg_at_k(relevant_docs, results, K)
        print(f"Personalized Query {i} - {query_text:<25} - NDCG@{K:<3}: {ndcg:.4f}")
    print()

Personalized Query 1 - India protest             - NDCG@5  : 1.0000
Personalized Query 1 - India protest             - NDCG@10 : 1.0000
Personalized Query 1 - India protest             - NDCG@15 : 1.0000

Personalized Query 2 - support farmers           - NDCG@5  : 1.0000
Personalized Query 2 - support farmers           - NDCG@10 : 1.0000
Personalized Query 2 - support farmers           - NDCG@15 : 1.0000

Personalized Query 3 - Modi shame                - NDCG@5  : 1.0000
Personalized Query 3 - Modi shame                - NDCG@10 : 1.0000
Personalized Query 3 - Modi shame                - NDCG@15 : 1.0000

Personalized Query 4 - BJP party                 - NDCG@5  : 1.0000
Personalized Query 4 - BJP party                 - NDCG@9  : 1.0000
Personalized Query 4 - BJP party                 - NDCG@10 : 1.0000
Personalized Query 4 - BJP party                 - NDCG@15 : 1.0000

Personalized Query 5 - human rights violated     - NDCG@5  : 1.0000
Personalized Query 5 - human rights violated



## Vector Representation using T-distributed Stochastic Neighbor Embedding (T-SNE)

In [53]:
import gensim
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings

In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Sample data: Replace this with your actual dataframe column
tweets = ["This is a sample tweet", "Another tweet for analysis", "Machine learning with Python", "Text data visualization"]

# Create TF-IDF model
vectorizer = TfidfVectorizer(max_features=100)
tfidf_matrix = vectorizer.fit_transform(tweets)

# tfidf_matrix is a sparse matrix of shape (n_samples, n_features)

In [55]:
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Create a TSNE model: set perplexity and n_iter according to your dataset size
tsne_model = TSNE(n_components=2, verbose=1, perplexity=40, n_iter=300)
tsne_results = tsne_model.fit_transform(tfidf_matrix.toarray())  # convert to array if necessary

# tsne_results now holds the 2D coordinates of your tweets



ValueError: perplexity must be less than n_samples

In [None]:
# Plotting
plt.figure(figsize=(10, 6))
plt.scatter(tsne_results[:, 0], tsne_results[:, 1], edgecolor='k', alpha=0.5)
plt.title('Tweet Visualization using T-SNE')
plt.xlabel('TSNE Component 1')
plt.ylabel('TSNE Component 2')
plt.show()