# IRWA Project Part 2

|Name | Email | UPF uNum |
| --- | --- | --- |
| Clara Pena | clara.pena01@estudiant.upf.edu | u186416 |
| Yuyan Wang | yuyan.wang01@estudiant.upf.edu | u199907 |

## Import Libraries and Load Data

In [30]:
import pandas as pd
import numpy as np
from collections import defaultdict
from array import array
from nltk import PorterStemmer, word_tokenize, SnowballStemmer
from nltk.corpus import stopwords
from collections import Counter
import math
import numpy.linalg as la
import string
import textwrap
import re
import emoji

In [31]:
df = pd.read_csv("./data/processed_data.csv")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117405 entries, 0 to 117404
Data columns (total 9 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   id        117405 non-null  int64 
 1   content   117404 non-null  object
 2   date      117405 non-null  object
 3   hashtags  116794 non-null  object
 4   likes     117405 non-null  int64 
 5   retweets  117405 non-null  int64 
 6   url       117405 non-null  object
 7   language  117405 non-null  object
 8   docId     48427 non-null   object
dtypes: int64(3), object(6)
memory usage: 8.1+ MB


We'll be working with those tweets that have a document ID associated with them; that is, the value in the docID column should not be NaN. This is basically taking those tweets in English. Besides that, for simplicity, we're not using the column language anymore, so it can be dropped.

In [32]:
tweets_df = df.dropna(subset=['docId']).drop(columns=['language'])

In [33]:
tweets_df.head(10)

Unnamed: 0,id,content,date,hashtags,likes,retweets,url,docId
1,1364506167226032128,watch full video farmersprotest nofarmersnofood tcofustokocxk,2021-02-24T09:23:16+00:00,#farmersprotest #NoFarmersNoFood,0,0,https://twitter.com/anmoldhaliwal/status/1364506167226032128,doc_2
6,1364505991887347714,watch full video farmersprotest nofarmersnofood,2021-02-24T09:22:34+00:00,#farmersprotest #NoFarmersNoFood,0,0,https://twitter.com/anmoldhaliwal/status/1364505991887347714,doc_3
9,1364505813834989568,watch full video farmersprotest nofarmersnofood,2021-02-24T09:21:51+00:00,#farmersprotest #NoFarmersNoFood,0,0,https://twitter.com/anmoldhaliwal/status/1364505813834989568,doc_4
10,1364505749359976448,anoth farmer malkeet singh mahilpur hoshiarpur pass away delhi protest site farmersprotest,2021-02-24T09:21:36+00:00,#FarmersProtest,3,3,https://twitter.com/ShariaActivist/status/1364505749359976448,doc_5
14,1364505676375076867,hi tell boss modidontsellfarm thank farmersprotest,2021-02-24T09:21:19+00:00,#ModiDontSellFarmers #FarmersProtest,0,0,https://twitter.com/KaurDosanjh1979/status/1364505676375076867,doc_6
16,1364505511073300481,watch full video farmersprotest nofarmersnofood,2021-02-24T09:20:39+00:00,#farmersprotest #NoFarmersNoFood,0,0,https://twitter.com/anmoldhaliwal/status/1364505511073300481,doc_7
18,1364505452134817795,despit increas tax petroldiesel must increas tax alcohol cigarett tobacco aatamnirbhartbharat kissabl petrolpricehik petrolpric modihaitomehngaihai bjp farmersprotest,2021-02-24T09:20:25+00:00,#taxes #petrolDiesel #taxes #alcohol #cigarettes #tobacco #aatamnirbhartbharat #Kissables #PetrolPriceHike #PetrolPrice #ModiHaiToMehngaiHai #modi_rojgaar_दो #BJP #FarmersProtest #Budget2021,1,1,https://twitter.com/Satende09192805/status/1364505452134817795,doc_8
20,1364505443997937669,mockeri menac sedit charg farmersprotest,2021-02-24T09:20:23+00:00,#sedition #FarmersProtest,0,0,https://twitter.com/algo_121/status/1364505443997937669,doc_9
25,1364505314586951680,watch full video farmersprotest nofarmersnofood,2021-02-24T09:19:52+00:00,#farmersprotest #NoFarmersNoFood,0,0,https://twitter.com/anmoldhaliwal/status/1364505314586951680,doc_10
26,1364505255946379268,left hear modi lol farmersprotest,2021-02-24T09:19:38+00:00,#FarmersProtest,1,0,https://twitter.com/kdhanjal12/status/1364505255946379268,doc_11


In [34]:
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 48427 entries, 1 to 117404
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        48427 non-null  int64 
 1   content   48427 non-null  object
 2   date      48427 non-null  object
 3   hashtags  48153 non-null  object
 4   likes     48427 non-null  int64 
 5   retweets  48427 non-null  int64 
 6   url       48427 non-null  object
 7   docId     48427 non-null  object
dtypes: int64(3), object(5)
memory usage: 3.3+ MB


## Indexing

### Build inverted index

In [35]:
def create_index(df):
    index = defaultdict(list)
    
    for idx, row in df.iterrows():
        doc_id = row['docId']
        terms = row['content'].split()
        
        current_page_index = {}
        for position, term in enumerate(terms):
            if term in current_page_index:
                current_page_index[term][1].append(position)
            else:
                current_page_index[term] = [doc_id, array('I', [position])]  # 'I' for array of unsigned ints.

        # Merge the current page index with the main index
        for term_page, posting_page in current_page_index.items():
            index[term_page].append(posting_page)

    return index

In [36]:
index = create_index(tweets_df)

In [37]:
print(len(index))

30477


In [38]:
def preprocess_text(text, lang_code='en', language_mapping={'en': 'english', 'es': 'spanish', 'fr': 'french', 'de': 'german', 'da': 'danish', 'nl': 'dutch', 'it': 'italian', 'fi': 'finnish', 'ru': 'russian', 'el': 'greek', 'no': 'norwegian', 'pt': 'portuguese', 'sv': 'swedish'}):
    # Map language code to SnowballStemmer language name
    language = language_mapping.get(lang_code, "english")  # Default to 'english' if lang_code is unsupported

    text = text.lower()

    # Remove mentions (anything starting with @ and followed by any characters until a space or end of the string)
    text = re.sub(r'@\w+', '', text)

    # Remove emojis
    text = emoji.replace_emoji(text, replace='')

    # Handle contractions by removing possessive endings and common contractions
    text = re.sub(r"\b(\w+)'s\b", r'\1', text)  # Changes "people's" to "people"
    text = re.sub(r"\b(\w+)n't\b", r'\1 not', text)  # Changes "isn't" to "is not"
    text = re.sub(r"\b(\w+)'ll\b", r'\1 will', text)  # Changes "I'll" to "I will"
    text = re.sub(r"\b(\w+)'d\b", r'\1 would', text)  # Changes "I'd" to "I would"
    text = re.sub(r"\b(\w+)'re\b", r'\1 are', text)  # Changes "you're" to "you are"
    text = re.sub(r"\b(\w+)'ve\b", r'\1 have', text)  # Changes "I've" to "I have"

    tokens = word_tokenize(text, language=language)

    # Remove punctuation
    table = str.maketrans('', '', string.punctuation)
    stripped_tokens = [w.translate(table) for w in tokens]

    # Remove non-alphabetic tokens
    words = [word for word in stripped_tokens if word.isalpha()]

    # Remove stop words for the detected language
    try:
        stop_words = set(stopwords.words(language))
        stop_words.add('https')
    except OSError:
        stop_words = set()
    words = [w for w in words if not w in stop_words]
    
    # print("------------- WORDS", words)
    # Stemming
    try:
        stemmer = SnowballStemmer(language)
        stemmed = stemmer.stem(' '.join(words))
    except Exception as e:
        print(f"Stemming not performed due to: {e}")
        stemmed = words  # Fallback to non-stemmed words if stemming fails


    # print("------------- HERE", stemmed.split())
    return stemmed

In [39]:
def build_terms(line):
    # stemmer = PorterStemmer()
    stemmer = SnowballStemmer("english")
    stop_words = set(stopwords.words("english"))
    line = line.lower()
    line = line.split()
    table = str.maketrans('', '', string.punctuation)
    line = [w.translate(table) for w in line]
    line = [w for w in line if w not in stop_words]
    line = [stemmer.stem(w) for w in line] 
    return line

In [40]:
def search(query, index):
    query = preprocess_text(query)  # Normalize and tokenize the query.
    docs = None  # Initialize docs as None to handle the intersection.

    for term in query:
        try:
            # Extract the document IDs for the term.
            term_docs = set([posting[0] for posting in index[term]])

            if docs is None:
                # Initialize docs with the set of document IDs for the first term.
                docs = term_docs
            else:
                # Intersect the sets of document IDs.
                docs = docs.intersection(term_docs)
        except KeyError:
            # If the term is not in the index, return an empty list because no documents can satisfy the query.
            return []

    if docs is None:
        return []  # If no terms were processed, return an empty list.
    else:
        return list(docs)  # Convert the set to a list before returning.

### Propose test queries

In [41]:
# Calculate the top n terms in the DataFrame for the specified column.
def get_top_terms(df, column='content', top_n=10):
    text = ' '.join(df[column].dropna())  # Combine all text and convert to lower case.
    words = text.split()
    # Get a count of all words
    word_count = Counter(words)
    # Return the most common words
    return word_count.most_common(top_n)

top_terms = get_top_terms(tweets_df, column='content', top_n=20)
print(top_terms)

[('farmersprotest', 50272), ('farmer', 17421), ('india', 7724), ('support', 6004), ('protest', 4787), ('amp', 4728), ('right', 3594), ('peopl', 3526), ('modi', 3113), ('indian', 3008), ('govern', 2753), ('bjp', 2649), ('law', 2570), ('releasedetainedfarm', 2432), ('govt', 2338), ('stand', 2207), ('farmersmakeindia', 2133), ('indiabeingsilenc', 2133), ('thank', 2129), ('farm', 2066)]


* “farmers protest India”: This query integrates the most frequent term ‘farmersprotest’ with ‘India’, capturing a broad perspective of the geographical context. It is designed to test the engine’s capability to fetch documents that discuss the nationwide impact and scale of the protests. The inclusion of ‘India’ helps to ensure that the search results are not just about protests in general but specifically about the farmers’ protests within the Indian context.
	
* “support farmer rights”: By combining ‘support’ with ‘farmer’ and ‘rights’, this query delves into the solidarity and advocacy aspects surrounding the protests. It focuses on the socio-political dimensions of the farmers’ rights being debated or upheld. This query is intended to evaluate how well the search engine can identify and retrieve content that discusses support mechanisms, both local and global, for the farmers amidst the protests.
	
* “Modi government response”: Including ‘Modi’ and ‘government’ targets discussions related to the administrative and political response to the protests. Given that government actions and policies are central to the unfolding of the protests, this query checks the engine’s effectiveness in pulling up content that critically examines or reports on the government’s strategies and reactions, providing a lens into the political narrative.
	
* “BJP agricultural laws”: Merging ‘BJP’ with ‘laws’ specifically focuses on the political party in power and the controversial agricultural laws that sparked the protests. This query is crafted to test the search engine’s precision in sifting through discussions related to legislative actions and political affiliations that are pivotal to understanding the core issues of the protests. It aims to highlight documents that discuss the legal frameworks and political viewpoints that define the conflict.
	
* “Indian farmers rally”: This query combines ‘Indian’ with ‘farmers’ and adds a dynamic aspect with ‘rally’, pointing to organized protest events. It is designed to retrieve documents that detail specific events, their significance, and the participation dynamics during the protests. This tests the search engine’s ability to focus on event-based reporting and narratives that capture the mobilization and active participation of the farming community.

In [42]:
def simulate_search(queries, index):
    for query in queries:
        docs = search(query, index)
        top = 10  # Number of results to display
        num_results = len(docs)
        
        print("\n======================\nSample of {} results out of {} for the searched query '{}':\n".format(min(top, num_results), num_results, query))
        for d_id in docs[:top]:
            print("docId = {}".format(d_id))

# List of queries to be processed
queries = [
    "farmers protest India",
    "support farmer rights",
    "Modi government response",
    "BJP agricultural laws",
    "Indian farmers rally"
]

simulate_search(queries, index)

------------- WORDS ['farmers', 'protest', 'india']
------------- HERE ['farmers', 'protest', 'india']

Sample of 0 results out of 0 for the searched query 'farmers protest India':

------------- WORDS ['support', 'farmer', 'rights']
------------- HERE ['support', 'farmer', 'right']

Sample of 0 results out of 0 for the searched query 'support farmer rights':

------------- WORDS ['modi', 'government', 'response']
------------- HERE ['modi', 'government', 'respons']

Sample of 0 results out of 0 for the searched query 'Modi government response':

------------- WORDS ['bjp', 'agricultural', 'laws']
------------- HERE ['bjp', 'agricultural', 'law']

Sample of 0 results out of 0 for the searched query 'BJP agricultural laws':

------------- WORDS ['indian', 'farmers', 'rally']
------------- HERE ['indian', 'farmers', 'r']

Sample of 0 results out of 0 for the searched query 'Indian farmers rally':



### Rank your results

In [43]:
def create_index_tfidf(dataframe):
    # num_documents = len(df)
    num_documents = dataframe['docId'].nunique()
    index = defaultdict(list)
    # tf = defaultdict(dict)  # Normalized term frequencies of terms in documents
    tf = defaultdict(list)
    df = defaultdict(int)  # Document frequencies of terms
    idf = defaultdict(float)

    for idx, row in dataframe.iterrows():
        doc_id = row['docId']
        terms = row['content'].split()
        
        current_page_index = {}

        for position, term in enumerate(terms):
            if term in current_page_index:
                # Append the position to the corresponding list in the array
                current_page_index[term][1].append(position)
            else:
                # Initialize the list with page_id and a new array
                current_page_index[term] = [doc_id, array('I', [position])]

        # Calculate the norm for the terms in the document
        norm = math.sqrt(sum(len(positions[1])**2 for positions in current_page_index.values()))

        # calculate the tf(dividing the term frequency by the above computed norm) and df weights
        for term, posting in current_page_index.items():
            # append the tf for current term (tf = term frequency in current doc/norm)
            tf[term].append(np.round(len(posting[1])/norm,4)) ## SEE formula (1) above
            #increment the document frequency of current term (number of documents containing the current term)
            df[term] += 1 # increment DF for current term
        
        # Merge the current page index with the main index
        for term_page, posting_page in current_page_index.items():
            index[term_page].append(posting_page)

    # Calculate IDF for each term
    for term in df:
        idf[term] = math.log(num_documents / (1 + df[term]))  # Smoothing by adding 1 to denominator

    return index, tf, df, idf

In [44]:
index, tf, df, idf = create_index_tfidf(tweets_df)

In [45]:
def rank_documents(terms, docs, index, idf, tf):
    """
    Perform the ranking of the results of a search based on the tf-idf weights
    
    Arguments:
    terms -- list of query terms
    docs -- list of documents, to rank, matching the query
    index -- inverted index data structure
    idf -- inverted document frequencies
    tf -- term frequencies
    
    Returns:
    List of ranked documents based on the relevance
    """
    doc_vectors = defaultdict(lambda: np.zeros(len(terms)))
    query_vector = np.zeros(len(terms))
    query_terms_count = Counter(terms)
    query_norm = la.norm(list(query_terms_count.values()))

    # Compute the tf-idf for the query vector
    for term_index, term in enumerate(terms):
        if term in idf:
            query_vector[term_index] = query_terms_count[term] / query_norm * idf[term]
            if term not in index:
                continue
    
            for doc_index, (doc, postings) in enumerate(index[term]):
                if doc in docs:
                    doc_vectors[doc][term_index] = tf[term][doc_index] * idf[term]  # T
            
    # Calculate the score of each doc using cosine similarity (dot product of normalized vectors)
    doc_scores = [[np.dot(cur_doc_vec, query_vector), doc] for doc, cur_doc_vec in doc_vectors.items()]
    doc_scores.sort(reverse=True, key=lambda x: x[0])
    # print(doc_scores)
    result_docs = [x[1] for x in doc_scores]

    if len(result_docs) == 0:
        print("No results found, try again")
    else:
        return result_docs

In [46]:
def search_tf_idf(query, index, idf, tf):
    """
    Output is the list of documents that contain all of the query terms. 
    This requires taking the intersection of the lists of documents for each query term.
    """
    query = preprocess_text(query)
    # print(query)
    docs = None  # Initialize to None to handle the first term's document set initialization

    for term in query:
        if term in index:
            term_docs = set([posting[0] for posting in index[term]])  # Collect all document IDs containing this term
            if docs is None:
                docs = term_docs
            else:
                docs = docs.intersection(term_docs)  # Intersection with the accumulated set of documents
        else:
            return []  # If any term is not found, the intersection is empty

    if docs is None:
        return []  # No terms found, return empty list

    docs = list(docs)  # Convert set to list if necessary
    ranked_docs = rank_documents(query, docs, index, idf, tf)  # Rank the documents based on the relevance
    return ranked_docs

In [47]:
def simulate_search_tf_idf(queries, index, idf, tf):
    for query in queries:
        ranked_docs = search_tf_idf(query, index, idf, tf)
        top = 10  # Number of results to display
        num_results = len(ranked_docs)
        
        print("\n======================\nSample of {} results out of {} for the searched query '{}':\n".format(min(top, num_results), num_results, query))
        for d_id in ranked_docs[:top]:
            print("docId = {}".format(d_id))

# List of queries to be processed
queries = [
    "farmers protest India",
    "support farmer rights",
    "Modi government response",
    "BJP agricultural laws",
    "Indian farmers rally"
]

In [48]:
simulate_search_tf_idf(queries, index, idf, tf)

------------- WORDS ['farmers', 'protest', 'india']
------------- HERE ['farmers', 'protest', 'india']

Sample of 0 results out of 0 for the searched query 'farmers protest India':

------------- WORDS ['support', 'farmer', 'rights']
------------- HERE ['support', 'farmer', 'right']

Sample of 0 results out of 0 for the searched query 'support farmer rights':

------------- WORDS ['modi', 'government', 'response']
------------- HERE ['modi', 'government', 'respons']

Sample of 0 results out of 0 for the searched query 'Modi government response':

------------- WORDS ['bjp', 'agricultural', 'laws']
------------- HERE ['bjp', 'agricultural', 'law']

Sample of 0 results out of 0 for the searched query 'BJP agricultural laws':

------------- WORDS ['indian', 'farmers', 'rally']
------------- HERE ['indian', 'farmers', 'r']

Sample of 0 results out of 0 for the searched query 'Indian farmers rally':



## Evaluation

In [49]:
evaluation_gt = pd.read_csv('./data/evaluation_gt.csv', sep=';')
df_eva = pd.merge(tweets_df, evaluation_gt, on='docId', how='left')

pd.set_option('display.max_colwidth', None)

query_1_relevant = (df_eva[(df_eva['query_id'] == 1) & (df_eva['label'] == 1)])['docId'].unique()
query_1_not_relevant = (df_eva[(df_eva['query_id'] == 1) & (df_eva['label'] == 0)])['docId'].unique()

query_2_relevant = (df_eva[(df_eva['query_id'] == 2) & (df_eva['label'] == 1)])['docId'].unique()
query_2_not_relevant = (df_eva[(df_eva['query_id'] == 2) & (df_eva['label'] == 0)])['docId'].unique()

In [50]:
# Keep in mind that for the evaluation part we will be using only the subset of documents that are being defined in the evaluation_gt.csv
df_subset_documents = df_eva[df_eva['query_id'].notnull()]
subset_documents = df_subset_documents['docId'].unique()

In [51]:
index, tf, df, idf = create_index_tfidf(df_subset_documents)

In [52]:
query_1_results = search_tf_idf("people rights", index, idf, tf)
query_2_results = search_tf_idf("Indian government", index, idf, tf)

------------- WORDS ['people', 'rights']
------------- HERE ['people', 'right']
------------- WORDS ['indian', 'government']
------------- HERE ['indian', 'govern']


In [53]:
out = preprocess_text("people's rights")

------------- WORDS ['people', 'rights']
------------- HERE ['people', 'right']


In [54]:
out = preprocess_text("Indian government?")

------------- WORDS ['indian', 'government']
------------- HERE ['indian', 'govern']


In [55]:
print("Snowball Stemmer: ")
stem = SnowballStemmer("english")
print(stem.stem("people rights"))
for w in ["people", "rights"]:
    print(stem.stem(w))

print("Porter Stemmer: ")
stem = PorterStemmer()
print(stem.stem("what is being said about the Indian government"))
for w in ["what", "is", "being", "said", "about", "the", "Indian", "government"]:
    print(stem.stem(w))

Snowball Stemmer: 
people right
peopl
right
Porter Stemmer: 
what is being said about the indian govern
what
is
be
said
about
the
indian
govern


In [56]:
def print_wrapped(title, data):
    wrapper = textwrap.TextWrapper(width=90)
    wrapped_text = wrapper.fill(str(data))
    print(f"{title} {wrapped_text}\n")

print(f"Ground Truth Files Query 1 (subset): {query_1_relevant}\n")
print_wrapped("Our Obtained Results Query 1:", query_1_results)

print(f"Ground Truth Files Query 2 (subset): {query_2_relevant}\n")
print_wrapped("Our Obtained Results Query 2:", query_2_results)


Ground Truth Files Query 1 (subset): ['doc_1047' 'doc_2100' 'doc_3287' 'doc_3474' 'doc_3570' 'doc_4053'
 'doc_5480' 'doc_5512' 'doc_5751' 'doc_6477' 'doc_8066' 'doc_9696'
 'doc_9850' 'doc_9937' 'doc_10048']

Our Obtained Results Query 1: []

Ground Truth Files Query 2 (subset): ['doc_103' 'doc_1566' 'doc_1651' 'doc_1666' 'doc_1785' 'doc_2528'
 'doc_2653' 'doc_3005' 'doc_3076' 'doc_3116' 'doc_3646' 'doc_3682'
 'doc_3927' 'doc_4176' 'doc_4304']

Our Obtained Results Query 2: []

