# IRWA Project Part 3

|Name | Email | UPF uNum |
| --- | --- | --- |
| Clara Pena | clara.pena01@estudiant.upf.edu | u186416 |
| Yuyan Wang | yuyan.wang01@estudiant.upf.edu | u199907 |

## Import Libraries and Load Data

In [2]:
!pip install rank_bm25

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from rank_bm25 import BM25Okapi
from gensim.models import Word2Vec

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [4]:
df = pd.read_csv("./data/processed_data.csv")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117405 entries, 0 to 117404
Data columns (total 9 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   id        117405 non-null  int64 
 1   content   117404 non-null  object
 2   date      117405 non-null  object
 3   hashtags  116794 non-null  object
 4   likes     117405 non-null  int64 
 5   retweets  117405 non-null  int64 
 6   url       117405 non-null  object
 7   language  117405 non-null  object
 8   docId     48427 non-null   object
dtypes: int64(3), object(6)
memory usage: 8.1+ MB


In [5]:
tweets_df = df.dropna(subset=['docId']).drop(columns=['language'])

In [6]:
tweets_df.head(10)

Unnamed: 0,id,content,date,hashtags,likes,retweets,url,docId
1,1364506167226032128,watch full video farmersprotest nofarmersnofoo...,2021-02-24T09:23:16+00:00,#farmersprotest #NoFarmersNoFood,0,0,https://twitter.com/anmoldhaliwal/status/13645...,doc_2
6,1364505991887347714,watch full video farmersprotest nofarmersnofood,2021-02-24T09:22:34+00:00,#farmersprotest #NoFarmersNoFood,0,0,https://twitter.com/anmoldhaliwal/status/13645...,doc_3
9,1364505813834989568,watch full video farmersprotest nofarmersnofood,2021-02-24T09:21:51+00:00,#farmersprotest #NoFarmersNoFood,0,0,https://twitter.com/anmoldhaliwal/status/13645...,doc_4
10,1364505749359976448,anoth farmer malkeet singh mahilpur hoshiarpur...,2021-02-24T09:21:36+00:00,#FarmersProtest,3,3,https://twitter.com/ShariaActivist/status/1364...,doc_5
14,1364505676375076867,hi tell boss modidontsellfarm thank farmerspro...,2021-02-24T09:21:19+00:00,#ModiDontSellFarmers #FarmersProtest,0,0,https://twitter.com/KaurDosanjh1979/status/136...,doc_6
16,1364505511073300481,watch full video farmersprotest nofarmersnofood,2021-02-24T09:20:39+00:00,#farmersprotest #NoFarmersNoFood,0,0,https://twitter.com/anmoldhaliwal/status/13645...,doc_7
18,1364505452134817795,despit increas tax petroldiesel must increas t...,2021-02-24T09:20:25+00:00,#taxes #petrolDiesel #taxes #alcohol #cigarett...,1,1,https://twitter.com/Satende09192805/status/136...,doc_8
20,1364505443997937669,mockeri menac sedit charg farmersprotest,2021-02-24T09:20:23+00:00,#sedition #FarmersProtest,0,0,https://twitter.com/algo_121/status/1364505443...,doc_9
25,1364505314586951680,watch full video farmersprotest nofarmersnofood,2021-02-24T09:19:52+00:00,#farmersprotest #NoFarmersNoFood,0,0,https://twitter.com/anmoldhaliwal/status/13645...,doc_10
26,1364505255946379268,left hear modi lol farmersprotest,2021-02-24T09:19:38+00:00,#FarmersProtest,1,0,https://twitter.com/kdhanjal12/status/13645052...,doc_11


In [7]:
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 48427 entries, 1 to 117404
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        48427 non-null  int64 
 1   content   48427 non-null  object
 2   date      48427 non-null  object
 3   hashtags  48153 non-null  object
 4   likes     48427 non-null  int64 
 5   retweets  48427 non-null  int64 
 6   url       48427 non-null  object
 7   docId     48427 non-null  object
dtypes: int64(3), object(5)
memory usage: 4.3+ MB


## Ranking Score

### TF-IDF + Cosine Similarity

In [9]:
# Example query
query = "support farmers"

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(tweets_df['content'])

# Calculate TF-IDF for the query
query_vector = tfidf_vectorizer.transform([query])

# Compute cosine similarities between the query and all tweets
cosine_similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()

# Rank tweets based on similarity scores
top_20_tf_idf = np.argsort(-cosine_similarities)[:20]

# Store the scores for the top 20 tweets
scores = cosine_similarities[top_20_tf_idf]

# Create a DataFrame for the results
top_20_tf_idf_tweets = tweets_df.iloc[top_20_tf_idf][['content', 'likes', 'retweets']]
top_20_tf_idf_tweets['score'] = scores

print("Top 20 TF-IDF Ranked Tweets:")
print(top_20_tf_idf_tweets)

Top 20 TF-IDF Ranked Tweets:
                                                  content  likes  retweets  \
20948   let support farmers also modirojgard farmerspr...      1         1   
4718            farmersprotest godimed we support farmers      1         1   
15771   petrolpr godimed farmersprotest sav farmers sa...      0         0   
43362                              support farmersprotest      5         1   
93395                              support farmersprotest      1         0   
20750                              support farmersprotest      0         0   
25862                            i support farmersprotest      6         5   
72590                            i support farmersprotest      0         0   
34693                              support farmersprotest      3         2   
23493                              support farmersprotest      0         0   
92024                              farmersprotest support     31         2   
12154                              

### Our-Score + Cosine Similarity

### BM25

In [11]:
# Tokenize each tweet content
tokenized_tweets = [tweet.split(" ") for tweet in tweets_df['content']]
query_tokens = query.split(" ")

# Initialize BM25 and calculate scores
bm25 = BM25Okapi(tokenized_tweets)
bm25_scores = bm25.get_scores(query_tokens)

# Rank tweets based on BM25 scores
top_20_bm25 = np.argsort(-bm25_scores)[:20]
top_20_bm25_tweets = tweets_df.iloc[top_20_bm25]

# Store the scores in the DataFrame
top_20_bm25_tweets['bm25_score'] = bm25_scores[top_20_bm25]

print("Top 20 BM25 Ranked Tweets:")
print(top_20_bm25_tweets[['content', 'likes', 'retweets', 'bm25_score']])

Top 20 BM25 Ranked Tweets:
                                                  content  likes  retweets  \
4718            farmersprotest godimed we support farmers      1         1   
20948   let support farmers also modirojgard farmerspr...      1         1   
15771   petrolpr godimed farmersprotest sav farmers sa...      0         0   
85129   chenna metr abb to provid saf tunnel ventilati...      1         4   
3046    sal sal indiaonsal but farmers are not on sal ...      0         0   
96437   gobackfascistmod gobackmod gobackfascistmod go...      0         0   
85814   support farmersprotest support gretathunberg s...      6         1   
102967  support farmer support farmer support twitter ...      1         1   
91051               support farmer support farmersprotest      0         0   
98193                support farmersprotest thank support      0         0   
46713   support chakkajam farmersprotest yes support c...      3         2   
114826  support farmer mahapanchayatr

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_20_bm25_tweets['bm25_score'] = bm25_scores[top_20_bm25]


### Top 20 Documents using Word2Vec + Cosine Similarity

In [12]:
# Get tweet vector by averaging word vectors
def tweet2vec(tweet):
    words = tweet.split()
    word_vectors = [model.wv[word] for word in words if word in model.wv]
    return np.mean(word_vectors, axis=0) if word_vectors else np.zeros(model.vector_size)

In [14]:
# Train Word2Vec model
model = Word2Vec(sentences=tokenized_tweets, vector_size=100, window=5, min_count=1, workers=4)

# Calculate tweet vectors for each tweet and the query
tweet_vectors = np.array([tweet2vec(tweet) for tweet in tweets_df['content']])
query_vector = tweet2vec(query)

# Calculate cosine similarity between query vector and tweet vectors
cosine_similarities_word2vec = cosine_similarity([query_vector], tweet_vectors).flatten()

# Rank tweets based on word2vec similarity scores
top_20_word2vec = np.argsort(-cosine_similarities_word2vec)[:20]
scores_word2vec = cosine_similarities_word2vec[top_20_word2vec]
top_20_word2vec_tweets = tweets_df.iloc[top_20_word2vec]
top_20_word2vec_tweets['score'] = scores_word2vec

print("Top 20 Word2Vec Ranked Tweets:")
print(top_20_word2vec_tweets[['content', 'likes', 'retweets', 'score']])

Top 20 Word2Vec Ranked Tweets:
                                                  content  likes  retweets  \
46713   support chakkajam farmersprotest yes support c...      3         2   
98193                support farmersprotest thank support      0         0   
91051               support farmer support farmersprotest      0         0   
36505   thank support support farmer farmersprotest tc...      0         0   
101005  thank support farmersprotest india appreci sup...     56        16   
19258   support farmer retweet show support farmerspro...      0         0   
102967  support farmer support farmer support twitter ...      1         1   
114866  support farmer mahapanchayatrevolut farmerspro...      0         0   
114826  support farmer mahapanchayatrevolut farmerspro...      0         0   
114963  support farmer mahapanchayatrevolut farmerspro...      2         1   
17929   support comment stand youth badpolit badleader...     13         7   
47019                     support

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_20_word2vec_tweets['score'] = scores_word2vec
