# IRWA Project Part 3

|Name | Email | UPF uNum |
| --- | --- | --- |
| Clara Pena | clara.pena01@estudiant.upf.edu | u186416 |
| Yuyan Wang | yuyan.wang01@estudiant.upf.edu | u199907 |

## Import Libraries and Load Data

In [1]:
import pandas as pd
import numpy as np

from nltk import SnowballStemmer
from nltk.corpus import stopwords
import re
import string

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from rank_bm25 import BM25Okapi
from gensim.models import Word2Vec
from datetime import datetime, timezone
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import altair as alt

In [2]:
df = pd.read_csv("./data/processed_data.csv")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 117405 entries, 0 to 117404
Data columns (total 9 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   id        117405 non-null  int64 
 1   content   117404 non-null  object
 2   date      117405 non-null  object
 3   hashtags  116794 non-null  object
 4   likes     117405 non-null  int64 
 5   retweets  117405 non-null  int64 
 6   url       117405 non-null  object
 7   language  117405 non-null  object
 8   docId     48427 non-null   object
dtypes: int64(3), object(6)
memory usage: 8.1+ MB


In [3]:
tweets_df = df.dropna(subset=['docId']).drop(columns=['language'])

In [4]:
tweets_df.head(10)

Unnamed: 0,id,content,date,hashtags,likes,retweets,url,docId
1,1364506167226032128,watch full video farmersprotest nofarmersnofoo...,2021-02-24T09:23:16+00:00,#farmersprotest #NoFarmersNoFood,0,0,https://twitter.com/anmoldhaliwal/status/13645...,doc_2
6,1364505991887347714,watch full video farmersprotest nofarmersnofood,2021-02-24T09:22:34+00:00,#farmersprotest #NoFarmersNoFood,0,0,https://twitter.com/anmoldhaliwal/status/13645...,doc_3
9,1364505813834989568,watch full video farmersprotest nofarmersnofood,2021-02-24T09:21:51+00:00,#farmersprotest #NoFarmersNoFood,0,0,https://twitter.com/anmoldhaliwal/status/13645...,doc_4
10,1364505749359976448,anoth farmer malkeet singh mahilpur hoshiarpur...,2021-02-24T09:21:36+00:00,#FarmersProtest,3,3,https://twitter.com/ShariaActivist/status/1364...,doc_5
14,1364505676375076867,hi tell boss modidontsellfarm thank farmerspro...,2021-02-24T09:21:19+00:00,#ModiDontSellFarmers #FarmersProtest,0,0,https://twitter.com/KaurDosanjh1979/status/136...,doc_6
16,1364505511073300481,watch full video farmersprotest nofarmersnofood,2021-02-24T09:20:39+00:00,#farmersprotest #NoFarmersNoFood,0,0,https://twitter.com/anmoldhaliwal/status/13645...,doc_7
18,1364505452134817795,despit increas tax petroldiesel must increas t...,2021-02-24T09:20:25+00:00,#taxes #petrolDiesel #taxes #alcohol #cigarett...,1,1,https://twitter.com/Satende09192805/status/136...,doc_8
20,1364505443997937669,mockeri menac sedit charg farmersprotest,2021-02-24T09:20:23+00:00,#sedition #FarmersProtest,0,0,https://twitter.com/algo_121/status/1364505443...,doc_9
25,1364505314586951680,watch full video farmersprotest nofarmersnofood,2021-02-24T09:19:52+00:00,#farmersprotest #NoFarmersNoFood,0,0,https://twitter.com/anmoldhaliwal/status/13645...,doc_10
26,1364505255946379268,left hear modi lol farmersprotest,2021-02-24T09:19:38+00:00,#FarmersProtest,1,0,https://twitter.com/kdhanjal12/status/13645052...,doc_11


In [5]:
tweets_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 48427 entries, 1 to 117404
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        48427 non-null  int64 
 1   content   48427 non-null  object
 2   date      48427 non-null  object
 3   hashtags  48153 non-null  object
 4   likes     48427 non-null  int64 
 5   retweets  48427 non-null  int64 
 6   url       48427 non-null  object
 7   docId     48427 non-null  object
dtypes: int64(3), object(5)
memory usage: 3.3+ MB


In [6]:
queries = [
    "India protest",
    "support farmers",
    "Modi shame",
    "BJP party",
    "human rights violated"
]

In [7]:
# Helper function from previous part
def build_terms(line):
    # stemmer = PorterStemmer()
    stemmer = SnowballStemmer("english")
    stop_words = set(stopwords.words("english"))
    line = line.lower()

    # Handle contractions by removing possessive endings and common contractions
    line = re.sub(r"\b(\w+)'s\b", r'\1', line)  # Changes "people's" to "people"
    line = re.sub(r"\b(\w+)n't\b", r'\1 not', line)  # Changes "isn't" to "is not"
    line = re.sub(r"\b(\w+)'ll\b", r'\1 will', line)  # Changes "I'll" to "I will"
    line = re.sub(r"\b(\w+)'d\b", r'\1 would', line)  # Changes "I'd" to "I would"
    line = re.sub(r"\b(\w+)'re\b", r'\1 are', line)  # Changes "you're" to "you are"
    line = re.sub(r"\b(\w+)'ve\b", r'\1 have', line)  # Changes "I've" to "I have"

    line = line.split()

    table = str.maketrans('', '', string.punctuation)
    line = [w.translate(table) for w in line]
    line = [w for w in line if w not in stop_words]
    line = [stemmer.stem(w) for w in line] 
    return ' '.join(line)

## Ranking Score

### TF-IDF + Cosine Similarity

In [8]:
# Example query
query = "support farmers"

# TF-IDF Vectorization
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(tweets_df['content'])

# Calculate TF-IDF for the query
query = build_terms(query)
query_vector = tfidf_vectorizer.transform([query])

# Compute cosine similarities between the query and all tweets
cosine_similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()

# Rank tweets based on similarity scores
top_20_tf_idf = np.argsort(-cosine_similarities)[:20]

# Store the scores for the top 20 tweets
scores = cosine_similarities[top_20_tf_idf]

# Create a DataFrame for the results
top_20_tf_idf_tweets = tweets_df.iloc[top_20_tf_idf][['content', 'likes', 'retweets']]
top_20_tf_idf_tweets['score'] = scores

print("Top 20 TF-IDF Ranked Tweets:")
print(top_20_tf_idf_tweets)

Top 20 TF-IDF Ranked Tweets:
                                content  likes  retweets     score
35532     support farmer farmersprotest      2         6  0.968072
43070     support farmer farmersprotest      4         1  0.968072
53254     support farmer farmersprotest     13         4  0.968072
53225     support farmer farmersprotest      0         0  0.968072
97763     support farmer farmersprotest      2         4  0.968072
92638     farmersprotest support farmer      0         2  0.968072
39433   i support farmer farmersprotest      1         0  0.968072
28837     support farmer farmersprotest      2         2  0.968072
32698     farmersprotest support farmer      0         0  0.968072
47385     support farmer farmersprotest      1         0  0.968072
22325     support farmer farmersprotest      0         0  0.968072
84622     support farmer farmersprotest      0         0  0.968072
105813    support farmer farmersprotest      1         0  0.968072
34978     support farmer farmersp

In [9]:
# Generalized function:
def get_top_tweets_by_tfidf(query, df=tweets_df, top_n=20, print_results=True):
    # TF-IDF Vectorization
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform(tweets_df['content'])
    
    # Calculate TF-IDF for the query
    query = build_terms(query)
    query_vector = tfidf_vectorizer.transform([query])
    
    # Compute cosine similarities between the query and all tweets
    cosine_similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()
    
    # Rank tweets based on similarity scores
    top_indices = np.argsort(-cosine_similarities)[:top_n]
    
    # Store the scores for the top tweets
    scores = cosine_similarities[top_indices]
    
    # Create a DataFrame for the results
    top_tweets = tweets_df.iloc[top_indices][['content']]
    top_tweets['score'] = scores

    if print_results:
        print(f"Query '{query}' Top {top_n} TF-IDF & Cosine Similarity Ranked Tweets:")
        print(top_tweets)
    
    return top_tweets

In [10]:
for q in queries:
    get_top_tweets_by_tfidf(q)
    print("")

Query 'india protest' Top 20 TF-IDF & Cosine Similarity Ranked Tweets:
                                                  content     score
77988                 farmer protest india farmersprotest  0.879362
77270                      farmersprotest protest protest  0.746128
37683                              protest farmersprotest  0.723375
5610                                 india farmersprotest  0.621901
73859                                india farmersprotest  0.621901
51011                                india farmersprotest  0.621901
13284                       farmer protest farmersprotest  0.612658
107439   indian farmer protest india delhi farmersprotest  0.582579
39335   like someon said farmersprotest last protest d...  0.577910
39341   like someon said farmersprotest last protest d...  0.577910
85520   farmersprotest india farmer protest explain fa...  0.561856
69282   farmersprotest istandwithfarm protest india fa...  0.541238
48645                         india farmer fa

### Our-Score + Cosine Similarity

In [11]:
def normalize(column):
    return (column - column.min()) / (column.max() - column.min())

def calculate_social_popularity(likes, retweets, likes_w=0.3, retweets_w=0.7):
    if likes_w + retweets_w != 1: raise Warning(f"Incorrect weights for social popularity computation; likes_w {likes_w} retweets_w {retweets_w}")
    return likes_w * likes + retweets_w * retweets

def calculate_recency(date):
    date_times = pd.to_datetime(date, utc=True)
    current_date = datetime.now(timezone.utc)
    recency_timedelta = (current_date - date_times).dt.total_seconds().round(2)
    recency_score = 1 / (recency_timedelta + 1)
    return recency_score

def our_score_algorithm(query, df=tweets_df, top_n=20, text_weight=0.55, social_weight=0.3, recency_weight=0.15, likes_weight=0.3, retweets_weight=0.7, print_results=True):
    if text_weight+social_weight+recency_weight != 1: raise Warning(f"Incorrect weights for score computation; text_w {text_weight} social_w {social_weight} recency_w {recency_weight}")
    # TF-IDF Vectorization and Cosine Similarity for Text Relevance
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform(df['content'])
    query = build_terms(query)
    query_vector = tfidf_vectorizer.transform([query])
    cosine_similarities = cosine_similarity(query_vector, tfidf_matrix).flatten()
    
    # Calculate Social Popularity Score
    df['social_popularity'] = calculate_social_popularity(df['likes'], df['retweets'], likes_w=likes_weight, retweets_w=retweets_weight)
    
    # Calculate Recency Score
    df['recency'] = calculate_recency(df['date'])

    # Normalize scores
    df['normalized_tr'] = normalize(cosine_similarities)
    df['normalized_sp'] = normalize(df['social_popularity'])
    df['normalized_r'] = normalize(df['recency'])
    
    # Combine scores according to weight
    df['our_score'] = (df['normalized_tr'] * text_weight) + (df['normalized_sp'] * social_weight) + (df['normalized_r'] * recency_weight)
    # Select top_n
    top_tweets = df.nlargest(top_n, 'our_score').reset_index(drop=True)

    if print_results:
       with pd.option_context('display.max_colwidth', None):  # Set max_colwidth to None to display all content
            temp = top_tweets.copy()
            temp['content'] = temp['content'].str.slice(0, 25)
            print(f"Query '{query}' Top {top_n} Tweets text_w {text_weight} social_w {social_weight} recency_w {recency_weight}:")
            print(temp[['id', 'content', 'likes', 'retweets', 'our_score']])
    
    return top_tweets[['id', 'content', 'likes', 'retweets', 'our_score']]


In [12]:
for q in queries:
    our_score_algorithm(q)
    print("")

Query 'india protest' Top 20 Tweets text_w 0.55 social_w 0.3 recency_w 0.15:
                     id                    content  likes  retweets  our_score
0   1361286582234279936  farmer protest india farm      1         0   0.591599
1   1362767045486469123     protest farmersprotest      1         0   0.543710
2   1364238946163511302       india farmersprotest      1         0   0.529926
3   1363884538678673409  farmer protest farmerspro      0         0   0.512148
4   1361317533173825544  farmersprotest protest pr      0         0   0.509296
5   1362247868468322304       india farmersprotest      0         1   0.462799
6   1362680892364267528  like someon said farmersp      4         0   0.449851
7   1362680726936686592  like someon said farmersp      0         0   0.449817
8   1364148079117615105  farmersprotest happen ger  27888      6164   0.437873
9   1361488235541581827       india farmersprotest      0         0   0.437311
10  1364451234204291073  farmer protest pawri ho r    

### BM25

In [13]:
# Tokenize each tweet content
tokenized_tweets = [tweet.split(" ") for tweet in tweets_df['content']]
query = build_terms(query)
query_tokens = query.split(" ")

# Initialize BM25 and calculate scores
bm25 = BM25Okapi(tokenized_tweets)
bm25_scores = bm25.get_scores(query_tokens)

# Rank tweets based on BM25 scores
top_20_bm25 = np.argsort(-bm25_scores)[:20]
top_20_bm25_tweets = tweets_df.iloc[top_20_bm25]

# Store the scores in the DataFrame
top_20_bm25_tweets['bm25_score'] = bm25_scores[top_20_bm25]

print("Top 20 BM25 Ranked Tweets:")
print(top_20_bm25_tweets[['content', 'likes', 'retweets', 'bm25_score']])

Top 20 BM25 Ranked Tweets:
                                                  content  likes  retweets  \
102967  support farmer support farmer support twitter ...      1         1   
91051               support farmer support farmersprotest      0         0   
75748         support farmer support human farmersprotest     16        11   
114866  support farmer mahapanchayatrevolut farmerspro...      0         0   
114963  support farmer mahapanchayatrevolut farmerspro...      2         1   
114826  support farmer mahapanchayatrevolut farmerspro...      0         0   
72310   support farmer support indian youth support fa...      0         0   
63713   support farmer support indian youth support fa...      0         0   
51153   support farmer support nation farmersprotest r...      0         1   
25796   support farmer support nation farmersprotest m...      0         0   
72897   support farmer support nation remembersirchhot...      0         2   
36505   thank support support farmer 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_20_bm25_tweets['bm25_score'] = bm25_scores[top_20_bm25]


In [14]:
# Generalized function
def rank_tweets_by_bm25(query, df=tweets_df, top_n=20, print_results=True):
    tokenized_tweets = [tweet.split() for tweet in df['content']]
    query = build_terms(query)
    query_tokens = query.split()

    bm25 = BM25Okapi(tokenized_tweets)
    bm25_scores = bm25.get_scores(query_tokens)
    top_indices = np.argsort(-bm25_scores)[:top_n]
    top_tweets = df.iloc[top_indices]

    # Store the scores in the DataFrame
    top_tweets['bm25_score'] = bm25_scores[top_indices]

    if print_results:
        temp = top_tweets.copy()
        temp['content'] = temp['content'].str.slice(0, 35)
        print(f"Query '{query}' Top {top_n} BM25 Ranked Tweets:")
        print(temp[['id', 'content', 'bm25_score']])

    return top_tweets[['id', 'content', 'likes', 'retweets', 'bm25_score']]

In [15]:
for q in queries:
    rank_tweets_by_bm25(q)
    print("")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_tweets['bm25_score'] = bm25_scores[top_indices]


Query 'india protest' Top 20 BM25 Ranked Tweets:
                         id                              content  bm25_score
91003   1360831863150026754  protest solut problem india forget     6.267487
39335   1362680892364267528  like someon said farmersprotest las    6.249838
39341   1362680726936686592  like someon said farmersprotest las    6.249838
77988   1361286582234279936  farmer protest india farmersprotest    5.950958
65696   1361738661776158721  look current protest situat india w    5.834283
94702   1360688096136712203  india evil govern call protest ando    5.752217
69282   1361627358453788674  farmersprotest istandwithfarm prote    5.660341
31872   1363012303990398978  next protest india opposit farmersp    5.660341
1147    1364451234204291073  farmer protest pawri ho rahi hai de    5.645204
66043   1361726659062300673  farmer protest turn point india far    5.505703
106348  1360304991924178946  recent interview farmer protest ind    5.505703
106254  1360309231451181062

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_tweets['bm25_score'] = bm25_scores[top_indices]


Query 'support farmer' Top 20 BM25 Ranked Tweets:
                         id                              content  bm25_score
102967  1360457074316963840  support farmer support farmer suppo    5.194449
91051   1360830375954587651  support farmer support farmersprote    5.030880
75748   1361368311166951425  support farmer support human farmer    4.850314
114866  1360079709518745600  support farmer mahapanchayatrevolut    4.850314
114963  1360077712308621314  support farmer mahapanchayatrevolut    4.850314
114826  1360080410835755009  support farmer mahapanchayatrevolut    4.850314
72310   1361533569764495362  support farmer support indian youth    4.850297
63713   1361841188400107521  support farmer support indian youth    4.745638
51153   1362244642603819009  support farmer support nation farme    4.682770
25796   1363319560401657859  support farmer support nation farme    4.682770
72897   1361516618543550467  support farmer support nation remem    4.682770
36505   136282354696417280

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_tweets['bm25_score'] = bm25_scores[top_indices]





A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_tweets['bm25_score'] = bm25_scores[top_indices]


Query 'bjp parti' Top 20 BM25 Ranked Tweets:
                         id                              content  bm25_score
93384   1360760525639008259  bjp parti evil parti gobackmodi far   12.021106
83793   1361076957652344840  bjp terrorist parti releasedisharav   10.586374
34774   1362924039761768449    bjp blow job parti farmersprotest   10.586374
103534  1360436874448314369  bjp religi hate creat parti repealo    9.644399
64493   1361790805770387458  dare whole bjp parti come see farme    9.644399
46552   1362369880167710724  bjp stand bhagat joker parti railro    9.644399
89825   1360880614862299139  shame act bjp parti leader kisanmaj    9.233596
69247   1361628313916035074  bjp rss parti urban naxal disharavi    9.233596
74055   1361480721437577216  bjp nation parti india serv peopl i    8.856360
92601   1360781286705561603  expect kind stupid thing bjp parti     8.508738
58403   1361954181892874242  hello sir farmer demand chang lawsl    8.221686
80581   1361194400173363203  re

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_tweets['bm25_score'] = bm25_scores[top_indices]


### Top 20 Documents using Word2Vec + Cosine Similarity

In [16]:
# Get tweet vector by averaging word vectors
def tweet2vec(tweet, model):
    words = tweet.split()
    word_vectors = [model.wv[word] for word in words if word in model.wv]
    return np.mean(word_vectors, axis=0) if word_vectors else np.zeros(model.vector_size)

In [17]:
# Train Word2Vec model
model = Word2Vec(sentences=tokenized_tweets, vector_size=100, window=5, min_count=1, workers=4)

# Calculate tweet vectors for each tweet and the query
tweet_vectors = np.array([tweet2vec(tweet, model) for tweet in tweets_df['content']])
query = build_terms(query)
query_vector = tweet2vec(query, model)

# Calculate cosine similarity between query vector and tweet vectors
cosine_similarities_word2vec = cosine_similarity([query_vector], tweet_vectors).flatten()

# Rank tweets based on word2vec similarity scores
top_20_word2vec = np.argsort(-cosine_similarities_word2vec)[:20]
scores_word2vec = cosine_similarities_word2vec[top_20_word2vec]
top_20_word2vec_tweets = tweets_df.iloc[top_20_word2vec]
top_20_word2vec_tweets['score'] = scores_word2vec

print("Top 20 Word2Vec Ranked Tweets:")
print(top_20_word2vec_tweets[['id', 'content', 'likes', 'retweets', 'score']])

Top 20 Word2Vec Ranked Tweets:
                         id  \
33796   1362957944325693440   
91051   1360830375954587651   
102967  1360457074316963840   
31531   1363023215560323078   
52551   1362206113786126342   
35880   1362855385103794183   
105181  1360370833542295553   
60530   1361900612640641028   
97763   1360607434851504132   
26217   1363303857363771397   
94914   1360680293259071489   
53225   1362153445881487360   
85154   1361006115912228871   
44830   1362439420285050880   
39679   1362668035379662848   
68595   1361648129272139776   
22325   1363477352043307008   
47385   1362339995617353728   
55419   1362061830047211524   
84428   1361033876878069762   

                                                  content  likes  retweets  \
33796   farmersprotest support farmer proud daughter f...      3         0   
91051               support farmer support farmersprotest      0         0   
102967  support farmer support farmer support twitter ...      1         1   
31531

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_20_word2vec_tweets['score'] = scores_word2vec


In [18]:
# Generalized function
def rank_tweets_by_word2vec(query, df=tweets_df, top_n=20, print_results=True, visualization=True):
    tokenized_tweets = [tweet.split() for tweet in df['content']]
    model = Word2Vec(sentences=tokenized_tweets, vector_size=100, window=5, min_count=1, workers=4)
    # Calculate tweet vectors for each tweet
    tweet_vectors = np.array([tweet2vec(tweet, model) for tweet in df['content']])
    # Calculate vector for the query
    query = build_terms(query)
    query_vector = tweet2vec(query, model)
    cosine_similarities_word2vec = cosine_similarity([query_vector], tweet_vectors).flatten()
    # Rank tweets based on word2vec similarity scores
    top_indices = np.argsort(-cosine_similarities_word2vec)[:top_n]
    top_tweets = df.iloc[top_indices]
    top_tweets['score'] = cosine_similarities_word2vec[top_indices]

    if print_results:
        temp = top_tweets.copy()
        temp['content'] = temp['content'].str.slice(0, 35)
        print(f"Query '{query}' Top {top_n} Word2Vec Ranked Tweets:")
        print(temp[['id', 'content', 'score']])

    if visualization:
        # Combine tweet and query vectors for visualization
        top_tweet_vectors = tweet_vectors[top_indices]
        all_vectors = np.vstack([top_tweet_vectors, query_vector])
        
        # Perform PCA reduction to 2D
        pca = PCA(n_components=2)
        reduced_vectors = pca.fit_transform(all_vectors)
        
        # Prepare data for Altair
        plot_data = pd.DataFrame(reduced_vectors, columns=['PCA1', 'PCA2'])
        plot_data['label'] = ['Tweet'] * top_n + ['Query']
        plot_data['content'] = list(top_tweets['content']) + [query]
        plot_data['score'] = list(top_tweets['score']) + [None]
        plot_data['id'] = list(top_tweets['id']) + ['Query']

        # Altair scatter plot
        chart = alt.Chart(plot_data).mark_circle(size=100).encode(
            x=alt.X('PCA1', title='PCA Dimension 1'),
            y=alt.Y('PCA2', title='PCA Dimension 2'),
            color=alt.Color('label', scale=alt.Scale(domain=['Tweet', 'Query'], range=['skyblue', 'green'])),
            tooltip=['id', 'score', 'content']
        ).properties(
            title=f"Top {top_n} Word2Vec Ranked Tweets for Query '{query}' in 2D Space (PCA)",
            width=400,
            height=300
        )

        # Display the plot
        chart.display()
        chart.save(f'./visualizations/{"".join(query).replace(" ", "")}.png')

    return top_tweets[['id', 'content', 'likes', 'retweets', 'score']]

**It is recommended to see following visualizations from a jupyter notebook since we are providing the tooltip function; however, the png images are available as well**

In [19]:
q1 = rank_tweets_by_word2vec(queries[0])

Query 'india protest' Top 20 Word2Vec Ranked Tweets:
                         id                              content     score
94702   1360688096136712203  india evil govern call protest ando  0.945495
31872   1363012303990398978  next protest india opposit farmersp  0.936452
25181   1363339024421834752  farmerssuicid farmersprotest third   0.935633
77988   1361286582234279936  farmer protest india farmersprotest  0.931885
44094   1362468403647287298  garba gujarati way protest protest   0.931036
99317   1360565655779610626  ok protest jivi ditch farmersprotes  0.928580
73470   1361499782527799297  largest crowd ever seen protest exc  0.927710
96558   1360638112020402181  pictori protest india excess agit f  0.925784
91003   1360831863150026754  protest solut problem india forget   0.924961
34864   1362919925040353280  russel brand talk farmer protest fa  0.920223
107319  1360276799591075842  part ii million farmer protest agri  0.919349
47239   1362344989641084930  current farmerspro

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_tweets['score'] = cosine_similarities_word2vec[top_indices]


![Altair Chart Query 1](./visualizations/india_protest.png)

In [20]:
q2 = rank_tweets_by_word2vec(queries[1])

Query 'support farmer' Top 20 Word2Vec Ranked Tweets:
                         id                              content     score
33796   1362957944325693440  farmersprotest support farmer proud  0.963609
91051   1360830375954587651  support farmer support farmersprote  0.962361
102967  1360457074316963840  support farmer support farmer suppo  0.959135
31531   1363023215560323078  support farmer dpstopintimidatingfa  0.957128
35880   1362855385103794183  support farmer farmersprotest tcote  0.954453
33884   1362956100476096514  support farmer farmersprotest tcoks  0.954381
38416   1362719232874708994        farmersprotest support farmer  0.954328
74014   1361482097270001665        farmersprotest support farmer  0.954328
59659   1361918432631943171        farmersprotest support farmer  0.954328
92638   1360780298305212419        farmersprotest support farmer  0.954328
20285   1363541021003550728        farmersprotest support farmer  0.954328
104241  1360412370833608705        farmersprot

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_tweets['score'] = cosine_similarities_word2vec[top_indices]


![Altair Chart Query 2](./visualizations/supportfarmer.png)

In [21]:
q3 = rank_tweets_by_word2vec(queries[2])

Query 'modi shame' Top 20 Word2Vec Ranked Tweets:
                         id                              content     score
18171   1363673764043526145  shame modi govt shame shame modiign  0.965618
110409  1360194026339618817  shame modi shame bjp farmersprotest  0.958155
110416  1360193876313600004  shame modi shame bjp farmersprotest  0.958155
110427  1360193698030493701  shame modi shame bjp farmersprotest  0.958155
110405  1360194188986314764  shame modi shame bjp farmersprotest  0.957966
110326  1360196876503404544  shame modi shame bjp farmersprotest  0.957878
15851   1363749240736997383  shame modi shame bjp destroy indian  0.956434
86968   1360965271339470854            farmersprotest modi shame  0.953544
91608   1360811393910411274            shame modi farmersprotest  0.953544
98112   1360599574323290113            shame modi farmersprotest  0.953544
90197   1360869243877691394            shame modi farmersprotest  0.953544
41266   1362614081828642816       bjp shame shame 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_tweets['score'] = cosine_similarities_word2vec[top_indices]


![Altair Chart Query 3](./visualizations/modishame.png)

In [22]:
q4 = rank_tweets_by_word2vec(queries[3])

Query 'bjp parti' Top 20 Word2Vec Ranked Tweets:
                        id                              content     score
93384  1360760525639008259  bjp parti evil parti gobackmodi far  0.924734
48086  1362321744237899779  vote bjp cos dint want congress hel  0.914232
34774  1362924039761768449    bjp blow job parti farmersprotest  0.911770
58310  1361957143251996672  congress domin bjp sad fail panjabm  0.909642
55076  1362072033018830850  shame bjp bjp lost punjab reveng po  0.903352
34733  1362925696742432771             farmersprot bjp murdabad  0.903205
12884  1363910618286338048     minist bjp stupid farmersprotest  0.897229
13353  1363880027222249477  shame bjp bycott modi bjp modiignor  0.896886
87906  1360939202511179781  bjp prime minist indian address bjp  0.896107
75490  1361381394237313034  farmersprotest big issu earlier con  0.892237
777    1364464752634724353  bjp lol farmer bend knee modi farme  0.889942
37475  1362779783239786504  punjab public vote congress polit p

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_tweets['score'] = cosine_similarities_word2vec[top_indices]


![Altair Chart Query 4](./visualizations/bjpparti.png)

In [23]:
q5 = rank_tweets_by_word2vec(queries[4])

Query 'human right violat' Top 20 Word2Vec Ranked Tweets:
                         id                              content     score
69954   1361596881328091143  human right protect human right rem  0.976614
73702   1361494660150140934  blood hand investig human right vio  0.969991
69360   1361624625885626371  human right protect human remembers  0.963840
69368   1361624283605311495  human right protect human remembers  0.963840
56502   1362023523858931712  human right violat detain press att  0.963343
34052   1362951062546694144  water basic human right speak human  0.960271
92622   1360780668754530308  protest peac human right subsum fun  0.950427
49614   1362282294996701186  may god merci human humanright farm  0.950028
63192   1361854741421039621     human right india farmersprotest  0.948469
29593   1363087347001286660  get comfort violat human right lite  0.945150
47566   1362334130701180928  internet shutdown violat human righ  0.941906
64908   1361768535769436171  human right a

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  top_tweets['score'] = cosine_similarities_word2vec[top_indices]


![Altair Chart Query 5](./visualizations/humanrightviolat.png)