<div style="text-align: center">

<b><h1>  </h1></b>
<h2> Assignment 3 - Retrieval System </h2>
<h2> Modern Information Retrieval Course </h2>
<h3> Dr. Asgari </h3>
<h3> Group Members </h3>
Parsa Mohammadian - 98102284
<br/>
Sara Azarnoush - 98170668
<br/>
Kahbod Aeini - 98101209 
<br/>
<br/>
Sharif University of Technology
<br/>
Computer Engineering Department
<hr/>
</div>

### Introduction

In this project, we will implement various retrieval system over the tweets dataset. We will use the following models:

- Boolean Retrieval
- TF-IDF Retrieval
- Transformer Based Retrieval
- FastText Retrieval

After that we will evaluate the performance of each of the retrieval systems with some queries and MMR metric.

---

### Requirements

---

In [45]:
try:
    from google.colab import drive
    COLAB = True
except:
    COLAB = False
    print('Not in Google Colab')

if COLAB:
  drive.mount('/content/drive')

Not in Google Colab


In [46]:
from IPython.display import clear_output


In [47]:
%pip install pandas
%pip install nltk
%pip install sklearn
%pip install numpy
%pip install fasttext
%pip install tensorboardX
%pip install torch
%pip install simpletransformers
%pip install faiss
%pip install faiss-cpu --no-cache
%pip install gensim

clear_output()


In [48]:
import pandas as pd
import nltk
import string
import functools
import sklearn as sk
from abc import ABC, abstractmethod
import numpy as np
import fasttext
from simpletransformers.retrieval import RetrievalModel
from gensim.models.fasttext import FastText


In [49]:
nltk.download('punkt')
nltk.download('stopwords')

clear_output()


### Load Data

Although we have implemented the twitter crawler [here](../datasets/twitter-crawler.ipynb) which take a username as input and output all of his/her tweets, we will use the [sentiment140](https://www.kaggle.com/datasets/kazanova/sentiment140) dataset because of the twitter crawler's limitations. So we load the dataset into pandas dataframe. We drop redundant columns and keep only the tweets text. Finally, since the data is to large for normal computer to process, we used fraction of the data in the rest of the notebook.

---


In [50]:
if COLAB:
    PATH_TO_SENTIMENT140_DATASET = 'drive/MyDrive/training.1600000.processed.noemoticon.csv'
else:
    PATH_TO_SENTIMENT140_DATASET = '../datasets/training.1600000.processed.noemoticon.csv'
CSV_COLUMNS = ['target', 'id', 'date', 'flag', 'user', 'text']
TEST_SIZE = 0.2


df = pd.read_csv(PATH_TO_SENTIMENT140_DATASET)
df.columns = CSV_COLUMNS
df.drop(columns=['target', 'id', 'date', 'flag', 'user'], inplace=True)
df.head()


Unnamed: 0,text
0,is upset that he can't update his Facebook by ...
1,@Kenichan I dived many times for the ball. Man...
2,my whole body feels itchy and like its on fire
3,"@nationwideclass no, it's not behaving at all...."
4,@Kwesidei not the whole crew


#### Sample Data

Limit the size of the dataset in order to limit resource and time usage.


In [51]:
df = df.sample(n=10000)


### Preprocess Data

Here we used nltk package and other defined functions to normalize the data as explained in the introduction. 

---

#### Tokenize Text

In [52]:
df['text_tokenized'] = df['text'].apply(lambda x: nltk.word_tokenize(x))


#### Normalize Text

In [53]:
def to_lower(tokens: list) -> list:
    """
    Converts the tokens to lower case.
    """
    return [token.lower() for token in tokens]
    

def contains_any_of(token: list, chars: str) -> bool:
    """
    Returns true if the token contains any of the characters in the given list.
    """
    return any(char in token for char in chars)


def remove_punctuation(tokens: list) -> list:
    """
    Removes punctuation from the given tokens.
    """
    return [token for token in tokens if not contains_any_of(token, string.punctuation+"’‘•")]


def remove_stop_words(tokens: list) -> list:
    """
    Removes stop words from the given tokens.
    """
    remove_stop_words.stop_words = set(nltk.corpus.stopwords.words('english'))
    return [token for token in tokens if token not in remove_stop_words.stop_words]


def normalize(tokens):
    """
    Normalizes the tokens of the lyrics.
    """
    normalization_functions = [to_lower, remove_punctuation, remove_stop_words]
    return functools.reduce(lambda x, f: f(x), normalization_functions, tokens)


df['text_normalized'] = df['text_tokenized'].apply(normalize)


#### Stem Text

In [54]:
stemmer = nltk.stem.SnowballStemmer('english')
df['text_stemmed'] = df['text_normalized'].apply(lambda x: [stemmer.stem(t) for t in x])


#### Join Text

In [55]:
df['text_preprocessed'] = df['text_stemmed'].apply(lambda x: ' '.join(x))


### Retrieval Base Class

This class is the parent for all four retrieval systems. It contains the common methods of all retrieval systems. This unique interface further used to implement the test method without difficulty. 

Query class is a wrapper for string query. It does not do much but it is used to make the code more readable.

Further we can see the queries and the test function mentioned before.

---

In [56]:
class Query:
    """
    A class that represents a query.
    """
    def __init__(self, text: str):
        self.text = text

    def __str__(self):
        return self.text

    def __repr__(self):
        return self.text

class RetrievalSystemBase(ABC):
    @abstractmethod
    def train(self, df: pd.DataFrame):
        pass

    @abstractmethod
    def retrieve(self, query: Query) -> list:
        pass

queries = [
    "twitter",
    "hair cut",
    "weekend",
    "friend",
    "nice day",
    "school class",
    "send direct",
    "spring",
    "enjoy life",
    "sleep"
]

def test_model(retrieval_system_class: RetrievalSystemBase):
    retrieval_system = retrieval_system_class()
    retrieval_system.train(df)
    print(f'Retrieval system {retrieval_system_class.__name__}:')
    for idx, query in enumerate(queries):
        delim = "\n\t"
        print(f'{delim}Query {idx+1}/{len(queries)}: {query}')
        results = retrieval_system.retrieve(Query(query))
        delim = "\n\t\t"
        print(f'{delim}{delim.join([f"{i+1}. {d}" for i, d in enumerate(results)])}')

### Boolean Retrieval

The first model of retrieval system. Because of the simplicity of this model, we implemented it ourselves without any particular packages.

---

In [59]:
class BooleanRetrieval(RetrievalSystemBase):
    def __init__(self, k=10):
        self.k = k
        self.word_to_idx = None
        self.idx_to_document = None
        self.document_word_matrix = None

    def train(self, df: pd.DataFrame):
        all_words_list = df['text_preprocessed'].apply(
            lambda x: x.split()).tolist()
        all_words_list_flattened = [x for y in all_words_list for x in y]
        all_words = set(all_words_list_flattened)
        self.word_to_idx = {word: idx for idx, word in enumerate(all_words)}
        self.idx_to_document = {}
        self.document_word_matrix = np.zeros((len(df), len(all_words)))

        for doc_idx, (doc, text) in enumerate(zip(df['text_preprocessed'], df['text'])):
            self.idx_to_document[doc_idx] = text
            for word in doc.split():
                self.document_word_matrix[doc_idx, self.word_to_idx[word]] = 1

    def retrieve(self, query: Query) -> list:
        documents = [set(self.__retrieve_word(x)) for x in query.text.split()]
        intersection = functools.reduce(lambda x, y: x.intersection(y), documents)
        union = functools.reduce(lambda x, y: x.union(y), documents)
        signle = union - intersection
        length = self.k if len(intersection) > self.k else len(intersection)
        return list(intersection)[:length] + list(signle)[:self.k-length]

    def __retrieve_word(self, word: str) -> list:
        idx = self.word_to_idx[word]
        return [self.idx_to_document[i] for i in np.where(self.document_word_matrix[:, idx] == 1)[0]]


In [60]:
test_model(BooleanRetrieval)

Retrieval system BooleanRetrieval:

	Query 1/10: twitter

		1. Step away from your computer.. Twitter is not going anywhere!! 
		2. @BoomKack You are dancing up a storm with twittering feet.... 
		3. twitter is very quiet today. it's not funny 
		4. New fun unharmful juste.ru virus is spreading round twitter 
		5. Has Been Neglecting Twitter. I'm Sorry, Twitter. 
		6. @morrgand Thanks, but I already downloaded it  Rochelle's a creeper so she probably made you twitter stalk! Just kidding.
		7. @evankincade tell pastor Jules to get on the twitter train 
		8. @bellascottxx can you imagine if we didn't have iPhones we would have to go for like 7 hours without twitter :/ but it's okay 
		9. I'm exited to get twitter   
		10. someone help, why am i getting all these disgusting spam followers on twitter. i cant get rid of them either! 

	Query 2/10: hair cut

		1. @KimKardashian You have gorgeous hair. Don't get it cut, not yet. Wait until you get older like me. 
		2. Aww my puppy is getting 

### TF-IDF Retrieval

TF-IDF is a term frequency-inverse document frequency retrieval model. Since this model is much more complex, we used the [scikit-learn](https://scikit-learn.org/) package to implement it.

---

In [43]:
class TfIdfRetrieval(RetrievalSystemBase):
    def __init__(self, k=10):
        self.tfidf_vectorizer = sk.feature_extraction.text.TfidfVectorizer()
        self.tfidf_matrix = None
        self.k = k

    def train(self, df: pd.DataFrame):
        self.tfidf_matrix = self.tfidf_vectorizer.fit_transform(df['text_preprocessed'])

    def retrieve(self, query: Query) -> list:
        query_vector = self.tfidf_vectorizer.transform([query.text])
        similarities = (query_vector * self.tfidf_matrix.T).toarray().flatten()
        similarities = similarities.argsort()[-self.k:][::-1]
        return [df['text'].iloc[i] for i in similarities]


In [44]:
test_model(TfIdfRetrieval)

Retrieval system TfIdfRetrieval:

	Query 1/10: twitter

		1. what am I going to do when twitter is down! 
		2. new to twitter 
		3. trying out twitter 
		4. Off to bed so I can be up at 3:15 AM. G'night Twitters 
		5. Guess I'm on Twitter now! 
		6. Good morning Twitter 
		7. @ainhoa_88 welcome to twitter 
		8. @Stoots_Askew welcome to twitter!!! 
		9. starting to like twittering 
		10. not a fan of twitter 

	Query 2/10: hair cut

		1. might have to work not sure, .......but i hope not, i want to get my hair cut!!!! 
		2. waiting for my hair to dye.. 
		3. I cut my finger 
		4. @jakesonaplane i vote no on the cutting .... sorry dude, i dig the hair 
		5. Dying my hair 
		6. Im actually starting to like my hair again  I just need to cut my bangs shorter And straight 
		7. My daughter is doing my hair for me 
		8. Yay, just got my hair cut, soon going to put on my national costume, called bunad... I'm excited 
		9. @veronica78   i gotta cut hershels hair then i'll be back  always waits 

### Transformer Based Retrieval

For this model, we used the [simpletransformers](https://simpletransformers.readthedocs.io/en/latest/index.html) package and the `facebook/dpr-ctx' pretrained model.

---

In [19]:
class TransformerRetrieval(RetrievalSystemBase):
    def __init__(self, k=10):
        self.k = 10

    def train(self, df: pd.DataFrame):
        model_type = "dpr"
        context_encoder_name = "facebook/dpr-ctx_encoder-single-nq-base"
        question_encoder_name = "facebook/dpr-question_encoder-single-nq-base"

        args = {
            "include_title": False,
        }

        self.model = RetrievalModel(
            model_type=model_type,
            context_encoder_name=context_encoder_name,
            query_encoder_name=question_encoder_name,
            args=args,
            use_cuda=COLAB,
        )

        self.train_df = df[['text', 'text_preprocessed']].copy(deep=True)
        self.train_df.rename(columns={'text': 'query_text', 'text_preprocessed': 'gold_passage'}, inplace=True)
        self.model.train_model(self.train_df)

    def retrieve(self, query: Query) -> list:
        to_predict = [query.text]
        prediction_passages = self.train_df.copy(deep=True)
        prediction_passages['title'] = ['']*len(prediction_passages)
        prediction_passages.rename(columns={'query_text': 'passages'}, inplace=True)
        predicted_passages, _, _, _ = self.model.predict(to_predict, prediction_passages=prediction_passages, retrieve_n_docs=self.k)
        return predicted_passages[0]
        

In [20]:
test_model(TransformerRetrieval)

Some weights of the model checkpoint at facebook/dpr-ctx_encoder-single-nq-base were not used when initializing DPRContextEncoder: ['ctx_encoder.bert_model.pooler.dense.bias', 'ctx_encoder.bert_model.pooler.dense.weight']
- This IS expected if you are initializing DPRContextEncoder from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DPRContextEncoder from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'DPRQuestionEncoderTokenizer'. 
The class this function is called from is 'DPRContextEncoderTokeniz

  0%|          | 0/10 [00:00<?, ?ba/s]



Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/1250 [00:00<?, ?it/s]

  (max_idxs == torch.tensor(labels)).sum().cpu().detach().numpy().item()


Retrieval system TransformerRetrieval:

	Query 1/10: twitter


  0%|          | 0/10000 [00:00<?, ?ex/s]

  0%|          | 0/157 [00:00<?, ?ba/s]

  0%|          | 0/10 [00:00<?, ?it/s]

Generating query embeddings: 0it [00:00, ?it/s]

Retrieving docs:   0%|          | 0/1 [00:00<?, ?it/s]


		1. Starting with twitter 
		2. I made a Twitter 
		3. is new to twitter 
		4. just starting a TWITTER 
		5. i say the same thing everytime i update twitter 
		6. I want to use Twitter from my Phoneeeeee. 
		7. bom dia twitters 
		8. bom dia twitters 
		9. thnx to all the twitter followes  the old and new ones ;D
		10. Joined twitter at last 

	Query 2/10: hair cut


Generating query embeddings: 0it [00:00, ?it/s]

Retrieving docs:   0%|          | 0/1 [00:00<?, ?it/s]


		1. combing my hair 
		2. Haircut! Feeling fresh and clean 
		3. I tried the hair dye. it dun work! 
		4. Blow drying hair  
		5. LOL... I just realized how this works! No i haven't seen HAIR...... I would love to see it someday... 
		6. grow HAIR grow!!! I want long mermaid hair 
		7. NEED TO DYE MY HAIR ASAP. SICK OF REGROWTHS.  
		8. just got my fringe cut 
		9. loving this weather...its good hair weather so  iv decided to straighten my hair 
		10. On a happy note I am getting ma hair did tomorrow  CHOP CHOP ! Getting a bunch cut off !

	Query 3/10: weekend


Generating query embeddings: 0it [00:00, ?it/s]

Retrieving docs:   0%|          | 0/1 [00:00<?, ?it/s]


		1. is the weekend 
		2. Weekendtime 
		3. Weekend has gone to fast 
		4. as I said... no weekend for me    Im working...
		5. can't wait for nxt weekend  lol
		6. No summer vacations this time arnd 
		7. Morning all.  The long weekend begins. Hooray, and that.
		8. So glad it's the weekend! 
		9. Where's the summer? 
		10. Not looking forward to working all weekend especially with anastashia leaving.... 

	Query 4/10: friend


Generating query embeddings: 0it [00:00, ?it/s]

Retrieving docs:   0%|          | 0/1 [00:00<?, ?it/s]


		1. Exedrin is my friend 
		2. I like you as a friend 
		3. its like Friends allll over again! 
		4. where are all my friends???  
		5. @PernilleNC   i wanne be your friend, haha 
		6. Maybe AVALANCHE &amp; ShinRa can become friends.  
		7.  i know. Lol.
		8. I am hanging out with my friends. 
		9. Have some problems with friends. 
		10. I miss my friends. 

	Query 5/10: nice day


Generating query embeddings: 0it [00:00, ?it/s]

Retrieving docs:   0%|          | 0/1 [00:00<?, ?it/s]


		1. Nice sunny day again, HAPPY 
		2. The day looks much nicer nowww 
		3. such a nice day...goin to be spending it in the sun 
		4. I'll have a little cry about it later. Its to nice of a day 
		5. It is sunny and warm outside, today is gonna be a good day 
		6. Change is good, change is good, change is good...  I'm not good with change.
		7. Good morning Chica! Have a good day! 
		8. Days look beautiful when things are near back to normal 
		9. good morning 
		10. Not a good day to be late 

	Query 6/10: school class


Generating query embeddings: 0it [00:00, ?it/s]

Retrieving docs:   0%|          | 0/1 [00:00<?, ?it/s]


		1. Schools out, but works in 
		2. I never thought that I would be this sad to leave middle school 
		3. Classes till 12 
		4. Dang todays the last day of school 
		5. I MUST love school  Hahaha
		6. Emily, you went to school? Aww 
		7. Morning world. Last damn day of school! 
		8. No one showed up for my class this morning.  But that means I get to save this lesson for next week and I don't have to prepare (cont…
		9. starts summer school today 
		10. is headed for Senior Year. FINALLY. 

	Query 7/10: send direct


Generating query embeddings: 0it [00:00, ?it/s]

Retrieving docs:   0%|          | 0/1 [00:00<?, ?it/s]


		1. just arrive from dagupan 
		2. Why am I not getting text alerts 
		3. is on call  
		4. Why do I have so many problems with direct messages? I cant get to them, EVER! Please dont think I am ignoring u if I dont respond 
		5. They sent the wrong hitch 
		6. easiest one i took 
		7. Bev - my email to you is bouncing back. 
		8. It's 3AM back home. Can't really expect a response back so soon... 
		9. So much to post, so little time to post 
		10. i finally sent off the care package!  knowing my luck, now they will send him home.

	Query 8/10: spring


Generating query embeddings: 0it [00:00, ?it/s]

Retrieving docs:   0%|          | 0/1 [00:00<?, ?it/s]


		1. SUMMER ! 
		2. Awww! It's the last day of spring break 
		3. SUMMER!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 
		4. First day of summer, here i come!!! 
		5. its so summery today 
		6. Where's the summer? 
		7. @kanter And waiting for sunrise in spring with an amazing colors 
		8. In a state of shock...Lets just say when it rains it pours 
		9. I don't want spring break to end  4/20 tomorrow...let's do somethinggggg!
		10. summer always comes during the exam period. so i decided to reschedule it 

	Query 9/10: enjoy life


Generating query embeddings: 0it [00:00, ?it/s]

Retrieving docs:   0%|          | 0/1 [00:00<?, ?it/s]


		1. enjoying some blissful time with myself.. 
		2. enjoying the new found peace and freedom after a 30 year long war 
		3. Relaxing before my long day at work  on a beautiful Sat I could be having fun
		4. Enjoying this motorcycle ride.. 
		5. Wow! Cool! Enjoy 
		6. Should be out enjoying the sun..but im not 
		7. Peaceful and sunny 
		8. Sitting in the sunshine listening to music. Wonderful. 
		9. Rest in peace G 
		10. Enjoy playing golf cavs  lakers vs magic!!!

	Query 10/10: sleep


Generating query embeddings: 0it [00:00, ?it/s]

Retrieving docs:   0%|          | 0/1 [00:00<?, ?it/s]


		1. AND I CANT sleep 
		2. got to sleep now but i can't 
		3. i want to go to bed but never can sleep 
		4. Sleep? Not these days. 
		5. tired but cant sleep 
		6. Cannot sleep... Again! 
		7. I really wish I would have gotten more sleep. 
		8. Its official....can't go back to sleep 
		9. I probably won't get any sleep tonight.  ;)
		10. Going to sleep i think working in the am.. 


### Fasttext Retrieval

For fasttext, we used the [gensim](https://radimrehurek.com/gensim) package. 

---

In [24]:
class FastTextRetrieval(RetrievalSystemBase):
    def __init__(self, k=10):
        self.k = k

    def train(self, df: pd.DataFrame):
        self.model = FastText(
            sentences=df['text_preprocessed'].apply(lambda x: x.split()).tolist(),
            sg=1,
            vector_size=110,
            epochs=10,
        )
        self.document_vectors = np.ndarray(shape=(len(df), 110))
        self.document_text_by_idx = {}
        for doc_idx, (doc, text) in enumerate(zip(df['text_preprocessed'], df['text'])):
            self.document_text_by_idx[doc_idx] = text
            splitted = doc.split()
            if len(splitted) == 0:
                continue
            document_vector = np.mean(self.model.wv[splitted], axis=0)
            self.document_vectors[doc_idx] = document_vector

    def retrieve(self, query: Query) -> list:
        query_vector = np.mean(self.model.wv[query.text.split()], axis=0)
        similarities = np.dot(self.document_vectors, query_vector)
        similarities = similarities.argsort()[-self.k:][::-1]
        return [self.document_text_by_idx[i] for i in similarities]
        

In [25]:
test_model(FastTextRetrieval)

Retrieval system FastTextRetrieval:

	Query 1/10: twitter

		1. i have no followers 
		2. follow me 
		3. @DanneelHarris_ Thank you for the follow  How you doing? 
		4. so follow friday.... follow me 
		5. Thanks for the following everyone 
		6. Wants more followers 
		7. no one is following me 
		8. to all my new follower thank you 
		9. is short of Followers 
		10. Hey! Follow to @Beela_arg  SHE IS SO COOL 

	Query 2/10: hair cut

		1. follow me 
		2. i have no followers 
		3. is boiling in the office, air con any1 ? 
		4. lit des fiction  yeah!
		5. (cont) mid day meal and now just relaxing on my hammock..What a great day 
		6. Bite off more than you can chew, then chew it 
		7. so follow friday.... follow me 
		8. sometimes I wonder how life would be if me and my dad talked. if he was der in my everyday life. 
		9. gooing to work. 
		10. @DanneelHarris_ Thank you for the follow  How you doing? 

	Query 3/10: weekend

		1. (cont) mid day meal and now just relaxing on my hammock..Wha