#**Information Retrieval CSF469**
##Lab Session - 5
## Date - 23/02/2024
##<font color ='GREEN'>Marks: 10</font>
In this lab session, we implement vector-based query processing approach in which both the documents and queries are converted into vector forms. Then, a similarity based metric such as cosine similarity can be used to find the relevant documents.

## Students have to write their code in blank spaces marked under <font color = "RED">TO-DO</font> section

### After attempting the lab sheet, please rename the file as "NAME_ID.ipynb" and then upload it on canvas

Task is to implement vector based query processing approach. Students need to first change each text document into tfidf and store it as a matrix. Then, change query also into a vector of same size of vector as of a document. Then use a similarity metric to get the relevant documents.

In [304]:
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [305]:
dataset = pd.read_csv('IRdata.csv')
# Dataset is now stored in a Pandas Dataframe

In [306]:
print(dataset)

                                                 title     label
0    Avantel Limited Announces Resignation of Ebv R...   Avantel
1    480% Returns From 1-Year Low: Multibagger Defe...   Avantel
2    Multibagger defence stock rises 5% on Rs 68 cr...   Avantel
3    Avantel Ltd receives order worth Rs. 67.92 cro...   Avantel
4    Rs 11 to Rs 129: This defence stock turned int...   Avantel
..                                                 ...       ...
157  Titagarh Rail Systems rolls out new diving sup...  Titagarh
158  Titagarh Rail shares fall 10% from record high...  Titagarh
159  Rs 49 to Rs 813: This railway stock turned int...  Titagarh
160  Škoda Group secures €732m contract with Trenit...  Titagarh
161  Titagarh Rail Share Price: Stock at record hig...  Titagarh

[162 rows x 2 columns]


In [307]:
df_news = dataset.iloc[:, 0]
df_labels = dataset.iloc[:, -1]

In [308]:
print(df_news)
df_labels = df_labels.to_frame()
print(df_labels)

0      Avantel Limited Announces Resignation of Ebv R...
1      480% Returns From 1-Year Low: Multibagger Defe...
2      Multibagger defence stock rises 5% on Rs 68 cr...
3      Avantel Ltd receives order worth Rs. 67.92 cro...
4      Rs 11 to Rs 129: This defence stock turned int...
                             ...                        
157    Titagarh Rail Systems rolls out new diving sup...
158    Titagarh Rail shares fall 10% from record high...
159    Rs 49 to Rs 813: This railway stock turned int...
160    Škoda Group secures €732m contract with Trenit...
161    Titagarh Rail Share Price: Stock at record hig...
Name: title, Length: 162, dtype: object
        label
0     Avantel
1     Avantel
2     Avantel
3     Avantel
4     Avantel
..        ...
157  Titagarh
158  Titagarh
159  Titagarh
160  Titagarh
161  Titagarh

[162 rows x 1 columns]


In [309]:
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')


# Preprocessing function
def preprocess_text(text):
    # Tokenization
    tokens = word_tokenize(text.lower())
    # Removing stop words
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]
    # Stemming
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
    # Joining tokens back to sentence
    preprocessed_text = ' '.join(stemmed_tokens)
    return preprocessed_text

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#<font color = "RED">TO-DO</font> Implement code for converting text into TFIDF vectors.

In [310]:
import numpy as np

class CustomTFIDFVectorizer:
    def __init__(self):
        self.word_idf = {}
        self.vocab_size = 0

    def fit_transform(self, documents):
        ### write your code here for generating vector of documents
        word_doc_count = {}

        for document in documents:
            words = set(document.split())
            for word in words:
                if word in word_doc_count:
                    word_doc_count[word]+=1
                else:
                    word_doc_count[word] = 1

        for word, count in word_doc_count.items():
            self.word_idf[word] = np.log(len(documents)/count)

        tfidf_matrix = np.zeros((len(documents), len(self.word_idf)))

        for i, doc in enumerate(documents):
            tfidf_matrix[i] = self.transform([doc])

        self.vocab_size = len(self.word_idf)

        return tfidf_matrix

    def transform(self, query_text, max_length=None):
        ### write your code here for generating vector of query
        # query_text = query_text[0]
        words = str(query_text).split()
        tf = {}

        for word in words:
            tf[word] = tf.get(word, 0) + 1

        query_vector = np.zeros(len(self.word_idf))

        for word, freq in tf.items():
            if word in self.word_idf:
                query_vector[list(self.word_idf).index(word)] = freq * self.word_idf[word]

        return query_vector

In [311]:
# Preprocess documents
preprocessed_documents = [preprocess_text(doc) for doc in df_news]

In [312]:
# Initialize and fit the TF-IDF vectorizer
vectorizer = CustomTFIDFVectorizer()

# Fit and transform documents
tfidf_matrix = vectorizer.fit_transform(preprocessed_documents)
print(tfidf_matrix.shape)

(162, 611)


#<font color = "RED">TO-DO</font>

In [313]:
query = "multi bagger defence shares  "
# Query processing
preprocessed_query = preprocess_text(query)
query_vector = (vectorizer.transform([preprocessed_query])).reshape(1,-1)
# print(query_vector.shape)
# Calculating cosine similarity between query and documents
similarity_scores = cosine_similarity(query_vector, tfidf_matrix)
# Retrieving most similar five documents
top_indices = np.argsort(similarity_scores.ravel())[-5:][::-1]

print("Top 5 most similar documents:")
for i, idx in enumerate(top_indices):
    print(f"{i+1}. Document {idx}: {df_news[idx]}")

Top 5 most similar documents:
1. Document 27: Smallcap Defence Company Declares 2:1 Bonus Issue; Stock Zooms 20% - Equitymaster
2. Document 5: This Multibagger Defence Stock Has Given over 2000% Return In 3 Years - News18
3. Document 2: Multibagger defence stock rises 5% on Rs 68 crore order win, nears 52-week high - Business Today
4. Document 6: 2:1 Bonus Issue: Multibagger Defence Stock Up 400% YTD; Turns Rs 50000 Into Rs 2.5 Lakh In 11 Months - Goodreturns
5. Document 4: Rs 11 to Rs 129: This defence stock turned into a multibagger in two years; fell 11% from record high - Business Today


In [314]:
query1 = "Titagarh shares at 10%"
query2 = "small cap company shares"
query3 = "rail systems launch repayment"

In [315]:
# Query-1 processing
preprocessed_query = preprocess_text(query1)
query_vector = (vectorizer.transform([preprocessed_query])).reshape(1,-1)
# print(query_vector.shape)
# Calculating cosine similarity between query and documents
similarity_scores = cosine_similarity(query_vector, tfidf_matrix)
# Retrieving most similar five documents
top_indices = np.argsort(similarity_scores.ravel())[-5:][::-1]

print("Top 5 most similar documents:")
for i, idx in enumerate(top_indices):
    print(f"{i+1}. Document {idx}: {df_news[idx]}")

Top 5 most similar documents:
1. Document 158: Titagarh Rail shares fall 10% from record high; right time to book profit? - Business Today
2. Document 53: Gainers & Losers: 10 stocks that moved the most on June 12 - Moneycontrol
3. Document 39: Multibagger Alert! With over 9500% returns in a decade, this small-cap stock delivered gains in 7 out of last 10 years - MintGenie
4. Document 1: 480% Returns From 1-Year Low: Multibagger Defence Stock Less Than Rs 10 Away From New High; Co Wins Big Order - Goodreturns
5. Document 85: Top 10 stocks to watch on December 15, 2023: Texmaco Rail, Titagarh, IRCON, Dr Reddy's, Ami Organics and more - Business Today


In [316]:
# Query-2 processing
preprocessed_query = preprocess_text(query2)
query_vector = (vectorizer.transform([preprocessed_query])).reshape(1,-1)
# print(query_vector.shape)
# Calculating cosine similarity between query and documents
similarity_scores = cosine_similarity(query_vector, tfidf_matrix)
# Retrieving most similar five documents
top_indices = np.argsort(similarity_scores.ravel())[-5:][::-1]

print("Top 5 most similar documents:")
for i, idx in enumerate(top_indices):
    print(f"{i+1}. Document {idx}: {df_news[idx]}")

Top 5 most similar documents:
1. Document 27: Smallcap Defence Company Declares 2:1 Bonus Issue; Stock Zooms 20% - Equitymaster
2. Document 25: Stock of this telecom equipment company has zoomed over 200% in 5 months - Business Standard
3. Document 146: Titagarh Rail Systems stock jumps 4.4% after company signs contract worth ₹857 crore - MintGenie
4. Document 16: Small cap stock gained up to 3.2% after it received an order from Cochin Shipyard - Trade Brains
5. Document 44: 1:5 Stock Split: Small Cap Multibagger Telecommunications Stock Hits New 52-Week High - Goodreturns


In [317]:
# Query-3 processing
preprocessed_query = preprocess_text(query3)
query_vector = (vectorizer.transform([preprocessed_query])).reshape(1,-1)
# print(query_vector.shape)
# Calculating cosine similarity between query and documents
similarity_scores = cosine_similarity(query_vector, tfidf_matrix)

# Retrieving most similar five documents
top_indices = np.argsort(similarity_scores.ravel())[-5:][::-1]

print("Top 5 most similar documents:")
for i, idx in enumerate(top_indices):
    print(f"{i+1}. Document {idx}: {df_news[idx]}")

Top 5 most similar documents:
1. Document 93: Titagarh Rail Systems stock gains on launching Rs 700-crore QIP issue - Moneycontrol
2. Document 95: Titagarh Rail Systems shares rise after the launch of Rs 700 crore QIP issue - Zee Business
3. Document 156: Titagarh Rail Systems launches diving support craft for Indian Navy - MyIndMakers
4. Document 97: Titagarh Rail Systems Surges on Rs 700 Crore QIP Launch, Eyes Debt Repayment - Indiainfoline
5. Document 66: Railway stock jumps 3.4% after launching new product for Indian Navy - Trade Brains
