# Text Retrieval
There are 2 standard models for retrieving text data.
1. Boolean Retrieval Model
2. Vector Space Model

The aim of any information retrieval model is to retrieve documents related to a query.

## 1. Boolean Retrieval Model
In this model we consider every query and document as a set of words and we retrieve a document if and only if the query word is present in it. Model can be extended to support complex queries with boolean operators.

In this assignment we are going to implement both the models, using scikit-learn package. We are going to use song lyrics dataset.


**Step 1. Import necessary packages -- numpy and pandas - 1 Mark** 

In [1]:
import pandas as pd
import numpy as np

**Step 2. Read the dataset and store it in variable 'df' - 1 mark** <br> 

The lyric column of the dataset has song lyrics. We aim to give some lyrics as a query and retrieve the song name. 


In [2]:

df = pd.read_csv("modified_song_lyrics.csv")

#df.head()

FileNotFoundError: ignored

**Documentation Reference: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html**<br>

**Step 3**<br>
1. Import this class
2. Create a 'vectorizer' object of 'CountVectorizer' with parameter binary=True - 1 Mark

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(binary=True)

We aim to analyze the lyrics for presence or absence. <br> 
**Step 4. Fit and transform the lyric column using vectorizer - 2 Marks**<br>
X object is a matrix of size (n_songs,n_unique_words) where each entry is 0 or 1 if the word in present in this song. Verify this using X.shape method

In [None]:
X = count_vectorizer.fit_transform(df['lyric'])
X.shape

In [None]:
query1 = 'beautiful'
query2 = 'girl'
# To get list of all doc containing a word, we do it in the following way
list_q1 = X[:,count_vectorizer.vocabulary_[query1]]
# Step 5. Do the same for 'query2' and store it in 'list_q2'
list_q2 = X[:,count_vectorizer.vocabulary_[query2]]

In [None]:
# AND Operation
for i in range(list_q1.shape[0]):
    if list_q1[i]==1 and list_q2[i]==1:
        print(df.iloc[i,1])

**Step 6. Implement OR operation - 1 Mark**

In [None]:
# OR Operation
for i in range(list_q1.shape[0]):
    if list_q1[i]==1 or list_q2[i]==1:
        print(df.iloc[i,1])

## 2. Vector Space Model
In this model, every document and query is represented as a vector and closest vector as measured by cosine distance is considered as the correct answer.

**Documentation Reference:**<br>
1. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
2. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html

**Step 1. Import above references - 1 Mark**

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

**Step 2. Create a 'vectorizer' object of 'TfidfVectorizer' - 1 Mark**

In [None]:
tfidf_vectorizer = TfidfVectorizer()

Here we attempt to calculate tf-idf scores of the terms (lyrics). We do that by doing the following. <br> 
**Step 3. Fit and transform the lyric column using vectorizer - 2 Marks**<br>
X object is a matrix of size (n_songs,n_unique_words) where each entry is tf-idf score of the word in this song. Verify this using X.shape method

In [None]:
X = tfidf_vectorizer.fit_transform(df['lyric'])
X.shape

**Step 4. Use 'transform' method of vectorizer on 'query' and store in 'query_vec' - 1 Mark**<br>
This method converts a text value into a tf-idf vector

In [None]:
query = "Take it easy, with me"
query_vec = tfidf_vectorizer.transform([query])

query_vec.shape

**Step 5. Use 'cosine_similarity' on 'X' and 'query_vec' store it in 'results' - 1 Mark**

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

results = cosine_similarity(X, query_vec)

In [None]:
# Print Name of the song
song_index = np.argmax(results.reshape((-1,)))
print('Song -- ', df.iloc[song_index,1]) # add song name here 