### TF-Weightage Sample Code
This is a sample code to give students an idea of how Term-Frequency weightage model can be applied to calculate relevance score of documents.

The example is taken from **lecture 3.2**.

<h3 style = 'color:purple;'>Vector Space Model (TF-Weightage Model)</h3>

$$ f(q,d) = sim(q,d) =  \sum_{i=1}^n x_iy_i $$ 
q = (x_1,.....,x_n) <br>
d = (y_1,.....,y_n) <br>
x_i = count of word W_i in query. <br>
y_i = count of word W_i in doc.

In [None]:
#lets say we have the following documents
documents = {
    "d1" : "news about",
    "d2" : "news about organic food campaign",
    "d3" : "news of presidential campaign",
    "d4" : "news of presidential campaign presidential candidate",
    "d5" : "news of organic food campaign campaign campaign campaign"
} # a dictionary with doc# as key and doc content as value

In [None]:
#visualize the dictionary
documents

{'d1': 'news about',
 'd2': 'news about organic food campaign',
 'd3': 'news of presidential campaign',
 'd4': 'news of presidential campaign presidential candidate',
 'd5': 'news of organic food campaign campaign campaign campaign'}

In [None]:
#create a corpus ccontaining the vocabulary of words in the documents
corpus = [] # a list that will store words of the vocabulary
for doc in documents.values(): #iterate through documents 
    for word in doc.split(): #go through each word in the current doc
        if not word in corpus: 
            corpus.append(word) #add word in corpus if not already added

In [None]:
#visualize the corpus 
corpus

['news',
 'about',
 'organic',
 'food',
 'campaign',
 'of',
 'presidential',
 'candidate']

In [None]:
#lets create a dictionary within a dictionary to store term-frequncy for each doc
tf_docs = {} #empty dictionary
for doc_id in documents.keys(): #iterate through doc# (d1,d2,...,d5)
    tf_docs[doc_id] = {} #create empty dictionary for each doc# key

In [None]:
#visualize the state of tf_docs
tf_docs

{'d1': {}, 'd2': {}, 'd3': {}, 'd4': {}, 'd5': {}}

As you can see, we have created a dictionary against each doc, now we have to use the created dictionaries to store term frequencies for each doc.

In [None]:
#lets start on storing term-frequencies for every doc
for word in corpus: #iterate through words in the corpus
    for doc_id,doc in documents.items(): #iterate through documents dictionary
        tf_docs[doc_id][word] = doc.count(word) #store term-frequency for the word in each doc

In [None]:
tf_docs #visualize calculated term frequencies

{'d1': {'news': 1,
  'about': 1,
  'organic': 0,
  'food': 0,
  'campaign': 0,
  'of': 0,
  'presidential': 0,
  'candidate': 0},
 'd2': {'news': 1,
  'about': 1,
  'organic': 1,
  'food': 1,
  'campaign': 1,
  'of': 0,
  'presidential': 0,
  'candidate': 0},
 'd3': {'news': 1,
  'about': 0,
  'organic': 0,
  'food': 0,
  'campaign': 1,
  'of': 1,
  'presidential': 1,
  'candidate': 0},
 'd4': {'news': 1,
  'about': 0,
  'organic': 0,
  'food': 0,
  'campaign': 1,
  'of': 1,
  'presidential': 2,
  'candidate': 1},
 'd5': {'news': 1,
  'about': 0,
  'organic': 1,
  'food': 1,
  'campaign': 4,
  'of': 1,
  'presidential': 0,
  'candidate': 0}}

In [1]:
tf_docs['d3'] #checking term frequencies for d3

NameError: ignored

### Querying the documents for relevance scores
Since we have calculated the term frequencies for all the documents in our collection, let us calcualte the relevance score of each document for a given query.

In [15]:
query = "news about presidential campaign" #the query
query

'news about presidential campaign'

In [16]:
query_vocab = [] # will store the unique words that occur in the query
for word in query.split():
    if word not in query_vocab:
        query_vocab.append(word)

In [17]:
query_vocab # the unique words in the query

['news', 'about', 'presidential', 'campaign']

In [19]:
query_wc = {} # a dictionary to store count of a word in the query (i.e x_i according to lecture slides terminology)
for word in query_vocab:
    query_wc[word] = query.split().count(word)

In [20]:
query_wc # the count of each word that occurs in the query

{'about': 1, 'campaign': 1, 'news': 1, 'presidential': 1}

In [None]:
relevance_scores = {} # a dictionary that will store the relevance score for each doc
# doc_id will be the key and relevance score the value for this dictionary
for doc_id in documents.keys():
    score = 0 #initialze the score for the doc to 0 at the start
    for word in query_vocab:
        score += query_wc[word] * tf_docs[doc_id][word] # count of word in query * term_freq of the word
    relevance_scores[doc_id] = score

In [None]:
# lets print the relevance scores for the query
print("Document Relevancy Scores\n",relevance_scores)

Document Relevancy Scores
 {'d1': 2, 'd2': 3, 'd3': 3, 'd4': 4, 'd5': 5}


## HOORAY !!!
We are done with our first simple vector space model to check the relevancy of each document to our query.

### What next ?
This was just one the many ways you can calculate the relevancy score of query for a set of documents. I tried to comment the code as much as possible for your understanding, it is important that you get familiar with Python coding as early as possible as it will be used throughout your degree program. Please try to understand this code, as it will be useful for solving assignment 1 for this course. Regards