<img src="../../../images/banners/python-practice.png" width="600"/>

# <img src="../../../images/logos/python.png" width="23"/> 02. Practice Session 

> Covering Data Types, Functions, and IO.

## Table of Contents


* [Search Engine (TF-IDF)](#search_engine_(tf-idf))
    * [Step #1: Read Files](#step_#1:_read_files)
    * [Step #2: Extract Unique Words in all Documents](#step_#2:_extract_unique_words_in_all_documents)
    * [Step #3: Extract Number of Words in each Document](#step_#3:_extract_number_of_words_in_each_document)
    * [Step #3: Create `tf` (Term Frequency)](#step_#3:_create_`tf`_(term_frequency))
    * [Step #4: Search](#step_#4:_search)

---

<a class="anchor" id="search_engine_(tf-idf)"></a>
## Search Engine (TF-IDF)

> **TF-IDF** stands for “Term Frequency — Inverse Document Frequency”. This is a technique to quantify a word in documents, we generally compute a weight to each word which signifies the importance of the word in the document and corpus.

If i give you a sentence for example _“This building is so tall”_. Its easy for us to understand the sentence as we know the semantics of the words and the sentence. But how will the computer understand this sentence? The computer can understand any data only in the form of numerical value. So, for this reason we vectorize all of the text so that the computer can understand the text better.

By vectorizing the documents we can further perform multiple tasks such as finding the relevant documents, ranking, clustering and so on. This is the same thing that happens when you perform a google search. The web pages are called documents and the search text with which you search is called a query. google maintains a fixed representation for all of the documents. When you search with a query, google will find the relevance of the query with all of the documents, ranks them in the order of relevance and shows you the top k documents, all of this process is done using the vectorized form of query and documents. Although Googles algorithms are highly sophisticated and optimized, this is their underlying structure.

Terminology
- **t**: term (word)
- **d**: document (set of words)
- **N**: count of corpus
- **corpus**: the total document set

<a class="anchor" id="step_#1:_read_files"></a>
### Step #1: Read Files

> 1. Read the `data/files_path.txt` which contains all the documents you have to read.
> 2. Read the files listed in `data/files_path.txt` and create a dictionary where keys are file names and values are file contents.

```python
docs = {
    "file_1": "content_1",
    "file_2": "content_2",
    ...
} 
```

In [1]:
import json

In [110]:
with open("./data/result.json") as f:
    data = json.load(f)

<a class="anchor" id="step_#2:_extract_unique_words_in_all_documents"></a>
### Step #2: Extract Unique Words in all Documents

> Create a set of all words (`vocab`) and print the number of unique words.

In [30]:
# number of messages in data
len(data['messages'])

225

In [31]:
type(data['messages'])

list

In [108]:
# we store unique words in a set
vocab = set()

In [109]:
for msg in data['messages']:
    
    if not msg['text']:
        continue
    
    if type(msg['text']) == list:
        # TODO: add text content where there is a link, so far I'm ignoring messages that have link.
        continue
    
    words = msg['text'].split()
    vocab.update(words)

<a class="anchor" id="step_#3:_extract_number_of_words_in_each_document"></a>
### Step #3: Extract Number of Words in each Document

> 1. Extract words in each document by creating a dictionary named `tf_dict` where keys are document names and values are another dictionary.
> 2. In the nested dictionary, keys are words and values are the corresponding word frequency.

In [138]:
# term frequency (tf) dictionary
tf_dict = {}

In [139]:
for msg in data['messages']:
    
    # ignoring messages that have no content
    if not msg['text']:
        continue
    
    if type(msg['text']) == list:
        # TODO: add text content where there is a link, so far I'm ignoring messages that have link.
        continue
    
    # tokenizing words
    words = msg['text'].split()
    
    sender = msg['from']
    # initializing an unseen sender with empty dict
    if sender not in tf_dict:
        tf_dict[sender] = {}
    
    # counting words for each sender
    for word in words:
        if word in tf_dict[sender]:
            tf_dict[sender][word] += 1
        else:
            tf_dict[sender][word] = 1

In [141]:
tf_dict

{'Fatemeh Modarres': {'سلام': 1,
  'ب': 2,
  'همگی،': 1,
  'علی،': 1,
  'امشب': 1,
  'کلاس': 1,
  'رفع': 1,
  'اشکال': 1,
  'چه': 1,
  'ساعتی': 1,
  'هست؟': 1,
  'اوکی': 1,
  'مرسیییی': 1,
  '👌': 3,
  'یعنی': 3,
  'در': 1,
  'اینده': 1,
  'نزدیک،': 1,
  'میشه': 2,
  'یه': 2,
  'ربات': 2,
  'ساخت': 1,
  'و': 6,
  'دقیقا': 1,
  'همون': 1,
  'بینایی': 1,
  'رو': 4,
  'بهش': 1,
  'منتقل': 1,
  'کرد،': 1,
  'احتمالا': 1,
  'همین': 1,
  'کانسپت': 1,
  'برای': 1,
  'حس': 1,
  'شنوایی': 1,
  'بویایی': 1,
  '....': 1,
  'کم': 2,
  'استخراج': 1,
  'اثبات': 1,
  'بعدش': 1,
  'اون': 1,
  'انتقال': 1,
  'دهنده': 1,
  'های': 2,
  'عصبی': 1,
  'میذارن': 1,
  'برا': 1,
  'رباته،': 1,
  'مثلا': 1,
  'فلان': 2,
  'چیزی': 1,
  'دیدی،': 1,
  'ری': 1,
  'اکشن': 1,
  'نشون': 1,
  'بده،': 1,
  'شبیه': 1,
  'سازی': 1,
  'کامل': 1,
  'از': 1,
  'انسان': 1,
  'دیگه': 2,
  'کلا': 1,
  'وجود': 1,
  'ادم': 1,
  'هیچ': 1,
  'ضرورتی': 1,
  'نداره،': 1,
  'هوشمند': 1,
  'میان': 1,
  'جای': 1,
  'ادما🙈': 1,
  'فک': 1,

<a class="anchor" id="step_#3:_create_`tf`_(term_frequency)"></a>
### Step #3: Create `tf` (Term Frequency)

> 1. Create a dictionary where words are keys and values are a list.
> 2. Values are a list of corresponding documents frequencies.

```python
tf = {
    word_1: [freq_doc_1, freq_doc_2, freq_doc_3, ..., freq_doc_n],
    word_2: [freq_doc_1, freq_doc_2, freq_doc_3, ..., freq_doc_n],
    word_3: [freq_doc_1, freq_doc_2, freq_doc_3, ..., freq_doc_n],
    ...
    word_n: [freq_doc_1, freq_doc_2, freq_doc_3, ..., freq_doc_n],
}
```

| |doc_1|doc_2|...|doc_n|
|--|--|--|--|--|
|word_1|10|4|...|14|
|word_2|8|11|...|4|
|word_3|3|5|...|1|

In [142]:
from tqdm import tqdm

In [143]:
tf = {}

In [144]:
for w in tqdm(vocab):
    vector = []
    for name, word_freq in tf_dict.items():
        vector.append(word_freq.get(w, 0))
        
    tf[w] = vector

100%|██████████| 685/685 [00:00<00:00, 202931.08it/s]


<a class="anchor" id="step_#4:_search"></a>
### Step #4: Search

Using dot product of vectors, ask a user to enter a query and find the most relevant documents.

Example:
- query: "محسن نقش"
- output: `[doc_28, doc_4, ..., doc_19]`

In [152]:
import numpy as np
np.argmax([i*j for i, j in zip(tf["هستم"], tf["خروس"])])

0

In [153]:
list(tf_dict.keys())[0]

'Fatemeh Modarres'

In [154]:
query = input("Enter a phrase:")

Enter a phrase:خروس هستم
