<a 
 href="https://colab.research.google.com/github/LearnPythonWithRune/MachineLearningWithPython/blob/main/colab/final/13 - Lesson - Information Retrieval (IR).ipynb"
 target="_parent">
<img 
 src="https://colab.research.google.com/assets/colab-badge.svg"
alt="Open In Colab"/>
</a>

# Information Retrieval (IR)
### Goal of lesson
- Learn what Information Retrival is
- Topic modeling documents
- How to use Term Frequency and understand the limitations
- Implement Term Frequency by Inverse Document Frequency (TF-IDF)

### What is Information Retrievel (IR)
- The task of finding relevant documents in respose to a user query
- Web search engines are the most visible IR applications ([wiki](https://en.wikipedia.org/wiki/Information_retrieval))

### Topic Modeling
- Models for discovering the topics for a set of document
    - e.g., it provides us with methods to organize, understand and summarize large collections of textual information.
- Topic modeling can be described as a method for finding a group of words that best represents the information.

## Approach 1: Term Frequency

### Term Frequency
- The number of times a term occurs in a document is called its term frequency ([wiki](https://en.wikipedia.org/wiki/Tf–idf#Term_frequency))

$\text{tf}(t, d) = f_{t, d}$: The number of time term $t$ occurs in document $d$.

- There are other ways to define term frequency (see [wiki](https://en.wikipedia.org/wiki/Tf–idf#Term_frequency_2))

> #### Programming Notes:
> - Libraries used
>     - [**nltk**](https://www.nltk.org) - Natural Language Toolkit
>     - [**os**](https://docs.python.org/3/library/os.html) Miscellaneous operating system interfaces
>     - [**math**](https://docs.python.org/3/library/math.html) Do math with Python
> - Functionality and concepts used
>     - **List/Dict Comprehension** to convert data ([Lecture on **List Comprehension**](https://youtu.be/vCYEvtfXdig))
>     - [**sorted**](https://docs.python.org/3/howto/sorting.html) sort stuff
>     - [**lambda**](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions) lambda functions

In [None]:
import os
import nltk
import math
from os import system
nltk.download('punkt')

In [None]:
# Create local directories in Google Colab
!mkdir -p files/holmes

In [None]:
# This part, only for colabs, in order to have all the fullpath name in the list "holmes_files"

REMOTE_DIRECTORY = "https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/jupyter/final/files/holmes/"

FILES = ["bachelor.txt", "clerk.txt", "face.txt ", "problem.txt", "twisted.txt", "blaze.txt", "copper.txt" , "gloria_scott.txt", "ritual.txt", "bohemia.txt", "coronet.txt", "interpreter.txt", "speckled.txt", "boscombe.txt", "crooked.txt ", "league.txt", "squires.txt", "carbuncle.txt", "engineer.txt", "atient.txt", "treaty.txt"]

holmes_files = []
for filename in FILES:
    full_name = REMOTE_DIRECTORY + filename
    system("curl -o files/holmes/"+filename+" "+full_name)

In [None]:
corpus = {}

for filename in os.listdir('files/holmes/'):
    with open(f'files/holmes/{filename}') as f:
        content = [word.lower() for word in nltk.word_tokenize(f.read()) if word.isalpha()]
        
        freq = {word: content.count(word) for word in set(content)}
        
        corpus[filename] = freq

In [None]:
for filename in corpus:
  corpus[filename] = sorted(corpus[filename].items(), key=lambda x: x[1], reverse=True)

In [None]:
for filename in corpus:
  print(filename)
  for word, score in corpus[filename][:5]:
    print(f' {word}: {score}')

### Problem: Stop of Function Word
- words that have little meaning on their own ([wiki](https://en.wikipedia.org/wiki/Stop_word))
- Examples: am, by, do, is, which, ....
- Student exercise: Remove function words and see result (HINT: nltk has a list of stopwords)

## Approach 2: TF-IDF
- TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. ([wiki](https://en.wikipedia.org/wiki/Tf–idf))

### Inverse Document Frequency
- Measure of how common or rare a word is across documents

$\text{idf}(t, D) = \log{\frac{N}{|d\in D : t\in d|}} = \log{\frac{\text{Total Documents}}{\text{Number of Documents Containing "term"}}}$
- $D$: All docments in the corpus
- $N$: total number of documents in the corpus $N = |D|$

### TF-IDF
- Ranking of what words are important in a document by multiplying Term Frequencey (TF) by Inverse Document Frequency (IDF)

$\text{tf-idf}(t, d) = \text{tf}(t, d)\cdot \text{idf}(t, D)$

### Example

- Document 1: *This is the sample of the day*
- Document 2: *This is another sample of the day*

In [None]:
doc1 = "This is the sample of the day".split()
doc2 = "This is another sample of the day".split()

In [None]:
corpus = [doc1, doc2]
corpus

In [None]:
tf1 = {word: doc1.count(word) for word in set(doc1)}
tf2 = {word: doc2.count(word) for word in set(doc2)}

In [None]:
tf1

In [None]:
tf2

In [None]:
term = 'sample'
ids = 2/sum(term in doc for doc in corpus)

tf1.get(term, 0)*ids, tf2.get(term, 0)*ids