<a 
 href="https://colab.research.google.com/github/LearnPythonWithRune/MachineLearningWithPython/blob/main/colab/final/13 - Project - Information Retrieval (IR).ipynb"
 target="_parent">
<img 
 src="https://colab.research.google.com/assets/colab-badge.svg"
alt="Open In Colab"/>
</a>

# Project: Information Retrieval (IR)
- Calculate the TF-IDF of the corpus form 'files/holmes'

### Step 1: Import libraries

In [None]:
import os
import nltk
import math
from os import system
nltk.download('punkt')

### Step 2: Read the corpus
- Read all the Sherlock Holmes texts in files/holmes/
- Create a dictionary (dict) calleds corpus
- Use os.listdir(...) ([docs](https://docs.python.org/3/library/os.html)) to iterate over all the filenames in 'files/holmes'
- For each filename open the file and read the content and add it to the **corpus[filename]**

In [None]:
# Create local directories in Google Colab
!mkdir -p files/holmes

In [None]:
# This part, only for colabs, in order to have all the fullpath name in the list "holmes_files"

REMOTE_DIRECTORY = "https://raw.githubusercontent.com/LearnPythonWithRune/MachineLearningWithPython/main/jupyter/final/files/holmes/"

FILES = ["bachelor.txt", "clerk.txt", "face.txt ", "problem.txt", "twisted.txt", "blaze.txt", "copper.txt" , "gloria_scott.txt", "ritual.txt", "bohemia.txt", "coronet.txt", "interpreter.txt", "speckled.txt", "boscombe.txt", "crooked.txt ", "league.txt", "squires.txt", "carbuncle.txt", "engineer.txt", "atient.txt", "treaty.txt"]

holmes_files = []
for filename in FILES:
    full_name = REMOTE_DIRECTORY + filename
    system("curl -o files/holmes/"+filename+" "+full_name)

In [None]:
corpus = {}

for filename in os.listdir('files/holmes/'):
  with open(f'files/holmes/{filename}') as f:
    corpus[filename] = f.read()

### Step 3: Tokenize the content
- Iterate over **filename** in **corpus**
- For each filename assign **corpus[filename]** to be the list of word (in lower) for word in word_tokenize(...) of the content of filename if word is alpha.
    - HINT: Use list comprehension
    - HINT: Use **.isalpha()**

In [None]:
for filename in corpus:
  corpus[filename] = [word.lower() for word in nltk.word_tokenize(corpus[filename]) if word.isalpha()]

### Step 4: Get all words
- Create a set **words**
    - HINT: **words = set()**
- For each **filename** in **corpus** update the set **words** with the content
    - HINT: apply **update(...)**

In [None]:
words = set()
for filename in corpus:
  words.update(corpus[filename])

### Step 5: Calculate term frequency (TF)
- Createa empty dictionary (dict) called **tf**
- Iterate over **filename** in **corpus**
- For each filename add **tf[filename]** with the word frequency.
    - HINT: Use dict comprehension with **word** in **words**

In [None]:
tf = {}

for filename in corpus:
  tf[filename] = {word: corpus[filename].count(word) for word in words}

### Step 6: Calculate the inverse document frequency (IDF)
- Create an empty dictionary called **idf**
- Iterate **word** in **words**
- For each **word** calculate the number of documents word is in the corpus
    - HINT: **freq = sum(word in corpus[filename] for filename in corpus)**
- Update **idf[word]** to be the logarithm of number of documents divided by the calcualted frequency.

In [None]:
idf = {}

for word in words:
  freq = sum(word in corpus[filename] for filename in corpus)
  idf[word] = math.log(len(corpus) / freq)

### Step 7: Calculate the Term Frequence-Inverse Document Frequency (TF-IDF)
- Create a dictionary tfidf
- Iterate over **filename** in **corpus**
- For each **filename** calculate the TF-IDF for each word and add it as pairs **(word, tf-idf)**
    - HINT: Use list comprehension **[(word, tf[filename][word] * idf[word]) for word in words]**

In [None]:
tfidf = {}

for filename in corpus:
  tfidf[filename] = [(word, tf[filename][word] * idf[word]) for word in words]

### Step 8: Sort the values
- Iterate over **filename** in **corpus**
- For each **filename** sort the values in **tfidf** by second item in reverse order
    - HINT: Use **sorted** ([docs](https://docs.python.org/3/howto/sorting.html)) with **key=lambda x: x[1]** and **reverse=True**

In [None]:
for filename in corpus:
  tfidf[filename] = sorted(tfidf[filename], key=lambda x: x[1], reverse=True)

### Step 9: Print the top five words
- Iterate **filename** in **corpus**
- For each **filename** print the filename and iterate over the first file elements of **tfidf[filename]** and print the **term** and **score**

In [None]:
for filename in corpus:
  print(filename)
  for term, score in tfidf[filename][:5]:
    print(f' {term}: {score}')