# Project: Information Retrieval (IR)
- Calculate the TF-IDF of the corpus form 'files/holmes'

### Step 1: Import libraries

In [6]:
import os
import math
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/adel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Step 2: Read the corpus
- Read all the Sherlock Holmes texts in files/holmes/
- Create a dictionary (dict) calleds corpus
- Use os.listdir(...) ([docs](https://docs.python.org/3/library/os.html)) to iterate over all the filenames in 'files/holmes'
- For each filename open the file and read the content and add it to the **corpus[filename]**

In [7]:
corpus = {}

for filename in os.listdir('files/holmes/'):
  with open(f'files/holmes/{filename}') as f:
    corpus[filename] = f.read()

### Step 3: Tokenize the content
- Iterate over **filename** in **corpus**
- For each filename assign **corpus[filename]** to be the list of word (in lower) for word in word_tokenize(...) of the content of filename if word is alpha.
    - HINT: Use list comprehension
    - HINT: Use **.isalpha()**

In [8]:
for filename in corpus:
  corpus[filename] = [word.lower() for word in nltk.word_tokenize(corpus[filename]) if word.isalpha()]

### Step 4: Get all words
- Create a set **words**
    - HINT: **words = set()**
- For each **filename** in **corpus** update the set **words** with the content
    - HINT: apply **update(...)**

In [9]:
words = set()
for filename in corpus:
  words.update(corpus[filename])

### Step 5: Calculate term frequency (TF)
- Createa empty dictionary (dict) called **tf**
- Iterate over **filename** in **corpus**
- For each filename add **tf[filename]** with the word frequency.
    - HINT: Use dict comprehension with **word** in **words**

In [10]:
tf = {}

for filename in corpus:
  tf[filename] = {word: corpus[filename].count(word) for word in words}

### Step 6: Calculate the inverse document frequency (IDF)
- Create an empty dictionary called **idf**
- Iterate **word** in **words**
- For each **word** calculate the number of documents word is in the corpus
    - HINT: **freq = sum(word in corpus[filename] for filename in corpus)**
- Update **idf[word]** to be the logarithm of number of documents divided by the calcualted frequency.

In [11]:
idf = {}

for word in words:
  freq = sum(word in corpus[filename] for filename in corpus)
  idf[word] = math.log(len(corpus) / freq)

### Step 7: Calculate the Term Frequence-Inverse Document Frequency (TF-IDF)
- Create a dictionary tfidf
- Iterate over **filename** in **corpus**
- For each **filename** calculate the TF-IDF for each word and add it as pairs **(word, tf-idf)**
    - HINT: Use list comprehension **[(word, tf[filename][word] * idf[word]) for word in words]**

In [12]:
tfidf = {}

for filename in corpus:
  tfidf[filename] = [(word, tf[filename][word] * idf[word]) for word in words]

### Step 8: Sort the values
- Iterate over **filename** in **corpus**
- For each **filename** sort the values in **tfidf** by second item in reverse order
    - HINT: Use **sorted** ([docs](https://docs.python.org/3/howto/sorting.html)) with **key=lambda x: x[1]** and **reverse=True**

In [13]:
for filename in corpus:
  tfidf[filename] = sorted(tfidf[filename], key=lambda x: x[1], reverse=True)

### Step 9: Print the top five words
- Iterate **filename** in **corpus**
- For each **filename** print the filename and iterate over the first file elements of **tfidf[filename]** and print the **term** and **score**

In [14]:
for filename in corpus:
  print(filename)
  for term, score in tfidf[filename][:5]:
    print(f' {term}: {score}')

gloria_scott.txt
 trevor: 70.02401606763873
 beddoes: 33.489746814957655
 hudson: 24.396436929918487
 prendergast: 21.31165706406396
 boat: 18.81100205730782
crooked.txt
 barclay: 103.51376288259638
 colonel: 25.05525936990736
 aldershot: 18.81100205730782
 nancy: 18.26713462634054
 regiment: 14.108251542980867
bohemia.txt
 majesty: 54.80140387902161
 briony: 33.489746814957655
 irene: 32.919253600288684
 adler: 30.56787834312521
 photograph: 26.30802233840273
squires.txt
 cunningham: 94.3801955694261
 alec: 57.845926316745036
 acton: 45.667836565851346
 william: 31.506333455467118
 colonel: 31.3190742123842
patient.txt
 blessington: 79.157583380809
 trevelyan: 48.71235900357477
 brook: 24.356179501787384
 consultation: 15.222612188617115
 resident: 14.108251542980867
speckled.txt
 roylott: 60.89044875446846
 stoner: 57.845926316745036
 ventilator: 42.62331412812792
 stepfather: 36.53426925268108
 stoke: 33.489746814957655
twisted.txt
 clair: 82.20210581853242
 neville: 57.845926316745