# Information Retrieval (IR)
### Goal of lesson
- Learn what Information Retrival is
- Topic modeling documents
- How to use Term Frequency and understand the limitations
- Implement Term Frequency by Inverse Document Frequency (TF-IDF)

### What is Information Retrievel (IR)
- The task of finding relevant documents in respose to a user query
- Web search engines are the most visible IR applications ([wiki](https://en.wikipedia.org/wiki/Information_retrieval))

### Topic Modeling
- Models for discovering the topics for a set of document
    - e.g., it provides us with methods to organize, understand and summarize large collections of textual information.
- Topic modeling can be described as a method for finding a group of words that best represents the information.

## Approach 1: Term Frequency

### Term Frequency
- The number of times a term occurs in a document is called its term frequency ([wiki](https://en.wikipedia.org/wiki/Tf–idf#Term_frequency))

$\text{tf}(t, d) = f_{t, d}$: The number of time term $t$ occurs in document $d$.

- There are other ways to define term frequency (see [wiki](https://en.wikipedia.org/wiki/Tf–idf#Term_frequency_2))

> #### Programming Notes:
> - Libraries used
>     - [**nltk**](https://www.nltk.org) - Natural Language Toolkit
>     - [**os**](https://docs.python.org/3/library/os.html) Miscellaneous operating system interfaces
>     - [**math**](https://docs.python.org/3/library/math.html) Do math with Python
> - Functionality and concepts used
>     - **List/Dict Comprehension** to convert data ([Lecture on **List Comprehension**](https://youtu.be/vCYEvtfXdig))
>     - [**sorted**](https://docs.python.org/3/howto/sorting.html) sort stuff
>     - [**lambda**](https://docs.python.org/3/tutorial/controlflow.html#lambda-expressions) lambda functions

In [1]:
import os
import nltk
import math

In [4]:
corpus = {}

for filename in os.listdir('files/holmes/'):
    with open(f'files/holmes/{filename}') as f:
        content = [word.lower() for word in nltk.word_tokenize(f.read()) if word.isalpha()]
        
        freq = {word: content.count(word) for word in set(content)}
        
        corpus[filename] = freq

In [7]:
for filename in corpus:
    corpus[filename] = sorted(corpus[filename].items(), key=lambda x: x[1], reverse=True)

In [8]:
for filename in corpus:
    print(filename)
    for word, score in corpus[filename][:5]:
        print(f'  {word}: {score}')

speckled.txt
  the: 600
  and: 281
  of: 276
  a: 252
  i: 233
face.txt
  the: 326
  i: 298
  and: 226
  to: 185
  a: 173
twisted.txt
  the: 493
  a: 275
  and: 270
  i: 238
  of: 234
squires.txt
  the: 508
  of: 206
  and: 169
  to: 168
  a: 152
coronet.txt
  the: 466
  i: 356
  to: 270
  and: 238
  a: 213
carbuncle.txt
  the: 463
  of: 233
  a: 208
  and: 199
  i: 188
treaty.txt
  the: 688
  i: 348
  of: 319
  and: 318
  to: 316
bachelor.txt
  the: 401
  i: 236
  and: 234
  to: 233
  a: 211
patient.txt
  the: 346
  i: 187
  to: 184
  and: 172
  of: 171
bohemia.txt
  the: 443
  i: 261
  and: 254
  to: 245
  of: 237
problem.txt
  the: 427
  i: 231
  to: 209
  of: 191
  and: 187
crooked.txt
  the: 438
  and: 204
  of: 199
  i: 184
  a: 175
engineer.txt
  the: 431
  i: 313
  and: 250
  a: 233
  to: 215
interpreter.txt
  the: 353
  and: 188
  a: 186
  to: 178
  i: 153
gloria_scott.txt
  the: 430
  and: 273
  of: 220
  a: 203
  i: 188
clerk.txt
  the: 312
  i: 210
  a: 186
  and: 180
  of:

### Problem: Stop of Function Word
- words that have little meaning on their own ([wiki](https://en.wikipedia.org/wiki/Stop_word))
- Examples: am, by, do, is, which, ....
- Student exercise: Remove function words and see result (HINT: nltk has a list of stopwords)

## Approach 2: TF-IDF
- TF-IDF is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. ([wiki](https://en.wikipedia.org/wiki/Tf–idf))

### Inverse Document Frequency
- Measure of how common or rare a word is across documents

$\text{idf}(t, D) = \log{\frac{N}{|d\in D : t\in d|}} = \log{\frac{\text{Total Documents}}{\text{Number of Documents Containing "term"}}}$
- $D$: All docments in the corpus
- $N$: total number of documents in the corpus $N = |D|$

### TF-IDF
- Ranking of what words are important in a document by multiplying Term Frequencey (TF) by Inverse Document Frequency (IDF)

$\text{tf-idf}(t, d) = \text{tf}(t, d)\cdot \text{idf}(t, D)$

### Example

- Document 1: *This is the sample of the day*
- Document 2: *This is another sample of the day*

In [9]:
doc1 = "This is the sample of the day".split()
doc2 = "This is another sample of the day".split()

In [10]:
corpus = [doc1, doc2]
corpus

[['This', 'is', 'the', 'sample', 'of', 'the', 'day'],
 ['This', 'is', 'another', 'sample', 'of', 'the', 'day']]

In [11]:
tf1 = {word: doc1.count(word) for word in set(doc1)}
tf2 = {word: doc2.count(word) for word in set(doc2)}

In [12]:
tf1

{'This': 1, 'of': 1, 'sample': 1, 'is': 1, 'day': 1, 'the': 2}

In [13]:
tf2

{'This': 1, 'of': 1, 'sample': 1, 'is': 1, 'another': 1, 'day': 1, 'the': 1}

In [19]:
term = 'another'
ids = 2/sum(term in doc for doc in corpus)

tf1.get(term, 0)*ids, tf2.get(term, 0)*ids

(0.0, 2.0)