# Information Retrieval

Unstructured data generally means no additional or data-specific structure, for example:

| Data type | | Analysis |
| :--- | :--- | :--- |
| Plain text | No structure beyond a sequence of characters. | Use of POS and syntax trees to add structure |
| Graphics, photos, audio, video | A stream of values (bits, colors, pressure levels) in one or more dimensions. File formats and compression techniques may be structured, but the raw data is not. | Within the data we could use object recognition, or sound filtering |
| Sensor data, experimental results| Observational data. | Applying statisitical tests to confirm or refute a hypothesis |


### Infomation Retrieval
An _infomation retrieval (IR)_ task is, given a query, find the relevent documents in a collection. This assumes:
- Their is a large collection to be searched
- Their is a query (often keywords) to search for
- The task is to find all and only the relevent documents

An example would be searching a library catologue, if the user supplys the authors name the catalogue should list all books by that author.

### IR Evaluation
To evaluate the performance of an IR systems their are two measures, precision and recall.

- __Precision:__ The proportion of the documents returned by the system that are relevant.
- __Recall:__ The proportion of all the relevant documents that are returned by the system.

We can further break this down into 4 measures:
- __True positives (TP):__ Number of relevant documents retrieved.
- __False positives (FP):__ Number of non-relevant documents retrived.
- __True negatives (TN):__ Number of non-relevant documents not retrived.
- __False negatives (FN):__ Number of relevant documents not retrived.

This gives us the following statisitcs for precision (P) and recall (R):
$$
P = \dfrac{TP}{TP + FP}
$$

$$
R = \dfrac{TP}{TP + FN}
$$

#### Example
> Their are 130 documents, system 1 returns 25 of which 16 are relevant, system 2 returns 15 of which 12 are relevant.

__System 1__

| | Relevant | Not Relevant | Total |
| :--- | --- | --- |
| Retrieved | 16 | 9 | 25 |
| Not Retrieved | 12 | 93 | 105 |
| Total | 28 | 102 | 130 |

$$
P = \dfrac{16}{16 + 9} = 0.64
$$

$$
R = \dfrac{16}{16 + 12} = 0.57
$$

__System 2__

| | Relevant | Not Relevant | Total |
| :--- | --- | --- |
| Retrieved | 12 | 3 | 15 |
| Not Retrieved | 16 | 99 | 115 |
| Total | 28 | 102 | 130 |

$$
P = \dfrac{12}{12 + 3} = 0.80
$$

$$
R = \dfrac{12}{12 + 16} = 0.43
$$

System 2 is more precise, but system 1 has a higher recall.

#### F-score
AN F-score is an evaluation measure that combines precision and recall.
$$
F_\alpha = \dfrac{1}{\dfrac{\alpha}{P} + \dfrac{1 - \alpha}{R}}
$$
$\alpha$ is a weighting factor, values of $\alpha$ closer to $1$ value precision higher, values closer to $0$ value recall heigher.

Comparing the systems from the example above with $\alpha = 0.5$

$$
F_{0.5}(\text{System 1}) = \dfrac{1}{\dfrac{0.5}{0.64} + \dfrac{1 - 0.5}{0.57}} = 0.60
$$

$$
F_{0.5}(\text{System 2}) = \dfrac{1}{\dfrac{0.5}{0.80} + \dfrac{1 - 0.5}{0.43}} = 0.56
$$

So a balanced F-score rates system 1 slightly better than system 2