# Words, concepts and search

You are given a collection of texts, which form a dataset. Each string means a separate document. For each test we will answer 2 following questions:
- Which of the documents is the closest to a given query?
- How many "concepts" is enough to represent these tests?

Thus your result (answer) will consist of 2 integers, separated by a space: `doc_id concept_count`.

## Let's consider the test example:

`input.txt`

```
c d b.
d e a.
a b c.
a b c d.
d c a b.
a c.      # <--- the last one is the query
```

## Let's do vectorization
Reuse this in your solutions

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

inp = """c d b.
d e a.
a b c.
a b c d.
d c a b.
a c.""".split('\n')

dataset, query = inp[:-1], inp[-1]

vect = TfidfVectorizer(
            analyzer='word',
            stop_words=None,
            token_pattern=r"(?u)\b\w+\b"    # (?u)\b\w\w+\b -- default pattern: (?u) -- unicode modifier, \b -- word border, \w\w+ = 2+ letters
)
DTM = vect.fit_transform(dataset).todense()
print("Vocabulary:", vect.get_feature_names())
print(DTM)

print("\nIs it normed?\n")
print(DTM @ DTM.T)

Vocabulary: ['a', 'b', 'c', 'd', 'e']
[[0.         0.57735027 0.57735027 0.57735027 0.        ]
 [0.440627   0.         0.         0.440627   0.78210977]
 [0.57735027 0.57735027 0.57735027 0.         0.        ]
 [0.5        0.5        0.5        0.5        0.        ]
 [0.5        0.5        0.5        0.5        0.        ]]

Is it normed?

[[1.         0.25439612 0.66666667 0.8660254  0.8660254 ]
 [0.25439612 1.         0.25439612 0.440627   0.440627  ]
 [0.66666667 0.25439612 1.         0.8660254  0.8660254 ]
 [0.8660254  0.440627   0.8660254  1.         1.        ]
 [0.8660254  0.440627   0.8660254  1.         1.        ]]


### So, we are ready to answer question 1. 
Which of the documents is the closest to a given query?

In [2]:
query_vector = vect.transform([query]).todense()
cosines_raw = # ... oops
print("Cosine similarities of query and dataset:\n", cosines_raw)
# oops

Cosine similarities of query and dataset:
 [[0.40824829]
 [0.31157034]
 [0.81649658]
 [0.70710678]
 [0.70710678]]
Best match index: 2


### Time for question 2.
How many concepts are enough to express our dataset?

In other words, how many orthogonal components do we need to pass `allclose` test?

**NB: Can you just take the data and run PCA? Will it change the cosine metric?**

Implement reduced SVD to pass the test.

In [2]:
from numpy.linalg import svd

U, sigma, Vh = np.linalg.svd(DTM, full_matrices=True)
Sigma = np.diag(sigma)

for i in range(1, min(DTM.shape[1], DTM.shape[0]) + 1):
    doc_embeddings =  # ...
    projection =      # ...
    DTM_approx =      # ...
    print(f"If we take {i} components, allclose =", np.allclose(DTM, DTM_approx, atol=0.01))

If we take 1 components, allclose = False
If we take 2 components, allclose = False
If we take 3 components, allclose = False
If we take 4 components, allclose = True
If we take 5 components, allclose = True


In [5]:
k = 4

doc_embeddings =  # ...  
projection =      # ...
query_embedding = projection @ query_vector.T

cosines = doc_embeddings @ query_embedding
print(f"Cosine similarities of query and dataset (after reduction to {k} dimensions):\n", cosines)
assert np.allclose(cosines, cosines_raw), "Cosine similariries are not close"

Cosine similarities of query and dataset (after reduction to 4 dimensions):
 [[0.40824829]
 [0.31157034]
 [0.81649658]
 [0.70710678]
 [0.70710678]]


Exactly the same! We verified the job!

### And the answer is...

Thus, looking at the result of previous block, we can state:
    
`output.txt`
```
2 4
```