# ДЗ 2  
## Ранжирование: TF-IDF, матрица Document-Term, косинусная близость

### __Задача__:    

Реализуйте поиск, где 
- в качестве метода векторизации документов корпуса - **TF-IDF**
- формат хранения индекса - **матрица Document-Term**
- метрика близости пар (запрос, документ) - **косинусная близость**
- в качестве корпуса - **корпус Друзей из первого задания**


Что должно быть в реализации:
- функция индексации корпуса, на выходе которой посчитанная матрица Document-Term 
- функция индексации запроса, на выходе которой посчитанный вектор запроса
- функция с реализацией подсчета близости запроса и документов корпуса, на выходе которой вектор, i-й элемент которого обозначает близость запроса с i-м документом корпуса
- главная функция, объединяющая все это вместе; на входе - запрос, на выходе - отсортированные по убыванию имена документов коллекции


**На что направлена эта задача:** 
Реализация от начала до конца механики поиска с использованием простых компонентов.


## Imports:

In [1]:
! pip install pymorphy2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pymorphy2
  Downloading pymorphy2-0.9.1-py3-none-any.whl (55 kB)
[K     |████████████████████████████████| 55 kB 3.2 MB/s 
[?25hCollecting pymorphy2-dicts-ru<3.0,>=2.4
  Downloading pymorphy2_dicts_ru-2.4.417127.4579844-py2.py3-none-any.whl (8.2 MB)
[K     |████████████████████████████████| 8.2 MB 12.0 MB/s 
[?25hCollecting dawg-python>=0.7.1
  Downloading DAWG_Python-0.7.2-py2.py3-none-any.whl (11 kB)
Collecting docopt>=0.6
  Downloading docopt-0.6.2.tar.gz (25 kB)
Building wheels for collected packages: docopt
  Building wheel for docopt (setup.py) ... [?25l[?25hdone
  Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=13723 sha256=3cc12f1658a5c5bb909531e9b09984e99de12edac6f723fdf205eb4ea854f3db
  Stored in directory: /root/.cache/pip/wheels/72/b0/3f/1d95f96ff986c7dfffe46ce2be4062f38ebd04b506c77c81b9
Successfully built docopt
Installing collected 

In [12]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
import os
from pymorphy2 import MorphAnalyzer
from string import punctuation
import nltk
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [15]:
nltk.download("stopwords")
stopwords = set(stopwords.words("russian"))
morph = MorphAnalyzer()
vectorizer = TfidfVectorizer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Processing:

get list of files in a given directory + get texts from them

In [4]:
def get_files(path):
    f_paths = list()
    f_names = list()
    texts = list()

    for root, dirs, files in os.walk(path):
        for name in files:
            if name[0] != '.':
                f_paths.append(os.path.join(root, name))
                f_names.append(name)
    
    for file_path in f_paths:
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
          texts.append(f.read())

    return texts, f_names

preprocess texts 

In [5]:
def preprocess(texts):
    prep_texts = list()
    for text in texts:
      words = text.split()
      lemmas = list()
      for w in words:
        w = w.strip(punctuation)
        lemmas.append(morph.parse(w)[0].normal_form)
      words = [w for w in words if w not in stopwords]
      prep_texts.append(' '.join(words))
    return prep_texts

get tf-idf 

In [8]:
def get_matrix(path):
    texts, file_names = get_files(path)
    corpus = preprocess(texts)
    X = vectorizer.fit_transform(corpus)
    return X, file_names

get vector of users query

In [9]:
def vec_query(users_q):
    prep = preprocess([users_q])
    return vectorizer.transform(prep)

find cos similarity

In [10]:
def get_cos_sim(x, vector):
    simularity = cosine_similarity(x, vector)
    return simularity.reshape(-1)

In [16]:
# Testing on the corpus from hw1:
path = '/content/drive/MyDrive/infosearch22/hw1/friends-data'
X, file_names = get_matrix(path)
check = True
while check == True:
    query = input("You may input your query or type 'STOP' to stop: ")
    if "STOP" not in query:
        files_sorted = []
        vec_q = vec_query(query)
        cos_li = get_cos_sim(X, vec_q)
        id_sort = np.argsort(cos_li)[::-1]
        id_sort = id_sort.tolist()
        for i in range(len(file_names)):
            files_sorted.append(file_names[id_sort[i]])
        print("Results in descending order: \n{}".format('\n'.join(files_sorted)))
    else:
      check = False

You may input your query or type 'STOP' to stop: Я очень зла
Results in descending order: Friends - 4x01 - The One With The Jellyfish.ru.txt
Friends - 4x21 - The One With The Invitations.ru.txt
Friends - 5x04 - The One Where Phoebe Hates PBS.ru.txt
Friends - 5x06 - The One With The Yeti.ru.txt
Friends - 1x13 - The One With The Boobies.ru.txt
Friends - 6x23 - The One With The Ring.ru.txt
Friends - 2x07 - The One Where Ross Finds Out.ru.txt
Friends - 4x07 - The One Where Chandler Crosses The Line.ru.txt
Friends - 3x16 - The One With The Morning After (2).ru.txt
Friends - 5x14 - The One Where Everybody Finds Out.ru.txt
Friends - 3x17 - The One Without The Ski Trip.ru.txt
Friends - 4x08 - The One With Chandler In A Box.ru.txt
Friends - 4x18 - The One With Rachel's New Dress.ru.txt
Friends - 4x16 - The One With The Fake Party.ru.txt
Friends - 6x19 - The One With Joey's Fridge.ru.txt
Friends - 3x05 - The One With Frank Jr..ru.txt
Friends - 4x10 - The One With The Girl From Poughkeepsie.ru.tx