### Формат файлов

Вам выдается 4 файла:

* `train.txt` --- обучающая выборка пар запрос-документ и асессорские метки релевантности
* `test.txt` --- тестовая выборка пар запрос-документ
* `queries_test.txt` --- запросы из `train.txt`
* `queries_train.txt` --- запросы из `test.txt`

Колонки в файлах могут быть следующего типа:

* `QueryId` --- уникальный номер запроса
* `DocumentId` --- номер документа, не повторяется для одного запроса
* `DocumentLink` --- url документа
* `Relevance` --- асессорская метка релевантности

Формат файла ответов приведен ниже. Пары запрос-документ должны соответсвовать файлу `test.txt` и должны быть упорядочены по убыванию построенной функции релевантности. То есть так, как в поисковой выдаче.

Качество оценивается с помощью метрики ${NDCG}_{10}$.

### Смотрим данные

Посмотрим на данные

In [123]:
!head -n 5 data/queries_train.txt
!head -n 5 data/queries_test.txt

1	Как заснуть не заглядывая в холодильник и телефон?
2	Как выглядеть интеллигентной когда жрешь из кастрюли руками?
3	Как отпустить препода пораньше с пары?
4	Какие мышцы качаются когда лепишь фрикадельки?
5	Как заставить домового платить коммуналку?
101	Почему нельзя неподошедшего тебе мужика просто аккуратненько вернуть обратно маме?
102	Ты никогда не замечал, какие все вокруг оптимисты. когда проблема касается тебя?
103	В какой момент нужно сказать привет когда навстречу идет знакомый?
104	Кто знает, какие успокоительные лекарства принимает медведь в мультике про Машу? мне тоже такие надо.
105	Таджики на уроке физкультуры по команде строимся,по строили новый дом .


Считаем эти данные пандасом

In [142]:
import pandas as pd
from IPython.display import display

In [143]:
queries_test = pd.read_csv("data/queries_test.txt", sep="\t", index_col=0, header=None)
queries_train = pd.read_csv("data/queries_train.txt", sep="\t", index_col=0, header=None)
queries_test.columns = ["Query"]
queries_train.columns = ["Query"]

queries_test["QueryId"] = queries_test.index
queries_train["QueryId"] = queries_train.index
print("Train queries: {}".format(len(queries_train)))
print("Test queries: {}".format(len(queries_test)))

Train queries: 100
Test queries: 100


In [144]:
display(queries_train.head())
display(queries_test.head())

Unnamed: 0_level_0,Query,QueryId
0,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Как заснуть не заглядывая в холодильник и теле...,1
2,Как выглядеть интеллигентной когда жрешь из ка...,2
3,Как отпустить препода пораньше с пары?,3
4,Какие мышцы качаются когда лепишь фрикадельки?,4
5,Как заставить домового платить коммуналку?,5


Unnamed: 0_level_0,Query,QueryId
0,Unnamed: 1_level_1,Unnamed: 2_level_1
101,Почему нельзя неподошедшего тебе мужика просто...,101
102,"Ты никогда не замечал, какие все вокруг оптими...",102
103,В какой момент нужно сказать привет когда навс...,103
104,"Кто знает, какие успокоительные лекарства прин...",104
105,Таджики на уроке физкультуры по команде строим...,105


Теперь смотрим на сами пары (q, d).

In [145]:
!head -n 5 data/train.txt
!head -n 5 data/test.txt

Relevance,QueryId,DocumentId,"DocumentLink"
4,1,0,"http://otvet.mail.ru/question/195527651/"
4,1,1,"http://otvet.mail.ru/question/195596044/"
3,1,2,"http://ask.fm/Liza_1911"
3,1,3,"https://vk.com/polly_and_viole"
QueryId,DocumentId,"DocumentLink"
101,0,"https://vk.com/id377383444"
101,1,"http://xn----btbgrjageubg1a.xn--p1ai/420-skuchat"
101,2,"http://jiznenno.ru/vykhod-iz-treugolnika-karpmana/"
101,3,"http://iknigi.net/avtor-aleksandra-marinina/113760-obratnaya-sila-tom-2-19651982-aleksandra-marinina/read/page-1.html"


К сожалению, в url может быть запятая, поэтому это надо сначала пофиксить, прежде чем открывтаь пандасом

In [146]:
def fix_and_open(filename, num_columns):
    try:
        return pd.read_csv(filename)
    except pd.parser.CParserError:
        new_lines = []
        with open(filename, "r") as fin:
            for line in fin:
                line = line.strip()
                tokens = line.split(",")
                new_line = '{},"{}"'.format(",".join(tokens[:(num_columns - 1)]), ",".join(tokens[(num_columns - 1):]))
                new_lines.append(new_line)
        with open(filename, "w") as fout:
            fout.write("\n".join(new_lines))
        return pd.read_csv(filename)

In [147]:
test = fix_and_open("data/test.txt", 3)
train = fix_and_open("data/train.txt", 4)
print("Len train: {}".format(len(train)))
print("Len test: {}".format(len(test)))

Len train: 2269
Len test: 2383


In [38]:
import re


def read_link_text(url):
    try:
        r = urlopen(url, timeout=10)
        data = r.read()
        try:
            content = data.decode(r.headers.get_content_charset())
        except:
            content = data
        soup = BeautifulSoup(content, 'html.parser')
        texts = soup.findAll(text=True)
        def visible(element):
            if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
                return False
            elif re.match('<!--.*-->', str(element)):
                return False
            return True
        visible_texts = filter(visible, texts)
        return " ".join(filter(lambda x: x, map(lambda s: s.strip(), visible_texts))).lower()
    except Exception as e:
        print(e)
        return ""

In [43]:
from multiprocessing.pool import Pool

p = Pool(8)
test_text = p.map(read_link_text, test["DocumentLink"])
p.close()

<urlopen error [Errno 8] nodename nor servname provided, or not known>


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


HTTP Error 404: Not Found


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


HTTP Error 404: Not Found


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


<urlopen error [Errno 8] nodename nor servname provided, or not known>


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


timed out


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


timed out
HTTP Error 404: Not Found


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


HTTP Error 404: Not Found
HTTP Error 404: Not Found


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Remote end closed connection without response
<urlopen error timed out>


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


hostname 'нанофутур.рф' doesn't match either of 'xn--80azbeklgbg.xn--p1ai', 'www.xn--80azbeklgbg.xn--p1ai'
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)>


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


timed out
<urlopen error timed out>


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


<urlopen error [Errno 8] nodename nor servname provided, or not known>


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


timed out


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


HTTP Error 500: Internal Server Error


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


HTTP Error 404: Not Found


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


timed out


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


HTTP Error 403: Forbidden
HTTP Error 503: Service Unavailable
timed out


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


timed out
<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:645)>
HTTP Error 403: Forbidden
timed out
timed out
timed out
timed out
timed out
HTTP Error 500: Internal Server Error
HTTP Error 503: Service Temporarily Unavailable


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


HTTP Error 410: Gone
<urlopen error timed out>
<urlopen error timed out>


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


HTTP Error 403: Forbidden
Remote end closed connection without response
HTTP Error 403: Forbidden
HTTP Error 404: NOT FOUND


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Remote end closed connection without response
HTTP Error 404: Not Found
timed out


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Remote end closed connection without response


Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.
Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


Remote end closed connection without response
<urlopen error [Errno 60] Operation timed out>


In [148]:
print("Nonempty: {}".format(len(list(filter(lambda x: x, test_text)))))
print("Total: {}".format(len(test_text)))

Nonempty: 2331
Total: 2383


In [149]:
from collections import defaultdict
inv_documents = defaultdict(set)
doc_tfidfs = []
for doc_id, document in enumerate(test_text):
    for word in re.split("[^\w]+", document):
        inv_documents[word].add(doc_id)
for doc_id, document in enumerate(test_text):
    tfidf = {}
    doc_word_count = defaultdict(int)
    for word in re.split("[^\w]+", document):
        doc_word_count[word] += 1
    for word in doc_word_count.keys():
        tf = doc_word_count[word] / len(words)
        idf = np.log(1 + len(test_text) / len(inv_documents[word]))
        tfidf[word] = tf * idf
    doc_tfidfs.append(tfidf)

In [150]:
test["tfidf"] = doc_tfidfs

In [151]:
test.head()

Unnamed: 0,QueryId,DocumentId,DocumentLink,tfidf
0,101,0,https://vk.com/id377383444,"{'заходила': 0.029500976271, 'вам': 0.00893042..."
1,101,1,http://xn----btbgrjageubg1a.xn--p1ai/420-skuchat,"{'обсуждали': 0.0310140117413, '': 0.009246435..."
2,101,2,http://jiznenno.ru/vykhod-iz-treugolnika-karpm...,"{'': 0.0092464351281, 'счастливые': 0.02494839..."
3,101,3,http://iknigi.net/avtor-aleksandra-marinina/11...,"{'делом': 0.0206032515583, '': 0.0092464351281..."
4,101,4,http://fb.ru/article/270014/esli-tufli-veliki-...,"{'натуральной': 0.0390399656544, 'носком': 0.0..."


In [152]:
def calc_relev(row):
    tfidf, query = row
    res = 0
    for word in re.split(r"[^w]+", query.lower()):
        if word in tfidf:
            res += tfidf[word]
    return res
result = test.merge(queries_test, on="QueryId")
result["revel"] = result[["tfidf", "Query"]].apply(calc_relev, axis=1)

In [153]:
result

Unnamed: 0,QueryId,DocumentId,DocumentLink,tfidf,Query,revel
0,101,0,https://vk.com/id377383444,"{'заходила': 0.029500976271, 'вам': 0.00893042...",Почему нельзя неподошедшего тебе мужика просто...,0.207714
1,101,1,http://xn----btbgrjageubg1a.xn--p1ai/420-skuchat,"{'обсуждали': 0.0310140117413, '': 0.009246435...",Почему нельзя неподошедшего тебе мужика просто...,1.430096
2,101,2,http://jiznenno.ru/vykhod-iz-treugolnika-karpm...,"{'': 0.0092464351281, 'счастливые': 0.02494839...",Почему нельзя неподошедшего тебе мужика просто...,0.154923
3,101,3,http://iknigi.net/avtor-aleksandra-marinina/11...,"{'делом': 0.0206032515583, '': 0.0092464351281...",Почему нельзя неподошедшего тебе мужика просто...,0.300934
4,101,4,http://fb.ru/article/270014/esli-tufli-veliki-...,"{'натуральной': 0.0390399656544, 'носком': 0.0...",Почему нельзя неподошедшего тебе мужика просто...,0.144894
5,101,5,https://new.vk.com/wall-56106344_4734187,"{'палата': 0.0337222131164, 'неподошедшего': 0...",Почему нельзя неподошедшего тебе мужика просто...,0.207714
6,101,6,https://prodoctorov.ru/krasnodar/lpu/8343-uro-...,"{'свяжется': 0.0856447613778, 'пройти': 0.0368...",Почему нельзя неподошедшего тебе мужика просто...,0.009137
7,101,7,http://garri-potter-life.narod.ru/online/hp_ka...,"{'': 0.0184928702562, 'a1': 0.0428223806889, '...",Почему нельзя неподошедшего тебе мужика просто...,0.018493
8,101,8,http://otvet.mail.ru/mail/lady.dobrota777,"{'': 0.0184928702562, 'liveinternet': 0.020398...",Почему нельзя неподошедшего тебе мужика просто...,0.054551
9,101,9,http://24advokat.ru/cat/v_kakih_sluchayah_mojn...,"{'': 0.0184928702562, 'делом': 0.0206032515583...",Почему нельзя неподошедшего тебе мужика просто...,4.300852


In [154]:
result.sort(["QueryId", "revel"], ascending=[True, False])[["QueryId", "DocumentId"]].to_csv("answer.csv", index=False)

  if __name__ == '__main__':
