# 0. [OPTIONAL] Installing course dependencies

These are dependencies for the whole course.

In [None]:
!pip install -r ../requirements.txt

You may skip the next block for now. You will need `ffmpeg` on week 12.

In [None]:
# !conda update -y base conda
!conda install -c conda-forge ffmpeg -y

Run the next cell if you want to download embedding model, but this is not required during this lab. You can do it later.

In [None]:
!python -m spacy download en_trf_distilbertbaseuncased_lg

# 1. Touching the Internet

Solve the following task.
1. Download [this page](https://raw.githubusercontent.com/IUCVLab/information-retrieval/main/datasets/facts.txt)
2. Save it to the file with the **unique** name derived from the URL. NB File with another URL should not be save into the file with this name. E.g. [this file](https://github.com/IUCVLab/information-retrieval/blob/main/datasets/facts.txt) is another file with another content!

Hints:
- [requests](https://docs.python-requests.org/en/latest/) library is cool.
- [hashlib](https://docs.python.org/3/library/hashlib.html) may help with computing hash strings.
- when you download and save the data, don't try to encode and decode it. Use binary format when working with streams and files. <span style="color:red">Discuss with your TA which encodings you know and how they differ</span>.

In [None]:
import requests
import hashlib

url1 = "https://raw.githubusercontent.com/IUCVLab/information-retrieval/main/datasets/facts.txt"
url2 = "https://github.com/IUCVLab/information-retrieval/blob/main/datasets/facts.txt"

# TODO: download and save these documents
urls = [url1, url2]
scraped_content = {}

for url in urls:
  content = requests.get(url)

  h = hashlib.new('sha256')
  h.update(url.encode())
  out_file = h.hexdigest()
  # scraped_content[out_file] = content.text
  with open(out_file, 'w') as f: f.write(content.text)

# 2. Parsing different formats

Most probably, if you meet something in the Internet, this is one of: binary, plain text, XML, or json. XML then splits into xHTML, RSS, Atom, SOAP, XML-RPC, ... . Your task is to learn, how to process different formats.

## 2.1. JSON

In [the given file](http://sprotasov.ru/data/postnauka.txt) there is valid json. Parse this file and print all video URLs, which have `computer science` tag. Use built-in features of `requests`, or just a `json` library ([ref](https://docs.python.org/3/library/json.html)).

Hint:
- if the file has issues with parsing read about [the difference](https://stackoverflow.com/questions/57152985/what-is-the-difference-between-utf-8-and-utf-8-sig).

In [None]:
import json
import requests

url = "http://sprotasov.ru/data/postnauka.txt"


raw = requests.get(url)
loaded_json = json.loads(raw.text.lstrip('\ufeff'))
# loaded_json
[x['url'] for x in filter( lambda x: 'computer science' in x['tags'], loaded_json)]

['http://postnauka.ru/talks/31897',
 'http://postnauka.ru/video/24306',
 'http://postnauka.ru/faq/46974']

## 2.2. HTML

For a given StackExchange answer extract logins of the contributors (who asked and who answered) with votes. [bs4](https://beautiful-soup-4.readthedocs.io/en/latest/) will help you to do the job.

I can recommend to use CSS or XPath selectors. `div` elements with `post-layout` class represent answers. Inside there are `div` with `votecell` class stroring votes number and `div` with class `user-details` storing user info. My personal recommendation is to use `css selectors`, which are [documented here](https://beautiful-soup-4.readthedocs.io/en/latest/#css-selectors).

In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://math.stackexchange.com/questions/411486/"\
        "understanding-the-singular-value-decomposition-svd"
print(url)

# TODO. Your code here should parse HTML source page and find contributors of the repository.

result = requests.get(url).text
doc = BeautifulSoup(result, "html.parser")


users = doc.find_all('div', {'class': 'user-details'})
for user in users:
  name = user.find('a')
  if name:
    print(name.text)

votes = doc.find_all('div', {'class': 'js-vote-count'})
for vote in votes:
    print(vote.text)

https://math.stackexchange.com/questions/411486/understanding-the-singular-value-decomposition-svd
Rodrigo de Azevedo
Celdor
Ittay Weiss
Tomasz Bartkowiak
Bart Vanderbeke
Bart Vanderbeke
hgfei
littleO
TheSHETTY-Paradise

            23
        

            17
        

            10
        

            4
        

            3
        

            2
        

            1
        

            1
        


# 2.3. RSS feed

A lot of information is already organized in typed XML documents. Podcasts, for example, are just RSS feed. Parse [the feed of this podcast](http://sprotasov.ru/podcast/rss.xml) and print out:
- the number of episodes
- the length of the time span between the first and the last episodes (in days).

Use [`feedparser` library for this](https://waylonwalker.com/parsing-rss-python/).

In [None]:
!pip install feedparser

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting feedparser
  Downloading feedparser-6.0.10-py3-none-any.whl (81 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.1/81.1 KB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sgmllib3k
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py) ... [?25l[?25hdone
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6066 sha256=ac0a2929b66d4dcbc8426dc62eea20b48c8c55a923f256dc8a3bcb6787a77e0c
  Stored in directory: /root/.cache/pip/wheels/83/63/2f/117884c3b19d46b64d3d61690333aa80c88dc14050e269c546
Successfully built sgmllib3k
Installing collected packages: sgmllib3k, feedparser
Successfully installed feedparser-6.0.10 sgmllib3k-1.0.0


In [None]:
import feedparser
rss = 'http://sprotasov.ru/podcast/rss.xml'
feedparser.parse(rss) 

# TODO: complete the code to compute the time span of all the episodes.

{'bozo': False,
 'entries': [{'title': '17 - квантовые компьютеры против Дьяконова',
   'title_detail': {'type': 'text/plain',
    'language': None,
    'base': 'http://sprotasov.ru/podcast/rss.xml',
    'value': '17 - квантовые компьютеры против Дьяконова'},
   'summary': 'В 2020 году вышла книга Михаил Игоревича Дьяконова "Will We Ever Have a Quantum Computer?". \nВ этой книге автор ставит под сомнение возможность создания универсального квантового компьютера, \nа также критикует саму область квантовых вычислений.\nЯ выступаю адвокатом обороняющейся стороны и робко отбиваюсь от аргументов известного ученого.\nКнига небольшая, её можно прочитать тут - https://link.springer.com/book/10.1007/978-3-030-42019-2',
   'summary_detail': {'type': 'text/html',
    'language': None,
    'base': 'http://sprotasov.ru/podcast/rss.xml',
    'value': 'В 2020 году вышла книга Михаил Игоревича Дьяконова "Will We Ever Have a Quantum Computer?". \nВ этой книге автор ставит под сомнение возможность созда

# 3. [EXTRA TASK] Solving simple information retrieval task

According to the name, `information retrieval` is the discipline, which helps retrieves information (from unstructured sources). Thus, we will retrieve some information from [this news article](https://www.bbc.com/news/world-us-canada-59944889). Your task is to write a code, which will answer the question: **How many people die every day in the US waiting for a transplant?** Write flexible enough code. Test yourself by changing the link to [this one](https://www.americantransplantfoundation.org/about-transplant/facts-and-myths/).

In [None]:
import re

def find_answer(doc, question):
    pattern = r"\d+(?:,\d{3})*(?:\.\d+)?"
    keywords = ['people', 'die', 'every day', 'US', 'waiting', 'transplant']
    sentences = doc.stripped_strings

    for sentence in sentences:
        if all(keyword.lower() in sentence.lower() for keyword in keywords):
            numbers = re.findall(pattern, sentence)
            if numbers:
                return int(numbers[0].replace(',', ''))
    return None

# First URL
result = requests.get(url).text
doc = BeautifulSoup(result, "html.parser")
answer = find_answer(doc, question)
print(f"Answer from URL: {answer}")


In [None]:
def function(x, soup):
    if x in soup.text:#.strip():
        print(x)
        print("here")
    else:
        print("not here")

function("every day", doc)

every day
here
