# 0. Auxillary resources

In [16]:
import os
from hashlib import md5

def get_unique_name(url, base_name):
    """
    Handles the problem with file rewriting
    """
    hash = md5((url + base_name).encode("utf-8")).hexdigest()
    return hash + base_name
    

# 1. Crawler

## 1.0. Related example

This code shows `wget`-like tool written in python. Run it from console (`python wget.py`), make it work. Check the code, reuse, and modify for your needs.

In [17]:
import argparse
import re
import requests


def wget(url):
    try:
        # allow redirects - in case file is relocated
        resp = requests.get(url, allow_redirects=True)
    except:
        return None, None

    if not resp.ok:
        return None, None

    m = re.search(r"https?:\/\/.*\/([^?#\/]*\.[^?#\/]+)?.*", url)
    filename = m.group(1)

    if filename is None or filename == "":  # if filename is not recognized or recognized as empty
        filename = "index.html"             # assume it is a html page

    return filename, resp.content

### 1.0.1. How to parse a page?

If you build a crawler, you might follow one of the approaches:
1. search for URLs in the page, assuming this is just a text.
2. search for URLs in the places where URLs should appear: `<a href=..`, `<img src=...`, `<iframe src=...` and so on.

To follow the first approach you can rely on some good regular expression. [Like this](https://stackoverflow.com/a/3809435).

To follow the second approach just read one of these: [short answer](https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup) or [exhaustive explanation](https://hackersandslackers.com/scraping-urls-with-beautifulsoup/).

## 1.1. [15] Download and persist #
Please complete a code for `load()`, `download()` and `persist()` methods of `Document` class. What they do:
- for a given URL `download()` method downloads binary data and stores in `self.content`. It returns `True` for success, else `False`.
- `persist()` method saves `self.content` somewhere in file system. We do it to avoid multiple downloads (for caching in other words).
- `load()` method loads data from hard drive. Returns `True` for success.

Tests checks that your code somehow works.

**NB Passing the test doesn't mean you correctly completed the task.** These are **criteria, which have to be fullfilled**:
1. URL is a unique identifier (as it is a subset of URI). Thus, documents with different URLs should be stored in different files. Typical errors: documents from the same domain are overwritten to the same file, URLs with similar endings are downloaded to the same file, etc.
2. The document can be not only a text file, but also a binary. Pay attention that if you download `mp3` file, it still can be played. Hint: don't hurry to convert everything to text.

In [18]:
import requests
from urllib.parse import quote

class Document:
    
    def __init__(self, url):
        if not re.match(r"https?:\/\/.*\/.*", url):
            url += "/"

        self.url = url
        self.content = None
        self._filename = None
        
    def get(self):
        if not self.load():
            if not self.download():
                raise FileNotFoundError(self.url)
            else:
                self.persist()
    
    def download(self):
        self._filename, self.content = wget(self.url)
        ok = self.content is not None
        if ok:
            self._filename = get_unique_name(self.url, self._filename)
        return ok
    
    def persist(self):
        if self.content is None:
            raise ValueError("Invalid state: no content loaded")
            
        with open(self._filename.encode("utf-8"), "wb+") as f:
                f.write(self.content)
    
            
    def load(self):
        if self._filename is None or not os.path.exists(self._filename):
            return False
        else:
            try:
                with open(self._filename, "rb") as f:
                    self.content = f.read()
            except (IOError, OSError):
                return False
            else:
                return True
            

### 1.1.1. Tests ###

In [19]:
doc = Document('http://sprotasov.ru/data/iu.txt')

doc.get()
assert doc.content, "Document download failed"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document content error"

doc.get()
assert doc.load(), "Load should return true for saved document"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document load from disk error"

In [20]:
music_doc = Document("https://download.samplelib.com/mp3/sample-3s.mp3")
music_doc.get()
assert doc.content, "Document download failed"

## 1.2. [10] Parse HTML
`BeautifulSoap` library is a de facto standard to parse XML and HTML documents in python. Use it to complete `parse()` method that extracts document contents. You should initialize:
1. `self.anchors` list of tuples `('text', 'url')` met in a document. Be aware, there exist relative links (e.g. `../content/pic.jpg`). Use `urllib.parse.urljoin()` to fix this issue.
2. `self.images` list of images met in a document. Again, links can be relative to current page.
3. `self.text` should keep plain text of the document without scripts, tags, comments and so on. You can refer to [this stackoverflow answer](https://stackoverflow.com/a/1983219) for details.

**NB All these 3 criteria must be fulfilled to get full point for the task.**

In [21]:
import nltk

from bs4 import BeautifulSoup
from bs4.element import Comment
import urllib.parse


class HtmlDocument(Document):
    
    def parse(self):
        if self.content is None:
            raise ValueError("Invalid state: document is not loaded")
        
        soup = BeautifulSoup(self.content)

        anchors = soup.select("a")
        self.anchors = []

        for anchor in anchors:
            if anchor.has_attr("href"):
                address = urllib.parse.urljoin(self.url, anchor.attrs["href"])
                if address.startswith("http"):
                    self.anchors.append((anchor.string, address,))
        
        self.text = soup.get_text(separator='\n')
        
        images = soup.select("img")
        self.images = []
        
        for image in images:
            if image.has_attr("src"):
                self.images.append(urllib.parse.urljoin(self.url, image.attrs["src"]))



### 1.2.1. Tests

In [22]:
doc = HtmlDocument("http://sprotasov.ru")
doc.get()
doc.parse()

assert "just few links" in doc.text, "Error parsing text"
assert "http://sprotasov.ru/images/gb.svg" in doc.images, "Error parsing images"
assert any(p[1] == "https://twitter.com/07C3" for p in doc.anchors), "Error parsing links"

In [23]:
doc = HtmlDocument("https://alobanov.space")
doc.get()
doc.parse()

print(doc.text)

About | Aleksandr Lobanov
Achievements
 
About
 
Projects
 
Aleksandr Lobanov
Aleksandr
Lobanov
Full Stack Developer & ML Engineer
Nice to meet you! I am Alex, a frontend and backend developer, and a machine learning engineer. I am doing my bachelor's degree in Applied Artificial Intelligence at Innopolis University. I have been working on various projects such are web applications, analytics services, servers and others for three years. I am quickly trained and adaptive person who is always searching for something new and exciting. Also, I love participating in different competitions and often take the lead in them.
Curriculum Vitae
СV
Projects
About
 
Achievements
Projects
About
Nice to meet you! I am Alex, a frontend and backend developer, and a machine learning engineer. I am doing my bachelor's degree in Applied Artificial Intelligence at Innopolis University. I have been working on various projects such are web applications, analytics services, servers and others for three years.

## 1.3. [10] Document analysis ##
Complete the code for `HtmlDocumentTextData` class. Implement word and sentence splitting (use any method you can propose). 

**Criteria to succeed in the task**: 
1. Your `get_word_stats()` method should return `Counter` object.
2. Don't forget to lowercase your words for counting.
3. Sentences should be obtained from inside `<body>` tag only.

In [24]:
from nltk.tokenize import sent_tokenize, word_tokenize
from collections import Counter

class HtmlDocumentTextData:
    
    def __init__(self, url=None, doc=None):
        if doc is not None:
            self.doc = doc
            return
        
        if url is None:
            raise ValueError("Either url or doc argument should be specified")

        self.doc = HtmlDocument(url)
        self.doc.get()
        self.doc.parse()


    
    def get_sentences(self):
        result = sent_tokenize(self.doc.text)
        return result
    
    def get_word_stats(self):
        words = sum([word_tokenize(sent) for sent in self.get_sentences()], list())
        words = list(filter(lambda x: x not in [",", ".", "!", ":", ";"], words))
        words = list(map(lambda s: s.lower(), words))
        return Counter(words)

### 1.3.1. Tests ###

In [25]:
doc = HtmlDocumentTextData("https://innopolis.university/")

print(doc.get_word_stats().most_common(10))
assert [x for x in doc.get_word_stats().most_common(10) if x[0] == 'иннополис'], 'иннополис should be among most common'

[('и', 44), ('в', 22), ('иннополис', 21), ('с', 13), ('университет', 12), ('на', 12), ('университета', 11), ('центр', 10), ('«', 10), ('»', 10)]


## 1.4. [15] Crawling ##

Method `crawl_generator()` is given starting url (`source`) and max depth of search. It should return a **generator** of `HtmlDocumentTextData` objects (return a document as soon as it is downloaded and parsed). You can benefit from `yield obj_name` python construction. Use `HtmlDocumentTextData.anchors` field to go deeper.

In [26]:
from queue import Queue

class Crawler:
    
    def crawl_generator(self, source, depth=1):
        level = {source}
        visited = set(source)
        for i in range(1 + depth):
            next_level = set()
            for link in level:
                if link in visited: 
                    # Skip visited
                    continue
                visited.add(link)

                try:
                    document = HtmlDocument(link)
                    document.get()
                    document.parse()
                except:
                    continue

                next_level = next_level.union(map(lambda x: x[1], document.anchors))
                links = map(lambda x: HtmlDocument(x[1]), document.anchors)
                yield HtmlDocumentTextData(doc=document)
            level = next_level
        

### 1.4.1. Tests ###

In [28]:
crawler = Crawler()
counter = Counter()

# I have intentially changed depth from 2 to 1 since the number of links 
# that appears with depth=2 is too large to be proceed with my local machine

for c in crawler.crawl_generator("https://innopolis.university/en/", depth=1):
    print(c.doc.url)
    if c.doc.url[-4:] in ('.pdf', '.mp3', '.avi', '.mp4', '.txt'):
        print("Skipping", c.doc.url)
        continue
    counter.update(c.get_word_stats())
    print(len(counter), "distinct word(s) so far")
    
print("Done")

print(counter.most_common(20))
assert [x for x in counter.most_common(20) if x[0] == 'innopolis'], 'innopolis sould be among most common'

https://innopolis.university/en/
346 distinct word(s) so far
https://media.innopolis.university/news/innopolis-university-extends-international-application-deadline-/
665 distinct word(s) so far
https://www.youtube.com/user/InnopolisU
687 distinct word(s) so far
https://apply.innopolis.ru/en/
2841 distinct word(s) so far
https://media.innopolis.university/en
2920 distinct word(s) so far
https://innopolis.university/en/ido/
3012 distinct word(s) so far
https://innopolis.university/en/internationalpartners/
3195 distinct word(s) so far
https://innopolis.university/en/outgoingstudents/
4030 distinct word(s) so far
https://panoroo.com/virtual-tours/NvQZM6B2
4033 distinct word(s) so far
https://innopolis.university/en/faculty/
4895 distinct word(s) so far
https://vk.com/innopolisu
5071 distinct word(s) so far
https://innopolis.university/
5094 distinct word(s) so far
https://career.innopolis.university/konkursnyezayavkiprofessorskoprepodavatelskogosostava/
5198 distinct word(s) so far
https

Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.


https://innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
Skipping https://innopolis.university/public/files/Consent_to_the_processing_of_PD_for_UI.pdf
https://innopolis.university/en/campus
8125 distinct word(s) so far
https://media.innopolis.university/en/
8125 distinct word(s) so far
https://media.innopolis.university/news/webinar-interstudents-eng/
8130 distinct word(s) so far
https://media.innopolis.university/en/news/
8130 distinct word(s) so far
https://t.me/universityinnopolis
8136 distinct word(s) so far
https://media.innopolis.university/news/registration-innopolis-open-2020/
8202 distinct word(s) so far
https://apply.innopolis.university/en/master/
8213 distinct word(s) so far
https://innopolis.university/en/team-structure/
8214 distinct word(s) so far
https://innopolis.university/en/team-structure/education-academics/
8214 distinct word(s) so far
https://apply.innopolis.university/en/bachelor/
8249 distinct word(s) so far
https://innopolis.univers