# Crawler
## 1. Download and persist #
Please complete a code for `load()`, `download()` and `persist()` methods of `Document` class. What they do:
- for a given URL `download()` method downloads binary data and stores in `self.content`. It returns `True` for success, else `False`.
- `persist()` method saves `self.content` somewhere in file system. We do it to avoid multiple downloads (for caching in other words).
- `load()` method loads data from hard drive. Returns `True` for success.

Tests checks that your code somehow works.

In [20]:
import requests
from urllib.parse import quote, urlparse
import requests
import os

class Document:
    
    def __init__(self, url):
        self.url = url
        
        parsed_url = urlparse(self.url)
        self.file_path = "".join(["data//", parsed_url.netloc, ".", *parsed_url.path.split("/")])
        
    def get(self):
        if not self.load():
            if not self.download():
                raise FileNotFoundError(self.url)
            else:
                self.persist()
    
    def download(self):
        #TODO download self.url content, store it in self.content and return True in case of success
        response = requests.get(self.url)
        if response.status_code == 200:
            self.content = response.content.decode(response.encoding)
            return True
        return False
    
    def persist(self):
        #TODO write document content to hard drive
        os.makedirs(os.path.dirname(self.file_path), exist_ok=True)
        with open(self.file_path, "w") as f:
            f.write(self.content)
            
    def load(self):
        #TODO load content from hard drive, store it in self.content and return True in case of success
        if os.path.isfile(self.file_path):
            with open(self.file_path, "r") as f:
                self.content = f.read()
                return True
        return False

### 1.1. Tests ###

In [21]:
doc = Document('http://sprotasov.ru/data/iu.txt')

doc.get()
assert doc.content, "Document download failed"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document content error"

doc.get()
assert doc.load(), "Load should return true for saved document"
assert "Code snippets, demos and labs for the course" in str(doc.content), "Document load from disk error"

## 2. Parse HTML ##
`BeautifulSoap` library is a de facto standard to parse XML and HTML documents in python. Use it to complete `parse()` method that extracts document contents. You should initialize:
- `self.anchors` list of tuples `('text', 'url')` met in a document. Be aware, there exist relative links (e.g. `../content/pic.jpg`). Use `urllib.parse.urljoin()` to fix this issue.
- `self.images` list of images met in a document. Again, links can be relative to current page.
- `self.text` should keep plain text of the document without scripts, tags, comments and so on. You can refer to [this stackoverflow answer](https://stackoverflow.com/a/1983219) for details.

In [22]:
from bs4 import BeautifulSoup, SoupStrainer
from bs4.element import Comment
import urllib.parse


class HtmlDocument(Document):
    
    def parse(self):
        #TODO exctact plain text, images and links from the document
        def parse_text():
            soup = BeautifulSoup(self.content)
            [s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
            return soup.getText()
        
        def parse_links():
            ## Parse only 'a' attribute and get links from their
            soup = BeautifulSoup(self.content, parse_only=SoupStrainer('a'))
            return [(link.contents, urllib.parse.urljoin(self.url, link['href'])) for link in soup if link.has_attr('href')]
            
        def parse_imgs():
            soup = BeautifulSoup(self.content, parse_only=SoupStrainer('img'))
            return [urllib.parse.urljoin(self.url, link['src']) for link in soup if link.has_attr('src')]
        

        self.anchors = parse_links()
        self.images = parse_imgs()
        self.text = parse_text()

### 2.1. Tests ###

In [23]:
doc = HtmlDocument("http://sprotasov.ru")
doc.get()
doc.parse()

assert "тестирующий сервер codetest" in doc.text, "Error parsing text"
assert "http://sprotasov.ru/images/phone.png" in doc.images, "Error parsing images"
assert any(p[1] == "http://university.innopolis.ru/" for p in doc.anchors), "Error parsing links"

## 3. Document analysis ##
Complete the code for `HtmlDocumentTextData` class. Implement word and sentence splitting (use any method you can propose). Your `get_word_stats()` method should return `Counter` object. Don't forget to lowercase your words.

In [24]:
from collections import Counter
import nltk, string

nltk.download('punkt')

class HtmlDocumentTextData:
    
    def __init__(self, url):
        self.doc = HtmlDocument(url)
        self.doc.get()
        self.doc.parse()
    
    def get_sentences(self):
        #TODO*: implement sentence parser
        words = nltk.word_tokenize(self.doc.text)
        cleaned_words = [w.lower() for w in words if w.isalpha()]
        return cleaned_words
    
    def get_word_stats(self):
        #TODO return Counter object of the document, containing mapping {`word` -> count_in_doc}
        cleaned_words = self.get_sentences()
        return Counter(cleaned_words)

[nltk_data] Downloading package punkt to /Users/osmiyg/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### 3.1. Tests ###

In [25]:
doc = HtmlDocumentTextData("https://innopolis.university/")

print(doc.get_word_stats().most_common(10))
assert [x for x in doc.get_word_stats().most_common(10) if x[0] == 'иннополис'], 'иннополис sould be among most common'

[('и', 46), ('в', 28), ('иннополис', 21), ('по', 14), ('на', 12), ('университета', 10), ('университет', 10), ('лаборатория', 10), ('с', 8), ('об', 7)]


## 4. Crawling ##

Method `crawl_generator()` is given starting url (`source`) and max depth of search. It should return a **generator** of `HtmlDocumentTextData` objects (return a document as soon as it is downloaded and parsed). You can benefit from `yield obj_name` python construction. Use `HtmlDocumentTextData.anchors` field to go deeper.

In [26]:
class Crawler:
    
    def crawl_generator(self, source, depth=1):
        #TODO return real crawling results. Don't forget to process failures
        queue = []
        checked = set()
        tmp_anchor_list = []
        
        # Put initial links
        try:
            src = HtmlDocumentTextData(source)
            checked.add(source)
            yield src
            for anchor in src.doc.anchors:
                queue.append(anchor)
        except Exception as e:
            print(e)
            
        # Going until depth is satisified
        for i in range(depth):
            while (queue):
                try:
                    link = queue.pop(0)[1]
                    # We procceed if the link was not crawled before
                    if link not in checked:
                        checked.add(link)
                        src = HtmlDocumentTextData(link)
                        yield src
                        for anchor in src.doc.anchors:
                            tmp_anchor_list.append(anchor)
                        
                except Exception as e:
                    print(e)
            for anchor in tmp_anchor_list:
                    queue.append(anchor)
            tmp_anchor_list = []
            

### 4.1. Tests ###

In [27]:
crawler = Crawler()
counter = Counter()

for c in crawler.crawl_generator("https://innopolis.university/en/", 2):
    print(c.doc.url)
    if c.doc.url[-4:] in ('.pdf', '.mp3', '.avi', '.mp4', '.txt'):
        print("Skipping", c.doc.url)
        continue
    counter.update(c.get_word_stats())
    print(len(counter), "distinct word(s) so far")
    
print("Done")

print(counter.most_common(20))
assert [x for x in counter.most_common(20) if x[0] == 'innopolis'], 'innopolis sould be among most common'

https://innopolis.university/en/
280 distinct word(s) so far
http://old.innopolis.university/en/
300 distinct word(s) so far
https://media.innopolis.university/en/
422 distinct word(s) so far
https://www.facebook.com/InnopolisU
573 distinct word(s) so far
https://vk.com/innopolisu
807 distinct word(s) so far
https://www.youtube.com/user/InnopolisU
817 distinct word(s) so far
https://twitter.com/InnopolisU
https://www.instagram.com/innopolisu/
817 distinct word(s) so far
https://habr.com/ru/users/t-fazullin/posts/
1501 distinct word(s) so far
https://apply.innopolis.university/en/
2185 distinct word(s) so far
https://corporate.innopolis.university/
2436 distinct word(s) so far
https://media.innopolis.university/
2676 distinct word(s) so far
https://innopolis.university/lk/
2805 distinct word(s) so far
https://innopolis.university/en/about/
2900 distinct word(s) so far
https://innopolis.university/en/board/
2963 distinct word(s) so far
https://innopolis.university/en/team/
2964 distinct 

https://www.facebook.com/InnopolisU/posts/?ref=page_internal
7551 distinct word(s) so far
https://www.facebook.com/InnopolisU/community/?ref=page_internal
7555 distinct word(s) so far
https://www.facebook.com/login/?next=https%3A%2F%2Fwww.facebook.com%2FInnopolisU%2F&privacy_mutation_token=eyJ0eXBlIjowLCJjcmVhdGlvbl90aW1lIjoxNjExNDA0MDc3LCJjYWxsc2l0ZV9pZCI6Mzc4Mzc1MTU5OTY2NjIxfQ%3D%3D
7558 distinct word(s) so far
https://www.facebook.com/reg/?rs=2&privacy_mutation_token=eyJ0eXBlIjowLCJjcmVhdGlvbl90aW1lIjoxNjExNDA0MDc3LCJjYWxsc2l0ZV9pZCI6Mzc4Mzc1MTU5OTY2NjIxfQ%3D%3D
7603 distinct word(s) so far
https://www.facebook.com/reg/?rs=1&privacy_mutation_token=eyJ0eXBlIjowLCJjcmVhdGlvbl90aW1lIjoxNjExNDA0MDc3LCJjYWxsc2l0ZV9pZCI6NjQ4NzMxOTgyNDk1ODQyfQ%3D%3D
7603 distinct word(s) so far
https://www.facebook.com/InnopolisU/reviews/
7603 distinct word(s) so far
https://www.facebook.com/InnopolisU/community/
7603 distinct word(s) so far
https://www.facebook.com/InnopolisU/about/
7603 distinct word(s) 

https://vk.com/like?act=publish&object=wall-56385969_10180&from=innopolisu
https://vk.com/wall-56385969_10177
11241 distinct word(s) so far
https://vk.com/dovuziu?from=group
11371 distinct word(s) so far
https://vk.com/wall-175449549_2997
11382 distinct word(s) so far
https://vk.com/away.php?to=https%3A%2F%2Fvk.cc%2FbXn6QB&post=-56385969_10177
11382 distinct word(s) so far
https://vk.com/feed?section=search&q=%23itschool_innopolis
11382 distinct word(s) so far
https://vk.com/photo-175449549_457254458?list=wall-56385969_10177&from=group
11383 distinct word(s) so far
https://vk.com/like?act=add&object=wall-56385969_10177&from=innopolisu&hash=32201f20d7f06bf488&one=0
https://vk.com/like?act=publish&object=wall-56385969_10177&from=innopolisu
https://vk.com/wall-56385969_10174
11385 distinct word(s) so far
https://vk.com/away.php?to=http%3A%2F%2Fincommon.ru&post=-56385969_10174
11385 distinct word(s) so far
https://vk.com/photo-56385969_457245505?rev=1&post=-56385969_10174&from=group
11385 

'utf-8' codec can't decode byte 0xed in position 95578: invalid continuation byte
'utf-8' codec can't decode byte 0xed in position 114894: invalid continuation byte
'utf-8' codec can't decode byte 0xed in position 114894: invalid continuation byte
'utf-8' codec can't decode byte 0xed in position 114894: invalid continuation byte
'utf-8' codec can't decode byte 0xed in position 227653: invalid continuation byte
https://habr.com/ru/hub/hackathons/
21028 distinct word(s) so far
https://habr.com/ru/hub/robo_dev/
21563 distinct word(s) so far
'utf-8' codec can't decode byte 0xed in position 227653: invalid continuation byte
'utf-8' codec can't decode byte 0xed in position 227653: invalid continuation byte
'utf-8' codec can't decode byte 0xed in position 141967: invalid continuation byte
https://habr.com/ru/hub/bigdata/
21908 distinct word(s) so far
'utf-8' codec can't decode byte 0xed in position 141967: invalid continuation byte
'utf-8' codec can't decode byte 0xed in position 141967: inva

https://www.ets.org/s/cv/toefl/at-home/
25165 distinct word(s) so far
https://englishtest.duolingo.com/
25165 distinct word(s) so far
http://nic.gov.ru/en/proc/nic/legalize
25227 distinct word(s) so far
https://drive.google.com/file/d/10spw7SYKomHSyWuaNMWMl3-t_OiiR-qg/view
25227 distinct word(s) so far
https://drive.google.com/file/d/193kK6kV0RAzdp_wRksSJy9a9y1s0gMaE/view
25227 distinct word(s) so far
https://drive.google.com/file/d/1YYhSiVTm6bxxeBt0fv7rbrMdgU0C2e2_/view
25227 distinct word(s) so far
https://drive.google.com/file/d/1HATau9H2J2rJbC45o2EiREsGDKpNODA1/view
25227 distinct word(s) so far
https://innopolis.university/sveden/apply/
25315 distinct word(s) so far
https://www.embassy-worldwide.com/country/russia/?__cf_chl_jschl_tk__=a94cd95afa1311b39144c68ade05bb4c404c36e0-1603284143-0-AdT5wLmOm0W3vCpGYOe8JBgEGI_UYrAl9mWAr7wF22c8uVpnkbbSHOvIwBv5Iye4w3P-miBYqs-aStGmVo93k7jYCS4NQJ4cBWodhvlIsetlsYyQkaTTq_1OvbDiIsbarbVcfHyi_KH0wJDIB4Z1JSPevDKQ6oR5rH2pop3jbhEktZbCT0kRVDrWxej3fichPVhN

https://innopolis.university/laboratoriya-iskusstvennogo-intellekta-v-razrabotke-igr
29854 distinct word(s) so far
https://innopolis.university/laboratoriya-mashinnogo-obucheniya-i-predstavleniya-dannyh
29901 distinct word(s) so far
https://innopolis.university/laboratoriya-kiberfizicheskih-sistem
29918 distinct word(s) so far
https://innopolis.university/laboratoriya-setej-i-tekhnologij-blokchejn
29933 distinct word(s) so far
https://innopolis.university/laboratoriya-intellektualnyh-robototekhnicheskih-sistem
30023 distinct word(s) so far
https://innopolis.university/publikatsii
30037 distinct word(s) so far
https://innopolis.university/patenty
30100 distinct word(s) so far
https://innopolis.university/portfel-proektov/
30182 distinct word(s) so far
https://innopolis.university/center-blockchain2/
30214 distinct word(s) so far
https://innopolis.university/centergis3/
30321 distinct word(s) so far
https://innopolis.university/center-cybersecurity/
30424 distinct word(s) so far
https://

https://career.innopolis.university/
32570 distinct word(s) so far
decode() argument 'encoding' must be str, not None
decode() argument 'encoding' must be str, not None
https://career.innopolis.university/success-stories/julia-kazaeva/
32571 distinct word(s) so far
https://career.innopolis.university/success-stories/farid-gainullin/
32571 distinct word(s) so far
https://career.innopolis.university/success-stories/salimzhan-gafurov-en/
32571 distinct word(s) so far
No connection adapters were found for 'mailto:resume@innopolis.ru'
http://www.innopolis.com/resident/sport/sports-complex/
32719 distinct word(s) so far
http://www.innopolis.com/resident/sport/jogging-routes/
32738 distinct word(s) so far
No connection adapters were found for 'mailto:hoteluni@innopolis.ru'
http://hotel.innopolis.university/
32783 distinct word(s) so far
No connection adapters were found for 'mailto:dovuz@innopolis.university'
No connection adapters were found for 'mailto:admissions@innopolis.ru'
decode() argu

https://vk.com/innostudents
33204 distinct word(s) so far
https://www.instagram.com/innostudents/
33204 distinct word(s) so far
http://t.me/StudentAffairs_bot
33204 distinct word(s) so far
https://tilda.cc/?upm=686376
33482 distinct word(s) so far
https://us02web.zoom.us/j/9660990045?pwd=haAZQtiWaT-BVj-FamfQPy-KdnNBvQ
33482 distinct word(s) so far
https://us02web.zoom.us/j/86924417470?pwd=cWpHUlk4OG5BcVNjek94Qk9mMEFjQT09
33482 distinct word(s) so far
https://us02web.zoom.us/j/82115193290?pwd=aStDSVpXUnV5Q0JwZXB3Lys4YmJQUT09
33482 distinct word(s) so far
https://us02web.zoom.us/j/81023939550?pwd=ZHZka0hGTVh4RWFoOUZ3V3lESjVYUT09
33482 distinct word(s) so far
https://us02web.zoom.us/j/84204806112?pwd=Q0ROZkw3bmVGUTB2Q3pYRjFPRWk0UT09
33482 distinct word(s) so far
https://media.innopolis.university/en/news/kuka-robotics-innopolis-university-cooperation/
33489 distinct word(s) so far
https://media.innopolis.university/en/news/postgres-laboratory-innopolis-university/
33495 distinct word(s) s