# Crawlers

**Our plan for today:**

1. What is a crawler?
2. How can we create a simple crawler?
3. How to avoid being blocked?

## What is a crawler?

A crawler is a program that crowls across the webpages and collects information. Use it well: don't use it to download the data that the webpage creators don't want you to download; avoid getting your ID blocked; don't overload the server.

## How to create a simple crawler?

In [1]:
import requests
from pprint import pprint

session = requests.session()

In [2]:
response = session.get('https://ru.wikipedia.org')
response.headers['X-Client-IP']

'35.201.170.80'

In [3]:
pprint(dict(response.headers))

{'accept-ch': 'Sec-CH-UA-Arch,Sec-CH-UA-Bitness,Sec-CH-UA-Full-Version-List,Sec-CH-UA-Model,Sec-CH-UA-Platform-Version',
 'accept-ranges': 'bytes',
 'age': '79',
 'cache-control': 'private, s-maxage=0, max-age=0, must-revalidate',
 'content-encoding': 'gzip',
 'content-language': 'ru',
 'content-length': '27526',
 'content-type': 'text/html; charset=UTF-8',
 'date': 'Mon, 17 Oct 2022 07:06:22 GMT',
 'last-modified': 'Mon, 17 Oct 2022 07:06:04 GMT',
 'nel': '{ "report_to": "wm_nel", "max_age": 86400, "failure_fraction": 0.05, '
        '"success_fraction": 0.0}',
 'permissions-policy': 'interest-cohort=(),ch-ua-arch=(self '
                       '"intake-analytics.wikimedia.org"),ch-ua-bitness=(self '
                       '"intake-analytics.wikimedia.org"),ch-ua-full-version-list=(self '
                       '"intake-analytics.wikimedia.org"),ch-ua-model=(self '
                       '"intake-analytics.wikimedia.org"),ch-ua-platform-version=(self '
                       '"intake-

### Strategies of data collection


Basically, crawlers are collecting webpages (their html) in cycles (loops), or cycles (loops) of cycles (loops). Some strategies of data collection:
    
**By navigation type**

1. All the webpages have convenient numbers ("https://ficbook.net/fanfiction/no_fandom/originals?p=2"), usually, p=(a number) or page=(a number). Then, you just have to go through relevant numbers.
2. Webpages are named somehow. Then, you need to collect the links to the relevant webpages and then go through them to collect the data.

**By the speed of updating**

1. If the website is updated slowly, you can first collect the links to the webpages and then go through them
2. If the website is updated fast, you need to collect the data right after you got the link and then move on to a new web page.



## Avoiding getting blocked

Btw, Wikipedia doesn't block downloads, can be used as a source of linguistic data

### A pause between requests

In [4]:
import time

for _ in range(5):
    response = session.get('https://ru.wikipedia.org')
    print(response.headers['Date'])
    time.sleep(3)

Mon, 17 Oct 2022 07:06:22 GMT
Mon, 17 Oct 2022 07:06:22 GMT
Mon, 17 Oct 2022 07:06:22 GMT
Mon, 17 Oct 2022 07:06:22 GMT
Mon, 17 Oct 2022 07:06:22 GMT


### To present yourself as a well-respected browser

In [None]:
from fake_useragent import UserAgent
ua = UserAgent(verify_ssl=False)

headers = {'User-Agent': ua.random}
print(headers)
response = session.get('https://ru.wikipedia.org', headers=headers)

### A pause between requests (random time)

In [8]:
import random

for _ in range(5):
    response = session.get('https://ru.wikipedia.org')
    print(response.headers['Date'])
    time.sleep(random.uniform(1.1, 5.2))

Mon, 17 Oct 2022 07:10:37 GMT
Mon, 17 Oct 2022 07:10:37 GMT
Mon, 17 Oct 2022 07:10:37 GMT
Mon, 17 Oct 2022 07:10:47 GMT
Mon, 17 Oct 2022 07:10:47 GMT


## An example

Let's download some news from the HSE website!

1. The webpages have the following structure: "https://www.hse.ru/news/page2.html", we can loop through them.
2. Let's extract the publication date, the title, a short description, the full text of the publication, tags.
3. Put the data we collected into a database.

In [10]:
import sqlite3
from html import unescape
from bs4 import BeautifulSoup
import re

In [11]:
conn = sqlite3.connect('hse_news.db')
cur = conn.cursor()

In [12]:
cur.execute("""
CREATE TABLE IF NOT EXISTS texts 
(id int PRIMARY KEY, hse_id text, pub_year int, pub_month int, 
pub_day int, title text, short_text text, full_text text)
""")

cur.execute("""
CREATE TABLE IF NOT EXISTS tags 
(id int PRIMARY KEY, tag_name text) 
""")

cur.execute("""
CREATE TABLE IF NOT EXISTS text_to_tag 
(id INTEGER PRIMARY KEY AUTOINCREMENT, id_text int, id_tag int) 
""")

conn.commit()
conn.close()

**Step 1. Finding the webpages**

In [13]:
page_number = 1
url = f'https://www.hse.ru/news/page{page_number}.html'
req = session.get(url, headers={'User-Agent': ua.random})
page = req.text

In [14]:
soup = BeautifulSoup(page, 'html.parser')

In [15]:
news = soup.find_all('div', {'class': 'post_first'})

In [16]:
title = news[0].find('a').text
title

'«Помогая друг другу, заботясь о людях вокруг нас, мы сможем вступить в достойное будущее»'

In [17]:
attrs = news[0].find('a').attrs
attrs

{'href': '/news/community/783801103.html',
 'class': ['link', 'link_dark2', 'no-visited']}

In [18]:
href = news[0].find('a').attrs['href']
href

'/news/community/783801103.html'

In [19]:
short_text = news[0].find('div', {'class': 'post__text'}).text
short_text

'19 октября в Центре культур ВШЭ состоится фестиваль «Письмо в будущее». Его придумали студенты Высшей школы экономики совместно с Центром лидерства и волонтерства Вышки. Главная цель — сплотить университетское сообщество и показать, как можно помогать сегодня, чтобы создавать благополучное завтра.'

In [20]:
pub_day = news[0].find('div', {'class': 'post-meta__day'}).text
pub_day

'14'

In [21]:
pub_month = news[0].find('div', {'class': 'post-meta__month'}).text
pub_month

'окт'

In [22]:
pub_year = news[0].find('div', {'class': 'post-meta__year'}).text
pub_year

'2022'

**Step 2. Learn how to parse the webpage of the news article**

In [23]:
url_one = 'http://www.hse.ru'+href
url_one

'http://www.hse.ru/news/community/783801103.html'

In [24]:
req = session.get(url_one, headers={'User-Agent': ua.random})
page = req.text

soup = BeautifulSoup(page, 'html.parser')

In [25]:
full_text = soup.find('div', {'class': 'post__content'}).text
full_text[:200]

'«Помогая друг другу, заботясь о людях вокруг нас, мы сможем вступить в достойное будущее»© iStock19 октября в Центре культур ВШЭ состоится фестиваль «Письмо в будущее». Его придумали студенты Высшей ш'

In [26]:
full_text = soup.find('div', {'class': 'post__content'}).text
full_text[:200]

'«Помогая друг другу, заботясь о людях вокруг нас, мы сможем вступить в достойное будущее»© iStock19 октября в Центре культур ВШЭ состоится фестиваль «Письмо в будущее». Его придумали студенты Высшей ш'

In [27]:
meta = soup.find('div', {'class': 'articleMeta'})

tags = meta.find_all('a', {'class': 'tag'})
tags = [t.text for t in tags]
tags

['приглашение к участию', 'волонтерство']

**Step 3. Reformulating the steps in terms of functions**

In [28]:
months = {
    value: key+1
    for key, value in enumerate(
        ['янв', 'фев', 'мар', 'апр', 'мая', 'июн', 'июл', 'авг', 'сен', 'окт', 'ноя', 'дек']
    )
}

To parse the information from the webpage with a list of news articles (for one block):

In [29]:
def parse_first_level_info(one_block):
    block = {}
    block['title'] = one_block.find('a').text
    block['href'] = one_block.find('a').attrs['href']
    block['short_text'] = one_block.find('div', {'class': 'post__text'}).text
    block['pub_day'] = int(one_block.find('div', {'class': 'post-meta__day'}).text)
    block['pub_month'] = months[one_block.find('div', {'class': 'post-meta__month'}).text]
    block['pub_year'] = int(one_block.find('div', {'class': 'post-meta__year'}).text)
    return block

To parse the webpage of a news article:

In [30]:
def parse_second_level_info(block):
    url_one = 'http://www.hse.ru' + block['href']
    req = session.get(url_one, headers={'User-Agent': ua.random})
    page = req.text
    soup = BeautifulSoup(page, 'html.parser')
    block['full_text'] = soup.find('div', {'class': 'post__content'}).text
    meta = soup.find('div', {'class': 'articleMeta'})
    tags = meta.find_all('a', {'class': 'tag'})
    block['tags'] = [t.text for t in tags]     
    return block

In [31]:
regex_hse_id = re.compile('/([0-9]*?).html')

In [32]:
def get_nth_page(page_number):
    url = f'https://www.hse.ru/news/page{page_number}.html'
    req = session.get(url, headers={'User-Agent': ua.random})
    page = req.text
    soup = BeautifulSoup(page, 'html.parser')
    news = soup.find_all('div', {'class': 'post'})
    blocks = []
    for n in news:
        try:
            blocks.append(parse_first_level_info(n))
        except Exception as e:
            print(e)
    result = []
    for b in blocks:
        if b['href'].startswith('/'):
            idx = regex_hse_id.findall(b['href'])[0]
            if idx not in seen_news:
                try:
                    res = parse_second_level_info(b)
                    res['hse_id'] = idx
                    result.append(res)
                except Exception as e:
                    print(e)
            else:
                print('Seen', b['href'])
    return result

**Step 4. Putting the data into a database**

We need to create a dictionary for tags, a set of the articles we have seen (to not duplicate)

In [33]:
def write_to_db(block):
    tags = []
    for tag in block['tags']:
        if tag in db_tags:
            tags.append(db_tags[tag])
        else:
            db_tags[tag] = len(db_tags) + 1 
            cur.execute('INSERT INTO tags VALUES (?, ?)', (len(db_tags), tag))
            conn.commit()
            tags.append(db_tags[tag])
    text_id = len(seen_news) + 1
    cur.execute(
        'INSERT INTO texts VALUES (?, ?, ?, ?, ?, ?, ?, ?)',
        (text_id, block['hse_id'],
         block['pub_year'], block['pub_month'], block['pub_day'],
         block['title'], block['short_text'], block['full_text'])
    )
    tags = [(text_id, t) for t in tags]
    cur.executemany(
        'INSERT INTO text_to_tag (id_text, id_tag) VALUES (?, ?)',
        tags
    )
    conn.commit()
    seen_news.add(block['hse_id'])

In [34]:
conn = sqlite3.connect('hse_news.db')
cur = conn.cursor()
cur.execute('SELECT tag_name, id FROM tags')

db_tags = {}
for name, idx in cur.fetchall():
    db_tags[name] = idx

cur.execute('SELECT hse_id FROM texts')
seen_news = set(i[0] for i in cur.fetchall())

In [35]:
from tqdm.auto import tqdm

In [36]:
def run_all(n_pages):
    for i in tqdm(range(n_pages)):
        blocks = get_nth_page(i+1)
        for block in blocks:
            write_to_db(block)

In [37]:
run_all(100)

  0%|          | 0/100 [00:00<?, ?it/s]

('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))


In [38]:
cur.execute("""
SELECT count(text_to_tag.id) as cnt, tags.tag_name 
    FROM text_to_tag 
        JOIN tags ON tags.id = text_to_tag.id_tag 
            GROUP BY text_to_tag.id_tag 
            ORDER BY cnt DESC
            LIMIT 10;
""")
cur.fetchall()

[(249, 'репортаж о событии'),
 (180, 'исследования и аналитика'),
 (167, 'студенты'),
 (159, 'идеи и опыт'),
 (158, 'приглашение к участию'),
 (118, 'достижения'),
 (110, 'новое в ВШЭ'),
 (99, 'дискуссии'),
 (97, 'магистратура'),
 (89, 'публикации')]

In [39]:
cur.execute("""
SELECT count(pub_month) as cnt, pub_month
    FROM texts
        GROUP BY pub_month
        ORDER BY cnt DESC;
""")
cur.fetchall()

[(109, 10),
 (88, 9),
 (80, 12),
 (67, 11),
 (66, 4),
 (66, 6),
 (62, 7),
 (60, 2),
 (52, 5),
 (50, 3),
 (44, 8),
 (26, 1)]