# Seminar 7

## Web Crawling & Data Parsing
* A **Web crawler**, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).
* **Parsing**, syntax analysis, or syntactic analysis is the process of analysing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech).

## Before Parsing...
### Few words about HTML

Create simple `*.html` as example

In [None]:
my_html = '''
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset = "UTF-8">
    <title>Title</title>
</head>
<body>
<h1>Hello</h1>
<p>I'm a paragraph.</p>
<hr>
<ol>
    <li>One</li>
    <li>Two</li>
</ol>
    
</body>
</html>
'''

In [None]:
with open('my.html', 'w') as f:
    f.write(my_html)

Then `*.html` with table...

In [None]:
my_html2 = '''
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset = "UTF-8">
    <title>Title</title>
    <style type='text/css;'>
        table {
        border-collapse: collapse;
    }

    table, th, td {
        border: 1px solid black;
    }
    </style>
</head>
<body>
<h1>Hello</h1>
<p>I'm a paragraph.</p>
<hr>
<ol>
    <li>One</li>
    <li>Two</li>
</ol>
<table>
    <tr>
        <td>
            Cell 1
        </td>
        <td>
            Cell 2
        </td>
    </tr>
    <tr>
        <td>
            Cell 3
        </td>
        <td>
            Cell 4
        </td>
    </tr>
</table>
</body>
</html>
'''
with open('my2.html', 'w') as f:
    f.write(my_html2)

We can now open these simple pages and have a glance on them

## Library `requests`

This [library](http://docs.python-requests.org/en/latest/) helps to download webpages in Python environment.

In [None]:
#!pip install requests --user

In [None]:
import requests

If OK, then try to get the webpage

In [None]:
# Student Theses (Management) Page Example
r = requests.get('https://spb.hse.ru/en/ba/management/students/diplomas/')

In [None]:
# CHECK the status of loading
r.ok

In [None]:
# Print the HTML-code of webpage
print(r.text)

---

## BeautifulSoup

[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - the most popular library for data parsing from webpages in Python. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

> Check also library [Scrapy](https://scrapy.org/) for Python

> For R check
> ```r
library('XML')
library('rvest')
library('httr')
```

`BeautifulSoup` is a part of `bs4` library in Python and pre-installed in Anaconda. So, run the cell below

In [None]:
from bs4 import BeautifulSoup

In **BeautifulSoup** we can find several parsers by default the `html.parser` is installed but also we can install `lxml`, which sometimes performs better (`pip install lxml`)

So, check the following:

In [None]:
soup1 = BeautifulSoup(open('my.html'), 'html.parser')

In [None]:
soup2 = BeautifulSoup(open('my2.html'), 'lxml')

---
Then check our variables

In [None]:
soup1

In [None]:
print(soup1.prettify())

In [None]:
soup1.html.head

In [None]:
soup1.html.body.p

In [None]:
soup2.html.body.table

Find all content inside row tags `<tr>`:

In [None]:
soup2.body.table.findAll('tr')

In [None]:
soup2.find_all('td')[0].string.strip()

In [None]:
rows = soup2.body.table.findAll('tr')
for i, row in enumerate(rows):
    print(i)
    print(row.td.string)

In [None]:
for i, row in enumerate(rows):
    print(row.td.string.strip())

Load table:

In [None]:
table = []
for row in rows:
    table.append([cell.string.strip() for cell in row.findAll('td')]) 
print(table)

### Parse tags attributes:

In [None]:
soup2.html['lang']

In [None]:
soup2.html.head.style['type']

In [None]:
soup2.select('style')[0]['type']

### Very Nice!
Parse page with Theses

In [None]:
soup = BeautifulSoup(r.text, 'html.parser')

Extract hyperlinks in tags `<a href = ....>` from site:

In [None]:
soup('a', href = True)

Print links themselves

In [None]:
for link in soup('a', href = True, class_ = False):
    print(link['href'])

Then, find and collect information to make the table (check with `F12` in browser)

In [None]:
rows = soup.findAll('div', {'class': 'edu-programm__item small'})
rows[0]

Then, create the list for different columns

In [None]:
[i.text.strip() for i in rows[0].findAll('div')]

Then, run for all (nested loop)

In [None]:
table_theses = []
for j in rows:
    table_theses.append([i.text.strip() for i in j.findAll('div')])
    
table_theses

Create the table

In [None]:
import pandas as pd
df = pd.DataFrame(table_theses, columns = ['Supervisor', 'Year', 'Student', 'Text', 'Title'])
df

### So, we can write to Excel
[example here](https://xlsxwriter.readthedocs.io/example_pandas_simple.html)

In [None]:
writer = pd.ExcelWriter('table_theses.xlsx', engine = 'xlsxwriter')
df.to_excel(writer, index = False)
writer.save()

## Small Task
**Now try to run the loop for all pages**

---
## Parse POST-requests (optional)
Run the following code for your LMS account (use an old-style LMS)

In [None]:
import getpass

login = input('Enter your login before @edu: ')
passw = getpass.getpass('Enter the password (only stores on your computer): ')

print('\nWait a second...\n')

post = {'user_login': str(login)+'@edu.hse.ru',
        'user_password': str(passw),
        'userLogin': '%D0%92%D0%BE%D0%B9%D1%82%D0%B8'
}
url = 'https://lms.hse.ru/index.php?_qf__login_form='

session = requests.session()
r = session.post(url, data = post)
r2 = session.get('https://lms.hse.ru/?gb')

table = BeautifulSoup(r2.text, 'lxml')
out = table.find_all("tr", { "class" : "tabl_0 trhidden" })
print('\n'.join([' '.join([out[j].find_all('td')[i].get_text() for i in (0, 5, 6, 8, 9)]) for j in range(0,len(out))]))

login = ''
passw = ''