# ML for Finance
## Fall 2020
---

## Web Crawling & Data Parsing
* A **Web crawler**, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (web spidering).
* **Parsing**, syntax analysis, or syntactic analysis is the process of analysing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech).

## Library `requests`

This [library](http://docs.python-requests.org/en/latest/) helps to download webpages in Python environment.

In [None]:
#!pip install requests --user

In [None]:
import requests

If OK, then try to get the webpage

In [None]:
# Term Paper Page Example
r = requests.get('https://spb.hse.ru/en/ma/finance/VKR')

In [None]:
# CHECK the status of loading
r.ok

In [None]:
# Print the HTML-code of webpage
print(r.text)

In [None]:
# Check Timetable...
page = 'https://spb.hse.ru/ma/finance/timetable?fromdate=2020.11.16&todate=2020.11.21&groupoid=38882&receiverType=3&timetable-courses=2&timetable-groups=38882'
q = requests.get(page)

In [None]:
q.ok

In [None]:
q.content

---

## Before Parsing...
### Few words about HTML

Create simple `*.html` as example

In [None]:
my_html = '''
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset = "UTF-8">
    <title>Title</title>
</head>
<body>
<h1>Hello</h1>
<p>I'm a paragraph.</p>
<hr>
<ol>
    <li>One</li>
    <li>Two</li>
</ol>
    
</body>
</html>
'''

In [None]:
with open('my.html', 'w') as f:
    f.write(my_html)

Then `*.html` with table...

In [None]:
my_html2 = '''
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset = "UTF-8">
    <title>Title</title>
    <style type='text/css;'>
        table {
        border-collapse: collapse;
    }

    table, th, td {
        border: 1px solid black;
    }
    </style>
</head>
<body>
<h1>Hello</h1>
<p>I'm a paragraph.</p>
<hr>
<ol>
    <li>One</li>
    <li>Two</li>
</ol>
<table>
    <tr>
        <td>
            Cell 1
        </td>
        <td>
            Cell 2
        </td>
    </tr>
    <tr>
        <td>
            Cell 3
        </td>
        <td>
            Cell 4
        </td>
    </tr>
</table>
</body>
</html>
'''
with open('my2.html', 'w') as f:
    f.write(my_html2)

We can now open these simple pages and have a glance on them

## BeautifulSoup

[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) - the most popular library for data parsing from webpages in Python. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping.

> Check also library [Scrapy](https://scrapy.org/) for Python

> For R check
> ```r
library('XML')
library('rvest')
library('httr')
```

`BeautifulSoup` is a part of `bs4` library in Python and pre-installed in Anaconda. So, run the cell below

In [None]:
from bs4 import BeautifulSoup

In **BeautifulSoup** we can find several parsers by default the `html.parser` is installed but also we can install `lxml`, which sometimes performs better (`pip install lxml`)

So, check the following:

In [None]:
soup = BeautifulSoup(r.text, 'html.parser')

In [None]:
# Grab all links in list
soup.find_all('a')

In [None]:
# Grab specific links with documents
soup.findAll('a', {'class': 'link fileRef'})

In [None]:
# Restore only text of such links
[i.text for i in soup.findAll('a', {'class': 'link fileRef'})]

In [None]:
# Restore specific content (the link in particular)
[i['href'] for i in soup.findAll('a', {'class': 'link fileRef'})]

In [None]:
[i['href'] for i in soup.findAll('a', {'class': 'link fileRef'})][0]

---

Library **[re](https://docs.python.org/3/library/re.html)**

In [None]:
# Extract all after last slash (regular expressions)
import re
re.sub('.*/', '', [i['href'] for i in soup.findAll('a', {'class': 'link fileRef'})][0])

And for our local pages we can use the following:

In [None]:
soup1 = BeautifulSoup(open('my.html'), 'html.parser')

In [None]:
soup2 = BeautifulSoup(open('my2.html'), 'lxml')

---
Then check our variables

In [None]:
soup1

In [None]:
print(soup1.prettify())

In [None]:
soup1.html.head

In [None]:
soup1.html.body.p

In [None]:
soup2.html.body.table

Find all content inside row tags `<tr>`:

In [None]:
soup2.body.table.findAll('tr')

In [None]:
soup2.find_all('td')[0].string.strip()

In [None]:
rows = soup2.body.table.findAll('tr')
for i, row in enumerate(rows):
    print(i)
    print(row.td.string)

In [None]:
for i, row in enumerate(rows):
    print(row.td.string.strip())

Load table:

In [None]:
table = []
for row in rows:
    table.append([cell.string.strip() for cell in row.findAll('td')]) 
print(table)

### Parse tags attributes:

In [None]:
soup2.html['lang']

In [None]:
soup2.html.head.style['type']

In [None]:
soup2.select('style')[0]['type']

Extract hyperlinks in tags `<a href = ....>` from site:

In [None]:
soup('a', href = True)

Print links themselves

In [None]:
for link in soup('a', href = True, class_ = False):
    print(link['href'])

### Very Nice!
Parse our timetable

In [None]:
soup3 = BeautifulSoup(q.content, 'lxml')

In [None]:
soup3('div', {'class': 'scheduleItem__inner'})

In [None]:
soup3('table')

In [None]:
# Try This
page_en = 'https://www.hse.ru/api/timetable/lessons?fromdate=2020.11.16&todate=2020.11.21&groupoid=43082&receiverType=3'
tt2 = requests.get(page_en)
print(tt2.ok)

In [None]:
soup3_en = BeautifulSoup(tt2.content, 'lxml')

In [None]:
soup3_en

In [None]:
# *.json
print(soup3_en.prettify())

In [None]:
ttext = soup3_en.select('p')[0].string

In [None]:
with open('ttext.json', 'w') as f:
    f.write(ttext)

In [None]:
import json
j = json.load(open('ttext.json'))

In [None]:
j['Lessons']

In [None]:
import pandas as pd

In [None]:
pd.DataFrame(j['Lessons'])

---

### So, we can write to Excel
[example here](https://xlsxwriter.readthedocs.io/example_pandas_simple.html)

In [None]:
df = pd.DataFrame(j['Lessons'])
writer = pd.ExcelWriter('my_schedule.xlsx', engine = 'xlsxwriter')
df.to_excel(writer, index = False)
writer.save()

## Financial Data

In [None]:
moex = 'https://www.moex.com/en/index/IMOEX/archive/#/from=2019-09-20&till=2019-10-18&sort=TRADEDATE&order=desc'
m = requests.get(moex)
m.ok

In [None]:
s = BeautifulSoup(m.content, 'lxml')

In [None]:
s.findAll('table')

In [None]:
print(':(')

## Try [Selenium](https://selenium-python.readthedocs.io/)

In [None]:
# First install appropriate version of emulated browser in the same folder
# https://selenium-python.readthedocs.io/installation.html#downloading-python-bindings-for-selenium
from selenium.webdriver import Chrome
import os
driver = Chrome(os.getcwd() + '/' + 'chromedriver')

In [None]:
# Get the page
driver.get(moex)

In [None]:
# Click the element
driver.find_elements_by_xpath("//a[@role = 'button' and @data-dismiss = 'modal']")[0].click()

In [None]:
# You can store your webpage
m1 = driver.page_source

In [None]:
# And collect data
tab = BeautifulSoup(m1, 'lxml').find('table', {'ng-table': 'tableParams'})
rows = tab.findAll('tr')
for row in rows:
    table.append([cell.string.strip() for cell in row.findAll('td')]) 
pd.DataFrame(table)

In [None]:
# Or load *.csv file
driver.find_elements_by_xpath("//a[@target = '_blank' and @ng-click = 'DownloadReport(item.prefix)']")[1].click()
driver.quit()

In [None]:
# BUT better specify the downloading directory first
from selenium import webdriver
import time

# Set download dir
chromeOptions = webdriver.ChromeOptions()
prefs = {"download.default_directory" : os.getcwd()}
chromeOptions.add_experimental_option("prefs",prefs)
chromedriver = os.getcwd() + '/' + 'chromedriver'

# Open browser
driver = webdriver.Chrome(executable_path=chromedriver, options=chromeOptions)
time.sleep(5)

# Get the page
driver.get(moex)
time.sleep(5)

# Click the element
driver.find_elements_by_xpath("//a[@role = 'button' and @data-dismiss = 'modal']")[0].click()
time.sleep(5)

# Or load and open *.csv file
driver.find_elements_by_xpath("//a[@target = '_blank' and @ng-click = 'DownloadReport(item.prefix)']")[1].click()

# Close
driver.quit()

In [None]:
# Find all csv files
files = os.listdir()
[i for i in files if '.csv' in i][0]

In [None]:
# Open
pd.read_csv([i for i in files if '.csv' in i][0], delimiter = ';', skiprows = 1)

---
## Parse POST-requests
Run the following code for your LMS account (for old LMS version)

In [None]:
import getpass

login = input('Enter your login before @edu: ')
passw = getpass.getpass('Enter the password (only stores on your computer): ')

print('\nWait a second...\n')

post = {'user_login': str(login)+'@edu.hse.ru',
        'user_password': str(passw),
        'userLogin': '%D0%92%D0%BE%D0%B9%D1%82%D0%B8'
}
url = 'https://lms.hse.ru/index.php?_qf__login_form='

session = requests.session()
r = session.post(url, data = post)
r2 = session.get('https://lms.hse.ru/?gb')

table = BeautifulSoup(r2.text, 'lxml')
out = table.find_all("tr", { "class" : "tabl_0 trhidden" })
print('\n'.join([' '.join([out[j].find_all('td')[i].get_text() for i in (0, 5, 6, 8, 9)]) for j in range(0,len(out))]))

login = ''
passw = ''