# Mastering Applied Skills in Management, Analytics and Entrepreneurship

## DATA COLLECTION TECHNIQUES
## Part III. Web scraping intro

__NOTE:__ use this notebook with `Data Science environment`.

### 1. Libraries

Let's start from very basic example, we wiil need [urllib](https://docs.python.org/3/library/urllib.html) library for  opening and reading URLs. We will also use [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) Python library to parce HTML data.

In [None]:
import os
import re
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

### 2. Get text from HTML page

In [None]:
URL_2_SCRAP = 'https://gsom.spbu.ru/en/programmes/graduate/miba/'
print(URL_2_SCRAP)

In [None]:
request = Request(URL_2_SCRAP)

In [None]:
request

In [None]:
request.host

In [None]:
request.full_url

In [None]:
response = urlopen(request)

In [None]:
response.code

In [None]:
response.length

In [None]:
html = response.read()

In [None]:
html
print(html)

Time to [make a soup](https://beautiful-soup-4.readthedocs.io/en/latest/#making-the-soup) from our `html`. Soup will help us easily navigate inside the HTML structure:

In [None]:
soup = BeautifulSoup(html, 'html.parser')
soup

In [None]:
soup.contents

In [None]:
type(soup.contents)

In [None]:
soup.contents[0]

In [None]:
soup.contents[-1]

In [None]:
# here we take only text from the site
soup.text

In [None]:
text = soup.text
for ch in ['\n', '\t', '\r']:
    text = text.replace(ch, ' ')

In [None]:
' '.join(text.split())

### 3. Simple NLP analysis

In [None]:
# lower text and leave only text without special symbols
text = re.sub('[^а-яА-Яa-zA-Z]+', ' ', text).strip().lower()
text

In [None]:
text_as_list = text.split()
text_as_list[:5]

In [None]:
from collections import Counter

In [None]:
Counter(text_as_list)

In [None]:
# words frequencies
freqs = dict(Counter(text_as_list))

In [None]:
freqs

In [None]:
freqs = dict(
    sorted(
        freqs.items(), 
        key=lambda item: item[1], 
        reverse=True
    )
)
freqs

In [None]:
import matplotlib.pyplot as plt

In [None]:
freqs_bar = {k: v for k, v in freqs.items() if v >= 10}
freqs_bar

In [None]:
plt.figure(figsize=(16, 6))
plt.bar(*zip(*freqs_bar.items()))
plt.xticks(rotation='vertical')
plt.show()

We can see that the first words are like 'junk' or 'stop' words and are useless for the analysys, let's exclude them:

In [None]:
freqs_bar = {k: v for k, v in freqs_bar.items() if v <= 60}
freqs_bar

In [None]:
plt.figure(figsize=(16, 6))
plt.bar(*zip(*freqs_bar.items()))
plt.xticks(rotation='vertical')
plt.show()

## <font color='red'>INTERMEDIATE QUIZ #3-1</font>
Find the any site you like and do the following:
1. Collect HTML page of any site you like
2. Take the text from that page and process it
3. Draw a word frequency plot and make a conclusion about the theme of the site (e.g. news, education, IT etc.)
4. Find a few major drawbacks of that simple analysis based on word-count approach

In [None]:
### YOUR CODE HERE ###

### 4. Save HTML to disk as row data

#### 4.1. How to save files in Python

In [None]:
file = open('test_save_file_0.txt', 'w')
file.write('I am text and you write me into the file.')

# do not forget to close the file
file.close()

The [with](https://docs.python.org/3/reference/compound_stmts.html#with) statement will help us again. 

In [None]:
with open('test_save_file_1.txt', 'w') as file:
    file.write('I am another text but you write me into the file also.')

In [None]:
# not only overwrite but also append
with open('test_save_file_1.txt', 'a') as file:
    file.write('I will be appended and this is the way.')

In [None]:
# append with the new string
with open('test_save_file_1.txt', 'a') as file:
    file.write('\nI will be appended with the new string and this is the way too.')

#### 4.2. Saving MiBA page

In [None]:
with open('miba_page.html', 'w') as file:
    file.write(html.decode())

In [None]:
# read the data back
with open('miba_page.html', 'r') as file:
    text = file.read()

In [None]:
text

### 5. More than text

In [None]:
soup

In [None]:
soup.find_all('img')

In [None]:
soup.find_all('img')[19]

We can search through the soup:

In [None]:
soup.find_all('img', attrs={'alt': 'Vladimir Andreevich Gorovoy'})

In [None]:
soup.find_all('img', attrs={'alt': re.compile(r".*Gorovoy")})

In [None]:
all_found = soup.find_all('img', attrs={'alt': re.compile(r".*Gorovoy")})
all_found

In [None]:
all_found[0]

In [None]:
all_found[0]['alt']

In [None]:
all_found[0]['class']

In [None]:
all_found[0]['src']

In [None]:
URL_2_SCRAP = 'https://gsom.spbu.ru' + all_found[0]['src']
print(URL_2_SCRAP)

In [None]:
request = Request(URL_2_SCRAP)
response = urlopen(request)
img = response

In [None]:
img

About [PIL library](https://pillow.readthedocs.io/en/stable/index.html) for Python.

In [None]:
from PIL import Image
import numpy as np

In [None]:
plt.figure(figsize=(12, 8))
img = Image.open(img)
plt.imshow(np.array(img))
plt.show()

In [None]:
img.save('Vladimir.jpg')