# Mastering Applied Skills in Management, Analytics and Entrepreneurship I

## DATA COLLECTION TECHNIQUES
## Part III. Web scraping intro

### 1. Libraries

Let's start from very basic example, we wiil need [urllib](https://docs.python.org/3/library/urllib.html) library for  opening and reading URLs. We will also use [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) Python library to parce HTML data.

In [None]:
import os
import re
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup

### 2. Get text from HTML page

In [None]:
URL_2_SCRAP = 'https://gsom.spbu.ru/en/programmes/graduate/miba/'

In [None]:
request = Request(URL_2_SCRAP)
response = urlopen(request)
html = response.read()

In [None]:
html

In [None]:
soup = BeautifulSoup(html, 'html.parser')
soup

In [None]:
soup.contents

In [None]:
soup.text

In [None]:
text = soup.text
for ch in ['\n', '\t', '\r']:
    text = text.replace(ch, ' ')

In [None]:
' '.join(text.split())

### 3. Simple NLP analysis

In [None]:
# lower text and leave only text without symbols
text = re.sub('[^а-яА-Яa-zA-Z]+', ' ', text).strip().lower()
text

In [None]:
text_as_list = text.split()
text_as_list[:5]

In [None]:
from collections import Counter

In [None]:
Counter(text_as_list)

In [None]:
freqs = dict(Counter(text_as_list))

In [None]:
freqs

In [None]:
freqs = dict(
    sorted(
        freqs.items(), 
        key=lambda item: item[1], 
        reverse=True
    )
)
freqs

In [None]:
import matplotlib.pyplot as plt

In [None]:
freqs_bar = {k: v for k, v in freqs.items() if v >= 10}

In [None]:
plt.figure(figsize=(16, 6))
plt.bar(*zip(*freqs_bar.items()))
plt.xticks(rotation='vertical')
plt.show()

## <font color='red'>INTERMEDIATE QUIZ</font>
Find the any site you like and do the folowing:
1. Collect HTML page of any site you like
2. Take the text from that page and process it
3. Draw a word frequency plot and make a conclusion about the theme of the site (e.g. news, education, IT etc.)
4. Find a few major drawbacks of that simple analysis based on word-count approach

### 4. Save HTML to disk as row data

In [None]:
with open('miba_page.html', 'w') as file:
    file.write(html.decode())

In [None]:
with open('miba_page.html', 'r') as file:
    text = file.read()

In [None]:
text

### 5. More than text

In [None]:
soup

In [None]:
soup.find_all('img')

In [None]:
soup.find_all('img')[19]

In [None]:
soup.find_all('img', attrs={'alt': 'Vladimir Andreevich Gorovoy'})

In [None]:
soup.find_all('img', attrs={'alt': re.compile(r".*Gorovoy")})

In [None]:
soup.find_all('img')[20]['src']

In [None]:
URL_2_SCRAP = 'https://gsom.spbu.ru' + soup.find_all('img')[20]['src']
print(URL_2_SCRAP)

In [None]:
request = Request(URL_2_SCRAP)
response = urlopen(request)
img = response

In [None]:
img

In [None]:
from PIL import Image
import numpy as np

In [None]:
plt.figure(figsize=(12, 8))
img = Image.open(img)
plt.imshow(np.array(img))
plt.show()

In [None]:
img.save('Vladimir.jpg')