# Scraping and Parsing Sites

![Harry Potter Parsing](images/parser_tongue.png)

Many data analysis projects require gathering and processind data from the Internet site pages. Following code example will help you with basic tools of scraping and parsind data from the sites.

## Import libraries

In [None]:
import os
import re
import nltk
import string
import pymorphy2
import matplotlib.pyplot as plt
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
from nltk import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from pymorphy2 import tokenizers
from wordcloud import WordCloud

nltk.download('stopwords')
nltk.download('punkt')
MORPH = pymorphy2.MorphAnalyzer()

## Get text from MiBA's page

Let's get the text from [Master in Business Analytics and Big Data (MiBA)](https://gsom.spbu.ru/en/programmes/graduate/miba/) internet site page. We use standard `urllib` library to get `html` data from the page and [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) library to parce `html`:

In [None]:
URL_2_SCRAP = 'https://gsom.spbu.ru/en/programmes/graduate/miba/'

In [None]:
request = Request(URL_2_SCRAP)
response = urlopen(request)
html = response.read()

In [None]:
def get_text(html):
    soup = BeautifulSoup(html, 'html.parser')
    for script in soup(['script', 'style']):
        script.extract()
    page_text = soup.get_text()
    for ch in ['\n', '\t', '\r']:
        page_text = page_text.replace(ch, ' ')
    return ' '.join(page_text.split())

In [None]:
text_from_page = get_text(html)
print('sample of text:', text_from_page[:100])

Save parsed text to a file:

In [None]:
with open('./data/miba_page.txt', 'w') as file:
    file.write(text_from_page)

## Text preprocessing

Read text from a file and do basic preprocessing:

In [None]:
with open('./data/miba_page.txt', 'r') as file:
    text = file.read()

In [None]:
def preprocessing(text):
    for ch in ['\n', '\t', '\r']:
        text = text.replace(ch, ' ')
    result = re.sub('[^а-яА-Яa-zA-Z]+', ' ', text).strip().lower()
    result = re.sub('ё', 'е', result)
    return result

In [None]:
text = preprocessing(text)
print('total symbols:', len(text))
print('sample of text:', text[2200:2500])

## More processing

Get words in the text to the [dictionary form](https://en.wikipedia.org/wiki/Lemmatisation):

In [None]:
def advprocessing(text):
    funсtion_words = {'INTJ', 'PRCL', 'CONJ', 'PREP'}
    lemmatized_words = list(map(lambda word: MORPH.parse(word)[0], text.split()))
    result = []
    for word in lemmatized_words:
        if word.tag.POS not in funсtion_words:
            result.append(word.normal_form)
    return result, ' '.join(result)

In [None]:
text_tokens, text = advprocessing(text)
print('total symbols:', len(text))
print('total words:', len(text_tokens))
print('sample of text:', text[2200:2500])
print('sample of text tokens:', text_tokens[:50])

## Some visualizations

Prepare and display some diagrams:

In [None]:
freq_dist = FreqDist(text_tokens)
freq_dist

In [None]:
print('most common 10 words:', freq_dist.most_common(10))

In [None]:
plt.figure(figsize=(16, 8))
plt.title('50 more frequent words in text')
freq_dist.plot(50, cumulative=False)
plt.show()

In [None]:
wordcloud = WordCloud(background_color='white').generate(text)

In [None]:
plt.figure(figsize=(16, 8))
plt.axis('off')
plt.imshow(wordcloud, interpolation='bilinear')
plt.show()