# Mastering Applied Skills in Management, Analytics and Entrepreneurship

## DATA COLLECTION TECHNIQUES
## Part IV. Web scraping deeper dive

__NOTE:__ use this notebook with `Data Science environment`.

### 1. Libraries

In [None]:
# some basic libraries
import os
import re
import json
import socket
from random import randint, uniform
# for sending requests
from urllib.request import (
    Request, 
    urlopen, 
    URLError, 
    HTTPError, 
    ProxyHandler, 
    build_opener, 
    install_opener)
# to parce html data
from bs4 import BeautifulSoup
# for time delay while scraping
from time import sleep, gmtime, strftime
from time import sleep
from tqdm.notebook import tqdm
from urllib.parse import quote, unquote
# to work with the data
import pandas as pd

### 2. Tools and hints for requests

Here is the site we would like to parce:

In [None]:
url_to_parce = 'https://realpython.github.io/fake-jobs/'
print(url_to_parce)

We can use simple approach:

In [None]:
request = Request(url_to_parce)
request

...but it is a good practice to emulate human behaviour when parcing sites. Many sites block simple vanilla requests like, so let's add more humanity to our parce code:
- add [User Agent](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent) which is a characteristic string that lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting source
- add random delay between requests which will work if we use loop for many requests

In [None]:
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 YaBrowser/19.6.1.153 Yowser/2.5 Safari/537.36'
MIN_TIME_SLEEP = .1
MAX_TIME_SLEEP = .5
MAX_COUNTS = 2
TIMEOUT = 5

In [None]:
def get_content_lite(url_page, timeout):
    """
    Loads page's content by URL address.
    
    Keyword arguments:
      url_page: URL address of the page to be downloaded
      timeout: timeout for urlopen function
    
    MIN_TIME_SLEEP, MAX_TIME_SLEEP are used 
    for random sleep between requests. 
    
    """
    # sleep a while for not to overload site
    sleep(uniform(MIN_TIME_SLEEP, MAX_TIME_SLEEP))
    # make a request
    request = Request(url_page)
    request.add_header('User-Agent', USER_AGENT)
    # get the response
    response = urlopen(request, timeout=timeout)
    content = response.read()
    return content

In [None]:
html = get_content_lite(url_to_parce, timeout=TIMEOUT)
soup = BeautifulSoup(html, 'html.parser')

In [None]:
soup.name

### 3. How to work with soup, examples

In [None]:
soup

In [None]:
soup.text

In [None]:
soup.contents

Example how to search:

In [None]:
soup.find('meta')

In [None]:
soup.find_all('meta')

#### 3.1. Find an element

To identify elements to search it is a good idea to open a [desired url](https://realpython.github.io/fake-jobs/) and use `Developer mode` with `F12` key. Then you may want to click right mouse button in the element and select `Explore element`.

In [None]:
# start from the main page header
# you can `Copy element` from `Developer mode`
# <h1 class="title is-1">
#   Fake Python
# </h1>
soup.find(
    'h1',               # filter on a tag name
    class_='title is-1' # filters on attribute values, NOTE `class_` instead `class`
)

In [None]:
soup.find(
    'h1',
    {'class': 'title is-1'} # dictionary stylr also works
)

We found element, now extract what we want, e.g. text or data:

In [None]:
elem = soup.find('h1', {'class': 'title is-1'})

In [None]:
elem

In [None]:
elem.attrs

In [None]:
elem.contents

In [None]:
elem.get_text()

In [None]:
elem.text

In [None]:
# try the other sub-header
# <p class="subtitle is-3">
#   Fake Jobs for Your Web Scraping Journey
# </p>
soup.find('p', {'class': 'subtitle is-3'}).text

In [None]:
# find the first any `p` tag
soup.find('p').text

#### 3.2. Find many elements

It is possible to find all elements that satisfy search conditions at one search:

In [None]:
# use `find_all` instedd of `find`
soup.find_all('p')

In [None]:
all_p_elements = soup.find_all('p')
type(all_p_elements)

In [None]:
all_p_elements[0]

In [None]:
all_p_elements[0].text

__TIP:__ built-in function `enumerate` will help us. About [enumerate](https://docs.python.org/3/library/functions.html#enumerate) in Python.

In [None]:
for i, p_element in enumerate(all_p_elements):
    print(
        '`p` element number', i, '->',
        p_element.text.strip()
    )

#### 3.3. Few more steps

Now will do something more useful for our data collection task. Will collect all job descriptions from the page.

In [None]:
# again, copy element from `Developer mode`
# <h2 >Senior Python Developer</h2>
soup.find('h2', class_='title is-5')

In [None]:
# all job descriptions
soup.find_all('h2', class_='title is-5')

In [None]:
all_jobs = soup.find_all('h2', class_='title is-5')
all_jobs[0]

But these are only descriptions' headers, can we get whole card for job description?

In [None]:
# <div class="card-content">
#   ...here is all we need...
# </div>
soup.find('div', class_='card-content')

In [None]:
one_card = soup.find('div', class_='card-content')

In [None]:
one_card.contents

In [None]:
one_card.find('h2')

In [None]:
one_card.find('h3')

In [None]:
one_card.find('p', class_='location')

In [None]:
one_card.find('a', class_='card-footer-item')

In [None]:
# extract URL from element
one_card.find('a', class_='card-footer-item')['href']

### 4. Let's cook our soup

We can use `Developer mode` at the site or just search with `CTRL+F` inside soup because of the power of Jupyter notebook!

#### Step 1. Collect all the cards with job descriptions

In [None]:
# use `find_all` function
all_cards = soup.find_all('div', class_='card-content')
type(all_cards)

In [None]:
print(
    'total number of cards:',
    len(all_cards)
)

##### Create data structure for one card

In [None]:
# one sample card
all_cards[0]

Now will extract the data from single card and store it to easy-to-deal-with data structure. Let it be Python dictionary.

In [None]:
card_data = {}

In [None]:
# let's see what we can extract
all_cards[0].contents

In [None]:
card_data['job_description'] = all_cards[0].find('h2', class_='title is-5').text
print(card_data)

In [None]:
card_data['company'] = all_cards[0].find('h3', class_='subtitle is-6 company').text
print(card_data)

In [None]:
card_data['location'] = all_cards[0].find('p', class_='location').text
print(card_data)

In [None]:
# that's to be better
card_data['location'] = all_cards[0].find('p', class_='location').text.strip()
print(card_data)

Convert string to datetime requires some skills and use of [datetime](https://docs.python.org/3/library/datetime.html) library.

In [None]:
all_cards[0].find('time').get('datetime')

In [None]:
import datetime

datetime.datetime.strptime('2021-04-08', '%Y-%m-%d').date()

In [None]:
card_data['publish_time'] = datetime.datetime.strptime(
    all_cards[0].find('time').get('datetime'), '%Y-%m-%d'
).date()
print(card_data)

In [None]:
#<a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank">Apply</a>

In [None]:
# extract `url` data requires `find_all`
# because we have two urls in one card
all_cards[0].find_all('a', class_='card-footer-item')

In [None]:
# use list comprehension
[x['href'] for x in all_cards[0].find_all('a', class_='card-footer-item')]

In [None]:
# we need only the second url
all_cards[0].find_all('a', class_='card-footer-item')[1]['href']

In [None]:
card_data['url_details'] = all_cards[0].find_all('a', class_='card-footer-item')[1]['href']
print(card_data)

##### Run the loop for all cards

In [None]:
all_cards_list = []

In [None]:
# add some beauty with `tqdm` library
from tqdm.auto import tqdm

for card in tqdm(all_cards):
    card_data = {}
    
    card_data['job_description'] = card.find('h2', class_='title is-5').text
    card_data['company'] = card.find('h3', class_='subtitle is-6 company').text
    card_data['location'] = card.find('p', class_='location').text.strip()
    card_data['publish_time'] = datetime.datetime.strptime(
        card.find('time').get('datetime'), '%Y-%m-%d'
    ).date()
    card_data['url_details'] = card.find_all('a', class_='card-footer-item')[1]['href']
    
    all_cards_list.append(card_data)

In [None]:
len(all_cards_list)

In [None]:
# sample of data
all_cards_list[-1]

#### Step 2. Collect detailed descriptions

We can more data if we parce `url_details` source. 

In [None]:
url_details = all_cards_list[0]['url_details']
print(url_details)

In [None]:
html = get_content_lite(url_details, timeout=TIMEOUT)
soup = BeautifulSoup(html, 'html.parser')

In [None]:
# with the help of `Developer mode`
# we can find the required data
# <div class="content">
#     <p>Professional asset web application environmentally...</p>
#     <p id="location"><strong>Location:</strong> Stewartbury, AA</p>
#     <p id="date"><strong>Posted:</strong> 2021-04-08</p>
# </div>

soup.find('div', class_='content')

In [None]:
# we need only description text
# which is in the first `p` tag
# so just use `find` again
soup.find('div', class_='content').find('p')

In [None]:
text = soup.find('div', class_='content').find('p').text
text

#### Step 3. Combine all steps into final data collection loop

In [None]:
all_cards_list = []

for card in tqdm(all_cards):
    card_data = {}
    
    # block for main data
    card_data['job_description'] = card.find('h2', class_='title is-5').text
    card_data['company'] = card.find('h3', class_='subtitle is-6 company').text
    card_data['location'] = card.find('p', class_='location').text.strip()
    card_data['publish_time'] = datetime.datetime.strptime(
        card.find('time').get('datetime'), '%Y-%m-%d'
    ).date()
    card_data['url_details'] = card.find_all('a', class_='card-footer-item')[1]['href']
    
    # block for detailed data
    # here we parce site pages in a loop
    # and random time delay is good idea
    url_details = card_data['url_details']
    html = get_content_lite(url_details, timeout=TIMEOUT)
    soup = BeautifulSoup(html, 'html.parser')
    card_data['text'] = soup.find('div', class_='content').find('p').text
    
    all_cards_list.append(card_data)

In [None]:
len(all_cards_list)

In [None]:
all_cards_list[0]

In [None]:
# convert data to dataframe
# if necessary for analysis
df = pd.DataFrame(all_cards_list)
print(df.shape)
df.head()

### 5. Hints and tips for parcing sites

Hints for data request:
1. Proxy
2. Exception
3. Trials strategy (unlimited or count)

In [None]:
def get_content(url_page, timeout, proxies=None, file=False):
    counts = 0
    content = None
    while counts < MAX_COUNTS:
        try:
            request = Request(url_page)
            request.add_header('User-Agent', USER_AGENT)
            if proxies:
                proxy_support = ProxyHandler(proxies)
                opener = build_opener(proxy_support)
                install_opener(opener)
                context = ssl._create_unverified_context()
                response = urlopen(request, context=context, timeout=timeout)
            else:
                response = urlopen(request, timeout=timeout)
            if file:
                content = response.read()
            else:
                try:
                    content = response.read().decode(response.headers.get_content_charset())
                except:
                    content = None
            break
        except URLError as e:
            counts += 1
            print('URLError | ', url_page, ' | ', e, ' | counts: ', counts)
            sleep(randint(counts * MIN_TIME_SLEEP, counts * MAX_TIME_SLEEP))
        except HTTPError as e:
            counts += 1
            print('HTTPError | ', url_page, ' | ', e, ' | counts: ', counts)
            sleep(randint(counts * MIN_TIME_SLEEP, counts * MAX_TIME_SLEEP))
        except socket.timeout as e:
            counts += 1
            print('socket timeout | ', url_page, ' | ', e, ' | counts: ', counts)
            sleep(randint(counts * MIN_TIME_SLEEP, counts * MAX_TIME_SLEEP))
    return content

## <font color='red'>INTERMEDIATE QUIZ #4-1</font>
Now will look at the St Petersburg University, or - to be precise - at the page with [key news at the University](https://english.spbu.ru/news-events/news).

Your goals for the quiz are to:
1. Take one of the news with the help of the `soup`
2. Extract title and url for details

#### HINTS

In [None]:
# to parce site as usual
url_details = 'https://english.spbu.ru/news-events/news'
html = get_content(url_details, timeout=TIMEOUT)
soup = BeautifulSoup(html, 'html.parser')

In [None]:
# use `Developer mode` to find
# <div class="card-context  card--with-img card-context--large ">
#   ...
# </div>

soup.find('div', class_='card-context')

In [None]:
### YOUR CODE HERE ###

## <font color='red'>LAB WORK #3</font>

Collect the data for all the news at the first page with [key news at the University](https://english.spbu.ru/news-events/news):
1. Title of one record
2. Time it was published
3. Url (link) to the detailed news
4. Annotation (first chapter) for every text

In [None]:
### YOUR CODE HERE ###