# Mastering Applied Skills in Management, Analytics and Entrepreneurship

## DATA COLLECTION TECHNIQUES
## Part V. Web scraping with Selenium

JupyterHub installation includes [Selenium with Python](https://selenium-python.readthedocs.io/) which provides a simple API to write functional/acceptance tests using Selenium WebDriver or just to scrap sites over the Internet.

### 1. Selenium library

In [None]:
import io
import time
import matplotlib.pyplot as plt
import selenium
from selenium import webdriver
from selenium.webdriver import FirefoxOptions

First step is to create browser to access the site, this browser will be our eyes and hands for this task. For the site this browser will look like any human-like user.

In [None]:
opts = FirefoxOptions()
opts.add_argument('--headless')
browser = webdriver.Firefox(options=opts)

In [None]:
browser

In [None]:
browser.name

In [None]:
browser.current_url

### 2. Basic demo

We will take a task from the last year diploma project which was about AI job description analysis. First step will be to collect data from the site.

In [None]:
url_ai_jobs = 'https://aijobs.ai/'
print(url_ai_jobs)

In [None]:
# put the url to our browser
browser.get(url_ai_jobs)

In [None]:
# now our browser got the url
browser.current_url

In [None]:
img_bytes = browser.get_full_page_screenshot_as_png()

In [None]:
from PIL import Image

# we need some transformation 
# because image is in bytes
plt.figure(figsize=(16, 64))
img = Image.open(io.BytesIO(img_bytes))
plt.imshow(img)
plt.show()

In [None]:
# easy task, just collect text from the page
text_from_site = browser.find_element('xpath', 'html').text

In [None]:
print(text_from_site)

In [None]:
# remember this number!
len(text_from_site)

### 3. Click buttons

What is the problem with `AI jobs` site? Why can not we use `BeautifulSoup` library as usual?

Because of the `Load more jobs` button. We need new page to open, but soup from `BeautifulSoup` collect data but can not click `Load more` buttons.

In [None]:
# find the button first
# `Developer mode` will help us again
# click `Explore element`
# and `Copy XPath`
# result is `//*[@id="load-more-button"]`

button_xpath = '//*[@id="load-more-button"]'

browser.find_element('xpath', button_xpath)

In [None]:
browser.find_element('xpath', button_xpath).text

In [None]:
# try to click it

browser.find_element('xpath', button_xpath).click()

In [None]:
# below is an ugly workaround
# but it works!
# if you want a production solution
# see here https://stackoverflow.com/questions/56085152
# /selenium-python-error-element-could-not-be-scrolled-into-view

import time
import random

MIN_TIME_SLEEP = 3
MAX_TIME_SLEEP = 10
flag = True

while flag:
    try:
        browser.find_element('xpath', button_xpath).click()
        print('clicked')
        flag = False
    except:
        sec2sleep = random.uniform(MIN_TIME_SLEEP, MAX_TIME_SLEEP)
        print('sleep', sec2sleep, 'sec(s) and then click again')
        time.sleep(sec2sleep)

In [None]:
text_from_site = browser.find_element('xpath', 'html').text

In [None]:
# look at the number carefully
len(text_from_site)

### 4. Click buttons wisely

#### 4.1. Make many clicks

In [None]:
counter = 0
while counter < 3: # or 'while True:' for endless
    try:
        # click button
        browser.find_element('xpath', button_xpath).click()
        # ...and then collect data from site
        text_from_site = browser.find_element('xpath', 'html').text
        counter += 1
        print('click', counter, '| text', len(text_from_site))
    except Exception as e:
        sec2sleep = random.uniform(MIN_TIME_SLEEP, MAX_TIME_SLEEP)
        print('sleep', sec2sleep, 'sec(s) and then click again')
        time.sleep(sec2sleep)
text_from_site = browser.find_element('xpath', 'html').text
len(text_from_site)

#### 4.2. Try to find out what we have collected

It is a good idea to search elements with help of `XPath` and we can easily get it with help of `Developer mode`.

In [None]:
from selenium.webdriver.common.by import By

In [None]:
# first copy XPath for first job's plate 
# which is `/html/body/div[3]/div/div/div[3]/div/a`
# but here is a trick - we need all elements
# so the right XPath will be `/html/body/div[3]/div/div/div[*]/div/a` 

jobs_xpath = '/html/body/div[*]/div/div/div[*]/div/a'
jobs_xpath = '//*[@id="mix-job"]/div[1]'
jobs_xpath = '//*[@id="mix-job"]/div[*]'
jobs_xpath = '//*[@id="mix-job"]/div[*]/div/a'

jobs = browser.find_elements(By.XPATH, jobs_xpath)
len(jobs)

In [None]:
one_job = jobs[0]
one_job

In [None]:
# data we can extract
one_job.text

In [None]:
# we can extract sub-elements with `get_attribute`
# and `Developer mode` for copying the structure
# of the desired element e.g.
# <a href="https://aijobs.ai/job/solution-architect-partner-development" ...
# ...
# </a>

one_job.get_attribute(name='href')

In [None]:
# we can go deeper again with help of XPath
# for sub-elements but we need to edit a path
# from `//*[@id="mix-job"]/div[3]/div/a/div/div[1]/div[1]/div[1]`
# to `.//div/div[1]/div[1]/div[1]` because `//*[@id="mix-job"]/div[3]/div/a`
# refers to whole element of job description

one_job.find_element(By.XPATH, './/div/div[1]/div[1]/div[1]').text

In [None]:
one_job.find_element(By.XPATH, './/div/div[1]/div[2]/span[1]').text

In [None]:
one_job.find_element(By.XPATH, './/div/div[1]/div[2]/span[2]').text

In [None]:
# some trick is required for many elements in list
# look for `<span class="badge rounded-pill text-bg-light">`
[x.text for x in one_job.find_elements(By.XPATH, ".//div/div[1]/div[2]/span")]

In [None]:
one_job.find_element(By.XPATH, './/div/div[1]/div[3]/span').text

In [None]:
one_job.find_element(By.XPATH, './/div/div[2]/div/div/span').text

In [None]:
one_job.find_element(By.XPATH, './/div/div[2]/div/span/span').text

#### 4.3. Make a loop for all job descriptions

In [None]:
from tqdm.auto import tqdm

In [None]:
all_jobs = []
for job in tqdm(jobs):
    job_dict = {}
    job_dict['url'] = job.get_attribute(name='href')
    try:
        job_dict['location'] = job.find_element(By.XPATH, './/div/div[2]/div/span/span').text
    except:
        job_dict['location'] = ''
    try:
        job_dict['salary_range'] = job.find_element(By.XPATH, './/div/div[1]/div[3]/span').text
    except:
        job_dict['salary_range'] = ''
    job_dict['position'] = job.find_element(By.XPATH, './/div/div[1]/div[1]/div[1]').text
    job_dict['company'] = job.find_element(By.XPATH, './/div/div[2]/div/div/span').text
    job_dict['other'] = [
        x.text 
        for x in job.find_elements(By.XPATH, ".//div/div[1]/div[2]/span")
    ]
    all_jobs.append(job_dict)

In [None]:
import pandas as pd

In [None]:
df = pd.DataFrame(all_jobs)

In [None]:
df

### 5. Single position with Selenium

In [None]:
print(all_jobs[10]['url'])

In [None]:
df.url[0]

We need new browser for the new url:

In [None]:
opts = FirefoxOptions()
opts.add_argument('--headless')
browser = webdriver.Firefox(options=opts)

In [None]:
url_ai_job = all_jobs[10]['url']
browser.get(url_ai_job)

In [None]:
browser.find_element('xpath', 'html').text

In [None]:
description = browser.find_element(By.XPATH, '/html/body/div[2]/div[2]/div/div[1]/div')
print(description.text)

In [None]:
location = browser.find_element(By.XPATH, '/html/body/div[2]/div[2]/div/div[2]/div[1]/div/div/div/div/p')
print(location.text)

In [None]:
tag = browser.find_element(By.XPATH, '/html/body/div[2]/div[2]/div/div[2]/div[2]/div/a')
print(tag.text)

## <font color='red'>LAB WORK #4</font>

Collect at least 100 jobs from [ai-jobs.net](https://ai-jobs.net) and find maximum (maxValue) base salary in USD per year all across over the job descriptions collected.

In [None]:
import os

folder = 'ai_jobs_data'
os.makedirs(folder, exist_ok=True)

In [None]:
### YOUR CODE HERE ###