# Tutorial 1 (Beautiful Soup)

1. Reading Robots.txt file
2. Open URL with [urllib](https://docs.python.org/3/library/urllib.request.html) and [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/)


## 1. Reading Robots.txt file

In [63]:
#https://docs.python.org/3/library/urllib.robotparser.html
import urllib.robotparser
robot = urllib.robotparser.RobotFileParser() 
robot.set_url("https://en.wikipedia.org/robots.txt")
robot.read() #Reads the robots.txt URL and feeds it to the parser.

In [64]:
print(robot.can_fetch(useragent="*", url="https://en.wikipedia.org/wiki/Switzerland"))
print(robot.crawl_delay(useragent="*"))

True
None


## 2. Open URL with [urllib](https://docs.python.org/3/library/urllib.request.html) and [BeautifulSoup](https://beautiful-soup-4.readthedocs.io/en/latest/)

Beautiful Soup transforms a complex HTML document into a complex tree of Python objects such as Tag, NavigableString, BeautifulSoup, and Comment etc. Tags are XML or HTMl tags in the original document and has names and attributes. Beautiful Soup also gives access to different types of parsers for html, xml etc. Find more details [here](https://beautiful-soup-4.readthedocs.io/en/latest/index.html?highlight=find_all#specifying-the-parser-to-use).
## Task 1
Open the link https://realpython.github.io/fake-jobs/ in your browser and inspect the elements

In [1]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [69]:
html_response = urlopen('https://realpython.github.io/fake-jobs/') # returns http.client.HTTPResponse as a file like object
bs = BeautifulSoup(html_response, 'html.parser')
bs.contents

['html',
 '\n',
 <html>
 <head>
 <meta charset="utf-8"/>
 <meta content="width=device-width, initial-scale=1" name="viewport"/>
 <title>Fake Python</title>
 <link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
 </head>
 <body>
 <section class="section">
 <div class="container mb-5">
 <h1 class="title is-1">
         Fake Python
       </h1>
 <p class="subtitle is-3">
         Fake Jobs for Your Web Scraping Journey
       </p>
 </div>
 <div class="container">
 <div class="columns is-multiline" id="ResultsContainer">
 <div class="column is-half">
 <div class="card">
 <div class="card-content">
 <div class="media">
 <div class="media-left">
 <figure class="image is-48x48">
 <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
 </figure>
 </div>
 <div class="media-content">
 <h2 class="title is-5">Senior Python Developer</h2>
 <h3 class="subtitle is-6 company">Payne, Robert

In [70]:
# Alternative way to make a soup with requests
import requests
html_response = requests.get('https://realpython.github.io/fake-jobs/') # Get URL Content
print(html_response)
bs_alternate = BeautifulSoup(html_response.content, 'html.parser')
bs_alternate.contents

<Response [200]>


['html',
 '\n',
 <html>
 <head>
 <meta charset="utf-8"/>
 <meta content="width=device-width, initial-scale=1" name="viewport"/>
 <title>Fake Python</title>
 <link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
 </head>
 <body>
 <section class="section">
 <div class="container mb-5">
 <h1 class="title is-1">
         Fake Python
       </h1>
 <p class="subtitle is-3">
         Fake Jobs for Your Web Scraping Journey
       </p>
 </div>
 <div class="container">
 <div class="columns is-multiline" id="ResultsContainer">
 <div class="column is-half">
 <div class="card">
 <div class="card-content">
 <div class="media">
 <div class="media-left">
 <figure class="image is-48x48">
 <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
 </figure>
 </div>
 <div class="media-content">
 <h2 class="title is-5">Senior Python Developer</h2>
 <h3 class="subtitle is-6 company">Payne, Robert

In [6]:
# Print in a readable format
print(bs.prettify())

['html',
 '\n',
 <html>
 <head>
 <meta charset="utf-8"/>
 <meta content="width=device-width, initial-scale=1" name="viewport"/>
 <title>Fake Python</title>
 <link href="https://cdn.jsdelivr.net/npm/bulma@0.9.2/css/bulma.min.css" rel="stylesheet"/>
 </head>
 <body>
 <section class="section">
 <div class="container mb-5">
 <h1 class="title is-1">
         Fake Python
       </h1>
 <p class="subtitle is-3">
         Fake Jobs for Your Web Scraping Journey
       </p>
 </div>
 <div class="container">
 <div class="columns is-multiline" id="ResultsContainer">
 <div class="column is-half">
 <div class="card">
 <div class="card-content">
 <div class="media">
 <div class="media-left">
 <figure class="image is-48x48">
 <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
 </figure>
 </div>
 <div class="media-content">
 <h2 class="title is-5">Senior Python Developer</h2>
 <h3 class="subtitle is-6 company">Payne, Robert

## Error Handling when opening URLS
Opening a non existing URL can throw a URLError. A server error may also return an http error. 
We can handle these errors by using a try catch block and catching the URLError or HTTPError

In [None]:
html = urlopen('https://somerandomwebsitethatdoesnotexist.com')
bs = BeautifulSoup(html.read(), 'html.parser')

In [72]:
from urllib.error import HTTPError, URLError
try:
    html = urlopen('https://somerandomwebsitethatdoesnotexist.com')
    bs = BeautifulSoup(html.read(), 'html.parser')
except HTTPError as e:
    print(e)
except URLError as e:
    print(e)

<urlopen error [Errno 11001] getaddrinfo failed>


In [73]:
#Exploring tags
html_response = urlopen('https://realpython.github.io/fake-jobs/') 
bs = BeautifulSoup(html_response, 'html.parser')
print(bs.h1.prettify())
print(f'name: {bs.h1.name}')
print(f'\nattrs: {bs.h1.attrs}')
print(f'\nstring: {bs.h1.string}')

<h1 class="title is-1">
 Fake Python
</h1>

name: h1

attrs: {'class': ['title', 'is-1']}

string: 
        Fake Python
      


## Finding Elements
## Task 2

Get the name and attrs of the div tag. How can you access its children?

In [79]:
bs.div.p

<p class="subtitle is-3">
        Fake Jobs for Your Web Scraping Journey
      </p>

### Question
1. What happens if you try to access a non existing tag? 
2. Did you get all the div tags?

In [None]:
bs.findAll("div")

[<div class="container mb-5">
 <h1 class="title is-1">
         Fake Python
       </h1>
 <p class="subtitle is-3">
         Fake Jobs for Your Web Scraping Journey
       </p>
 </div>,
 <div class="container">
 <div class="columns is-multiline" id="ResultsContainer">
 <div class="column is-half">
 <div class="card">
 <div class="card-content">
 <div class="media">
 <div class="media-left">
 <figure class="image is-48x48">
 <img alt="Real Python Logo" src="https://files.realpython.com/media/real-python-logo-thumbnail.7f0db70c2ed2.jpg?__no_cf_polish=1"/>
 </figure>
 </div>
 <div class="media-content">
 <h2 class="title is-5">Senior Python Developer</h2>
 <h3 class="subtitle is-6 company">Payne, Roberts and Davis</h3>
 </div>
 </div>
 <div class="content">
 <p class="location">
         Stewartbury, AA
       </p>
 <p class="is-small has-text-grey">
 <time datetime="2021-04-08">2021-04-08</time>
 </p>
 </div>
 <footer class="card-footer">
 <a class="card-footer-item" href="https://www.re

In [82]:
# Searching with strings
bs.findAll(string ="Materials engineer")

['Materials engineer', 'Materials engineer']

In [100]:
url_apply_links = [elem.get("href") for elem in bs.findAll("a") if 'Apply' in elem.string]

In [103]:
bs.select(".card-footer>a:nth-child(2)")

[<a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/senior-python-developer-0.html" target="_blank">Apply</a>,
 <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/energy-engineer-1.html" target="_blank">Apply</a>,
 <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/legal-executive-2.html" target="_blank">Apply</a>,
 <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/fitness-centre-manager-3.html" target="_blank">Apply</a>,
 <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/product-manager-4.html" target="_blank">Apply</a>,
 <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/medical-technical-officer-5.html" target="_blank">Apply</a>,
 <a class="card-footer-item" href="https://realpython.github.io/fake-jobs/jobs/physiological-scientist-6.html" target="_blank">Apply</a>,
 <a class="card-footer-item" href="https://realpython.git

## Task 3

Extract all the different job titles, companies present in the page.

Bonus: Can you all all  the apply links? Can you associate the links with the respective jobs?

**Hint** ```bs.find_all("a")``` can help to find all the links

In [None]:
for elem in bs.find_all("div", {"class": "media-content"}):
    print(elem.h2.string, elem.h3.string)

In [None]:
title_list = []
company_list =[]

for elem in bs.find_all("div", {"class": "media-content"}):
    title = elem.h2.string
    title_list.append(title)
    company = elem.h3.string
    company_list.append(company)
    print(title, company)
    


In [None]:
import pandas as pd

In [102]:
df = pd.DataFrame({"Job Titles": title_list,
              "Company": company_list,
              "Apply URL": url_apply_links})
df.to_csv("fake_job.csv")

In [None]:
apply_url_list = []
for elem in :
    print(elem)
    apply_url_list.append()

In [62]:
import pandas as pd
data = pd.DataFrame({"Title": title_list,
              "Company": company_list,
              "URL": apply_url_list})

data.to_csv("job_desc.csv")