# Scraping teaching jobs in Hungary
Scraping a Hungarian government website that advertises job vacacies to count the number of teacher vacancies per day. First we need to import the libraries we need.

In [1]:
import requests
import bs4

Now, let's specify the URL we want to scrape, then use the `requests` library to send a `POST` request to the form endpoint on the page. Once we receive a response, we'll parse it for the data we need.

In [2]:
url = 'https://kozigallas.gov.hu/publicsearch.aspx'

Okay, time to make the `POST` request. This is basically what happens when you submit the search form on that website. It packages up a bunch of data, including your search query, and receives a response, which we'll later parse. 

There's a bunch of tokens and crap sent with the request.

In [8]:
def send_request():
    try:
        response = requests.post(
            url=url,
            params={
                "p": "1",
            },
            headers={
                "Content-Type": "application/x-www-form-urlencoded; charset=utf-8",
                "Cookie": "ASP.NET_SessionId=nialgtjlnsftsb551ou1dj55",
            },
            data={
                "__EVENTTARGET": "",
                "__EVENTARGUMENT": "",
                "__VSTATE": "H4sIAAAAAAAEAO29B2AcSZYlJi9tynt/SvVK1+B0oQiAYBMk2JBAEOzBiM3mkuwdaUcjKasqgcplVmVdZhZAzO2dvPfee++999577733ujudTif33/8/XGZkAWz2zkrayZ4hgKrIHz9+fB8/Iv7vX/M3/e1+zV/3F/z+v/9JtWzrqmxe5b9oXdT5y6ppn2TTt79Xfv37//6/za/56/6W07bc2fnd3nz5cvd3m7TL59VFsfxNf91k+9N7ew8O6H/7v+nst/m1zvHPr/Vr0b+/3q/1a9oPzv2/f5tf5xf8WLFc5vW8XZS/7hePi8VF2tTTzz5a5LMiu1sssou8udvW2bJZZXW+bMYXxflH6VUxa+effbS7u/NROs+Li3lLf+zf+yidVPUsrz/7aOeju0e/4Nf/yaIpJmU+nzES3N3v8mt8w738Lr/mxezX+nUI/K/DY/p17Mh+LerqW8fpeV7Os+YHy3/0Ly7/0b91+Y/+1ZfpZXZxnWbpT+dl84N/9G9N58XkH/2Lm7HiyKB+LQLy2/xav+DXJhx/3d91PL4boNnkWT2d//6TddtWSyA6+7V+bdexIHH+m+L9X+dN/q7NZ4CMkf/av+5v/XsV5S/Ms+m6zQ9X5pfmd/k15zOB4SboNwIGv84v+HWnZdY0v+6vs8iXaxrOzuMsndf5+WcfrRiVb1eLfJw1q3cfHb24LtpfWAnEtCpnWfn4bnYEsNTzr2Mh7HUgnPykvv8Lv1wvysOlQarMaTZ++gcC49cVGL9Rk5f5lAaksD76hcfauPyFmY4lfZvXeWOHNvu1fv2w//ud/l/lF0XT0juz71STRnF5SXN1/Y/+xT/I2qxYCAo/hvlUMLNf6zcOgX7eAfqyrs6L0lDmmTKAwbE0ZCrS88L8/tZQy0zhr0eYo89f+9f9NXdm6B9/0ETxzPxms9/st/6Nfo3f+CD9R//gss3Kf/QvXlbN7/j6B9nbxXr5lpjp7T/6V69+8I//yWmxbP/Rv/oHefvb/17/6N/xA/rsvHr7j/49Kb7Nm3/0r6fvmt9Ovymat1WZ3b0oFsSoPyjWi19wusjLNm1+QDD+8b+ZfmZv/fd+7E0+nS+Lt+vFb0fja/7xP7kP+see/eN/soD99U8vrvM2X/zaXzw5/rVfzp/+1r/Rr7nza+3S/3Z/zd1fc+/XvPdr7v+a93/NT3/NB7/mwa/58Lf65Nf+jS6857f5NZm/MeLf4NdId1MIUFZmbVswTousSNtsRfRt8env/jy/AD0mk3R3+x43jrT6hf1WlzmRifAdAHZv+/4tgJlWMWCfuGbSiPRCPjCK0VDbGODf+jf4NXeYjPeJkJ8SKR8QCX+DgHy/LpPvt/41fosn/+hfPG22fy+alfUyXeQ0K7/xk4y03nVm/vpH/2qawkb++p2eVHVTzbaPJ9k/+vf89PZP5YtV+Y/+1frib/BkPctortvf9KSplhf1P/oXz+SL3+hZ/tP/6F9dyx+/3efX//ifXG9/UVGb7dfVqq709d/i29lPz/7Rv2f7STHPtO1v+O38Mteuf8fvEBf/YPsFaUpCdfv1D6pyWb1VmL9XtfhH/+K6WmyfEgny+qJayBe/yYt/9G/1EEleEnaK0+tqUZHS5T9+ZxKUSVUSIeiXFqC2n5ACuFAs3lBXSo4f+8lM8flNfzJvfrCq/9G/WrtKfopIL7/+1r+10j9gY2HvX2t379favfdr7e7/Wrv3f63dT3+t3Qe/1u7Br7X78Nfa26FJ+q19Pg8Z/sdoxmb8229of/uNvc+gm3+L1XpSFs38adbmxtD9gt8oW63Ka2v3WFuI7v/tn5OiI9mu2jYl/Ubs1VRvH5EA7Ij+/3W13a//JZT3o1Q+/TH99DfnT7O3xK7/6F+8yB6l+tZvfEvov3kU+m99A/RflxThbyIj+AW/4YvssrigsX5Vl7/ub0pGcZafZ+uyZS1LMH4DUjdsd9iWAe6v/+v+VtRMNHydtQVZS237O/CHP2hro5OnVhFbSvz6v+4voLd/f//1p9V0TSq/bSwcT/+mNHxi/LcViWX7j/6tM5gegfObMhy4D/bF3/U71QXxs7VTKSnR1vzVkrS/tbT/9X/d3wbvv6yLy2x6varKYnptwPwWx7OsvTTvzei9xUzZg977LfFesViRQWx+sF6Yl5Iz+5GdwF//1/2N0fjbebmyzV7nF6TIm3/0r76Yzf4f2Ro3yBcKAAA=",
                "__VIEWSTATE": "",
                "__EVENTVALIDATION": "/wEWBQL+raDpAgK3hdWgDwKYhJHtDwKSvba2CQKbgZmZA575eIrMtkZINDZao4TPeGOLl/xf",
                "userName": "Felhasználói+név",
                "password": "Jelszó",
                "ctl00$ContentPlaceHolder1$JobSearchForm1$txtKeyword": "tanár",
                "ctl00$ContentPlaceHolder1$JobSearchForm1$btnSearch": "Keresés",
            },
        )
        return response.content
    except requests.exceptions.RequestException:
        print('HTTP Request failed')

The response is an HTML page which we'll have to parse. So we'll use BeautifulSoup to take the response string and parse the HTML:

In [10]:
soup = bs4.BeautifulSoup(send_request(), 'html.parser')

The number of results for a search for "teacher" is in a `span` element with an `id` of `#ctl00_ContentPlaceHolder1_JobSearchForm1_lblCount`. Let's use BS4 to find that and print the value.

In [11]:
count = soup.find(id='ctl00_ContentPlaceHolder1_JobSearchForm1_lblCount').text
print(count)

2577


Let's also add the date and time of retrieval. We'll put the date along with the count into an array so we can easily add it as a row to a CSV.

**NOTE** We are using timezone to set UTC as the time on everything, because that is what Github Actions uses.

In [12]:
from datetime import datetime, timezone

current_time = datetime.now(timezone.utc).strftime("%m/%d/%Y, %H:%M")

row = [count.strip(), current_time]
print(row)

['2577', '06/27/2022, 11:53']


Time to write the result to a CSV, and for that we'll need the Python CSV library.

In [25]:
import os
import csv

And we'll also prepare the headers for the CSV file, and the filename.

In [13]:
HEADERS = ['count','datetime']
FILENAME = 'teacher_vacancies_count.csv'

Let's check if the file already exists. If it does, we'll just add a row, if not, then we create the file from scratch and add the headers.

In [40]:
if not os.path.isfile(FILENAME):
   with open(FILENAME, 'w', newline='') as outfile:
    writer = csv.writer(outfile)
    writer.writerow(HEADERS)
    writer.writerow(row)
else: # else it exists so append without writing the header
   with open(FILENAME, 'a', newline='') as outfile:
    writer = csv.writer(outfile)
    writer.writerow(row)

Done! 👏