First draft of project that parses through Pracuj.pl website in search of skills required for positions related to Data Science.

First step is to import necessary libraries for reading data and parsing website - urllib and BeautifulSoup. The latter handles all the transformation of nasty, terribly written HTML files and allows displaying it in much more pleasing way. It also provides some great tools for reading data hidden between tags. As it doesn't allow for loading website I will use urllib for that purpose.

In [3]:
import urllib
from bs4 import BeautifulSoup as bs

Let's take a look at our website. Firstly I will use urllib to read and load data coming from url into bs4. 

In [35]:
url="http://www.pracuj.pl/praca/data%20science;kw/warszawa;wp"
html=urllib.urlopen(url).read()
soup=bs(html, "html")

In [192]:
from IPython.display import IFrame
IFrame('https://www.pracuj.pl/praca/data%20science;kw/warszawa;wp', 950, 350)

Let's see how our website looks like as a bs4 object.

In [161]:
len(soup.prettify())

281588

In [173]:
print soup.prettify()[102000:103000]

o-list_star star " data-href="/logowanie?returnUrl=%2fpraca%2fdata%2520science%3bkw%2fwarszawa%3bwp&amp;emailOriginId=Favourites&amp;addFav=4733139" data-id="4733139" data-tooltip="Dodaj do ulubionych" data-tooltip-align="right" data-tooltip-checked="Usuń z ulubionych">
        </span>
        <h2 class="o-list_item_link" data-applied-offer-id="4733139">
         <a class="o-list_item_link_name" href="/praca/mlodszy-specjalista-ds-przeciwdzialania-naduzyciom-warszawa,oferta,4733139" itemprop="title" title="Praca Młodszy Specjalista ds. Przeciwdziałania Nadużyciom">
          Młodszy Specjalista ds. Przeciwdziałania Nadużyciom
         </a>
        </h2>
        <h3 class="o-list_item_link" data-offers="list" itemprop="hiringOrganization" itemscope="" itemtype="http://schema.org/Organization">
         <a class="o-list_item_link_emp" href="/poznaj-pracodawce/deutsche%20bank%20polska%20s.a.,3134133" itemprop="name" title="Praca Deutsche Bank Polska S.A.">
          Deutsche Bank Polska S

So our HTML file has over 280 thousand characters. Links to job offers start around line 100,000th. Each of them is member of class "o-list_item_link_name".

Let's extract links for each offer that meets our criteria.

In [155]:
links = soup.find_all('a', class_="o-list_item_link_name")
for link in links[0:10]:
    print link.get('href',)

/praca/praktykant-ka-zespol-aktuarialny-warszawa,oferta,4756161
/praca/business-intelligence-ms-specialist-warszawa,oferta,4758430
/praca/mlodszy-specjalista-ds-przeciwdzialania-naduzyciom-warszawa,oferta,4733139
/praca/programista-aplikacji-bi-warszawa,oferta,4732396
/praca/microsoft-bi-developer-warszawa,oferta,4717759
/praca/business-intelligence-consultant-warszawa,oferta,4744292
/praca/specjalista-ds-baz-danych-i-analizy-portfela-kredytowego-warszawa,oferta,4742891
/praca/starszy-specjalista-ds-analiz-i-statystyki-warszawa,oferta,4720884
/praca/bi-development-team-lead-warszawa,oferta,4717565
/praca/etl-specialist-warszawa,oferta,4740993


We have links for each offer but it would be nice to know these offers by job title, as advertised by employer. 

In [147]:
job_titles = soup.find_all('a', class_="o-list_item_link_name")
for title in job_titles[0:5]:
    print title.contents[0].strip()

Praktykant/ka - Zespół Aktuarialny
Business Intelligence (MS) Specialist
Młodszy Specjalista ds. Przeciwdziałania Nadużyciom
Programista Aplikacji BI
Microsoft BI Developer
Business Intelligence Consultant
Specjalista ds. baz danych i analizy portfela kredytowego
Starszy Specjalista ds Analiz i Statystyki
BI Development Team Lead
ETL Specialist


In [82]:
prefix = "www.pracuj.pl"

In [148]:
links = soup.find_all('a', class_="o-list_item_link_name")
for link in links[0:5]:
    print prefix + link.get('href',)

www.pracuj.pl/praca/praktykant-ka-zespol-aktuarialny-warszawa,oferta,4756161
www.pracuj.pl/praca/business-intelligence-ms-specialist-warszawa,oferta,4758430
www.pracuj.pl/praca/mlodszy-specjalista-ds-przeciwdzialania-naduzyciom-warszawa,oferta,4733139
www.pracuj.pl/praca/programista-aplikacji-bi-warszawa,oferta,4732396
www.pracuj.pl/praca/microsoft-bi-developer-warszawa,oferta,4717759
www.pracuj.pl/praca/business-intelligence-consultant-warszawa,oferta,4744292
www.pracuj.pl/praca/specjalista-ds-baz-danych-i-analizy-portfela-kredytowego-warszawa,oferta,4742891
www.pracuj.pl/praca/starszy-specjalista-ds-analiz-i-statystyki-warszawa,oferta,4720884
www.pracuj.pl/praca/bi-development-team-lead-warszawa,oferta,4717565
www.pracuj.pl/praca/etl-specialist-warszawa,oferta,4740993


To combine both job ad link and corresponding title I will create dictionary with links as keys.

In [120]:
job_offers = {}
for link in links:
    job_offers[link.get('href')]={}

Now for each link I will create additional dictionary that will contain link to given job offer as well as job title.

In [121]:
for link in links:
    job_offers[link.get('href')]["link"]=prefix + link.get('href')

In [153]:

for i, link in enumerate(links):
    job_titles = soup.find_all('a', class_="o-list_item_link_name")
    job_offers[link.get('href')]["Job_Title"] = job_titles[i].contents[0].strip()

Let's take a look at results. We should have job title with link pointing to ad with the offer.

In [166]:
for key in job_offers.keys()[0:5]:
    print job_offers[key]["Job_Title"] + "\n" + job_offers[key]["link"] + "\n"

Junior Business Intelligence Analyst
www.pracuj.pl/praca/junior-business-intelligence-analyst-warszawa,oferta,4734392

Specjalista ds. CRM Analitycznego
www.pracuj.pl/praca/specjalista-ds-crm-analitycznego-warszawa,oferta,4733798

Specjalista ds. rozwoju systemu raportowego
www.pracuj.pl/praca/specjalista-ds-rozwoju-systemu-raportowego-warszawa,oferta,4747572

Inżynier - Text Mining / Data Mining / Knowledge Engineering
www.pracuj.pl/praca/inzynier-text-mining-data-mining-knowledge-engineering-warszawa,oferta,4744875

Specjalista ds. E-commerce
www.pracuj.pl/praca/specjalista-ds-e-commerce-warszawa,oferta,4720582



I will want to save this data to a file as I will use it later on for further analysis. JSON files are easy to parse through and fast to write and read. So let's make use of json library. 

In [162]:
import json
with open("job_offers.json", "w") as writeJSON:
    json.dump(job_offers, writeJSON)

In [176]:
with open('job_offers.json') as data_file:
    data = json.load(data_file)