## Data scraping from hh.ru and preparation

The main goal of this file is to understand how one can get data about vacancies from the biggest Russian job site - hh.ru.  
Here I'll analyze the search results about DS vacancies. Transformed data will be saved in .csv for further processing.

In [35]:
#import requests
#import urllib.request
#import datetime
#import re
from bs4 import BeautifulSoup
import json
import numpy as np
import pandas as pd

from time import sleep

from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium import webdriver

import sys
import logging

In [38]:
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

handler = logging.StreamHandler(stream=sys.stdout)
handler.setFormatter(logging.Formatter(fmt='[%(asctime)s: %(funcName)s: %(levelname)s] %(message)s'))
logger.addHandler(handler)

Preparing browser under selenium control to collect data

In [2]:
chrome_mode = 'headed' #'headless' # for debug purposes we can change this value to any but 'headless' to run Chrome in standard mode
chrome_options = Options()
if chrome_mode == 'headless':
    chrome_options.add_argument('--disable-extensions')
    chrome_options.add_argument('--disable-gpu')
    chrome_options.add_argument('--headless')
service = Service(executable_path="c:\\Applications\\WebDriver\\chromedriver-x32.exe")
browser = webdriver.Chrome(service=service, options=chrome_options)

In [3]:
"""
urls to search vacancies by words "аналитик данных" / "data scien*" in Russia looks like:
(1) https://hh.ru/search/vacancy?
                        text=data+scien*&                   # will be format parameter
                        search_field=name&search_field=description&
                        area=1&                             # will be format parameter
                        salary=150000&currency_code=RUR&    # will be format parameter
                        experience=doesNotMatter&
                        order_by=relevance&
                        search_period=0&
                        items_on_page=100&
                        no_magic=true&
                        L_save_area=true
(2) https://hh.ru/search/vacancy?
                        text=data+scien*&
                        search_field=name&search_field=description&
                        area=1&
                        salary=150000&currency_code=RUR&
                        experience=doesNotMatter&
                        order_by=relevance&
                        search_period=0&
                        items_on_page=100&
                        no_magic=true&
                        L_save_area=true&
                        page=1&                         # these 2 parameters are used when navigating to different pages
                        hhtmFrom=vacancy_search_list
(3) https://hh.ru/search/vacancy?
                        text=data+scien*&
                        search_field=name&search_field=description&
                                                        # 'area' parameter is missing => search in all areas
                        salary=&currency_code=RUR&      # 'salary' parameter set to '' => search for all salary amounts
                        experience=doesNotMatter&
                        order_by=relevance&
                        search_period=0&
                        items_on_page=100&
                        no_magic=true&
                        L_save_area=true
"""

'\nurls to search vacancies by words "аналитик данных" / "data scien*" in Russia looks like:\n(1) https://hh.ru/search/vacancy?\n                        text=data+scien*&                   # will be format parameter\n                        search_field=name&search_field=description&\n                        area=1&                             # will be format parameter\n                        salary=150000&currency_code=RUR&    # will be format parameter\n                        experience=doesNotMatter&\n                        order_by=relevance&\n                        search_period=0&\n                        items_on_page=100&\n                        no_magic=true&\n                        L_save_area=true\n(2) https://hh.ru/search/vacancy?\n                        text=data+scien*&\n                        search_field=name&search_field=description&\n                        area=1&\n                        salary=150000&currency_code=RUR&\n                        experience=d

In [4]:
search_url_template = "https://hh.ru/search/vacancy?text={}&search_field=name&search_field=description&{}salary={}&currency_code=RUR&experience=doesNotMatter&order_by=relevance&search_period=0&items_on_page=100"
#items_on_page = 100
salary_level = '' #'150000'
search_texts = [
    "data+scien*",
    "%D0%B0%D0%BD%D0%B0%D0%BB%D0%B8%D1%82%D0%B8%D0%BA+%D0%B4%D0%B0%D0%BD%D0%BD%D1%8B%D1%85",
    "data analy*",
    "{}",
]
areas = {
    '#later': '{}', # left to configure later
    'All': '',
    'Moscow': 'area=1&',
    'SPb': 'area=2&',
    'Ekaterinburg': 'area=3&',
    'Novosib': 'area=4&',
    'Austria': 'area=7&',
    'Erevan': 'area=13&',
    'NNovgorod': 'area=66&',
    'RostovND': 'area=76&',
    'Samara': 'area=78&',
    'Saratov': 'area=79&',
    'Kazan': 'area=88&',
    'Chelyabinsk': 'area=104&',
    '???': 'area=159&',
    'Almaty': 'area=160&',
    'Minsk': 'area=1002&',
    'Nur-Sultan': 'area=159&',
    'Tbilisi': 'area=2758&',
    'Tashkent': 'area=2759&',
}
url_tail = '&page={}&hhtmFrom=vacancy_search_list'
output_filename = 'vacancies_data_analyst'

In [5]:
def combine_base_url(template=None, search_text='', area_keys=[], salary=''): #, items_per_page=100
    if template is None:
        return None
    areas_str = ''
    for key in area_keys:
        areas_str += areas[key]
    if salary == '' or salary is None:
        salary_str = ''
    else:
        salary_str = str(salary) + '&only_with_salary=true&'
    return template.format(search_text, areas_str, salary_str) #, items_per_page

The baseline url to start search with:

In [6]:
base_url = combine_base_url(template = search_url_template, search_text=search_texts[1], area_keys=['All'], salary=salary_level)
base_url

'https://hh.ru/search/vacancy?text=%D0%B0%D0%BD%D0%B0%D0%BB%D0%B8%D1%82%D0%B8%D0%BA+%D0%B4%D0%B0%D0%BD%D0%BD%D1%8B%D1%85&search_field=name&search_field=description&salary=&currency_code=RUR&experience=doesNotMatter&order_by=relevance&search_period=0&items_on_page=100'

In [7]:
browser.get(base_url)

The page source code analysis shows that all info about vacancies is located under 'template' key. This 1-row data looks like json-parseable multilevel structure. So I'm going to feed it to json module and the analyse more deeply.

In [8]:
soup = BeautifulSoup(browser.page_source, 'html.parser')
# Page source code showed us there is only 1 'template' tag on page.
# It contains a huge amount of data including vacancies list in dictionary-like format (possibly for JS parsing).
# So here I'll use json library to convert html text to dictionaries/lists
json_parsed = json.loads(soup.find_all('template')[0].text)
print('"Template" tag contains {} keys'.format(len(json_parsed)))

"Template" tag contains 498 keys


In [9]:
json_parsed

{'authPhone': None,
 'authNewEmployerAreaIdsToRedirect': [],
 'authNewEmployerCategories': [],
 'authNewEmployerFields': [],
 'authNewEmployerInitialValues': {},
 'authNewEmployerPhoneMask': None,
 'activeResumeAccessType': None,
 'accountTemporarilyLocked': {},
 'accountPhoneVerification': None,
 'applicantSignup': {'fields': [], 'hideLogin': False},
 'applicantVacancyResponseStatuses': {},
 'applicantResumes': [],
 'applicantResponseStreaks': {},
 'applicantPackageType': 'basic',
 'applicantServiceType': '',
 'applicantPaymentBackUrl': '',
 'applicantAnalyticsAction': '',
 'applicantPaymentTypes': [],
 'applicantAvailableResumeServices': [],
 'applicantPackageContent': [],
 'applicantAvailableQuantities': [],
 'applicantServicesPrices': {},
 'applicantPaymentSource': 'desktop',
 'applicantFindJobRecommendedQuantity': None,
 'applicantSuitableVacancyByResume': {},
 'account': {'firstName': None, 'middleName': None, 'lastName': None},
 'accountConnect': {},
 'accountConnectOAuth': {},


#### Analyzing page structure

We've got a huge amount of 'empty' data structures after translating page to json format. There are empty dictionaries and dictionaries containing 'empty' data structures. So I'll clean out these artifacts to make the visual analysis of data more efficient.

In [10]:
# Function to check if dictionary is 'empty'
def is_dict_empty(input_dict):
    result = True
    for key in input_dict.keys():
        if (type(input_dict[key]) is type(dict())) and (len(input_dict[key]) > 0):
            # Рекуррентная проверка словарей
            result = result and is_dict_empty(input_dict[key])
        else:
            # "Пустыми" считать структуры, длина которых равна 0, имеющие значение None или являющиеся пустым словарем или списком
            checks_empty = (input_dict[key] is None) or (str(input_dict[key]) in ['{}', '[]']) or (len(str(input_dict[key])) == 0)
            result = result and checks_empty
        if not result:
            break
    return result

In [11]:
clean_dict = {}
for key in json_parsed.keys():
    if (type(json_parsed[key]) is type(dict())):
        if not is_dict_empty(json_parsed[key]):
            clean_dict[key] = json_parsed[key]
    else:
        checks_empty = (json_parsed[key] is None) or (str(json_parsed[key]) in ['{}', '[]']) or (len(str(json_parsed[key])) == 0)
        if not checks_empty:
            clean_dict[key] = json_parsed[key]
print('{} non-empty keys in result'.format(len(clean_dict)))
print('====================================================')
for key in clean_dict.keys():
    print('{} ====> {}'.format(key, clean_dict[key]))

183 non-empty keys in result
applicantSignup ====> {'fields': [], 'hideLogin': False}
applicantPackageType ====> basic
applicantPaymentSource ====> desktop
accountHistoryReplenishments ====> {'bills': [], 'documentLinksVisibility': False, 'currency': 'RUR'}
accountDelete ====> {'applicantName': '', 'resumesList': {'resumes': {'published': [], 'unpublished': []}, 'count': 0}}
adsSearchParams ====> {'puid11': 'searchVacancy', 'puid23': '', 'puid14': 'аналитик данных', 'puid29': '', 'puid30': '', 'puid12': '', 'puid13': ''}
advancedSearch ====> {'showSearchConditions': False, 'hideSuggest': False, 'vacancy': None, 'resume': None, 'experience': [], 'keySkills': [], 'university': [], 'citizenship': [], 'work_ticket': [], 'employment': [], 'schedule': [], 'driver_license_types': [], 'job_search_status': [], 'language': [], 'exclusion': []}
appleBusinessChat ====> {'isEnabled': False, 'href': ''}
abortPageContent ====> False
addressesSuggestRemoteMode ====> False
analyticsParams ====> {'hhtmS

Now the *clean_dict* variable contains key with not-empty data bound to them.  
Keys analysis shows main search results are under _'vacancySearchResult'->'vacancies'_ keys.  
Another useful keys are:  
_'searchClusters'_ contains grouping characteristics ('industry', 'groups')  
_'searchClustersBasic'_ contains options to split data further ('area', 'compensation', etc.)

In [12]:
# Total number of search results
print(clean_dict['searchCounts'])
# or
print(clean_dict['vacancySearchResult']['totalResults'])
# ?

{'isLoad': False, 'value': 10627}
10627


In [13]:
clean_dict['searchClustersBasic'].keys()

dict_keys(['label', 'industry', 'experience', 'schedule', 'professionalArea', 'professional_role', 'area', 'employment', 'compensation', 'part_time', 'search_field', 'excluded_text'])

In case total search results number exceeds 2000 (seems to be hardcoded limit) it is possible to use 'searchClustersBasic'->'area' to implement partial searches

In [14]:
clean_dict['searchClustersBasic']['area']['groups']['113']

{'count': 10002,
 'seoDomain': 'hh.ru',
 'order': 1,
 'title': 'Россия',
 'id': '113'}

The key _'count'_ contains number of vacancies found in area, _'id'_ - area id. I can parse _'area'->'groups'_ keys to search in regions separately. It gives me a way to bypass the max vacancies limitation mentioned above

In [15]:
areas_to_crawl = []
area_ids_excluded = ['113'] # grouping ids: 113 = Russia
if json_parsed['vacancySearchResult']['totalResults'] > 2000:
    search_json_obj = json_parsed['searchClustersBasic']['area']['groups']
    for key in search_json_obj.keys():
        if (key not in area_ids_excluded) and (search_json_obj[key]['count'] > 0):
            areas_to_crawl.append(key)
print('Total areas with vacancies num: {}'.format(len(areas_to_crawl)))
print("First 10 values of areas' ids: ", areas_to_crawl[:10])

Total areas with vacancies num: 128
First 10 values of areas' ids:  ['1', '2', '9', '13', '16', '28', '40', '48', '94', '97']


Let's check vacancies number per page to ensure _'items_on_page'_ parameter works fine

In [16]:
vacancies_info = clean_dict['vacancySearchResult']['vacancies']
print('Vacancies data type: {}'.format(type(vacancies_info)))
print('Num of vacancies: {}'.format(len(vacancies_info)))

Vacancies data type: <class 'list'>
Num of vacancies: 101


There is 100 records containing vacancies info from the 1st search page. So this parameter is OK. And now I want to find total number of pages to get all of them.  
The _'paging'_ key seems a right place for this info

In [17]:
clean_dict['vacancySearchResult']['paging']

{'previous': {'page': -1, 'disabled': True},
 'pages': [{'text': '1', 'page': 0, 'selected': True, 'inShortRange': True},
  {'text': '2', 'page': 1, 'selected': False, 'inShortRange': True},
  {'text': '3', 'page': 2, 'selected': False, 'inShortRange': True},
  {'text': '4', 'page': 3, 'selected': False, 'inShortRange': False},
  {'text': '5', 'page': 4, 'selected': False, 'inShortRange': False},
  {'text': '...', 'page': 5, 'selected': False, 'inShortRange': False}],
 'lastPage': {'page': 19, 'selected': False},
 'next': {'page': 1, 'disabled': False},
 'os': 'Win'}

Indeed here is pagination data. Last page number contains in _'lastPage'->'page'_ key. And total search results is splitted on N+1 pages as they are 0-indexed.  
Now I have all the necessary info to get all vacancies into 1 place.  
> [!!!] In case of only 1 page the key _'paging'_ will contain _null_ value  
In case there are 2 or 3 pages the key _'lastPage'_ is absent

#### Harvesting vacancies info from hh.ru

It's time to combine all informations about harvesting procedure to make it correct:  
(1) if total number of results exceeds 2000 I have to get info partially from different areas (which are in *areas_to_crawl* variable). In this case there will possibly be duplicates as I don't know display algorithm in details. If *areas_to_crawl* has length == 0 then there are less than 2000 vacancies found;  
(1.5) if it's not enough I should go to split further by underground lines or stations (parameter _metro=9&_ or _metro=9.37&_ in address line);  
(2) I don't need to clean json results as I know all the necessary keys;  
(3) I need to loop through all the pages.

In [18]:
# returns num of pages to parse
def get_num_pages(json_dump, logger=logger):
    result = None
    div_paging = json_dump['vacancySearchResult'].get('paging', None)
    if div_paging is not None:
        if div_paging.get('lastPage', None) is None:
            result = max([x['page'] for x in div_paging['pages']])
        else:
            result = div_paging['lastPage']['page']
    return result

In [19]:
# parse all pages from first
def parse_pages(browser, first_page_url, tail_url, logger=logger):
    logger.debug('Flow control received')
    result = []
    browser.get(first_page_url)
    max_available_records = 2000
    if json_parsed['vacancySearchResult']['totalResults'] > max_available_records:
        logger.debug('More than {} records detected => passing flow control to "parse_pages_by_undeground"'.format(max_available_records))
        result += parse_pages_by_underground(browser, first_page_url, tail_url)
    else:
        logger.debug('Less than {} records detected => beginning parsing pages'.format(max_available_records))
        json_dump = json.loads(BeautifulSoup(browser.page_source, 'html.parser').find_all('template')[0].text)
        result += json_dump['vacancySearchResult']['vacancies']
        num_pages = get_num_pages(json_dump)
        logger.debug('Page 1 / {} parsed'.format(num_pages+1))
        if num_pages is not None:
            for page in range(1, num_pages+1):
                browser.get(first_page_url+tail_url.format(page))
                div_list = BeautifulSoup(browser.page_source, 'html.parser').find_all('template')
                json_dump = json.loads(div_list[0].text)
                result += json_dump['vacancySearchResult']['vacancies']
                logger.debug('Page {} / {} parsed'.format(page+1, num_pages+1))
    return result

In [20]:
# page address to filter search results by underground lines/stations
"""
https://hh.ru/search/vacancy?
                            area=1&
                            metro=9.37&
                            search_field=name&search_field=description&
                            text=%D0%B0%D0%BD%D0%B0%D0%BB%D0%B8%D1%82%D0%B8%D0%BA+%D0%B4%D0%B0%D0%BD%D0%BD%D1%8B%D1%85&
                            from=suggest_post&
                            clusters=true&
                            no_magic=true&
                            ored_clusters=true&
                            items_on_page=100&
                            enable_snippets=true
"""

'\nhttps://hh.ru/search/vacancy?\n                            area=1&\n                            metro=9.37&\n                            search_field=name&search_field=description&\n                            text=%D0%B0%D0%BD%D0%B0%D0%BB%D0%B8%D1%82%D0%B8%D0%BA+%D0%B4%D0%B0%D0%BD%D0%BD%D1%8B%D1%85&\n                            from=suggest_post&\n                            clusters=true&\n                            no_magic=true&\n                            ored_clusters=true&\n                            items_on_page=100&\n                            enable_snippets=true\n'

In [21]:
def parse_pages_by_underground(browser, start_page_url, tail_url, logger=logger):
    # at the moment function is called browser has already opened start_page_url
    # start_page_url doesn't contain 'page=' parameter
    #browser.get(start_page_url)
    logger.debug('Flow control received')
    json_dump = json.loads(BeautifulSoup(browser.page_source, 'html.parser').find_all('template')[0].text)
    # get all underground lines with search result > 0
    underground_dump = json_dump['searchClustersBasic']['metro']['groups']
    underground_lines = {key: underground_dump[key]['title'] for key in underground_dump.keys() if underground_dump[key]['type'] == 'line'}
    new_url_template = start_page_url + '&metro={}'
    logger.debug('Got {} underground lines. Begin parsing'.format(len(underground_lines)))
    
    result = []
    for line in underground_lines:
        logger.debug('Underground line: {}'.format(underground_lines[line]))
        url = new_url_template.format(line)
        browser.get(url)
        json_dump = json.loads(BeautifulSoup(browser.page_source, 'html.parser').find_all('template')[0].text)
        result += json_dump['vacancySearchResult']['vacancies']
        num_pages = get_num_pages(json_dump)
        logger.debug('Page 1 / {} parsed'.format(num_pages+1))
        if num_pages is not None:
            for page in range(1, num_pages+1):
                browser.get(url+tail_url.format(page))
                json_dump = json.loads(BeautifulSoup(browser.page_source, 'html.parser').find_all('template')[0].text)
                result += json_dump['vacancySearchResult']['vacancies']
                logger.debug('Page {} / {} parsed'.format(page+1, num_pages+1))
    return result

In [24]:
vacancies_info = []
if len(areas_to_crawl) > 0:
    url_by_area = combine_base_url(template = search_url_template, search_text=search_texts[1], area_keys=['#later'], salary=salary_level)
    for area in areas_to_crawl:
        vacancies_info += parse_pages(browser, url_by_area.format('area={}&'.format(area)), url_tail)

print('Final vacancies info contains {} record(s)'.format(len(vacancies_info)))

Final vacancies info contains 4467 record(s)


In [25]:
len(vacancies_info)

4467

We've got info from all pages and now can close browser

In [26]:
browser.quit()

Dumping data to have an opportunity to restore raw data later...

In [27]:
with open('datasets/'+output_filename+'.json', 'w') as f:
    json.dump(vacancies_info, f)

And now it's time to select fields to fill in DataFrame

In [28]:
for key in vacancies_info[0].keys():
    print('{} =======> {}'.format(key, vacancies_info[0][key]))



Useful info keys:
- 'vacancyId' - unique vacancy id which can be used to see description on https://hh.ru/vacancy/[vacancyId]
- 'name' - no comments )
- 'company'->'visibleName' + 'company'->'department'->'@name' - department info (if exists)
- 'area'->'@id', 'area'->'name' - код и название условной географической области поиска
- 'address'->'displayName', 'address'->'marker'->('@lat', '@lng') - показываемый адрес и координаты для карты
- 'compensation'->{'from', 'to', 'currencyCode', 'gross'=(True="before taxes", False="clean amount")} (или 'compensation'->'noCompensation', if no data) - salary info
- 'workSchedule' - full / shift / remote...
- 'snippet' - vacancy's description pieces: dict('req' - requirements, 'resp' - responsibilities, 'cond' - conditions, 'skill' - ?not used?, 'desc' - ?)
- 'publicationTime'
- 'lastChangeTime'

Lets define function to get necessary data from json:

In [29]:
df_column_names = [
    'vacancy_id',
    'vacancy_name',
    'company_name',
    'company_dept',
    'area',
    'address',
    'latitude',
    'longitude',
    'salary_from',
    'salary_to',
    'salary_currency',
    'salary_gross',
    'publication_time',
    'last_changed',
    'schedule',
    'req',
    'resp',
    'cond',
    'skills'
]

In [30]:
def get_record_data(rec):
    result = dict()
    result['vacancy_id'] = rec['vacancyId']
    result['vacancy_name'] = rec['name']
    result['company_name'] = rec['company']['visibleName']
    if rec['company'].get('department', np.NAN) is np.NAN:
        result['company_dept'] = np.NAN
    else:
        result['company_dept'] = rec['company']['department'].get('@name', np.NAN)
    result['area'] = rec['area']['@id']
    if rec.get('address', None) is None:
        result['address'] = np.NAN
    else:
        result['address'] = rec['address'].get('displayName', np.NAN)
        if rec['address'].get('marker', None) is None:
            result['latitude'] = np.NAN
            result['longitude'] = np.NAN
        else:
            result['latitude'] = rec['address']['marker'].get('@lat', np.NAN)
            result['longitude'] = rec['address']['marker'].get('@lng', np.NAN)
    if rec['compensation'].get('noCompensation', None) is None:
        result['salary_from'] = rec['compensation'].get('from', np.NAN)
        result['salary_to'] = rec['compensation'].get('to', np.NAN)
        result['salary_currency'] = rec['compensation'].get('currencyCode', np.NAN)
        result['salary_gross'] = rec['compensation'].get('gross', np.NAN)
    else:
        result['salary_from'] = np.NAN
        result['salary_to'] = np.NAN
        result['salary_currency'] = np.NAN
        result['salary_gross'] = np.NAN
    result['publication_time'] = rec['publicationTime']['@timestamp']
    result['last_changed'] = rec['lastChangeTime']['@timestamp']
    result['schedule'] = rec['workSchedule']
    result['req'] = rec['snippet'].get('req', np.NAN)
    result['resp'] = rec['snippet'].get('resp', np.NAN)
    result['cond'] = rec['snippet'].get('cond', np.NAN)
    result['skills'] = rec['snippet'].get('skills', np.NAN)

    return result

In [31]:
raw_parsed_data = {name: [] for name in df_column_names}
for rec in vacancies_info:
    parsed = get_record_data(rec)
    for key in df_column_names:
        raw_parsed_data[key].append(parsed.get(key, np.NAN))
print('Control of num of records created:', len(raw_parsed_data['vacancy_id']))
print('Vacancies ID sample: ', raw_parsed_data['vacancy_id'][:10])

Control of num of records created: 4467
Vacancies ID sample:  [67551709, 67119378, 67547530, 67627301, 67302297, 67583082, 54571452, 67583022, 66891845, 67579748]


Converting data structures created to pandas DataFrame. Then making readable data in 'publication_time' and 'last_changed' columns. And finally assigning 'vacancy_id' as primary index (it's unique for each record). It will be useful later if I'll decide to update data so I'll be able to filter out existing data or update existing records

In [32]:
df = pd.DataFrame(raw_parsed_data)
df['publication_time'] = df['publication_time'].apply(pd.to_datetime, unit='s')
df['last_changed'] = df['last_changed'].apply(pd.to_datetime, unit='s')
df.set_index('vacancy_id', inplace=True)
df.head()

Unnamed: 0_level_0,vacancy_name,company_name,company_dept,area,address,latitude,longitude,salary_from,salary_to,salary_currency,salary_gross,publication_time,last_changed,schedule,req,resp,cond,skills
vacancy_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
67551709,"Backend developer (Python, PostgreSQL)",Шамсуллин Рустам Радикович,,1,"Москва, Лужнецкая набережная, 2/4с4",55.715489,37.573627,,350000.0,RUR,False,2022-07-06 15:43:19,2022-07-11 19:15:48,FULL_DAY,,Обработку спортивных он-лайн видео-трансляции....,Вторым этапом будет оплачиваемое тестовое зада...,
67119378,Product Analyst (продуктовый аналитик),Мидлэнд Ритейл Груп,,1,"Москва, улица Льва Толстого, 20",55.735439,37.584981,,180000.0,RUR,False,2022-07-11 09:50:12,2022-07-11 10:00:13,FULL_DAY,Умение выбирать нужный способ оценки данных – ...,Взаимодействие со всеми продуктовыми командами...,Удаленку или офис в пешей доступности от Метро...,
67547530,Аналитик,МСУ-1,,1,"Москва, Ольховская улица, 49",55.778644,37.674418,120000.0,,RUR,False,2022-07-11 06:42:04,2022-07-11 06:42:05,FULL_DAY,Опыт работы в строительстве и знание технологи...,Нормирование строительно-монтажных работ. Ресу...,Работа в крупной стабильной строительной компа...,
67627301,Аналитик BI,А101,,1,посёлок Коммунарка,55.570395,37.475495,,,,,2022-07-10 08:05:10,2022-07-11 06:31:47,FULL_DAY,Высшее физико-математическое или экономическое...,"Сбор данных из различных источников (1С, SQL, ...","Удаленный формат работы. График работы: 5/2, с...",
67302297,Ведущий аналитик,Триафлай,,1,"Москва, Турчанинов переулок, 6с2",55.737101,37.597199,200000.0,,RUR,False,2022-07-10 14:12:28,2022-07-10 14:59:19,FULL_DAY,"...3х лет на позиции аналитика, включая опыт с...",Внедрение продукта конечному заказчику. Вендор...,Испытательный срок 3 месяца. Гибкий график (40...,


In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4467 entries, 67551709 to 66835012
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   vacancy_name      4467 non-null   object        
 1   company_name      4467 non-null   object        
 2   company_dept      785 non-null    object        
 3   area              4467 non-null   int64         
 4   address           4090 non-null   object        
 5   latitude          4090 non-null   float64       
 6   longitude         4090 non-null   float64       
 7   salary_from       853 non-null    float64       
 8   salary_to         544 non-null    float64       
 9   salary_currency   988 non-null    object        
 10  salary_gross      988 non-null    object        
 11  publication_time  4467 non-null   datetime64[ns]
 12  last_changed      4467 non-null   datetime64[ns]
 13  schedule          4467 non-null   object        
 14  req          

In [34]:
df.to_csv('datasets/'+output_filename+'.csv')