<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 4: Web Scraping Job Postings

<div class='alert alert-danger'>

## Directions

In this project you will be leveraging a variety of skills. The first will be to use the web-scraping and/or API techniques you've learned to collect data on data-related jobs from any job aggregator. Once you have collected and cleaned the data, use it to answer the two questions.

<div class='alert alert-warning'>

**Focus on data-related job postings**, e.g. <font color=red>data scientist, data analyst, research scientist, business intelligence</font>, and any others you might think of. You may also want to decrease the scope by **limiting your search to a single region**.

features to collect: <font color=red>location, title, summary of job</font>

Main objectives:
   1. Determine the <font color=red>industry factors</font> that are most important in <font color=red>predicting the salary amounts</font> for these data.
   2. Determine the <font color=red>factors that distinguish job categories and titles</font> from each other. For example, can required skills accurately predict job title?

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
from bs4 import BeautifulSoup
import os
from selenium import webdriver # initialize browser
from time import sleep
import re

<font color=red>Step 1 - extract job description links using 'element'</font>:
``` Python
# get all the links
job_details = pd.DataFrame()
jobs = []

driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")
for i in range(240):
    url = "https://www.website.com/search?search=data&sortBy=new_posting_date&page={}".format(i)
    driver.get(url)
    
    sleep(3)
    
    response = requests.get(url)
    x = response.status_code
    if x != 200:
        print('Status code:', response.status_code)
    
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')
    
    try:
        for link in soup.find_all('a', {'class':'bg-white mb3 w-100 dib v-top pa3 no-underline flex-ns flex-wrap JobCard__card___22xP3'}):
            jobs.append('https://www.website.com' + link.get('href'))
    except:
        jobs.append('None')
    
    # check progress of url scraping
    if i % 10 == 0:
        print(i)

job_details['url'] = jobs
job_details.to_csv('project4_urls.csv', index=False)

job_details.head()
driver.close()
```

<font color=red>Step 2 - extract job details & description from each url</font>

``` Python
job_details = pd.DataFrame(columns=['company_name','job_title','location','employment_type','seniority','job_category','salary_range','salary_freq','roles & responsibilities','requirements'])
job_details['links'] = jobs
                           
for entry in html.find_all('div', {'class':'bg-white pa4'}):
    name = entry.find('p', {'name':'company'}).text
    job = entry.find('h1', {'id':'job_title'}).text
    location = str(entry.find('a', {'href':'#location_map'}).renderContents())
    emp_type = entry.find('p', {'id':'employment_type'}).text
    seniority = entry.find('p', {'id':'seniority'}).text
    job_categ = entry.find('p', {'id':'job-categories'}).text
    salary_range = entry.find('span', {'class':'salary_range dib f2-5 fw6 black-80'}).text
    salary_freq = entry.find('span', {'class':'salary_type dib f5 fw4 black-60 pr1 i pb'}).text
    roles = entry.find('div', {'id':'description-content'}).text
    requirements = entry.find('div', {'id':'requirements-content'}).text

job_details.loc[len(job_details)] = [name, job, location, emp_type, seniority, job_categ, salary_range, salary_freq, roles, requirements]

job_details
```

<div class='alert alert-danger'>

#### Load csv with scraped urls

In [None]:
df = pd.read_csv('project4_urls.csv')
df.head()

In [None]:
df.shape

<div class='alert alert-danger'>

### split dataframe into sets of 500 urls each

#### Scrape data for first set of 500 urls

In [None]:
df1 = df.loc[0:499].copy(deep=True)
df1.shape

In [None]:
company = []
position = []
address = []
employment = []
snr = []
job_cat = []
sal_r = []
sal_f = []
r_and_r = []
req = []

i = 0

driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")

for url in df['url'].loc[0:499]:
    driver.get(url)
    
    sleep(3)
    
    response = requests.get(url)
    x = response.status_code
    if x != 200:
        print('Status code:', response.status_code)
    
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')
    ##### get company name
    try:
        name = soup.find('p', {'name':'company'}).text
        company.append(name)
    except:
        company.append('None')
    ##### get job title
    try:
        job = soup.find('h1', {'id':'job_title'}).text
        position.append(job)
    except:
        position.append('None')
    ##### get company location
    try:
        location = soup.find('a', {'href':'#location_map'}).text
        address.append(location)
    except:
        address.append('None')
    ##### get employment type
    try:
        emp_type = soup.find('p', {'id':'employment_type'}).text
        employment.append(emp_type)
    except:
        employment.append('None')
    ##### get job seniority
    try:
        seniority = soup.find('p', {'id':'seniority'}).text
        snr.append(seniority)
    except:
        snr.append('None')
    ##### get job category
    try:
        job_categ = soup.find('p', {'id':'job-categories'}).text
        job_cat.append(job_categ)
    except:
        job_cat.append('None')
    ##### get salary range
    try:
        salary_range = soup.find('span', {'class':'salary_range dib f2-5 fw6 black-80'}).text
        sal_r.append(salary_range)
    except:
        sal_r.append('None')
    ##### get salary type
    try:
        salary_freq = soup.find('span', {'class':'salary_type dib f5 fw4 black-60 pr1 i pb'}).text
        sal_f.append(salary_freq)
    except:
        sal_f.append('None')
    ##### get job description
    try:
        roles = soup.find('div', {'id':'description-content'}).text
        r_and_r.append(roles)
    except:
        r_and_r.append('None')
    ##### get job requirements
    try:
        requirements = soup.find('div', {'id':'requirements-content'}).text
        req.append(requirements)
    except:
        req.append('None')
    
    i += 1
    
    if i%100 == 0:
        print(i)

df1['company_name'] = company
df1['job_title'] = position
df1['location'] = address
df1['employment_type'] = employment
df1['seniority'] = snr
df1['job_category'] = job_cat
df1['salary_range'] = sal_r
df1['salary_freq'] = sal_f
df1['roles & responsibilities'] = r_and_r
df1['requirements'] = req
print('done')

In [None]:
driver.close()
df1.head(10)
print(df1.shape)

In [None]:
df1.to_csv('project4_df1.csv',index=False)

<div class='alert alert-danger'>

#### Scrape data for second set of 500 urls

In [None]:
df2 = df.loc[500:999].copy(deep=True)
df2.shape

In [None]:
company = []
position = []
address = []
employment = []
snr = []
job_cat = []
sal_r = []
sal_f = []
r_and_r = []
req = []

i = 0

driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")

for url in df['url'].loc[500:999]:
    driver.get(url)
    
    sleep(3)
    
    response = requests.get(url)
    x = response.status_code
    if x != 200:
        print('Status code:', response.status_code)
    
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')
    ##### get company name
    try:
        name = soup.find('p', {'name':'company'}).text
        company.append(name)
    except:
        company.append('None')
    ##### get job title
    try:
        job = soup.find('h1', {'id':'job_title'}).text
        position.append(job)
    except:
        position.append('None')
    ##### get company location
    try:
        location = soup.find('a', {'href':'#location_map'}).text
        address.append(location)
    except:
        address.append('None')
    ##### get employment type
    try:
        emp_type = soup.find('p', {'id':'employment_type'}).text
        employment.append(emp_type)
    except:
        employment.append('None')
    ##### get job seniority
    try:
        seniority = soup.find('p', {'id':'seniority'}).text
        snr.append(seniority)
    except:
        snr.append('None')
    ##### get job category
    try:
        job_categ = soup.find('p', {'id':'job-categories'}).text
        job_cat.append(job_categ)
    except:
        job_cat.append('None')
    ##### get salary range
    try:
        salary_range = soup.find('span', {'class':'salary_range dib f2-5 fw6 black-80'}).text
        sal_r.append(salary_range)
    except:
        sal_r.append('None')
    ##### get salary type
    try:
        salary_freq = soup.find('span', {'class':'salary_type dib f5 fw4 black-60 pr1 i pb'}).text
        sal_f.append(salary_freq)
    except:
        sal_f.append('None')
    ##### get job description
    try:
        roles = soup.find('div', {'id':'description-content'}).text
        r_and_r.append(roles)
    except:
        r_and_r.append('None')
    ##### get job requirements
    try:
        requirements = soup.find('div', {'id':'requirements-content'}).text
        req.append(requirements)
    except:
        req.append('None')
    
    i += 1
    
    if i%50 == 0:
        print(i)

df2['company_name'] = company
df2['job_title'] = position
df2['location'] = address
df2['employment_type'] = employment
df2['seniority'] = snr
df2['job_category'] = job_cat
df2['salary_range'] = sal_r
df2['salary_freq'] = sal_f
df2['roles & responsibilities'] = r_and_r
df2['requirements'] = req
print('done')

In [None]:
driver.close()
df2.head(10)
print(df2.shape)

In [None]:
df2.to_csv('project4_df2.csv',index=False)

<div class='alert alert-danger'>

#### Scrape data for third set of 500 urls

In [None]:
df3 = df.loc[1000:1499].copy(deep=True)
df3.shape

In [None]:
company = []
position = []
address = []
employment = []
snr = []
job_cat = []
sal_r = []
sal_f = []
r_and_r = []
req = []

i = 0

driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")

for url in df['url'].loc[1000:1499]:
    driver.get(url)
    
    sleep(3)
    
    response = requests.get(url)
    x = response.status_code
    if x != 200:
        print('Status code:', response.status_code)
    
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')
    ##### get company name
    try:
        name = soup.find('p', {'name':'company'}).text
        company.append(name)
    except:
        company.append('None')
    ##### get job title
    try:
        job = soup.find('h1', {'id':'job_title'}).text
        position.append(job)
    except:
        position.append('None')
    ##### get company location
    try:
        location = soup.find('a', {'href':'#location_map'}).text
        address.append(location)
    except:
        address.append('None')
    ##### get employment type
    try:
        emp_type = soup.find('p', {'id':'employment_type'}).text
        employment.append(emp_type)
    except:
        employment.append('None')
    ##### get job seniority
    try:
        seniority = soup.find('p', {'id':'seniority'}).text
        snr.append(seniority)
    except:
        snr.append('None')
    ##### get job category
    try:
        job_categ = soup.find('p', {'id':'job-categories'}).text
        job_cat.append(job_categ)
    except:
        job_cat.append('None')
    ##### get salary range
    try:
        salary_range = soup.find('span', {'class':'salary_range dib f2-5 fw6 black-80'}).text
        sal_r.append(salary_range)
    except:
        sal_r.append('None')
    ##### get salary type
    try:
        salary_freq = soup.find('span', {'class':'salary_type dib f5 fw4 black-60 pr1 i pb'}).text
        sal_f.append(salary_freq)
    except:
        sal_f.append('None')
    ##### get job description
    try:
        roles = soup.find('div', {'id':'description-content'}).text
        r_and_r.append(roles)
    except:
        r_and_r.append('None')
    ##### get job requirements
    try:
        requirements = soup.find('div', {'id':'requirements-content'}).text
        req.append(requirements)
    except:
        req.append('None')
    
    i += 1
    
    if i%50 == 0:
        print(i)

df3['company_name'] = company
df3['job_title'] = position
df3['location'] = address
df3['employment_type'] = employment
df3['seniority'] = snr
df3['job_category'] = job_cat
df3['salary_range'] = sal_r
df3['salary_freq'] = sal_f
df3['roles & responsibilities'] = r_and_r
df3['requirements'] = req
print('done')

In [None]:
driver.close()
df3.head(10)
print(df3.shape)

In [None]:
df3.to_csv('project4_df3.csv',index=False)

<div class='alert alert-danger'>

#### Scrape data for fourth set of 500 urls

In [None]:
df4 = df.loc[1500:1999].copy(deep=True)
df4.shape

In [None]:
company = []
position = []
address = []
employment = []
snr = []
job_cat = []
sal_r = []
sal_f = []
r_and_r = []
req = []

i = 0

driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")

for url in df['url'].loc[1500:1999]:
    driver.get(url)
    
    sleep(3)
    
    response = requests.get(url)
    x = response.status_code
    if x != 200:
        print('Status code:', response.status_code)
    
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')
    ##### get company name
    try:
        name = soup.find('p', {'name':'company'}).text
        company.append(name)
    except:
        company.append('None')
    ##### get job title
    try:
        job = soup.find('h1', {'id':'job_title'}).text
        position.append(job)
    except:
        position.append('None')
    ##### get company location
    try:
        location = soup.find('a', {'href':'#location_map'}).text
        address.append(location)
    except:
        address.append('None')
    ##### get employment type
    try:
        emp_type = soup.find('p', {'id':'employment_type'}).text
        employment.append(emp_type)
    except:
        employment.append('None')
    ##### get job seniority
    try:
        seniority = soup.find('p', {'id':'seniority'}).text
        snr.append(seniority)
    except:
        snr.append('None')
    ##### get job category
    try:
        job_categ = soup.find('p', {'id':'job-categories'}).text
        job_cat.append(job_categ)
    except:
        job_cat.append('None')
    ##### get salary range
    try:
        salary_range = soup.find('span', {'class':'salary_range dib f2-5 fw6 black-80'}).text
        sal_r.append(salary_range)
    except:
        sal_r.append('None')
    ##### get salary type
    try:
        salary_freq = soup.find('span', {'class':'salary_type dib f5 fw4 black-60 pr1 i pb'}).text
        sal_f.append(salary_freq)
    except:
        sal_f.append('None')
    ##### get job description
    try:
        roles = soup.find('div', {'id':'description-content'}).text
        r_and_r.append(roles)
    except:
        r_and_r.append('None')
    ##### get job requirements
    try:
        requirements = soup.find('div', {'id':'requirements-content'}).text
        req.append(requirements)
    except:
        req.append('None')
    
    i += 1
    
    if i%50 == 0:
        print(i)

df4['company_name'] = company
df4['job_title'] = position
df4['location'] = address
df4['employment_type'] = employment
df4['seniority'] = snr
df4['job_category'] = job_cat
df4['salary_range'] = sal_r
df4['salary_freq'] = sal_f
df4['roles & responsibilities'] = r_and_r
df4['requirements'] = req
print('done')

In [None]:
driver.close()
df4.head(10)
print(df4.shape)

In [None]:
df4.to_csv('project4_df4.csv',index=False)

<div class='alert alert-danger'>

#### Scrape data for fifth set of 500 urls

In [None]:
df5 = df.loc[2000:2499].copy(deep=True)
df5.shape

In [None]:
company = []
position = []
address = []
employment = []
snr = []
job_cat = []
sal_r = []
sal_f = []
r_and_r = []
req = []

i = 0

driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")

for url in df['url'].loc[2000:2499]:
    driver.get(url)
    
    sleep(3)
    
    response = requests.get(url)
    x = response.status_code
    if x != 200:
        print('Status code:', response.status_code)
    
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')
    ##### get company name
    try:
        name = soup.find('p', {'name':'company'}).text
        company.append(name)
    except:
        company.append('None')
    ##### get job title
    try:
        job = soup.find('h1', {'id':'job_title'}).text
        position.append(job)
    except:
        position.append('None')
    ##### get company location
    try:
        location = soup.find('a', {'href':'#location_map'}).text
        address.append(location)
    except:
        address.append('None')
    ##### get employment type
    try:
        emp_type = soup.find('p', {'id':'employment_type'}).text
        employment.append(emp_type)
    except:
        employment.append('None')
    ##### get job seniority
    try:
        seniority = soup.find('p', {'id':'seniority'}).text
        snr.append(seniority)
    except:
        snr.append('None')
    ##### get job category
    try:
        job_categ = soup.find('p', {'id':'job-categories'}).text
        job_cat.append(job_categ)
    except:
        job_cat.append('None')
    ##### get salary range
    try:
        salary_range = soup.find('span', {'class':'salary_range dib f2-5 fw6 black-80'}).text
        sal_r.append(salary_range)
    except:
        sal_r.append('None')
    ##### get salary type
    try:
        salary_freq = soup.find('span', {'class':'salary_type dib f5 fw4 black-60 pr1 i pb'}).text
        sal_f.append(salary_freq)
    except:
        sal_f.append('None')
    ##### get job description
    try:
        roles = soup.find('div', {'id':'description-content'}).text
        r_and_r.append(roles)
    except:
        r_and_r.append('None')
    ##### get job requirements
    try:
        requirements = soup.find('div', {'id':'requirements-content'}).text
        req.append(requirements)
    except:
        req.append('None')
    
    i += 1
    
    if i%50 == 0:
        print(i)

df5['company_name'] = company
df5['job_title'] = position
df5['location'] = address
df5['employment_type'] = employment
df5['seniority'] = snr
df5['job_category'] = job_cat
df5['salary_range'] = sal_r
df5['salary_freq'] = sal_f
df5['roles & responsibilities'] = r_and_r
df5['requirements'] = req
print('done')

In [None]:
driver.close()
df5.head(10)
print(df5.shape)

In [None]:
df5.to_csv('project4_df5.csv',index=False)

<div class='alert alert-danger'>

#### Scrape data for sixth set of 500 urls

In [None]:
df6 = df.loc[2500:2999].copy(deep=True)
df6.shape

In [None]:
company = []
position = []
address = []
employment = []
snr = []
job_cat = []
sal_r = []
sal_f = []
r_and_r = []
req = []

i = 0

driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")

for url in df['url'].loc[2500:2999]:
    driver.get(url)
    
    sleep(3)
    
    response = requests.get(url)
    x = response.status_code
    if x != 200:
        print('Status code:', response.status_code)
    
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')
    ##### get company name
    try:
        name = soup.find('p', {'name':'company'}).text
        company.append(name)
    except:
        company.append('None')
    ##### get job title
    try:
        job = soup.find('h1', {'id':'job_title'}).text
        position.append(job)
    except:
        position.append('None')
    ##### get company location
    try:
        location = soup.find('a', {'href':'#location_map'}).text
        address.append(location)
    except:
        address.append('None')
    ##### get employment type
    try:
        emp_type = soup.find('p', {'id':'employment_type'}).text
        employment.append(emp_type)
    except:
        employment.append('None')
    ##### get job seniority
    try:
        seniority = soup.find('p', {'id':'seniority'}).text
        snr.append(seniority)
    except:
        snr.append('None')
    ##### get job category
    try:
        job_categ = soup.find('p', {'id':'job-categories'}).text
        job_cat.append(job_categ)
    except:
        job_cat.append('None')
    ##### get salary range
    try:
        salary_range = soup.find('span', {'class':'salary_range dib f2-5 fw6 black-80'}).text
        sal_r.append(salary_range)
    except:
        sal_r.append('None')
    ##### get salary type
    try:
        salary_freq = soup.find('span', {'class':'salary_type dib f5 fw4 black-60 pr1 i pb'}).text
        sal_f.append(salary_freq)
    except:
        sal_f.append('None')
    ##### get job description
    try:
        roles = soup.find('div', {'id':'description-content'}).text
        r_and_r.append(roles)
    except:
        r_and_r.append('None')
    ##### get job requirements
    try:
        requirements = soup.find('div', {'id':'requirements-content'}).text
        req.append(requirements)
    except:
        req.append('None')
    
    i += 1
    
    if i%50 == 0:
        print(i)

df6['company_name'] = company
df6['job_title'] = position
df6['location'] = address
df6['employment_type'] = employment
df6['seniority'] = snr
df6['job_category'] = job_cat
df6['salary_range'] = sal_r
df6['salary_freq'] = sal_f
df6['roles & responsibilities'] = r_and_r
df6['requirements'] = req
print('done')

In [None]:
driver.close()
df6.head(10)
print(df6.shape)

In [None]:
df6.to_csv('project4_df6.csv',index=False)

<div class='alert alert-danger'>

#### Scrape data for seventh set of 500 urls

In [None]:
df7 = df.loc[3000:3499].copy(deep=True)
df7.shape

In [None]:
company = []
position = []
address = []
employment = []
snr = []
job_cat = []
sal_r = []
sal_f = []
r_and_r = []
req = []

i = 0

driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")

for url in df['url'].loc[3000:3499]:
    driver.get(url)
    
    sleep(3)
    
    response = requests.get(url)
    x = response.status_code
    if x != 200:
        print('Status code:', response.status_code)
    
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')
    ##### get company name
    try:
        name = soup.find('p', {'name':'company'}).text
        company.append(name)
    except:
        company.append('None')
    ##### get job title
    try:
        job = soup.find('h1', {'id':'job_title'}).text
        position.append(job)
    except:
        position.append('None')
    ##### get company location
    try:
        location = soup.find('a', {'href':'#location_map'}).text
        address.append(location)
    except:
        address.append('None')
    ##### get employment type
    try:
        emp_type = soup.find('p', {'id':'employment_type'}).text
        employment.append(emp_type)
    except:
        employment.append('None')
    ##### get job seniority
    try:
        seniority = soup.find('p', {'id':'seniority'}).text
        snr.append(seniority)
    except:
        snr.append('None')
    ##### get job category
    try:
        job_categ = soup.find('p', {'id':'job-categories'}).text
        job_cat.append(job_categ)
    except:
        job_cat.append('None')
    ##### get salary range
    try:
        salary_range = soup.find('span', {'class':'salary_range dib f2-5 fw6 black-80'}).text
        sal_r.append(salary_range)
    except:
        sal_r.append('None')
    ##### get salary type
    try:
        salary_freq = soup.find('span', {'class':'salary_type dib f5 fw4 black-60 pr1 i pb'}).text
        sal_f.append(salary_freq)
    except:
        sal_f.append('None')
    ##### get job description
    try:
        roles = soup.find('div', {'id':'description-content'}).text
        r_and_r.append(roles)
    except:
        r_and_r.append('None')
    ##### get job requirements
    try:
        requirements = soup.find('div', {'id':'requirements-content'}).text
        req.append(requirements)
    except:
        req.append('None')
    
    i += 1
    
    if i%50 == 0:
        print(i)

df7['company_name'] = company
df7['job_title'] = position
df7['location'] = address
df7['employment_type'] = employment
df7['seniority'] = snr
df7['job_category'] = job_cat
df7['salary_range'] = sal_r
df7['salary_freq'] = sal_f
df7['roles & responsibilities'] = r_and_r
df7['requirements'] = req
print('done')

In [None]:
driver.close()
df7.head(10)
print(df7.shape)

In [None]:
df7.to_csv('project4_df7.csv',index=False)

<div class='alert alert-danger'>

#### Scrape data for eighth set of 500 urls

In [None]:
df8 = df.loc[3500:3999].copy(deep=True)
df8.shape

In [None]:
company = []
position = []
address = []
employment = []
snr = []
job_cat = []
sal_r = []
sal_f = []
r_and_r = []
req = []

i = 0

driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")

for url in df['url'].loc[3500:3999]:
    driver.get(url)
    
    sleep(3)
    
    response = requests.get(url)
    x = response.status_code
    if x != 200:
        print('Status code:', response.status_code)
    
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')
    ##### get company name
    try:
        name = soup.find('p', {'name':'company'}).text
        company.append(name)
    except:
        company.append('None')
    ##### get job title
    try:
        job = soup.find('h1', {'id':'job_title'}).text
        position.append(job)
    except:
        position.append('None')
    ##### get company location
    try:
        location = soup.find('a', {'href':'#location_map'}).text
        address.append(location)
    except:
        address.append('None')
    ##### get employment type
    try:
        emp_type = soup.find('p', {'id':'employment_type'}).text
        employment.append(emp_type)
    except:
        employment.append('None')
    ##### get job seniority
    try:
        seniority = soup.find('p', {'id':'seniority'}).text
        snr.append(seniority)
    except:
        snr.append('None')
    ##### get job category
    try:
        job_categ = soup.find('p', {'id':'job-categories'}).text
        job_cat.append(job_categ)
    except:
        job_cat.append('None')
    ##### get salary range
    try:
        salary_range = soup.find('span', {'class':'salary_range dib f2-5 fw6 black-80'}).text
        sal_r.append(salary_range)
    except:
        sal_r.append('None')
    ##### get salary type
    try:
        salary_freq = soup.find('span', {'class':'salary_type dib f5 fw4 black-60 pr1 i pb'}).text
        sal_f.append(salary_freq)
    except:
        sal_f.append('None')
    ##### get job description
    try:
        roles = soup.find('div', {'id':'description-content'}).text
        r_and_r.append(roles)
    except:
        r_and_r.append('None')
    ##### get job requirements
    try:
        requirements = soup.find('div', {'id':'requirements-content'}).text
        req.append(requirements)
    except:
        req.append('None')
    
    i += 1
    
    if i%50 == 0:
        print(i)

df8['company_name'] = company
df8['job_title'] = position
df8['location'] = address
df8['employment_type'] = employment
df8['seniority'] = snr
df8['job_category'] = job_cat
df8['salary_range'] = sal_r
df8['salary_freq'] = sal_f
df8['roles & responsibilities'] = r_and_r
df8['requirements'] = req
print('done')

In [None]:
driver.close()
df8.head(10)
print(df8.shape)

In [None]:
df8.to_csv('project4_df8.csv',index=False)

<div class='alert alert-danger'>

#### Scrape data for remaining urls

In [None]:
df9 = df.loc[4000:].copy(deep=True)
df9.shape

In [None]:
company = []
position = []
address = []
employment = []
snr = []
job_cat = []
sal_r = []
sal_f = []
r_and_r = []
req = []

i = 0

driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")

for url in df['url'].loc[4000:]:
    driver.get(url)
    
    sleep(3)
    
    response = requests.get(url)
    x = response.status_code
    if x != 200:
        print('Status code:', response.status_code)
    
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')
    ##### get company name
    try:
        name = soup.find('p', {'name':'company'}).text
        company.append(name)
    except:
        company.append('None')
    ##### get job title
    try:
        job = soup.find('h1', {'id':'job_title'}).text
        position.append(job)
    except:
        position.append('None')
    ##### get company location
    try:
        location = soup.find('a', {'href':'#location_map'}).text
        address.append(location)
    except:
        address.append('None')
    ##### get employment type
    try:
        emp_type = soup.find('p', {'id':'employment_type'}).text
        employment.append(emp_type)
    except:
        employment.append('None')
    ##### get job seniority
    try:
        seniority = soup.find('p', {'id':'seniority'}).text
        snr.append(seniority)
    except:
        snr.append('None')
    ##### get job category
    try:
        job_categ = soup.find('p', {'id':'job-categories'}).text
        job_cat.append(job_categ)
    except:
        job_cat.append('None')
    ##### get salary range
    try:
        salary_range = soup.find('span', {'class':'salary_range dib f2-5 fw6 black-80'}).text
        sal_r.append(salary_range)
    except:
        sal_r.append('None')
    ##### get salary type
    try:
        salary_freq = soup.find('span', {'class':'salary_type dib f5 fw4 black-60 pr1 i pb'}).text
        sal_f.append(salary_freq)
    except:
        sal_f.append('None')
    ##### get job description
    try:
        roles = soup.find('div', {'id':'description-content'}).text
        r_and_r.append(roles)
    except:
        r_and_r.append('None')
    ##### get job requirements
    try:
        requirements = soup.find('div', {'id':'requirements-content'}).text
        req.append(requirements)
    except:
        req.append('None')
    
    i += 1
    
    if i%50 == 0:
        print(i)

df9['company_name'] = company
df9['job_title'] = position
df9['location'] = address
df9['employment_type'] = employment
df9['seniority'] = snr
df9['job_category'] = job_cat
df9['salary_range'] = sal_r
df9['salary_freq'] = sal_f
df9['roles & responsibilities'] = r_and_r
df9['requirements'] = req
print('done')

In [None]:
driver.close()
df9.head(10)
print(df9.shape)

In [None]:
df9.to_csv('project4_df9.csv',index=False)

<div class='alert alert-danger'>

### Combining dataframes

```Python
df1 = pd.read_csv('project4_df1.csv')
print(df1.shape)

df2 = pd.read_csv('project4_df2.csv')
print(df2.shape)

df3 = pd.read_csv('project4_df3.csv')
print(df3.shape)

df4 = pd.read_csv('project4_df4.csv')
print(df4.shape)

df5 = pd.read_csv('project4_df5.csv')
print(df5.shape)

df6 = pd.read_csv('project4_df6.csv')
print(df6.shape)

df7 = pd.read_csv('project4_df7.csv')
print(df7.shape)

df8 = pd.read_csv('project4_df8.csv')
print(df8.shape)

df9 = pd.read_csv('project4_df9.csv')
print(df9.shape)

frames = [df1, df2, df3, df4, df5, df6, df7, df8, df9]
df = pd.concat(frames)
```

```Python
print(df.shape)
df.to_csv('project4_compiled.csv', index=False)
```