In [1]:
# Install necessary libraries if they are not present
!pip install requests
!pip install beautifulsoup4
!pip install pandas



In [2]:
# Import relevant packages
import requests
from bs4 import BeautifulSoup
import pandas as pd
import datetime
import random

# Data Extraction through Web Scraping.

## Introduction.

Almost 10 years ago, the job of a data scientist was labeled by Harvard Business Review as "the sexiest job of the 21st century" [(Davenport & Patil 2012)](https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century). Since then, there has been a steady increase in the demand for data experts, and it is expected that both job creation and salaries will continue to rise in the coming years. The following articles illustrate this situation:

https://www.smithhanley.com/2022/01/04/data-science-in-2022/
https://www.bbva.com/en/big-data-the-demand-for-expert-talent-continues-to-grow/

The cited studies refer to labor markets in Europe and the United States. Suppose you are in charge of developing a study of the labor market for data scientists in Latin America, for which you need to build a database with job offers published in different countries of the region.

The objective of this task is to use web scraping techniques to extract data on job offers for data scientists published on an open job portal (www.linkedin.com/jobs).


#### 1. Go to the website www.linkedin.com/jobs, click on the `Search Jobs` button, and search for jobs for *data scientist* in your country's capital (or another city of interest). Inspect and analyze the source code of the results page to understand the structure of its HTML code.

#### 2. Extract the list of job postings returned by your search on LinkedIn.

In [3]:
# Define the position and location to scrape
position = 'data scientist'
url_friendly_position = position.replace(" ","%20")
location = 'Monterrey'
url_search = 'https://www.linkedin.com/jobs/search/?keywords=%s&location=%s'%(url_friendly_position, location)
print(url_search)

https://www.linkedin.com/jobs/search/?keywords=data%20scientist&location=Monterrey


In [4]:
# To prevent the website from thinking you are a bot, use one of the following headers when making the request:

# List of headers
headers = [
    {'User-Agent': 'Mozilla/5.0'},
    {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36'},
    {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Mobile Safari/537.36'},
    {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Mobile Safari/537.36'},
    {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36'}
]

# Randomly select one header
head = random.choice(headers)

# Print the selected header
print(f"Using header: {head}")


Using header: {'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Mobile Safari/537.36'}


In [5]:
# Obtain a list of jobs related to the position and location
response = requests.get(url_search, headers=head)
print(response)
soup = BeautifulSoup(response.text, 'html.parser')
joblist = soup.find('ul', class_="jobs-search__results-list")
alljobs = joblist.find_all('li')

<Response [200]>


#### 3. Select only the first job posting from the list and extract the following information: job title, company name, location, and job URL.

Note: By location, we mean the city, district, or municipality specified in the posting.

In [6]:
print(alljobs[0])

<li>
<div class="base-card relative w-full hover:no-underline focus:no-underline base-card--link base-search-card base-search-card--link job-search-card" data-column="1" data-entity-urn="urn:li:jobPosting:3979385708" data-impression-id="jobs-search-mobile-0" data-reference-id="q/tu8eGHf9i/eoHCiTScqw==" data-row="1" data-tracking-id="+Ik0gwXmDNJmNEY1XRrsvw==">
<a class="base-card__full-link absolute top-0 right-0 bottom-0 left-0 p-0 z-[2]" data-tracking-client-ingraph="" data-tracking-control-name="public_jobs_jserp-result_search-card" data-tracking-will-navigate="" href="https://mx.linkedin.com/jobs/view/data-science-analyst-at-steelcase-3979385708?position=1&amp;pageNum=0&amp;refId=q%2Ftu8eGHf9i%2FeoHCiTScqw%3D%3D&amp;trackingId=%2BIk0gwXmDNJmNEY1XRrsvw%3D%3D&amp;trk=public_jobs_jserp-result_search-card">
<span class="sr-only">
              
        
        Data Science Analyst
      
      
          </span>
</a>
<div class="search-entity-media">
<img alt="" class="artdeco-entity-i

In [7]:
info = alljobs[0].find('div', class_="base-search-card__info")
title = info.find('h3', class_="base-search-card__title").text.strip()
company = info.find('h4', class_="base-search-card__subtitle").text.strip()
metadata = alljobs[0].find('div', class_="base-search-card__metadata")
location_element = metadata.find('span', class_="job-search-card__location")
location_job = location_element.text.strip()

joburl = alljobs[0].find('a', class_="base-card__full-link")['href']

# Information about the first job
print(f'Título: {title}')
print(f'Empresa: {company}')
print(f'Ubicación: {location_job}')
print(f'URL del trabajo: {joburl}')

Título: Data Science Analyst
Empresa: Steelcase
Ubicación: San Pedro Garza García, Nuevo León, Mexico
URL del trabajo: https://mx.linkedin.com/jobs/view/data-science-analyst-at-steelcase-3979385708?position=1&pageNum=0&refId=q%2Ftu8eGHf9i%2FeoHCiTScqw%3D%3D&trackingId=%2BIk0gwXmDNJmNEY1XRrsvw%3D%3D&trk=public_jobs_jserp-result_search-card


In [8]:
# Create a list of dictionaries with job information

jobs = []

for job in alljobs:
    info = job.find('div', class_="base-search-card__info")
    title = info.find('h3', class_="base-search-card__title").text.strip()
    company = info.find('h4', class_="base-search-card__subtitle").text.strip()
    
    metadata = job.find('div', class_="base-search-card__metadata")
    location_element = metadata.find('span', class_="job-search-card__location")
    location_job = location_element.text.strip()
    
    joburl = job.find('a', class_="base-card__full-link")['href']
    
    job_info = {
        'Location': location_job,
        'Title': title,
        'Company': company,
        'Url': joburl
    }

    jobs.append(job_info)

# Select the first job from the list and extract the relevant information
first_job = jobs[0]
location_job = first_job['Location']
title_job = first_job['Title']
company_job = first_job['Company']
joburl_job = first_job['Url']

# Print the information of the first job
print(f'Título: {title_job}')
print(f'Empresa: {company_job}')
print(f'Ubicación: {location_job}')
print(f'URL del trabajo: {joburl_job}')

Título: Data Science Analyst
Empresa: Steelcase
Ubicación: San Pedro Garza García, Nuevo León, Mexico
URL del trabajo: https://mx.linkedin.com/jobs/view/data-science-analyst-at-steelcase-3979385708?position=1&pageNum=0&refId=q%2Ftu8eGHf9i%2FeoHCiTScqw%3D%3D&trackingId=%2BIk0gwXmDNJmNEY1XRrsvw%3D%3D&trk=public_jobs_jserp-result_search-card


#### 4. Based on the previous points, write a routine to extract the information for location, job title, company name, and job URL for all the job postings returned by your LinkedIn search, and store the data in a pandas dataframe.


In [9]:
df_jobs = pd.DataFrame(jobs, columns=['Location', 'Title', 'Company', 'Url'])
print(df_jobs.head(1))

                                     Location                 Title  \
0  San Pedro Garza García, Nuevo León, Mexico  Data Science Analyst   

     Company                                                Url  
0  Steelcase  https://mx.linkedin.com/jobs/view/data-science...  


#### 5. Export your dataframe to a .csv file.

In [10]:
# Exportar el DataFrame a un archivo CSV
date = datetime.datetime.now().strftime('%Y-%m-%d')
position = position.replace(" ", "_")
nombre_archivo = f'LinkedIn_{position}_{location}_{date}.csv'
df_jobs.to_csv(nombre_archivo, index=False)

#### 6. How many job postings does your dataframe contain, and how many results are there in total from the LinkedIn search? Comment on the differences or matches, and explain what you would need to do to extract all available results from LinkedIn (in words, implementation is not necessary).

In [11]:
ofertas_df = df_jobs['Url'].count()
ofertas_linkedin = int(soup.find('span', {'class': 'results-context-header__job-count'}).text)
print(f'The number of job postings in the DataFrame is {ofertas_df}, while the total number of results on LinkedIn is {ofertas_linkedin}.')

The number of job postings in the DataFrame is 59, while the total number of results on LinkedIn is 659.


If there is a discrepancy between these two results, it is likely because I have only extracted the job postings from the first page of LinkedIn results. LinkedIn displays a limited number of postings per page, and if there are more postings, they are shown on subsequent pages.

To extract all available results from LinkedIn, you would need to implement a loop that iterates through all the result pages. This would involve extracting the URL for the next page from the current page, then performing web scraping on that page and adding the information to the dataframe, and so on, until there are no more pages or results.

However, this process can be complex and require careful handling to avoid being blocked by LinkedIn due to generating too many requests in a short period of time. It is important to mention that web scraping LinkedIn is against their terms of service, so this would likely be considered a bad practice.
