---
format:
    html:
        embed-resources: true
---

# Crawling 

## Overview 

In this portion of the homework, you will be crawling google jobs to collect various job-descriptions for later processing. 

We will be using the `serpapi` API to crawl google jobs. `serpapi` is a paid API, but they have a free tier which should be more than enough for this homework. The API allows you to search google programatically, which has a wealth of practical applications.

You will need and API key from SerpApi, it is free to sign up, you shouldn't enter any payment information.

https://serpapi.com/manage-api-key

This will get you 100 free searches per month. Each search will get you 10 job-descriptions, for a total of around 1000 possible job-descriptions per month. 

This portion of the homework relies on limited API search resources, so prototype carefully, once you are sure it is working 100%, then you should run it one last time. 

Consider reserving about 10 searches for prototyping, with the remaining 90 searches for your final "production" run.

Make sure to save the outputs, and have them backed up after this final run, in case you delete them by mistake.

We will use the following python wrapper for the API, it can be installed with

`pip install google-search-results`

The following are additional useful reference resources 

For instructions on the API see the following

* [https://serpapi.com/google-jobs-api](https://serpapi.com/google-jobs-api)
* [https://serpapi.com/blog/scrape-google-jobs-organic-results-with-python/](https://serpapi.com/blog/scrape-google-jobs-organic-results-with-python/)
* [https://serpapi.com/integrations/python](https://serpapi.com/integrations/python)

## Starter code 

Here is some starter code:

`Note: uule parameter`

The uule parameter is an encoded location parameter used in Google search queries. It stands for "Unique User Location Encoding" and is used to specify the geographic location from which the search is being conducted. This can influence the search results to be more relevant to the specified location.

This can be set to `'w+CAIQICINVW5pdGVkIFN0YXRlcw'`, which is an encoded string representing a specific location, i.e. the United States. This encoding helps simulate searches as if they are being conducted from that location, which can be useful for testing or gathering location-specific data.

In [6]:
from serpapi import GoogleSearch
import json

Save your API key in a centralized location, e.g. `~/.api-keys.json`

Read it in with `import json` 

In [7]:
import json
with open('/Users/zp/Desktop/api-key.json') as f:
    keys = json.load(f)
API_KEY = keys['serpapi']

Be careful, don't run this too many times for debugging and prototyping, or you will use up all your free searches. 

In [12]:
search_query = 'data science'
params = {
	'api_key':API_KEY,                          # https://serpapi.com/manage-api-key
	'uule': 'w+CAIQICINVW5pdGVkIFN0YXRlcw',		# encoded location (USA)
	'q': search_query,              			# search query
    'hl': 'en',                         		# language of the search
    'gl': 'us',                         		# country of the search
	'engine': 'google_jobs',					# SerpApi search engine
}

Lets do one search and explore the output 

In [13]:
search = GoogleSearch(params)   			# where data extraction happens on the SerpApi backend
result_dict = search.get_dict() 			# JSON -> Python dict

if 'error' in result_dict:
    print("ERROR FOUND IN SEARCH")

In [14]:
for result in result_dict['jobs_results']:
    print(result)
    # google_jobs_results.append(result)

{'title': 'AVP, Data Science - Underwriting and Operations Analytics | Hartford, WI, USA', 'company_name': 'The Travelers Indemnity Company', 'location': 'Hartford, WI', 'via': 'EFinancialCareers', 'share_link': 'https://www.google.com/search?ibp=htl;jobs&q=data+science&htidocid=3idjrGTlPG2bZbk3AAAAAA%3D%3D&hl=en-US&shndl=-1&source=sh/x/job/li/m1/1#fpstate=tldetail&htivrt=jobs&htiq=data+science&htidocid=3idjrGTlPG2bZbk3AAAAAA%3D%3D', 'thumbnail': 'https://serpapi.com/searches/672859379fd309ef4b4fbdde/images/38d82cd27cffaaa7d885388f748141ba8257bfd16e379fa3e64fc0cf0bc18251.jpeg', 'extensions': ['21 hours ago', 'Full-time', 'Paid time off', 'Health insurance'], 'detected_extensions': {'posted_at': '21 hours ago', 'schedule_type': 'Full-time', 'paid_time_off': True, 'health_insurance': True}, 'description': "AVP, Data Science - Underwriting and Operations Analytics\n\nWho Are We?\n\nTaking care of our customers, our communities and each other. That's the Travelers Promise. By honoring this

In web crawling, pagination involves retrieving data across multiple pages of a website. It helps manage large datasets by fetching a limited number of results per request, enabling efficient data extraction without overwhelming the server or exceeding resource limits.

You could use pagination to get more results for a given search, however you need to start where the last search left off.

You can do this by adding the `next_page_token` to the `params` dictionary.

You get the last `next_page_token` from the `result_dict["serpapi_pagination"]`

If you don't do pagination, and search 10 times, you will just get the first 10 results over and over again.

In [15]:
print(result_dict["serpapi_pagination"],"\n")
print(result_dict["serpapi_pagination"]["next_page_token"])

{'next_page_token': 'eyJmYyI6IkVvd0ZDc3dFUVVwSE9VcHJUMDFMUnpSMlowazJVRVEyUTJRdFVrZFRiV2s0TmxoQ1NYUkpkSFZETjFwVk4waDVVMk5HYTNFMFptNTBUUzFFWkhsSFkwWTJjMEphZFhoSVRUbHRaR05PT0cxbFFtcFNOVnBIVDJOMk9FcENOVk5NVTNaUGEydFRiVU5CWVRZNE1EUktNelJGZFhKd05VaG1kRGRvUnpaaVgybFdWakUwVUdKQ1NtVkVTSE5uYjJoWGRWRmFTa2gxYkZKbk9GUjRiM2hXUzJGV1MwUndXblZCVTNCcldVUlJaV0Z5T1hsVGJ6VjRia2x6YjFSYVNUTmxkelV5YjFGNmNWaFpOR3REU0dsalpFWnRWRTlUZUhSd1NtYzFkRkZoUm1sUWJuTkhibGxxY1RONVkyMUpSRjlsTFRCMlpHOUVTMVo1U1c5eU4xRnpaRGszTmpKVlZtZFlSRzF4WkRCSGQzbzVTVlF3UjBORGJuWkphRTVmWTNkTlltSTNkMmsxWVVjNVQwdGtlbkF5VERsV01FUTVRMUJNTlUxeVowMTBXVmMyY1hjd1IwbEpTamRHVjNwdFVEVlJlRXhzUm0xWVFVaEdWbEpTUjNvNWJWbDJZa2xoYzB0alkzWkVMVkZWV0VoQmNXUjBkSE5UY1VsU1UyNXhNRWgzUmxkVlpVdEJiR1kzWVY5R2FFTlBWVnB2V2pWTlNXaGtkR3RWVHpNMU1EUXpUa2R1VkY5UU1qZFFUbkpQYW5weVdrMVNXbEJNWlRoUE4yVlJXUzFSUVdaeGN6TlhiVU5CVHkxbFR6VndTMDlFVUc5QmR6Sm9hSEJDWVRsVU0wNTZkR1UwVjJ0dWQxaFNjSFZTUlRScFkyRkpSVXhETVVnMk5scHdZMm8yVTJZeFlWYzFiMlpoVGkxS1NWaHBOMU4yUmxwMlJYQkdjekZ0ZEVGTVUxSkRTMmx

Lets do one more search, but this time with pagination, starting where the last search left off.

In [16]:
search_query = 'data science'
params = {
	'api_key':API_KEY,                          # https://serpapi.com/manage-api-key
	'uule': 'w+CAIQICINVW5pdGVkIFN0YXRlcw',		# encoded location (USA)
	'q': search_query,              			# search query
    'hl': 'en',                         		# language of the search
    'gl': 'us',                         		# country of the search
    "num": 10,									# number of results per page
	'engine': 'google_jobs',					# SerpApi search engine
    'next_page_token': result_dict["serpapi_pagination"]["next_page_token"]
}

In [17]:
search = GoogleSearch(params)   			# where data extraction happens on the SerpApi backend
result_dict = search.get_dict() 			# JSON -> Python dict

if 'error' in result_dict:
    print("ERROR FOUND IN SEARCH")

In [18]:
for result in result_dict['jobs_results']:
    print(result)
    # google_jobs_results.append(result)

{'title': 'Product Manager, Data Science & Analytics (Remote)', 'company_name': 'The Home Depot', 'location': 'Anywhere', 'via': 'The Home Depot Careers', 'share_link': 'https://www.google.com/search?ibp=htl;jobs&q=data+science&htidocid=94jynhwGXj80buIGAAAAAA%3D%3D&hl=en-US&shndl=37&shmd=H4sIAAAAAAAA_xXLsQrCMBAAUFz7CYJwk6jUVgQXnQqFqiCIuss1HkklvQvJCfWL_E11edvLPqPscI7yeBmFEzJaijnUqAhX0xEbgilUjP6tnUkwu1AvSnNYwlFaSITROBCGRsR6Gu-cakjbskzJFzYp_lZhpC-FqZWhfEqb_tyTw0jBo9J9vVkNRWC7mNwcwV56gpqCKHQMlXpkxRya6gtct7sRqgAAAA&shmds=v1_AXX-3kHmDDMrjc_GzaFuGjtkeiTmB_0hW03ImTUlmWVU9ZFDNg&shem=jbt1,jbto1&source=sh/x/job/li/m1/1#fpstate=tldetail&htivrt=jobs&htiq=data+science&htidocid=94jynhwGXj80buIGAAAAAA%3D%3D', 'thumbnail': 'https://serpapi.com/searches/6728597ad11a893f71a2704e/images/2d7133faaf25aab208bf30de0f369b322fe2ec214b178dec8a29c2f5e080d6fb.gif', 'extensions': ['Work from home', 'Full-time', 'No degree mentioned'], 'detected_extensions': {'work_from_home': True, 'schedule_type': 'Full-time', 'qu

# Utility function

Create utility function to search google jobs, and save the results to a file.

Here is one sketch of what the function might look like:

- Imports the current date and time using `datetime`.
- Defines `search_google_jobs` to perform a Google Jobs search with a default or custom query.
- Accepts parameters for the search query, pagination token, and verbosity.
- Sets search parameters like API key, location, language, and search engine.
- Appends the pagination token if provided.
- Creates a timestamped output filename based on the query and time.
- Does a search and data extraction.
- Optionally prints the data if `verbose` is `True` and saves results to a JSON file.
- Returns the `next_page_token` for pagination or handles errors.

In [12]:
# INSERT CODE HERE
import datetime
def search_google_jobs(search_query, next_page_token=None, verbose=False):
    current_time = datetime.datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
    filename = f"jobs_results_{search_query.replace(' ', '_')}_{current_time}.json"

    params = {
        'api_key': API_KEY,                         # Replace with your API key
        'uule': 'w+CAIQICINVW5pdGVkIFN0YXRlcw',     # Encoded location (USA)
        'q': search_query,                          # Search query
        'hl': 'en',                                 # Language of the search
        'gl': 'us',                                 # Country of the search
        'num': 10,                                  # Number of results per page
        'engine': 'google_jobs',                    # SerpApi search engine
    }

    if next_page_token:
        params['next_page_token'] = next_page_token

    search = GoogleSearch(params)
    result_dict = search.get_dict()

    if verbose:
        print(json.dumps(result_dict, indent=4))  

    if 'error' in result_dict:
        print("ERROR FOUND IN SEARCH")
        return None, []

    # Save the JSON response to a file
    with open(filename, 'w') as file:
        json.dump(result_dict, file, indent=4)

    jobs_results = result_dict.get('jobs_results', [])
    next_page_token = result_dict.get("serpapi_pagination", {}).get("next_page_token")

    return next_page_token, jobs_results


In [25]:
next_page_token = search_google_jobs(search_query="machine learning engineer", verbose=True)

{'title': 'Machine Learning Engineer', 'company_name': 'Apple', 'location': 'Austin, TX', 'via': 'Careers At Apple', 'share_link': 'https://www.google.com/search?ibp=htl;jobs&q=machine+learning+engineer&htidocid=MMbmdDm79xD9nKF9AAAAAA%3D%3D&hl=en-US&shndl=-1&source=sh/x/job/li/m1/1#fpstate=tldetail&htivrt=jobs&htiq=machine+learning+engineer&htidocid=MMbmdDm79xD9nKF9AAAAAA%3D%3D', 'thumbnail': 'https://serpapi.com/searches/6728653ff79a0150f41a2bda/images/1721331cca54d3a6bd2659205cb921ad0ac69ef13e72054cb9b92c9640189527.png', 'extensions': ['18 days ago', 'Full-time'], 'detected_extensions': {'posted_at': '18 days ago', 'schedule_type': 'Full-time'}, 'description': 'Summary\nPosted: Oct 17, 2024\n\nRole Number:200573979\n\nImagine what you could do here! The people here at Apple don’t just create products — they build the kind of wonder that’s revolutionized entire industries. It’s the diversity of those people and their ideas that inspires the innovation that runs through everything we d

In [26]:
next_page_token

'eyJmYyI6IkVxSUZDdUlFUVVwSE9VcHJUWFpxU0Y4dExYYzViRFp6V2xkd05GWkllVlJuWVZkc1ZtSldNRzlFV21KWVQwSTBhbmhLTmtsaFh6ZEplbFZIUVY5ZldXbDNVa0pEWmxOWE1XNTFVV1JOYTBKRVNVbzBSelJSV2s1NU5WaG5VMEpsTlRoMFdXSnBVR3BFV0dKNVNVVnZUMEZ4ZUZwcFFtVjVZV1JTVjJjek5FVkZlVFJLVmxabVNTMWxORTA1YlROT1RtZEVYMlJYZGtsemIycHpjVkY1WXpsQ1UyTkpORU5MTFRGdmFXRnBPSGR6UlZOSlZGTjFkMWQxWnpKWE5qRjBaRU5KUzJaYVdWcGlZM0JRWVU1d1RVY3lWV1ZaWVdKdlNuRjZlakJHYzBGTlpVeGhXbXhLYkdGM2NXSXRUVkZOTVRoaGFrcG9kV054V1dacGJFOVhNVU4zUVVGM01rSkNVM2R2TlZSelZraE5ielJhT1Y4M2RIZEpWM0pYVG5oVE5WaFBORGt5T0RoV1EySm1TM0phYjNoSE5GQm5hbkpzT1RoelUyRlpUemgzYUVZMk5XVnlURlJKUVhaZlYyZERTalJvZUhrNE5tUmxSMGREY3kxWlF6SktlazlKU2toWGJEUlJVblUzVEVKdWJ6bFlhSFUyTVVVdGREVlRiRWxZTm5samRHTlZkMTlvTmpKVVRVSXhkVTlaTUdRMU5ITjRWV3RsU21KWE5HRlZWa0l4TlcxdFIzbEhlSFUxYUZoU2MzaFBNUzB0VEVjeFFtdDJSRVprZVdsa1IwNVRkMHRwTW1KSFlWaG1jMWwwVHpGRU5tRndaWEpsZFRWak5EQndNREJsWjBzNVRqQkVVbGRuVGtGRE1sVnRNSEZJT0ZWdFUxZGxZWHB1VDI5dGMyc3dPR1UwUW5SYWMxVm5kRFJvYWxKd2NTMXFla2htZDNsRVpHdDNOSE5LTTBVeWVURTFXblY2Wkd

In [27]:
search_google_jobs(search_query="machine learning engineer", next_page_token=next_page_token, verbose=True)

{'title': 'Senior Machine Learning Engineer, Applied ML', 'company_name': 'Robinhood', 'location': 'Menlo Park, CA', 'via': 'Greenhouse', 'share_link': 'https://www.google.com/search?ibp=htl;jobs&q=machine+learning+engineer&htidocid=Kyuf4_bkbLrcCB3fAAAAAA%3D%3D&hl=en-US&shndl=-1&source=sh/x/job/li/m1/1#fpstate=tldetail&htivrt=jobs&htiq=machine+learning+engineer&htidocid=Kyuf4_bkbLrcCB3fAAAAAA%3D%3D', 'thumbnail': 'https://serpapi.com/searches/6728654b9819f3637e4d8e85/images/0237a7282e485314c9401a5417a6670966a3a855094015ec78dffcbba41bc0dc.gif', 'extensions': ['20 days ago', 'Full-time', 'Health insurance', 'Paid time off'], 'detected_extensions': {'posted_at': '20 days ago', 'schedule_type': 'Full-time', 'health_insurance': True, 'paid_time_off': True}, 'description': "About the team + role\n\nThe mission of the Applied Machine Learning team is to provide scalable data and model driven decision making solutions to the various business functions at Robinhood. We aim to create a personali

'eyJmYyI6IkV2Y0VDcmNFUVVwSE9VcHJUMk0wYWxob1dVWkdkVUptZDI1dmJYcDBXRWhsVFRaeFYwcERhbm93UTAxb2J6RndkWFJ3VFRkYWFsSkxkbkY0YzBsVWJEQlFURUpuUW5OcGJIWnJja2c0ZFZwdldFUnhWR1UyY0hCUGVqRlpOMDV0UWxaU1dpMXdSVEEyTFVGa1dUSTBjbmhRZDB4b01WVlNZMDlaVVRJME1YZGlTV055ZGxkMFFUQklSbFJUT1dNMlFXZElMWEJuYzFSTGRrZFZRVTFxVkhGT2IxTmFiRmx1YVd4WlNWWmlhalF5UkRob1RFOVJTbHBNVDAxRllqbDZlRXBVUVRacWJIcEVZWFV0VVhJMFoyeFhVR2xwUlU5SGMySXlUMnhwY0hKeE5XUk1NamRwZWpSUllVWndObE5SVlhJeWR6Sm1jbVpaT0Vwa05rZzVWakZIUkZSRVNFaEJSbWhSVVd0S2NUYzFha2hYWjIxSloyOU1ibkpNVDNOSGJWUnNUVXQ2V1RKdlZsVnRZVE5EU21Vek5GVlNZMGM1UXkweE1VVnpXbkZZUVRCTExXSmhRVTVZZEY5c2FtdFBaMFJGTldKTGRVWk9Va3hJYlVGT1JVaGxjMUEzUWxOcll6aFRSMDA0WkVkSGFVSkZTMEkzTkRSRlZVTlhOMWM0VTBoZlgyVjZlV0V4TlhoSFlYTTNiV2h6U20xR1YzZFFXakl6VEZJMVZFbGZkVVJLZURJMGNUQkhORGd0U1VacWFWazNZM0EzWVZWdVFVTnZTV05tWTFoT2RXdExTMFZaTFRCV1pVNWtTakJKV2tSTVdIbGlNbmhKUkdwNGVtVldORmRHYlhGWWNETkhTMk5RU2xoMk5EWTFORFprZEcxVVREaFhkRGRZYW1kRmFFbHdRVGhHVlRsTk5sZzBURWd5U1hsMVYyVkxOVzlvZW5Fd1UwSTFNM2R2VmpabVFXNVZObU53UVR

In [28]:
for result in result_dict['jobs_results']:
    print(result["title"], result["company_name"])

Product Manager, Data Science & Analytics (Remote) The Home Depot
Data Scientist, Paramount Advertising Paramount
Manager Data Scientist Capital One
Data Science Solution Specialist - Generative AI Deloitte
Senior Data Scientist, ASE iCloud Data Organization [Executive Communications] Apple
(USA) Senior, Data Scientist (Computer Vision Engineer) Walmart Inc.
Public Notice for Direct Hire (STEM) - Data Scientist Public Health
Data Scientist , Global FP&A Technology Amazon.com Services LLC
Data Scientist Intern United States Cisco
Senior Director of Data Science Formation Bio


## Iterate over job titles

These titles reflect a wide range of roles that leverage data science and machine learning skills in various industries and specialties.

- Data Scientist
- Machine Learning Engineer
- Artificial Intelligence Specialist
- Data Analyst
- Business Intelligence Analyst
- Research Scientist (AI/ML)
- Deep Learning Engineer
- NLP Engineer (Natural Language Processing)
- Computer Vision Engineer
- Data Engineer
- Applied Scientist
- Quantitative Analyst (Quant)
- Predictive Modeler
- AI Solutions Architect
- Statistician
- Big Data Engineer
- Data Science Consultant
- Automation Engineer
- Analytics Manager
- Decision Scientist
- Operations Research Analyst
- Robotics Engineer
- Bioinformatics Data Scientist
- Healthcare Data Analyst
- Financial Data Scientist
- Customer Insights Analyst
- Marketing Data Analyst
- Data Strategy Manager
- Cloud AI Engineer
- Computational Scientist
- Fraud Detection Specialist
- Risk Analyst
- Data Architect
- Algorithm Engineer

For each keyword, do three searches, using pagination, this will result in around 30 jobs per keyword (assuming there are at least 30 jobs for the particular keyword), save each search results to a file. 

Note, just to be safe, wait a one second between each request e.g. using `time.sleep(1)`

In [3]:
job_titles = [
    "Data Scientist",
    "Machine Learning Engineer",
    "Artificial Intelligence Specialist",
    "Data Analyst",
    "Business Intelligence Analyst",
    "Research Scientist (AI-ML)",
    "Deep Learning Engineer",
    "NLP Engineer (Natural Language Processing)",
    "Computer Vision Engineer",
    "Data Engineer",
    "Applied Scientist",
    "Quantitative Analyst (Quant)",
    "AI Solutions Architect",
    "Statistician",
    "Big Data Engineer",
    "Data Science Consultant",
    "Automation Engineer",
    "Analytics Manager",
    "Operations Research Analyst",
    "Robotics Engineer",
    "Bioinformatics Data Scientist",
    "Financial Data Scientist",
    "Customer Insights Analyst",
    "Marketing Data Analyst",
    "Data Strategy Manager",
    "Cloud AI Engineer",
    "Computational Scientist",
    "Fraud Detection Specialist",
    "Risk Analyst",
    "Data Architect"
]

# print(len(job_titles)*3)

Now insert code to iterate over the job titles, and perform the searches.

Be very careful, this needs to be 100% correct before running it, otherwise you will burn through your free searches.

I would recommend doing just one iteration of the loop as a trial run, if that looks good, then do do the next iteration and carefully check the results, if everything looks good then do remaining 28 iterations.

Note: sometimes the Pagination will return less than 10 results, so you may end up with slightly less than 30 results per keyword, e.g. 25 to 30

Remember to clean the job tiles to remove any characters like spaces, `/` or `()`

In [18]:
# INSERT CODE HERE
import time
def job_search(job_title):
    for title in job_title:
        clean_title = title.replace(" ", "_").replace("/", "_").replace("(", "").replace(")", "")
        print(f"\n---------------\n{title}")
        next_page_token = None
        for num in range(3):  
            print(f"SEARCH- {num}")
            next_page_token,jobs_results = search_google_jobs(clean_title, next_page_token)

            if jobs_results:
                for job in jobs_results:
                    print(f"{job['title']} : {job['company_name']}")
            else:
                print(f"No results found for '{title}' on SEARCH- {num}.")

            if not next_page_token:
                break  

            time.sleep(1) 


job_search(job_titles)


---------------
Data Scientist
SEARCH- 0
Data Scientist II : Cencora
Lead Analyst – Data Science : Colorado Rockies
Technical Intern - Data Scientist : BAE Systems
Data Scientist- REMOTE : Prime Therapeutics
Data Scientist : VIZIO, Inc.
Data Scientist Principal (REMOTE) – Life Insurance Company : USAA
Data Scientist - Clearance Required : Logistics Management Institute
Data Scientist II : NSA Storage
Principal Data Scientist : Self Financial
Data Scientist, R&D : Eight Sleep
SEARCH- 1
Lead Data Scientist : Vendavo
Full Stack Data Scientist : Cardinal Health
Staff LLM Data Scientist : TIFIN
Principal Data Scientist - Service Enablement Analytics : Atlassian
Wildfire Data Scientist : Xcel Energy
Cleared Data Scientist (All Levels) : Noblis
Data Scientist II : Grubhub Holdings Inc.
Principal Data Scientist : Sovrn
Data Scientist : Partner Opportunities
SEARCH- 2
Data Scientist II (NLP Required) : CareSource
Data Scientist : NSA Storage
Technical Intern - Data Scientist : BAE Systems, Inc