# Your Turn: Build A Pipeline

- Combine Your Knowledge of the Website, `requests` and `bs4`
- Automate Your Scraping Process Across Multiple Pages
- Generalize Your Code For Varying Searches
- Target & Save Specific Information You Want

## Your Tasks:

- Scrape the first 100 available search results
- Generalize your code to allow searching for different locations/jobs
- Pick out information about the URL, job title, and job location
- Save the results to a file

In [181]:
import requests
from bs4 import BeautifulSoup

### Part 1: Inspect

- How do the URLs change when you navigate to the next results page?
- How do the URLs change when you use a different location and/or job title search?
- Which HTML elements contain the link, title, and location of each job?

 * Using thhe same Indeed website, we would complete the task. When URLs change there is an additional query parameter `&start=10` This doubles with every page.

This denotes the first page: `https://www.indeed.com/jobs?q=python&l=new+york&start=0`
This follows as the second page: `https://www.indeed.com/jobs?q=python&l=new+york&start=10`

 * URLs change when a different location or key word is changed by modifying the query parameter information in the URL
 


`https://www.indeed.com/jobs?q=python&l=new+york`

- **Base URL**
    - `https://www.indeed.com/jobs`
- **Query Parameters**
    - Start & Separators: `?`, `&`
    - Information: `q=python`, `l=new+york`

### Part 2: Scrape

- Build the code to fetch the first 100 search results. This means you will need to automatically navigate to multiple results pages
- Write functions that allow you to specify the job title, location, and amount of results as arguments

In [182]:
base_url = "https://www.indeed.com/jobs"

In [183]:
#This function opens the page number specified

def get_url(title, loc, page_num):
    
    url = base_url+"?q="+title.lower().replace(" ", "+")+"&l="+loc.lower().replace(" ", "+")
    page_inc = "&start="+ str(page_num * 10)
    url = url + page_inc
    repsonse = requests.get(url)
    return response 

site_url = get_url("python","new york", 3)

In [184]:
site_content = site_url.content

In [185]:
soup = BeautifulSoup(site_content)
soup

<!DOCTYPE html>
<html dir="ltr" lang="en">
<head>
<meta content="text/html;charset=utf-8" http-equiv="content-type"/>
<script src="//d3fw5vlhllyvee.cloudfront.net/s/ee8d2b7/en_US.js" type="text/javascript"></script>
<link href="//d3fw5vlhllyvee.cloudfront.net/s/34ab604/jobsearch_all.css" rel="stylesheet" type="text/css"/>
<link href="https://rss.indeed.com/rss?q=python&amp;l=new+york" rel="alternate" title="Python Jobs, Employment in New York State" type="application/rss+xml"/>
<link href="/m/jobs?q=python&amp;l=new+york" media="only screen and (max-width: 640px)" rel="alternate"/>
<script type="text/javascript">

if (typeof window['closureReadyCallbacks'] == 'undefined') {
window['closureReadyCallbacks'] = [];
}

function call_when_jsall_loaded(cb) {
if (window['closureReady']) {
cb();
} else {
window['closureReadyCallbacks'].push(cb);
}
}
</script>
<meta content="1" name="ppstriptst"/>
<script>
var _scriptDownloadCount = 0;
var _retryDownload = function() {
var script = document.crea

### Part 3: Parse

- Sieve through your HTML soup to pick out only the job title, link, and location
- Format the results in a readable format (e.g. JSON)
- Save the results to a file

In [186]:

def parse_site(soup):
    results = soup.find(id='resultsCol')
    jobs = results.find_all('div', class_='row')
    base_url = "https://www.indeed.com"
    results = []
    
    #Loops through jobs and loks for the a tag in the h2 tag
    for job in jobs:
        job_titles = job.find('h2').find('a').text.strip()
        job_link = job.find('h2').find('a')["href"]
        job_url = base_url + job_link
        job_loc = job.find(class_="location").text
        results.append({"Job Title": job_titles, "Job Link": job_url, "Job Location":job_loc})

        
    return results
#base_url + title_link['href']
parse_site(soup)


[{'Job Title': 'Data Technician (Full- or Part-Time)',
  'Job Link': 'https://www.indeed.com/rc/clk?jk=da727c0cddda240e&fccid=56a26d4c816e53d1&vjs=3',
  'Job Location': 'New York, NY'},
 {'Job Title': 'Python / Java Software Engineer',
  'Job Link': 'https://www.indeed.com/rc/clk?jk=6b49e869fa7440a0&fccid=aaf3b433897ea465&vjs=3',
  'Job Location': 'New York, NY'},
 {'Job Title': 'Penetration Testing Trainee (Remote USA)',
  'Job Link': 'https://www.indeed.com/rc/clk?jk=487b30db63184515&fccid=bf0600f0f252b45b&vjs=3',
  'Job Location': 'Florida, NY'},
 {'Job Title': 'Healthcare Data Scientist',
  'Job Link': 'https://www.indeed.com/rc/clk?jk=6720cf1c03a1c426&fccid=2c23f29fcd5c78da&vjs=3',
  'Job Location': 'New York, NY'},
 {'Job Title': 'Digital Archives Assistant',
  'Job Link': 'https://www.indeed.com/rc/clk?jk=7c541433bfbcb0b8&fccid=ec3edbdb9b5c3b64&vjs=3',
  'Job Location': 'Long Island City, NY'},
 {'Job Title': 'Research Associates',
  'Job Link': 'https://www.indeed.com/rc/clk?jk

In [199]:
# Declare necessary variables
qty_page = 15
num_jobs = 100

# To determine the number of pages to open 
page_num = num_jobs//qty_page

def tot_page(title, loc, page_num):
    jobs_tot = []
    for i in range(page_num):
        site_url = get_url(title, loc, page_num)
        soup = BeautifulSoup(site_url.content)
        jobs_tot += (parse_site(soup))
    return jobs_tot
        
Task = tot_page("python","new york",page_num)
Task

[{'Job Title': 'Data Technician (Full- or Part-Time)',
  'Job Link': 'https://www.indeed.com/rc/clk?jk=da727c0cddda240e&fccid=56a26d4c816e53d1&vjs=3',
  'Job Location': 'New York, NY'},
 {'Job Title': 'Python / Java Software Engineer',
  'Job Link': 'https://www.indeed.com/rc/clk?jk=6b49e869fa7440a0&fccid=aaf3b433897ea465&vjs=3',
  'Job Location': 'New York, NY'},
 {'Job Title': 'Penetration Testing Trainee (Remote USA)',
  'Job Link': 'https://www.indeed.com/rc/clk?jk=487b30db63184515&fccid=bf0600f0f252b45b&vjs=3',
  'Job Location': 'Florida, NY'},
 {'Job Title': 'Healthcare Data Scientist',
  'Job Link': 'https://www.indeed.com/rc/clk?jk=6720cf1c03a1c426&fccid=2c23f29fcd5c78da&vjs=3',
  'Job Location': 'New York, NY'},
 {'Job Title': 'Digital Archives Assistant',
  'Job Link': 'https://www.indeed.com/rc/clk?jk=7c541433bfbcb0b8&fccid=ec3edbdb9b5c3b64&vjs=3',
  'Job Location': 'Long Island City, NY'},
 {'Job Title': 'Research Associates',
  'Job Link': 'https://www.indeed.com/rc/clk?jk

In [203]:
#import necessary modules
import pandas as pd


job_search = pd.DataFrame(Task, columns=["Job Title","Job Link","Job Location"])

job_search.head()

Unnamed: 0,Job Title,Job Link,Job Location
0,Data Technician (Full- or Part-Time),https://www.indeed.com/rc/clk?jk=da727c0cddda2...,"New York, NY"
1,Python / Java Software Engineer,https://www.indeed.com/rc/clk?jk=6b49e869fa744...,"New York, NY"
2,Penetration Testing Trainee (Remote USA),https://www.indeed.com/rc/clk?jk=487b30db63184...,"Florida, NY"
3,Healthcare Data Scientist,https://www.indeed.com/rc/clk?jk=6720cf1c03a1c...,"New York, NY"
4,Digital Archives Assistant,https://www.indeed.com/rc/clk?jk=7c541433bfbcb...,"Long Island City, NY"


In [204]:
job_search.to_csv("job_search.csv")