## resources
1. https://dev.to/hhsm95/using-user-agent-to-scraping-data-lli
2. [How to Web Scrape Indeed with Python](https://www.youtube.com/watch?v=PPcgtx0sI2E) by John Watson Rooney
3. [Comprehensive Python Beautiful Soup Web Scraping Tutorial!](https://www.youtube.com/watch?v=GjKQ6V_ViQE&t=2205s) by Keith Galli
4. [How to scrape JOB posts from INDEED with PYTHON](https://www.youtube.com/watch?v=eN_3d4JrL_w&lc=Ugw9P4LYvEssGrIcNf94AaABAg.9FOng9tpc_Q9FOtU0NVkpR) by Izzy Analytics
    - [corresponding github repo](https://github.com/israel-dryer/Indeed-Job-Scraper/blob/master/indeed-job-scraper.ipynb)

## note to self
1. status_code == 200
    - a HTTP status code, means "OK", the server has succesfully answered our http request.
    - more info: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
2. [f-strings in python](https://www.geeksforgeeks.org/formatted-string-literals-f-strings-python/)
3. So many duplicated job posts on indeed.com... soooooooo many
4. card.find/find_all('span', 'class_'): 'class_' is by default

In [87]:
from datetime import datetime #to get the current date
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import time
import numpy as np

In [98]:
def extract(job_title, location):
    inquiry = 'https://www.indeed.com/jobs?q={}&l={}&filter=0'
    job_title = job_title.replace(' ', '+')
    location = location.replace(' ', '+')
    url = inquiry.format(job_title, location)
    return url

In [99]:
def transform(card):
    title = card.h2.a.get('title')
    job_link = 'https://www.indeed.com' + card.h2.a.get('href')
    company = card.find('span', class_ = 'company').text.strip()
    try:
        rating = card.find('span', class_ = 'ratingsContent').text.strip()
    except AttributeError:
        rating = ''
    location = card.find('div', class_ = 'recJobLoc').get('data-rc-loc')
    try:
        salary = card.find('span', 'salaryText').text.strip()
    except AttributeError:
        salary = ''
    summary = card.find('div', 'summary').text.strip()
    post_date = card.find('span', 'date').text
    today = datetime.today().strftime('%Y-%m-%d')
        
    job = {'title': title,
           'company': company,
           'rating': rating,
           'location': location,
           'salary': salary,
           'summary': summary,
           'post_date': post_date,
           'record obtained': today,
           'job_url': job_link
        }
    return job

In [100]:
def get_jobs(job_title, location):
    joblist = []
    url = extract(job_title, location)
    
    while True:
        headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36'}
        delays = [7, 4, 6, 2, 10, 19]
        delay = np.random.choice(delays)
        r = requests.get(url, headers)
        time.sleep(delay)
        soup = bs(r.content, 'html.parser')
        cards = soup.find_all('div', class_ = 'jobsearch-SerpJobCard')
        for index, card in enumerate(cards):
            job = transform(card)
            joblist.append(job)
            print('moving along', index)
        try:
            url = 'https://www.indeed.com' + soup.find('a', {'aria-label': 'Next'}).get('href')
        except AttributeError:
            break

    data = pd.DataFrame(joblist)
    data.to_csv(job_title + '_jobs_indeed.csv')
    print('JOB FINISHED!')

In [102]:
get_jobs('data scientist', 'united states')

moving along 0
moving along 1
moving along 2
moving along 3
moving along 4
moving along 5
moving along 6
moving along 7
moving along 8
moving along 9
moving along 10
moving along 11
moving along 12
moving along 13
moving along 14
moving along 15
moving along 0
moving along 1
moving along 2
moving along 3
moving along 4
moving along 5
moving along 6
moving along 7
moving along 8
moving along 9
moving along 10
moving along 11
moving along 12
moving along 13
moving along 14
moving along 15
moving along 16
moving along 17
moving along 18
moving along 0
moving along 1
moving along 2
moving along 3
moving along 4
moving along 5
moving along 6
moving along 7
moving along 8
moving along 9
moving along 10
moving along 11
moving along 12
moving along 13
moving along 14
moving along 15
moving along 16
moving along 17
moving along 18
moving along 0
moving along 1
moving along 2
moving along 3
moving along 4
moving along 5
moving along 6
moving along 7
moving along 8
moving along 9
moving along 10
