## Scraping to get job info for OS orgs

**TL;DR:**
If you want to keep this away from using paid tools etc. I'd begin with users simply posting the positions via a form or importing from a structured CSV/gSheet/Excel sheet etc.

**Attempted dataflow:**
- Read from gSheet listing OS orgs of interest
- Attempt to find jobs/careers pages on their websites
    - Parse resulting html etc.
- Attempt to find jobs/careers from search engine results for the orgs

In [1]:
from helper import *

In [2]:
# read from gSheet
scope = ['https://spreadsheets.google.com/feeds',
            'https://www.googleapis.com/auth/drive']

orgs = read_gsheet(scope, 'os_orgs')['org']

### First approach: Look at what the careers/jobs pages look like for some of these

In [3]:
job_pages_df = pd.DataFrame([google(x + ' careers')[0] for x in orgs[:5]])

In [4]:
job_pages_df

Unnamed: 0,text,url
0,Jobs - Hypothesis,https://web.hypothes.is/jobs/
1,Careers - PLOS,https://plos.org/careers/
2,Opportunities - Creative Commons,https://creativecommons.org/about/team/opportu...
3,Dat Project · GitHub,https://github.com/datproject
4,Careers - ContentMine,http://contentmine.org/get-in-touch/careers/


### Findings:

Not really an easy way to get jobs/job descs from these:
- A variety of results (some are Github pages)
- If you click through --> some link you to third-party job software/webpages etc.
- You have to parse html which might not be very nice for some pages

### Second approach: Try with a SERP API and see what we get

I used https://app.zenserp.com/ because it had a free trial.

This is probably the **best option** for this kind of thing: https://serpapi.com/playground?engine=google_jobs&q=Hypothesis. Didn't have a free trial but the playground is very nice.

In [5]:
res = [get_zenserp_results(search_text) for search_text in job_pages_df['text']]

In [6]:
job_pages_df['baseurl'] = job_pages_df['url'].apply(lambda x: str.split(x, '/')[2])

In [7]:
res_d = []
for i in range(0, len(job_pages_df)):
    # include the result if the url string matches orgs' baseurls from above 
    res_d.append([val for val in res[i]['organic'] if \
     val.get('url') is not None and \
     job_pages_df.iloc[i]['baseurl'] in val.get('url')])

In [8]:
res_flat = [item for sublist in res_d for item in sublist]

In [9]:
pd.DataFrame(res_flat)

Unnamed: 0,position,title,url,destination,description,isAmp,sitelinks
0,1,Jobs - Hypothesis,https://web.hypothes.is/jobs/,web.hypothes.is › jobs,We believe the impact of this pervasive conver...,False,
1,1,Careers - PLOS,https://plos.org/careers/,plos.org › careers,PLOS people have revolutionized science publis...,False,"[{'title': 'PLOS ONE: Careers', 'description':..."
2,2,Careers in research - PLOS ONE,https://journals.plos.org/plosone/browse/caree...,journals.plos.org › plosone › browse › careers...,Get an email alert for Careers in research; Ge...,False,
3,3,Employment - PLOS ONE,https://journals.plos.org/plosone/browse/emplo...,journals.plos.org › plosone › browse › employment,Employment. Related content. Labor economics ·...,False,
4,4,Get Involved - PLOS,https://plos.org/get-involved/,plos.org › get-involved,... your CV can help demonstrate your current ...,False,
5,5,Jobs - PLOS ONE,https://journals.plos.org/plosone/browse/jobs,journals.plos.org › plosone › browse › jobs,Job postings in the substance use disorder tre...,False,
6,1,Opportunities - Creative Commons,https://creativecommons.org/about/team/opportu...,creativecommons.org › What We Do › Team,"If you are interested in applying, please emai...",False,
7,2,New Job Opportunities at Creative Commons - Cr...,https://creativecommons.org/2016/03/08/new-job...,creativecommons.org › Blog › About CC,"Mar 8, 2016 - New Job Opportunities at Creativ...",False,
8,3,jobs Archives - Creative Commons,https://creativecommons.org/tag/jobs/,creativecommons.org › Blog,New job at CC: Director of development. Work f...,False,
9,4,opportunities Archives - Creative Commons,https://creativecommons.org/tag/opportunities/,creativecommons.org › Blog,Job opportunity: Chief Technology Officer at C...,False,


### Findings:
This is nicer. 
- You can definitely get a clean list of jobs this way if you wanted to 
- At the very least a title and a link to the listing. 
- If you do use this, I'd recommend going with SerpAPI.com since it allows you to filter by the engine (google_jobs) unlike zenserp. 