-
Notifications
You must be signed in to change notification settings - Fork 69
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Browse files
Browse the repository at this point in the history
A lasting problem has been the fragility of large aggregation jobs, that tend to require the coordination of CPU or memory-intensive jobs. Traditionally we have only save the aggregated results, which prevents us from reusing computation for different aggregations, or reusing a half-done list of computed properties, among many other problems. Here we introduce The skills_ml.jon_postings.computed_properties module, which is split into two parts: computers (computing properties on collections), and aggregators (aggregating them). The computers save data indexed by job posting to daily-partitioned storage on s3. This is cached per job posting id, so if a large collection craps out halfway through, the computation can be reused. In addition, I'm trying to get the interface between Airflow and skills-ml right. I want to put as little code in Airflow as possible, in large part because it's very tough to unit test anything in Airflow. Just the same, we do want to be able to configure aggregations in Airflow. If we decide that now we want to create a new tabular dataset that mixes properties together in a different way, or increase the # of top skills that are present, this should be a simple change in Airflow. The result is a bit of a complex dance between the JobPostingPropertyComputers and their aggregators: the Computers define compatible aggregator functions that work with their output, and this is enforced by unit test. But aggregators can choose which ones they use for a particular aggregation (should be decided in Airflow). To wrap up, there is now an example of a basic computation and aggregation task, runnable without any dependencies, that is basically a mini versionof the Data@Work Research Hub. There was also a moto problem with the introduction of boto3, which I was actually able to fix by converting the geocoder and CBSA finders to the new S3BackedJsonDict class. They had implemented their own version using boto2, but I was able to remove a bunch of custom code and tests, as well as fix Travis, by switching them over.
- Loading branch information
Showing
15 changed files
with
1,168 additions
and
203 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,106 @@ | ||
"""Computing and aggregating job posting properties | ||
To show job posting property computation and aggregation, | ||
we calculate job posting counts by cleaned title, and upload | ||
the resulting CSV to S3. | ||
This is essentially a mini version of the Data@Work Research Hub. | ||
To enable this example to be run with as few dependencies as possible, we use: | ||
- a fake local s3 instance | ||
- a sample of the Virginia Tech open job postings dataset | ||
- only title cleaning and job counting. | ||
To make this example a little bit more interesting, one could incorporate more | ||
classes from the job_posting_properties.computers module, such as skill extractors or geocoders. | ||
""" | ||
import json | ||
import logging | ||
import urllib.request | ||
|
||
from skills_ml.job_postings.computed_properties.computers import\ | ||
TitleCleanPhaseOne, PostingIdPresent | ||
from skills_ml.job_postings.computed_properties.aggregators import\ | ||
aggregate_properties_for_quarter | ||
|
||
from moto import mock_s3 | ||
import boto3 | ||
import s3fs | ||
import unicodecsv as csv | ||
import numpy | ||
|
||
logging.basicConfig(level=logging.INFO) | ||
|
||
VT_DATASET_URL = 'http://opendata.cs.vt.edu/dataset/ab0abac3-2293-4c9d-8d80-22d450254389/resource/9a810771-d6c9-43a8-93bd-144678cbdd4a/download/openjobs-jobpostings.mar-2016.json' | ||
|
||
|
||
logging.info('Downloading sample Virginia Tech open jobs file') | ||
response = urllib.request.urlopen(VT_DATASET_URL) | ||
string = response.read().decode('utf-8') | ||
logging.info('Download complete') | ||
lines = string.split('\n') | ||
logging.info('Found %s job posting lines', len(lines)) | ||
|
||
with mock_s3(): | ||
client = boto3.resource('s3') | ||
client.create_bucket(Bucket='test-bucket') | ||
computed_properties_path = 's3://test-bucket/computed_properties' | ||
job_postings = [] | ||
|
||
for line in lines: | ||
try: | ||
job_postings.append(json.loads(line)) | ||
except ValueError: | ||
# Some rows in the dataset are not valid, just skip them | ||
logging.warning('Could not decode JSON') | ||
continue | ||
|
||
# Create properties. In this example, we are going to both compute and aggregate, | ||
# but this is not necessary! Computation and aggregation are entirely decoupled. | ||
# So it's entirely valid to just compute a bunch of properties and then later | ||
# figure out how you want to aggregate them. | ||
# We are only introducing the 'grouping' and 'aggregate' semantics this early in the | ||
# script so as to avoid defining these properties twice in the same script. | ||
|
||
# create properties to be grouped on. In this case, we want to group on cleaned job title | ||
grouping_properties = [ | ||
TitleCleanPhaseOne(path=computed_properties_path), | ||
] | ||
# create properties to aggregate for each group | ||
aggregate_properties = [ | ||
PostingIdPresent(path=computed_properties_path), | ||
] | ||
|
||
# Regardless of their role in the final dataset, we need to compute | ||
# all properties from the dataset. Since the computed properties | ||
# partition their S3 caches by day, for optimum performance one | ||
# could parallelize each property's computation by a day's worth of postings | ||
# But to keep it simple for this example, we are going to just runin a loop | ||
for cp in grouping_properties + aggregate_properties: | ||
logging.info('Computing property %s for %s job postings', cp, len(job_postings)) | ||
cp.compute_on_collection(job_postings) | ||
|
||
# Now that the time consuming computation is done, we aggregate, | ||
# choosing an aggregate function for each aggregate column. | ||
# Here, the 'posting id present' property just emits the number 1, | ||
# so numpy.sum gives us a count of job postings | ||
# Many other properties, like skill counts, will commonly use | ||
# an aggregate function like 'most common'. | ||
# A selection is available in skills_ml.algorithms.aggregators.pandas | ||
logging.info('Aggregating properties') | ||
aggregate_path = aggregate_properties_for_quarter( | ||
quarter='2016Q1', | ||
grouping_properties=grouping_properties, | ||
aggregate_properties=aggregate_properties, | ||
aggregate_functions={'posting_id_present': [numpy.sum]}, | ||
aggregations_path='s3://test-bucket/aggregated_properties', | ||
aggregation_name='title_state_counts' | ||
) | ||
|
||
s3 = s3fs.S3FileSystem() | ||
logging.info('Logging all rows in aggregate file') | ||
with s3.open(aggregate_path, 'rb') as f: | ||
reader = csv.reader(f) | ||
for row in reader: | ||
logging.info(row) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1 @@ | ||
{"incentiveCompensation": "", "experienceRequirements": "Here are some experience and requirements", "baseSalary": {"maxValue": 0.0, "@type": "MonetaryAmount", "minValue": 0.0}, "description": "We are looking for a person to fill this job", "title": "Bilingual (Italian) Customer Service Rep (Work from Home)", "employmentType": "Full-Time", "industry": "Call Center / SSO / BPO, Consulting, Sales - Marketing", "occupationalCategory": "", "qualifications": "Here are some qualifications", "educationRequirements": "Not Specified", "skills": "Customer Service, Consultant, Entry Level", "validThrough": "2014-02-05T00:00:00", "jobLocation": {"@type": "Place", "address": {"addressLocality": "Salisbury", "addressRegion": "PA", "@type": "PostalAddress"}}, "@context": "http://schema.org", "alternateName": "Customer Service Representative", "datePosted": "2013-03-07", "@type": "JobPosting"} | ||
{"incentiveCompensation": "", "experienceRequirements": "Here are some experience and requirements", "baseSalary": {"maxValue": 0.0, "@type": "MonetaryAmount", "minValue": 0.0}, "description": "We are looking for a person to fill this job", "title": "Bilingual (Italian) Customer Service Rep (Work from Home)", "employmentType": "Full-Time", "industry": "Call Center / SSO / BPO, Consulting, Sales - Marketing", "occupationalCategory": "", "qualifications": "Here are some qualifications", "educationRequirements": "Not Specified", "skills": "Customer Service, Consultant, Entry Level", "validThrough": "2014-02-05T00:00:00", "jobLocation": {"@type": "Place", "address": {"addressLocality": "Salisbury", "addressRegion": "PA", "@type": "PostalAddress"}}, "@context": "http://schema.org", "alternateName": "Customer Service Representative", "datePosted": "2013-03-07", "@type": "JobPosting", "id": "1"} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.