Design Doc: Geographic filtering

Background

We wish to add geographic filtering across all Open Skills API endpoints, both those pertaining to jobs and to skills. We need to figure out a definition of relevance of a job or skill to any given geography, and how to display this relevance in the API and use it to perform filtering when needed.

What Exists Now

We do expose geographical information in the /jobs/{id} endpoint as of now, and it has a simple definition of job relevance, using Core-Based Statistical Area (CBSA) as the geography:

# of occurrences of job title in geography / max # of occurrences of any job title in geography

We don't expose this score directly in the API, but if a FIPS code is given, the job will only be returned if this score is greater than 0. So, in effect, only one job is required to pass this filter.

In the database, this is stored in the jobs_importance table:

(quarter_id, geography_id, job_uuid, importance) (note: quarter_id is as of yet unused)

Next steps

Jobs endpoints

If we don't wish to change this definition of job importance at this time, it should be applicable to all of the job-based endpoints:

/jobs can optionally take a FIPs code, and paginate through all jobs that have a >0 importance score. I think this would make sense to order by geographic importance because there is nothing else to order by
/jobs/{id}/related_jobs can optionally take a FIPS code, and paginate through only the related jobs that also have a >0 importance score. In the absence of other ordering (it looks like this is offloaded to the DB for now) maybe we can order by geographic importance here.
/jobs/autocomplete can optionally take a FIPS code and only suggest autocompleted job titles if the job title has a >0 importance score. In the absence of other ordering (it looks like this is offloaded to the DB for now) maybe we can order by geographic importance here.
/jobs/normalize can optionally take a FIPS code and only suggest normalized job titles if the job title has a >0 importance score. ordering is interesting here, because the normalizer is giving its own score to order by, and is paginating from Elasticsearch. We would need to index the job importance scores into Elasticsearch if we want to incorporate that ordering. This probably isn't worth pursuing at the moment.
/jobs/unusual_titles can probably be skipped
/skills/related_jobs can optionally take a FIPS code and paginate through only the related jobs that have a >0 importance score. We have an existing ordering here, as well. Although there is no technical problem with mixing the orderings as there is when Elasticsearch is involved, if we give the option to switch the sorting to geographical importance we probably want to look at a default threshold for the importance of the skill to the job.

Other notes:

We can expose the geographic importance score in the returned job entry if a FIPS code is given.
We can pick a threshold for job importance above 0 and use it instead.

Skill endpoints

I'll attempt to define geographic relevance for a skill in the same simple manner as we are currently implementing for job titles. There is the room for more complication here, as not only is there the axis of 'number of jobs in geography that require this skill at all', but also 'how relevant are these skills to these jobs'? If a skill is 0.01 important to 95% of the jobs in a geography, how does that compare to a skill that is 0.4 important to 25% of the jobs in a geography?

Here are some proposed approaches:

Don't weight skills by importance to the jobs directly, but do implement a threshold to reduce noise. We've already gotten feedback that the skill lists are noisy, so we can pick some threshold, and make the geographic score (# of job titles in geography that deem this skill important) / (# of job titles in geography)
Weight by both importance of skill to job title and importance of job title to geography. Since these are both captured in normalized scores, what would happen if we multiplied them together? Using the example above, you would be comparing 0.01*0.95 = 0.0095 to 0.4 * 0.25 = 0.1 This score can be thresholded and ordered on. In this way, we can de facto exclude both skills that are barely relevant to jobs and skills that are relevant to jobs that are barely relevant in the geography.

Assuming either of these approaches, the endpoints would then look like this:

/jobs/{id}/related_skills can optionally take a FIPS code and then only return skills that are present/relevant in the given geography. Since there is an existing ordering here, we have to figure out how to mix them.
/skills can optionally take a FIPS code and paginate through skills in the given geography, probably ordered by importance to the geography
/skills/{id} can optionally take a FIPS code and only return the skill if it is present/relevant in the geography - this is what jobs/{id} does, but I'm just as skeptical about the utility of this query as I am for the jobs version.
/skills/{id}/related_skills can optionally take a FIPs code and then only return skills that are present/relevant in the given geography. I'm not seeing an existing ordering here, so we can use geographical importance
/skills/autocomplete can optionally take a FIPS code and only return autocomplete suggestions that are present/relevant in a given geography. No ordering here yet so we can use geographical importance.
/skills/normalize can optionally take a FIPS code and only return normalization suggestions that are present/relevant in a given geography. We don't have a sophisticated definition of this endpoint yet (it looks similar to autocomplete), so I think we can order by geographical importance too.

Implementation

API side

Given the definitions of importance for both jobs and skills above, we shouldn't need to add more tables to implement this. The notion of 'skill importance to a geography' is based on combining two data fields that already exist in the API's database. It is possible to precompute this in a skills_geographic_importance table (or add a geography_id to the skills_importance table) but from a technical standpoint I'm not sure it's necessary, at least yet.

A lot of the work is just in constructing queries within the API endpoints. There is some opportunity for reuse here (I think SQLAlchemy allows you to encapsulate the join and filtering in a clause that is reused) but given that each endpoint constructs its queries in a different way, there is no quick way I know of to integrate this into each endpoint.

Processing side

We can talk about increasing the size of the dataset used to calculate the jobs_importance table. The common schema processing probably isn't mature enough to power this in the two-week cycle that we want to deliver, so using the current data in jobs_importance might be the way to go.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly