This project aims to:
- give a programmatic access to data related to the recent bioinformatics job market in France.
- provide some basic analysis and charts related to those data.
The data come from the Société Française de Bioinformatique (SFBI), an association who, among other activities, gathers job offers and posts them on their website and mail list. You will find here information related to more than 2500 job offers that have been posted from april 2012 onward. The data will be updated regularly (every 4-5 months).
This project concerns data of french origin, and was essentially destined for the french bioinformatics community. English has been used for the code, but the output charts are in french.
Please read the details section before using the charts.
We highly recommend to use the conda environment manager to install and use this project. Not only does it provide a clean environment to work in, it also makes it really easy to install all the necessary packages.
The following procedure assumes you have already installed conda. If not, here is the miniconda download page.
2.1.1 Create the virtual environment
Make use of the provided environment definition file
wget https://raw.githubusercontent.com/royludo/SFBIStats/master/env.yml conda env create -f env.yml
This will setup a complete environment called sfbistatsenv with the core package requirements already installed.
Alternatively, you can use
env_full.yml. It contains the packages required by the code in the
as well. If you decide to use
env.yml, refer to the READMEs in each example's directory for the requirements that you
will have to install yourself.
In both cases, once the environment is created, don't forget
source activate sfbistatsenv.
2.1.2 Get the code
Clone the repository directly in your environment.
git clone https://github.com/royludo/SFBIStats.git
You will end up with a folder
sfbistatsenv/SFBIStats containing all the project.
2.1.4 Install the package
Go in the project's directory.
python setup.py install
2.2 Run the examples
You probably want to use the data and create some charts. The examples folders contains scripts that make use of the
SFBI jobs data to produce charts as seen here or here.
Each folder is different, and has its own dependencies. Please refer to the README provided in each folder for
instructions on how to install and run each example. If you used
env_full.yml you can directly run them.
Among the job offers posted on the SFBI mail list, only the formatted ones (posted through the SFBI website) have been considered, for practical reasons. The SFBI started formatting job offers through their website only in 2012. Before that, offers were sent and forwarded to the list as is. That is why the dataset starts in april 2012.
But through 2012 and partially 2013, users could keep on sending unformatted offers to the mail list. Those offers don't appear here. Thus, great care must be taken regarding the interpretation of those data. The increase of job offers must be put into perspective, as people were switching from the previous anarchic job posts to the formatted one.
Due to different technical issues (crazy encodings, dead links...), a small number of offers do not appear in the dataset. But as no real bias is introduced by these issues, this should be ok.
Each job entry contains the following fields:
- contract_type: 'CDD', 'CDI', 'Stage', 'Thèse'
- contract_subtype: 'PR', 'MdC', 'CR', 'IR', 'IE', 'CDI autre', 'Post-doc / IR', u'CDD Ingénieur', 'ATER', 'CDD autre'
Stage and Thèse don't have any subtypes, so the contract_subtype field is empty.
The file jobs_anon.json contains all the data used in this project. It comes from mongodb, so there are some points to be aware of. More information on the specific strict mode JSON format. The data can be easily parsed nonetheless with:
import json from bson import json_util json.loads(jobs_anon.json, object_hook=json_util.object_hook)
The data have been scraped from web pages, and are delivered raw. Sanitization of the fields is left to users. But feel
free to reuse the functions in
sfbistats/utils/utils.py for that.
You can see a sample of the output charts here. They are named according to the script that created it.
summary_5 and 10, time_series_8 and 9, summary_lins_6 and 7:
The education level required for a job has been inferred only from job subtypes, and concerns only CDD and CDI. Stage and Thèse categories have been excluded. The job was considered as requiring :
- a master degree with subtypes 'CDD Ingénieur' and 'IE'
- a PhD with subtypes 'Post-doc / IR', 'PR', 'MdC', 'CR', 'IR', 'ATER'
The fuzzy subtypes 'CDD autre' and 'CDI autre' have been excluded. So beware that the information displayed in these charts may not be the most accurate there is. Use with caution.
Generated with the word_cloud module using the titles of the job offers. Types and subtypes of contracts are specified in each image title.
If you want to transform the charts with your own awesome style, if you have a better way to get the data (or more data), or if you feel like some different kinds of charts could be useful, then don't hesitate! Fork, code, and tell us about it. We will happily accept any kind of contribution to this project!
- remake maps the same way as jacklabelette
- go full machine learning on all the mails as done here
- make efforts to get the authorization to release the content of the mails
- provide an interface to visualize the data dynamically, maybe with plotly/dash
- job update
- job update
- use the definitive new region names
- job update
- reorganized the repo's architecture and moved all the code from python 2 to python 3
- scripts used for articles and everything that is not related to getting the data has been put in the example directory
- core code to get and parse mails stays in sfbistats
- started experiments with machine learning
#bioinfo-fr on freenode. Nick is fragmeister.