# Cleaning and Tidying SNSF Public Data

In [4]:
import requests
import os
import pandas as pd

## Gather

In [12]:
folder_name = 'rawdata'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

do not run: 

```{python}
file_names = ["P3_GrantExport.csv",\
              "P3_GrantExport_with_abstracts.csv",\
              "P3_PersonExport.csv",\
              "P3_PublicationExport.csv",\
              "P3_GrantOutputDataExport.csv",\
              "P3_CollaborationExport.csv"]

for k in file_names:
    url_grant = "http://p3.snf.ch/P3Export/" + k
    print(url_grant)
    response = requests.get(url_grant)
    assert response.status_code == 200, "status code for" + k + " not ok"
    
    with open(os.path.join(folder_name, k), mode="wb") as file:
              file.write(response.content)
```

In [18]:
grants = pd.read_csv('rawdata/P3_GrantExport.csv', sep=';')

In [25]:
people = pd.read_csv('rawdata/P3_PersonExport.csv', sep=';')

In [29]:
publications = pd.read_csv('rawdata/P3_PublicationExport.csv', sep=';')

In [48]:
collaborations = pd.read_csv('rawdata/P3_CollaborationExport.csv', sep=';')

In [55]:
output_data = pd.read_csv('rawdata/P3_GrantOutputDataExport.csv', sep=';')

## Assess

In [23]:
grants.sample(6)

Unnamed: 0,Project Number,Project Number String,Project Title,Project Title English,Responsible Applicant,Funding Instrument,Funding Instrument Hierarchy,Institution,Institution Country,University,Discipline Number,Discipline Name,Discipline Name Hierarchy,All disciplines,Start Date,End Date,Approved Amount,Keywords
27637,63528,2000-063528,Applications of nonliner spectroscopy to the s...,,Vauthey Eric,Project funding (Div. I-III),Project funding,Département de Chimie Université de Fribourg,Switzerland,University of Geneva - GE,20301,Physical Chemistry,"Mathematics, Natural- and Engineering Sciences...",20301,01.04.2001,31.03.2003,343213.00,PHOTOCHEMISTRY; NONLINEAR LASER SPECTROSC; OPY...
73835,191292,P2ZHP3_191292,A multi-omics approach to understand the aetio...,,Keller Nadia,Early Postdoc.Mobility,Careers;Fellowships,School of Chemistry and Molecular Biosciences ...,Australia,Institution abroad - IACH,30307,Medical Microbiology,Biology and Medicine;Basic Medical Sciences,30307/30107,01.01.2020,31.10.2021,data not included in P3,group A Streptococcus; interleukine; puerperal...
67661,175559,32003B_175559,Deciphering the neoantigen landscape in bladde...,Deciphering the neoantigen landscape in bladde...,Derré Laurent,Project funding (Div. I-III),Project funding,Service d'Urologie CHUV,Switzerland,University of Lausanne - LA,30403,"Immunology, Immunopathology",Biology and Medicine;Experimental Medicine,30403/30401,01.03.2018,28.02.2022,628644.00,neoantigens; mutanome; immunotherapy; bladder ...
68560,177984,IZSEZ0_177984,1st Winter School at EPFL Valais: Challenges a...,,Buonsanti Raffaella,Scientific Exchanges,Science communication,Laboratory of Nanochemistry for Energy EPFL - ...,Switzerland,EPF Lausanne - EPFL,20507,Chemical Engineering,"Mathematics, Natural- and Engineering Sciences...",20507/20303,01.11.2017,31.10.2018,25000.00,Energy Research; School; doctoral students
23191,52486,2100-052486,Experimental stress analysis of composites usi...,,Botsis John,Project funding (Div. I-III),Project funding,Laboratoire de mécanique appliquée et d'analys...,Switzerland,EPF Lausanne - EPFL,20505,Material Sciences,"Mathematics, Natural- and Engineering Sciences...",20505,01.09.1998,30.09.2000,238263.00,COMPOSITES MATERIALS; CRACK BRIDGING; DEFORMAT...
30011,67666,823A-067666,Etude chez l'homme du processus d'automatisati...,,Posada Andres,Fellowships for advanced researchers,Careers;Fellowships,UNI: CNRS Institut des Sciences Cognitives Bron F,France,Institution abroad - IACH,30302,Neurophysiology and Brain Research,Biology and Medicine;Basic Medical Sciences,30302,01.10.2002,31.03.2004,data not included in P3,


In [24]:
grants.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74519 entries, 0 to 74518
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Project Number                74519 non-null  int64 
 1   Project Number String         74519 non-null  object
 2   Project Title                 74519 non-null  object
 3   Project Title English         31400 non-null  object
 4   Responsible Applicant         74519 non-null  object
 5   Funding Instrument            74519 non-null  object
 6   Funding Instrument Hierarchy  74479 non-null  object
 7   Institution                   68860 non-null  object
 8   Institution Country           68794 non-null  object
 9   University                    74514 non-null  object
 10  Discipline Number             74519 non-null  int64 
 11  Discipline Name               74519 non-null  object
 12  Discipline Name Hierarchy     74020 non-null  object
 13  All disciplines 

In [26]:
people.sample(6)

Unnamed: 0,Last Name,First Name,Gender,Institute Name,Institute Place,Person ID SNSF,OCRID,Projects as responsible Applicant,Projects as Applicant,Projects as Partner,Projects as Practice Partner,Projects as Employee,Projects as Contact Person
100473,Tschan Semmer,Franziska,female,IPTO - Institut de Psychologie du Travail et d...,Neuchâtel,32844,,30596;43472;47904;52861;56997;58452;138273;149...,53123;103724;113429,173111.0,,1552,
64055,Medici,Nicola,male,,,745801,,,,,,179186,
61171,Malaise,Grégory,male,,,134018,,,,,,59905,
19370,Coray,Renata,female,Institut für Mehrsprachigkeit Université de Fr...,Fribourg,91619,,108647;120931;134214;158079,,179426.0,,49546;55858;108647,
69435,Musso,Alessandra,female,Laboratoire de Moléculaire Neurodegenerative R...,Lausanne,597986,,,,,,127478;144063,
24434,Dos Santos Ferreira,João,male,,,736388,,,,,,174038,


In [27]:
people.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 111903 entries, 0 to 111902
Data columns (total 13 columns):
 #   Column                             Non-Null Count   Dtype 
---  ------                             --------------   ----- 
 0   Last Name                          111903 non-null  object
 1   First Name                         111896 non-null  object
 2   Gender                             111903 non-null  object
 3   Institute Name                     54186 non-null   object
 4   Institute Place                    54083 non-null   object
 5   Person ID SNSF                     111903 non-null  int64 
 6   OCRID                              7092 non-null    object
 7   Projects as responsible Applicant  28898 non-null   object
 8   Projects as Applicant              18934 non-null   object
 9   Projects as Partner                5300 non-null    object
 10  Projects as Practice Partner       735 non-null     object
 11  Projects as Employee               82000 non-null   

In [30]:
publications.sample(6)

Unnamed: 0,Publication ID SNSF,Project Number,Peer Review Status,Type of Publication,Title of Publication,Authors,Status,Publication Year,ISBN,DOI,...,Publisher,Editors,Journal Title,Volume,Issue / Number,Page from,Page to,Proceeding Title,Proceeding Place,Abstract
22014,{0598A41B-3792-44C9-8F98-703008F5F76B},129120,Peer-reviewed,Proceedings (peer-reviewed),PARAMETRIC SCRIPTING FOR EARLY DESIGN PERFORMA...,"Julien Nembrini, Guillaume Labelle, Christop...",Published,2011.0,,,...,"CISBAT,EPFL",,CISBAT 2011,,,,,CISBAT 2011,,
42969,{D0582FB6-F9EA-456B-9D9B-76FBD2541770},136243,Peer-reviewed,Original article (peer-reviewed),Spezialisierung an Gerichten,Rüefli Anna,Published,2013.0,,,...,,,Justice - Justiz - Giustizia,2013.0,2.0,1.0,18.0,Justice - Justiz - Giustizia,,Aufgrund der zunehmenden Komplexität Verästelu...
18334,{46357B67-8E5F-4510-B15A-7FB77EAE8ECC},127461,Peer-reviewed,Original article (peer-reviewed),An audit of diagnostic reference levels in int...,"Samara Eleni-Theano, Aroua Abbas, De Palma R...",Published,2011.0,,10.1093/rpd/ncq600 ...,...,,,Radiation Protection Dosimetry,148.0,1.0,74.0,82.0,Radiation Protection Dosimetry,,A wide variation in patient exposure has been ...
78692,{934BAA4A-4A52-4233-83A2-1A37FF410988},149221,Peer-reviewed,Original article (peer-reviewed),Archaeal populations in two distinct sedimenta...,"Thomas C., Ionescu D., Ariztegui D.",Published,2014.0,,10.1016/j.margen.2014.09.001 ...,...,,,Marine Genomics,17.0,,53.0,62.0,Marine Genomics,,
25336,{4C8F474D-5718-4CD2-B27A-1E68EF3728CD},130314,Peer-reviewed,Original article (peer-reviewed),The fate of high redshift massive compact gala...,"Kaufmann T., Mayer L., Carollo M., Feldman...",Accepted,,,,...,,,Monthly Notices of the Royal Astronomical Society,,,,,Monthly Notices of the Royal Astronomical Society,,
31887,{2EA25550-30E7-42D0-8E6F-71529BBE0778},132767,Peer-reviewed,Original article (peer-reviewed),The effects of dynamical interactions on plane...,"Parker RJ, Quanz SP",Published,2012.0,,10.1111/j.1365-2966.2011.19911.x ...,...,,,MONTHLY NOTICES OF THE ROYAL ASTRONOMICAL SOCIETY,419.0,3.0,2448.0,2458.0,MONTHLY NOTICES OF THE ROYAL ASTRONOMICAL SOCIETY,,We present N body simulations of young substru...


In [31]:
publications.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 133541 entries, 0 to 133540
Data columns (total 26 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   Publication ID SNSF        133541 non-null  object 
 1   Project Number             133541 non-null  int64  
 2   Peer Review Status         133541 non-null  object 
 3   Type of Publication        128867 non-null  object 
 4   Title of Publication       133504 non-null  object 
 5   Authors                    131026 non-null  object 
 6   Status                     133541 non-null  object 
 7   Publication Year           118806 non-null  float64
 8   ISBN                       15201 non-null   object 
 9   DOI                        78871 non-null   object 
 10  Import Source              107910 non-null  object 
 11  Last Change of Outputdata  0 non-null       float64
 12  Open Access Status         133541 non-null  int64  
 13  Open Access Type           42

In [34]:
publications['Open Access Type'].value_counts()

Publisher (Gold Open Access)                                           18131
Repository (Green Open Access)                                         12349
Website                                                                10359
Green OA Embargo (Freely available via Repository after an embargo)     1791
Name: Open Access Type, dtype: int64

In [35]:
publications['Status'].value_counts()

Published    118975
Accepted      14464
NotSet          102
Name: Status, dtype: int64

In [36]:
publications['Volume'].value_counts()

8               2078
7               2002
9               1851
6               1808
10              1705
                ... 
vol. 53            1
111(5)             1
2:10               1
Supp 05/2016       1
16 (2015)          1
Name: Volume, Length: 3657, dtype: int64

In [37]:
publications['Issue / Number'].value_counts()

1                  8688
2                  6471
3                  5426
4                  5071
5                  3638
                   ... 
1771                  1
7631                  1
2171                  1
368                   1
20. Januar 2014       1
Name: Issue / Number, Length: 2720, dtype: int64

In [50]:
publications[(publications.DOI.isna() == False) & (publications[['DOI', 'Project Number']].duplicated())].shape

(1724, 26)

In [51]:
collaborations.sample(6)

Unnamed: 0,Project Number,Group/Person,Types of collaboration,Country,Project Start Date,Project End Date
4992,131932,"Wellcome Trust Centre for Neuroimaging, UCL","in-depth/constructive exchanges on approaches,...",Great Britain and Northern Ireland,01.01.2011,31.12.2013
30719,153359,"Prof. Dr. Dario Gamboni, UniGE","in-depth/constructive exchanges on approaches,...",Switzerland,01.07.2014,30.09.2016
24219,148807,SUNY Stony Brook,"in-depth/constructive exchanges on approaches,...",United States of America,01.09.2013,31.08.2014
12911,140211,"Dr Jeanne Crassous, Université de Rennes",Publication;Research Infrastructures,France,01.04.2012,31.03.2015
19439,144857,Prof. Matthias Lutolf / Laboratory of Stem Cel...,"in-depth/constructive exchanges on approaches,...",Switzerland,01.12.2013,30.11.2017
4493,130824,Department of Diagnostic Radiology,"in-depth/constructive exchanges on approaches,...",Switzerland,01.09.2010,28.02.2014


In [52]:
collaborations.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60324 entries, 0 to 60323
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   Project Number          60324 non-null  int64 
 1   Group/Person            60324 non-null  object
 2   Types of collaboration  60324 non-null  object
 3   Country                 60324 non-null  object
 4   Project Start Date      60324 non-null  object
 5   Project End Date        60322 non-null  object
dtypes: int64(1), object(5)
memory usage: 2.8+ MB


In [53]:
collaborations['Types of collaboration'].value_counts()

in-depth/constructive exchanges on approaches, methods or results;Publication                                                                                                      18300
in-depth/constructive exchanges on approaches, methods or results                                                                                                                  16506
in-depth/constructive exchanges on approaches, methods or results;Publication;Research Infrastructures                                                                              6305
in-depth/constructive exchanges on approaches, methods or results;Publication;Research Infrastructures;Exchange of personnel                                                        3260
in-depth/constructive exchanges on approaches, methods or results;Research Infrastructures                                                                                          3078
Publication                                                                

In [54]:
collaborations['Group/Person'].value_counts()

EPFL                                                       92
University of Geneva                                       80
ETH Zurich                                                 72
University of Zurich                                       64
ETH Zürich                                                 56
                                                           ..
Vetrinary Pathology University of Zurich                    1
Prof. Gilles Gasser, University of Zurich                   1
Office fédéral du personnel (OFPER)                         1
University of Stanford, department of cognitive science     1
Mathias Pessiglione / ICM Paris                             1
Name: Group/Person, Length: 52235, dtype: int64

In [56]:
output_data.sample(6)

Unnamed: 0,Project Number,Output Type,Output Title,Url,Year
19598,159300,"Print (books, brochures, leaflets)",Research update - “Nairobi`s splintered sanita...,,2017.0
7915,139230,Talks/events/exhibitions,TUN Basel / MUBA Beitrag '*Drug the Bug: Eine ...,http://www.tunbasel.ch/institutionen-projekte....,2012.0
26029,173330,Talks/events/exhibitions,Resultate der Lärmwirkungsstudie SiRENE,,2018.0
1436,124753,Video/Film,"Les Sciences, ça m'interesse",,2010.0
23126,165607,Talks/events/exhibitions,Was heisst Digitalisierung für unsere Unterneh...,,2017.0
15294,152020,"New media (web, blogs, podcasts, news feeds etc.)",The Server’s Security Certificate Is Not Valid...,http://theiii.org/index.php/882/the-servers-se...,2014.0


In [57]:
output_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28730 entries, 0 to 28729
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Project Number  28730 non-null  int64  
 1   Output Type     28730 non-null  object 
 2   Output Title    28726 non-null  object 
 3   Url             18712 non-null  object 
 4   Year            28487 non-null  float64
dtypes: float64(1), int64(1), object(3)
memory usage: 1.1+ MB


In [58]:
output_data['Output Type'].value_counts()

Media relations: print media, online media           10709
Talks/events/exhibitions                              7201
New media (web, blogs, podcasts, news feeds etc.)     3679
Media relations: radio, television                    3623
Print (books, brochures, leaflets)                    1413
Other activities                                      1003
Video/Film                                             680
Software                                               286
Start-up                                               136
Name: Output Type, dtype: int64

#### Quality

- spaces in column names

##### `grants` (ie. `GrantExport`) table

- `Project Number` and `Project Number String` are redundant
- `Project Number String` encodes division information?
- `Project Title English` often redundant or null
- `Responsible Applicant` not an uid
- `Institution` free text? if yes, is it relevant? better named as department?
- `Start Date` and `End Date` string, not date type
- `Aproved Amount` a string, not numeric
- `Keywords` not consistent (see keyword extraction from abstracts)

##### `people` (ie. `PersonExport`) table

- typo in col name for `ORCID`
- gender not categorical variable

##### `publications` table

- missing DOIs
- `Last Change of Outputdata` empty
- `Publication Year` shows as float
- `Status`, `Peer Review Status`, `Type of Publication`, and `Open Acces Type` strings, not categories
- `Volume`, `Issue / Number`, `Page from`, `Page to` strings, not numeric
- `[..] Title` show inconsistencies re capitalization
- duplicated entries: 1'724 duplicated non null DOIs and project numbers

##### `collaborations` table

- `Type of collaboration` string, not category
- `Switzerland` should not be a valid `Type of collaboration`
- `[.] Date` string, not dates
- `Group/Person` encoding seems inconsistent ("," vs "/", "Prof", "Dr")

##### `output_data` table

- `Output Type` string, not category
- `Year` float, not integer


#### Tidiness

##### `grants` (ie. `GrantExport`) table

- `Funding Instrument`, `Funding Instrument Hierarchy` are confusing
- `Discipline`, ... `Discipline Name Hierarchy` are confusing
- details about Institute out of scope
- `University` contains both long and short names: details out of scope

##### `people` (ie. `PersonExport`) table

- `Project as...` contain mixed variables and observations for grant and role
- details about Institute out of scope

##### `publications` table

- `Authors` contains multiple observations

##### `collaborations` table

- `Types of collaboration` contains multiple observations


## Clean

#### Define

#### Code

#### Test