# Cleaning and Tidying SNSF Public Data

In [4]:
import requests
import os
import pandas as pd

## Gather

In [12]:
folder_name = 'rawdata'
if not os.path.exists(folder_name):
    os.makedirs(folder_name)

do not run: 

```{python}
file_names = ["P3_GrantExport.csv",\
              "P3_GrantExport_with_abstracts.csv",\
              "P3_PersonExport.csv",\
              "P3_PublicationExport.csv",\
              "P3_GrantOutputDataExport.csv",\
              "P3_CollaborationExport.csv"]

for k in file_names:
    url_grant = "http://p3.snf.ch/P3Export/" + k
    print(url_grant)
    response = requests.get(url_grant)
    assert response.status_code == 200, "status code for" + k + " not ok"
    
    with open(os.path.join(folder_name, k), mode="wb") as file:
              file.write(response.content)
```

In [18]:
grants = pd.read_csv("rawdata/P3_GrantExport.csv", sep=';')

## Assess

In [23]:
grants.sample(6)

Unnamed: 0,Project Number,Project Number String,Project Title,Project Title English,Responsible Applicant,Funding Instrument,Funding Instrument Hierarchy,Institution,Institution Country,University,Discipline Number,Discipline Name,Discipline Name Hierarchy,All disciplines,Start Date,End Date,Approved Amount,Keywords
27637,63528,2000-063528,Applications of nonliner spectroscopy to the s...,,Vauthey Eric,Project funding (Div. I-III),Project funding,Département de Chimie Université de Fribourg,Switzerland,University of Geneva - GE,20301,Physical Chemistry,"Mathematics, Natural- and Engineering Sciences...",20301,01.04.2001,31.03.2003,343213.00,PHOTOCHEMISTRY; NONLINEAR LASER SPECTROSC; OPY...
73835,191292,P2ZHP3_191292,A multi-omics approach to understand the aetio...,,Keller Nadia,Early Postdoc.Mobility,Careers;Fellowships,School of Chemistry and Molecular Biosciences ...,Australia,Institution abroad - IACH,30307,Medical Microbiology,Biology and Medicine;Basic Medical Sciences,30307/30107,01.01.2020,31.10.2021,data not included in P3,group A Streptococcus; interleukine; puerperal...
67661,175559,32003B_175559,Deciphering the neoantigen landscape in bladde...,Deciphering the neoantigen landscape in bladde...,Derré Laurent,Project funding (Div. I-III),Project funding,Service d'Urologie CHUV,Switzerland,University of Lausanne - LA,30403,"Immunology, Immunopathology",Biology and Medicine;Experimental Medicine,30403/30401,01.03.2018,28.02.2022,628644.00,neoantigens; mutanome; immunotherapy; bladder ...
68560,177984,IZSEZ0_177984,1st Winter School at EPFL Valais: Challenges a...,,Buonsanti Raffaella,Scientific Exchanges,Science communication,Laboratory of Nanochemistry for Energy EPFL - ...,Switzerland,EPF Lausanne - EPFL,20507,Chemical Engineering,"Mathematics, Natural- and Engineering Sciences...",20507/20303,01.11.2017,31.10.2018,25000.00,Energy Research; School; doctoral students
23191,52486,2100-052486,Experimental stress analysis of composites usi...,,Botsis John,Project funding (Div. I-III),Project funding,Laboratoire de mécanique appliquée et d'analys...,Switzerland,EPF Lausanne - EPFL,20505,Material Sciences,"Mathematics, Natural- and Engineering Sciences...",20505,01.09.1998,30.09.2000,238263.00,COMPOSITES MATERIALS; CRACK BRIDGING; DEFORMAT...
30011,67666,823A-067666,Etude chez l'homme du processus d'automatisati...,,Posada Andres,Fellowships for advanced researchers,Careers;Fellowships,UNI: CNRS Institut des Sciences Cognitives Bron F,France,Institution abroad - IACH,30302,Neurophysiology and Brain Research,Biology and Medicine;Basic Medical Sciences,30302,01.10.2002,31.03.2004,data not included in P3,


In [24]:
grants.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74519 entries, 0 to 74518
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Project Number                74519 non-null  int64 
 1   Project Number String         74519 non-null  object
 2   Project Title                 74519 non-null  object
 3   Project Title English         31400 non-null  object
 4   Responsible Applicant         74519 non-null  object
 5   Funding Instrument            74519 non-null  object
 6   Funding Instrument Hierarchy  74479 non-null  object
 7   Institution                   68860 non-null  object
 8   Institution Country           68794 non-null  object
 9   University                    74514 non-null  object
 10  Discipline Number             74519 non-null  int64 
 11  Discipline Name               74519 non-null  object
 12  Discipline Name Hierarchy     74020 non-null  object
 13  All disciplines 

#### Quality

##### `grants` table

- spaces in column names
- `Project Number` and `Project Number String` are redundant
- `Project Number String` encodes division information?
- `Project Title English` often redundant or null
- `Responsible Applicant` not an uid
- `Institution` free text? if yes, is it relevant? better named as department?
- `Start Date` and `End Date` string, not date type
- `Aproved Amount` a string, not numeric
- `Keywords` not consistent (see keyword extraction from abstracts)

#### Tidiness

##### `grants` table

- `University` contains both long and short names
- `Funding Instrument`, `Funding Instrument Hierarchy` are confusing
- `Discipline`, ... `Discipline Name Hierarchy` are confusing

## Clean

#### Define

#### Code

#### Test