# Göttingen Campus Institutions

This notebook creates a list of GRID IDs for institutions from the Göttingen Campus.

Author: Andreas Lüschow

2021/07/08

-----

## Imports

In [1]:
# run imports
%run ../imports.ipynb

# import constants from constants notebook
%run ../constants.ipynb

# import methods from utils notebook
%run ../utils.ipynb


    [Errno 2] No such file or directory: '../../output/goe_campus_grid_ids.txt'
    Consider creating a file with GRID IDs from the Göttingen Campus using the Jupyter notebook './jupyter/data/goettingen_campus_institutions' first!
    (You may ignore this error if you do not need to import GRID IDs inside the current notebook.)
    
env: GOOGLE_APPLICATION_CREDENTIALS=../../../bigquery_credentials.json


## Setting up the Google BigQuery Client

In [2]:
client = get_bq_client()

-----

## Inspect the data

- Total number of publications in Dimensions data
- Structure of Dimensions grid table

In [3]:
# get total number of publications in Dimensions data
sql = f"""
SELECT
    count(*) as nr_of_publications
FROM {DS_PUBLICATIONS}
"""

q = client.query(sql)
q.to_dataframe()

Unnamed: 0,nr_of_publications
0,512769


In [4]:
# look into the grid table
sql = f"""
SELECT
    id,
    name
FROM {DS_GRID}
"""

q = client.query(sql)
q.to_dataframe().head(10)

Unnamed: 0,id,name
0,grid.437499.4,Gpack (France)
1,grid.488233.6,Bristol-Myers Squibb (Switzerland)
2,grid.447587.d,Virgin Islands Humanities Council
3,grid.432141.1,Biotec (United Kingdom)
4,grid.429701.9,Wellness Pointe
5,grid.430069.8,OncoDetect (United States)
6,grid.435281.9,Helping Others in a Positive Environment
7,grid.425513.4,Bodycote (United Kingdom)
8,grid.507886.4,Ontario Turtle Conservation Centre
9,grid.410734.5,Jiangsu Provincial Center for Disease Control ...


## Get Göttingen Campus Institutions

- Find candidates using regular expressions.
- Check the resulting list manually.
- Create a final list of Göttingen Campus GRID IDs.
- Write this list to a file for further use in other notebooks.

In [5]:
# find possible institutions from the Göttingen Campus using regular expressions
sql = f"""
SELECT
    id,
    name 
FROM {DS_GRID}
WHERE
    REGEXP_CONTAINS(name, r"[gG]ö?(oe)?ttingen")
    OR REGEXP_CONTAINS(name, r"Max[\s-]?Planck")
    OR REGEXP_CONTAINS(name, r"Primaten")
    OR REGEXP_CONTAINS(name, r"MPI")
    OR REGEXP_CONTAINS(name, r"DLR")
    OR REGEXP_CONTAINS(name, r"Deutsches Zentrum")
    OR REGEXP_CONTAINS(name, r"Primate")
    OR REGEXP_CONTAINS(name, r"German Aerospace")
"""

q = client.query(sql)
df = q.to_dataframe()

In [6]:
save(df, f"{OUTPUT_FOLDER}possible_goe_campus_institutions.csv")
df

Unnamed: 0,id,name
0,grid.507417.0,Biomedicine Research Institute of Buenos Aires...
1,grid.494632.e,Max Planck Institutes Library
2,grid.419552.e,Max Planck Institute for Solid State Research
3,grid.419534.e,Max Planck Institute for Intelligent Systems
4,grid.429508.2,Max Planck Institute for Astronomy
...,...,...
118,grid.421937.a,MPI Research (United States)
119,grid.410436.4,Oregon National Primate Research Center
120,grid.453465.5,Primate Conservation
121,grid.507517.1,Max Planck-Bristol Centre for Minimal Biology


In [7]:
# manually filter out member institutions of the Göttingen Campus
# (see https://goettingen-campus.de/de/ueber-uns/translate-to-deutsch-members for a list of members)
members = ["University of Göttingen",
           "Universitätsmedizin Göttingen",
           "Max Planck Institute for Biophysical Chemistry",
           "Max Planck Institute for Dynamics and Self-Organization",
           "Max Planck Institute of Experimental Medicine",
           "German Primate Center",
           # the following institutions are ignored
           # "Max Planck Institute for Solar System Research",
           # "Max Planck Institute for the Study of Religious and Ethnic Diversity",
           # "Göttingen Academy of Sciences and Humanities",
           # "German Aerospace Center"
        ]
goe_campus = [
    (row["id"], row["name"]) for index, row in df.iterrows() if row["name"] in members
]
grid_ids, _ = zip(*goe_campus)
grid_ids

('grid.418140.8',
 'grid.418215.b',
 'grid.411984.1',
 'grid.7450.6',
 'grid.419514.c',
 'grid.419522.9')

In [8]:
# write GRID IDs to file
save_grid_ids(grid_ids)