# Using the search-API for the Norwegian Central Coordinating Register for Legal Entities


In [None]:
!pip install requests

In [1]:
import requests

## Example: Get organisations with a certain activity type (NACE-code)


Let's say we want a list of organisation of the activity type of public libraries. There are two NACE-codes, 91.011 and 91.012 that are relevant. The search-API offers a [large set of properties](https://data.brreg.no/enhetsregisteret/api/docs/index.html#enheter-sok-detaljer) that can be used to specify the search, as query parameters to the API. In our case, we use the parameter naeringskode which takes a list of NACE-codes. In addition to the list of NACE-codes, we specify that we want the first page (i.e. 0), as well as the size of each page. To make the output easier to read we choose a low number, 2.

In [2]:
query_parametre = { 'naeringskode': [91.011,91.012], 'page': 0, 'size': 2 }

Now we are ready to run the request and save the result in the variable req:

In [3]:
req = requests.get('https://data.brreg.no/enhetsregisteret/api/enheter', params=query_parametre)

As we have requested a small size, we could show the result immediatly, but often it is OK to check if the request was successful, and look for status codes of 2xx:

In [4]:
req.status_code

200

If the status_code is 200, we can use the text-attribute to see the results. But since we are using an API that serves JSON, the formatting is better if we use the json-method:

In [16]:
req.json()

{'_embedded': {'enheter': [{'organisasjonsnummer': '920989330',
    'navn': 'F.DELBANCO GMBH & CO.KG',
    'organisasjonsform': {'kode': 'NUF',
     'beskrivelse': 'Norskregistrert utenlandsk foretak',
     '_links': {'self': {'href': 'https://data.brreg.no/enhetsregisteret/api/organisasjonsformer/NUF'}}},
    'registreringsdatoEnhetsregisteret': '2018-07-13',
    'registrertIMvaregisteret': False,
    'naeringskode1': {'beskrivelse': 'Drift av fag- og forskningsbiblioteker',
     'kode': '91.012'},
    'antallAnsatte': 0,
    'forretningsadresse': {'land': 'Tyskland',
     'landkode': 'DE',
     'poststed': '21339 Lüneburg',
     'adresse': ['Bessemerstrasse 3']},
    'registrertIForetaksregisteret': False,
    'registrertIStiftelsesregisteret': False,
    'registrertIFrivillighetsregisteret': False,
    'konkurs': False,
    'underAvvikling': False,
    'underTvangsavviklingEllerTvangsopplosning': False,
    'maalform': 'Bokmål',
    '_links': {'self': {'href': 'https://data.brreg.no

In [None]:
### Storing the relevant data in a separate variable
Let's save the relevant part of the first response to a variable results. The relevant part is situated in a list under  _embedded > enheter. First we create the variable result, then we add the relevant content.

In [1]:
result = []
result += req.json()['_embedded']['enheter']

In [31]:
result

[{'organisasjonsnummer': '920989330',
  'navn': 'F.DELBANCO GMBH & CO.KG',
  'organisasjonsform': {'kode': 'NUF',
   'beskrivelse': 'Norskregistrert utenlandsk foretak',
   '_links': {'self': {'href': 'https://data.brreg.no/enhetsregisteret/api/organisasjonsformer/NUF'}}},
  'registreringsdatoEnhetsregisteret': '2018-07-13',
  'registrertIMvaregisteret': False,
  'naeringskode1': {'beskrivelse': 'Drift av fag- og forskningsbiblioteker',
   'kode': '91.012'},
  'antallAnsatte': 0,
  'forretningsadresse': {'land': 'Tyskland',
   'landkode': 'DE',
   'poststed': '21339 Lüneburg',
   'adresse': ['Bessemerstrasse 3']},
  'registrertIForetaksregisteret': False,
  'registrertIStiftelsesregisteret': False,
  'registrertIFrivillighetsregisteret': False,
  'konkurs': False,
  'underAvvikling': False,
  'underTvangsavviklingEllerTvangsopplosning': False,
  'maalform': 'Bokmål',
  '_links': {'self': {'href': 'https://data.brreg.no/enhetsregisteret/api/enheter/920989330'}}},
 {'organisasjonsnummer'

### Accessing the full result set with "page" and "size"
Based on the parameters it is likely that your result only contains a limited amount of the total result. The last element in the JSON-response, is information about how many elements, and how many pages with results, we have. We can focus on this element by treating it as a dictionary, and use 'page' as the key:

In [18]:
req.json()['page']

{'size': 2, 'totalElements': 21, 'totalPages': 11, 'number': 0}

To move forward in the list of pages, we can just update the relevant attribute in the variable query_parametre:


In [42]:
query_parametre['page'] += 1

Then we can re-run the request.


In [43]:
req = requests.get('https://data.brreg.no/enhetsregisteret/api/enheter', params=query_parametre)

Let's check if we got page two (which has number 1 ...):

In [44]:
req.json()['page']

{'size': 2, 'totalElements': 21, 'totalPages': 11, 'number': 2}

Now we can update our variable result with the new data by repeating the code that copied the relevant data to our variable result:

In [45]:
result += req.json()['_embedded']['enheter']

If our calculations are correct, the variable result should now be a list with four elements, each representing an organisation:

In [35]:
len(result)

4

Let's get a summary of the identifier, name and type of these elements. In Python this can be done with a for-loop, iterating through each element and extracting the relevant information. It can also be written as a list-comprehension as follows:

In [46]:
[(e['organisasjonsnummer'], e['navn'], e['organisasjonsform']['kode']) for e in result]

[('920989330', 'F.DELBANCO GMBH & CO.KG', 'NUF'),
 ('984456573', 'STIFTELSEN DE ANKERSKE SAMLINGER', 'STI'),
 ('984456573', 'STIFTELSEN DE ANKERSKE SAMLINGER', 'STI'),
 ('916087330', "SIGMUND HALDÅS' SAMLINGER", 'FLI'),
 ('982132541', 'KARASJOGA GIELDA / KARASJOK KOMMUNE BOKBUSSEN', 'ORGL'),
 ('986418989', 'HEMSEDAL HISTORIELAG', 'FLI')]

There is a risk that in the time between accessing the different pages, the content changes so that the same organisation listed in the bottom of one page becomes the first on the next page, or something similar, as the result-list is not kept static after the first request. So when using the results, it might be a good idea to remove duplicates.

### Automatically fetching all pages
Manually fetching each page is a bit tedious, even though in our example there is no more than ten pages. In our example there are not more than 21 elements in total, and if we increased size parameter in the request we would be able to get all elements in one response. Alternativly you can loop through the pages, and add the elements to your variable result.

*Warning*

There is an *important limitation* in the search-API that limits how many organisations it is possible to get access to using the search-API demonstrated here: If the sum of (page+1)*size > 10 000, the API returns and error (HTTP 400, "Bad request"). In our example it is not relevant, but if you get a response where the first page reports a "totalElements" of 10 000 or more, it will not be possible to get to more than the first 10 000 elements. Instead, you should use the functionality to download the full dataset. See [REF] for how to do that.

*Alternative 1*, changing the size of each page to a high enough number. In our case 50 is more than enough. Don't forget to make sure you ask for the first page, in case you have set the page-parameter earlier:

In [48]:
result_full = []
query_parametre = { 'naeringskode': [91.011,91.012], 'page': 0, 'size': 50 }
req = requests.get('https://data.brreg.no/enhetsregisteret/api/enheter', params=query_parametre)

In [49]:
result_full += req.json()['_embedded']['enheter']

This should result in the same length of the variable list result_full as reported as "totalElements".

In [58]:
req.json()['page']['totalElements'] == len(result_full)

True

*Alternative 2*, looping through all the pages, adding the result to a variable list

In [56]:
result_full_through_looping = []
query_parametre = { 'naeringskode': [91.011,91.012], 'page': 0, 'size': 2 }
url = 'https://data.brreg.no/enhetsregisteret/api/enheter'

req = requests.get(url, params=query_parametre) # Only to know size of result

total_pages = req.json()['page']['totalPages']

while query_parametre['page'] < total_pages:
    req = requests.get(url, params=query_parametre)
    result_full_through_looping += req.json()['_embedded']['enheter']

    # prepare for next iteration, updating the page-parameter in the request
    query_parametre['page'] += 1



In [59]:
req.json()['page']['totalElements'] == len(result_full_through_looping)

True

In [73]:
df = pd.read_json('er.json', compression='gzip' )

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1037749 entries, 0 to 1037748
Data columns (total 43 columns):
 #   Column                                        Non-Null Count    Dtype  
---  ------                                        --------------    -----  
 0   Organisasjonsnummer                           1037749 non-null  int64  
 1   Navn                                          1037749 non-null  object 
 2   Organisasjonsform.kode                        1037749 non-null  object 
 3   Organisasjonsform.beskrivelse                 1037749 non-null  object 
 4   Næringskode 1                                 982062 non-null   float64
 5   Næringskode 1.beskrivelse                     982062 non-null   object 
 6   Næringskode 2                                 38573 non-null    float64
 7   Næringskode 2.beskrivelse                     38573 non-null    object 
 8   Næringskode 3                                 1562 non-null     float64
 9   Næringskode 3.beskrivelse          

In [20]:
df_sample = df.sample(10000)

In [21]:
len(df_sample)

10000

In [22]:
df_sample.describe()

Unnamed: 0,Organisasjonsnummer,Næringskode 1,Næringskode 2,Næringskode 3,Hjelpeenhetskode,Antall ansatte,Postadresse.postnummer,Postadresse.kommunenummer,Forretningsadresse.postnummer,Forretningsadresse.kommunenummer,Institusjonell sektorkode,Siste innsendte årsregnskap,Overordnet enhet i offentlig sektor
count,10000.0,9447.0,383.0,20.0,196.0,10000.0,1892.0,1892.0,9461.0,9461.0,9153.0,3360.0,21.0
mean,943430400.0,59.785001,42.585413,30.13635,65.342204,2.747,3796.597252,2774.402748,3902.312335,2916.015537,5575.314105,2019.007738,947032200.0
std,43794870.0,29.479587,32.339305,33.282255,16.788348,41.7555,2851.340814,1739.036626,2616.633611,1646.625832,2859.231711,1.147087,36414210.0
min,811548800.0,0.0,0.0,1.11,0.0,0.0,28.0,301.0,26.0,301.0,1120.0,2002.0,864966000.0
25%,918494300.0,43.341,2.2,2.0525,68.209,0.0,1337.0,1103.0,1593.0,1506.0,2100.0,2019.0,939607700.0
50%,926062100.0,68.209,68.209,5.1555,70.1,0.0,3455.5,3024.0,3619.0,3030.0,7000.0,2019.0,959412300.0
75%,985455600.0,86.211,70.1,54.82125,70.1,0.0,5863.0,4224.25,5690.0,4225.0,8200.0,2019.0,964963600.0
max,999666800.0,97.0,96.01,81.101,82.99,2350.0,9991.0,5444.0,9990.0,5444.0,8300.0,2020.0,991012100.0


In [87]:
df_sample['organisasjonsnummer'] = df_sample['organisasjonsnummer'].astype('string')

In [89]:
df_sample['overordnetEnhet'] = df_sample['overordnetEnhet'].astype('string')

In [91]:
df.columns

Index(['organisasjonsnummer', 'navn', 'organisasjonsform',
       'registreringsdatoEnhetsregisteret', 'registrertIMvaregisteret',
       'naeringskode1', 'antallAnsatte', 'forretningsadresse',
       'institusjonellSektorkode', 'registrertIForetaksregisteret',
       'registrertIStiftelsesregisteret', 'registrertIFrivillighetsregisteret',
       'konkurs', 'underAvvikling',
       'underTvangsavviklingEllerTvangsopplosning', 'maalform', 'links',
       'stiftelsesdato', 'postadresse', 'naeringskode2', 'hjemmeside',
       'sisteInnsendteAarsregnskap', 'frivilligMvaRegistrertBeskrivelser',
       'naeringskode3', 'overordnetEnhet'],
      dtype='object')

In [28]:
df['Stiftelsesdato']

0                 NaN
1                 NaN
2          1990-01-30
3                 NaN
4                 NaN
              ...    
1037744    2021-04-20
1037745           NaN
1037746           NaN
1037747    2021-03-26
1037748    2021-02-05
Name: Stiftelsesdato, Length: 1037749, dtype: object

In [44]:
pd.to_datetime(df['Stiftelsesdato'], errors='coerce', infer_datetime_format=True) # uten errors='coerce' får jeg feilmelding. infer_datetime_format er for å spare tid på å forsøke ISO8601

0                NaT
1                NaT
2         1990-01-30
3                NaT
4                NaT
             ...    
1037744   2021-04-20
1037745          NaT
1037746          NaT
1037747   2021-03-26
1037748   2021-02-05
Name: Stiftelsesdato, Length: 1037749, dtype: datetime64[ns]

In [41]:
df.loc[pd.notnull(df['Stiftelsesdato']), ['Organisasjonsform.kode', 'Stiftelsesdato']].sort_values('Stiftelsesdato')

Unnamed: 0,Organisasjonsform.kode,Stiftelsesdato
948075,STI,1277-09-13
798696,STI,1538-12-31
46181,ANNA,1550-12-31
986540,STI,1635-12-31
892130,ANS,1671-12-31
...,...,...
1032596,KBO,2021-04-21
1032590,KBO,2021-04-21
1032572,KBO,2021-04-21
1035689,KBO,2021-04-21
