# Accessing the National Library of Norway

## Part II: N-grams and galaxies

by Koenraad De Smedt at UiB
(based on materials from the National Library)

---
The National Library of Norway (Nasjonalbiblioteket, NB) offers access to its collections from Python or R. This notebook gives a few examples of how to compute and plot frequencies of words and n-grams over time and how to show galaxies of related words. For more information, see [DH-lab at NB](https://dh.nb.no) (in Norwegian).

---

Some stuff needs to be installed and imported.

In [None]:
!pip install dhlab

In [None]:
import dhlab as dh
from dhlab import Ngram, NgramBook, NgramNews
from dhlab.ngram.nb_ngram import nb_ngram
from dhlab.api.nb_ngram_api import make_word_graph
from dhlab import graph_networkx_louvain as gnl

## Ngram

Get relative frequencies (percentages) of words per year, over a given period. The result is a dataframe which can be plotted or further processed.

`smooth` is a parameter for smoothing the curve. This implies that the relative frequencies for a year are computed as the mean for number of preceding and following years; `lw` is line width.

In [None]:
Ngram(words=['i og med'],
      from_year=1910, to_year=2020, 
      doctype='avis').plot(smooth=9, kind='line', figsize=(10, 6), lw=3)

Compare two words over time.

In [None]:
Ngram(words=['det', 'der'],
      from_year=1810, to_year=2000, 
      doctype='avis').plot(smooth=9, kind='line', figsize=(10, 6), lw=3)

Truncation.

In [None]:
Ngram(words=['universitets*'],
      from_year=1920, to_year=2000, 
      doctype='avis').plot(smooth=9, kind='line', figsize=(10, 6), lw=3)

Two bigrams compared.

In [None]:
Ngram(words=['min arm', 'armen min'],
      from_year=1910, to_year=2000, 
      doctype='avis').plot(smooth=9, kind='line', figsize=(10, 6), lw=3)

We can take the dataframe of the `Ngram` result and do other things with it.

In [None]:
hanhun = Ngram(words=['han', 'hun'],
                from_year=1950, to_year=1970, 
                doctype="bok").frame
hanhun.head(10).style.background_gradient(cmap="Reds", axis=None)

In [None]:
hanhun['hun'].loc['1954']

### Exercises

1.  Investigate examples of the Norwegian a-ending vs. en-ending for definite nouns, such as *sola* vs. *solen*.
2.  Search for *på* and *paa* from 1900 to 1950. When did a change occur?

## Galaxies

Galaxies compute connections from a central word to semantically related words. These connections can be used for several purposes, such as sentiment analysis.

In [None]:
klar_graf = make_word_graph('klar', corpus='all', cutoff=16, leaves=0)

Display the graph.

In [None]:
gnl.show_graph(klar_graf, spread=4)


Show different communities of words, each roughly related to a meaning.

In [None]:
gnl.show_communities(klar_graf)

### Exercises

1.  Choose another word and make a graph. 