# Python For The Digital Humanities, Part 2: Practical Project: H.P. Lovecrafts *At the Mountains of Madness*
Manuel Huth, 2026

This file is part of a Python online course consisting of a reader and several scripts. You can find the entire course on GitHub (https://github.com/talant26). The course aims to introduce scholars of the humanities with no prior knowledge to the basics of Python, demonstrating typical applications in the humanities, such as extracting and visualising information from texts.


## H.P. Lovecrafts *At the Mountains of Madness*
* Source: https://www.gutenberg.org/ebooks/70652
* https://en.wikipedia.org/wiki/At_the_Mountains_of_Madness

## Goal:
* Read the file
* Extract information
* Analyze the extracted data
* Visualize the data

We will get to know **different types of modules** that allow us to **work with text and web data**.

## Recommendations for working with Google Colab:
* Tutorial: https://www.tutorialspoint.com/google_colab/index.htm
* Change the editor so that line numbers are displayed. (Tools -> Settings -> Editor -> show line numbers)
* Google Colab does not permanently store files. To keep your files, download them or save them to Google Drive (https://www.tutorialspoint.com/google_colab/google_colab_saving_work.htm).
* If errors occur when executing individual scripts, please click the 'Run all' button, which can be found either at the top of the menu bar or in the 'Runtime' > 'Run all' tab. This will execute all scripts in the correct order. This helps to avoid errors that may occur if the session has been inactive for too long, for example, or if a script accesses variables that have not been set previously.
* If this does not work, you can try to restart a session (Runtime $>$ Restart Session).
* For the scripts to work, make sure you have uploaded the files 'Copyright.txt', 'Mainpart.txt' and 'DoubleNames.csv'.

Important note: The text of the novel 'Mountains of Madness' comes from the Gutenberg Project ( https://www.gutenberg.org/ebooks/70652 ). The text has been split into two parts 'Copyright.txt' and 'Mainpart.txt'. **Please note the copyright!**



## Interacting with files





Working with files can be very useful. I recommend using the *with open function*. It is similar to a function and takes the following arguments:
- First of all the filename in quotes (e.g. 'starwars.txt')
- Then we define the type of access we want. This is a single letter that tells the interpreter what we want to do. That is, we declare whether we want to read a file ('r'), write a file ('w'), or append something to the end of the file ('a'). You can combine the letters with a '+': use 'r+' to read and write to a file, use 'a+' to append to and read from a file, use 'w+' to read and write (it will also create the file if it does not already exist).
- Sometimes you need to specify the encoding (for example: encoding='utf-8'). This will be explained in more detail during the course.

Lets us look at an example. First, let’s open a not yet existing file for writing:

In [None]:
# We want writing access to the existing file 'starwars.txt'
# It is encoded with 'utf-8'
# We store the accessed file in the variable 'outputfile'
# Using this variable we can access and change the file

with open('starwars.txt', 'w+', encoding='utf-8') as outputfile:

  # Now we write 'A long time ago...' into the file
  outputfile.write('A long time ago...\n')
  outputfile.write('\n')
  outputfile.write('The end.')

Now we want to open the file we just created and read its contents. There are two ways to do this: the *read method* and the *readlines method*.
With the *read method* we can store the entire contents of the file in a single string variable, with the *readlines method* we can store all the lines of the file in a list (i.e. a list where each entry corresponds to one line of the file).

### The read method

Note: The following script will not work if you did not run the script above that created the starwars.txt file.

In [None]:
# We want reading access to the existing file 'starwars.txt'
# It is encoded with 'utf-8'
# We store the accessed file in the variable 'inputfile'
with open('starwars.txt', 'r', encoding='utf-8') as inputfile:

  # Now we store the filecontent as a string in the variable 'content'
  content = inputfile.read()
  print(content)

### The readlines method
Now we want to read one of the files we just uploaded. So let us read the Copyright ('Copyright.txt') first.

Note: If you have uploaded the file to a different folder (e.g. in your google drive), open the file explorer, right click on the file and hit 'copy path'.




In [None]:
# We want reading access to the existing file 'starwars.txt'
# It is encoded with 'utf-8'

# We store the accessed file in the variable 'inputfile'
with open('Copyright.txt', 'r', encoding='utf-8') as inputfile:

  # Now we store the filecontent as a string in the variable 'content'
  content = inputfile.readlines()
  print(content)

  for element in content:
    print(element[:-1])

Instead of using the *readlines method* you can just loop through the elements of the file:

In [None]:
# We want reading access to the existing file 'starwars.txt'
# It is encoded with 'utf-8'
# We store the accessed file in the variable 'inputfile'
with open('Copyright.txt', 'r', encoding='utf-8') as inputfile:

  for line in inputfile:

    print('Content of the line:' + line[:-1])

### Further References
- https://www.w3schools.com/python/python_file_handling.asp
- https://www.geeksforgeeks.org/file-handling-python/
- https://automatetheboringstuff.com/2e/chapter9/

### Project exercise: Read the Textfile

Read the file 'Mainpart.txt' with the **read method**, store the text in variable and display the content of the variable. It is encoded with 'utf-8'. Display the result.

In [None]:
# Space for your code

### Solution

In [None]:
with open('Mainpart.txt', 'r', encoding='utf-8') as inputfile:

  text = inputfile.read()
  print(text)

## String functions

Strings are similar to lists. To access certain parts of a string or to check if a string is inside another string, see the chapter on *Lists*.


But there are also a number of methods specific to strings. Here are some
important ones:

Command    | Explanation
-----------|----------------------------------------------------
endswith() | Checks if a string ends with a specific value (e.g. a character).
find() | Finds a string inside another string and returns the indexnumber, where the string was found.
isdigit() | Checks if the characters of a string are digits.
islower() | Checks if the characters of a string are lowercase.
isupper() | Checks if the characters of a string are uppercase.
replace() | Replaces parts of a string
rfind() | See the find method, but rfind returns the indexnumber, where the string was found.
split() | Splits a string at a separator (like a comma or semicolon for example). The result is stored as a list.
splitlines() | Splits a string into lines. The result is stored as a list.
startswith() | Checks if a string starts with a specific value (e.g. a character).
strip() | Removes white spaces at the beginning and end of a string. Instead of white spaces other characters can be removed as well.


For information about the methods and additional methods, see 'Further References'.

### Splitting strings: The *split method*

You can split a string into a list with the split() method:

In [None]:
exampleString = 'This is an examplestring. It has more than one sentence.'

# By default a string is split at each white space
wordlist = exampleString.split()

print(wordlist)

# If you want to split at another separator, you can write it in the brackets
wordlist = exampleString.split('.')

print(wordlist)

### The strip method
With the strip method you can eiter remove white spaces or any characters at the beginning and end of a word:

In [None]:
# Removing white spaces
exampleWord = '    word           '
exampleWord = exampleWord.strip()

print(exampleWord)

# Removing other characters
exampleString = '#####***another string*****####'
exampleString = exampleString.strip('#*')
print(exampleString)

### Further References
- https://www.w3schools.com/python/python_strings_methods.asp
- https://automatetheboringstuff.com/2e/chapter6/

### Project Exercise: Getting statistics about words

**Exercise1:**

Open the file 'Mainpart.txt' and save its text in a variable (you can copy the code, you used above). Now create a list containing all the words of the text (use white spaces as separators) and display the list. Each word shall appear only once in the list.

**Exercise2 (Optional):**

Change your code, so that the following characters at the end of each word are removed, before the word is added to the list: periods, commas, semicolons, question marks, and exclamation marks.

**Exercise3 (Optional):**

Display the length of the list (there are a few ways to do this).




In [None]:
# Space for your code

### Solution

In [None]:
with open('Mainpart.txt', 'r', encoding='utf-8') as inputfile:

  text = inputfile.read()

  # Now let us create a list of each word
  words = text.split()

  # We create an empty list. Here we want to store each word only one time
  filteredlist = []

  # Now we loop through each word of the list words
  for word in words:

    # if the word is not in filteredlist, we append it to filteredlist
    if word not in filteredlist:

      filteredlist.append(word)

  for entry in filteredlist:

    print(entry)

**Solution to exercise 2:**


In [None]:
with open('Mainpart.txt', 'r', encoding='utf-8') as inputfile:

  text = inputfile.read()

  # Now let us create a list of each word
  words = text.split()

  # We create an empty list. Here we want to store each word only one time
  filteredlist = []

  # Now we loop through each word of the list words
  for word in words:

    ##############ADDITIONAL CODE##################################
    # Let us remove unnecessary characters at the end of each word
    word = word.strip(':.,;?!"')

    # if the word is not in filteredlist, we append it to filteredlist
    if word not in filteredlist:

      filteredlist.append(word)

  for entry in filteredlist:

    print(entry)

**Solution to exercise 3:**

In [None]:
with open('Mainpart.txt', 'r', encoding='utf-8') as inputfile:

  text = inputfile.read()

  # Now let us create a list of each word
  words = text.split()

  # We create an empty list. Here we want to store each word only one time
  filteredlist = []

  # Now we loop through each word of the list words
  for word in words:

    # Let us remove unnecassary characters at the end of the words
    word = word.strip(':.,;?!"')

    # if the word is not in filteredlist, we append it to filteredlist
    if word not in filteredlist:

      filteredlist.append(word)

  # We display the length of the 'filteredlist'
  print(len(filteredlist))

## RegEx (Regular Expressions)



With **regular expressions**, you can search for strings that have a certain **pattern, such as phone numbers, dates, certain types of names... They can be very helpful and are essential when working with text.

Creating patterns and working with regular expressions can be a bit tricky. Here is a simplified but working version of how to find all strings in a text that match a certain pattern:

### Importing the module
First, you need to import regular expressions using the *import statement*.

In [None]:
import re

In [None]:
string = 'Hello \n Hello'
print(string)

rawstring = (r'Hello \n Hello')
print(rawstring)

### Compiling a pattern
Then we want to create a pattern that searches for strings consisting of 4 digits to find all the years in a text.

`examplePattern = re.compile(r'[0-9]{4}')`

**It consists of:**
* a variable where we store the pattern in (here: `examplePattern`)
* the command to create the pattern (here: `re.compile`)
* a raw-string containing the pattern (here: `r'....'`)
* the pattern (here: `[0-9]{4}`)

**Patterns can consist of two parts:**
- the characters that the pattern should have (here: `[0-9]`)
- the desired number of characters (here: `{4}`)

Characterpattern  | Explanation
------------------|------------------------------------
[A-Z]             | all uppercase characters from A-Z
[a-z]             | all lowercase characters from a-z
[0-9]             | all digits from 0-9
[A-Zabc12]        | all uppercase characters from A-Z and the characters a, b, 1 and 2

Characternumber   | Explanation
------------------|------------------------------------
{1}               | Exactly 1 character
{1,}              | One or more characters
{1,4}             | 1-4 characters
{,4}              | 4 or less characters

### Finding a pattern
After we have created the pattern we can search with it in a text. Here we store the results in the 'resultlist' variable:

`resultlist = examplePattern.findall(text)`



In [None]:
import re

# Our text
text = 'Philipp Melanchthon was born in the year 1497 and died in the year 1560.'

# Our pattern
examplePattern = re.compile(r'[0-9]{4}')

# Our result
results = examplePattern.findall(text)

# Let us print it
print(results)

### Further References
- https://automatetheboringstuff.com/2e/chapter7/
- https://www.w3schools.com/python/python_regex.asp
- **https://www.geeksforgeeks.org/regular-expression-python-examples/**

### Project Exercises: Finding names, dates and chapters in the text

**Exercise 1:**

Find all years in the novell mountains of madness.

In [None]:
import re

with open('Mainpart.txt', 'r', encoding='utf-8') as inputfile:

  text = inputfile.read()

  # Space for your code



**Exercise 2:**

Find all roman numbers in the novel Mountains of Madness. Roman numbers can contain the following digits IVXLCDM

The last digit is usually followed by a dot.

In [None]:
import re

with open('Mainpart.txt', 'r', encoding='utf-8') as inputfile:

  text = inputfile.read()

  # Space for your code

**Exercise 3:**

Find all names in the text. For the sake of simplicity, we assume that a name begins with a capital letter and is followed by one or more lowercase letters.

In [None]:
import re

with open('Mainpart.txt', 'r', encoding='utf-8') as inputfile:

  text = inputfile.read()

  # Space for your code


**Exercise 4:**

Find all combinations of two capitalized terms in the text (i.e. according to the following scheme: capitalized word, space, second capitalized word). This way we hope to find all first and last name combinations.

In [None]:
import re

with open('Mainpart.txt', 'r', encoding='utf-8') as inputfile:

  text = inputfile.read()

  # Space for your code

### Solutions

**Solution to exercise 1**

In [None]:
import re

with open('Mainpart.txt', 'r', encoding='utf-8') as inputfile:

  text = inputfile.read()

  # Space for your code
  yearPattern = re.compile(r'[0-9]{4}')


  print(yearPattern.findall(text))

**Solution to Exercise 2:**

Find all roman numbers in the novel Mountains of Madness. Roman numbers can contain the following digits IVXLCDM

The last digit is usually followed by a dot.

In [None]:
import re

with open('Mainpart.txt', 'r', encoding='utf-8') as inputfile:

  text = inputfile.read()

  # Space for your code
  romanNumberPattern = re.compile(r'[IVXLCDM]{1,}[.]{1}')

  print(romanNumberPattern.findall(text))

Congratulations: We now know the number of chapters

**Solution to exercise 3:**

Find all names in the text. For the sake of simplicity, we assume that a name begins with a capital letter and is followed by one or more lowercase letters.

In [None]:
import re

with open('Mainpart.txt', 'r', encoding='utf-8') as inputfile:

  text = inputfile.read()

  # Space for your code
  namePattern = re.compile(r'[A-Z]{1}[a-z]{1,}')

  print(namePattern.findall(text))

**Solution to exercise 4:**

Find all combinations of two capitalized terms in the text (i.e. according to the following scheme: capitalized word, space, second capitalized word). This way we hope to find all first and last name combinations.

In [None]:
import re

with open('Mainpart.txt', 'r', encoding='utf-8') as inputfile:

  text = inputfile.read()

  # Space for your code
  doublenamePattern = re.compile(r'[A-Z]{1}[a-z]{1,}[ ]{1}[A-Z]{1}[a-z]{1,}')

  print(doublenamePattern.findall(text))

## Natural language processing (NLP) with the Natural Language Toolkit
**NLTK** (Natural Language Toolkit): It is a tool for analyzing and working
with language using computational linguistics. Its data is based on ”50 corpora
and lexical resources”.

This module can do a lot of useful things. Here are just some examples.



### Splitting the novel into sentences
With the sentence tokenizer you can analyse a text and split it up into sentences. The result is stored as a list.



In [None]:
# First we need to import the module
import nltk

# You may need to download the following package
nltk.download('punkt')
nltk.download("punkt_tab")

# We open our file
with open('Mainpart.txt', 'r', encoding='utf-8') as inputfile:

  # we read the file and store its content in the variable 'text'
  text = inputfile.read()

  # We create a list of sentences with the help of the 'sentence tokenizer'
  sentences = nltk.sent_tokenize(text)

  for sentence in sentences:

    print(sentence)

### Creating a wordlist without stop words
Stop words are the most common words (such as 'is' or 'and'). If we want to know, which words are in a text, we normally want to ignore them.

In [None]:
# We import the module
import nltk

# You may need to download the following package
nltk.download('punkt')
nltk.download("punkt_tab")

# First we need to download the stopwords and import them separately
nltk.download("stopwords")
from nltk.corpus import stopwords

# we want reading access to our file
with open('Mainpart.txt', 'r', encoding='utf-8') as inputfile:
  text = inputfile.read()

  # We create a list of all words inside the text. Instead of using the split method we use the word tokenizer of nltk
  words = nltk.word_tokenize(text)

  # We create a set (something like a list) of stopwords
  stopwordset = set(stopwords.words("english"))

  # We create an empty list. Here all words will be stored, that are not in our stopwordset
  filteredlist = []

  # We loop through the list of words
  for word in words:

    # We check if the word is not in the stopwordset
    if word not in stopwordset:

       # We check if the word is not yet in out filteredlist
      if word not in filteredlist:

        # We append it to our filtered list
        filteredlist.append(word)

  # we loop through each item of the filteredlist and display it
  for item in filteredlist:

    print(item)

  # We display the length of the list = the number of all words in the list
  print(len(filteredlist))


### Lemmatizing Words
Now our list still has the disadvantage that some of its words have the same root (e.g. 'went', 'goes' and 'go' should not be different words). We can find their root with the help of a lemmatizer.


Let's try this with the word 'goes'. Let us try to find its root.

In [None]:
# We import the module
import nltk

# First we need to download the stopwords and import them separately
nltk.download("stopwords")
from nltk.corpus import stopwords

# We may need to download 'wordnet'
nltk.download('wordnet')

# We create our lemmatizer
lemmatizer = nltk.WordNetLemmatizer()
# We lemmatize our word
lemmatizedword = lemmatizer.lemmatize('goes')

# Now the infinitive of the word will be displayed
print(lemmatizedword)

### Lemmatizing Words in the novel "At the mountains of madness"
What we just did with one word, we want to apply to our project.Earlier, we made a list of the words in the novel 'At the Mountains of Madness'. Now we want to add only lemmatized words to the list (see Creating a Word List Without Stop Words). To do this, we need to make some small changes to the script we created earlier .

In [None]:
# We import the module
import nltk

# First we need to download the stopwords and import them separately
nltk.download("stopwords")
from nltk.corpus import stopwords

############ Additional Code #######
# We may need to download 'wordnet'
nltk.download('wordnet')
# We create our lemmatizer
lemmatizer = nltk.WordNetLemmatizer()
####################################

with open('Mainpart.txt', 'r', encoding='utf-8') as inputfile:
  text = inputfile.read()

  # We create a list of all words inside the text. Instead of using the split method we use the word tokenizer of nltk
  words = nltk.word_tokenize(text)

  # We create a set (something like a list) of stopwords
  stopwordset = set(stopwords.words("english"))

  # We create an empty list. Here all words will be stored, that are not in our stopwordset
  filteredlist = []

  # We loop through the list of words
  for word in words:

    ########################## Additional Code ######################
    # Let us trace the word back to its root
    word = lemmatizer.lemmatize(word)
    #################################################################

    # We check if the word is not in the stopwordset
    if word not in stopwordset:

       # We check if the word is not yet in out filteredlist
      if word not in filteredlist:

        # We append it to our filtered list
        filteredlist.append(word)

  # we loop through each item of the filteredlist and display it
  for item in filteredlist:

    print(item)

  # We display the length of the list = the number of all words in the list
  print(len(filteredlist))

### POS-Tagging
Part of speech tagging: Identification of words as verbs, nouns, adjectives ...

The output is a list, that contains pairs of analyzed words and their function in the sentence. The pairs are stored as tuples (something similar to a list). Example:

[('Luke', 'NNP'), ('I', 'PRP'), ('am', 'VBP'), ('your', 'PRP$'), ('father', 'NN')]

So we have the word 'Luke' and its function is 'NNP' (=proper noun), then we have 'I' and its part of speech is 'PRP' (Personal Pronoun) and so on...

See the following list of abbreviations: https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html

In [None]:
import nltk

# You may need to install the following addittion with the following command
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

# We have a wordlist, that we want to analyze
wordlist = [ 'Luke', 'I', 'am', 'your', 'father' ]

# Now we do the positional tagging
taggedlist = nltk.pos_tag(wordlist)

# Now we display taggedlist
print(taggedlist)

# Now lets assume we only want to display words, that are tagged with VBP
# (i.e. Verb, non-3rd person singular present).

# We loop through the pairs of the taggedlist
for item in taggedlist:

  # We check if the second item equals 'VBP' (i.e. the part os speech)
  if item[1] == 'VBP':
    # We disply the first item (i.e. the word)
    print(item[0])

# Now the word: "am" shoud be displayed

### Project Exercise: POS-Tagging to find names / proper nouns

The following program creates a list of all words in the novel 'At the Mountains of Madness'. Then it analyzes these words with POS tagging. Use the empty list 'filteredlist' and store in it only words that have been tagged as proper nouns ('NNP').

In [None]:
# We import the module
import nltk

# You may need to install the following addittion
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

with open('Mainpart.txt', 'r', encoding='utf-8') as inputfile:
  text = inputfile.read()

  # We create a list of all words inside the text. Instead of using the split method we use the word tokenizer of nltk
  words = nltk.word_tokenize(text)

  # Now we create a list containing word-POS pairs
  taggedlist = nltk.pos_tag(words)

  filteredlist = []
  # Space for your code



### Solution

In [None]:
# We import the module
import nltk

# You may need to install the following addittion
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

with open('Mainpart.txt', 'r', encoding='utf-8') as inputfile:
  text = inputfile.read()

  # We create a list of all words inside the text. Instead of using the split method we use the word tokenizer of nltk
  words = nltk.word_tokenize(text)

  # Now we create a list containing word-function pairs
  taggedlist = nltk.pos_tag(words)

  filteredlist = []
  #################### Additional Code ##################
  # We create an empty list 'filterdlist', where only thhe
  # proper nouns shall be stored

  # now we loop through each word-function pair
  for pair in taggedlist:

    # we check if the POS equals 'NNP' (=proper noun)
    if pair[1] == 'NNP':

      # Now we check, of the name inside the pair is already in filteredlist
      if pair[0] not in filteredlist:

        # We append the name to the filteredlist
        filteredlist.append(pair[0])


# Now we print each entry of the filteredlist
for item in filteredlist:

  print(item)


### Further References
- Website and introduction: https://www.nltk.org/
- **Tutorial**: https://realpython.com/nltk-nlp-python/

## CSV
CSV (comma separated values) is a standard format when it comes to storing or exchanging (large) data. You can think of a CSV file as a very basic form of an Excel spreadsheet.

Rules for creating csv-files:
- The first line contains the column titles.
- A new line marks a new row and a comma marks a new column (see the following example)

**Excel**

Name | Age
-----|-----
John   | 32
Maria  | 32
Stephen| 44
Hank   |12

**CSV**
```
Name,Age
John,32
Maria,32
Stephen,44
Hank,12

```




### How to create a CSV file
Now let us create a csv file from the example above.

In [None]:
# Let us open and create a csv file
with open('Examplecsv.csv', 'w+', encoding='utf-8') as outputfile:

  # We create the first line which contains the header 'Name' and 'Age'
  outputfile.write('Name,Age\n')

  # now we write the data
  outputfile.write('John,32\n')
  outputfile.write('Maria,32\n')
  outputfile.write('Stephen,44\n')
  outputfile.write('Hank,12')

You can access the newly created file using the file browser. It is also possible to create csv files using the csv module (see 'Further Reading').

### Importing a CSV-file
Importing a *csv file* can easily done with the *csv module*. It allows you to transform the *csv data* into python values. See the following example where we try to extract information from the file 'DoubleNames.csv'. (I created this file from data I extracted from the novel 'At the Mountains of Madness').The file has two columns. In the first one we have a double name (DoubleName), in the second one we have the number of times this double name appears in the novel (=Occurences).

In [None]:
# First we have to import the csv module
import csv

# we open the file
with open('DoubleNames.csv', 'r', encoding='utf-8') as csvfile:

  # we read it with the DictReader method
  data = csv.DictReader(csvfile)

  # now we display the data
  for dictPerson in data:

    # Now we display the data
    print(dictPerson)

### Further References
- https://www.geeksforgeeks.org/working-csv-files-python/
- https://automatetheboringstuff.com/2e/chapter16/

## Requests and Beautiful soup
With the *request module*, you can check the status of a website (i.e., see if it is online) or download its page source.

In the following example, we will download and display the page source of the website 'w3schools.com/python'.

In [None]:
# We import the module
import requests

page = requests.get('https://w3schools.com/python/')

print(page.text)

If you want to navigate through the tags of a web page or through an *xml file* (which may be important if you are creating an edition), you should use the *Beautiful Soup* module. For more information about this module, see 'Further references'.

### Further References
**The requests module**
- https://www.w3schools.com/python/module_requests.asp
- https://www.geeksforgeeks.org/python-requests-tutorial/
- https://automatetheboringstuff.com/2e/chapter12/

**The beautiful soup module**
- https://beautiful-soup-4.readthedocs.io/en/latest/
- https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/
- https://automatetheboringstuff.com/2e/chapter12/

## JSON
JSON is another important format for exchanging data, along with CSV. You can think of it as a combination of Python lists and dictionaries.

See the following page for an example:
https://lobid.org/gnd/139907211.json

Let us try to get the page source of this site with the requests module and convert it to python values with the json module using the *json.loads* method:



In [None]:
# we need the json and requests modules
import json
import requests

# we access the website and download
page = requests.get('https://lobid.org/gnd/139907211.json')
pageSource = page.text

# If you want to see the text of the website (=the variable PageSource)
# just remove the hashtag at the beginning of the following line:
# print(pageSource)

# We convert pageSource using json loads and display it
pythondata = json.loads(pageSource)
print(pythondata)

You can also convert Python values into a json file. For instructions on how to do this, see 'Further References'.

### Further References
- https://automatetheboringstuff.com/2e/chapter16/
- https://www.w3schools.com/python/python_json.asp
- https://www.geeksforgeeks.org/python-json/

## Pandas
Pandas is a powerful tool for data analysis. It is built on top of *Numpy*, but is easier to understand and work with.

It is not only good for analyzing and exploring data, but also for cleaning and manipulating it (for example, you can remove duplicates), and even looking for correlations. We will focus on the analyzing and exploring part.

An important aspect of understanding Pandas is that it uses dataframes.

A dataframe is similar to an Excel spreadsheet or a csv file. Each column is in that dataframe called *series*.

One of the most common cases of using Pandas is when you want to analyze data from a csv file.

### Displaying a dataframe

In [None]:
# we import pandas and rename it to pd (this way we do not have to type so much)
import pandas as pd

# we read the csv-file and create a dataframe
df = pd.read_csv('DoubleNames.csv')

# we display the dataframe
df

### The describe method
With the describe method, you can get statistical information for each column (like maximum, minimum, mean, standard deviation).

In [None]:
# we import pandas and rename it to pd (this way we do not have to type so much)
import pandas as pd

# we read the csv-file and create a dataframe
df = pd.read_csv('DoubleNames.csv')

# we 'describe' the dataframe
df.describe()

### Further References
- https://www.w3schools.com/python/pandas/default.asp
- https://www.geeksforgeeks.org/pandas-tutorial/


## Matplotlib
Using the dataframes we created in Pandas we can now visualize them with matplotlib. See the examples below. For tutorials and instructions, see 'Further References'

### Scatter plot

In [None]:
# we import pandas and name it pd
import pandas as pd
# we import matplotlib and name it plt
import matplotlib.pyplot as plt

# we create a pandas dataframe
df = pd.read_csv("DoubleNames.csv")

# We create a scatter plot, using the columns of the dataframe
plt.scatter(df['DoubleName'], df['Occurences'])

# We add a title
plt.title("Scatter Plot")

# we name the x-axis and y-axis
plt.xlabel('DoubleName')
plt.ylabel('Occurences')

# we display the scatter plot
plt.show()


### Line Chart

In [None]:
# we import pandas and name it pd
import pandas as pd
# we import matplotlib and name it plt
import matplotlib.pyplot as plt

# we create a pandas dataframe
df = pd.read_csv("DoubleNames.csv")

# We create a line chart, using the columns of the dataframe
plt.plot(df['DoubleName'], df['Occurences'])

# We add a title
plt.title("Line Chart")

# we name the x-axis and y-axis
plt.xlabel('DoubleName')
plt.ylabel('Occurences')

# we display the line chart
plt.show()

### Bar Chart

In [None]:
# we import pandas and name it pd
import pandas as pd
# we import matplotlib and name it plt
import matplotlib.pyplot as plt

# We create a dataframe with pandas
data = pd.read_csv("DoubleNames.csv")

# we create a bar chart using the columns of the dataframe
plt.bar(data['DoubleName'], data['Occurences'])

# We add a title
plt.title("Bar Chart")

# We name the X- and Y-Axis
plt.xlabel('DoubleName')
plt.ylabel('Occurences')

plt.show()


### Further References
- https://www.geeksforgeeks.org/data-visualization-with-python/
- https://www.geeksforgeeks.org/matplotlib-tutorial/