# Digital Humanities Exercise 2
Author: Tobias Famos

## Theoretical Questions
### 1) Cite four differences between XML and HTML standards.
- All XML tags must have a closing tag. HTML can have not closed tags (e.g. `<br>`)
- XML is case sensitive, HTML is case insensitive
- The goal: XML is made mainly to transport and save data, HTML is made to present data
- XML can have an unlimited number of tags, the user can create new tags as he goes. HTML has a specific set of tags that are allowed.

### 2) Are both XML and HTML fully declarative languages?
A declarative language is a programming language that does not specify any control flow elements. This means it tells the user (i.e. the executing machine) **what** to do but not **when** to do what.
Both HTML and XML are are fully declarative. XML just describes how the data is structured and HTML only describes how the data must be presented.

# Practical Questions (consider the French Theater data)
### 1:How many unique author names can you find?


In [42]:
import os
import lxml
from bs4 import BeautifulSoup
import pandas
import re


In [43]:
def get_author_tag(fileName):
    # Reading data from the xml file
    with open(fileName, 'r', encoding='iso-8859-1') as f:
        content = f.readlines()
        content = "".join(content)
        bs_content = BeautifulSoup(content, "lxml")
        author_tag = bs_content.findAll('author')[0]
        return author_tag.getText()

In [44]:
# assign directory
directory = 'data'
# iterate over files in
# that directory
all_files = []
for filename in os.listdir(directory):
    f = os.path.join(directory, filename)
    # checking if it is a file
    if os.path.isfile(f):
        all_files.append(f)


In [45]:
df = pandas.DataFrame(columns=['XML Tag'])
for file in all_files:
    author_string = get_author_tag(file)
    df.loc[len(df)] = author_string

df.head(5)

Unnamed: 0,XML Tag
0,Molière
1,"LA CALPRENEDE, Gautier Costes de"
2,VOLTAIRE
3,"DANCOURT, Florent CARTON dit"
4,REGNARD et DUFRESNY


In [46]:
print(f'{len(df.value_counts())} Unique author tag contents')
value_counts = df.groupby(by='XML Tag')

197 Unique author tag contents


I found 197 unique author tag contents. This does not mean, that those are unique authors, but unique names of authors.

### 2: Are you sure that all these unique names refer to distinct authors?
No I am by no means sure, that they are unique names. There are most likely a lot of duplicates in there (as the authors might have been written differently in different tags)

### 3: Can you reduce the variability around the author names
We can reduce the duplicates by doing the following improvements upon the authors. These are mistakes or duplicate ways to write authors.

In [49]:
def apply_rules(row):
    row = str.lower(str(row['XML Tag']))
    row = re.sub("\(anonyme\)","anonyme", row)
    row = re.sub("geroges", "georges", row)
    row = re.sub("voltaire, françois-marie arouet de", "voltaire", row)
    row = re.sub("chabanon, michel-paul-guy de", "chabanon, michel paul guy de", row)
    row = re.sub("nan", "anonyme", row)
    row = re.sub("moliï¿½re", "molière", row)

    return row


## correct_predictions = df.apply(lambda row: is_correct_prediction(row, 'prediction'), axis=1)is_correct_prediction

df['corrected'] = df.apply(lambda row: apply_rules(row), axis=1)
value_counts_lower_case = len(df["corrected"].value_counts())

print(f'New count of unique authors {value_counts_lower_case}')


New count of unique authors 188


## 4) Can you count the number of plays per author?
Sure, lets take the last string normalization step we had and group it. Then we get how many times all the authors show up in our dataset and thus how many plays they wrote.

In [48]:
value_counts_lower_case = df["corrected"].value_counts()
value_counts_lower_case

dancourt, florent carton dit                       49
voltaire                                           38
marivaux                                           33
molière                                            31
regnard, jean-françois                             24
                                                   ..
pompiganonyme, jean-jacques lefranc, marquis de     1
françois-augustin paradis de moncrif                1
mathieu, pierre                                     1
plancher de valcour, philippe                       1
lafont, joesph de                                   1
Name: corrected, Length: 188, dtype: int64