# Peace Treaties (one page)

The website that I'm scraping here is a database that contains metadata on European peace treaties from 1450 to 1789. It was created as part of a project that was funded by the DFG (German Research Foundation): https://www.ieg-friedensvertraege.de/vertraege

The collection contains a selection of some 1800 treaties.

### Import libraries and packeges

In [1]:
import requests
from requests import get
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
import re
import csv

**re** is not required if you don't use regex to extract text. Python3 works better for this script; Python2 doesn't handle utf-8 too well.

This is added to slow down requests rate from the website. It is probably not necessary for such a small database, but it is good practice not to stress a website with automated requests.

In [2]:
from time import sleep
from random import randint

### Set up the containers for capturing treaty names, thes sides involved and the dates of the treaty:

In [3]:
treatytitles = []
treatysides = []
treatydates = []

### Indicate the url

You need to indicate the url of the webpage in order to scrape the data. Parser is html in this case.

In [4]:
# url = "https://www.ieg-friedensvertraege.de/likecms.php?searchlang=de&step2onpage=3&function=process&process_target=ieg_treaty_mask_step2&site=index.html&nav=1&siteid=2&formlang=de&date=&year_from=&year_till=&location=&partner1=&partner2=&partner3=&language=&archive=&limit=&submit2=suche"
# results = requests.get(url) 
# soup = BeautifulSoup(results.text, "html.parser")

Get **filtered results** using a different url. This alternative url is for filtered results only containing treaties involving Russia, using the website's search function. There are more elegant ways to do this, but this works, too.

In [4]:
url="https://www.ieg-friedensvertraege.de/likecms.php?searchlang=de&step2onpage=3&function=process&process_target=ieg_treaty_mask_step2&site=index.html&nav=1&siteid=2&formlang=de&date=&year_from=&year_till=&location=&partner1=9&partner2=&partner3=&language=&archive=&limit=&submit2=suche"
results = requests.get(url) 
soup = BeautifulSoup(results.text, "html.parser")

### Write scraping code

If you don't know how to read the html text of a website to identify your data, you should check one of the tutorials mentioned in the repository description.

This is the scraping code for titles, dates, and sides. Sides was the most difficult to get, because it is not defined. It just sits between other bits of data. Apparently it always comes immediately after 'br' in td[1], so it can be grabbed with next_sibling, but that also grabs whitespace and newlines.

In [5]:
treaties = soup.find_all('tr', class_='text13')

sleep(randint(2,10))
    
for container in treaties:
    td = container.find_all('td')
    
    title = td[1].b.a.get_text() 
    treatytitles.append(title)
    
    sides = td[1].find('br').next_sibling #works but includes the whitespace and newline as well
    treatysides.append(sides)
    
    dates = td[2].get_text()
    treatydates.append(dates)

**treaties** pre-defines the frame within which I will look for the data. 
If you look at web page content using Web Developer Tools in the browser, you will find that all the database information (titles, sides, and dates) can be found within a series of **tr** (table row) tags. I have further specified the class ('text13'), to make sure I get the correct table rows in case the tag is used elsewhere on the page.

All subsequent searches will refer to this **treaties** container.

**sleep** is pausing requests from website. Random pause between 2-10 seconds.

Within each **tr** tag, there are three **td** (table data) tags. One for the number of the database entry (td[0]). One for the treaty title and the sides involved (td[1]), and one for the dates (td[2]). Note that counting them starts with 0!

By referring to the respective **td** tag by their number, I can then further indicate where the text that I want to grab is located within that tag.

The title data, for example, is located within an **a** tag that is again contained within a **b** tag. **get_text()** grabs just the text content within.

### Fill the containers with the scraped data:

In [6]:
out_treaties = pd.DataFrame({
 'title': treatytitles,
'sides': treatysides,
 'dates': treatydates
})

If you keep appending (because you are re-running the code, because something went wrong the first time), it piles up and has double, triple the rows, so delete and recreate containers too:

In [8]:
#del(out_treaties)

The following code gets rid of all newlines in the dataframe.

In [7]:
new_treaties = out_treaties.replace('\n',' ', regex=True)

Get a preview of the data

In [8]:
print(new_treaties)

                                                title  \
0         Bündnis von Gmunden (verlesen: *Brunndenau)   
1                         Waffenstillstand von Moskau   
2                        Präliminarvertrag von Krakau   
3                       Waffenstillstand von Nowgorod   
4                       Waffenstillstand von Nowgorod   
..                                                ...   
95         Konvention über die bewaffnete Neutralität   
96           Schiffahrtskonvention von St. Petersburg   
97  Freundschafts- und Handelsvertrag von St. Pete...   
98                     Friedensvertrag von Versailles   
99  Friedensvertrag von Konstantinopel (Beta-Version)   

                                                sides                 dates  
0                                    Kaiser, Russland           1514 VIII 4  
1                                     Polen, Russland  1522 XII 25_1523 I 4  
2                                     Polen, Russland            1523 II 22  
3  

### Export dataframe as csv file

If you are happy with the results, you can export the dataframe to a CSV file.

In [9]:
new_treaties.to_csv('new_treaties_rus.csv')