![bse_logo_textminingcourse](https://bse.eu/sites/default/files/bse_logo_small.png)

# Text Mining: Models and Algorithms


Hannes Mueller

## Scraping

In this session we will learn some of the fundamentals of using web scraping to generate a corpus of text. There are several packages that help you achieve this. I will assume throughout that you have already covered basics and so we will jump right in. Even if you have never done scraping it should be intuitive.

ATTENTION: At the end of the notebook I give you the full code to generate a dataframe with a string column that contains the text of the resolutions passed in 2024 plus some metadata.

## Scraping UN Resolutions

We will first use beautifulsoup to scrape UN resolutions from the UN security council.

The Security Council has primary responsibility for the maintenance of international peace and security. It has 15 Members, and each Member has one vote. Under the Charter of the United Nations, all Member States are obligated to comply with Council decisions.

The Security Council takes the lead in determining the existence of a threat to the peace or act of aggression. It calls upon the parties to a dispute to settle it by peaceful means and recommends methods of adjustment or terms of settlement. In some cases, the Security Council can resort to imposing sanctions or even authorize the use of force to maintain or restore international peace and security.

### Inspect the Webpage

#### Visual Inspection
The first step is always to check out the webpage. Let's go to the UN webpage of UN security council resolutions. https://www.un.org/securitycouncil/content/resolutions-0

or (once we click on a year)

https://www.un.org/securitycouncil/content/resolutions-adopted-security-council-1946

#### Inspect using Developer Tools
Second step is to use Developer tools to understand the structure of a website. All modern browsers come with developer tools installed. We will now do this for Chrome.

On macOS, you can open up the developer tools through the menu by selecting View → Developer → Developer Tools. On Windows and Linux, you can access them by clicking the top-right menu button (⋮) and selecting More Tools → Developer Tools. You can also access your developer tools by right-clicking on the page and selecting the Inspect option or using a keyboard shortcut:

Mac: Cmd+Alt+I

Windows/Linux: Ctrl+Shift+I

Try to find where the resolution links sit on the webpage. One good way of finding specific elements is to use the dotted box with the arrow in inspection tools and click on the element - the right hand inspection tab will jump to that element.

Note what happens when you go into inspection mode and you click on one of the resolution links. In particular, note the part of: "embed-responsive-item" and src="/pdf?symbol=en/S/RES/15(1946)"

#### Beautiful soup and CSS selector page

Here is a guide to Beautiful soup
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

The following references can come in handy when maneuvering on a page with beautiful soup.
https://www.w3schools.com/cssref/css_selectors.asp

In [1]:
import requests
from bs4 import BeautifulSoup
import os

#remember we try to get https://www.un.org/securitycouncil/content/resolutions-adopted-security-council-1946

root="https://www.un.org/securitycouncil/content/"
first_year="resolutions-adopted-security-council-2025"

URL = root+first_year
print(URL)
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"}
response = requests.get(URL, headers=headers)
print(response.text)

https://www.un.org/securitycouncil/content/resolutions-adopted-security-council-2025
<!DOCTYPE html>
<html lang="en" dir="ltr">
  <head>
    <meta charset="utf-8" />
<noscript><style>form.antibot * :not(.antibot-message) { display: none !important; }</style>
</noscript><meta name="description" content="Resolutions adopted by the Security Council" />
<link rel="canonical" href="https://main.un.org/securitycouncil/en/content/resolutions-adopted-security-council-2025" />
<meta name="MobileOptimized" content="width" />
<meta name="HandheldFriendly" content="true" />
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no" />
<meta http-equiv="x-ua-compatible" content="ie=edge" />
<link rel="icon" href="/securitycouncil/themes/custom/un3_sc/favicon.ico" type="image/vnd.microsoft.icon" />
<link rel="alternate" hreflang="ar" href="https://main.un.org/securitycouncil/ar/content/resolutions-adopted-security-council-2025" />
<link rel="alternate" hreflang="zh-hans" h

In [2]:
#seeing it as a list of hyperlinks with the content
#we first load into soup
soup = BeautifulSoup(response.text, 'html.parser')

print(soup.prettify())


<!DOCTYPE html>
<html dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <noscript>
   <style>
    form.antibot * :not(.antibot-message) { display: none !important; }
   </style>
  </noscript>
  <meta content="Resolutions adopted by the Security Council" name="description"/>
  <link href="https://main.un.org/securitycouncil/en/content/resolutions-adopted-security-council-2025" rel="canonical"/>
  <meta content="width" name="MobileOptimized"/>
  <meta content="true" name="HandheldFriendly"/>
  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
  <meta content="ie=edge" http-equiv="x-ua-compatible"/>
  <link href="/securitycouncil/themes/custom/un3_sc/favicon.ico" rel="icon" type="image/vnd.microsoft.icon"/>
  <link href="https://main.un.org/securitycouncil/ar/content/resolutions-adopted-security-council-2025" hreflang="ar" rel="alternate"/>
  <link href="https://main.un.org/securitycouncil/zh/content/resolutions-adopted-security-council-2025" h

## Meneuver to the right object

Beautiful soup objects have powerful methods like find() and find_all().

In the soup above there is an element which is the beginning of the table:
                
                table class="table table-striped table-sm"
                tbody
The following code meneuvers there. Through:
                
                table = soup.find('table')
                table_body = table.find('tbody')

Then we take it into a list object through:
                
                rows = table_body.find_all('tr')

### Let's do one example

In [3]:
soup = BeautifulSoup(response.text, 'html.parser')

table = soup.find('table')

table_body = table.find('tbody')

#the row command identifies them through <tr>
rows = table_body.find_all('tr')
print("It is not a list. But a soup object:", type(rows))
print("But access works like in a list. Printing first element gives:", rows[0])

It is not a list. But a soup object: <class 'bs4.element.ResultSet'>
But access works like in a list. Printing first element gives: <tr><td><a href="http://undocs.org/en/S/RES/2795(2025)">S/RES/2795 (2025)</a></td><td>31 October 2025</td><td>The situation in Bosnia and Herzegovina</td></tr>


## Getting into the table and retreiving cells

`row.find_all('td')`: Retrieves all table cells (`<td>` tags) in the current row.

For the row above cells will contain:

    Cell 1: `<td><a href="https://www.undocs.org/s/res/2767(2024)">S/RES/2767 (2024)</a></td>`
    
    Cell 2: `<td>27 December 2024</td>`
    
    Cell 3: `<td>The situation in Somalia (AUSSOM)</td>`


In [4]:
cells=rows[0].find_all('td')
print(cells[1])


<td>31 October 2025</td>


## Putting it all together to put metadate in one table

The following code uses several RegEx commands like \d, and re.search

In [5]:
import re

year=2025
year=str(year)

root="https://www.un.org/securitycouncil/content/"
first_year="resolutions-adopted-security-council-"

URL = root+first_year+year


headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"}
response = requests.get(URL, headers=headers)

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table')

table_body = table.find('tbody')

rows = table_body.find_all('tr')

data = []
for i, row in enumerate(rows):
    cells = row.find_all('td')
    if len(cells) >= 3:
        resolution_link = cells[0].find('a')['href'] if cells[0].find('a') else None
        resolution_number = re.search(r'S/RES/(\d+)', cells[0].text.strip()).group(1)
        resolution_date = cells[1].text.strip()
        resolution_headline = cells[2].text.strip()

        if resolution_link:
            data.append([resolution_number, resolution_date, resolution_headline, resolution_link])


In [6]:
import pandas as pd

# Convert to DataFrame
df = pd.DataFrame(data, columns=["Resolution Number", "Date", "Headline", "PDF Link"])
df.head()

Unnamed: 0,Resolution Number,Date,Headline,PDF Link
0,2795,31 October 2025,The situation in Bosnia and Herzegovina,http://undocs.org/en/S/RES/2795(2025)
1,2794,17 October 2025,The question concerning Haiti (Haiti sanctions),http://undocs.org/en/S/RES/2794(2025)
2,2793,30 September 2025,The question concerning Haiti,http://undocs.org/en/S/RES/2793(2025)
3,2792,17 September 2025,The situation between Iraq and Kuwait,http://undocs.org/en/S/RES/2792(2025)
4,2791,12 September 2025,Reports of the Secretary-General on the Sudan ...,http://undocs.org/en/S/RES/2791(2025)



### Explanation of `re.search(r'S/RES/(\d+)', cells[0].text.strip()).group(1)`

See documentation for match object here: https://docs.python.org/3/library/re.html#match-objects

(using ChatGPT)
#### Step-by-Step Breakdown:

1. **`cells[0].text.strip()`**:
   - Extracts the text content of the first cell (`cells[0]`) and removes any leading or trailing whitespace.
   - **Example**: If the cell contains `" S/RES/2564 "`, this becomes `"S/RES/2564"`.

---

2. **`re.search(r'S/RES/(\d+)', ...)`**:
   - **Purpose**: Uses a regular expression to search for a specific pattern within the extracted text.
   - **Pattern Explanation (`r'S/RES/(\d+)'`)**:
     - `r'...'`: The `r` prefix denotes a raw string, so backslashes are treated literally.
     - `S/RES/`: Matches the literal text `S/RES/`.
     - `(\d+)`: Captures a sequence of one or more digits (`\d` = digit, `+` = one or more).
       - Parentheses `()` create a **capture group**, allowing you to extract the matched digits.
   - **Example**: In the string `"S/RES/2564"`, it matches the entire pattern and captures `"2564"` as the group.

---

3. **`.group(1)`**:
   - Retrieves the first (and in this case, only) **capture group** from the match.
   - If the pattern matches, `.group(1)` will return the digits captured by `(\d+)`.
   - **Example**: If the match is `"S/RES/2564"`, `.group(1)` returns `"2564"`.
   - If there’s no match, `re.search(...)` returns `None`, and calling `.group(1)` would raise an `AttributeError`.

---

#### Let's check in an example:

In [None]:
text = "S/RES/2564"
match = re.search(r'S/RES/(\d+)', text)
if match:
    print(match.group(1))

2564


In [7]:
text = "S/RES/2564"
match = re.search(r'S/RES/(\d+)', text)
if match:
    print(match.group(0))

S/RES/2564


In [8]:
text = "Not a resolution"
match = re.search(r'S/RES/(\d+)', text)
if match:
    print(match.group(1))
else:
    print("No match found")

No match found


# Bonus

It took me half a day to figure this out but in the end the webpage gave up it's secrets. I then give you the full code that downloads a full year from the webpage and makes a dataframe with the text.

It is impossible to completely ChatGPT this code. You need to have exactly the kind of attitude you need with fully hand-coded scraping of trying things and figuring out where the machine patterns are.

### The Scraping Core

This is the core that scrapes to a link you can only find if you try a few things manually first.

Note the code just downloads the first three rows from the dataset generated above.

In [9]:
import pandas as pd
import requests
import time
import os

# Base URL for PDF download
base_url = "https://daccess-ods.un.org/access.nsf/Get?OpenAgent&DS=s/res/{}&Lang=E"

# Function to download PDF
def download_pdf(resolution_number, download_url, folder):
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
        response = requests.get(download_url, headers=headers, timeout=30)
        if response.status_code == 200:
            filename = os.path.join(folder, f"Resolution_{resolution_number}.pdf")
            with open(filename, "wb") as file:
                file.write(response.content)
            print(f"Downloaded: {filename}")
        else:
            print(f"Failed to download: {download_url} (Status code: {response.status_code})")
    except Exception as e:
        print(f"Error downloading {download_url}: {e}")

# Create destination folder based on year
year = 2024
folder_name = f"UNResolutions_{year}"
os.makedirs(folder_name, exist_ok=True)

# Loop through the dataframe and download PDFs
for index, row in df.iloc[:3].iterrows():
    resolution_number = row["Resolution Number"]
    pdf_link = base_url.format(row["PDF Link"].split("/")[-1])
    download_pdf(resolution_number, pdf_link, folder_name)
    time.sleep(3)  # Delay of 3 seconds


Downloaded: UNResolutions_2024/Resolution_2795.pdf
Downloaded: UNResolutions_2024/Resolution_2794.pdf
Downloaded: UNResolutions_2024/Resolution_2793.pdf


### The Full Program

This program first defines some helper functions and then goes to the year that needs to be scraped and produces the full dataframe.

For me the code sometime fails to download. Simply stop and rerun the following cell in that case. It should eventually work.

In [None]:
! pip install PyPDF2
import pandas as pd
import requests
import time
import os
import re
from bs4 import BeautifulSoup
from PyPDF2 import PdfReader

# Base URL for PDF download
base_url = "https://daccess-ods.un.org/access.nsf/Get?OpenAgent&DS=s/res/{}&Lang=E"

# Function to download PDF
def download_pdf(resolution_number, download_url, folder):
    try:
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
        response = requests.get(download_url, headers=headers, timeout=30)
        if response.status_code == 200:
            filename = os.path.join(folder, f"Resolution_{resolution_number}.pdf")
            with open(filename, "wb") as file:
                file.write(response.content)
            print(f"Downloaded: {filename}")
        else:
            print(f"Failed to download: {download_url} (Status code: {response.status_code})")
    except Exception as e:
        print(f"Error downloading {download_url}: {e}")

# Function to extract text from PDF
def extract_text_from_pdf(pdf_path):
    try:
        reader = PdfReader(pdf_path)
        text = ""
        for page in reader.pages:
            text += page.extract_text()
        return text
    except Exception as e:
        print(f"Error reading PDF {pdf_path}: {e}")
        return None

# Scrape the webpage to create the dataframe
year = 2024
year_str = str(year)

root = "https://www.un.org/securitycouncil/content/"
first_year = "resolutions-adopted-security-council-"

URL = root + first_year + year_str

headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36"}
response = requests.get(URL, headers=headers)

soup = BeautifulSoup(response.text, 'html.parser')
table = soup.find('table')

if table:
    table_body = table.find('tbody')
    rows = table_body.find_all('tr')

    data = []
    for i, row in enumerate(rows):
        cells = row.find_all('td')
        if len(cells) >= 3:
            resolution_link = cells[0].find('a')['href'] if cells[0].find('a') else None
            resolution_number = re.search(r'S/RES/(\d+)', cells[0].text.strip()).group(1)
            resolution_date = cells[1].text.strip()
            resolution_headline = cells[2].text.strip()

            if resolution_link:
                data.append([resolution_number, resolution_date, resolution_headline, resolution_link])

    # Create dataframe
    df = pd.DataFrame(data, columns=["Resolution Number", "Date", "Headline", "PDF Link"])

    # Create destination folder based on year
    folder_name = outpath+f"UNResolutions_{year}"
    os.makedirs(folder_name, exist_ok=True)

    # Add a column for extracted text
    df["Extracted Text"] = ""

    # Loop through the dataframe and download PDFs
    for index, row in df.iterrows():
        resolution_number = row["Resolution Number"]
        pdf_link = base_url.format(row["PDF Link"].split("/")[-1])
        pdf_path = os.path.join(folder_name, f"Resolution_{resolution_number}.pdf")

        # Download the PDF
        download_pdf(resolution_number, pdf_link, folder_name)

        # Extract text from the downloaded PDF
        if os.path.exists(pdf_path):
            extracted_text = extract_text_from_pdf(pdf_path)
            df.at[index, "Extracted Text"] = extracted_text

        time.sleep(3)  # Delay of 3 seconds
else:
    print("No table found on the webpage.")


Downloaded: /content/drive/MyDrive/BSE Text Mining 2025/Session 2/UNResolutions_2024/Resolution_2767.pdf
Downloaded: /content/drive/MyDrive/BSE Text Mining 2025/Session 2/UNResolutions_2024/Resolution_2766.pdf
Downloaded: /content/drive/MyDrive/BSE Text Mining 2025/Session 2/UNResolutions_2024/Resolution_2765.pdf
Downloaded: /content/drive/MyDrive/BSE Text Mining 2025/Session 2/UNResolutions_2024/Resolution_2764.pdf
Downloaded: /content/drive/MyDrive/BSE Text Mining 2025/Session 2/UNResolutions_2024/Resolution_2763.pdf
Downloaded: /content/drive/MyDrive/BSE Text Mining 2025/Session 2/UNResolutions_2024/Resolution_2762.pdf
Downloaded: /content/drive/MyDrive/BSE Text Mining 2025/Session 2/UNResolutions_2024/Resolution_2761.pdf
Downloaded: /content/drive/MyDrive/BSE Text Mining 2025/Session 2/UNResolutions_2024/Resolution_2760.pdf
Downloaded: /content/drive/MyDrive/BSE Text Mining 2025/Session 2/UNResolutions_2024/Resolution_2759.pdf
Downloaded: /content/drive/MyDrive/BSE Text Mining 2025

In [None]:
df

Unnamed: 0,Resolution Number,Date,Headline,PDF Link,Extracted Text
0,2767,27 December 2024,The situation in Somalia (AUSSOM),https://www.undocs.org/s/res/2767(2024),United Nations S/RES/2767 (2024) \n Secur...
1,2766,20 December 2024,The situation in the Middle East (UNDOF),https://www.undocs.org/s/res/2766(2024),United Nations S/RES/2766 (2024) \n Secur...
2,2765,20 December 2024,The situation concerning the Democratic Republ...,https://www.undocs.org/s/res/2765(2024),United Nations S/RES/2765 (2024) \n Secur...
3,2764,20 December 2024,Children and armed conflict,https://www.undocs.org/s/res/2764(2024),United Nations S/RES/2764 (2024) \n Secur...
4,2763,13 December 2024,Threats to international peace and security ca...,https://www.undocs.org/s/res/2763(2024),United Nations S/RES/2763 (2024) \n Secur...
5,2762,13 December 2024,Peace and security in Africa,https://www.undocs.org/s/res/2762(2024),United Nations S/RES/2762 (2024) \n Secur...
6,2761,6 December 2024,General issues related to sanctions,https://www.undocs.org/s/res/2761(2024),United Nations S/RES/2761 (2024) \n Secur...
7,2760,14 November 2024,Reports of the Secretary-General on the Sudan ...,https://www.undocs.org/s/res/2760(2024),United Nations S/RES/2760 (2024) \n Secur...
8,2759,14 November 2024,The situation in the Central African Republic ...,https://www.undocs.org/s/res/2759(2024),United Nations S/RES/2759 (2024) \n Secur...
9,2758,13 November 2024,The situation in the Middle East (2140 sanctions),https://www.undocs.org/s/res/2758(2024),United Nations S/RES/2758 (2024) \n Secur...


In [None]:
print(df["Extracted Text"][0])

 United Nations   S/RES/2767 (2024)  
  Security Council   Distr.: General  
27 December 2024  
 
24-24584 (E)  
*2424584*   
 
Resolution 2767 (2024)  
 
 
Adopted by the Security Council at its 9828th meeting, on 
27 December 2024  
 
 
 The Security Council , 
 Recalling  all its previous resolutions and statements of its President on the 
situation in Somalia,  
 Reaffirming  its full respect for the sovereignty, territorial integrity, political 
independence, and unity of Somalia,  
 Recalling  that the Federal Government of Somalia (FGS) has primary 
responsibility for ensuring security within Somalia, and recognising  the FGS’s 
request for continued international support to enable it to achieve progressively its 
aim of a secure, stable, peaceful, united and democratic country,  
 Commending  the contribution to peace and security in Somalia made by the 
African Union Mission in Somalia (AMISOM) and its successor, the African Union 
Transition Mission in Somalia (ATMIS), since 