# POP77142 Assignment 1: Text Preparation

## Before Submission

-   Make sure that you can run all cells without errors
-   You can do it by clicking `Kernel`, `Restart & Run All` in the menu
    above
-   Make sure that you save the output by pressing Command+S / CTRL+S
-   Rename the file from `01_assignment.ipynb` to
    `01_lastname_firstname_studentnumber.ipynb`
-   Use Firefox browser for submitting your Jupyter notebook on
    Blackboard.

## Overview

In this assignment you will need to collect and prepare textual data for
analysis. As the data source we will debates in the Dáil Éireann (Irish
Parliament) for the first 2 months of 2025 (but in practice once you
implement a solution for those it should be relatively straightforward
to scale up).

There are 2 broad strategies that can be used to obtain Dáil debates:

1.  Use the [Oireachtas
    website](https://www.oireachtas.ie/en/debates/find/) to scrape the
    debates using R (e.g. `rvest`) or Python (e.g. `Beautiful Soup`).
    There can be different strategies to solve this, but, crucially, the
    website is largely static, so dealing with it as a set of HTML files
    is quite manageable.
2.  Use the [Oireachtas API](https://api.oireachtas.ie/) to scrape the
    debates using R (e.g. `httr2`) or Python (e.g. `requests`). This
    might be a more advanced option, but it is also a lot more powerful
    and flexible. Importantly, this API does not require authentication,
    which makes working with it quite a bit simpler than with many other
    APIs.


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
import re

## Part 1: Data Acquisition

In this part you will need to write a scraper that collects the data
either directly from the Oireachtas website or using the Oireachtas API.
The data should be collected for the first 2 months of 2025 (January and
February, but the bulk of the debates would be in February).

Depending on how you choose to organise your code, you may choose to
build up a usual tabular dataset straightaway or you might find it
easier to store the data in a different container (e.g. a list of
vectors, a list of lists, a list of dictionaries, etc.) and then convert
it to a tabular format in the next part.

You may use generative AI to help you with trialing different
approaches. If you do use AI, you need to report the version of the LLM
that you are using (e.g. `code-davinci-002`,
`meta-llama-3.1-8b-intruct`, etc.). Hardware permitting, I encourage you
to use offline models to have better control over the data and the
model.

While there maybe also some bindings for the API that are readily
available, none of them are officially supported, so you shouldn’t be
relying on those.

In [2]:
#Counting tokens and unique words
def count_tokens(text):
    words = re.findall(r'\b\w+\b', text.lower()) 
    return len(words), len(set(words))  #(Total tokens, Unique words)

#List of dates to scrape 
dates = ["2025-01-22", "2025-01-23", "2025-02-05", "2025-02-06","2025-02-11",
         "2025-02-12","2025-02-13","2025-02-18","2025-02-19","2025-02-20",
         "2025-02-25","2025-02-26","2025-02-27"]

#Base URL for Debates
base_url = "https://www.oireachtas.ie/en/debates/debate/dail/"

#Storing URLs
date_urls = {}

#Loop through each date and find URLs for each section and subsection if applicable
for date in dates:
    section_urls = []
    full_url = f"{base_url}{date}/"

    try:
        response = requests.get(full_url)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "html.parser")

        results_div = soup.find("div", class_="results")

        if results_div:
            for link in results_div.find_all("a", href=True):
                section_url = f"https://www.oireachtas.ie{link['href']}"

                #Adds new sections
                if section_url not in section_urls:
                    section_urls.append(section_url)

                #Checks for subsections
                subsection_response = requests.get(section_url)
                subsection_response.raise_for_status()
                subsection_soup = BeautifulSoup(subsection_response.text, "html.parser")

                #Find subsection links
                for sublink in subsection_soup.find_all("a", href=True):
                    sub_href = sublink.get("href")
                    if sub_href.startswith("#s") and f"{section_url}{sub_href}" not in section_urls:
                        section_urls.append(f"{section_url}{sub_href}")

        #Store found URLs
        date_urls[date] = section_urls

    except requests.exceptions.RequestException:
        pass  #handles potential errors (if blocked or not found)

#Confirm finished
print(" URL compiling complete.")
print(date_urls)


 URL compiling complete.
{'2025-01-22': ['https://www.oireachtas.ie/en/debates/debate/dail/2025-01-22/1/', 'https://www.oireachtas.ie/en/debates/debate/dail/2025-01-22/2/', 'https://www.oireachtas.ie/en/debates/debate/dail/2025-01-22/3/'], '2025-01-23': ['https://www.oireachtas.ie/en/debates/debate/dail/2025-01-23/1/', 'https://www.oireachtas.ie/en/debates/debate/dail/2025-01-23/2/', 'https://www.oireachtas.ie/en/debates/debate/dail/2025-01-23/7/', 'https://www.oireachtas.ie/en/debates/debate/dail/2025-01-23/12/', 'https://www.oireachtas.ie/en/debates/debate/dail/2025-01-23/13/', 'https://www.oireachtas.ie/en/debates/debate/dail/2025-01-23/18/', 'https://www.oireachtas.ie/en/debates/debate/dail/2025-01-23/19/'], '2025-02-05': ['https://www.oireachtas.ie/en/debates/debate/dail/2025-02-05/1/', 'https://www.oireachtas.ie/en/debates/debate/dail/2025-02-05/2/', 'https://www.oireachtas.ie/en/debates/debate/dail/2025-02-05/3/', 'https://www.oireachtas.ie/en/debates/debate/dail/2025-02-05/8/',

In [3]:
data = []

#For loop to scrape data from URLs found above
for date, section_urls in date_urls.items():
    for url in section_urls :
        try:
            section_response = requests.get(url)
            section_response.raise_for_status()
            section_soup = BeautifulSoup(section_response.text, "html.parser")

            #Volume and Debate Number source discription: <p class="c-hero__subtitle">
            vol, debate_no = "N/A", "N/A"

            subtitle = section_soup.find("p", class_="c-hero__subtitle")
            if subtitle:
                vol_match = re.search(r'Vol\.\s*(\d+)', subtitle.text)
                no_match = re.search(r'No\.\s*(\d+)', subtitle.text)

                vol = vol_match.group(1) if vol_match else vol
                debate_no = no_match.group(1) if no_match else debate_no

            #Extract Dáil number
            dail = "N/A"
            meta_title = section_soup.find("meta", {"name": "title"})
            if meta_title:
                title_content = meta_title.get("content", "")
                dail_match = re.search(r'\((\d+.. Dáil)\)', title_content)
                dail = dail_match.group(1) if dail_match else dail

            #Track unique speakers per section
            seen_speakers = set()

            #Find all speaker sections Source: <div class="speech brief">
            speaker_sections = section_soup.find_all("div", class_="speech brief")

            for speech in speaker_sections:
                #Extract Speaker Name Source: <h4 class="c-avatar__name">
                speaker_name = "Unknown"
                speaker_tag = speech.find("h4", class_="c-avatar__name")

                if speaker_tag:
                    speaker_link = speaker_tag.find("a")
                    if speaker_link:
                        speaker_name = speaker_link.text.strip()

                #Prevent duplicates in a section (precursor to ensuring unique only)
                if speaker_name in seen_speakers:
                    continue
                seen_speakers.add(speaker_name)

                #Extract Speech Text
                text = ""
                paragraphs = speech.find_all("p")
                if paragraphs:
                    text = " ".join([p.text.strip() for p in paragraphs])

                #Count tokens and types
                ntokens, ntypes = count_tokens(text)

                #Store data
                data.append({
                    "Dáil": dail,
                    "Vol": vol,
                    "No": debate_no,
                    "Date": date,
                    "Speaker": speaker_name,
                    "Text": text,
                    "ntokens": ntokens,
                    "ntypes": ntypes
                })

            time.sleep(random.uniform(1, 3))  #Time delays to not overload

        except Exception:
            pass  #handles errors
#Convert to DataFrame
df = pd.DataFrame(data)
print(df)
print("Data scraping complete.")
#If printing an empty data frame then Error 403 forbids code from scrapping the URLs
#Should wait some time as other methods require complex coding, VPNs, or IP changes


           Dáil   Vol No        Date  \
0     34th Dáil  1062  2  2025-01-22   
1     34th Dáil  1062  2  2025-01-22   
2     34th Dáil  1062  2  2025-01-22   
3     34th Dáil  1062  2  2025-01-22   
4     34th Dáil  1062  2  2025-01-22   
...         ...   ... ..         ...   
1193  34th Dáil  1063  5  2025-02-26   
1194  34th Dáil  1063  5  2025-02-26   
1195  34th Dáil  1063  5  2025-02-26   
1196  34th Dáil  1063  5  2025-02-26   
1197  34th Dáil  1063  5  2025-02-26   

                                               Speaker  \
0                                  Deputy Eoin Ó Broin   
1                             Deputy Mary Lou McDonald   
2                          Deputy Richard Boyd Barrett   
3                               Deputy Louise O'Reilly   
4                                   An Ceann Comhairle   
...                                                ...   
1193                  Deputy Jennifer Carroll MacNeill   
1194                            Deputy David Cullinane 

## Part 2: Text Preprocessing

In this part you will need to clean up the collected data. Depending on
how the previous part was implemented it might take more or fewer steps.
The ultimate goal is to have a dataset of the following form:

| dail | vol | no  | date | speaker | text | ntokens | ntypes |
|------|-----|-----|------|---------|------|---------|--------|

where:

`dail` - is the number of the Dáil (e.g. 34th Dáil)

`vol` - is the volume number of the debates (e.g. 1000)

`no` - is the number of the debate in the volume (e.g. 1)

`date` - is the date of the debate (in YYYY-MM-DD form, e.g. 2025-01-01)

`speaker` - is the name of the speaker

`text` - is the text of the speech

`ntokens` - is the number of tokens in the speech

`ntypes` - is the number of types in the speech

Note that you **don’t** need to submit the actual dataset. However,
after organising the textual data in this way, you will need to perform
the following steps:

-   Print out the first and last 5 rows of the data
-   Print the dimensionality of the data (number of rows and number of
    columns)
-   Print the total number of unique speakers in the dataset.

In [4]:

          
# Output required info
print("First 5 rows:")
display(df.head())

print("Last 5 rows:")
display(df.tail())

print(f"Data dimensions: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"Number of unique speakers: {df['Speaker'].nunique()}")

First 5 rows:


Unnamed: 0,Dáil,Vol,No,Date,Speaker,Text,ntokens,ntypes
0,34th Dáil,1062,2,2025-01-22,Deputy Eoin Ó Broin,That is a joke.,4,4
1,34th Dáil,1062,2,2025-01-22,Deputy Mary Lou McDonald,It is ridiculous.,3,3
2,34th Dáil,1062,2,2025-01-22,Deputy Richard Boyd Barrett,That is not agreed.,4,4
3,34th Dáil,1062,2,2025-01-22,Deputy Louise O'Reilly,It is absolutely not agreed.,5,5
4,34th Dáil,1062,2,2025-01-22,An Ceann Comhairle,Deputies might give me a chance to ask the que...,19,18


Last 5 rows:


Unnamed: 0,Dáil,Vol,No,Date,Speaker,Text,ntokens,ntypes
1193,34th Dáil,1063,5,2025-02-26,Deputy Jennifer Carroll MacNeill,Yes.,1,1
1194,34th Dáil,1063,5,2025-02-26,Deputy David Cullinane,I do not want to see children with disabilitie...,46,32
1195,34th Dáil,1063,5,2025-02-26,Deputy Marie Sherlock,Could the Minister please put them in place?,8,8
1196,34th Dáil,1063,5,2025-02-26,Deputy Cathy Bennett,"-----including more pharmacies, rehab care and...",11,10
1197,34th Dáil,1063,5,2025-02-26,An Cathaoirleach Gníomhach (Deputy Cathal Crowe),Go raibh maith agat a Theachta. Anois leanfaid...,17,16


Data dimensions: 1198 rows × 8 columns
Number of unique speakers: 138
