## Web Scraping Approach

A web scraping process aimed at collecting news articles from an online platform is performed. 3 libraries were used to carry out our web scraping process.

- `requests`: Retrieving web page content.(Python Software Foundation, n.d.)
- `BeautifulSoup`: Parsing and navigating the HTML structure of these pages for data extraction.(Mitchell & Richardson, n.d.)
- `pandas`: To structure the scraped information into a format ready for further analysis. (McKinney, n.d.)

Our extraction focuses on the articles titles and their main textual content, from a carefully chosen list of URLs. This required a process of identifying and extracting HTML elements known to house the title and body text, which were then compiled into a coherent dataset. This approach not only made the data collection process efficient for our needs but also helped the consistency and accuracy of the data prepared for analysis.


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_guardian_article(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for non-200 status codes

        soup = BeautifulSoup(response.content, 'html.parser')

        title_element = soup.find('h1', class_='dcr-h5yuj3')  # Assuming this class for Guardian article titles
        title = title_element.text.strip() if title_element else 'The Guardian Article'

        content_element = soup.find('div', class_='article-body-commercial-selector')  # Assuming this class for Guardian content
        if content_element:
            paragraphs = content_element.find_all('p')
            text = '\n'.join(p.get_text(strip=True) for p in paragraphs)
        else:
            text = 'Content not found'

        return {
            'url': url,
            'title': title,
            'text': text,
            'label': 'Human-written'
        }
    except requests.exceptions.RequestException as e:
        print(f"Error fetching URL: {e}")
        return None

# List of Guardian article URLs to scrape
guardian_urls = [
     "https://www.theguardian.com/football/2024/mar/28/compete-with-the-big-dogs-new-era-beckons-for-the-dominican-republic",
    "https://www.theguardian.com/football/2024/mar/28/euro-2024-power-rankings-a-look-at-the-24-teams-going-to-germany",
    "https://www.theguardian.com/football/2024/mar/27/marcus-rashford-england-manchester-united-international-southgate-euro-2024-football",
    "https://www.theguardian.com/football/2024/mar/28/a-league-tv-broadcast-blackout-production-team-global-advance-administration",
    "https://www.theguardian.com/football/2024/mar/27/southgate-backs-push-for-26-man-squads-at-euro-2024-amid-injury-pile-up",
    "https://www.theguardian.com/football/live/2024/mar/27/chelsea-v-ajax-womens-champions-league-quarter-final-second-leg-live",
    "https://www.theguardian.com/football/2024/mar/27/chelsea-ajax-womens-champions-league-quarter-final-match-report",
    "https://www.theguardian.com/football/2024/mar/27/luis-rubiales-30-month-jail-sentence-hermoso-kiss",
    "https://www.theguardian.com/football/2024/mar/27/lionesses-legend-steph-houghton-to-retire-at-the-end-of-the-season",
    "https://www.theguardian.com/football/2024/mar/27/we-need-support-ukraine-steeled-to-win-battle-for-attention-at-euro-2024",
    "https://www.theguardian.com/football/2024/mar/27/eric-cantona-finally-discusses-seagulls-sardines-comment-manchester-united",
    "https://www.theguardian.com/football/2024/mar/27/football-daily-wales-poland-euro-2024",
    "https://www.theguardian.com/football/2024/mar/27/sexism-is-the-risk-factor-footballs-race-to-learn-more-about-acl-injuries",
    "https://www.theguardian.com/football/2024/mar/27/romeo-lavia-chelsea-season-over-one-appearance-thigh-injury-setback",
    "https://www.theguardian.com/football/2024/mar/27/richarlison-considered-quitting-football-during-post-world-cup-depression-brazil",
    "https://www.theguardian.com/football/2024/mar/27/england-euro-2024-squad-gareth-southgate-on-plane",
    "https://www.theguardian.com/football/2024/mar/27/india-world-cup-dream-fading-after-shock-home-defeat-to-afghanistan-2026-asian-qualifiers",
    "https://www.theguardian.com/football/2024/mar/27/share-your-reaction-to-ukraine-qualifying-for-euro-2024-football",
    "https://www.theguardian.com/football/2024/mar/27/premier-league-social-media-instagram-tiktok",
    "https://www.theguardian.com/football/2024/mar/27/central-defence-premier-league-arsenal-manchester-united",
    "https://www.theguardian.com/football/2024/mar/27/the-knowledge-football-teams-parentheses",
    "https://www.theguardian.com/football/2024/mar/27/wales-euro-2024-playoff-shootout-poland",
    "https://www.theguardian.com/football/2024/mar/27/wales-are-going-places-despite-missing-out-on-euro-2024-insists-rob-page",
    "https://www.theguardian.com/football/2024/mar/26/international-roundup-spain-brazil-germany-netherlands-georgia-greece-france-chile",
    "https://www.theguardian.com/football/live/2024/mar/26/wales-v-poland-euro-2024-playoff-final-live-score-updates",
    "https://www.theguardian.com/football/2024/mar/26/wales-poland-euro-2024-qualifying-playoff-match-report",
    "https://www.theguardian.com/football/live/2024/mar/26/england-v-belgium-international-football-friendly-live",
    "https://www.theguardian.com/football/2024/mar/26/england-belgium-international-friendly-match-report",
    "https://www.theguardian.com/football/2024/mar/26/ukraine-iceland-euro-2024-playoff",
    "https://www.theguardian.com/world/2024/mar/26/vinicius-junior-spain-racism-football",
    "https://www.theguardian.com/football/2024/mar/26/16-year-old-lily-yohannes-called-up-to-uswnt-shebelieves-squad",
    "https://www.theguardian.com/football/audio/2024/mar/26/will-the-wsl-title-race-go-down-to-the-final-day-womens-football-weekly",
    "https://www.theguardian.com/football/2024/mar/26/football-daily-email-luton-coldplay",
    'https://www.theguardian.com/football/2024/mar/26/mls-reach-agreement-with-referees-to-end-month-long-lockout',
    'https://www.theguardian.com/football/2024/mar/26/sarina-wiegman-believes-coach-player-relationships-are-very-inappropriate',
    'https://www.theguardian.com/football/blog/2024/mar/26/debutants-dazzle-for-socceroos-to-add-vigour-to-arnies-hardened-pros',
    'https://www.theguardian.com/football/2024/mar/26/caroline-graham-hansen-interview-barcelona-norway-moving-the-goalposts',
    'https://www.theguardian.com/football/picture/2024/mar/26/david-squires-on-england-flag-furore-kit-collar-woke-things-destroying-football',
    'https://www.theguardian.com/football/live/2024/mar/26/australia-vs-lebanon-socceroos-world-cup-qualifier-live-updates-scores-results-lineups-start-time-gio-stadium-canberra',
    'https://www.theguardian.com/football/2024/mar/26/craig-goodwin-headlines-socceroos-dominant-victory-over-lebanon',
    'https://www.theguardian.com/football/2024/mar/26/sven-goran-eriksson-england-manager',
    'https://www.theguardian.com/football/2024/mar/26/yellow-red-green-and-glorious-remembering-wales-admiral-76-kit',
    'https://www.theguardian.com/world/2024/mar/26/chen-xuyuan-former-head-of-china-football-association-jailed-for-life-bribery',
    'https://www.theguardian.com/football/2024/mar/25/declan-rice-steps-up-as-england-leader-as-he-reaches-half-century',
    'https://www.theguardian.com/football/2024/mar/25/declan-rice-ben-white-back-england-arsenal',
    'https://www.theguardian.com/football/2024/mar/25/he-said-i-didnt-look-happy-mctominay-credits-steve-clarke-for-upturn-in-form',
    'https://www.theguardian.com/football/2024/mar/25/rob-page-confident-wales-can-subdue-fantastic-lewandowski-in-playoff-clash',
    'https://www.theguardian.com/football/2024/mar/25/france-flounder-antoine-griezmann-seven-years-world-record',
    'https://www.theguardian.com/football/2024/mar/25/plastic-football-fans-from-abroad-can-be-just-as-passionate-as-local-lifers',
    'https://www.theguardian.com/football/2024/mar/25/vinicius-junior-racist-abuse-spain-brazil-football-real-madrid',
    'https://www.theguardian.com/world/2024/mar/25/dani-alves-spain-jail-bail',
    'https://www.theguardian.com/football/2024/mar/25/football-daily-england-toney-watkins-euro-2024',
    'https://www.theguardian.com/world/2024/mar/25/dani-alves-spain-jail-bail',
    'https://www.theguardian.com/football/2024/mar/25/newcastle-united-amanda-staveley-victor-restis',
    'https://www.theguardian.com/football/2024/mar/25/law-firm-hits-out-at-uefa-over-liverpool-fans-yet-to-see-paris-final-claims-settled',
    'https://www.theguardian.com/football/2024/mar/25/england-gareth-southgate-euro-2024',
    'https://www.theguardian.com/football/2024/mar/25/usmnt-mexico-concacaf-champions-league-gregg-berhalter-soccer',
    'https://www.theguardian.com/football/2024/mar/25/james-weir-my-manchester-united-debut-was-an-out-of-body-experience-i-wouldnt-change-that-for-anything',
    'https://www.theguardian.com/football/2024/mar/24/usmnt-mexico-concacaf-nations-league-final-tyler-adams-goal-homophobic-chants',
    'https://www.theguardian.com/football/2024/mar/24/lies-told-joao-cancelo-blasts-manchester-city',
    'https://www.theguardian.com/football/2024/mar/24/wsl-womens-super-league-roundup',
    'https://www.theguardian.com/football/live/2024/mar/24/aston-villa-v-arsenal-womens-super-league-live',
    'https://www.theguardian.com/football/2024/mar/24/former-norwich-sporting-director-denounced-for-callous-remarks-about-black-footballers',
    'https://www.theguardian.com/football/live/2024/mar/24/everton-v-liverpool-womens-super-league-live',
    'https://www.theguardian.com/football/2024/mar/24/jay-bothroyd-reveals-he-hid-epilepsy-and-seizures-during-football-career'

]

# Scraping articles
scraped_data = []
for url in guardian_urls:
    result = scrape_guardian_article(url)
    if result:
        scraped_data.append(result)

df = pd.DataFrame(scraped_data)

df.to_pickle("guardian_data.pkl")
print("Data scraped and saved to guardian_data.pkl")

Data scraped and saved to guardian_data.pkl


In [2]:
df

Unnamed: 0,url,title,text,label
0,https://www.theguardian.com/football/2024/mar/...,‘Compete with the big dogs’: new era beckons f...,"They came, they saw and they … lost heavily. F...",Human-written
1,https://www.theguardian.com/football/2024/mar/...,The Guardian Article,This was not a vintage window for Didier Desch...,Human-written
2,https://www.theguardian.com/football/2024/mar/...,Rashford’s England rivals circle as his United...,Gareth Southgate did say. When the England man...,Human-written
3,https://www.theguardian.com/football/2024/mar/...,The Guardian Article,Longstanding broadcasters NEP have stepped in ...,Human-written
4,https://www.theguardian.com/football/2024/mar/...,The Guardian Article,Gareth Southgate has suggested he would be in ...,Human-written
...,...,...,...,...
60,https://www.theguardian.com/football/2024/mar/...,The Guardian Article,Arsenalproduced an impressive second-half disp...,Human-written
61,https://www.theguardian.com/football/live/2024...,The Guardian Article,Content not found,Human-written
62,https://www.theguardian.com/football/2024/mar/...,The Guardian Article,"Stuart Webber, the former Norwich sporting dir...",Human-written
63,https://www.theguardian.com/football/live/2024...,The Guardian Article,Content not found,Human-written


In [3]:
(df['text'] == 'Content not found').sum()

8

In [4]:
df = df[df['text'] != 'Content not found']

In [5]:
(df['text'] == 'Content not found').sum()

0

## Data storage for further analysis

After successfully scraping and organizing the data, it is stored. This step allowed us to keep a stable and easily accessible dataset for further analysis, obviating the need to redo the scraping process. Opting for a pickle file as the storage medium was particularly advantageous due to its capacity to store Python objects, thereby maintaining the integrity of the data's structure and content. 


In [6]:
df.to_pickle("guardian_data.pkl")

### References

- Python Software Foundation. (n.d.). *Requests: HTTP for Humans™*. Retrieved from [https://requests.readthedocs.io](https://requests.readthedocs.io)

- Richard Mitchell, Leonard Richardson. (n.d.). *Beautiful Soup Documentation*. Retrieved from [https://www.crummy.com/software/BeautifulSoup/bs4/doc/](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

- Wes McKinney. (n.d.). *pandas: powerful Python data analysis toolkit*. Retrieved from [https://pandas.pydata.org/pandas-docs/stable/index.html](https://pandas.pydata.org/pandas-docs/stable/index.html)