## Web Scraping Approach

A web scraping process aimed at collecting news articles from an online platform is performed. 3 libraries were used to carry out our web scraping process.

- `requests`: Retrieving web page content.(Python Software Foundation, n.d.)
- `BeautifulSoup`: Parsing and navigating the HTML structure of these pages for data extraction.(Mitchell & Richardson, n.d.)
- `pandas`: To structure the scraped information into a format ready for further analysis. (McKinney, n.d.)

Our extraction focuses on the articles titles and their main textual content, from a carefully chosen list of URLs. This required a process of identifying and extracting HTML elements known to house the title and body text, which were then compiled into a coherent dataset. This approach not only made the data collection process efficient for our needs but also helped the consistency and accuracy of the data prepared for analysis.


## Data storage for further analysis

After successfully scraping and organizing the data, it is stored in a pickle file named `aljazeera_articles.pkl`. This step allowed us to keep a stable and easily accessible dataset for further analysis, obviating the need to redo the scraping process. Opting for a pickle file as the storage medium was particularly advantageous due to its capacity to store Python objects, thereby maintaining the integrity of the data's structure and content. 



In [27]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_article(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for non-200 status codes

        soup = BeautifulSoup(response.content, 'html.parser')

        title_element = soup.find('h1', class_='c-article-header__hed')  
        title = title_element.text.strip() if title_element else 'Article Title Not Found'

        content_element = soup.find('div', class_='c-article-body') 
        if content_element:
            paragraphs = content_element.find_all('p')
            text = ' '.join(p.get_text(strip=True) for p in paragraphs)
        else:
            text = 'Content not found'

        return {
            'url': url,
            'title': title,
            'text': text,
            'label': 'Human-written'
        }
    except requests.exceptions.RequestException as e:
        print(f"Error fetching URL: {e}")
        return None

# URLs to scrape
conversation_urls = [
    "https://theconversation.com/aphantasia-ten-years-since-i-coined-the-term-for-lacking-a-minds-eye-the-journey-so-far-226090",
    "https://theconversation.com/easter-eggs-are-more-expensive-this-year-and-climate-change-may-be-a-culprit-226183",
    "https://theconversation.com/eating-some-chocolate-really-might-be-good-for-you-heres-what-the-research-says-226759",
    "https://theconversation.com/us-election-two-graphs-show-how-young-voters-influence-presidential-results-as-biden-gets-poll-boost-226350",
    "https://theconversation.com/tiktok-health-hacks-promising-to-change-the-taste-and-smell-of-female-genitals-are-more-sour-than-sweet-225970",
    "https://theconversation.com/why-is-jesus-often-depicted-with-a-six-pack-the-muscular-messiah-reflects-christian-values-of-masculinity-224909",
    "https://theconversation.com/argentina-javier-mileis-government-poses-an-urgent-threat-to-human-rights-226534",
    "https://theconversation.com/judas-and-the-economics-of-betrayal-221813",
    "https://theconversation.com/the-anthropocene-already-exists-in-our-heads-even-if-its-now-officially-not-a-geological-epoch-226554",
    "https://theconversation.com/the-total-solar-eclipse-in-north-america-could-help-shed-light-on-a-persistent-puzzle-about-the-sun-226558",
    "https://theconversation.com/gaza-war-is-un-security-council-demand-for-a-ceasefire-legally-binding-heres-what-international-law-says-226662",
    "https://theconversation.com/why-did-modern-humans-replace-the-neanderthals-the-key-might-lie-in-our-social-structures-195056",
    "https://theconversation.com/chinas-uk-election-hack-how-and-why-the-electoral-commission-was-targeted-226668",
    "https://theconversation.com/the-roots-of-the-easter-story-where-did-christian-beliefs-about-jesus-resurrection-come-from-221071",
    "https://theconversation.com/julian-assange-how-british-extradition-law-works-224703",
    "https://theconversation.com/ive-captained-ships-into-tight-ports-like-baltimore-and-this-is-how-captains-like-me-work-with-harbor-pilots-to-avoid-deadly-collisions-226700",
    "https://theconversation.com/britains-forgotten-prison-island-remembering-the-thousands-of-convicts-who-died-working-in-bermudas-dockyards-226044",
    "https://theconversation.com/honey-is-said-to-help-with-hay-fever-symptoms-heres-what-the-research-says-about-this-claim-225728",
    "https://theconversation.com/go-on-an-easter-egg-case-hunt-on-the-beach-to-discover-more-about-sharks-and-rays-224715",
    "https://theconversation.com/in-the-fog-of-the-video-streaming-wars-job-losses-and-business-closures-are-imminent-225829",
    "https://theconversation.com/ten-years-since-its-annexation-crimea-serves-as-a-grim-warning-to-any-ukrainian-lands-that-fall-under-russian-occupation-226270",
    "https://theconversation.com/moscow-attacks-why-the-kremlin-may-have-ignored-any-terrorist-warnings-from-the-cia-226549",
    "https://theconversation.com/how-will-the-un-security-councils-call-for-a-gaza-ceasefire-affect-israeli-politics-and-relations-with-the-us-expert-qanda-226653",
    "https://theconversation.com/vladimir-putin-why-its-time-for-democracies-to-denounce-russias-leader-as-illegitimate-226158",
    "https://theconversation.com/leo-varadkar-the-political-backdrop-to-his-shock-resignation-as-irelands-prime-minister-226370",
    "https://theconversation.com/what-we-learned-from-teaching-a-course-on-the-science-of-happiness-226280",
    "https://theconversation.com/measuring-emotional-emptiness-could-help-manage-this-potentially-life-threatening-experience-223418",
    "https://theconversation.com/how-central-asian-jews-and-muslims-worked-together-in-londons-20th-century-fur-and-carpet-trade-226283",
    "https://theconversation.com/how-henry-viiis-grandmother-used-a-palace-in-northamptonshire-to-build-the-mighty-tudor-dynasty-221275",
    "https://theconversation.com/your-brain-can-reveal-if-youre-rightwing-plus-three-other-things-it-tells-us-about-your-politics-226175",
    "https://theconversation.com/nasal-rinsing-why-flushing-the-nasal-passages-with-tap-water-to-tackle-hayfever-could-be-fatal-225811",
    "https://theconversation.com/what-your-sad-desk-sandwich-says-about-your-working-habits-224326",
    "https://theconversation.com/many-drugs-are-prescribed-for-conditions-they-werent-tested-for-heres-what-you-need-to-know-225715",
    "https://theconversation.com/hot-tubs-are-as-full-of-nasty-germs-as-you-fear-226195",
    "https://theconversation.com/buying-affordable-ethical-chocolate-is-almost-impossible-but-some-firms-are-offering-the-next-best-thing-225239",
    "https://theconversation.com/road-house-explores-what-it-means-to-be-a-hyper-masculine-hardman-in-the-21st-century-226533",
    "https://theconversation.com/immaculate-how-a-nunsploitation-film-tunes-into-womens-anger-over-misogyny-and-oppression-226528",
    "https://theconversation.com/3-body-problem-netflix-adaptation-of-liu-cixins-alien-invasion-trilogy-is-captivating-226049",
    "https://theconversation.com/lear-is-not-okay-meta-play-explores-what-happens-when-teenagers-rewrite-shakespeares-tragedy-225713",
    "https://theconversation.com/the-program-netflix-show-exposes-the-dark-side-of-americas-troubled-teens-schools-225399",
    "https://theconversation.com/climate-quitting-the-people-leaving-their-fossil-fuel-jobs-because-of-climate-change-226246",
    "https://theconversation.com/for-people-with-mental-illness-drugs-and-alcohol-can-be-a-key-survival-strategy-ive-learned-they-shouldnt-have-to-get-clean-to-get-treatment-225827",
    "https://theconversation.com/ive-spent-time-with-refugees-in-french-coastal-camps-and-they-told-me-the-governments-rwanda-plan-is-not-putting-them-off-coming-to-the-uk-221798",
    "https://theconversation.com/i-couldnt-stand-the-pain-the-turkish-holiday-resort-thats-become-an-emergency-dental-centre-for-britons-who-cant-get-treated-at-home-224762",
    "https://theconversation.com/descendants-of-holocaust-survivors-explain-why-they-are-replicating-auschwitz-tattoos-on-their-own-bodies-206821",
    "https://theconversation.com/historys-crisis-detectives-how-were-using-maths-and-data-to-reveal-why-societies-collapse-and-clues-about-the-future-218969",
    "https://theconversation.com/princess-of-wales-and-king-charles-one-in-two-people-develop-cancer-during-their-lives-the-diseases-and-treatments-explained-226456",
    "https://theconversation.com/china-why-the-countrys-economy-has-hit-a-wall-and-what-it-plans-to-do-about-it-225623",
    "https://theconversation.com/the-middle-aged-brain-changes-a-lot-and-its-key-to-understanding-dementia-225412",
    "https://theconversation.com/the-worm-moon-once-marked-the-spring-return-of-earthworms-until-global-warming-kicked-in-226643",
    "https://theconversation.com/how-long-before-quantum-computers-can-benefit-society-thats-googles-us-5-million-question-226257",
    "https://theconversation.com/thousands-of-irish-viewers-boycott-licence-fee-after-presenter-salary-scandal-what-this-says-about-the-future-of-public-broadcasting-225394",
    "https://theconversation.com/ben-and-jerrys-and-why-its-hard-for-activist-brands-to-stay-true-to-themselves-after-corporate-buyouts-226524",
    "https://theconversation.com/ukraine-war-russias-baltic-neighbours-to-create-massive-border-defences-as-trump-continues-undermining-nato-225944",
    "https://theconversation.com/announcing-kate-middletons-cancer-diagnosis-should-have-been-simple-but-the-palace-let-it-get-out-of-hand-226490",
    "https://theconversation.com/why-would-islamic-state-attack-russia-and-what-does-this-mean-for-the-terrorism-threat-globally-226464",
    "https://theconversation.com/announcing-kate-middletons-cancer-diagnosis-should-have-been-simple-but-the-palace-let-it-get-out-of-hand-226490",
    "https://theconversation.com/uk/events/the-69th-annual-conference-of-the-association-of-hispanists-of-great-britain-and-ireland-13061",
]

# Scraping articles
scraped_data = []
for url in conversation_urls:
    result = scrape_article(url)
    if result:
        scraped_data.append(result)

# Creating a DataFrame with the scraped data
df = pd.DataFrame(scraped_data)

df.to_pickle("conversation_data.pkl")
print("Data scraped and saved to conversation_data.pkl")


Error fetching URL: 403 Client Error: Forbidden for url: https://theconversation.com/the-total-solar-eclipse-in-north-america-could-help-shed-light-on-a-persistent-puzzle-about-the-sun-226558
Error fetching URL: 403 Client Error: Forbidden for url: https://theconversation.com/how-will-the-un-security-councils-call-for-a-gaza-ceasefire-affect-israeli-politics-and-relations-with-the-us-expert-qanda-226653
Error fetching URL: 403 Client Error: Forbidden for url: https://theconversation.com/vladimir-putin-why-its-time-for-democracies-to-denounce-russias-leader-as-illegitimate-226158
Error fetching URL: 403 Client Error: Forbidden for url: https://theconversation.com/leo-varadkar-the-political-backdrop-to-his-shock-resignation-as-irelands-prime-minister-226370
Error fetching URL: 403 Client Error: Forbidden for url: https://theconversation.com/measuring-emotional-emptiness-could-help-manage-this-potentially-life-threatening-experience-223418
Error fetching URL: 403 Client Error: Forbidden f

In [28]:
df

Unnamed: 0,url,title,text,label
0,https://theconversation.com/aphantasia-ten-yea...,Aphantasia: ten years since I coined the term ...,"Words are powerful things. In 2015, with the h...",Human-written
1,https://theconversation.com/easter-eggs-are-mo...,Easter eggs are more expensive this year and c...,Chocolate eggs and bunnies cost more than ever...,Human-written
2,https://theconversation.com/eating-some-chocol...,Eating some chocolate really might be good for...,Although it always makes me scoff slightly to ...,Human-written
3,https://theconversation.com/us-election-two-gr...,US election: two graphs show how young voters ...,American politics is very polarised at the mom...,Human-written
4,https://theconversation.com/tiktok-health-hack...,TikTok health hacks promising to change the ta...,Wake up. Brush teeth. Exfoliate. Drink a glass...,Human-written
5,https://theconversation.com/why-is-jesus-often...,Why is Jesus often depicted with a six-pack? T...,Have you ever wondered why so many images depi...,Human-written
6,https://theconversation.com/argentina-javier-m...,Argentina: Javier Milei’s government poses an ...,"“Milei, you scumbag, you are the dictatorship....",Human-written
7,https://theconversation.com/judas-and-the-econ...,Judas and the economics of betrayal,No one remembers the names of the soldiers who...,Human-written
8,https://theconversation.com/the-anthropocene-a...,"The Anthropocene already exists in our heads, ...",An international subcommittee of geologists re...,Human-written
9,https://theconversation.com/gaza-war-is-un-sec...,Gaza war: is UN security council ‘demand’ for ...,Despite the groundbreaking adoption of aUN sec...,Human-written


### References

- Python Software Foundation. (n.d.). *Requests: HTTP for Humans™*. Retrieved from [https://requests.readthedocs.io](https://requests.readthedocs.io)

- Richard Mitchell, Leonard Richardson. (n.d.). *Beautiful Soup Documentation*. Retrieved from [https://www.crummy.com/software/BeautifulSoup/bs4/doc/](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

- Wes McKinney. (n.d.). *pandas: powerful Python data analysis toolkit*. Retrieved from [https://pandas.pydata.org/pandas-docs/stable/index.html](https://pandas.pydata.org/pandas-docs/stable/index.html)