## Web Scraping Approach

A web scraping process aimed at collecting news articles from Al Jazeera's online platform is performed. 3 libraries were used to carry out our web scraping process.

- `requests`: Retrieving web page content.(Python Software Foundation, n.d.)
- `BeautifulSoup`: Parsing and navigating the HTML structure of these pages for data extraction.(Mitchell & Richardson, n.d.)
- `pandas`: To structure the scraped information into a format ready for further analysis. (McKinney, n.d.)

Our extraction focuses on the articles' titles and their main textual content, from a carefully chosen list of URLs. This required a process of identifying and extracting HTML elements known to house the title and body text, which were then compiled into a coherent dataset. This approach not only made the data collection process efficient for our needs but also helped the consistency and accuracy of the data prepared for analysis.


In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Al Jazeera news scraping function
def scrape_aljazeera_article(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Assuming the article title is correctly found within an <h1> tag
    title = soup.find('h1').get_text(strip=True) if soup.find('h1') else 'No Title Found'
    
    # Finding the main content container
    content_div = soup.find('div', class_='wysiwyg wysiwyg--all-content css-ibbk12')
    
    # list to hold all text elements
    text_elements = []
    
    # Joining the list of text elements into a single string, separating each element with a space or newline
    full_text = ' '.join(text_elements)
    
    return {
        'url': url,
        'title': title,
        'text': full_text,
        'label': 'Human-written'
    }

# List of Al Jazeera URLs to scrape
urls = [
    'https://www.aljazeera.com/opinions/2024/3/27/beit-daras-and-gaza-an-intergenerational-tale-of-struggle-against-erasure',
    'https://www.aljazeera.com/news/2024/3/27/israels-war-on-gaza-list-of-key-events-day-173',
    'https://www.aljazeera.com/news/2024/3/27/hezbollah-launches-rocket-barrage-after-israeli-strikes-on-lebanon-kill-7',
    'https://www.aljazeera.com/features/2024/3/26/south-sudan-on-the-brink-after-oil-exports-derailed-by-sudans-civil-war',
    'https://www.aljazeera.com/news/2024/3/25/binance-executive-detained-in-nigeria-in-crypto-case-escapes-custody',
    'https://www.aljazeera.com/news/2024/3/25/un-humanitarian-chief-martin-griffiths-to-step-down-due-to-health-reasons',
    'https://www.aljazeera.com/news/2024/3/25/senegal-election-results-who-is-diomaye-faye-tipped-to-be-next-president',
    'https://www.aljazeera.com/news/2024/3/25/senegals-former-pm-ba-concedes-defeat-in-presidential-election',
    'https://www.aljazeera.com/program/newsfeed/2024/3/25/opposition-celebrations-in-senegal-as-faye-takes-early-election-lead',
    'https://www.aljazeera.com/news/2024/3/25/senegals-bassirou-diomaye-faye-takes-early-lead-in-presidential-election',
    'https://www.aljazeera.com/news/2024/3/24/chad-main-opposition-figures-barred-as-leaders-cleared-for-election',
    'https://www.aljazeera.com/news/2024/3/24/senegal-votes-in-delayed-presidential-election',
    'https://www.aljazeera.com/news/2024/3/24/hundreds-of-kidnapped-nigerian-students-released-after-more-than-two-weeks',
    'https://www.aljazeera.com/features/2024/3/24/this-superfood-coffee-is-made-from-thorny-alien-trees',
    'https://www.aljazeera.com/news/2024/3/24/senegals-women-voters-could-make-a-miracle-happen-in-presidential-election',
    'https://www.aljazeera.com/news/2024/3/23/the-tax-inspectors-competing-to-be-senegals-new-president',
    'https://www.aljazeera.com/news/2024/3/23/captured-somali-pirates-arrive-in-india-to-face-trial-over-ship-hijacking',
    'https://www.aljazeera.com/podcasts/2024/3/22/the-take-senegal-won-back-its-election-but-who-will-win-the-vote',
    'https://www.aljazeera.com/news/2024/3/22/increasing-water-scarcity-fuelling-more-global-conflicts-un-report-warns',
    'https://www.aljazeera.com/news/2024/3/22/ugandas-president-museveni-promotes-son-to-army-chief',
    'https://www.aljazeera.com/news/2024/3/22/senegals-2024-election-why-does-it-matter',
    'https://www.aljazeera.com/news/2024/3/20/sudan-is-one-of-the-worst-humanitarian-disasters-in-recent-memory-un',
    'https://www.aljazeera.com/news/2024/3/20/foreign-investors-on-alert-as-senegal-nears-election-marred-by-uncertainty',
    'https://www.aljazeera.com/news/2024/3/19/did-russia-iran-provoke-niger-walkout-from-us-military-pact',
    'https://www.aljazeera.com/news/2024/3/19/un-reports-35-percent-increase-in-people-affected-by-south-sudan-violence',
    'https://www.aljazeera.com/features/2024/3/19/migrant-workers-exploited-abused-in-italys-prized-fine-wine-vineyards',
    'https://www.aljazeera.com/features/2024/3/19/migrant-workers-exploited-abused-in-italys-prized-fine-wine-vineyards',
    'https://www.aljazeera.com/features/2024/3/19/after-12-years-in-power-senegals-macky-sall-leaves-a-fragile-democracy',
    'https://www.aljazeera.com/news/2024/3/19/the-gambia-votes-to-reverse-landmark-ban-on-female-genital-mutilation',
    'https://www.aljazeera.com/program/inside-story/2024/3/18/whats-the-impact-of-niger-cutting-military-ties-with-us',
    'https://www.aljazeera.com/news/2024/3/18/israel-asks-icj-not-to-order-new-measures-over-looming-famine-in-gaza',
    'https://www.aljazeera.com/news/2024/3/18/gunmen-in-nigeria-kidnap-at-least-87-people-in-new-attack',
    'https://www.aljazeera.com/news/2024/3/17/niger-suspends-military-cooperation-with-us',
    'https://www.aljazeera.com/news/2024/3/16/indian-navy-captures-ship-from-somali-pirates-rescues-crew-members',
    'https://www.aljazeera.com/news/2024/3/16/indian-navy-captures-ship-from-somali-pirates-rescues-crew-members',
    'https://www.aljazeera.com/news/2024/3/16/nigerian-soldiers-killed-in-attack-in-delta-state',
    'https://www.aljazeera.com/news/2024/3/15/we-want-sonko-senegal-opposition-boosted-after-leaders-freed-before-vote',
    'https://www.aljazeera.com/news/2024/3/15/al-shabab-fighters-killed-as-overnight-siege-of-mogadishu-hotel-ends',
    'https://www.aljazeera.com/features/2024/3/15/russian-time-how-burkina-faso-fell-for-the-charms-of-moscow',
    'https://www.aljazeera.com/news/2024/3/15/senegals-top-opposition-leaders-released-from-prison-as-elections-loom',
    'https://www.aljazeera.com/program/people-power/2024/3/14/the-sacrifice-zone',
    'https://www.aljazeera.com/opinions/2024/3/14/the-never-ending-war-on-truth',
    'https://www.aljazeera.com/program/newsfeed/2024/3/14/south-african-mother-charged-with-trafficking-her-missing-6-year-old',
    'https://www.aljazeera.com/news/2024/3/14/three-egyptian-coptic-orthodox-monks-killed-in-south-africa-monastery',
    'https://www.aljazeera.com/news/2024/3/13/tax-inspectors-to-poultry-boss-senegals-presidential-candidates',
    'https://www.aljazeera.com/news/2024/3/13/uk-plans-to-pay-asylum-seekers-to-move-to-rwanda',
    'https://www.aljazeera.com/news/2024/3/12/twenty-armed-men-take-control-of-cargo-ship-off-somalia-say-watchdogs',
    'https://www.aljazeera.com/news/2024/3/12/twenty-armed-men-take-control-of-cargo-ship-off-somalia-say-watchdogs',
    'https://www.aljazeera.com/news/2024/3/10/ramadan-mubarak-hear-greetings-in-different-languages',
    'https://www.aljazeera.com/news/2024/3/10/nigeria-school-abductions-more-pupils-snatched-as-army-hunts-for-missing',
    'https://www.aljazeera.com/program/inside-story/2024/3/9/why-cant-nigeria-stop-the-kidnapping-of-schoolchildren',
    'https://www.aljazeera.com/sports/liveblog/2024/3/8/live-anthony-joshua-vs-francis-ngannou-heavyweight-boxing',
    'https://www.aljazeera.com/program/the-stream/2024/3/8/what-impact-do-solidarity-campaigns-have-for-the-people-of-drc',
    'https://www.aljazeera.com/news/2024/3/8/nigeria-abduction-at-least-275-pupils-missing-after-gunmen-storm-school',
    'https://www.aljazeera.com/program/newsfeed/2024/3/8/parents-protest-after-280-children-are-abducted-from-nigeria-school',
    'https://www.aljazeera.com/news/2024/3/8/why-are-refugee-deaths-at-an-all-time-high',
    'https://www.aljazeera.com/news/2024/3/7/dozens-of-pupils-abducted-by-gunmen-in-nigerias-northwest',
    'https://www.aljazeera.com/news/2024/3/7/ramadan-2024-fasting-hours-and-iftar-times-around-the-world',
    'https://www.aljazeera.com/news/2024/3/7/oceans-break-high-temperature-record-on-warmest-february-marked-globally',
    'https://www.aljazeera.com/sports/2024/3/7/preview-anthony-joshua-vs-francis-ngannou-heavyweight-boxing-fight',
    'https://www.aljazeera.com/news/2024/3/7/senegal-sets-delayed-presidential-elections-for-march-24',
    'https://www.aljazeera.com/program/newsfeed/2024/3/6/haiti-gang-leader-warns-of-genocide-if-pm-returns',
    'https://www.aljazeera.com/news/2024/3/6/why-are-people-blaming-the-houthis-for-cutting-the-red-sea-cables',
    'https://www.aljazeera.com/news/2024/3/6/when-is-ramadan-2024-and-how-is-the-moon-sighted',
    'https://www.aljazeera.com/features/2024/3/5/no-justification-for-gaza-carnage-nigeria-foreign-minister-yusuf-tuggar',
    'https://www.aljazeera.com/opinions/2024/3/4/sudanese-democracy-should-not-be-us-made',
    'https://www.aljazeera.com/program/newsfeed/2024/3/4/thousands-of-inmates-escape-after-gang-led-jailbreak-attack-in-haiti',
    'https://www.aljazeera.com/news/2024/3/3/around-170-people-executed-in-burkina-faso-attacks-regional-official-says',
    'https://www.aljazeera.com/features/2024/3/3/the-two-faced-cookie-that-called-out-a-lying-politician-in-south-africa',
    'https://www.aljazeera.com/program/talk-to-al-jazeera/2024/3/3/david-miliband-on-global-crises-gaza-drc-sudan-ukraine',
    'https://www.aljazeera.com/opinions/2024/3/2/protecting-climate-refugees-requires-a-legal-definition',
    'https://www.aljazeera.com/news/2024/3/2/chad-interim-leader-deby-confirms-plan-to-run-for-president-in-may',
    'https://www.aljazeera.com/features/2024/3/2/no-identity-why-is-kenyan-music-failing-to-break-through-globally',
    'https://www.aljazeera.com/news/2024/3/1/un-official-warns-of-possible-war-crimes-rape-as-a-weapon-in-sudan',
    'https://www.aljazeera.com/program/upfront/2024/3/1/can-somalia-win-its-war-against-al-shabab',
    'https://www.aljazeera.com/news/2024/3/1/kenya-haiti-sign-reciprocal-agreement-on-police-deployment-ruto',
    'https://www.aljazeera.com/gallery/2024/3/1/un-peacekeepers-begin-pullout-from-dr-congos-restive-east',
    'https://www.aljazeera.com/news/2024/2/29/zambia-declares-national-disaster-after-drought-devastates-agriculture',
    'https://www.aljazeera.com/news/2024/2/29/why-is-chad-boiling-over-ahead-of-long-awaited-elections-and-whats',
    'https://www.aljazeera.com/news/2024/2/29/chadian-opposition-leader-dies-in-gun-exchange-state-prosecutor-says',
    'https://www.aljazeera.com/features/2024/2/29/how-an-eu-funded-security-force-helped-senegal-crush-democracy-protests',
    'https://www.aljazeera.com/news/2024/2/28/ghanas-parliament-passes-anti-lgbtq-bill',
    'https://www.aljazeera.com/features/2024/2/28/jolt-to-reality-gaza-war-forces-voter-rethink-ahead-of-south-africa-poll',
    'https://www.aljazeera.com/news/2024/2/28/june-elections-proposed-during-senegal-dialogue-to-end-political-crisis',
    'https://www.aljazeera.com/opinions/2024/2/28/real-solutions-to-climate-change-in-africa-are-about-people-not-profit',
    'https://www.aljazeera.com/news/2024/2/28/chad-announces-several-deaths-after-foiled-intelligence-office-attack',
    'https://www.aljazeera.com/features/2024/2/27/malawi-banana-wine-entrepreneurs-press-on-despite-climate-change',
    'https://www.aljazeera.com/podcasts/2024/2/27/the-take-whats-behind-the-armed-conflict-in-eastern-dr-congo',
    'https://www.aljazeera.com/news/2024/2/27/two-people-shot-dead-as-guinea-protest-turns-bloody',
    
]

# Scraping the articles and collect the data
articles = []
for url in urls:
    result = scrape_aljazeera_article(url)
    if result:
        articles.append(result)

aljazeera_df = pd.DataFrame(articles)

display(aljazeera_df)


Unnamed: 0,url,title,text,label
0,https://www.aljazeera.com/opinions/2024/3/27/b...,Beit Daras and Gaza: An intergenerational tale...,,Human-written
1,https://www.aljazeera.com/news/2024/3/27/israe...,"Israel’s war on Gaza: List of key events, day 173",,Human-written
2,https://www.aljazeera.com/news/2024/3/27/hezbo...,Hezbollah launches rocket barrage after Israel...,,Human-written
3,https://www.aljazeera.com/features/2024/3/26/s...,South Sudan on the brink after oil exports der...,,Human-written
4,https://www.aljazeera.com/news/2024/3/25/binan...,Binance executive detained in Nigeria in crypt...,,Human-written
...,...,...,...,...
84,https://www.aljazeera.com/opinions/2024/2/28/r...,Real solutions to climate change in Africa are...,,Human-written
85,https://www.aljazeera.com/news/2024/2/28/chad-...,"Troops deployed, internet shut down in Chad’s ...",,Human-written
86,https://www.aljazeera.com/features/2024/2/27/m...,The Malawians braving climate shocks and red t...,,Human-written
87,https://www.aljazeera.com/podcasts/2024/2/27/t...,The Take: What’s behind the armed conflict in ...,,Human-written



## Data storage for further analysis

After successfully scraping and organizing the data, it is stored in a pickle file named `aljazeera_articles.pkl`. This step allowed us to keep a stable and easily accessible dataset for further analysis, obviating the need to redo the scraping process. Opting for a pickle file as the storage medium was particularly advantageous due to its capacity to store Python objects, thereby maintaining the integrity of the data's structure and content. 


In [None]:
aljazeera_df.to_pickle("aljazeera_articles.pkl")

### References

- Python Software Foundation. (n.d.). *Requests: HTTP for Humans™*. Retrieved from [https://requests.readthedocs.io](https://requests.readthedocs.io)

- Richard Mitchell, Leonard Richardson. (n.d.). *Beautiful Soup Documentation*. Retrieved from [https://www.crummy.com/software/BeautifulSoup/bs4/doc/](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

- Wes McKinney. (n.d.). *pandas: powerful Python data analysis toolkit*. Retrieved from [https://pandas.pydata.org/pandas-docs/stable/index.html](https://pandas.pydata.org/pandas-docs/stable/index.html)