## Web Scraping Approach

A web scraping process aimed at collecting news articles from Al Jazeera's online platform is performed. 3 libraries were used to carry out our web scraping process.

- `requests`: Retrieving web page content.(Python Software Foundation, n.d.)
- `BeautifulSoup`: Parsing and navigating the HTML structure of these pages for data extraction.(Mitchell & Richardson, n.d.)
- `pandas`: To structure the scraped information into a format ready for further analysis. (McKinney, n.d.)

Our extraction focuses on the articles titles and their main textual content, from a carefully chosen list of URLs. This required a process of identifying and extracting HTML elements known to house the title and body text, which were then compiled into a coherent dataset. This approach not only made the data collection process efficient for our needs but also helped the consistency and accuracy of the data prepared for analysis.


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# CNN news scraping function
def scrape_cnn_article(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Finding the title of the article
    title = soup.find('h1', class_='headline__text')  # Assuming CNN uses this class for article titles
    article_title = title.get_text(strip=True) if title else 'CNN Article'  # Fallback title if not found
    
    # Find the main content of the article
    content_div = soup.find('div', class_='article__content')
    
    # Extracting text from all paragraphs within the content div
    paragraphs = content_div.find_all('p') if content_div else []
    article_text = ' '.join(p.get_text(strip=True) for p in paragraphs)
    
    return {
        'title': article_title,
        'text': article_text.strip()
    }

scraping_functions = {
    'edition.cnn.com': scrape_cnn_article,  
}

def scrape_article(url):
    domain = url.split('//')[1].split('/')[0]
    if domain in scraping_functions:
        func = scraping_functions[domain]
        article_data = func(url)
        return article_data
    print(f"No specific scraping function for URL: {url}")
    return None

# List of CNN URLs to scrape
urls = [
    'https://edition.cnn.com/2024/03/01/asia/japan-demographic-crisis-population-intl-hnk-dst/index.html',
    'https://edition.cnn.com/2024/03/01/world/bangladesh-dhaka-fire-deaths-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/25/europe/isis-k-explainer-russia-moscow-attack-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/21/europe/kyiv-missile-attack-ukraine-russia-intl/index.html',
    'https://edition.cnn.com/2024/03/23/europe/us-had-warned-russia-isis-was-determined-to-attack-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/24/europe/ireland-taoiseach-varadkar-resignation-reason-intl/index.html',
    'https://edition.cnn.com/2024/03/20/europe/sullivan-kyiv-visit-ukraine-aid-intl/index.html',
    'https://edition.cnn.com/2024/03/20/europe/leo-varadkar-ireland-prime-minister-resignation-intl/index.html',
    'https://edition.cnn.com/2024/03/21/europe/kyiv-missile-attack-ukraine-russia-intl/index.html',
    'https://edition.cnn.com/2024/03/16/europe/ukraine-appeal-ignore-putin-pseudo-elections-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/19/europe/aleksandr-moiseev-russian-navy-intl/index.html',
    'https://edition.cnn.com/2024/03/25/americas/ecuador-brigitte-garcia-dead-intl-latam/index.html',
    'https://edition.cnn.com/2024/03/11/americas/haiti-pm-ariel-henry-resigns-gang-violence-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/12/americas/haiti-gangs-prime-minister-analysis-intl/index.html',
    'https://edition.cnn.com/2024/03/07/americas/haiti-port-terminal-violence-intl/index.html',
    'https://edition.cnn.com/2024/03/07/americas/grenada-american-couple-missing-arrests-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/06/americas/haiti-henry-us-elections-intl/index.html',
    'https://edition.cnn.com/2024/03/04/americas/haiti-ariel-henry-gangs-protests-bsap-intl-latam/index.html',
    'https://edition.cnn.com/2024/03/03/americas/april-total-solar-eclipse-scn/index.html',
    'https://edition.cnn.com/2024/02/29/americas/brian-mulroney-obit/index.html',
    'https://edition.cnn.com/2024/02/27/americas/mexico-mayor-candidates-killed-intl-latam/index.html',
    'https://edition.cnn.com/2024/02/18/americas/brazil-lula-israel-holocaust-summon-ambassador/index.html',
    'https://edition.cnn.com/2024/03/26/china/china-cyber-hacking-accusations-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/18/china/china-teacher-li-followers-police-questioning-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/14/india/india-bangalore-water-crisis-impact-intl-hnk-dst/index.html',
    'https://edition.cnn.com/2024/03/17/asia/russian-election-putin-china-north-korea-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/14/asia/asian-elephant-burials-scli-intl-scn/index.html',
    'https://edition.cnn.com/2024/03/18/india/india-gujarat-university-attack-muslim-prayers-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/24/india/india-madrasa-court-ruling-uttar-pradesh-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/26/china/china-soccer-officials-jailed-corruption-intl-hnk-spt/index.html',
    'https://edition.cnn.com/2024/03/22/asia/brown-giant-panda-genetic-trait-scn/index.html',
    'https://edition.cnn.com/2024/03/23/asia/china-coast-guard-water-cannon-philippine-south-china-sea-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/21/india/india-arvind-kejriwal-arrest-corruption-allegations-intl/index.html',
    'https://edition.cnn.com/2024/03/21/india/china-india-sela-tunnel-border-dispute-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/21/asia/nearly-5-million-animals-dead-in-mongolias-harshest-winter-in-half-a-century-climate-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/12/india/india-mirv-icbm-intl-hnk-ml/index.html',
    'https://edition.cnn.com/2024/03/11/china/china-two-sessions-takeaway-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/11/china/china-nationalist-attack-nobel-laureate-mo-yan-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/04/asia/malaysia-renew-search-missing-plane-mh370-intl/index.html',
    'https://edition.cnn.com/2024/02/09/china/china-russia-xi-putin-call-ukraine-war-intl-hnk/index.html',
    'https://edition.cnn.com/2023/12/22/india/india-tamil-nadu-flood-rain-weather-intl-hnk/index.html',
    'https://edition.cnn.com/2023/12/19/india/india-opposition-lawmakers-suspended-record-number-intl-hnk/index.html',
    'https://edition.cnn.com/2023/12/13/india/india-delhi-zafar-mahal-mughal-vandalism-intl-hnk/index.html',
    'https://edition.cnn.com/2023/12/12/india/india-united-states-gurpatwant-singh-pannun-indictment-analysis-intl-hnk/index.html',
    'https://edition.cnn.com/2024/02/01/china/china-execution-couple-murder-children-intl-hnk/index.html',
    'https://edition.cnn.com/2024/02/05/china/china-sentences-australian-yang-hengjun-suspended-death-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/15/australia/worlds-heaviest-blueberry-scli-intl/index.html',
    'https://edition.cnn.com/2024/03/14/australia/josh-cavallo-australia-gay-soccer-star-proposal-intl-hnk-spt/index.html',
    'https://edition.cnn.com/2024/03/12/australia/australia-united-airlines-sydney-san-francisco-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/07/australia/mass-coral-bleaching-event-great-barrier-reef-intl-hnk-scn/index.html',
    'https://edition.cnn.com/2024/03/03/australia/marion-barter-florabella-remakel-ric-blum-intl-hnk-dst/index.html',
    'https://edition.cnn.com/2024/02/23/australia/australia-police-murder-charge-missing-gay-couple-nsw-intl-hnk/index.html',
    'https://edition.cnn.com/2024/02/22/australia/australia-bishop-saunders-charged-historical-sex-offenses-intl-hnk/index.html',
    'https://edition.cnn.com/2024/01/25/australia/male-marsupials-sleep-sex-intl-scli-scn/index.html',
    'https://edition.cnn.com/2024/01/17/australia/fire-ants-forming-rafts-to-cross-australia-flood-waters-intl-scli-scn-climate/index.html',
    'https://edition.cnn.com/2024/01/10/australia/australian-fugitive-murder-suspect-greece-intl-hnk/index.html',
    'https://edition.cnn.com/2023/12/20/australia/australia-businessman-convicted-foreign-influence-china-intl-hnk/index.html',
    'https://edition.cnn.com/2023/12/14/australia/australia-folbigg-murder-convictions-quashed-intl-hnk/index.html',
    'https://edition.cnn.com/2023/12/08/australia/tropical-storm-jasper-australia-heatwave-climate-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/26/middleeast/palestinians-drown-gaza-aid-drop-intl/index.html',
    'https://edition.cnn.com/2024/03/26/middleeast/israel-gaza-ceasefire-un-resolution-war-impact-intl/index.html',
    'https://edition.cnn.com/2024/03/25/middleeast/un-security-council-gaza-israel-ceasefire-intl/index.html',
    'https://edition.cnn.com/2024/03/24/middleeast/idf-gaza-hospitals-red-crescent-intl/index.html',
    'https://edition.cnn.com/2024/03/19/middleeast/famine-northern-gaza-starvation-ipc-report-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/15/middleeast/hamas-interview-diamond-hostages-ceasefire-intl/index.html',
    'https://edition.cnn.com/world/middleeast/krista-kim-heart-space-art-dubai-spc/index.html',
    'https://edition.cnn.com/2024/03/15/middleeast/gaza-aid-ship-explainer-looming-famine-mime-intl/index.html',
    'https://edition.cnn.com/2024/03/18/middleeast/palestinian-relocation-gaza-east-jerusalem-intl/index.html',
    'https://edition.cnn.com/2024/03/13/middleeast/china-russia-iran-navy-drills-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/12/middleeast/us-religious-freedom-saudi-arabia-rabbi-kippah-intl/index.html',
    'https://edition.cnn.com/2024/03/12/middleeast/israel-police-shoot-palestinian-boy-dead-intl/index.html',
    'https://edition.cnn.com/2024/03/11/middleeast/netanyahu-biden-rift-widens-rafah-gaza-mime-intl/index.html',
    'https://edition.cnn.com/2024/03/08/middleeast/iran-women-crimes-against-humanity-un-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/07/middleeast/israel-gaza-hostage-cnn-interview-intl/index.html',
    'https://edition.cnn.com/2024/03/06/middleeast/israel-gaza-starvation-siege-mothers-babies-intl/index.html',
    'https://edition.cnn.com/2024/03/05/middleeast/gaza-hamas-ceasefire-israel-intl/index.html',
    'https://edition.cnn.com/2024/03/05/middleeast/israel-fire-palestinians-aid-northern-gaza-intl/index.html',
    'https://edition.cnn.com/2024/03/08/middleeast/worlds-oldest-bread-discovered-turkey-intl-scli-scn/index.html',
    'https://edition.cnn.com/2024/03/08/middleeast/israel-building-road-splitting-gaza-cmd-intl/index.html',
    'https://edition.cnn.com/2024/03/08/middleeast/gaza-israelis-aid-trucks-protests/index.html',
    'https://edition.cnn.com/2024/03/06/uk/catherine-princess-of-wales-trooping-colour-claim-confusion-intl-scli-gbr/index.html',
    'https://edition.cnn.com/2024/03/01/uk/kate-princess-of-wales-prince-william-royal-newsletter-030124-intl-scli/index.html',
    'https://edition.cnn.com/2024/03/11/uk/kate-royal-photograph-edited-intl-gbr/index.html',
    'https://edition.cnn.com/2024/03/11/uk/kate-royal-photo-apology-gbr-intl/index.html',
    'https://edition.cnn.com/2024/03/14/uk/uk-government-extremism-definition-gbr-intl/index.html',
    'https://edition.cnn.com/2024/03/13/uk/england-nhs-puberty-blockers-trans-children-intl-gbr/index.html',
    'https://edition.cnn.com/2024/03/10/uk/news-agencies-recall-image-of-catherine-princess-of-wales/index.html',
    'https://edition.cnn.com/2024/03/22/uk/kate-hysteria-royal-newsletter-032224-intl-scli-gbr-cmd/index.html',
    'https://edition.cnn.com/2024/03/21/uk/king-charles-medical-records-london-hospital-intl-gbr/index.html',
    'https://edition.cnn.com/2024/03/20/uk/kate-princess-of-wales-records-hospital-staff-intl/index.html',
    'https://edition.cnn.com/2024/03/10/uk/kate-princess-wales-photo-released-intl/index.html',
    'https://edition.cnn.com/2024/03/26/middleeast/israel-gaza-ceasefire-un-resolution-war-impact-intl/index.html',
    'https://edition.cnn.com/2024/03/26/uk/julian-assange-us-extradition-appeal-intl-gbr/index.html',
    'https://edition.cnn.com/2024/03/25/europe/isis-k-explainer-russia-moscow-attack-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/25/africa/former-senegalese-pm-concedes-defeat-to-opposition-candidate-intl/index.html',
    'https://edition.cnn.com/2024/03/25/uk/microplastics-archeological-remains-study-scli-intl-scn-gbr/index.html',
    'https://edition.cnn.com/2024/03/25/climate/brazil-flooding-landslide-climate-intl/index.html',
    'https://edition.cnn.com/2024/03/24/europe/kate-princess-of-wales-cancer-public-support-intl/index.html',
    'https://edition.cnn.com/2024/03/24/africa/nigeria-gunmen-children-released-intl-hnk/index.html',
    'https://edition.cnn.com/2024/03/26/style/nora-turato-spruth-magers-la/index.html'

]

# Scraping the articles and collect the data
articles = []
for url in urls:
    result = scrape_article(url)
    if result:
        articles.append({'url': url, 'title': result['title'], 'text': result['text'], 'label': 'Human-written'})

cnn_df = pd.DataFrame(articles)

In [3]:
cnn_df

Unnamed: 0,url,title,text,label
0,https://edition.cnn.com/2024/03/01/asia/japan-...,Japan’s population crisis was years in the mak...,"Each spring, as reliably as the changing of th...",Human-written
1,https://edition.cnn.com/2024/03/01/world/bangl...,"Bangladesh inferno kills at least 43, injures ...",A massive fire raced through a six-storey buil...,Human-written
2,https://edition.cnn.com/2024/03/25/europe/isis...,"Who are ISIS-K, the group linked to the Moscow...",ISIS claimed responsibly forFriday’s deadly as...,Human-written
3,https://edition.cnn.com/2024/03/21/europe/kyiv...,"Thousands shelter from Kyiv missile barrage, h...",Ukraine’s capital came under heavy missile att...,Human-written
4,https://edition.cnn.com/2024/03/23/europe/us-h...,US had warned Russia ISIS was determined to at...,The US warned Moscow that ISIS militants were ...,Human-written
...,...,...,...,...
95,https://edition.cnn.com/2024/03/25/uk/micropla...,Archaeologists are now finding microplastics i...,Sign up for CNN’s Wonder Theory science newsle...,Human-written
96,https://edition.cnn.com/2024/03/25/climate/bra...,At least 27 dead as flooding ravages southeast...,At least 27 people have been killed in southea...,Human-written
97,https://edition.cnn.com/2024/03/24/europe/kate...,Prince and Princess of Wales ‘enormously touch...,The Prince and Princess of Wales expressed the...,Human-written
98,https://edition.cnn.com/2024/03/24/africa/nige...,137 school children kidnapped by gunmen in Nig...,At least 137 school children who were kidnappe...,Human-written



## Data storage for further analysis

After successfully scraping and organizing the data, it is stored in a pickle file named `aljazeera_articles.pkl`. This step allowed us to keep a stable and easily accessible dataset for further analysis, obviating the need to redo the scraping process. Opting for a pickle file as the storage medium was particularly advantageous due to its capacity to store Python objects, thereby maintaining the integrity of the data's structure and content. 


In [4]:
cnn_df.to_pickle("cnn_data.pkl")

### References

- Python Software Foundation. (n.d.). *Requests: HTTP for Humans™*. Retrieved from [https://requests.readthedocs.io](https://requests.readthedocs.io)

- Richard Mitchell, Leonard Richardson. (n.d.). *Beautiful Soup Documentation*. Retrieved from [https://www.crummy.com/software/BeautifulSoup/bs4/doc/](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

- Wes McKinney. (n.d.). *pandas: powerful Python data analysis toolkit*. Retrieved from [https://pandas.pydata.org/pandas-docs/stable/index.html](https://pandas.pydata.org/pandas-docs/stable/index.html)