## Web Scraping Approach

A web scraping process aimed at collecting news articles from Al Jazeera's online platform is performed. 3 libraries were used to carry out our web scraping process.

- `requests`: Retrieving web page content.(Python Software Foundation, n.d.)
- `BeautifulSoup`: Parsing and navigating the HTML structure of these pages for data extraction.(Mitchell & Richardson, n.d.)
- `pandas`: To structure the scraped information into a format ready for further analysis. (McKinney, n.d.)

Our extraction focuses on the articles titles and their main textual content, from a carefully chosen list of URLs. This required a process of identifying and extracting HTML elements known to house the title and body text, which were then compiled into a coherent dataset. This approach not only made the data collection process efficient for our needs but also helped the consistency and accuracy of the data prepared for analysis.


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_techcrunch_article(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for non-200 status codes

        soup = BeautifulSoup(response.content, 'html.parser')

        title_element = soup.find('h1', class_='article__title')  
        title = title_element.text.strip() if title_element else 'TechCrunch Article'

        content_element = soup.find('div', class_='article-content') 
        if content_element:
            paragraphs = content_element.find_all('p')
            text = ' '.join(p.get_text(strip=True) for p in paragraphs)
        else:
            text = 'Content not found'

        return {
            'url': url,
            'title': title,
            'text': text,
            'label': 'Human-written'
        }
    except requests.exceptions.RequestException as e:
        print(f"Error fetching URL: {e}")
        return None

# List of TechCrunch URLs to scrape
techcrunch_urls = [
    'https://techcrunch.com/2024/03/27/linkedin-is-experimenting-with-a-tiktok-like-video-feed-in-its-app/',
    'https://techcrunch.com/2024/03/27/marissa-mayers-startup-just-rolled-out-apps-for-group-photo-sharing-and-event-planning-and-the-internet-isnt-sure-what-to-think/',
    'https://techcrunch.com/2024/03/27/databricks-spent-10m-on-a-generative-ai-model-that-still-cant-beat-gpt-4/',
    'https://techcrunch.com/2024/03/27/uk-unicorn-new-york-hologram-venture-capital/',
    'https://techcrunch.com/2024/03/27/unicorn-founders/',
    'https://techcrunch.com/2024/03/27/wase-seed-fundraise/',
    'https://techcrunch.com/2024/03/13/tech-layoffs-2023-list/',
    'https://techcrunch.com/2024/03/26/apple-wwdc-2024-set-for-june-10-14-promises-to-be-absolutely-incredible/',
    'https://techcrunch.com/2024/03/26/robinhood-goes-after-apple-card-with-a-new-credit-card-loaded-with-impressive-features/',
    'https://techcrunch.com/2024/03/26/apple-wwdc-2024-set-for-june-10-14-promises-to-be-absolutely-incredible/',
    'https://techcrunch.com/2024/03/26/with-affinity-acquisition-canva-should-be-able-to-compete-better-with-adobes-creative-tools/',
    'https://techcrunch.com/2024/03/13/chatgpt-everything-to-know-about-the-ai-chatbot/',
    'https://techcrunch.com/2024/03/27/newretirement-wants-to-simplify-financial-planning-for-retirement/',
    'https://techcrunch.com/2024/03/27/google-generate-travel-itineraries-for-your-vacations/',
    'https://techcrunch.com/2024/03/27/google-swipe-clothes-better-fashion-recommendations/',
    'https://techcrunch.com/2024/03/27/techcrunch-minute-robinhoods-credit-card-has-arrived-to-take-on-apple-and-any-upcoming-challengers/',
    'https://techcrunch.com/2024/03/27/musical-toy-startup-playtime-engineering-wants-to-simplify-electronic-music-making-for-kids/',
    'https://techcrunch.com/2024/03/27/cyvl-ai-is-bringing-data-driven-solutions-to-transportation-infrastructure/',
    'https://techcrunch.com/2024/03/27/orchard-robotics-ai-powered-camera-system-turns-existing-farm-equipment-into-apple-growing-data-collectors/',
    'https://techcrunch.com/2024/03/27/observe-snowflake-data-observability/',
    'https://techcrunch.com/2024/03/27/century-health-2m-ai-pharma-patient-data/',
    'https://techcrunch.com/2024/03/27/act-fast-just-3-days-remain-to-grab-your-techcrunch-early-stage-2024-tickets/',
    'https://techcrunch.com/2024/03/27/new-summit-is-raising-a-new-100-million-fund-to-back-climate-tech-and-underrepresented-fund-managers/',
    'https://techcrunch.com/2024/03/27/amazon-dsa-ads-library-cjeu/',
    'https://techcrunch.com/2024/03/27/robinhood-goes-after-apple-card-with-a-new-credit-card-loaded-with-impressive-features/',
    'https://techcrunch.com/2024/03/27/databricks-spent-10m-on-a-generative-ai-model-that-still-cant-beat-gpt-4/',
    'https://techcrunch.com/2024/03/27/tech-layoffs-2023-list/',
    'https://techcrunch.com/2024/03/27/amazon-dark-pattern-design-fine/',
    'https://techcrunch.com/2024/03/27/african-b2b-e-commerce-giant-wasoko-marked-down-to-260m-after-vc-halves-stake/',
    'https://techcrunch.com/2024/03/27/rabbit-partners-with-elevenlabs-to-power-voice-commands-on-its-device/',
    'https://techcrunch.com/2024/03/27/marissa-mayers-startup-just-rolled-out-apps-for-group-photo-sharing-and-event-planning-and-the-internet-isnt-sure-what-to-think/',
    'https://techcrunch.com/2024/03/26/paypal-backs-indonesia-insurance-startup-qoala-in-47m-funding/',
    'https://techcrunch.com/2024/03/26/elon-musk-says-all-premium-subscribers-on-x-will-gain-access-to-ai-chatbot-grok-this-week/',
    'https://techcrunch.com/2024/03/26/discipluus-ventures-mentors-founders-norman-rockwell-america/',
    'https://techcrunch.com/2024/03/26/how-to-turn-off-instagrams-political-content-filter/',
    'https://techcrunch.com/2024/03/26/facebook-secret-project-snooped-snapchat-user-traffic/',
    'https://techcrunch.com/2024/03/26/nasas-snake-robot-is-designed-to-search-out-life-in-the-icy-oceans-of-a-saturn-moon/',
    'https://techcrunch.com/2024/03/26/vibrant-planet-uses-ai-for-land-mapping-and-improving-climate-resiliency/',
    'https://techcrunch.com/2024/03/26/baas-startup-synctera-layoffs-fintech/',
    'https://techcrunch.com/2024/03/26/eu-election-security-guidance-for-vlops/',
    'https://techcrunch.com/2024/03/26/evari-seed-round/',
    'https://techcrunch.com/2024/03/26/adobes-genstudio-brings-brand-safe-generative-ai-to-marketers/',
    'https://techcrunch.com/2024/03/26/confetti-a-team-building-platform-used-by-apple-google-and-microsoft-raises-16m/',
    'https://techcrunch.com/2024/03/26/tesla-fsd-beta-free-trial-promotion-driver-assistance/',
    'https://techcrunch.com/2024/03/26/former-nextdoor-exec-raises-25-million-for-pipedreams-a-startup-rolling-up-hvac-companies/',
    'https://techcrunch.com/2024/03/26/ionobell-fundraise-exclusive/',
    'https://techcrunch.com/2024/03/26/fireworks-ai-open-source-api-puts-generative-ai-in-reach-of-any-developer/',
    'https://techcrunch.com/2024/03/26/evoloh-series-a-funding/',
    'https://techcrunch.com/2024/03/26/chiyo-startup-helps-new-moms-postpartum-nutrition/',
    'https://techcrunch.com/2024/03/26/just-4-days-left-to-cash-in-on-early-bird-savings-to-tc-early-stage-2024/',
    'https://techcrunch.com/2024/03/26/0g-labs-launches-with-whopping-35m-pre-seed-to-build-a-modular-ai-blockchain/',
    'https://techcrunch.com/2024/03/26/viam-looks-beyond-robotics-with-its-no-code-automation-platform/',
    'https://techcrunch.com/2024/03/26/picogrid-scores-new-funding-to-connect-the-militarys-stovepipe-systems/'
]

# Scraping articles
scraped_data = []
for url in techcrunch_urls:
    result = scrape_techcrunch_article(url)
    if result:
        scraped_data.append(result)

# Creating a DataFrame with the scraped data
df = pd.DataFrame(scraped_data)

In [2]:
df

Unnamed: 0,url,title,text,label
0,https://techcrunch.com/2024/03/27/linkedin-is-...,LinkedIn is experimenting with a TikTok-like v...,LinkedIn is testing a new TikTok-like short-fo...,Human-written
1,https://techcrunch.com/2024/03/27/marissa-maye...,Marissa Mayer’s startup just rolled out photo ...,When Marissa Mayer co-founded a startup six ye...,Human-written
2,https://techcrunch.com/2024/03/27/databricks-s...,Databricks spent $10M on new DBRX generative A...,If you wanted to raise the profile of your maj...,Human-written
3,https://techcrunch.com/2024/03/27/uk-unicorn-n...,The UK threw a splashy event in New York this ...,"A 3D hologram, dubbed the Ever-Changing Statue...",Human-written
4,https://techcrunch.com/2024/03/27/unicorn-foun...,New study of unicorn founders finds most are ‘...,A new study that zeros-in on the founders of s...,Human-written
5,https://techcrunch.com/2024/03/27/wase-seed-fu...,Wase zaps microbes to squeeze more biogas from...,Few people get as excited about wastewater as ...,Human-written
6,https://techcrunch.com/2024/03/13/tech-layoffs...,A comprehensive list of 2023 & 2024 tech layoffs,The tech-widereckoningthat began in 2022and ra...,Human-written
7,https://techcrunch.com/2024/03/26/apple-wwdc-2...,"Apple WWDC 2024, set for June 10-14, promises ...",Apple SVP Greg “Joz” Joswiak just confirmed vi...,Human-written
8,https://techcrunch.com/2024/03/26/robinhood-go...,Robinhood’s new credit card goes after Apple C...,Eight months afteracquiring credit card startu...,Human-written
9,https://techcrunch.com/2024/03/26/apple-wwdc-2...,"Apple WWDC 2024, set for June 10-14, promises ...",Apple SVP Greg “Joz” Joswiak just confirmed vi...,Human-written


## Data storage for further analysis

After successfully scraping and organizing the data, it is stored in a pickle file named `aljazeera_articles.pkl`. This step allowed us to keep a stable and easily accessible dataset for further analysis, obviating the need to redo the scraping process. Opting for a pickle file as the storage medium was particularly advantageous due to its capacity to store Python objects, thereby maintaining the integrity of the data's structure and content. 


In [3]:
df.to_pickle("techcrunch_data.pkl")
print("Data scraped and saved to techcrunch_data.pkl")

Data scraped and saved to techcrunch_data.pkl


### References

- Python Software Foundation. (n.d.). *Requests: HTTP for Humans™*. Retrieved from [https://requests.readthedocs.io](https://requests.readthedocs.io)

- Richard Mitchell, Leonard Richardson. (n.d.). *Beautiful Soup Documentation*. Retrieved from [https://www.crummy.com/software/BeautifulSoup/bs4/doc/](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

- Wes McKinney. (n.d.). *pandas: powerful Python data analysis toolkit*. Retrieved from [https://pandas.pydata.org/pandas-docs/stable/index.html](https://pandas.pydata.org/pandas-docs/stable/index.html)