## Web Scraping Approach

A web scraping process aimed at collecting news articles from an online platform is performed. 3 libraries were used to carry out our web scraping process.

- `requests`: Retrieving web page content.(Python Software Foundation, n.d.)
- `BeautifulSoup`: Parsing and navigating the HTML structure of these pages for data extraction.(Mitchell & Richardson, n.d.)
- `pandas`: To structure the scraped information into a format ready for further analysis. (McKinney, n.d.)

Our extraction focuses on the articles titles and their main textual content, from a carefully chosen list of URLs. This required a process of identifying and extracting HTML elements known to house the title and body text, which were then compiled into a coherent dataset. This approach not only made the data collection process efficient for our needs but also helped the consistency and accuracy of the data prepared for analysis.

## Data storage for further analysis

After successfully scraping and organizing the data, it is stored in a pickle file named `aljazeera_articles.pkl`. This step allowed us to keep a stable and easily accessible dataset for further analysis, obviating the need to redo the scraping process. Opting for a pickle file as the storage medium was particularly advantageous due to its capacity to store Python objects, thereby maintaining the integrity of the data's structure and content. 


In [15]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def scrape_article(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an exception for non-200 status codes

        soup = BeautifulSoup(response.content, 'html.parser')

        title_element = soup.find('h1')  # Assuming the article title is wrapped in an 'h1' tag
        title = title_element.text.strip() if title_element else 'Article Title Not Found'

        content_element = soup.find('div', class_='caas-body')  # Assuming this class for article content
        if content_element:
            paragraphs = content_element.find_all('p')
            text = ' '.join(p.get_text(strip=True) for p in paragraphs)
        else:
            text = 'Content not found'

        return {
            'url': url,
            'title': title,
            'text': text,
            'label': 'Human-written'
        }
    except requests.exceptions.RequestException as e:
        print(f"Error fetching URL: {e}")
        return None

# List of URLs to scrape
urls = [
    "https://finance.yahoo.com/video/energy-stocks-soared-march-211521930.html",
    "https://finance.yahoo.com/news/sec-suit-against-coinbase-forward-135353133.html",
    "https://finance.yahoo.com/news/washingtons-next-test-funding-ukraine-to-stop-putin-183041926.html",
    "https://finance.yahoo.com/video/troubles-mount-apple-wwdc-nears-211314671.html",
    "https://finance.yahoo.com/news/stock-market-today-sp-500-hits-fresh-record-dow-jumps-over-475-points-200500649.html",
    "https://finance.yahoo.com/news/invested-most-salary-7-years-181442762.html",
    "https://finance.yahoo.com/news/trump-media-stock-surges-on-day-2-of-market-debut-153225087.html",
    "https://finance.yahoo.com/news/baltimore-bridge-collapse-could-cost-carnival-10-million-this-year-160921545.html",
    "https://finance.yahoo.com/video/walgreens-earnings-gdp-consumer-sentiment-210542071.html",
    "https://finance.yahoo.com/news/elon-musk-says-almost-anyone-203000043.html",
    "https://finance.yahoo.com/news/fast-food-workers-losing-jobs-225107245.html",
    "https://finance.yahoo.com/news/interior-department-issues-rule-limit-203610416.html",
    "https://finance.yahoo.com/news/biden-methane-crackdown-reaches-oil-190000777.html",
    "https://finance.yahoo.com/news/oil-declines-industry-report-shows-232152865.html",
    "https://finance.yahoo.com/news/shale-ceos-bemoan-us-political-163609255.html",
    "https://finance.yahoo.com/news/swiss-economy-likely-picked-somewhat-144435122.html",
    "https://finance.yahoo.com/news/larry-fink-joins-jamie-dimon-115249796.html",
    "https://finance.yahoo.com/news/spanish-inflation-quickens-government-removes-082346372.html",
    "https://finance.yahoo.com/news/thai-pm-says-economy-needs-053810053.html",
    "https://finance.yahoo.com/news/oman-becoming-hot-spot-ship-031905219.html",
    "https://finance.yahoo.com/news/oil-prices-fall-second-day-020039217.html",
    "https://finance.yahoo.com/news/oil-holds-advance-opec-cutbacks-001540829.html",
    "https://finance.yahoo.com/news/pgim-bullet-proofs-bond-portfolios-150439109.html",
    "https://finance.yahoo.com/news/americans-got-more-pessimistic-about-the-future-of-the-economy-in-march-170648639.html",
    "https://finance.yahoo.com/news/pennsylvania-county-joins-other-local-163845380.html",
    "https://finance.yahoo.com/news/small-business-hiring-woes-show-152534078.html",
    "https://finance.yahoo.com/news/fed-says-official-net-negative-150259152.html",
    "https://finance.yahoo.com/news/us-consumer-confidence-steady-march-142153941.html",
    "https://finance.yahoo.com/news/nigeria-hikes-rates-again-fights-132412009.html",
    "https://finance.yahoo.com/news/better-time-think-buy-home-172201456.html",
    "https://finance.yahoo.com/news/survey-stock-market-inch-2-171018094.html",
    "https://finance.yahoo.com/news/nepo-housing-market-more-third-170559313.html",
    "https://finance.yahoo.com/news/earn-500-month-apple-stock-170539964.html",
    "https://finance.yahoo.com/news/survey-best-ways-play-stock-170442809.html",
    "https://finance.yahoo.com/news/survey-market-pros-see-10-165309953.html",
    "https://finance.yahoo.com/news/3-reits-shunned-wall-street-163012691.html",
    "https://finance.yahoo.com/news/1-wall-street-analyst-thinks-163000639.html",
    "https://finance.yahoo.com/news/truth-social-stock-price-soaring-161336948.html",
    "https://finance.yahoo.com/news/general-electric-stock-going-200-155838469.html",
    "https://finance.yahoo.com/news/4-reasons-buy-baidu-stock-155500921.html",
    "https://finance.yahoo.com/news/billionaire-bill-gates-35-billion-154629407.html",
    "https://finance.yahoo.com/news/junk-market-flashes-warning-fed-154815671.html",
    "https://finance.yahoo.com/news/draftkings-stock-24-upside-according-144802795.html",
    "https://finance.yahoo.com/news/nvidia-8-upside-according-1-144600995.html",
    "https://finance.yahoo.com/news/nvidia-stock-investors-know-recent-143431252.html",
    "https://www.yahoo.com/lifestyle/thredup-releases-resale-state-union-141418782.html",
    "https://finance.yahoo.com/news/meme-stocks-back-trump-media-143135828.html",
    "https://finance.yahoo.com/news/rising-challenge-aspiring-magnificent-seven-141500024.html",
    "https://finance.yahoo.com/news/apple-just-suffered-largest-single-140000413.html",
    "https://finance.yahoo.com/news/better-electric-vehicle-stock-tesla-135300941.html",
    "https://finance.yahoo.com/news/starbucks-stock-continues-to-struggle-as-competition-heats-up-in-the-us-overseas-133603950.html",
    "https://finance.yahoo.com/news/tipranks-perfect-10-list-3-132528882.html",
    "https://finance.yahoo.com/news/disney-looking-grow-ads-business-130000843.html",
    "https://finance.yahoo.com/news/reddit-stock-headed-17-lower-124400600.html",
    "https://finance.yahoo.com/news/nikola-stock-buy-121500508.html",
    "https://finance.yahoo.com/news/3-ai-stocks-buy-hand-110000283.html",
    "https://finance.yahoo.com/news/3-dividend-stocks-paid-raised-104500319.html",
    "https://finance.yahoo.com/news/were-magnificent-seven-purely-ai-104500595.html",
    "https://finance.yahoo.com/news/1-super-stock-track-record-101700681.html",
    "https://finance.yahoo.com/news/1-top-cryptocurrency-buy-soars-100000573.html",
    "https://finance.yahoo.com/news/forget-tesla-think-stock-replace-095000495.html"
    "https://finance.yahoo.com/news/3-high-yield-dividend-stocks-094400057.html",
    "https://finance.yahoo.com/news/could-bull-market-buy-help-093500871.html",
    "https://finance.yahoo.com/news/fantastic-stock-outperformed-tesla-past-093000464.html",
    "https://finance.yahoo.com/news/goldman-sees-pension-funds-offloading-092437436.html",
    "https://finance.yahoo.com/news/goldman-sees-pension-funds-offloading-092437436.html",
    "https://finance.yahoo.com/news/3-no-brainer-stocks-buy-092100439.html",
    "https://finance.yahoo.com/news/trader-bets-trump-dollar-rally-090000916.html",
    "https://finance.yahoo.com/news/arm-holdings-trillion-dollar-stock-084000948.html",
    "https://finance.yahoo.com/news/forget-tesla-1-unstoppable-artificial-082700121.html",
    "https://finance.yahoo.com/news/analysis-yuan-skids-markets-bet-082413508.html",
    "https://finance.yahoo.com/news/reddit-next-big-ai-stock-081000684.html",
    "https://finance.yahoo.com/news/no-risk-contagion-first-cyberattack-055924605.html",
    "https://finance.yahoo.com/news/p-global-downgrades-outlooks-five-232203780.html",
    "https://finance.yahoo.com/news/too-buy-duolingo-stock-050000677.html",
    "https://finance.yahoo.com/news/hong-kong-moribund-market-loses-041355220.html",
    "https://finance.yahoo.com/news/fed-rate-cut-could-actually-034032686.html",
    "https://finance.yahoo.com/news/cathie-wood-buys-48-million-000300053.html",
    "https://finance.yahoo.com/news/djt-stock-might-become-etf-234500621.html"
]

# Initializing an empty list to store scraped data
results = []

# Looping through each URL and scrape the article
for url in urls:
    result = scrape_article(url)
    if result:
        results.append(result)
    else:
        print(f"Failed to scrape article from {url}")

# Creating a DataFrame with the scraped data
df = pd.DataFrame(results)

# Saving the DataFrame to a pickle file
df.to_pickle("yahoo_scraped_data.pkl")
print("Data scraped and saved to scraped_data.pkl")

Error fetching URL: 400 Client Error: Invalid HTTP Request for url: https://finance.yahoo.com/video/energy-stocks-soared-march-211521930.html
Failed to scrape article from https://finance.yahoo.com/video/energy-stocks-soared-march-211521930.html
Error fetching URL: 400 Client Error: Invalid HTTP Request for url: https://finance.yahoo.com/video/troubles-mount-apple-wwdc-nears-211314671.html
Failed to scrape article from https://finance.yahoo.com/video/troubles-mount-apple-wwdc-nears-211314671.html
Error fetching URL: 502 Server Error: Next Hop Connection Failed for url: https://finance.yahoo.com/video/walgreens-earnings-gdp-consumer-sentiment-210542071.html
Failed to scrape article from https://finance.yahoo.com/video/walgreens-earnings-gdp-consumer-sentiment-210542071.html
Error fetching URL: 404 Client Error: Not Found for url: https://finance.yahoo.com/news/interior-department-issues-rule-limit-203610416.html
Failed to scrape article from https://finance.yahoo.com/news/interior-depar

In [16]:
df

Unnamed: 0,url,title,text,label
0,https://finance.yahoo.com/news/sec-suit-agains...,"SEC Suit Against Coinbase Can Go Forward, Judg...",(Bloomberg) — The US Securities and Exchange C...,Human-written
1,https://finance.yahoo.com/news/washingtons-nex...,Washington’s next test: Funding Ukraine to sto...,"It’s been ugly, but a Congress split between D...",Human-written
2,https://finance.yahoo.com/news/stock-market-to...,"Stock market today: S&P 500 hits fresh record,...",US stocks rebounded Wednesday after several da...,Human-written
3,https://finance.yahoo.com/news/invested-most-s...,I invested most of my salary for 7 years and h...,Daniel George worked at Google X and then as a...,Human-written
4,https://finance.yahoo.com/news/trump-media-sto...,Trump Media stock surges on day 2 of market debut,Trump Media & Technology Group (DJT) soared as...,Human-written
5,https://finance.yahoo.com/news/baltimore-bridg...,Baltimore bridge collapse could cost Carnival ...,Cruise operator Carnival (CCL) warned that the...,Human-written
6,https://finance.yahoo.com/news/elon-musk-says-...,Elon Musk Says 'Almost Anyone' Can Afford A $1...,SpaceX CEOElon Muskhas outlined an ambitious p...,Human-written
7,https://finance.yahoo.com/news/fast-food-worke...,Fast food workers are losing their jobs in Cal...,Fast food workersare losing their jobs in Cali...,Human-written
8,https://finance.yahoo.com/news/biden-methane-c...,Biden’s Methane Crackdown Reaches Oil Wells on...,(Bloomberg) -- The Biden administration is fin...,Human-written
9,https://finance.yahoo.com/news/oil-declines-in...,Oil Slips as Rising US Stockpiles Undercut Tig...,(Bloomberg) -- Oil lost more of the ground it ...,Human-written


### References

- Python Software Foundation. (n.d.). *Requests: HTTP for Humans™*. Retrieved from [https://requests.readthedocs.io](https://requests.readthedocs.io)

- Richard Mitchell, Leonard Richardson. (n.d.). *Beautiful Soup Documentation*. Retrieved from [https://www.crummy.com/software/BeautifulSoup/bs4/doc/](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

- Wes McKinney. (n.d.). *pandas: powerful Python data analysis toolkit*. Retrieved from [https://pandas.pydata.org/pandas-docs/stable/index.html](https://pandas.pydata.org/pandas-docs/stable/index.html)