# Oil News retrieval

### Data Retrieval from OilPrice.com

This section describes the data retrieval process used to collect textual data from the online source OilPrice.com.

**Source**: [https://oilprice.com/Latest-Energy-News/World-News/](https://oilprice.com/Latest-Energy-News/World-News/)

**Extracted Fields**

The following fields were extracted for each news article:

- **Title**: The headline of the article.
- **URL**: The full link to the original article.
- **Date**: The publication date as provided on the website.
- **Author**: The name of the author or contributor, if available.
- **Excerpt**: A short summary or preview of the article.

**Methodology**

Data was retrieved using a Python-based web scraping pipeline built with the `requests` and `BeautifulSoup` libraries. The script navigates the HTML structure of each page and extracts the relevant information from specific tags and class attributes. The data is stored in a `pandas.DataFrame` for further analysis.

This dataset serves as the input for subsequent semantic embedding and natural language processing experiments.


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

In [3]:
base_url = "https://oilprice.com/Latest-Energy-News/World-News/Page-{}.html"
all_articles = []

headers = {"User-Agent": "Mozilla/5.0"}

### From 100 to 300

In [4]:
for page in range(1, 1172):  # Cambia per più pagine
    url = base_url.format(page)
    print(f"Scraping: {url}")
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, "html.parser")
    
    articles = soup.find_all("div", class_="categoryArticle__content")
    
    for article in articles:
        # Titolo
        title_tag = article.find("h2", class_="categoryArticle__title")
        title = title_tag.get_text(strip=True) if title_tag else None

        # Link
        a_tag = article.find("a", href=True)
        link = "https://oilprice.com" + a_tag["href"] if a_tag else None

        # Data e autore
        meta_tag = article.find("p", class_="categoryArticle__meta")
        meta_text = meta_tag.get_text(strip=True) if meta_tag else None
        date, author = meta_text.split(" | ") if meta_text and " | " in meta_text else (meta_text, None)

        # Excerpt
        excerpt_tag = article.find("p", class_="categoryArticle__excerpt")
        excerpt = excerpt_tag.get_text(strip=True) if excerpt_tag else None

        all_articles.append({
            "title": title,
            "url": link,
            "date": date,
            "author": author,
            "excerpt": excerpt
        })

    time.sleep(1.5)

df_300 = pd.DataFrame(all_articles)
df_300
df_300.to_csv('../raw/df_300.csv')


Scraping: https://oilprice.com/Latest-Energy-News/World-News/Page-1.html
Scraping: https://oilprice.com/Latest-Energy-News/World-News/Page-2.html
Scraping: https://oilprice.com/Latest-Energy-News/World-News/Page-3.html
Scraping: https://oilprice.com/Latest-Energy-News/World-News/Page-4.html
Scraping: https://oilprice.com/Latest-Energy-News/World-News/Page-5.html
Scraping: https://oilprice.com/Latest-Energy-News/World-News/Page-6.html
Scraping: https://oilprice.com/Latest-Energy-News/World-News/Page-7.html
Scraping: https://oilprice.com/Latest-Energy-News/World-News/Page-8.html
Scraping: https://oilprice.com/Latest-Energy-News/World-News/Page-9.html
Scraping: https://oilprice.com/Latest-Energy-News/World-News/Page-10.html
Scraping: https://oilprice.com/Latest-Energy-News/World-News/Page-11.html
Scraping: https://oilprice.com/Latest-Energy-News/World-News/Page-12.html
Scraping: https://oilprice.com/Latest-Energy-News/World-News/Page-13.html
Scraping: https://oilprice.com/Latest-Energy-Ne

In [5]:
df_300

Unnamed: 0,title,url,date,author,excerpt
0,April Price Crash Dragged Saudi Arabia’s Oil R...,https://oilprice.comhttps://oilprice.com/Lates...,"Jun 25, 2025 at 06:45",Tsvetana Paraskova,Saudi Arabia’s revenues from oil exports crash...
1,Giant Leviathan Gas Field Offshore Israel Resu...,https://oilprice.comhttps://oilprice.com/Lates...,"Jun 25, 2025 at 06:12",Tsvetana Paraskova,The massive Leviathan gas field offshore Israe...
2,China and India Cut Imports of Lower-Quality C...,https://oilprice.comhttps://oilprice.com/Lates...,"Jun 25, 2025 at 04:56",Tsvetana Paraskova,The world’s biggest and second-biggest coal im...
3,Iran-Israel War Prompts China to Reconsider Ru...,https://oilprice.comhttps://oilprice.com/Lates...,"Jun 25, 2025 at 02:50",Irina Slav,The war between Israel and Iran has spark worr...
4,EU Set to Change Subsidy Rules for Energy Costs,https://oilprice.comhttps://oilprice.com/Lates...,"Jun 25, 2025 at 02:00",Irina Slav,National governments in the EU would soon be a...
...,...,...,...,...,...
23415,Australia's Desalinization Plant Workers in In...,https://oilprice.comhttps://oilprice.com/Lates...,"Jun 20, 2011 at 20:09",Charles Kennedy,Victoria state’s troubled Wonthaggi desalinati...
23416,Chinese Energy Workers in Somalia Threatened,https://oilprice.comhttps://oilprice.com/Lates...,"Jun 19, 2011 at 18:20",Charles Kennedy,The Ogaden National Liberation Front has warne...
23417,Argentina Now Receiving 40 Percent of Chinese ...,https://oilprice.comhttps://oilprice.com/Lates...,"Jun 19, 2011 at 09:28",Joao Peixe,In Argentina Mandarin Chinese is now the main ...
23418,Chinese Dam and Pipeline Projects Raise Burmes...,https://oilprice.comhttps://oilprice.com/Lates...,"Jun 18, 2011 at 17:44",Joao Peixe,Lucrative China-backed hydropower projects are...


## Dataset Date Setting

In [5]:
import pandas as pd

In [10]:

df = pd.read_csv("../raw/df_300.csv")

df['date'] = pd.to_datetime(df['date'], format='%b %d, %Y at %H:%M')
df['Date'] = df['date'].dt.date
df['time'] = df['date'].dt.time

if 'Unnamed: 0' in df.columns:
    df = df.drop(columns=['Unnamed: 0'])
df_final = df[['title', 'Date', 'excerpt']]
df_final

Unnamed: 0,title,Date,excerpt
0,April Price Crash Dragged Saudi Arabia’s Oil R...,2025-06-25,Saudi Arabia’s revenues from oil exports crash...
1,Giant Leviathan Gas Field Offshore Israel Resu...,2025-06-25,The massive Leviathan gas field offshore Israe...
2,China and India Cut Imports of Lower-Quality C...,2025-06-25,The world’s biggest and second-biggest coal im...
3,Iran-Israel War Prompts China to Reconsider Ru...,2025-06-25,The war between Israel and Iran has spark worr...
4,EU Set to Change Subsidy Rules for Energy Costs,2025-06-25,National governments in the EU would soon be a...
...,...,...,...
23415,Australia's Desalinization Plant Workers in In...,2011-06-20,Victoria state’s troubled Wonthaggi desalinati...
23416,Chinese Energy Workers in Somalia Threatened,2011-06-19,The Ogaden National Liberation Front has warne...
23417,Argentina Now Receiving 40 Percent of Chinese ...,2011-06-19,In Argentina Mandarin Chinese is now the main ...
23418,Chinese Dam and Pipeline Projects Raise Burmes...,2011-06-18,Lucrative China-backed hydropower projects are...


In [11]:
df_final.to_csv("../raw/df_oilnews.csv", index=False)