# SQL vs Pandas for Energy Data

In this section we will cover a step-by-step example to integrate the energy production data stored in a database (SQLite) and then using Python to analyze and plot the data. This guide walks through each step, from loading CSV data into an SQLite database to querying and plotting the energy production data for different countries.

This sections includes downloading the data from individual `.csv` files, loading them into a SQLite database, and visualizing the energy production using Python.

This guide shows how to:
1. Download CSV files from Eurostat.
2. Load data into a SQLite database.
3. Use Python to query and visualize the data.

## Step 1: Download the Data from Eurostat

The datasets are available on the Eurostat website under [Energy Statistics](https://ec.europa.eu/eurostat/web/energy). Let’s assume the data files have been downloaded and stored in your `~/Downloads` folder as `.csv` files.

For this example, we are using the following datasets:

In [103]:
import pandas as pd
import sqlite3
import os
import matplotlib.pyplot as plt

In [104]:
file_paths = {
    'coal': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_coal.xlsx',
    'nonRenewables': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_combustionFuels_nonRenewables.xlsx',
    'renewables': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_combustionFuels_Renewables.xlsx',
    'geothermal': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_geothermal.xlsx',
    'hydro': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_hydro.xlsx',
    'naturalGas': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_naturalGas.xlsx',
    'nuclear': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_nuclear.xlsx',
    'oil': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_oil.xlsx',
    'otherRenewables': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_otherRenewables.xlsx',
    'solar': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_solar.xlsx',
    'wind': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_wind.xlsx'
}

In [105]:
# Folder to save individual databases
db_folder = '/workspaces/EAGE_PythonRenewableEnergyCourse/EAGE_PythonRenewableEnergyCourse/section5/individualDB'

In [106]:
# Create folder if it doesn't exist
if not os.path.exists(db_folder):
    os.makedirs(db_folder)

In [107]:
# Let's start by processing one dataset, for example, coal
dataset = 'coal'
file_path = file_paths[dataset]
db_name = f'{db_folder}/{dataset}_individualDB.db'

In [108]:
# Check if the database already exists and delete it if needed
if os.path.exists(db_name):
    os.remove(db_name)
    print(f"Existing database {db_name} deleted.")

In [109]:
# Load the dataset from Excel
coal_data = pd.read_excel(file_path, sheet_name='coal', skiprows=range(0, 8))

In [110]:
# 1) Rename the first column from "TIME" to "Country"
coal_data.rename(columns={'TIME': 'Country'}, inplace=True)

In [111]:
coal_data = coal_data.drop(index=0)

In [112]:
# 2) Delete the first row if it contains 'European Union - 27 countries (from 2020)'
coal_data = coal_data[coal_data['Country'] != 'European Union - 27 countries (from 2020)']

In [113]:
# 3) Delete the columns where the name starts with 'Unnamed'
coal_data = coal_data.loc[:, ~coal_data.columns.str.contains('^Unnamed')]

In [114]:
# 4) Delete rows from row 43 to the end (i.e., keep rows 0 to 42)
coal_data = coal_data[:40]

In [115]:
# 5) Rename the date columns to human-readable formats (January 2016, February 2016, etc.)
# Assuming the columns after 'Country' are date-like (e.g., '2016-01', '2016-02', etc.)
date_columns = pd.to_datetime(coal_data.columns[1:], format='%Y-%m')  # Convert to datetime format
new_column_names = ['Country'] + date_columns.strftime('%B %Y').tolist()  # Convert back to desired string format
coal_data.columns = new_column_names

In [116]:
# Let's inspect the first few rows to make sure it looks correct
print("First few rows of cleaned coal dataset:")
print(coal_data.head())

First few rows of cleaned coal dataset:
    Country January 2016 February 2016 March 2016 April 2016 May 2016  \
2   Belgium            :             :          :          :        :   
3  Bulgaria            :             :          :          :        :   
4   Czechia            :             :          :          :        :   
5   Denmark            :             :          :          :        :   
6   Germany            :             :          :          :        :   

  June 2016 July 2016 August 2016 September 2016  ... October 2023  \
2         :         :           :              :  ...       70.669   
3         :         :           :              :  ...      781.035   
4         :         :           :              :  ...     2610.974   
5         :         :           :              :  ...       44.275   
6         :         :           :              :  ...     9674.588   

  November 2023 December 2023 January 2024 February 2024 March 2024  \
2         64.24       100.765

In [117]:
# Connect to the new SQLite database
conn = sqlite3.connect(db_name)
cursor = conn.cursor()

In [118]:
# Store the cleaned data into the database
coal_data.to_sql(f'{dataset}', conn, if_exists='replace', index=False)

40

In [119]:
# Verify that the data was stored correctly
print("First few rows from the database:")
query = pd.read_sql_query(f"SELECT * FROM {dataset} LIMIT 5", conn)
print(query)

First few rows from the database:
    Country January 2016 February 2016 March 2016 April 2016 May 2016  \
0   Belgium            :             :          :          :        :   
1  Bulgaria            :             :          :          :        :   
2   Czechia            :             :          :          :        :   
3   Denmark            :             :          :          :        :   
4   Germany            :             :          :          :        :   

  June 2016 July 2016 August 2016 September 2016  ... October 2023  \
0         :         :           :              :  ...       70.669   
1         :         :           :              :  ...      781.035   
2         :         :           :              :  ...     2610.974   
3         :         :           :              :  ...       44.275   
4         :         :           :              :  ...     9674.588   

  November 2023 December 2023 January 2024 February 2024 March 2024  \
0         64.24       100.765      

In [120]:
# Close the connection
conn.close()

---

## Repeat for all datasets

### Individual databases

![datasetDataset_Image](../images/section5/individualDatabases.png) 

In [180]:
def process_dataset_to_db(dataset, file_path, db_folder):
    """
    Processes a dataset from an Excel file and stores it in an individual SQLite database.
    
    Parameters:
    - dataset: The name of the dataset (e.g., 'coal', 'renewables', etc.)
    - file_path: The path to the Excel file containing the dataset.
    - db_folder: The folder where the SQLite database will be saved.
    """
    # Create folder if it doesn't exist
    if not os.path.exists(db_folder):
        os.makedirs(db_folder)

    # Create the database name and path
    db_name = f'{db_folder}/{dataset}_individualDB.db'

    # Check if the database already exists and delete it if needed
    if os.path.exists(db_name):
        os.remove(db_name)
        print(f"Existing database {db_name} deleted.")

    # Load the dataset from Excel
    data = pd.read_excel(file_path, sheet_name=dataset, skiprows=range(0, 8))

    # 1) Rename the first column from "TIME" to "Country"
    data.rename(columns={'TIME': 'Country'}, inplace=True)

    # 2) Remove the first row with GEO (Labels)
    data = data.drop(index=0)

    # 3) Delete the first row if it contains 'European Union - 27 countries (from 2020)'
    data = data[data['Country'] != 'European Union - 27 countries (from 2020)']

    # 4) Delete the columns where the name starts with 'Unnamed'
    data = data.loc[:, ~data.columns.str.contains('^Unnamed')]

    # 5) Delete rows from row 43 to the end (i.e., keep rows 0 to 42)
    data = data[:40]

    # 6) Rename the date columns to human-readable formats (January 2016, February 2016, etc.)
    # Assuming the columns after 'Country' are date-like (e.g., '2016-01', '2016-02', etc.)
    date_columns = pd.to_datetime(data.columns[1:], format='%Y-%m')  # Convert to datetime format
    new_column_names = ['Country'] + date_columns.strftime('%B %Y').tolist()  # Convert back to desired string format
    data.columns = new_column_names

    # Let's inspect the first few rows to make sure it looks correct
    print(f"First few rows of cleaned {dataset} dataset:")
    print(data.head())

    # Connect to the new SQLite database
    conn = sqlite3.connect(db_name)
    cursor = conn.cursor()

    # Store the cleaned data into the database
    data.to_sql(f'{dataset}', conn, if_exists='replace', index=False)

    # Verify that the data was stored correctly
    print(f"First few rows from the {dataset} database:")
    query = pd.read_sql_query(f"SELECT * FROM {dataset} LIMIT 5", conn)
    print(query)

    # Close the connection
    conn.close()
    print(f"Database for {dataset} created successfully.\n")

In [181]:
# Folder to save individual databases
db_folder = '/workspaces/EAGE_PythonRenewableEnergyCourse/EAGE_PythonRenewableEnergyCourse/section5/individualDB'

In [182]:
# File paths for datasets
file_paths = {
    'coal': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_coal.xlsx',
    'nonRenewables': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_combustionFuels_nonRenewables.xlsx',
    'renewables': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_combustionFuels_Renewables.xlsx',
    'geothermal': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_geothermal.xlsx',
    'hydro': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_hydro.xlsx',
    'naturalGas': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_naturalGas.xlsx',
    'nuclear': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_nuclear.xlsx',
    'oil': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_oil.xlsx',
    'otherRenewables': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_otherRenewables.xlsx',
    'solar': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_solar.xlsx',
    'wind': '../data/section4/euroStat/nrg_cb_pem_page_spreadsheet_wind.xlsx'
}

In [183]:
# Process all datasets and create individual databases
for dataset, file_path in file_paths.items():
    process_dataset_to_db(dataset, file_path, db_folder)

Existing database /workspaces/EAGE_PythonRenewableEnergyCourse/EAGE_PythonRenewableEnergyCourse/section5/individualDB/coal_individualDB.db deleted.
First few rows of cleaned coal dataset:
    Country January 2016 February 2016 March 2016 April 2016 May 2016  \
2   Belgium            :             :          :          :        :   
3  Bulgaria            :             :          :          :        :   
4   Czechia            :             :          :          :        :   
5   Denmark            :             :          :          :        :   
6   Germany            :             :          :          :        :   

  June 2016 July 2016 August 2016 September 2016  ... October 2023  \
2         :         :           :              :  ...       70.669   
3         :         :           :              :  ...      781.035   
4         :         :           :              :  ...     2610.974   
5         :         :           :              :  ...       44.275   
6         :         :  

---

#### Database Operations

In [434]:
# Folder to save the final aggregated databases
aggregated_db_folder = '/workspaces/EAGE_PythonRenewableEnergyCourse/EAGE_PythonRenewableEnergyCourse/section5/aggregatedDB'

In [435]:
# Create subfolders perCountry and totalEU
per_country_folder = f'{aggregated_db_folder}/perCountry'
total_eu_folder = f'{aggregated_db_folder}/totalEU'

In [436]:
if not os.path.exists(per_country_folder):
    os.makedirs(per_country_folder)
if not os.path.exists(total_eu_folder):
    os.makedirs(total_eu_folder)

In [437]:
# Define the file paths for datasets already processed and stored in individualDB
db_folder = '/workspaces/EAGE_PythonRenewableEnergyCourse/EAGE_PythonRenewableEnergyCourse/section5/individualDB'

In [438]:
# Define dataset categories
non_renewables = ['coal', 'naturalGas', 'nuclear', 'oil']
renewables = ['geothermal', 'hydro', 'solar', 'wind']
other_renewables = ['otherRenewables']

In [439]:
# Define the final EU databases
eu_dbs = {
    'nonRenewableEnergiesEU': non_renewables,
    'renewableEnergiesEU': renewables,  # Added renewableEnergiesEU
    'otherRenewableEnergiesEU': other_renewables,
    'energySourcesEU': ['renewables', 'nonrenewable', 'otherRenewables']
}

# Define the final Country databases
country_dbs = {
    'nonRenewableEnergiesCountry': non_renewables,
    'renewableEnergiesCountry': renewables,  # Added renewableEnergiesCountry
    'otherRenewableEnergiesCountry': other_renewables,
    'energySourcesCountry': ['renewables', 'nonrenewable', 'otherRenewables']
}

In [440]:
# Function to load dataset from individual DB
def load_dataset_from_db(dataset, db_folder):
    db_path = f'{db_folder}/{dataset}_individualDB.db'
    conn = sqlite3.connect(db_path)
    data = pd.read_sql_query(f"SELECT * FROM {dataset}", conn)
    conn.close()
    return data

In [441]:
# Function to save data to DB
def save_to_db(data, table_name, db_path):
    conn = sqlite3.connect(db_path)
    data.to_sql(table_name, conn, if_exists='replace', index=False)
    conn.close()

In [442]:
### 1) Row-wise Summation for EU Databases (Summing all rows) ###
def create_eu_aggregated_dbs():
    for db_name, datasets in eu_dbs.items():
        print(f"Creating {db_name} database...")
        db_path = f'{total_eu_folder}/{db_name}.db'  # Save in totalEU folder

        aggregated_data = pd.DataFrame()

        for dataset in datasets:
            if dataset == 'nonrenewable':
                # Sum all non-renewable datasets (coal, oil, etc.)
                nonrenewable_data = pd.DataFrame()
                for sub_dataset in non_renewables:
                    sub_data = load_dataset_from_db(sub_dataset, db_folder)
                    sub_data.iloc[:, 1:] = sub_data.iloc[:, 1:].apply(pd.to_numeric, errors='coerce')

                    if nonrenewable_data.empty:
                        nonrenewable_data = sub_data
                    else:
                        nonrenewable_data.iloc[:, 1:] += sub_data.iloc[:, 1:]

                dataset_data = nonrenewable_data
            elif dataset == 'renewables':
                # Sum all renewable datasets (geothermal, hydro, etc.)
                renewables_data = pd.DataFrame()
                for sub_dataset in renewables:
                    sub_data = load_dataset_from_db(sub_dataset, db_folder)
                    sub_data.iloc[:, 1:] = sub_data.iloc[:, 1:].apply(pd.to_numeric, errors='coerce')

                    if renewables_data.empty:
                        renewables_data = sub_data
                    else:
                        renewables_data.iloc[:, 1:] += sub_data.iloc[:, 1:]

                dataset_data = renewables_data
            else:
                # For other datasets like 'otherRenewables', handle normally
                dataset_data = load_dataset_from_db(dataset, db_folder)

            dataset_data.iloc[:, 1:] = dataset_data.iloc[:, 1:].apply(pd.to_numeric, errors='coerce')

            # Sum rows across all countries (excluding Country column)
            summed_row = dataset_data.iloc[:, 1:].sum(axis=0).to_frame().T

            # Insert dataset name (renewables/nonrenewables) into 'Category'
            summed_row.insert(0, 'Category', dataset)

            # Ensure rounding after summation
            summed_row.iloc[:, 1:] = summed_row.iloc[:, 1:].round(2)

            if aggregated_data.empty:
                aggregated_data = summed_row
            else:
                aggregated_data = pd.concat([aggregated_data, summed_row], ignore_index=True)

        # Save aggregated data to a new database
        save_to_db(aggregated_data, db_name, db_path)
        print(f"{db_name} database created successfully!\n")

In [443]:
### 2) Column-wise Summation for Country Databases (Summing all columns) ###
def create_country_aggregated_dbs():
    for db_name, datasets in country_dbs.items():
        print(f"Creating {db_name} database...")
        db_path = f'{per_country_folder}/{db_name}.db'  # Save in perCountry folder

        aggregated_data = pd.DataFrame()

        for dataset in datasets:
            if dataset in ['nonrenewable', 'renewables']:
                # For nonrenewable and renewables, we dynamically sum the relevant sub-datasets
                sub_datasets = non_renewables if dataset == 'nonrenewable' else renewables
                aggregated_dataset = pd.DataFrame()
                for sub_dataset in sub_datasets:
                    data = load_dataset_from_db(sub_dataset, db_folder)
                    data.iloc[:, 1:] = data.iloc[:, 1:].apply(pd.to_numeric, errors='coerce')

                    if aggregated_dataset.empty:
                        aggregated_dataset = data
                    else:
                        aggregated_dataset.iloc[:, 1:] += data.iloc[:, 1:]

                data = aggregated_dataset
            else:
                # Load regular datasets like 'otherRenewables'
                data = load_dataset_from_db(dataset, db_folder)

            data.iloc[:, 1:] = data.iloc[:, 1:].apply(pd.to_numeric, errors='coerce')

            # Sum columns per country (excluding the 'Country' column itself)
            summed_col = data.iloc[:, 1:].sum(axis=1).to_frame(name=dataset)  # Sum columns for each country

            # Round values to 2 decimal places after summing
            summed_col = summed_col.round(2)

            if aggregated_data.empty:
                aggregated_data = pd.concat([data[['Country']], summed_col], axis=1)  # Add 'Country' in the first pass
            else:
                aggregated_data = pd.concat([aggregated_data, summed_col], axis=1)

        # Save aggregated data to a new database
        save_to_db(aggregated_data, db_name, db_path)
        print(f"{db_name} database created successfully!\n")

In [444]:
# Execute the creation of both types of databases
create_eu_aggregated_dbs()
create_country_aggregated_dbs()

Creating nonRenewableEnergiesEU database...
nonRenewableEnergiesEU database created successfully!

Creating renewableEnergiesEU database...
renewableEnergiesEU database created successfully!

Creating otherRenewableEnergiesEU database...
otherRenewableEnergiesEU database created successfully!

Creating energySourcesEU database...
energySourcesEU database created successfully!

Creating nonRenewableEnergiesCountry database...
nonRenewableEnergiesCountry database created successfully!

Creating renewableEnergiesCountry database...
renewableEnergiesCountry database created successfully!

Creating otherRenewableEnergiesCountry database...
otherRenewableEnergiesCountry database created successfully!

Creating energySourcesCountry database...
energySourcesCountry database created successfully!



---

### Difference between individual DB and many tables, and multiple DBs

![datasetDataset_Image](../images/section5/database_tables.png)