# Validation of the PyPSA-Earth stats

## Description
This task aims to develop such notebook that:
- takes as input the files from folders from pypsa-earth: `results/{scenarios}/stats.csv` (see PR Create statistics #579). In the meantime, data is loaded from `notebooks/validation/temp_stats_csv/stats_merged.csv`
- loads open data on power systems across the world
- Creates plots to perform the validation
Plots and tables shall have different aggregation levels (e.g. demand for a continent)

Create statistics for:
- demand
- installed capacity by technology
- renewable sources
- network characteristics (length of lines for example)

Plots:
- Compare the statistics of the PyPSA-Earth model with open data

## Public data sources collection
These sources could be helpful:
- [ENTSO-E](https://transparency.entsoe.eu/generation/r2/installedGenerationCapacityAggregation/show)
- [IRENA](https://www.irena.org/data-and-statistics), not working
- [IEA](https://www.iea.org/data-and-statistics)
- [WEC](https://www.worldenergy.org/statistics/), not working
- [WRI](https://www.wri.org/resources/data-sets)
- [UN](https://unstats.un.org/unsd/snaama/)
- [WBG](https://datacatalog.worldbank.org/dataset/world-development-indicators)
- [OECD](https://data.oecd.org/)
- [Eurostat](https://ec.europa.eu/eurostat/data/database)
- [EIA](https://www.eia.gov/outlooks/aeo/data/browser/)
- [Enerdata](https://www.enerdata.net/research/)
- [BP](https://www.bp.com/en/global/corporate/energy-economics/statistical-review-of-world-energy.html)
- [USAID](https://www.usaid.gov/what-we-do/energy/global-energy-database), Single countries only?

https://www.usaid.gov/powerafrica/nigeria


## Preparation

### Import packages

In [None]:
import logging
import os
import sys

import pypsa
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

logger = logging.getLogger(__name__)

pd.set_option("display.max_columns", None)
pd.set_option("display.max_colwidth", 70)

### Set main directory to root folder

In [None]:
# change current directory
module_path = os.path.abspath(os.path.join('../../../')) # To import helpers

if module_path not in sys.path:
    sys.path.append(module_path+"/pypsa-earth/scripts")
    
from _helpers import sets_path_to_root, country_name_2_two_digits, two_digits_2_name_country

sets_path_to_root("documentation")

### Load stats data (obtained from pypsa-earth)

In [None]:
# Read it with multilevel column names. #TODO are multilevel column names necessary?
stats = pd.read_csv("notebooks/validation/temp_stats_csv/stats_merged.csv", index_col=0, header=[0,1])

In [None]:
stats.head()

### Load public data

In [None]:
EXAMPLE_URL="https://pxweb.irena.org/pxweb/en/IRENASTAT/IRENASTAT__Power%20Capacity%20and%20Generation/ELECCAP_2022_cycle2.px/"

In [99]:
# Read the data
irena_eleccap = pd.read_csv("notebooks/validation/temp_irena/ELECCAP_20230314-165057.csv", encoding="latin-1", skiprows=2)

# Replace ".." in the dataframe with NaN
irena_eleccap = irena_eleccap.replace("..", np.nan)

# Change dtype of column "Installed electricity capacity by country/area (MW)" to float
irena_eleccap["Installed electricity capacity by country/area (MW)"] = irena_eleccap["Installed electricity capacity by country/area (MW)"].astype(float)

In [100]:
irena_eleccap = irena_eleccap.groupby(["Country/area", "Year", "Technology"]).sum(numeric_only=True).reset_index() #"Technology", "Installed electricity capacity by country/area (MW)"

In [101]:
# Check data for a single country
irena_eleccap[irena_eleccap["Country/area"] == "Germany"].head(5)

Unnamed: 0,Country/area,Year,Technology,Installed electricity capacity by country/area (MW)
1482,Germany,2021,Biogas,7611.0
1483,Germany,2021,Coal and peat,0.0
1484,Germany,2021,Fossil fuels n.e.s.,78335.0
1485,Germany,2021,Geothermal energy,40.0
1486,Germany,2021,Liquid biofuels,231.0


## Validation

### Installed capacity by technology

In [89]:
# Define the technologies which should be compared
techs = ["CCGT", "OCGT", "nuclear", "oil", "onwind", "ror", "solar", "hydro"]

In [90]:
stats_capacities = stats["add_electricity"].loc[:, (techs)]

In [91]:
stats_capacities.head()

Unnamed: 0,CCGT,OCGT,nuclear,oil,onwind,ror,solar,hydro
IR,19758.292377,1568.594751,950.263323,34203.239284,279.222988,1562.931331,397.407953,9768.455414
TG,28.3,,,92.63355,0.0,11.830661,5.124544,64.764331
ZA,21.740341,153.625259,1813.10242,2140.576063,2606.828347,67.420576,5447.851077,592.356688
CM,200.0,,,51.38,0.0,542.752315,14.057588,71.082803
NE,,,,122.6,0.0,,27.033268,


#### Uniform technology names and combine datasets

In [None]:
# Show names of IRENA technologies
#irena_eleccap["Technology"].unique()

In [92]:
# Create dict to match the technology names of stats_capacities and irena eleccap
names = {"Solar photovoltaic": "solar",
        "Onshore wind energy": "onwind",
        #"Offshore wind energy": "offwind",
        "Renewable hydropower": "hydro",
        "Nuclear": "nuclear",
        "Oil": "oil",
        "Natural gas": "CCGT", # TODO All natural gas is CCGT, is that okay?
        "Mixed Hydro Plants": "ror", # TODO Is this correct? Check IRENA    
        }

In [93]:
# Rename the technologies in irena_eleccap to match the names in stats_capacities using the dict names
irena_eleccap["Technology"] = irena_eleccap["Technology"].replace(names)

In [96]:
# Transform technologies to columns and have the countries as index
irena_eleccap = irena_eleccap.pivot_table(index=["Country/area", "Year"], columns="Technology", values="Installed electricity capacity by country/area (MW)")

In [97]:
irena_eleccap.head(10)

Unnamed: 0_level_0,Technology,Biogas,CCGT,Coal and peat,Fossil fuels n.e.s.,Geothermal energy,Liquid biofuels,Marine energy,Offshore wind energy,Other non-renewable energy,Pumped storage,Renewable municipal waste,Solar thermal energy,Solid biofuels,hydro,nuclear,oil,onwind,ror,solar
Country/area,Year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Afghanistan,2021,0.0,40.0,0.0,97.477,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,340.493,0.0,139.184,0.4,0.0,30.745
Albania,2021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.425,0.0,1.425,0.0,0.0,2289.0,0.0,98.0,0.0,0.0,21.95
Algeria,2021,0.0,25116.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,25.0,0.0,228.0,0.0,359.0,10.0,0.0,423.0
American Samoa,2021,0.0,0.0,0.0,0.96,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,38.75,0.0,0.0,5.16
Andorra,2021,0.0,1.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,46.0,0.0,0.0,0.0,0.0,3.827
Angola,2021,0.0,1146.04,0.0,526.716,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,51.0,3729.279,0.0,464.1,0.0,0.0,13.377
Anguilla,2021,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.8,0.0,0.0,1.511
Antigua and Barbuda,2021,0.0,0.0,0.0,82.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,12.864
Argentina,2021,69.272,22872.871,0.0,0.0,0.0,0.0,0.0,0.0,0.0,974.0,0.0,0.0,219.145,10375.525,1755.0,2844.702,3292.124,0.0,1071.366
Armenia,2021,0.0,1826.6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1336.4,407.5,0.0,2.925,0.0,182.75


In [None]:
# Next: Get the same index and Merge the two dataframes


#### Plot

### Demand