# Introduction

This Notebook is a step for step handbook for getting per Unit Generation Data out of the Smard Data.

For that the Smard Powerplant Data will be used and processed so it can be handled with the EIC IDs of each Unit. 

## Preparation of Python environment

The following modules and their dependencies are required to run this notebook:
* Pandas (with openpyxl)
* Numpy

In [8]:
import pandas as pd
import numpy as np
from pandas import read_csv
from pandas import read_excel
import os


## Download of input files and setup of folder structure

First we need to download and prepare the raw primary Data. For that we two three different Datasets.

- We need the _blocks.xlsx_ data, which is a list of all powerplants in Germany. It was collected by the INATECH and is available via the repository. So it isn't necessary to download it independently

- Also we need the Generation Data from the Smard Website, which can be downloaded here:
https://www.smard.de/en/downloadcenter/download-power-plant-data/?downloadAttributes=%7B%22selectedPowerPlant%22:%22all%22,%22selectedContent%22:%22generation%22,%22from%22:1451602800000,%22to%22:1483225199999,%22selectedFileType%22:%22CSV%22%7D
It needs to be downloaded for each year independently from year 2016 to 2021. For that select for the Filters the following options:
    - _All Power Plants_
    - _Content: Actual generation_
    - _01.01.Year - 31.12.Year_ where you choose for year each year from 2016-2021
    - _Select resolution: original resolution_
    - _CSV_

Then you can download the data via the Download File button. It will be Downloaded as a zip Folder. You have to extract the zip Folder and then put each csv file into the _smard_ folder (without a Subfolder). So in the end all csv files should be in the __smard__ folder

In [9]:
# INITIAL SETUP

# data is currently provided for the years of 2016 to 2021
years = [2016, 2017, 2018, 2019, 2020, 2021, 2022]

# import main file for German powerplant data including EIC, ETS-ID, electrical and heat power, CHP
blocks = pd.read_excel('input/blocks.xlsx')

blocks.head(10)

Unnamed: 0,ETS-ID,EIC,Name BNA,City,Plant name,Block name,Filename,Column_Block_Smard,Column_Block_Smard_2
0,1399,11WD7BERG1S--A-X,Bergkamen A,Bergkamen,Bergkamen,A,Bergkamen,Generation_DE Bergkamen A [MW],
1,1453,11WD8BOXB1L---N8,Boxberg Block N,Boxberg,Boxberg,N,Boxberg,Generation_DE Boxberg Block N [MW],
2,1453,11WD8BOXB1L---P4,Boxberg Block P,Boxberg,Boxberg,P,Boxberg,Generation_DE Boxberg Block P [MW],
3,1454,11WD8BOXB1L---Q2,Boxberg Block Q,Boxberg,Boxberg,Q,Boxberg,Generation_DE Boxberg Block Q [MW],
4,1454,11WD8BOXB1L---R0,Boxberg Block R,Boxberg,Boxberg,R,Boxberg,Generation_DE Boxberg Block R [MW],
5,1460,11WD8LIPD1L---R6,Kraftwerk Lippendorf Block R,Boehlen,Lippendorf,R,Braunkohlekraftwerk_Lippendorf,Generation_DE Kraftwerk Lippendorf Block R [MW],
6,1460,11WD8LIPD1L---S4,Kraftwerk Lippendorf Block S,Boehlen,Lippendorf,S,Braunkohlekraftwerk_Lippendorf,Generation_DE Kraftwerk Lippendorf Block S [MW],
7,795,11WD2BURG000145R,Gasturbinenanlage 12,Burghausen,Burghausen,,Burghausen_GT,Generation_DE Gasturbinenanlage 12 [MW],
8,1419,11WD2BUSD0000386,Buschhaus,Helmstedt,Buschhaus,,Buschhaus,Generation_DE D [MW],
9,1409,11WD7HERD2G-H6-X,Cuno Heizkraftwerk Herdecke H6,Herdecke,Cuno HKW Herdecke,H6,Cuno_Heizkraftwerk_Herdecke,Generation_DE Cuno Heizkraftwerk Herdecke H6 [MW],


### Prepare the smard data
In the smard data we have a file for each powerplant and year we generate a Dataframe which maps 
the files to the correct plant name. This plant name is also present in the column _filename_ in the _blocks.xlsx_ data

In [10]:
# Folder which contains all smard data files
smard_folder = "./input/smard"
files_smard = os.listdir(smard_folder)

# Initialize columns for DataFrame, in which we save the year, the plant name and the filename
years_smard = []
plant_names_smard = []
filenames_smard = []
blocks_smard = []

# Loop through all filenames, except the first, as this is the .gitignore
for filename in files_smard:
    # Split the filename on underscore, to get each information saved in the filename
    filename_array = filename.split("_")
    
    # Check if file is csv file
    if filename_array[-1].split(".")[1] != "csv":
        continue

    if "_".join(filename_array[:-4]) not in blocks["Filename"].values:
        continue

    # Save filename in the designated array
    filenames_smard.append(filename)
    
    # # Get year and plant name and save them in the arrays
    years_smard.append(filename_array[-3][:4])
    plant_names_smard.append("_".join(filename_array[:-4]))

# Convert everything to a DataFrame
smard_files_df = pd.DataFrame({"year":years_smard, "plant name":plant_names_smard, "filename":filenames_smard}) 

smard_files_df.head(10)

Unnamed: 0,year,plant name,filename
0,2016,Bergkamen,Bergkamen_201601010000_201612312359_hour_6.csv
1,2017,Bergkamen,Bergkamen_201701010000_201712312359_stunde_6.csv
2,2018,Bergkamen,Bergkamen_201801010000_201812312359_stunde_6.csv
3,2019,Bergkamen,Bergkamen_201901010000_201912312359_stunde_6.csv
4,2020,Bergkamen,Bergkamen_202001010000_202012312359_stunde_6.csv
5,2021,Bergkamen,Bergkamen_202101010000_202112312359_hour_6.csv
6,2022,Bergkamen,Bergkamen_202201010000_202212312359_stunde_6.csv
7,2016,Boxberg,Boxberg_201601010000_201612312359_hour_8.csv
8,2017,Boxberg,Boxberg_201701010000_201712312359_stunde_8.csv
9,2018,Boxberg,Boxberg_201801010000_201812312359_stunde_8.csv


First we get all unique IDs for ETS and EIC IDs out of the blocks.xlsx list

In [11]:
# GENERATION OF EIC AND ETS-ID LISTS

# list of all powerplant block EIC codes
eic_list = blocks['EIC'].unique()

# list of all powerplant location ETS-ID codes
ets_list = blocks['ETS-ID'].unique()

### Define helper functions for Importing of Smard Data

Because the smard data needs to be preprocessed we need two helper functions. 

The first (read_csv_file) reads the csv files but can also handle errors occuring while loading the csv file. This is needed because some csv files are not readable and would throw an error, resulting in the code stopping.

The second function, called rename_columns_smard is used to rename the columns in the csv files, so they match with the column names saved in the smard_files_df. Therefore we get rid of the Ending ot each column, because it differs from year to year (Originalauflösung or original resolution)

In [12]:
# Helper Function used to handle errors when loading csv files
def read_csv_file(path, delimiter, thousands):
    try:
        df = read_csv(path, delimiter=delimiter, thousands=thousands)
        return df
    except:
        return(f"Invalid File for path {path}")

In [13]:
# Helper function to rename the columns in the smard datasets
def rename_columns_smard(column_name):
    column_name_split = column_name.split(" ")
    index_mw = column_name_split.index("[MW]")
    column_name_new = " ".join(column_name_split[:index_mw+1])
    return column_name_new

Now we need to import the generation data from the smard csv files. To do that we loop over the EIC IDs in the blocks list. 
Then for each unit we sum up the yearly data and save it to the DataFrame _smard_generation_data_df_. So in this DataFrame we have 
for each EIC and year an entry with the yearly generation data.

In [14]:
# Importing the generation Data from the smard csv files
# The columns we add to the installations_df, for each year a Columns with the Generation per Year
columns = ["EIC"] + [f'Generation elec. {y} [MWh_el]' for y in years]
smard_generation_data_df = pd.DataFrame([], columns=columns)
smard_generation_data_df["EIC"] = blocks["EIC"]

# Loop over the ETS-List and for each ETS-ID get the corresponding smard Data and calculate the yearly Generation Data
for smard_file in smard_files_df["plant name"].unique():

    # Get all Blocks belonging to one ETS-ID
    blocks_smard_df = blocks[blocks["Filename"] == smard_file]

    # Check if DataFrame is emtpty
    if len(blocks_smard_df) == 0:
        continue

    # Get all ETS-IDs, mostly it's just one id but for some Plants there are more than one ID
    ets_ids = blocks_smard_df["ETS-ID"].unique()

    # Get the Plant Name
    plant_name = blocks_smard_df["Plant name"].values[0]

    # Loop over years to get Generation Data for each year
    for year in years:

        # With the Plant Name get the correct csv-filename for ETS-ID and Year
        filename = smard_files_df[(smard_files_df["plant name"] == smard_file) & (smard_files_df["year"] == str(year))]["filename"].values
        
        # Check if a filename was found, if not throw an Error and continue
        if len(filename) == 0: 
            bna_name = blocks_smard_df["Name BNA"].values
            print(f"WARNING: No CSV Filename found for {bna_name} in year {year}")
            continue
        filename = filename[0]

        # Read in CSV Data with the function defined above
        # Is done like this because some csv files don't contain data and 
        # would throw an Error which would stop the code from running. With this the Error geht's handled
        smard_data_df = read_csv_file(smard_folder + "/" + filename, delimiter=";", thousands=".")

        if type(smard_data_df) == str:
            print(smard_data_df)
            continue

        # Rename Columns so they can be matched to the entries in the Blocks Dataset
        smard_data_df = smard_data_df.rename(lambda x: rename_columns_smard(x) if "[MW]" in x else x, axis=1)

        # Entries without Value are filled by Smard with the character '-'
        # To handle these NaN Values they are replaced with the entry None
        smard_data_df.replace("-", None, inplace=True)

        # Loop over ETS-IDs to be able to seperate Generation Data into ETS sepereated
        for ets_id in ets_ids:

            # Get the Blocks which have all the same ETS ID out of the Blocks
            # with the same filename
            blocks_ets_df = blocks_smard_df[blocks_smard_df["ETS-ID"] == ets_id]

            # Get the corresponding EIC IDs
            eic_ids = blocks_ets_df["EIC"].values

            # Get columns Names to be able to handle multiple EICs for one column
            column_names = blocks_ets_df["Column_Block_Smard"].unique()

            # Loop over all Blocks belonging to the csv file which are defined in th blocks dataset
            # and have the same ETS ID
            for column_name in column_names:

                multiple_columns = False

                # Get EIC Ids of the Column name
                eic_ids_column = blocks_ets_df[blocks_ets_df["Column_Block_Smard"] == column_name]["EIC"].values

                 # Check if there is just one Columns for multiple EIC-IDS
                # If yes each generation entry in the column will be devived by the amount of differenc EIC IDS and split into multiple Columns
                amt_eic_ids = len(eic_ids_column)
                if amt_eic_ids > 1:
                    multiple_columns = True

                # Handles the error and gives the generation a NaN value
                try:
                    # Check if One column represents multiple EIC-IDs
                    if multiple_columns:

                        # Change Datatype of column
                        smard_data_df[column_name] = pd.to_numeric(smard_data_df[column_name], errors="coerce")

                        # Create one new column per EIC-ID, where the EIC ID is the name
                        for eic_id in eic_ids_column:
                            
                            # Each column get's the equal share of the Values, 
                            # so they're calculated by dividing the Values in the original column by the amount of EIC-IDs 
                            smard_data_df[eic_id] = smard_data_df[column_name] / len(eic_ids_column)
                    else:
                        # If it is just one EIC-ID for that column get the EIC-ID out of the eic_ids_column list
                        eic_id = eic_ids_column[0]

                        # Change Datatype of column to be able to use the pandas.sum() method
                        smard_data_df[column_name] = pd.to_numeric(smard_data_df[column_name], errors="coerce")

                        # Rename the columns so the correct column will be changed to EIC-ID as Column name
                        smard_data_df.columns = [eic_id if x== column_name else x for x in smard_data_df.columns]

                except Exception as error:

                    # Try second Column name, if that also not works continue 
                    try:
                        # Get second name of Column in csv File for that Block
                        column_name = blocks_ets_df[blocks_ets_df["EIC"] == eic_id]["Column_Block_Smard_2"].values[0]

                        # Change Datatype of column to be able to use the pandas.sum() method
                        smard_data_df[column_name] = pd.to_numeric(smard_data_df[column_name], errors="coerce")
                        
                        # Rename the columns so the correct column will be changed to EIC-ID as Column name
                        smard_data_df.columns = [eic_id if x == column_name else x for x in smard_data_df.columns]
                    
                    except Exception as error:
                        print(error)
                        continue

            # Get all Columns containing Generation Data but not belonging to the defined Units with EIC-ID
            column_drops = [column for column in smard_data_df.columns.values[3:] if column not in eic_ids]

            # Delete the columns not having the correct EIC-IDs
            smard_data_df_output = smard_data_df.drop(column_drops, axis=1).copy()

            # Rename First 3 Columns so all are named the same
            new_column_names = ["date", "start", "end"]
            for i in range(3): smard_data_df_output.columns.values[i] = new_column_names[i]

            # Set Index to the Date
            smard_data_df_output.set_index(["date", "start"], inplace=True)

            # Save as CSV file with the name pattern {plant_name}_{ets_id}_{year}.csv
            smard_data_df_output.to_csv(f"output/{plant_name}_{ets_id}_{year}.csv")