# Spanish Electricity Market Analysis

## 1. Dataset import and data cleaning

### 1.1. Libraries import

In [37]:
import pandas as pd
import seaborn as sns
import os
import matplotlib as mpl
import matplotlib.pyplot as plt
from sqlalchemy import create_engine

sns.set()

### 1.2. Reading .csv file for REE Spanish Electricity Market Data

In [2]:
os.getcwd() #we get the current working directory 

'/home/ingrid/Documents/DA_Ironhack/Week5/Project-Week-5-Your-Own-Project/your-project'

In [15]:
raw_data = pd.read_csv('datasets/spain_energy_market.csv') #reading the csv file with the aggregated data from 2014-2018. 

In [16]:
raw_data.head()

Unnamed: 0,datetime,id,name,geoid,geoname,value
0,2014-01-01 23:00:00,600,Precio mercado SPOT Diario ESP,3.0,España,25.280833
1,2014-01-02 23:00:00,600,Precio mercado SPOT Diario ESP,3.0,España,39.924167
2,2014-01-03 23:00:00,600,Precio mercado SPOT Diario ESP,3.0,España,4.992083
3,2014-01-04 23:00:00,600,Precio mercado SPOT Diario ESP,3.0,España,4.091667
4,2014-01-05 23:00:00,600,Precio mercado SPOT Diario ESP,3.0,España,13.5875


In [18]:
raw_data.shape #how many rows and columns this df has 

(40212, 6)

In [17]:
raw_data.dtypes #checking the types of the dataframe. 

datetime     object
id            int64
name         object
geoid       float64
geoname      object
value       float64
dtype: object

In [6]:
raw_data.datetime = pd.to_datetime(raw_data['datetime']) #since the column datetime is imported as an string

In [7]:
raw_data.dtypes

datetime    datetime64[ns]
id                   int64
name                object
geoid              float64
geoname             object
value              float64
dtype: object

In [8]:
raw_data.name.unique()

array(['Precio mercado SPOT Diario ESP', 'Precio mercado SPOT Diario FRA',
       'Precio mercado SPOT Diario POR',
       'Energía asignada en Mercado SPOT Diario España',
       'Energía asignada en Mercado SPOT Diario Francia', nan,
       'Rentas de congestión mecanismos implícitos diario Francia exportación',
       'Rentas de congestión mecanismos implícitos diario Portugal exportación',
       'Rentas de congestión mecanismos implícitos diario Francia importación',
       'Rentas de congestión mecanismos implícitos diario Portugal importación',
       'Demanda real', 'Demanda programada PBF total',
       'Generación programada PBF total',
       'Generación programada PBF Eólica',
       'Generación programada PBF Ciclo combinado',
       'Generación programada PBF Carbón',
       'Generación programada PBF Nuclear',
       'Generación programada PBF Gas Natural Cogeneración',
       'Generación programada PBF UGH + no UGH',
       'Generación programada PBF Solar fotovoltaica'

From the names above we can see that the dataset has mixed information. In this table we can found information from different sources: 

- Spot or DAM Market price, for Spain, France and Portugal (Iberian Electricity market system)
- Congestion mechanisms for imbalances based on interconnections with France and Portugal 
- Real electricity demand. 
- Scheduled program generation. (Baseline → Programa Base Funcionamiento)
- Scheduled generation per technology. 

Hence, we should split the entire tables into subtables to be able to perform the data analysis based on our objectives. 

### 1.3. Total scheduled electricity demand (Per day)

In [30]:
PBF_total_df = raw_data[raw_data.name == 'Generación programada PBF total']

In [23]:
PBF_total_df.head()

Unnamed: 0,datetime,id,name,geoid,geoname,value
23819,2014-01-01 23:00:00,10258,Generación programada PBF total,,,642771.8
23820,2014-01-02 23:00:00,10258,Generación programada PBF total,,,658078.5
23821,2014-01-03 23:00:00,10258,Generación programada PBF total,,,680564.6
23822,2014-01-04 23:00:00,10258,Generación programada PBF total,,,644494.7
23823,2014-01-05 23:00:00,10258,Generación programada PBF total,,,598661.4


From this dataframe we can see that, since we are working with the *total* scheduled electricity for Spain  on that day, there is no need for the geoid, geoname, name and id columns. Also, we will reset the index. 

In [33]:
PBF_total_df = PBF_total_df[['datetime','value']].reset_index(drop=True)

In [36]:
PBF_total_df.head()

Unnamed: 0,datetime,value
0,2014-01-01 23:00:00,642771.8
1,2014-01-02 23:00:00,658078.5
2,2014-01-03 23:00:00,680564.6
3,2014-01-04 23:00:00,644494.7
4,2014-01-05 23:00:00,598661.4


With this table we have the total energy demand scheduled for the day after it is calculated. We can export this table to SQL. 

In [40]:
# Workbench Databench 
driver = 'mysql+pymysql'
user = 'root'
password = 'iMc91linux'
ip = 'localhost'
database = 'REE_analysis'


# connection_string to connect to Workbench Database 
connection_string = f'{driver}://{user}:{password}@{ip}/{database}'

    
# Engine creation 
engine = create_engine(connection_string)


#uploading PBF REE data into mySQL Database 
PBF_total_df.to_sql('PBF_total', engine)

### 1.3. importing geolocation tables

In [10]:
geo_id = pd.read_excel('datasets/geolocalizaciones.xlsx')

In [13]:
geo_id.head()

Unnamed: 0,GEO_ID,NIVEL,NAME
0,1,0,Portugal
1,2,0,Francia
2,3,0,España
3,8739,0,Andorra
4,8740,0,Marruecos


As we can see above, the dataset contains a geographic id that explains us which generation unit has been turned on to produce the expected electricity for the time period $t_i$. This table contains three columns that contain the following information: 

- *GEO_ID*: the unique indicator id for each location 
- *NIVEL*: Nivel stands for the level of each location. It is an integer that can go from 0 to 5 and it is referred to: 
    - 0: Country 
    - 1: Areas in Spain (Peninsula, Canary Islands, Melilla, Ceuta, Balearic Area, Melilla)
    - 2: Autonomous communities 
    - 3: Provinces
    - 4: Province capital 
    - 5: Municipalty
    - 6: Hydrological basin
- *NAME*: name of the specific geo_id indicator. Location

with the tables we are working with, we don't have 