# Práctica I

## Extracción de los datos

### Funciones de utilidad

Importación de las librerías necesarias

In [47]:
import pandas as pd
import requests, json

Se crea una función que permite extrare el data frame a partir del código del juego de datos

In [48]:
def get_raw_data_frame( key, gziped ): 
    
    # url_template = 'https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/data/%s$DEFAULTVIEW/?format=TSV&compressed=false' 
    
    url_template =  'https://ec.europa.eu/eurostat/databrowser-backend/api/extraction/1.0/LIVE/false/tsv/%s?i'

    url = url_template % key

    return pd.read_table( url, compression = 'gzip' ) if gziped else  pd.read_table(  url )


Extraer el país a partir de la primera columna del dataframe

In [49]:
def extract_country( data ):
    
    new_data = data.rename( columns={data.iloc[:, 0].name :'country'} )

    new_data['country'] = new_data['country'].str.replace(r'^.*,(.*)$', r'\1', regex=True)
    
    return new_data


Eliminar espacios en los nombres de las columnas

In [50]:
def trim_column_names( data ):
    
    for col in data.columns:
    
        data = data.rename( columns={col :col.strip()} )
    
    return data
    

Limpia, trata y transforma todas las columnas que son númericas

In [51]:
def clean_numeric_columns( data ):
   
   data.iloc[:,1:] = data.iloc[:,1:].replace(
        r'^.*[:].*$', None, regex=True # Not available and confidencial flag
   ).replace(
        r'e', '', regex=True # Remove flag estimated
   ).replace(
        r'd', '', regex=True # Remove flag definition differs
   ).replace( 
        r'^(.+) +$', r'\1', regex = True # rTrim
   ).replace( 
        r'^ +(.+)$', r'\1', regex = True # lTrim>
   )
   
   for col in  data.iloc[:,1:].columns :    
        data[col] = pd.to_numeric( data[col] )
    
   return data
    
    

Función que filtra los valores por el filtro

In [52]:
def filter_data( filter ): 
    def _filter_data( data ):
        new_data = data[data.iloc[:, 0].str.contains( filter )]    
        new_data.reset_index(inplace = True, drop = True)
        return new_data
    return _filter_data
    

Función que obtiene el maestro de países, que permite obtener el nombre del país a partir de su identificador

In [53]:
def get_country_names():
    url = requests.get("https://ec.europa.eu/eurostat/databrowser-backend/api/codelist/LIVE/GEO/getCodeListJson/9.0/ESTAT/en/false")
    text = url.text
    data = json.loads(text)
    return data['category']['label']


Función que añade la columna con el nombre de los países

In [54]:
def add_column_country_name(data):    
    country_names = country_names = get_country_names()    
    data.insert(
        1, 
        'country_name', 
        data.country.map(lambda v: country_names[v] ), 
        True
    )
    return data
    

Función que realiza todo el proceso de extración y limpieza de los datos

In [55]:
class Compose:
    _f = None
    def __init__(self, f):
        self._f = f
    def andThen( self, g ):
        return Compose( lambda s: ( g( self._f(s) ) ) )
    def apply(self, a): 
        return self._f( a )

def flow( filter ) :
    return Compose( 
        filter_data( filter )
    ).andThen(
        extract_country
    ).andThen(
        trim_column_names 
    ).andThen( 
        clean_numeric_columns 
    ).andThen( 
        add_column_country_name
    )

def dataframe_by_key( key, filter, gziped = False ):
        return flow(filter).apply(  get_raw_data_frame( key, gziped ) )


Función que permite exportar el dataframe a fichero CSV

In [56]:
def export_dataframe( df, file_name, directory ):
    file = '/home/jovyan/work/%s/%s.csv' % (directory, file_name )
    df.to_csv(file )
    

### _DATASET I_: Precio de Gas doméstico en € por kw/h

Obtenido del origen de datos [Gas prices components for household consumers - annual data](https://ec.europa.eu/eurostat/databrowser/view/nrg_pc_202_c/default/table?lang=en)

Clave de identificación de los datos: **`NRG_PC_202_C`**

Se filtrarán los datos por:

 - Datos anuales
 - Componentes del precio de la energia: _"Energia y suministro "_
 - Consumición de la energía: En Giga Julios en todas las bandas
 - Moneda: Euro (€)
 - Unidad de medida: Kiolwatio-hora

In [57]:
data_gas_prices_household_consumers = dataframe_by_key( 
    key    = 'NRG_PC_202_C', 
    filter = 'A,NRG_SUP,TOT_GJ,EUR,KWH'
) 

Columnas del dataset:

In [58]:
display( data_gas_prices_household_consumers.dtypes )

country          object
country_name     object
2017            float64
2018            float64
2019            float64
2020            float64
2021            float64
dtype: object

Ejemplo de valores:

In [59]:
data_gas_prices_household_consumers

Unnamed: 0,country,country_name,2017,2018,2019,2020,2021
0,AT,Austria,0.0299,0.0304,0.0312,0.0308,0.0316
1,BA,Bosnia and Herzegovina,0.024,0.024,0.0249,0.0258,
2,BE,Belgium,0.0283,0.0288,0.0289,0.0252,0.0315
3,BG,Bulgaria,0.017,0.0209,0.024,0.0177,0.0331
4,CZ,Czechia,0.036,0.039,0.0455,0.0431,0.0448
5,DE,Germany (until 1990 former territory of the FRG),,,0.0278,0.0292,0.0293
6,DK,Denmark,0.0234,0.0259,0.0209,0.016,0.0415
7,EA,"Euro area (EA11-1999, EA12-2001, EA13-2007, EA...",0.0295,0.0303,0.0319,0.0302,
8,EE,Estonia,0.0234,0.0239,0.0253,0.024,0.0361
9,EL,Greece,,0.0311,0.0338,0.0258,


Exportación del dataframe

In [60]:
export_dataframe( 
    df = data_gas_prices_household_consumers, 
    file_name = 'data_gas_prices_household_consumers', 
    directory = 'subdataset' 
) 

### _DATASET II_: Precio de Gas no doméstico en € por kw/h

Obtenido del origen de datos [Gas prices components for non-household consumers - annual data](https://ec.europa.eu/eurostat/databrowser/view/nrg_pc_203_c/default/table?lang=en)

Clave de identificación de los datos: **`NRG_PC_203_C`**

Se filtrarán los datos por:

 - Datos anuales
 - Componentes del precio de la energia: _"Energia y suministro "_
 - Consumición de la energía: En Giga Julios en todas las bandas
 - Moneda: Euro (€)
 - Unidad de medida: Kiolwatio-hora

In [61]:
data_gas_prices_no_household_consumers = dataframe_by_key( 
    key    = 'NRG_PC_203_C', 
    filter = 'A,NRG_SUP,TOT_GJ,EUR,KWH'
) 

Columnas del dataset:

In [62]:
display( data_gas_prices_no_household_consumers.dtypes )

country          object
country_name     object
2017            float64
2018            float64
2019            float64
2020            float64
2021            float64
dtype: object

Ejemplo de valores:

In [63]:
data_gas_prices_no_household_consumers

Unnamed: 0,country,country_name,2017,2018,2019,2020,2021
0,AT,Austria,,,0.0184,0.0168,0.0297
1,BA,Bosnia and Herzegovina,,,0.0257,0.0259,0.0248
2,BE,Belgium,,,0.0189,0.0148,0.0318
3,BG,Bulgaria,,,0.0213,0.0142,0.0299
4,CZ,Czechia,,,0.0226,0.0192,
5,DE,Germany (until 1990 former territory of the FRG),,,0.0196,0.0171,0.0262
6,DK,Denmark,0.0194,0.0234,0.0178,0.0137,0.0448
7,EA,"Euro area (EA11-1999, EA12-2001, EA13-2007, EA...",0.022,0.024,0.0211,0.0175,
8,EE,Estonia,,,0.0213,0.0155,0.0352
9,EL,Greece,,,0.026,0.0165,


Exportación del dataframe

In [64]:
export_dataframe( 
    df = data_gas_prices_no_household_consumers, 
    file_name = 'data_gas_prices_no_household_consumers', 
    directory = 'subdataset' 
) 

### _DATASET III_: Precio de la electricidad doméstica para la franja de 2.500 a 4.999 kWh

Obtenido del origen de datos [Electricity prices components for household consumers - annual data (from 2007 onwards)](https://ec.europa.eu/eurostat/databrowser/view/NRG_PC_204_C__custom_2388428/default/table?lang=en)

Clave de identificación de los datos: **`NRG_PC_204_C__custom_2388428`**

Se filtrarán los datos por:

 - Datos anuales
 - Consumición de la energía: Consumición entre 2500 kWh y 4999 kWh
 - Componentes del precio de la energia: _"Energia y suministro "_
 - Moneda: Euro (€)

In [65]:
data_electricity_prices_household_consumers = dataframe_by_key( 
    key    = 'NRG_PC_204_C__custom_2388428', 
    filter = 'A,KWH2500-4999,NRG_SUP,EUR', 
    gziped = True 
) 

Columnas del dataset:

In [66]:
display( data_electricity_prices_household_consumers.dtypes )

country          object
country_name     object
2012-S2         float64
2013-S2         float64
2014-S2         float64
2015-S2         float64
2016-S2         float64
2017            float64
2018            float64
2019            float64
2020            float64
2021            float64
dtype: object

Ejemplo de valores:

In [67]:
data_electricity_prices_household_consumers

Unnamed: 0,country,country_name,2012-S2,2013-S2,2014-S2,2015-S2,2016-S2,2017,2018,2019,2020,2021
0,AL,Albania,,,,,,0.0713,0.0759,0.0778,,
1,AT,Austria,,,,,,0.0613,0.0623,0.0687,0.0732,0.0745
2,BA,Bosnia and Herzegovina,,,,,,0.0342,0.0338,0.0361,0.0365,
3,BE,Belgium,,,,,,0.0735,0.0808,0.0859,0.0786,0.0844
4,BG,Bulgaria,,,,,,0.0575,0.0585,0.0558,0.056,0.0608
5,CY,Cyprus,,,,,,0.1036,0.1157,0.1241,0.1042,0.1094
6,CZ,Czechia,,,,,,0.0541,0.057,0.069,0.0749,0.0979
7,DE,Germany (until 1990 former territory of the FRG),,,,,,0.0686,0.0622,0.0581,0.0574,0.0803
8,DK,Denmark,,,,,,0.0388,0.0503,0.0539,0.0409,0.0747
9,EA,"Euro area (EA11-1999, EA12-2001, EA13-2007, EA...",,,,,,0.076,0.0801,0.0727,0.0697,


Exportación del dataframe

In [68]:
export_dataframe( 
    df = data_electricity_prices_household_consumers, 
    file_name = 'data_electricity_prices_household_consumers', 
    directory = 'subdataset' 
) 

### _DATASET IV_: Precio de la electricidad no doméstica

Obtenido del origen de datos [Electricity prices components for non-household consumers - annual data (from 2007 onwards)](https://ec.europa.eu/eurostat/databrowser/view/nrg_pc_205_c/default/table?lang=en)

Clave de identificación de los datos: **`NRG_PC_205_C`**

Se filtrarán los datos por:

 - Datos anuales
 - Componentes del precio de la energia: _"Energia y suministro "_
 - Consumición de la energía: Consumición menos de 20 MWh 
 - Moneda: Euro (€)

In [69]:
data_electricity_prices_no_household_consumers = dataframe_by_key(
    key    = 'NRG_PC_205_C', 
    filter = 'A,NRG_SUP,MWH_LT20,EUR'
) 

Columnas del dataset:

In [70]:
display( data_electricity_prices_no_household_consumers.dtypes )

country          object
country_name     object
2007-S2         float64
2008-S2         float64
2009-S2         float64
2010-S2         float64
2011-S2         float64
2012-S2         float64
2013-S2         float64
2014-S2         float64
2015-S2         float64
2016-S2         float64
2017            float64
2018            float64
2019            float64
2020            float64
2021            float64
dtype: object

Ejemplo de valores:

In [71]:
data_electricity_prices_no_household_consumers

Unnamed: 0,country,country_name,2007-S2,2008-S2,2009-S2,2010-S2,2011-S2,2012-S2,2013-S2,2014-S2,2015-S2,2016-S2,2017,2018,2019,2020,2021
0,AT,Austria,,,,,,,,,,,0.0598,0.061,0.0654,0.0702,0.0723
1,BA,Bosnia and Herzegovina,,,,,,,,,,,0.0649,0.0621,0.0624,0.0648,
2,BE,Belgium,,,,,,,,,,,0.0672,0.0624,0.0663,0.0745,0.089
3,BG,Bulgaria,,,,,,,,,,,0.0817,0.081,0.0764,0.073,0.1075
4,CY,Cyprus,,,,,,,,,,,0.1187,0.124,0.1271,0.1055,0.1136
5,CZ,Czechia,,,,,,,,,,,0.058,0.0602,0.0721,0.0811,0.0848
6,DE,Germany (until 1990 former territory of the FRG),,,,,,,,,,,0.0468,0.0612,0.0525,0.0651,0.0707
7,DK,Denmark,,,,,,,,,,,0.0433,0.0514,0.0517,0.0426,0.0898
8,EA,"Euro area (EA11-1999, EA12-2001, EA13-2007, EA...",,,,,,,,,,,0.0757,0.0837,0.0794,0.078,
9,EE,Estonia,,,,,,,,,,,0.0406,0.0489,0.0516,0.044,0.085


Exportación del dataframe

In [72]:
export_dataframe( 
    df = data_electricity_prices_no_household_consumers, 
    file_name = 'data_electricity_prices_no_household_consumers', 
    directory = 'subdataset' 
    
) 

TypeError: export_dataframe() missing 1 required positional argument: 'directory'

In [None]:
# PRUEBAS BORRAR


# PRUEBAS BORRAR 
                          