## <i>EDA - Energy Technology RD&D Budget</i><br>
### This dataset has been taken from [International Energy Agency](https://www.iea.org/)<br>

##### The focus of this study is explore the [RD&D Budget Database](https://www.iea.org/data-and-statistics/data-product/energy-technology-rd-and-d-budget-database-2#documentation), which tracks the trends in spendings made for energy technologies in the IEA countries since 1974.<br>

<ul>In this project we will be looking at:<br>
    <li>Which of the countries are a part of International Energy Agency (IEA)?</li>
    <li>Which of the sectors are provided more budget? Along with this we will also compare countries to see which to the country has highest or lowest spendings as per the sectors.</li>
    <li>Total budget allocated by countries for Research and Development of Energy Technologies</li></ul><br>
    
We will see further as we explore our dataset.<br><br>

<strong><u>Note:</u></strong> The dataset when downloaded from the [IEA](https://www.iea.org/) was in the form of <b>.xlsx</b> file containing columns for each year starting from 1974.<br>For convenience of data handling (preparation, cleaning & visualisation) the year columns have been <b>Unpivoted</b> using Excel's <b>Power Query</b> and then the required sheets were saved as <b>CSV</b> files.<br><br>

The original xlsx file will also be available if you want to work with it directly.

In [None]:
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
import plotly_express as px

import plotly.io as pio
# to save plotly interactive graphs

%matplotlib inline

In [None]:
# getting public rd & d budget

public_rd = pd.read_csv('../input/iea-energy-technology-rdd-budget-database/Public_RDD_Budget_Unpivot.csv')

public_rd.head()

In [None]:
# changing column names

public_rd.rename(columns={'Currency': 'Currency_type', "Economic Indicators": "Sectors",
                          "Attribute": "Year", "Value": "Amount"}, inplace=True)


In [None]:
# let's create funtion to check size of datasets so we can check other dataframes easily
def df_size(df):
    size = df.shape
    print(f'Set contains {size[0]} rows and {size[1]} columns')
    
df_size(public_rd)

In [None]:
# create a function to check the null values in df

def check_null(df):
    for col in df.columns:
        values = np.mean(df[col].isnull())
        print(f"{col}\t-\t{values}% null values")
        
check_null(public_rd)

There seems to be a problem! As seen in <i>'public_rd.head()'</i> The Amount column does contain null values but above it shows us 0.0% null value.<br>
Looks like it sees " ..  " as values instead of null values.<br>

To check this, lets see the dtypes of our dataset!

In [None]:
# checking dtypes

public_rd.info()

In [None]:
# while unpivoting I created a total column in the end which contained NAN values
public_rd.drop(index=79920, inplace=True) 

In [None]:
# changing year dtype from object to int

# looks like year column contains text properties, hence, we will have to remove those first!
year = list()
for row in public_rd['Year']:
    val = re.findall("^([\d]*)", str(row))
    if val == '': continue
    year.append(np.int(val[0]))
    
# print(year)
public_rd['Year'] = pd.Series(year)

In [None]:
# changing amount dtype from object to float

print(type(public_rd['Amount'][2]))
# As expected, " .. " values in the column are seen as string values.
# While filtering '..' I also came across that values contained string 'x' in more than 3000 rows

# replacing string values
for row in public_rd['Amount']:
    if row == '..':
        public_rd['Amount'].replace('..', 0, inplace=True)
    if row == 'x':
        public_rd['Amount'].replace('x', 0, inplace=True)

public_rd['Amount'] = public_rd['Amount'].astype('float64')

In [None]:
# now let's check dtypes of our columns again

public_rd.info()

# public_rd.to_csv('psql_datasets/IEA_public_rd.csv')

Great! now that we have converted Year and Amount columns, its time for some data vizzes!<br>

Though, before we begin we need to select a currency type through which we will decide the spendings done by each country as per sectors.

In [None]:
print(public_rd['Currency_type'].unique(),"\n")
# we will be using USD (2020 prices and exchange rates) so that we can compare spendings done by each
# country on a same scale.
# we will be also removing rows with year 2021 since it contains incomplete data & is still in process.

public_rd = public_rd[(public_rd['Currency_type'] == 'USD (2020 prices and exchange rates)') &
                      (public_rd['Year'] != 2021) ]

df_size(public_rd)

In [None]:
public_rd['Country'].unique()
# removing regions from countries
rd_countries = public_rd.loc[(public_rd['Country'] != 'Estimated IEA Total')
             & (public_rd['Country'] != 'European Union')
             & (public_rd['Country'] != 'Estimated IEA Americas total')
             & (public_rd['Country'] != 'Estimated IEA Europe total')
             & (public_rd['Country'] != 'Estimated IEA Asia Oceania total')]


rd_countries.head(5)

In [None]:
sector_plot = rd_countries[rd_countries['Sectors'] != 'Total Budget']
sector_plot = sector_plot.sort_values(by='Amount', ascending=False)

plt.style.use('fivethirtyeight')
plt.figure(1, figsize=(12,9))
ax = sns.barplot(data=sector_plot, x='Amount', y='Sectors', palette='pastel', errcolor='white')

plt.title('Spendings by Sectors in Public RD&D', fontsize=25, fontweight='bold', color='black', pad=20)
plt.ylabel(None)
plt.xlabel('Amount in USD (millions)', fontsize=19, fontweight='bold', color='black', labelpad=20)
ax.set_yticklabels(sector_plot['Sectors'].unique(),fontsize=17, color='black')
plt.xticks(np.arange(0, 250, step=20),fontsize=16, color='black')

# ax.set_facecolor('#2e3141')

fig = plt.gcf()
plt.show()

# fig.savefig('trial.jpg', bbox_inches='tight', facecolor='#2e3141')

#### It can be observed that in <b>Public RD & D deparments</b> most of the spendings throughout many years<i>(1974-2020)</i>, by IEA countries, for Energy Technologies, has been done in Nuclear Sectors with an approx. of 200 million USD

#### Let's see which of the region have had a bigger budget for its Public RD&D energy departments. <br><br>This will be calculated using the mean of total budget allotted by a region for each year. 


In [None]:
# getting only the region values and its total budget for all the years & renaming column to region
total_budget = public_rd[public_rd['Country'].isin(['Estimated IEA Americas total',
                                                     'Estimated IEA Europe total',
                                                     'Estimated IEA Asia Oceania total'])]
total_budget = total_budget.rename(columns={'Country': 'Region'})

# pivoting the dataframe so that we can plot lines according to the regions
total_budget = pd.pivot_table(total_budget, index='Year', values='Amount', columns='Region')

plt.style.use('seaborn')
plt.figure(2, figsize=(12,8))

ax = sns.lineplot(data=total_budget, palette='Set1')
ax.legend(loc='upper center', labelcolor='mfc', fontsize=14, title='Region', title_fontsize=16,
         frameon=False)
plt.ylabel('Amount in USD (millions)', fontsize=17, fontweight='bold', color='gray', labelpad=20)
plt.xlabel('Year', fontsize=17, fontweight='bold', color='gray', labelpad=20)
plt.title('Total Spendings by Region in Public RD&D', fontsize=21, pad=20, color='gray')
ax.tick_params(labelsize=13)

# ax.set_facecolor('#2e3141')

fig = plt.gcf()
plt.show()

# fig.savefig('trial2.jpg', bbox_inches='tight', facecolor='#2e3141')

### Interesting! 

In [None]:
# now lets bring in the private RD & D dataset

private_rd = pd.read_csv('../input/iea-energy-technology-rdd-budget-database/Private_RDD_Budget_Unpivot.csv')

private_rd.head()

In [None]:
private_rd.info()
# looks like we will have to clean and change the dtypes on this dataset as well

In [None]:
# before we carry on lets change the column names
private_rd.rename(columns={'Currency': 'Currency_type', 'Economic Indicators': 'Sectors',
                         'Attribute': 'Year', 'Value': 'Amount'}, inplace=True)

In [None]:
private_rd['Country'].unique()

#### Looks like we only have 3 countries in IEA having private RD&D departments

In [None]:
# the year's data is only from 2013 to 2020
# (while the 2020 values are estimated we will have to remove string values from Year column)
private_rd['Year'].unique()

In [None]:
# converting column year from dtype object to int

year = list()
for row in private_rd['Year']:
    val = re.findall("^([\d]*)", str(row))
    if val == '': continue
    year.append(np.int(val[0]))

private_rd['Year'] = pd.Series(year).astype('int64')

In [None]:
private_rd['Amount'].replace('..', np.nan, inplace=True)
private_rd['Amount'].replace('x', np.nan, inplace=True)

In [None]:
check_null(private_rd)

##### Amount column contains 0.5 null values. Since it is less than 1% we can change to these null values to 0 without affecting our dataset.

In [None]:
private_rd['Amount'].fillna(0, inplace=True)
private_rd['Amount'] = private_rd['Amount'].astype('float64')

In [None]:
# before we continue we will have to choose one currency type
# private_rd['Currency_type'].unique()

private_rd = private_rd[private_rd['Currency_type'] == 'USD (2020 prices and exchange rates)']

In [None]:
private_rd.head()

In [None]:
print(private_rd['Sectors'].unique())

# looks like we have duplicate values with an extra space, we will have to remove these blank spaces
print('\n----------------------------------------\n')

private_rd['Sectors'] = private_rd['Sectors'].str.rstrip()
print(private_rd['Sectors'].unique())

# private_rd.to_csv('psql_datasets/IEA_private_rd.csv')

In [None]:
# let's create a pivot to plot the spendings as per our country's private RD & D department

# for this plot we will not use the Total Budget in the Sector's column, hence removing
country_plot = private_rd[private_rd['Sectors'] != 'Total Budget']

country_plot = pd.pivot_table(country_plot, values='Amount', index='Year', columns='Country')

plt.style.use('fivethirtyeight')
plt.figure(3, figsize=(12,8))

ax = sns.barplot(data=country_plot, palette='GnBu', errcolor='gray')
plt.ylabel('Amount in USD (millions)', fontsize=18, color='black', labelpad=20)
plt.xlabel('Country', fontsize=18, color='black', labelpad=20)
plt.title('Spendings by Country in Private RD&D', fontsize=20, pad=20)
plt.xticks(color='black', fontsize=13)
plt.yticks(color='black', fontsize=13)

# ax.set_facecolor('#2e3141')

fig = plt.gcf()
plt.show()

# fig.savefig('trial3.jpg', bbox_inches='tight', facecolor='#2e3141')

### Looks like private RD&D departments in Italy spends more than 90 million USD for Energy Technologies. 
### Let's see how much it differs from Italy's public RD&D

In [None]:
# since our private RD & D data is since 2013 for the public dataset we will have to take values only
# from 2013 as well

it_public = rd_countries[(rd_countries['Country'] == 'Italy') & (rd_countries['Year'] >= 2013)].dropna()
it_private = private_rd[(private_rd['Country'] == 'Italy')].dropna()

it_public['Country'].replace('Italy', 'Italy Public RD', inplace=True)
it_private['Country'].replace('Italy', 'Italy Private RD', inplace=True)



it_total_budget = pd.merge(it_public, it_private, how='outer', on=
                    ['Country', 'Currency_type','Sectors', 'Year', 'Amount'])
plot = it_total_budget[it_total_budget['Sectors'] == 'Total Budget']

plt.style.use('seaborn')
sns.catplot(data=plot, x='Country', y='Amount', kind='bar', aspect=2, height=6)
plt.ylabel('Amount in USD (millions)', fontsize=18, color='black', labelpad=20)
plt.xlabel('Country', fontsize=18, color='black', labelpad=20)
plt.title('Spendings by Italy in Public & Private RD&D', fontsize=20, pad=20)
plt.xticks(color='black', fontsize=13)
plt.yticks(color='black', fontsize=13)

plt.show()

### We can see that Italy's private RD & D Energy Technologies Departments has  more budget allocated than the public RD's
#### Let's check how it compares with Sectors of public and private RD's departments of Italy

In [None]:
it_total_budget = pd.merge(it_public, it_private, how='outer', on=
                    ['Country', 'Currency_type','Sectors', 'Year', 'Amount'])
plot = it_total_budget[it_total_budget['Sectors'] != 'Total Budget']

plot.rename(columns={'Country' : 'Department'}, inplace=True)


fig = px.scatter(plot, x='Sectors', y="Amount", facet_col="Department", width=900, hover_name='Year',
          facet_col_spacing=0.09, height=650, 
          title="Spendings in USD (millions) by Sectors for Italy's Public & Private RD Departments")
fig.show()

# pio.write_html(fig, file="Italy_RD.html", auto_open=True)

### You can hover over the scattered plot to see for which year an amount was allocated to a particular sector by Public and Private RD&D departments.

### Let's get our last dataset (Economic Indicators) 
#### Here we will see :
#### Which are the IEA countries ?
####  Comparison between IEA countries by it's spendings done in Energy Technologies according to the country's GDP

In [None]:
eco_indicator = pd.read_csv('../input/iea-energy-technology-rdd-budget-database/Economic_Indicators_Unpivot.csv')

eco_indicator.head()

In [None]:
eco_indicator.info()
# we will have to edit and change dtypes of Year (Attribute) and Amount (Value)

eco_indicator.rename(columns={'Attribute': 'Year', 'Value': 'Amount'}, inplace=True)

In [None]:
# removing any str properties from Year column
year = []
for row in eco_indicator['Year']:
    val = re.findall("^([\d]*)", str(row))
    if val == '': continue
    year.append(np.int(val[0]))
    
eco_indicator['Year'] = pd.Series(year).astype('int64')

# removing str properties from Amount column

eco_indicator['Amount'].replace('..', np.nan, inplace=True)
check_null(eco_indicator)

# since missing values in Amount are less than 0.3%, replacing missing values with zero
eco_indicator['Amount'].fillna(0, inplace=True)


eco_indicator['Amount']  = eco_indicator['Amount'].astype('float64')

In [None]:
eco_indicator.info()

In [None]:
eco_indicator['Country'].unique()

print(eco_indicator['Indicator'].unique())

# changing duplicate values having blank spaces
eco_indicator['Indicator'] = eco_indicator['Indicator'].str.rstrip()

print('\n----------------------------------\n')
print(eco_indicator['Indicator'].unique())

# eco_indicator.to_csv('psql_datasets/IEA_economic_indicators.csv')

In [None]:
# for this plot we will be looking at RD&D per thousand units of a country's GDP

eco_indicator = eco_indicator[eco_indicator['Indicator'] == 'RD&D per thousand units of GDP']

In [None]:
eco_indicator.sort_values(by='Amount', inplace=True)
px.bar(data_frame=eco_indicator, x='Amount', y='Country', height=1000,
      animation_frame='Year',
      title='Total public energy RD&D budgets per thousand units of GDP by country')

# pio.write_html(fig, file="Total_Budget_RD.html", auto_open=True)

## As of the recent year 2020, Norway has had the most RD&D budget per thousand units of its GDP followed by Finland and Japan!

### You can hover over to bars to see their exact budget per thousand units of its GDP. <BR> You can scroll through the years to see which IEA countries has had the most total budget for Energy Techonolgoies RD&D departments.