# Global Power Plant Database

Project Description

The Global Power Plant Database is a comprehensive, open source database of power plants around the world. It centralizes power plant data to make it easier to navigate, compare and draw insights for one’s own analysis. The database covers approximately 14,000 power plants from 3 countries(USA, AUS, INDIA) and includes thermal plants (e.g. coal, gas, oil, nuclear, biomass, waste, geothermal) and renewables (e.g. hydro, wind, solar). Each power plant is geolocated and entries contain information on plant capacity, generation, ownership, and fuel type. It will be continuously updated as data becomes available.

Key attributes of the database

The database includes the following indicators:

• country (text): 3 character country code corresponding to the ISO 3166-1 alpha-3 specification [5]

• country_long (text): longer form of the country designation

• name (text): name or title of the power plant, generally in Romanized form

• gppd_idnr (text): 10 or 12 character identifier for the power plant

• capacity_mw (number): electrical generating capacity in megawatts

• latitude (number): geolocation in decimal degrees; WGS84 (EPSG:4326)

• longitude (number): geolocation in decimal degrees; WGS84 (EPSG:4326)

• primary_fuel (text): energy source used in primary electricity generation or export

• other_fuel1 (text): energy source used in electricity generation or export

• other_fuel2 (text): energy source used in electricity generation or export

• other_fuel3 (text): energy source used in electricity generation or export

• commissioning_year (number): year of plant operation, weighted by unit-capacity when data is available

• owner (text): majority shareholder of the power plant, generally in Romanized form

• source (text): entity reporting the data; could be an organization, report, or document, generally in Romanized form

• url (text): web document corresponding to the source field

• geolocation_source (text): attribution for geolocation information

• wepp_id (text): a reference to a unique plant identifier in the widely-used PLATTS-WEPP database.

• year_of_capacity_data (number): year the capacity information was reported

• generation_gwh_2013 (number): electricity generation in gigawatt-hours reported for the year 2013

• generation_gwh_2014 (number): electricity generation in gigawatt-hours reported for the year 2014

• generation_gwh_2015 (number): electricity generation in gigawatt-hours reported for the year 2015

• generation_gwh_2016 (number): electricity generation in gigawatt-hours reported for the year 2016

• generation_gwh_2017 (number): electricity generation in gigawatt-hours reported for the year 2017

• generation_gwh_2018 (number): electricity generation in gigawatt-hours reported for the year 2018

• generation_gwh_2019 (number): electricity generation in gigawatt-hours reported for the year 2019

• generation_data_source (text): attribution for the reported generation information

• estimated_generation_gwh_2013 (number): estimated electricity generation in gigawatt-hours for the year 2013

• estimated_generation_gwh_2014 (number): estimated electricity generation in gigawatt-hours for the year 2014

• estimated_generation_gwh_2015 (number): estimated electricity generation in gigawatt-hours for the year 2015

• estimated_generation_gwh_2016 (number): estimated electricity generation in gigawatt-hours for the year 2016

• estimated_generation_gwh_2017 (number): estimated electricity generation in gigawatt-hours for the year 2017

• 'estimated_generation_note_2013` (text): label of the model/method used to estimate generation for the year 2013

• estimated_generation_note_2014 (text): label of the model/method used to estimate generation for the year 2014

• estimated_generation_note_2015 (text): label of the model/method used to estimate generation for the year 2015

• estimated_generation_note_2016 (text): label of the model/method used to estimate generation for the year 2016

• estimated_generation_note_2017 (text): label of the model/method used to estimate generation for the year 2017

Fuel Type Aggregation We define the "Fuel Type" attribute of our database based on common fuel categories.

Prediction : Make two prediction 1) Primary Fuel 2) capacity_mw

Hint : Use pandas methods to combine all the datasets and then start working on this project.

Dataset Link- • https://github.com/FlipRoboTechnologies/ML_-Datasets/tree/main/Global%20Power%20Plant%20Database

In [None]:
!pip install rasterio
!pip install folium

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#analysis datatime
import datetime as dt
from datetime import datetime

import folium
import rasterio as rio
from folium import plugins
from folium.plugins import HeatMap

import warnings
warnings.filterwarnings('ignore')



In [None]:
GPP_url_IND = 'https://raw.githubusercontent.com/FlipRoboTechnologies/ML_-Datasets/main/Global%20Power%20Plant%20Database/database_IND.csv'
GPP_url_usa = 'https://raw.githubusercontent.com/FlipRoboTechnologies/ML_-Datasets/main/Global%20Power%20Plant%20Database/database_USA.csv'
GPP_url_AUS = 'https://raw.githubusercontent.com/FlipRoboTechnologies/ML_-Datasets/main/Global%20Power%20Plant%20Database/database_AUS.csv'
gpp_IND = pd.read_csv(GPP_url_IND)
gpp_USA = pd.read_csv(GPP_url_usa)
gpp_AUS = pd.read_csv(GPP_url_AUS)

In [None]:
print(gpp_IND.shape)
print(gpp_USA.shape)
print(gpp_AUS.shape)

print(gpp_IND.head())
print(gpp_USA.head())
print(gpp_AUS.head())

In [None]:
#to club all three data sets we are using concatenation

df_GPP = pd.concat([gpp_IND,gpp_USA,gpp_AUS])
df_GPP.head()

In [None]:
df_GPP.columns

In [None]:
df_GPP.info()

In [None]:
df_GPP.sample(10)

In [None]:
df_GPP.describe()

Here we can find some power plants that has negitive generationof power .
it seems  it was by   mistike in the dataset but there are some power plants those consume more energy than they produce

In [None]:
df_GPP.isnull().sum()

we found lots of missing values in this dataset and mainly  we can see  estimated_generation_gwh    ,wepp_id, other_fue13,other_fue12,other_fue11  are totally missed .

**we are going to invistigate large missing other_fue13,other_fue12 and possibily we have to drop "estimated_generation_gwh    ,wepp_id, " this columns from dataset**


In [None]:
df_GPP.duplicated().sum()

In [None]:
#impute the missing values
# capacity capacity_mw ,latitude ,longitude

num_cols = ['capacity_mw','latitude','longitude']
df_GPP[num_cols] = df_GPP[num_cols].fillna(df_GPP[num_cols].mean())

df_GPP.isnull().sum()

In [None]:
#impute missing values for catagorical columns with most frequent values

catagorical_cols = ['primary_fuel','other_fuel1']
if all(col in df_GPP.columns for col in catagorical_cols):
    df_GPP[catagorical_cols] = df_GPP[catagorical_cols].fillna(df_GPP[catagorical_cols].mode().iloc[0])
    print(df_GPP.isnull().sum())
else:
  miss_cols = set(catagorical_cols) - set(df_GPP.columns)
  print("Missing columns:", miss_cols)


In [None]:
#impute missing values for  numerical columns with median
median_cols = ['commissioning_year','year_of_capacity_data']
df_GPP[median_cols] = df_GPP[median_cols].fillna(df_GPP[median_cols].median())
df_GPP.isnull().sum()

In [None]:
#impute missing values for categorical columns with most frequent value

most_frequent_cols = ['owner','geolocation_source']
df_GPP[most_frequent_cols] = df_GPP[most_frequent_cols].fillna(df_GPP[most_frequent_cols].mode().iloc[0])

In [None]:
df_GPP.isnull().sum()



In [None]:
df_GPP.head()

In [None]:
#Check for consistency
#need to check  if the  'generation_gwh' value are consistent across years, example there shouldn't be a drastic increase or descrease with out a reasonable cause.
#validation of 'generation_gwh' values  and commissioning_year

for year in range(2013,2019):
  year_col = f'generation_gwh_{year}'
  df_GPP.loc[df_GPP[year_col] < df_GPP['commissioning_year'], year_col] = np.nan


In [None]:
df_GPP.drop(columns=['wepp_id','estimated_generation_gwh'],inplace=True)

In [None]:
df_GPP.head()

In [None]:

high_miss_values = ['other_fuel2','other_fuel3']
df_GPP.drop(columns=high_miss_values,inplace=True)


In [None]:

df_GPP.head()

In [None]:
df_GPP.info()

In [None]:
df_GPP.isnull().sum()

In [None]:
from sklearn.impute import SimpleImputer

#impute data columns with 0 where nessaary  or using  strategy
ness_cols = [f'generation_gwh_{year}' for  year in range(2013,2019)]
imputer_gen = SimpleImputer(strategy='constant', fill_value = 0 )
df_GPP[ness_cols] = imputer_gen.fit_transform(df_GPP[ness_cols])

In [None]:
df_GPP.isnull().sum()

In [None]:
df_GPP['generation_gwh_2019'].fillna(0, inplace=True)
df_GPP['generation_data_source'].fillna('N/A', inplace=True)

In [None]:
df_GPP.isnull().sum()

In [None]:
df_GPP.to_csv('df_GPP_cleaned_data.csv',index=False)

**Data was cleaned and stored in df_GPP_cleaned_data.csv**

In [None]:
df_GPP.head()

In [None]:
df_GPP.describe()

In [None]:
df_GPP.hist(figsize=(20,15))
plt.show()

In [None]:
#Fuel details

plt.figure(figsize=(15,10))
sns.barplot(x='primary_fuel',y='capacity_mw',data=df_GPP)
plt.show()

In [None]:
#scatter plot
sns.scatterplot(x=df_GPP.capacity_mw, y = df_GPP.primary_fuel)
plt.title('Capacity vs Primary Fuel')
plt.xlabel('Capacity')
plt.ylabel('Primary Fuel')
plt.show()

In [None]:
#Correlation Matrix

numerical_df_GPP = df_GPP.select_dtypes(include=['int64', 'float64'])

corr_matrix = numerical_df_GPP.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

as per the above correlation matrix ,there is a positive correlation for capacity_mw and generation_gwh_2014 to generation_gwh_2018 and it's range form  0.78 to .80 and it indicatating higher capacity is associated with higher generation value over these years

**Generative growth  accross the different years  : 0.73 to 0.97 .**


**Negitive correlation : latiture and longtitude**

Feak correlation is  :     
**commissioning** year with other featurs




In [None]:
# Capacity vs genaration analysis
years = range(2013,2019)

for year in years:
  plt.figure(figsize=(10, 6))
  sns.scatterplot(x='capacity_mw', y=f'generation_gwh_{year}', data=df_GPP,hue = 'primary_fuel')
  plt.title(f'Capacity vs Generation : {year}')
  plt.xlabel('Capacity')
  plt.ylabel('Generation')
  plt.show()

In [None]:
df_GPP.head()

In [None]:
df_GPP.columns

In [None]:
from sklearn.preprocessing import LabelEncoder #,OneHotEncoder
le = LabelEncoder()
#one_hot_en = OneHotEncoder()

label_encoder_coulmns = ['country','country_long','name','gppd_idnr','primary_fuel','other_fuel1','geolocation_source','generation_data_source','owner','source','url']
#one_hot_encode_columns = ['geolocation_source','generation_data_source']



In [None]:
label_encode = {}

for col in label_encoder_coulmns:
  df_GPP[col] = le.fit_transform(df_GPP[col])
  label_encode[col] = le

In [None]:
#apply One Hot encoding

#df_GPP = pd.get_dummies(df_GPP, columns=one_hot_encode_columns)
df_GPP.head()

In [None]:
df_GPP['generation_gwh_2019'] = pd.to_numeric(df_GPP['generation_gwh_2019'], errors='coerce')

In [None]:

df_GPP.dtypes

In [None]:
#Feature and target selection
features = df_GPP.drop(['primary_fuel','capacity_mw'],axis=1)
target_fuel = df_GPP['primary_fuel']
target_capacity = df_GPP['capacity_mw']


In [None]:
#Spliting the data into train and test
from sklearn.model_selection import train_test_split

X_train_fuel,X_test_fuel,y_train_fuel,y_test_fuel = train_test_split(features,target_fuel,test_size=0.2,random_state=42)
X_train_capacity,X_test_capacity,y_train_capacity,y_test_capacity = train_test_split(features,target_capacity,test_size=0.2,random_state=42)

In [None]:
#Standardize the feature
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score


scaler = StandardScaler()
X_train_fuel = scaler.fit_transform(X_train_fuel)
X_test_fuel = scaler.transform(X_test_fuel)
X_train_capacity = scaler.fit_transform(X_train_capacity)
X_test_capacity = scaler.transform(X_test_capacity)


In [None]:
#Classifiers for fuel predections

#from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import BaggingClassifier

In [None]:
classifiers = {
          'svc':SVC(),
          'rfc':RandomForestClassifier(),
          'knc':KNeighborsClassifier(),
          'gau':GaussianNB(),
          'dtc' : DecisionTreeClassifier(),
          'abc' : AdaBoostClassifier(),
          'grd':GradientBoostingClassifier(),
          'bagg':BaggingClassifier()
}

In [None]:
#train AND evalate classifiers for primary fuel predection
imputer = SimpleImputer(strategy='mean')

X_train_fuel = imputer.fit_transform(X_train_fuel)
X_test_fuel = imputer.transform(X_test_fuel)

for name, classifier in classifiers.items():
  classifier.fit(X_train_fuel,y_train_fuel)
  y_pred_fuel = classifier.predict(X_test_fuel)
  accuracy = accuracy_score(y_test_fuel,y_pred_fuel)
  print(f'{name} : {accuracy}')

Result :
svc : 0.6973684210526315

rfc : 0.83796992481203

knc : 0.7090225563909774

gau : 0.02706766917293233

dtc : 0.7699248120300752

abc : 0.37105263157894736

grd : 0.8150375939849624

bagg : 0.8150375939849624

**we  can suggest best model " Random forest classifier " got 83.79 % accuracy** .





In [None]:
import joblib
rfc = RandomForestClassifier()
rfc.fit(X_train_fuel,y_train_fuel)
y_pred_fuel = rfc.predict(X_test_fuel)
accuracy = accuracy_score(y_test_fuel,y_pred_fuel)
print(f'Random forest classifier : {accuracy}')



In [None]:
joblib.dump(rfc,'best_rfc_model.pkl') # best classification model is randomforest classifier

In [None]:
#train AND evalate  for capacity_mw  predection
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression,Lasso,Ridge
from sklearn.svm import SVR
from sklearn.linear_model import SGDRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.model_selection import cross_val_score

##**"capacity_mw"  predection using regression  models **

In [None]:
models = [LinearRegression(),
          Ridge(alpha = 0.001),
          Lasso(alpha=0.003),
          SVR(),
          DecisionTreeRegressor(),
          RandomForestRegressor(),
          GradientBoostingRegressor(),
          AdaBoostRegressor(base_estimator=LinearRegression())]

model_names = 'LinearRegression','Ridge','Lasso','SVR','SGDRegressor','DecisionTreeRegressor','RandomForestRegressor','GradientBoostingRegressor','AdaBoostRegressor','KNeighborsRegressor','BaggingRegressor'
model_df = pd.DataFrame(columns=['Model','MSE','R2','MeanCV'])
for model,model_names in zip(models,model_names):
  print(model)

  model.fit(X_train_capacity,y_train_capacity)
  pred = model.predict(X_test_capacity)
  mse = mean_squared_error(y_test_capacity,pred,squared=False)
  r2 = model.score(X_test_capacity,y_test_capacity)

  averages = cross_val_score(model,X_train_capacity,y_train_capacity,cv=5,scoring='neg_mean_squared_error').mean()

  model_df = pd.concat([model_df,pd.DataFrame({'Model': [model_names],'MSE':mse,'R2':r2,'MeanCV': [averages]})],ignore_index=True)
print(model_df)

**Based on the above results, we can find the best model**


**First Best suggested model:**
The decision Treeregression is the best model ,based on it's lowest MSE,highest R square ,and best meanCV score -**DecisionTreeRegressor**  166.833201  0.795335  -18154.799626

**Best Second model :**
RandomForestRegressor  174.778641  **0.775376**  -19560.574025


**Hyperparameter Tuning**

In [None]:
from sklearn.model_selection import GridSearchCV

dtr = DecisionTreeRegressor()

# define the perameter grid

Paramet_grid  = {
    'max_depth': [None, 5, 10, 15],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]

}
grid_search = GridSearchCV(dtr,Paramet_grid,cv=5,scoring='neg_mean_squared_error',return_train_score=True)
grid_search.fit(X_train_capacity,y_train_capacity)



**Best parameters **

In [None]:
best_params = grid_search.best_params_
print(best_params)

In [None]:
#Best parameters
best_dtr = DecisionTreeRegressor(max_depth=5,min_samples_leaf=2,min_samples_split=5)
best_dtr

In [None]:
#Trained the model

best_dtr.fit(X_train_capacity,y_train_capacity)
y_pred_capacity = best_dtr.predict(X_test_capacity)
mse_capacity = mean_squared_error(y_test_capacity,y_pred_capacity,squared=False)
r2_capacity = best_dtr.score(X_test_capacity,y_test_capacity)
print(f'Decision Tree regression MSE: {mse_capacity},  Decision Tree regression R2: {r2_capacity}')
#print(r2_capacity)

**Save the Model**

In [None]:
import joblib

joblib.dump(rfc,'best_rfc_model.pkl') # best classification model is randomforest classifier
joblib.dump(best_dtr,'best_dtr_model.pkl') #best regresion model is Randomforestregressor