<h2>Summary</h2> 
In order to get reasonable electricity generation values from the provided Power Plant dataset (gppd_120_pr.csv), the "estimated_generation_gwh" values should be fixed. Just the code in the following cell is necessary; the rest of the Notebook is the justification.

<h4>Update: some data forensics</h4>
The Kaggle Power Plant dataset for Puerto Rico is a subset of the [GPPD (Global Power Plant Database)](http://datasets.wri.org/dataset/globalpowerplantdatabase), which is a wonderful effort to have information from as many Power Plants in the world as possible. The GPPD data for US Power Plants comes in turn from the EIA-923 reports, which seem very reliable; however, I have not seen information from Puerto Rico Power Plants in EIA-923 reports previous to 2017; thus, in case Puerto Rico's information in the GPPD was populated previous to 2017, it may be a combination of sources. The geopositioning data seems ok, and the "capacity_mw" and "primary_fuel" data match those in ["Tabla de datos"](http://http://energia.pr.gov/datos/plantas/), from PREPA, Puerto Rico Electric Power Authority (Autoridad de Energía Eléctrica, in Spanish). However, the "estimated_generation_gwh" values seem invented, as I justify below.

In [None]:
import pandas as pd
import numpy as np

source_capacity_factors = {"Coal": 0.55, "Hydro": 0.40, "Gas": 0.80, "Oil": 0.64, "Solar": 0.25, "Wind": 0.30, "Nuclear": 0.85}

# "force_fix": - if False, the source_capacity_factors dictionary values are applied only to the 
#                "estimated_generation_gwh" values whose Capacity Factor is > 1
#              - if True, all the "estimated_generation_gwh" values are fixed with the source_capacity_factors dictionary values
def fix_estimated_generation(gpp_df, source_capacity_factors, force_fix=False):
    gpp_df["capacity_factor"] = np.where(gpp_df["capacity_mw"] > 0, gpp_df["estimated_generation_gwh"] / (gpp_df["capacity_mw"]*24*365/1000), 0)
    for idx in range(gpp_df.shape[0]):
        if (gpp_df.loc[idx, 'capacity_factor'] > 1) or force_fix: 
            gpp_df.loc[idx, 'capacity_factor'] = source_capacity_factors[gpp_df.loc[idx, "primary_fuel"]]
            gpp_df.loc[idx, 'estimated_generation_gwh'] = gpp_df.loc[idx, "capacity_factor"] * gpp_df.loc[idx, "capacity_mw"] * 24*365/1000
    return gpp_df

# Uncomment next line if you have not read yet the Power Plants file
# global_power_plants = pd.read_csv('../input/ds4g-environmental-insights-explorer/eie_data/gppd/gppd_120_pr.csv')
# global_power_plants = fix_estimated_generation(global_power_plants, source_capacity_factors)

<h2>The issue</h2>
While browsing the nice analysis in https://www.kaggle.com/parulpandey/understanding-the-data-wip, I realized there seemed to be some inconsistency between the "capacity_mw" and "estimated_generation_gwh" columns. If we represent the "capacity_mw" by source (Coal, Oil, Hydro... in "primary_fuel" column), we get the following plot:

In [None]:
import pandas as pd
global_power_plants = pd.read_csv('../input/ds4g-environmental-insights-explorer/eie_data/gppd/gppd_120_pr.csv')
total_capacity_mw = global_power_plants['capacity_mw'].sum()
print('Total Installed Capacity: '+'{:.2f}'.format(total_capacity_mw) + ' MW')
capacity = (global_power_plants.groupby(['primary_fuel'])['capacity_mw'].sum()).to_frame()
capacity = capacity.sort_values('capacity_mw',ascending=False)
capacity['percentage_of_total'] = (capacity['capacity_mw']/total_capacity_mw)*100
capacity['percentage_of_total'].plot(kind='bar',color=['orange', 'yellow', 'black', 'orange','cyan','blue'], 
                                     title="Capacity per fuel type (%)")

On the other hand, representing the "estimated_generation_gwh" by source shows the following plot:

In [None]:
total_estimated_generation_gwh = global_power_plants['estimated_generation_gwh'].sum()
print('Total Estimated Generation per year: '+'{:.2f}'.format(total_estimated_generation_gwh) + ' GWh')
estimated_generation = (global_power_plants.groupby(['primary_fuel'])['estimated_generation_gwh'].sum()).to_frame()
estimated_generation = estimated_generation.sort_values('estimated_generation_gwh',ascending=False)
estimated_generation['percentage_of_total'] = (estimated_generation['estimated_generation_gwh']/total_estimated_generation_gwh)*100
estimated_generation['percentage_of_total'].plot(kind='bar',color=['orange', 'yellow', 'black', 'orange','cyan','blue'],
                                                title="Annual generation per fuel type (%)")

This means **coal Power Plants in Puerto Rico (in fact, there is just one), whose capacity is less than 10% of the total, generate more than 90% of the total electricity???** This seemed inconsistent, so I checked the data: the estimated generation for the coal Power Plant is 450,562.69 GWh. 

However, according to the [original source of Power Plant data](http://datasets.wri.org/dataset/globalpowerplantdatabase), the biggest Power Plant in the world, the hydro one in Three Gorges Dam (China) generated 92,452.6 GWh in a year, which is 5 times less!!

I guessed there should be a concept relating the capacity and the energy generation, and it does exist: the Capacity Factor(https://en.wikipedia.org/wiki/Capacity_factor) is "the unitless ratio of an actual electrical energy output over a given period of time to the maximum possible electrical energy output over that period". This means the Capacity Factor should be 0 < CF < 1.

*Note: in order to have the ratio properly unitless, the units in the numerator and denominator must be the same; to obtain the "maximum possible electrical energy output" in a year means we put this Capacity in MW (MegaWatt),  to produce energy during a year, thus we must multiply by 24 hours/day and 365 days/year, and then divide by 1000 to get the "maximum possible electrical energy output" in a year in GWh (GigaWatt hour)
*

Thus I added a capacity_factor column:

In [None]:
global_power_plants["capacity_factor"] = global_power_plants["estimated_generation_gwh"]/(global_power_plants["capacity_mw"]*24*365/1000)
global_power_plants[["name", "capacity_mw", "primary_fuel", "estimated_generation_gwh", "capacity_factor"]]

We can see the Capacity Factor is 113 (11300%) for that coal Power Plant, which is not possible.

Also, the capacity factor for the hydro Power Plants is 5.2 (5200%), which is neither possible. However, this is much less important since those are very low capacity plants compared to Oil and Coal Power Plants.

Finally, we can see all the plants of a given primary_fuel have EXACTLY the same Capacity Factor; since the "capacity_mw" values are particular for each plant, this means the "estimated_generation_gwh" are not real but have been computed to maintain a certain ratio (a Capacity Factor, but not always with reasonable values) with their corresponding "capacity_mw" values.

Consequently, we must choose reasonable Capacity Factors per source and fix the estimated generation accordingly; the following Capacity Factor ranges are typical in the USA (just as a reference):
* Coal: between 50-60% (sometimes even 70%); I chose 55%
* Hydro: 40-50%, but can be much less with long drought periods; I chose 40%

For the rest, the CF values calculated from the capacity and generation data provided may stay as they are, since the CF is < 1, or can also be fixed (I have added an optional force_fix flag for this):
* Gas: 80%
* Oil: 64%
* Solar: 25%; the CF value extracted from values in gppd_120_pr.csv, 5%, is very low, difficult to make those plants profitable
* Wind: 30%; the CF value extracted from values in gppd_120_pr.csv, 2%, is very low, difficult to make those plants profitable
* Nuclear: 85%; there are no nuclear power plants in Puerto Rico, but just in case another dataset is provided which has nuclear power plants

I wrote the lines of code to fix it (dictionary source_capacity_factors and function fix_estimated_generation) at the top of the NoteBook.

Now the fixed estimated generation plot is:

In [None]:
global_power_plants = fix_estimated_generation(global_power_plants, source_capacity_factors)
global_power_plants[["name", "capacity_mw", "primary_fuel", "estimated_generation_gwh", "capacity_factor"]]    

In [None]:
total_estimated_generation_gwh = global_power_plants['estimated_generation_gwh'].sum()
print('Total Estimated Generation per year: '+'{:.2f}'.format(total_estimated_generation_gwh) + ' GWh')
estimated_generation = (global_power_plants.groupby(['primary_fuel'])['estimated_generation_gwh'].sum()).to_frame()
estimated_generation = estimated_generation.sort_values('estimated_generation_gwh',ascending=False)
estimated_generation['percentage_of_total'] = (estimated_generation['estimated_generation_gwh']/total_estimated_generation_gwh)*100
estimated_generation['percentage_of_total'].plot(kind='bar',color=['orange', 'yellow', 'black', 'orange','cyan','blue'],
                                                title="Annual generation per fuel type (%)")

Now the generation plot is much more consistent with the capacity plot, as expected.

Also, the total year estimated generation has reduced from 486860.88 GWh to 33913.69 GWh (14 times less!!). This can have impact in the NO2 emission factor attributed to Power Plants since they are 14 times less guilty than with the provided data, compared to other sources such as cars.