# Final Project Phase 2 Summary

# Data Collection and Cleaning
You are required to provide data collection and cleaning for the three (3) minimum datasets. Create a function for each of the following sections that reads or scrapes data from a file or website, manipulate and cleans the parsed data, and writes the cleaned data into a new file. 

Make sure your data cleaning and manipulation process is not too simple. Performing complex manipulation and using modules not taught in class shows effort, which will increase the chance of receiving full credit.


## Data Sources
Include sources (as links) to your datasets. Add any additional data sources if needed. Clearly indicate if a data source is different from one submitted in your Phase I, as we will check that it satisfies the requirements.
*   Downloaded Dataset Source: [Global Power Plant Database](https://datasets.wri.org/dataset/globalpowerplantdatabase) and [Global Power Plant Emissions Database](http://meicmodel.org/?page_id=91&lang=en)
*   Web Collection #1 Source: [WhatToMine Coins](https://whattomine.com/coins.json) and [WhatToMine GPUs](https://whattomine.com/gpus)
*   Web Collection #2 Source: [CoinGecko](https://www.coingecko.com/en/apiBlo)



## Downloaded Dataset Requirement

Fill in the predefined functions with your data scraping/parsing code. You may modify/rename each function as you seem fit, but you must provide at least 3 separate functions that clean each of your required datasets.


In [2]:
import pandas as pd
import numpy as np
import requests

In [3]:
def global_power():
  # IMPORT FILES
  generation_path = "data\global-power-plants\global_power_plant_database.csv"
  emission_path = "data\global-power-plants\global_power_emissions_database.xlsx"
  
  # power generation csv
  with open(generation_path, encoding='utf8') as fin:
    ppg = pd.read_csv(fin, low_memory=False)
  
  # power plant emissions xlsx
  ppe = pd.read_excel(emission_path, sheet_name='GPED_v1.0_Plant Level', skiprows=0, header=1)

  # removing unwanted data
  unwanted_columns = ['latitude',
                      'longitude',
                      'other_fuel1',
                      'other_fuel2',
                      'other_fuel3',
                      'commissioning_year',
                      'gppd_idnr',
                      'owner',
                      'source',
                      'url',
                      'geolocation_source',
                      'wepp_id',
                      'year_of_capacity_data',
                      'generation_data_source',
                      'generation_gwh_2018',
                      'generation_gwh_2019',
                      'estimated_generation_note_2013',
                      'estimated_generation_note_2014',
                      'estimated_generation_note_2015',
                      'estimated_generation_note_2016',
                      'estimated_generation_note_2017']
  ppg.drop(unwanted_columns, axis=1, inplace=True)

  unwanted_columns = ['No.', 'Number of Units', 'Total Plant Installed Capacity (MW)']
  ppe.drop(unwanted_columns, axis=1, inplace=True)

  # AGGREGATING DATA
  avgs = ['generation_gwh_2013',
          'generation_gwh_2014',
          'generation_gwh_2015',
          'generation_gwh_2016',
          'generation_gwh_2017']
  ppg['AVG_GENERATION'] = ppg[avgs].mean(axis=1)

  avgs = ['estimated_generation_gwh_2013',
          'estimated_generation_gwh_2014',
          'estimated_generation_gwh_2015',
          'estimated_generation_gwh_2016',
          'estimated_generation_gwh_2017']
  ppg['AVG_EST_GENERATION'] = ppg[avgs].mean(axis=1)

  # merge the two average columns into a single column
  ppg['GENERATION_MW'] = ppg.apply(lambda x : np.fmax(x['AVG_GENERATION'], x['AVG_EST_GENERATION']), axis=1)

  # remove unaggregated columns
  unwanted_columns = ['AVG_GENERATION',
                      'AVG_EST_GENERATION',
                      'generation_gwh_2013',
                      'generation_gwh_2014',
                      'generation_gwh_2015',
                      'generation_gwh_2016',
                      'generation_gwh_2017',
                      'estimated_generation_gwh_2013',
                      'estimated_generation_gwh_2014',
                      'estimated_generation_gwh_2015',
                      'estimated_generation_gwh_2016',
                      'estimated_generation_gwh_2017']
  ppg.drop(unwanted_columns, axis=1, inplace=True)

  # merge rows by fuel type
  generation_dist = ppg.groupby('primary_fuel')['GENERATION_MW'].sum().sort_values(ascending=False)
  emission_dist = ppe.groupby('Fuel Types').aggregate({'CO2 Emissions (Mg)':'sum',
                                                       'SO2 Emissions (Mg)':'sum',
                                                       'NOx Emissions (Mg)':'sum',
                                                       'PM2.5 Emissions (Mg)':'sum'})

  # COMBINING TABLES
  generation_dist.index = generation_dist.index.str.upper()
  generation_dist.drop('WAVE AND TIDAL', axis=0, inplace=True)
  gen_other = ['PETCOKE','WASTE','COGENERATION','STORAGE','NUCLEAR']
  generation_dist['OTHER'] = generation_dist[gen_other].sum(axis=0)
  generation_dist.drop(gen_other, axis=0, inplace=True)
  emission_dist = emission_dist.rename(index={'NG':'GAS'})

  power = pd.concat([generation_dist, emission_dist], axis=1)
  power = power.fillna(0)

  return power

############ Function Call ############
global_power()

Unnamed: 0,GENERATION_MW,CO2 Emissions (Mg),SO2 Emissions (Mg),NOx Emissions (Mg),PM2.5 Emissions (Mg)
COAL,9960694.0,8880799000.0,29734190.0,18480380.0,2508320.0
GAS,6304916.0,2518846000.0,52287.39,3440040.0,41447.6
HYDRO,3755360.0,0.0,0.0,0.0,0.0
WIND,709442.1,0.0,0.0,0.0,0.0
OIL,536091.7,737013900.0,8689335.0,2723445.0,93044.16
SOLAR,348600.2,0.0,0.0,0.0,0.0
GEOTHERMAL,60833.82,0.0,0.0,0.0,0.0
BIOMASS,33689.12,119606700.0,179237.6,174493.0,37133.16
OTHER,3011469.0,274978500.0,150082.7,338207.3,10422.5


## Web Collection Requirement \#1


In [4]:
def web_parser1():
  pass





############ Function Call ############
web_parser1()

## Web Collection Requirement \#2

In [5]:
def web_parser2():
  pass





############ Function Call ############
web_parser2()

## Additional Dataset Parsing/Cleaning Functions

Write any supplemental (optional) functions here.

In [6]:
def extra_source1():
    pass

    
############ Function Call ############
extra_source1()

In [7]:
# Define further extra source functions as necessary

# Inconsistencies
For each inconsistency (NaN, null, duplicate values, empty strings, etc.) you discover in your datasets, write at least 2 sentences stating the significance, how you identified it, and how you handled it.

1. 

2. 

3. 

4. (if applicable)

5. (if applicable)
