# Final Project Phase 2 Summary

# Data Collection and Cleaning
You are required to provide data collection and cleaning for the three (3) minimum datasets. Create a function for each of the following sections that reads or scrapes data from a file or website, manipulate and cleans the parsed data, and writes the cleaned data into a new file. 

Make sure your data cleaning and manipulation process is not too simple. Performing complex manipulation and using modules not taught in class shows effort, which will increase the chance of receiving full credit.


## Data Sources
Include sources (as links) to your datasets. Add any additional data sources if needed. Clearly indicate if a data source is different from one submitted in your Phase I, as we will check that it satisfies the requirements.
*   Downloaded Dataset Source: [Global Power Plant Database](https://datasets.wri.org/dataset/globalpowerplantdatabase) and [Global Power Plant Emissions Database](http://meicmodel.org/?page_id=91&lang=en)
*   Web Collection #1 Source: [WhatToMine Coins](https://whattomine.com/coins.json) and [WhatToMine GPUs](https://whattomine.com/gpus)
*   Web Collection #2 Source: [CoinGecko](https://www.coingecko.com/en/apiBlo)



## Downloaded Dataset Requirement

Fill in the predefined functions with your data scraping/parsing code. You may modify/rename each function as you seem fit, but you must provide at least 3 separate functions that clean each of your required datasets.


In [1]:
import pandas as pd
import re
from bs4 import BeautifulSoup
from pprint import pprint
import requests, re
import json

In [3]:
def global_power():
  # IMPORT FILES
  generation_path = "data\global-power-plants\global_power_plant_database.csv"
  emission_path = "data\global-power-plants\global_power_emissions_database.xlsx"
  
  # power generation csv
  with open(generation_path, encoding='utf8') as fin:
    ppg = pd.read_csv(fin, low_memory=False)
  
  # power plant emissions xlsx
  ppe = pd.read_excel(emission_path, sheet_name='GPED_v1.0_Plant Level', skiprows=0, header=1)

  # removing unwanted data
  unwanted_columns = ['latitude',
                      'longitude',
                      'other_fuel1',
                      'other_fuel2',
                      'other_fuel3',
                      'commissioning_year',
                      'gppd_idnr',
                      'owner',
                      'source',
                      'url',
                      'geolocation_source',
                      'wepp_id',
                      'year_of_capacity_data',
                      'generation_data_source',
                      'generation_gwh_2018',
                      'generation_gwh_2019',
                      'estimated_generation_note_2013',
                      'estimated_generation_note_2014',
                      'estimated_generation_note_2015',
                      'estimated_generation_note_2016',
                      'estimated_generation_note_2017']
  ppg.drop(unwanted_columns, axis=1, inplace=True)

  unwanted_columns = ['No.', 'Number of Units', 'Total Plant Installed Capacity (MW)']
  ppe.drop(unwanted_columns, axis=1, inplace=True)

  # AGGREGATING DATA
  avgs = ['generation_gwh_2013',
          'generation_gwh_2014',
          'generation_gwh_2015',
          'generation_gwh_2016',
          'generation_gwh_2017']
  ppg['AVG_GENERATION'] = ppg[avgs].mean(axis=1)

  avgs = ['estimated_generation_gwh_2013',
          'estimated_generation_gwh_2014',
          'estimated_generation_gwh_2015',
          'estimated_generation_gwh_2016',
          'estimated_generation_gwh_2017']
  ppg['AVG_EST_GENERATION'] = ppg[avgs].mean(axis=1)

  # merge the two average columns into a single column
  ppg['GENERATION_MW'] = ppg.apply(lambda x : np.fmax(x['AVG_GENERATION'], x['AVG_EST_GENERATION']), axis=1)

  # remove unaggregated columns
  unwanted_columns = ['AVG_GENERATION',
                      'AVG_EST_GENERATION',
                      'generation_gwh_2013',
                      'generation_gwh_2014',
                      'generation_gwh_2015',
                      'generation_gwh_2016',
                      'generation_gwh_2017',
                      'estimated_generation_gwh_2013',
                      'estimated_generation_gwh_2014',
                      'estimated_generation_gwh_2015',
                      'estimated_generation_gwh_2016',
                      'estimated_generation_gwh_2017']
  ppg.drop(unwanted_columns, axis=1, inplace=True)

  # merge rows by fuel type
  generation_dist = ppg.groupby('primary_fuel')['GENERATION_MW'].sum().sort_values(ascending=False)
  emission_dist = ppe.groupby('Fuel Types').aggregate({'CO2 Emissions (Mg)':'sum',
                                                       'SO2 Emissions (Mg)':'sum',
                                                       'NOx Emissions (Mg)':'sum',
                                                       'PM2.5 Emissions (Mg)':'sum'})

  # COMBINING TABLES
  generation_dist.index = generation_dist.index.str.upper()
  generation_dist.drop('WAVE AND TIDAL', axis=0, inplace=True)
  gen_other = ['PETCOKE','WASTE','COGENERATION','STORAGE','NUCLEAR']
  generation_dist['OTHER'] = generation_dist[gen_other].sum(axis=0)
  generation_dist.drop(gen_other, axis=0, inplace=True)
  emission_dist = emission_dist.rename(index={'NG':'GAS'})

  power = pd.concat([generation_dist, emission_dist], axis=1)
  power = power.fillna(0)

  return power

############ Function Call ############
global_power()

Unnamed: 0,GENERATION_MW,CO2 Emissions (Mg),SO2 Emissions (Mg),NOx Emissions (Mg),PM2.5 Emissions (Mg)
COAL,9960694.0,8880799000.0,29734190.0,18480380.0,2508320.0
GAS,6304916.0,2518846000.0,52287.39,3440040.0,41447.6
HYDRO,3755360.0,0.0,0.0,0.0,0.0
WIND,709442.1,0.0,0.0,0.0,0.0
OIL,536091.7,737013900.0,8689335.0,2723445.0,93044.16
SOLAR,348600.2,0.0,0.0,0.0,0.0
GEOTHERMAL,60833.82,0.0,0.0,0.0,0.0
BIOMASS,33689.12,119606700.0,179237.6,174493.0,37133.16
OTHER,3011469.0,274978500.0,150082.7,338207.3,10422.5


## Web Collection Requirement \#1


In [16]:
def web_scrape(): 

    #creates a dict that connects to a list of all the HashRates of GPUs for easy iteration when creating visuals
    hashlist = []
    hashdict = {}
    url = requests.get('https://whattomine.com/gpus')
    soup = BeautifulSoup(url.text, 'html.parser')
    for hasher in soup.find_all('div',{'class' :'position-relative'}):
        for h in hasher.stripped_strings:
            hashlist.append(h)
            hashlist2=  hashlist[::2]
            hashdict['Hashrate(Millions of Hash Per Sec)'] = [z for z in hashlist2] 
    print(hashdict)
    
    #creates a dict that connects to a list of all the Revenues of GPUs for easy iteration when creating visuals
    revlist = []
    revdict = {}
    for rev in soup.find_all('td',{'class':'text-right table-success font-weight-bold'}):
        for r in rev.stripped_strings:
            revlist.append(r)
            revdict["24Hour Revenue"] = [z for z in revlist]
    print(revdict) 
    #creates a dict that connects to a list of all the Names of GPUs
    namelist = []
    namedict = {}
    for name in soup.find_all('td'):
        for n in name.stripped_strings:
            namelist.append(n)
            if '(*)' in namelist:
                namelist.remove('(*)')
            namelist2 = namelist[1:650:16]
            namedict["GPU Model"] = [z for z in namelist2]
    print(namedict)

    # this dictionary matches GPU model to the hash rate and 24 hour revenue
    fulldict = {}
    for i in range(len(namelist2)):
        fulldict[namelist2[i]] = {"Hashrate(Millions of Hash Per Sec)":hashlist2[i],"24Hour Revenue": revlist[i]}
    print(fulldict) 
    
    #creates a data frame with the columns as GPU name and the index as the description 
    data = []
    data.append(hashlist2)
    data.append(revlist)
    df = pd.DataFrame(data, index= ['Hashrate (Millions of Hashes Per Sec)','24 Hour Revenue'], columns = namelist2).T
    return df 

############ Function Call ############
web_scrape()

{'Hashrate(Millions of Hash Per Sec)': ['114.00 Mh/s', '93.00 Mh/s', '91.50 Mh/s', '230.00 Mh/s', '64.00 Mh/s', '64.00 Mh/s', '64.00 Mh/s', '58.10 Mh/s', '58.10 Mh/s', '2.45 h/s', '55.00 Mh/s', '55.00 Mh/s', '170.00 Mh/s', '160.00 Mh/s', '155.00 Mh/s', '48.00 Mh/s', '48.00 Mh/s', '26.50 Mh/s', '40.50 Mh/s', '41.00 Mh/s', '40.00 Mh/s', '40.00 Mh/s', '40.00 Mh/s', '37.00 Mh/s', '39.00 Mh/s', '34.00 Mh/s', '29.30 Mh/s', '30.50 Mh/s', '30.00 Mh/s', '28.00 Mh/s', '30.00 Mh/s', '29.00 Mh/s', '26.00 Mh/s', '32.50 h/s', '23.00 Mh/s', '22.50 Mh/s', '17.00 Mh/s', '13.00 Mh/s', '12.50 Mh/s', '6.10 Mh/s', '33.00 Mh/s']}
{'24Hour Revenue': ['$6.63', '$5.56', '$5.39', '$4.30', '$3.79', '$3.79', '$3.79', '$3.46', '$3.46', '$3.40', '$3.26', '$3.26', '$3.24', '$3.03', '$2.94', '$2.85', '$2.78', '$2.37', '$2.36', '$2.28', '$2.26', '$2.26', '$2.26', '$2.14', '$2.10', '$1.80', '$1.69', '$1.67', '$1.66', '$1.63', '$1.61', '$1.55', '$1.47', '$1.32', '$1.30', '$1.24', '$1.09', '$1.00', '$0.97', '$0.46', '$0.

Unnamed: 0,Hashrate (Millions of Hashes Per Sec),24 Hour Revenue
GeForce RTX 3090,114.00 Mh/s,$6.63
Radeon VII,93.00 Mh/s,$5.56
GeForce RTX 3080,91.50 Mh/s,$5.39
GeForce RTX 3080 Ti,230.00 Mh/s,$4.30
Radeon RX 6800,64.00 Mh/s,$3.79
Radeon RX 6900 XT,64.00 Mh/s,$3.79
Radeon RX 6800 XT,64.00 Mh/s,$3.79
GeForce RTX 3060 Ti,58.10 Mh/s,$3.46
GeForce RTX 3070,58.10 Mh/s,$3.46
GeForce RTX 2080 Ti,2.45 h/s,$3.40


## Web Collection Requirement \#2

In [5]:
def web_parser2():
  pass





############ Function Call ############
web_parser2()

## Additional Dataset Parsing/Cleaning Functions

Write any supplemental (optional) functions here.

In [13]:
import json
#data with details on specific crypto - to utilize when visualizing 
def coin_info():
    url = requests.get('https://whattomine.com/coins.json')
    j = url.json()
    pprint(j)
############ Function Call ############
coin_info()

{'coins': {'01coin': {'algorithm': 'NeoScrypt',
                      'block_reward': 10.924,
                      'block_reward24': 10.924,
                      'block_time': '159.0',
                      'btc_revenue': '0.00004603',
                      'btc_revenue24': '0.00003114',
                      'difficulty': 0.460923324,
                      'difficulty24': 0.724959909116895,
                      'estimated_rewards': '979.34868',
                      'estimated_rewards24': '662.46804',
                      'exchange_rate': 4.7e-08,
                      'exchange_rate24': 4.70000000000007e-08,
                      'exchange_rate_curr': 'BTC',
                      'exchange_rate_vol': 0.2252778666741,
                      'id': 338,
                      'lagging': False,
                      'last_block': 627965,
                      'market_cap': '$22,791.76',
                      'nethash': 12450632,
                      'profitability': 26,
              

In [20]:
def hashrates():
    #will use this for graphical interpretation 
    with open ('hash-rate.json', 'r') as f:
        jsondict = json.load(f)
        pprint(jsondict)
############ Function Call ############
hashrates()

{'description': 'The estimated number of tera hashes per second (trillions of '
                'hashes per second) the Bitcoin network is performing.',
 'name': 'Hash Rate',
 'period': 'day',
 'status': 'ok',
 'unit': 'Hash Rate TH/s',
 'values': [{'x': 1594857600, 'y': 122988997.50264357},
            {'x': 1594944000, 'y': 121358766.03401928},
            {'x': 1595030400, 'y': 117847537.53861415},
            {'x': 1595116800, 'y': 116374944.77118158},
            {'x': 1595203200, 'y': 115473898.09459317},
            {'x': 1595289600, 'y': 115544420.5882286},
            {'x': 1595376000, 'y': 113203968.57204916},
            {'x': 1595462400, 'y': 113943058.6824216},
            {'x': 1595548800, 'y': 113943058.6824216},
            {'x': 1595635200, 'y': 119239871.14009087},
            {'x': 1595721600, 'y': 120718051.36083588},
            {'x': 1595808000, 'y': 124043956.85751186},
            {'x': 1595894400, 'y': 124990478.24995872},
            {'x': 1595980800, 'y': 126

# Inconsistencies
For each inconsistency (NaN, null, duplicate values, empty strings, etc.) you discover in your datasets, write at least 2 sentences stating the significance, how you identified it, and how you handled it.

1. In the html web scrape, some GPU names also had an asterisk associated with them. This made the data difficult to iterate through as the information I required did not have an even index pattern. I originally manually pulled out the names, until I realized the (*) was the issue and I removed them all at once. I could then find a pattern for indices and cut my code from 15 lines to 1.

2. 

3. 

4. (if applicable)

5. (if applicable)
