# Getting the most solar power for your dollar
## Data wrangling
### Zachary Brown

The goal of this project is to use the Tracking the Sun dataset from the Lawrence Berkeley National Laboratory to create a model that identifies what factors make residential solar panel installations in Austin, Texas as cost-efficient as possible. To do so the data will be loaded from Parquet files and in this notebook will be wrangled into the working dataframe for the rest of the project.

I'll start off by importing the modules needed for importing and wrangling the data.

In [10]:
import os
import wget
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
from datetime import datetime

Next I need to download the Parquet datafiles into the external data directory within this repository, so I'll switch to the correct directory.

In [26]:
print(os.getcwd())
os.chdir(r"C:\Users\Zjbro\OneDrive\Documents\GitHub\Solar-Panel-Capstone\data\external")
data_dir = os.getcwd()
print(os.getcwd())

C:\Users\Zjbro\OneDrive\Documents\GitHub\Solar-Panel-Capstone\data\external
C:\Users\Zjbro\OneDrive\Documents\GitHub\Solar-Panel-Capstone\data\external


The next step is to download the parquet files. The entire collection of data is hosted here: https://data.openei.org/s3_viewer?bucket=oedi-data-lake&prefix=tracking-the-sun%2F

I only need 2021 data for this project as I want the most up to date model possible. Within 2021 the data is broken down by state. I need to download all 26 files, so I'll create a function to automatically download them all for me.

In [23]:
# Start by creating the function and creating empty url and file name lists

def import_data(root_url):
    urls = []
    file_names = []
    
# Parse the 2021 data directory to get URLs for each state    

    response = requests.get(root_url)
    soup = BeautifulSoup(response.content, "html.parser")
    
    for a in soup.find_all('a', href=True):
        if '/s3_viewer?bucket=oedi-data-lake&prefix=tracking-the-sun%2F' in a['href']:
            urls.append('https://data.openei.org' + a['href'])
    
# Create a list of the file names for each state from each link in the urls list

    for url in urls:
        url_response = requests.get(url)
        url_soup = BeautifulSoup(url_response.content, 'html.parser')
        
        for a in url_soup.find_all('a', href=True):
            if '.parquet' in a['href']:
                file_names.append(a['href'])
            
# Loop through each file and download to data directory

    for url, file in zip(urls, file_names):
        wget.download(file)

Now that the function has been created to download all of the Parquet files locally we need to run it with the 2021 URL to download the state data files.

In [24]:
import_data('https://data.openei.org/s3_viewer?bucket=oedi-data-lake&prefix=tracking-the-sun%2F2021%2F')

100% [............................................................................] 289731 / 289731

Now the data files need to be loaded into a single Pandas dataframe to begin the wrangling process.

In [40]:
file_list = []

# I'll start by adding each Parquet file to an empty list.
for file in os.listdir(data_dir):
    if file.endswith('.parquet'):
        file_list.append(file)
        
# Now I'll iterate over the file_list and read each in to one dataframe, aggregating all of the data.
tables = []

for filename in file_list:
    df = pd.read_parquet(filename)
    tables.append(df)

raw_data = pd.concat(tables, ignore_index=True)
raw_data.head()

Unnamed: 0,data_provider_1,data_provider_2,system_id_1,system_id_2,installation_date,system_size_dc,total_installed_price,rebate_or_grant,customer_segment,expansion_system,...,output_capacity_inverter_1,output_capacity_inverter_2,output_capacity_inverter_3,dc_optimizer,inverter_loading_ratio,date_of_battery_install,battery_manufacturer,battery_model,battery_rated_capacity_kw,battery_rated_capacity_kwh
0,"Washington, D.C. Public Service Commission",-1,DC-2012700-SUN-I,-1,2020-11-30,6.8,-1.0,-1.0,-1,0,...,-1.0,-1.0,-1.0,-1,-1.0,NaT,-1,-1,-1.0,-1.0
1,"Washington, D.C. Public Service Commission",-1,DC-2012701-SUN-I,-1,2020-12-07,7.04,-1.0,-1.0,-1,0,...,-1.0,-1.0,-1.0,-1,-1.0,NaT,-1,-1,-1.0,-1.0
2,"Washington, D.C. Public Service Commission",-1,DC-2012702-SUN-I,-1,2020-12-07,7.68,-1.0,-1.0,-1,0,...,-1.0,-1.0,-1.0,-1,-1.0,NaT,-1,-1,-1.0,-1.0
3,"Washington, D.C. Public Service Commission",-1,DC-2012703-SUN-I,-1,2020-12-06,5.12,-1.0,-1.0,-1,0,...,-1.0,-1.0,-1.0,-1,-1.0,NaT,-1,-1,-1.0,-1.0
4,"Washington, D.C. Public Service Commission",-1,DC-2012704-SUN-I,-1,2020-12-07,5.76,-1.0,-1.0,-1,0,...,-1.0,-1.0,-1.0,-1,-1.0,NaT,-1,-1,-1.0,-1.0


Ok, now that the dataframe has been assembled it's time to explore it.

In [44]:
raw_data.shape

(2041668, 80)

In [45]:
raw_data.columns

Index(['data_provider_1', 'data_provider_2', 'system_id_1', 'system_id_2',
       'installation_date', 'system_size_dc', 'total_installed_price',
       'rebate_or_grant', 'customer_segment', 'expansion_system',
       'multiple_phase_system', 'new_construction', 'tracking',
       'ground_mounted', 'zip_code', 'city', 'utility_service_territory',
       'third_party_owned', 'installer_name', 'self_installed', 'azimuth_1',
       'azimuth_2', 'azimuth_3', 'tilt_1', 'tilt_2', 'tilt_3',
       'module_manufacturer_1', 'module_model_1', 'module_quantity_1',
       'module_manufacturer_2', 'module_model_2', 'module_quantity_2',
       'module_manufacturer_3', 'module_model_3', 'module_quantity_3',
       'additional_modules', 'technology_module_1', 'technology_module_2',
       'technology_module_3', 'bipv_module_1', 'bipv_module_2',
       'bipv_module_3', 'bifacial_module_1', 'bifacial_module_2',
       'bifacial_module_3', 'nameplate_capacity_module_1',
       'nameplate_capacity_module

In [50]:
col = raw_data.columns.to_series().groupby(df.dtypes).groups
print(col)

{datetime64[ns]: ['installation_date', 'date_of_battery_install'], int32: ['expansion_system', 'multiple_phase_system', 'new_construction', 'tracking', 'ground_mounted', 'third_party_owned', 'self_installed', 'module_quantity_1', 'module_quantity_2', 'module_quantity_3', 'additional_modules', 'bipv_module_1', 'bipv_module_2', 'bipv_module_3', 'bifacial_module_1', 'bifacial_module_2', 'bifacial_module_3', 'nameplate_capacity_module_1', 'nameplate_capacity_module_2', 'nameplate_capacity_module_3', 'inverter_quantity_2', 'inverter_quantity_3', 'additional_inverters', 'micro_inverter_1', 'micro_inverter_2', 'micro_inverter_3', 'solar_storage_hybrid_inverter_1', 'solar_storage_hybrid_inverter_2', 'solar_storage_hybrid_inverter_3', 'built_in_meter_inverter_1', 'built_in_meter_inverter_2', 'built_in_meter_inverter_3', 'output_capacity_inverter_2', 'output_capacity_inverter_3', 'dc_optimizer'], float64: ['system_size_dc', 'total_installed_price', 'rebate_or_grant', 'azimuth_1', 'azimuth_2', 'a

In [68]:
ext_zip = len(raw_data['zip_code'])
raw_data[ext_zip > 5].head()

KeyError: True

Right now I'm missing a state column, and although zip code is object type, it includes 9 digit zip codes with hyphens which isn't a problem.

Before I create a state column I'll pare down the data to only applicable rows. This means just residential installations from 2021 that are not missing 'total_installed_price' values

In [70]:
print(raw_data['customer_segment'].unique())

['-1' 'RES' 'NON-PROFIT' 'SCHOOL' 'GOV' 'NON-RES' 'COM']


In [78]:
# Now I'll subset the data to just residential installations
res_data = raw_data[raw_data['customer_segment'] == 'RES']
res_data.shape

(1948175, 80)

In [79]:
newest_year = res_data['installation_date'].dt.year == 2021
res_data[newest_year].shape

(2266, 80)

In [None]:
newest_res_data = res_data[newest_year].copy()
missing_sale = newest_res_data['total_installed_price'] == -1
newest_res_data = newest_res_data[~missing_sale]
newest_res_data.shape