# Photovoltaic Installations

### Problem
Solar energy is underutilized and perhaps misunderstood by the general public in the United States despite the technological advancement and a dramatic reduction in installation prices. It can provide us with clean and stable production of energy should we choose to embrace its pursuit. To enable greater understanding of solar energy's potential, we can use data to predict cost and energy production of photovoltaic installations. However, the cost and sizing can be a great deterrence for most potential clients; therefore, they should be appropriately educated before making this important decision. 

### Clients
The client can be a homeowner or a small business owner trying to decide whether photovoltaic installation is a feasible choice at their location. They want to estimate the capacity of the array for their needs as well as the cost and rebates.


### Dataset



In [7]:
import pandas as pd
pv = pd.read_csv('Capstone/openpv_all.csv', sep=',', low_memory=False)

In [13]:
print(pv.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1020583 entries, 0 to 1020582
Data columns (total 81 columns):
state                          1020583 non-null object
date_installed                 1020578 non-null object
incentive_prog_names           797958 non-null object
type                           1020578 non-null object
size_kw                        1020578 non-null float64
appraised                      224036 non-null object
zipcode                        1020578 non-null float64
install_type                   978002 non-null object
installer                      702521 non-null object
cost_per_watt                  763002 non-null float64
cost                           999030 non-null object
lbnl_tts_version_year          797958 non-null float64
lbnl_tts                       797958 non-null object
city                           799016 non-null object
utility_clean                  792720 non-null object
tech_1                         580919 non-null object
model1_clean  

This dataset contains solar array installations across the United States between the years 1998 and January 2018. For each location, we have the installation's associated cost, state rebate amount, capacity, local annual insolation/radiation, annual power output (estimate), and the reported power output. 

### Data Wrangling/Cleaning

The original data contained over 1 millions rows, it had entire columns with missing values and columns with very little data. After dropping columns, some of the data types appeared to be incorrect. Numerical columns such as cost and rebate were in fact strings because they contained symbols like the dollar sign, leading/trailing spaces, and commas. Various symbols had to be found and removed before converting data types then all strings were converted to lower case for consistency. Other inconsistencies were found within the city names. For example, the city name column contained two versions of *'st. louis'* where the period was missing in some cases. As a result, the period was stripped all together to maintain consistency for all cities. Other minor spelling errors were found in the *install type* column and corrected. 

A number of APIs were used to add or correct information to the original dataset. For example, the column for incentive program names and solar radiation contained missing values which were filled in using the National Renewable Energy Laboratory (NREL) API. A column for city population was also added through the use of an API which contained 1000 largest cities in the US. The dataset was then filtered to contain only those 1,000 largest cities with population ranging from about 36,000 to abbout 8 million.


### Initial Findings

The exploratory data analysis mostly focused on the following questions:

1. Which states have the cheapest and most expensive installations; which states have highest incentives? 

2. How have the prices changed over the years?

3. Which factors contribute the most to the total cost? 


It is worth mentioning that the state of California had more rows than all other states combined. Although there is some bias towards a single state there is still enough information contained within the data for other states to have a meaningful analysis. In addition, it was found that most of the installations types were residential.


**Which states have the cheapest and most expensive installations; which states have highest incentives?**

Over the past 20 years, on average, the most expensive installation costs were in New Jersey(`$`49,282.18), Florida(`$`34,231.69), and Pennsylvania (`$`33,941.69), respectively. The highest rebates were achieved by New Jersey(`$`18,322.60), Florida(`$`15,437.61), and Connecticut(`$`11,637.82). The cheapest installation costs were in Michigan(`$`8,000), Illinois(`$`13,000), and Indiana(`$`16,453.15). Also, it's important to note that these figures encompass the past 20 years and are not representative of the most recent figures. 



**How have the prices changed over the years?**

Installation costs peaked between years 2008 and 2009 and have been declining since that time. To be more precise, installation costs have declined about 51% since 2008. It is also important to note that the *cost per watt* has actually been declining since 1998 with a change of 77%. During the period of 1998 to 2018 the number of installations and number of rebates has increased dramatically due to the change in costs and advancements in photovoltaic technology. These developments have created more favorable conditions for the consumer to switch to solar power. 


**Which factors contribute the most to the total cost?**

It is no surprise to find that a large capacity system (10kW) would cost more than a relatively small capacity (3kW). The capacity of the system is one of the major contributors to the installation cost. However, since insolation rates vary depending on geographic location, a system of same capacity (ex. 5kW) will not produce the same amount of energy around the country. So, dependning on how much energy a home consumes annually, the system capacity will be sized accordingly. Thus the cost will also depend on the amount of energy a home consumes throughout the year.    

