<b> This EDA will look at the values from a year worth of residential solar generation and weather data. My hypothesis is that Cloud Cover will be the most significant variable in predicting daily output of energy. </b>

In [None]:
#load packages and explore column field names, total rows and columns, explore data types
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
solar = pd.read_csv(r'../input/residential-solar-weather-and-energy-csv/Weather_and_energy_Final_2020_2021.csv', header=0)
panels = pd.read_csv(r'../input/solar-panel-monthly-totals-csv/Solar_modals_totals_2020_2021.csv', header=0)
print(solar.columns.values)
print(np.shape(solar))
print(solar.dtypes)
print(solar.head())
print(panels.columns.values)
print(np.shape(panels))
print(panels.dtypes)
print(panels.head())

In [None]:
#Exploring the distribution in the data very generally
solar.hist(figsize=(14,14), xrot=45)
plt.show()

panels.hist(figsize=(14,14), xrot=45)
plt.show()

<b> Looking at the distribution in the data, I can see some general trends across power generation and weather. The data shows a trend towards low levels of cloud cover and high levels of visuability which is good for solar power generation. The levels of Energy Discharge are left scewed which is good because higher levels are good. On the individual panel data, the energy generation is also on the higher end of the plot. </b>

In [None]:
#Describe the data with summary statistics
solar.describe()

In [None]:
#Describe the data with summary statistics
panels.describe()

In [None]:
#Look at correlation between variables
solarcorr = solar.corr()
print(solarcorr)

<b> This correlation data is relatively harder to interpret than a visual. I will make a heatmap to make the variables more intuitive to look at </b>

In [None]:
#Heatmap of correlations
plt.figure(figsize=(10,8))
sns.heatmap(solarcorr, cmap='RdBu_r', annot=True)
plt.show()

<b> From the heatmap I can see that Cloud Cover is the value most correlated with Energy Dischaged and will be used in the linear regression model. </b>

In [None]:
#Plot a basic scatter plot to observe the trend generally
solar.plot.scatter(x='Cloud Cover', y='Energy Discharged (Wh)')

In [None]:
#Generate a regression line with a robust fit to exclude outliers
import seaborn as sns; sns.set_theme(color_codes=True)
solar = pd.read_csv(r'../input/residential-solar-weather-and-energy-csv/Weather_and_energy_Final_2020_2021.csv', header=0)
ax = sns.regplot(x='Cloud Cover', y='Energy Discharged (Wh)', data=solar, robust=True)

In [None]:
#Spreading out the data to better observe the trends
plt.subplot(2,1,1)
sns.stripplot(x='Cloud Cover', y='Energy Discharged (Wh)', data=solar, jitter=True, size=3)

In [None]:
#Scatterplot for secondary variable
import seaborn as sns; sns.set_theme(color_codes=True)
solar = pd.read_csv(r'../input/residential-solar-weather-and-energy-csv/Weather_and_energy_Final_2020_2021.csv', header=0)
ax = sns.regplot(x='Visibility', y='Energy Discharged (Wh)', data=solar)

<b> With the data graphed we can see this can be modeled well with a linear regression. I will now created some predictive models </b>

In [None]:
#In this model the Energy Discharged is dependent on the Cloud Cover
import statsmodels.api as sm

X = solar['Cloud Cover']
y = solar['Energy Discharged (Wh)']

solar_model = sm.OLS(y, X).fit()
predictions = solar_model.predict(X) 

solar_model.summary()

<b> The first model is not a great fit. I will add a constant to the model. </b>

In [None]:
#Compare model to energy output
ax = solar_model.predict(X).plot(linewidth=3, marker='*')
ax2 = ax.twinx()
ax2.plot(solar['Energy Discharged (Wh)'], 'mediumseagreen', linewidth=2, marker='.')
plt.tight_layout()
plt.show()

In [None]:
#Second model with a constant
X = solar['Cloud Cover']
y = solar['Energy Discharged (Wh)']
X = sm.add_constant(X)

solar_model_2 = sm.OLS(y, X).fit()
predictions_2 = solar_model_2.predict(X) 

solar_model_2.summary()

In [None]:
#Compare model to energy output
ax = solar_model_2.predict(X).plot(linewidth=3, marker='*')
ax2 = ax.twinx()
ax2.plot(solar['Energy Discharged (Wh)'], 'mediumseagreen', linewidth=2, marker='.')
plt.tight_layout()
plt.show()

<b> The R squared value has increased with the incorporation of a constant, but I will now include the second more correlated value for predicting Energy Discharged, Visibility. </b>

In [None]:
#Adding a second independent variable
import statsmodels.api as sm

X = solar[["Cloud Cover", "Visibility"]]
y = solar["Energy Discharged (Wh)"]


solar_model_3 = sm.OLS(y, X).fit()
predictions_3 = solar_model_3.predict(X) 

solar_model_3.summary()


In [None]:
#Compare model to energy output
ax = solar_model_3.predict(X).plot(linewidth=3, marker='*')
ax2 = ax.twinx()
ax2.plot(solar['Energy Discharged (Wh)'], 'mediumseagreen', linewidth=2, marker='.')
plt.tight_layout()
plt.show()

<b> We now have a pretty good model with an R squared value of 0.919. I will continue to refine the model but I like the simplicity of this model with only two inputs. </b>

In [None]:
#Adding precipitation as an independent variable does not help the model
X = solar[["Cloud Cover", "Visibility", 'Precipitation']]
y = solar["Energy Discharged (Wh)"]

solar_model_4 = sm.OLS(y, X).fit()
predictions_4 = solar_model_4.predict(X) 

solar_model_4.summary()

In [None]:
#Compare model to energy output
ax = solar_model_4.predict(X).plot(linewidth=3, marker='*')
ax2 = ax.twinx()
ax2.plot(solar['Energy Discharged (Wh)'], 'mediumseagreen', linewidth=2, marker='.')
plt.tight_layout()
plt.show()

In [None]:
#Including Temperature does increase the accuracy of the model
X = solar[["Cloud Cover", "Visibility", 'Precipitation', "Temperature"]]
y = solar["Energy Discharged (Wh)"]

solar_model_5 = sm.OLS(y, X).fit()
predictions_5 = solar_model_5.predict(X) 

solar_model_5.summary()

In [None]:
#Compare model to energy output
ax = solar_model_5.predict(X).plot(linewidth=3, marker='*')
ax2 = ax.twinx()
ax2.plot(solar['Energy Discharged (Wh)'], 'mediumseagreen', linewidth=2, marker='.')
plt.tight_layout()
plt.show()

In [None]:
#Adding additional variables gets a more accurate model
X = solar[["Cloud Cover", "Visibility", 'Precipitation', "Temperature", "Maximum Temperature"]]
y = solar["Energy Discharged (Wh)"]

solar_model_6 = sm.OLS(y, X).fit()
predictions_6 = solar_model_6.predict(X) 

solar_model_6.summary()

In [None]:
#Compare model to energy output
ax = solar_model_6.predict(X).plot(linewidth=3, marker='*')
ax2 = ax.twinx()
ax2.plot(solar['Energy Discharged (Wh)'], 'mediumseagreen', linewidth=2, marker='.')
plt.tight_layout()
plt.show()

<b> While adding more varibles to the model makes it more accurate, I believe the increased R squared values are not worth the burden of inputting additional weather variables. To make this process better for a user to input values or automation of the reporting through pipelining weather data, a simplier process may be more desireable and a 2-3% difference in predicted energy generation for the day is acceptable. Looking at the graphs, model 6 is the best fit although model 2 has nice prediction for its simplicity. </b>

<b> I have explored my data. I will move on to visualizations and come back to data for the macro analysis of trends by state and implications for improving residential solar 