## Introduction
* The goal of this assignment is to perform data analysis and transformation on some selected datasets. 

* For this assignment I picked the solar industry. I looked at data sets showing solar cells production data in the span of 7 years in a town in Belgium called Antwerp. The production data starts at 10/26/2011 and goes all the way to 10/26/2018. 

* The dataset can be found in the following link:
https://www.kaggle.com/fvcoppen/solarpanelspower/version/5#

* The solar_production dataset contains two columns: first the date and the second is the cumulative power in kWh since 10/26/2011.


* The other dataset that I looked at was the temperature in Antwerp during the same 7-year span. The temperature data was entered manually into a csv file from the following link:
https://www.wunderground.com/history/monthly/be/antwerp/EBAW/date/2012-9

* The temperature data consists of four columns: date, Maximum Temperature, Average Temperature and Minimum Temperature in Fahrenheits. 



* My goal from this is trying to find a correlation between solar panels production and the temperature. It might be possible to develop a model that predict solar panels production based on the temperature prediction. 

## Data Cleaning and Transformation
* I loaded the datasets in Pandas data frames and performed some initial data cleaning and transformation in the two datasets. The following commands were used to load the datasets to pandas data frames and some sample outputs of the two datasets.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
solar_production=pd.read_csv("Data/Solar_Panels_Production.csv")
temperature=pd.read_csv("Data/Temperature.csv")
solar_production.head()

Unnamed: 0,date,cum_power
0,26/10/2011,0.1
1,27/10/2011,10.2
2,28/10/2011,20.2
3,29/10/2011,29.6
4,30/10/2011,34.2


In [2]:
temperature.head()

Unnamed: 0,date,Max Temp,Avg Temp,Min Temp,Unnamed: 4
0,26/10/2011,59,50.8,43.0,
1,27/10/2011,63,49.7,39.0,
2,28/10/2011,66,55.8,46.0,
3,29/10/2011,63,56.7,48.0,
4,30/10/2011,61,57.1,54.0,


* The temperature data frame has an undesired column called "Unnamed: 4". To get rid of this column the following commands was considered:

In [3]:
temperature = temperature.loc[:, ~temperature.columns.str.contains('^Unnamed')]
temperature.head()

Unnamed: 0,date,Max Temp,Avg Temp,Min Temp
0,26/10/2011,59,50.8,43.0
1,27/10/2011,63,49.7,39.0
2,28/10/2011,66,55.8,46.0
3,29/10/2011,63,56.7,48.0
4,30/10/2011,61,57.1,54.0


* The solar_production_day has a column of the cumulative power since 10/26/2011. In order to correlate between daily temperature and the solar panels production data, I obtained the daily power production by subtracting the cumulative power of consecutive days. The python command for this is shown below:

In [4]:
solar_production['day_power']=solar_production['cum_power'].shift(-1) - solar_production['cum_power']
solar_production.head()

Unnamed: 0,date,cum_power,day_power
0,26/10/2011,0.1,10.1
1,27/10/2011,10.2,10.0
2,28/10/2011,20.2,9.4
3,29/10/2011,29.6,4.6
4,30/10/2011,34.2,3.8


* For the temperature data set there was some missing data in the minimum temperature column, I added this data by calculating the minimum temperature from the maximum temperature and average temperature.


* After this i decided to get rid of all rows containing Null values using the following commands:

In [5]:
temperature.dropna(how="any")
solar_production.dropna(how="any")

Unnamed: 0,date,cum_power,day_power
0,26/10/2011,0.1,10.1
1,27/10/2011,10.2,10.0
2,28/10/2011,20.2,9.4
3,29/10/2011,29.6,4.6
4,30/10/2011,34.2,3.8
...,...,...,...
2552,21/10/2018,28095.0,6.0
2553,22/10/2018,28101.0,8.0
2554,23/10/2018,28109.0,6.0
2555,24/10/2018,28115.0,2.0


* Here is some samples from the data frames after cleaning both of the loaded data sets.

In [6]:
temperature.head()

Unnamed: 0,date,Max Temp,Avg Temp,Min Temp
0,26/10/2011,59,50.8,43.0
1,27/10/2011,63,49.7,39.0
2,28/10/2011,66,55.8,46.0
3,29/10/2011,63,56.7,48.0
4,30/10/2011,61,57.1,54.0


In [7]:
solar_production.head()

Unnamed: 0,date,cum_power,day_power
0,26/10/2011,0.1,10.1
1,27/10/2011,10.2,10.0
2,28/10/2011,20.2,9.4
3,29/10/2011,29.6,4.6
4,30/10/2011,34.2,3.8


* Some stats on both the data frames can be shown below:

In [8]:
temperature.describe()

Unnamed: 0,Max Temp,Avg Temp,Min Temp
count,2558.0,2558.0,2558.0
mean,59.432369,52.701759,45.623612
std,13.130341,11.390258,10.903156
min,19.0,11.5,0.0
25%,50.0,44.3,39.0
50%,59.0,52.5,46.0
75%,70.0,61.6,54.0
max,99.0,83.1,72.0


In [9]:
solar_production.describe()

Unnamed: 0,cum_power,day_power
count,2558.0,2557.0
mean,13461.057349,10.997223
std,8129.192104,8.209054
min,0.1,0.0
25%,6665.35,4.0
50%,13000.5,9.7
75%,20183.75,17.0
max,28120.0,34.0
