<a id="top"></a>
# Pandas and Scikit-Learn

<img src="http://www.kulturwirt.de/wp-content/uploads/2021/04/1920px-Pandas_logo.svg.png" width=200/>

pandas is a Python library for loading and manipulating tabular data with labels (e.g., CSV, Excel files). It provides many of the same operations as NumPy, with some extra tools for operating on time dimensions, cleaning up data, and creating plots.

scikit-learn provides tools for creating statistical and machine learning models. This notebook will provide an introduction to using scikit-learn for linear regression and how to handle data before using it to create a model.

Before we get started with coding, let's install one more library to help us work with Microsoft Excel files. In a new terminal window, run the following command:

`conda install -n data_science openpyxl --yes`

which will install `openpyxl` to our conda environment.

## Introduction to pandas

1. [Series](#series)
2. [DataFrames](#dataframes)
3. [Saving and loading data](#saving-loading)
4. [Working with real data](#real-data)

## Exercises

[Exercise 1](#exercise1)

## [Jump to Scikit-Learn](#scikit-learn)

<a id="series"></a>
## 1. Series

Series objects are labeled arrays. While NumPy provides access to array objects, it does not provide labels for the data points within the array. Series solve this issue by introducing an index to describe the data points.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
index = pd.date_range("2020-01-01 00:00", "2020-01-02 00:00", freq="H", name="Time")
data = np.sin(np.arange(0.0, index.size * 0.5, 0.5))

series = pd.Series(data=data, index=index, name="sinx")

In [None]:
series.index

In [None]:
series.values

In [None]:
series

In [None]:
series.iloc[2]

In [None]:
series.loc["2020-01-01 02:00:00"]

In [None]:
ax = series.plot()
ax.set_ylabel(series.name)

In [None]:
ax = series.plot()
ax.set_ylabel(series.name)
ax.grid()

[Return to top](#top)

<a id="dataframes"></a>
## 2. DataFrames

DataFrames go one step further to provide a way to work with tabular data that has a common index. Now, instead of having an array with labels for each data point, we have multiple arrays with unique names and a common index, all of which describe our data.

In [None]:
index = pd.date_range("2020-01-01 00:00", "2020-01-02 00:00", freq="H", name="Time")
data = {
    "sinx": np.sin(np.arange(0.0, index.size * 0.5, 0.5)),
    "cosx": np.cos(np.arange(0.0, index.size * 0.5, 0.5)),
}
df = pd.DataFrame(data=data, index=index)

In [None]:
df.shape

In [None]:
df.index

In [None]:
df.columns

In [None]:
df

In [None]:
df.sinx

In [None]:
df["sinx"]

In [None]:
df["sinx"].values

In [None]:
df.iloc[12]

In [None]:
df.loc["2020-01-01 12:00"]

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))

df.plot(ax=ax)

[Return to top](#top)

<a id="saving-loading"></a>
## 3. Saving and loading data

Tabular data of many forms can be saved to disk and loaded from disk using pandas. Two common file formats are CSVs (comma separated values) and Microsoft Excel files.

In [None]:
df.to_csv("data/my_data.csv")

In [None]:
pd.read_csv("data/my_data.csv", index_col=0, parse_dates=True)

In [None]:
df.to_excel("data/my_data.xlsx")

In [None]:
pd.read_excel("data/my_data.xlsx", index_col=0, parse_dates=True)

[Return to top](#top)

<a id="real-data"></a>
## 4. Working with real data

When doing research, we won't necessarily have nice sine and cosine curves to look at. Let's explore some real data from a weather observation station located in Junction, TX. This includes 1-minute observations of many weather variables over the month of June 2016.

In [None]:
df = pd.read_csv("data/junction_201604.csv", index_col="Time", parse_dates=True)

In [None]:
df.shape

In [None]:
df.index

In [None]:
df.columns

In [None]:
df.head(10)

In [None]:
df["2-m Temperature (C)"].plot()

In [None]:
df[["2-m Temperature (C)", "9-m Temperature (C)"]].plot()

In [None]:
time_slice = slice(pd.Timestamp("2016-06-01 00:00-05:00"), pd.Timestamp("2016-06-02 00:00-05:00"))
df.loc[time_slice, ["2-m Temperature (C)", "9-m Temperature (C)"]].plot()

In [None]:
df.describe()

In [None]:
df.corr()

In [None]:
df[["2-m Temperature (C)", "Solar Radiation (W/m^2)"]].corr()

In [None]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8, 8))

df.plot.scatter(x="2-m Temperature (C)", y="Solar Radiation (W/m^2)", ax=ax)

[Return to top](#top)

<a id="exercise1"></a>
## Exercise 1

1. Select a column of data from the real dataset above and plot it over a 1-week period of your choosing (available dates are June 1, 2016 to June 30, 2016).

In [None]:
# your code here

***

[Return to top](#top)

<a id="scikit-learn"></a>
# Scikit-Learn

<img src="https://scikit-learn.org/stable/_static/scikit-learn-logo-small.png" width=200/>

Scikit-learn is a library that provides tools for building and aseessing statistical and machine learning models from datasets. In this section we will explore how to fit a model to real data and some considerations about the data that need to be made to improve predictions.

1. [Fitting a model to data](#fitting)
2. [Cleaning up and manipulating data](#clean-manipulate)

## Exercises

[Exercise 2](#exercise2)

<a id="fitting"></a>
## 1. Fitting a model to data

There is a signal between 2-meter temperature and solar radiation according to the correlation table. Let's fit a model to these variables.

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
temp_and_solar = df[["2-m Temperature (C)", "Solar Radiation (W/m^2)"]]
print(temp_and_solar.shape)

In [None]:
x = temp_and_solar["2-m Temperature (C)"].values.reshape((-1, 1))
y = temp_and_solar["Solar Radiation (W/m^2)"].values.reshape((-1, 1))

In [None]:
x.shape

In [None]:
x

In [None]:
model = LinearRegression()
model.fit(x, y)

In [None]:
r_sq = model.score(x, y)
intercept = model.intercept_[0]
slope = model.coef_[0]

In [None]:
temp = temp_and_solar["2-m Temperature (C)"].values

fig, ax = plt.subplots(figsize=(8, 8))

temp_and_solar.plot.scatter(x="2-m Temperature (C)", y="Solar Radiation (W/m^2)", ax=ax)
ax.plot(temp, temp * slope + intercept, color="k")
ax.text(x=15, y=1000, s=f"R-squared = {r_sq:.2f}")

In [None]:
xi = np.arange(20.0, 40.0, 1.0).reshape((-1, 1))
yi = model.predict(xi)

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))

temp_and_solar.plot.scatter(x="2-m Temperature (C)", y="Solar Radiation (W/m^2)", ax=ax, label="Observations")
ax.scatter(xi, yi, color="orange", zorder=10, label="Predictions")
ax.plot(temp, temp * slope + intercept, color="k")
ax.text(x=15, y=1000, s=f"R-squared = {r_sq:.2f}")
ax.legend()

[Return to top](#top)

<a id="clean-manipulate"></a>
## 2. Cleaning up and manipulating the data

The model fit between 2-meter temperature and solar radiation above is not as strong as it could be. Let's manipulate the data to improve the model.

In [None]:
temp_solar_hourly = temp_and_solar.rolling("H").mean().asfreq("H")
temp_solar_hourly.where(temp_and_solar["Solar Radiation (W/m^2)"] > 0.0).dropna(inplace=True)
print(temp_solar_hourly.shape)

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))

temp_solar_hourly.plot.scatter(x="2-m Temperature (C)", y="Solar Radiation (W/m^2)", ax=ax)

In [None]:
x = temp_solar_hourly["2-m Temperature (C)"].values.reshape((-1, 1))
y = temp_solar_hourly["Solar Radiation (W/m^2)"].values.reshape((-1, 1))

model = LinearRegression()
model.fit(x, y)

In [None]:
r_sq = model.score(x, y)
intercept = model.intercept_[0]
slope = model.coef_[0]

yi = model.predict(xi)

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))

temp_solar_hourly.plot.scatter(x="2-m Temperature (C)", y="Solar Radiation (W/m^2)", ax=ax)
ax.scatter(xi, yi, color="orange", zorder=10, label="Predictions")
ax.plot(temp, temp * slope + intercept, color="k")
ax.text(x=15, y=1000, s=f"R-squared = {r_sq:.2f}")
ax.legend()

[Return to top](#top)

<a id="exercise2"></a>
## Exercise 2

Fit a multivariate linear regression to the data using 2-meter temperature, solar radiation, and at least one other variable of your choice. Consider the following:
* What is the correlation between the variables you have selected?
* Are there any physical relationships between the variables that you may need to consider before fitting a model to the data?

Plot the resulting predictions (no need to plot the regression line) on top of the data. Hint: you will still use `LinearRegression` to create this model.
* Are the predictions better than you expected or worse?
* Are they better than our previous predictions or worse? Why?

In [None]:
# you code here