<a id="top-of-pd"></a>

<img src="http://www.kulturwirt.de/wp-content/uploads/2021/04/1920px-Pandas_logo.svg.png" width=200/>

# Pandas

pandas is a Python library for loading and manipulating tabular data with labels (e.g., CSV, text, and Excel files). It provides many of the same operations as NumPy, with some extra tools for operating on time dimensions, cleaning up data, and creating plots.

1. [Series](#series)
2. [DataFrames](#dataframes)
3. [Operating on Pandas objects](#operations)
4. [Saving and loading data](#saving-loading)

### Exercises

[Exercise 1](#exercise1-pd)<br>
[Exercise 2](#exercise2-pd)

<a id="series"></a>
## 1. Series

Series objects are labeled arrays. While NumPy provides access to array objects, it does not provide labels for the data points within the array. Series solve this issue by introducing an index to describe the data points.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [None]:
index = pd.date_range("2020-01-01 00:00", "2020-01-02 00:00", freq="H", name="Time")
data = np.sin(np.arange(0.0, index.size * 0.5, 0.5))

series = pd.Series(data=data, index=index, name="sinx")

In [None]:
series.index

In [None]:
series.values

In [None]:
series

In [None]:
series.iloc[2]

In [None]:
series.loc["2020-01-01 02:00:00"]

In [None]:
ax = series.plot()
ax.set_ylabel(series.name)

In [None]:
ax = series.plot()
ax.set_ylabel(series.name)
ax.grid()

[Return to top of notebook](#top-of-pd)<br>
[Return to top of section](#series)

<a id="dataframes"></a>
## 2. DataFrames

DataFrames go one step further to provide a way to work with tabular data that has a common index. Now, instead of having an array with labels for each data point, we have multiple arrays with unique names and a common index, all of which describe our data.

In [None]:
index = pd.date_range("2020-01-01 00:00", "2020-01-02 00:00", freq="H", name="Time")
data = {
    "sinx": np.sin(np.arange(0.0, index.size * 0.5, 0.5)),
    "cosx": np.cos(np.arange(0.0, index.size * 0.5, 0.5)),
}
df = pd.DataFrame(data=data, index=index)

In [None]:
df.shape

In [None]:
df.index

In [None]:
df.columns

In [None]:
df

In [None]:
df.sinx

In [None]:
df["sinx"]

In [None]:
df["sinx"].values

In [None]:
df.iloc[12]

In [None]:
df.loc["2020-01-01 12:00"]

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))

df.plot(ax=ax)
ax.set_ylabel("y")

<a id="exercise1-pd"></a>

### Exercise 1

1. Plot `sinx` and `cosx` on the same axes over the time period 06:00 on 01 January 2020 to 18:00 on 01 January 2020.

In [None]:
# your code here

[Return to top of notebook](#top-of-pd)<br>
[Return to top of section](#dataframes)

<a id="operations"></a>
## 3. Operating on Pandas objects

If you're loading data with pandas, you're likely aiming to analyze that data in some way by performing operations. Let's take a look at how we can do that.

Pandas series and dataframes contain arrays, so we can perform math on them in the same way that do math with NumPy arrays.

In [None]:
df["sinx"] * 5.0

Notice that the result is a `Series` object. We can operate on the result in the same way that we operated on our other series objects.

In [None]:
(df["sinx"] * 5.0).plot()

We can also multiply columns in our dataframe and get a `Series` as a result.

In [None]:
df["sinx"] * df["cosx"]

We can also select columns and call methods to carry out some task.

In [None]:
df["sinx"].multiply(df["cosx"])

In the above cell, we performed the same multiply operation as with `*`, but using a dataframe method. The dataframe method provides some extra options to make our code more interpretable and flexible in more complex analyses.

If we want a new analysis, we can assign it to a new column. Before we do, let's create a new dataframe for the analysis, so our raw or observed data remains unmodified. Note that *this isn't always efficient*, as your dataset may be too large to fit into memory multiple times, but it's good practice to separate data from analyses.

In [None]:
df_analysis = df.copy(deep=True)
df_analysis["sinx*cosx"] = df["sinx"] * df["cosx"]

In [None]:
df_analysis.plot()

There are many ways to assign new values to a dataframe, including the assign method.

In [None]:
df_analysis = df_analysis.drop(columns=["sinx*cosx"])

In [None]:
df_analysis = df_analysis.assign(**{"sinx*cosx": df_analysis["sinx"] * df_analysis["cosx"]})

In [None]:
df_analysis.plot()

<a id="exercise2-pd"></a>
### Exercise 2

1. Calculate `sinx - cosx` and `sinx + cosx` and assign them to columns in the `df_analysis` dataframe.
2. Plot all of columns, including the original `sinx` and `cosx` columns.

In [None]:
# your code here

[Return to top of notebook](#top-of-pd)<br>
[Return to top of section](#operations)

<a id="saving-loading"></a>
## 4. Saving and loading data

Tabular data of many forms can be saved to disk and loaded from disk using pandas. Two common file formats are CSVs (comma separated values) and Microsoft Excel files.

In [None]:
df.to_csv("data/my_data.csv")

In [None]:
pd.read_csv("data/my_data.csv", index_col=0, parse_dates=True)

In [None]:
df.to_excel("data/my_data.xlsx")

In [None]:
pd.read_excel("data/my_data.xlsx", index_col=0, parse_dates=True)

[Return to top of notebook](#top-of-pd)<br>
[Return to top of section](#saving-loading)