## Python Workshop: Part 2

Welcome! This is part 2 of 2 of our intro to Python and Pandas.

In this notebook we will cover:
- Using APIs and API wrapper libraries to retrieve data
- Additional data visualizations with Seaborn
- Window functions and aggregations with Pandas
- Excel file I/O
- [Regular Expressions](https://docs.python.org/3/howto/regex.html)
- Decision Tree Learning with scikit-learn

We will be focusing on common data manipulation tasks in Pandas, this focuses on solutions without too many low-level details, so be sure to check out the [documentation](https://pandas.pydata.org/docs/index.html) if you want to learn more about the underlying functionality.

For additional visualization ideas, see the Seaborn [Gallery](https://seaborn.pydata.org/examples/index.html)

In [None]:
import os
import re
import requests
import numpy as np
import pandas as pd
import seaborn as sns
import yfinance as yf
from sklearn.tree import DecisionTreeRegressor, plot_tree

### Getting data from an API with requests
We will be using [NYC Air Quality Data](https://data.cityofnewyork.us/Environment/Air-Quality/c3uy-2p5r/about_data) accessed via an API endpoint.

The requests library is used to interact with APIs. API responses are typically in JSON, which is a similar format to a python dict.

The endpoint is an https address, followed by a set of filters after the "?". Here we are filtering to PM 2.5 pollution (indicator ID = 365) during Summer 2022.

In [None]:
# The URL is used to specify the endpoint and query parameters
api_url = "https://data.cityofnewyork.us/resource/c3uy-2p5r.json?indicator_id=365&geo_type_name=Borough"

# a GET request elicits a response from the server, in this case a series of data
result = requests.get(api_url)

In [None]:
type(result)

In [None]:
type(result.json())

In [None]:
type(result.json()[0])

In [None]:
# let's check out the first result
result.json()[0]

In [None]:
# pd.DataFrame can parse a list of JSON items/dicts
aq = pd.DataFrame.from_records(result.json())
aq.head()

In [None]:
# what time periods are in this dataset?
aq.time_period.unique()

In [None]:
# drop annual from data so we can look at seasonal trends (we are using the NOT operator "~" instead of df.drop)
aq = aq[~aq.time_period.str.startswith("Annual")]

In [None]:
# confirm we have borough-level data (not zip code or district)
aq.geo_type_name.unique()

In [None]:
# note that all of the API columns read in as text, so we'll need to convert the numbers and datetime
aq.info()

In [None]:
# convert to numeric
aq["data_value"] = pd.to_numeric(aq.data_value)

# convert to datetime
aq["start_date"] = pd.to_datetime(aq.start_date)

In [None]:
# let's check out the distribution of air quality values in the data set (high is bad)
sns.histplot(aq.data_value, bins=15)

In [None]:
# plotting PM2.5 air pollution over time in each borough
sns.lineplot(data=aq, x="start_date", y="data_value", hue="geo_place_name")

### Using an API wrapper
Many APIs provide a wrapper, i.e. a library containing higher-level functions for easier access to the API. **yfinance** is the library for the Yahoo Finance API, which contains stock prices and other information on public companies.


In [None]:
# specify query parameters
ticker_list = ["AAPL", "MSFT", "NVDA", "SPY"]
start_date = "2024-01-01"
end_date = "2024-04-30"

stocks = yf.download(tickers=ticker_list, start=start_date, end=end_date, interval="1d")

In [None]:
# see that we have a multi-indexed dataframe with multiple levels (Price, Ticker)
stocks.head()

In [None]:
# let's select prices from the dataframe and melt to normalize the data
prices = stocks["Adj Close"].reset_index().melt(id_vars="Date")
prices.tail()

In [None]:
sns.lineplot(data=prices, x="Date", y="value", hue="Ticker" )

### Calculating Window functions

Window functions are functions applied to distinct groups of data, without the final aggregation step of the GroupBy. This means you end up with the same number of rows you started with. An example is ranking or a cumulative sum per group.

In Pandas, you still use GroupBy to do this, and the function you choose determines whether you aggregate the rows in the group or return a windowed value.

In [None]:
# we will calculate the daily percent change in price for each ticker symbol (ensure DF is sorted correctly!)
prices["daily_pct"] = (prices
                       .sort_values(["Ticker", "Date"])
                       .groupby("Ticker")["value"] # just pct change on the value column
                       .pct_change())

In [None]:
# using the daily percent changes, we can calculate the cumulative percent return
prices["pct_return"] = (prices
                        .groupby("Ticker")["daily_pct"]
                        .apply(lambda x: (1 + x).cumprod())
                        .droplevel(0)) # droplevel removes the multi-index so we get a single Series back

In [None]:
# now we have daily and cumulative percent returns for each stock
prices.head(10)

### Visualizing categorical and continuous data

With one categorical axis (ticker) and one continuous axis (price/change) we can use box-style and scatter-style plots to visualize the distribution of price changes for each ticker.

To plot the cumulative returns, we'll make a line plot of returns over time.

In [None]:
# let's look at the distribution of daily percent changes for each stock
sns.boxplot(data=prices, x="daily_pct", y="Ticker", hue="Ticker")

In [None]:
# let's make a plot to compare the percent returns
sns.lineplot(data=prices, x="Date", y="pct_return", hue="Ticker")

### Machine Learning (in 5 mins)

We will briefly show a demonstration of decision tree learning in Python with scikit-learn. We will use a decision tree to approximate a sine function, based on [this](https://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html#sphx-glr-auto-examples-tree-plot-tree-regression-py) tutorial.

In [None]:
# Create a random dataset
rng = np.random.RandomState(1)
# generate 80 random numbers on [0,1], multiply by 5 and sort
X = np.sort(5 * rng.rand(80, 1), axis=0)
# create target function of sin(X)
y = np.sin(X).ravel() + rng.normal(0, 0.1, 80)
# add some noise to every 5th observation
# y[::5] +=

# Fit regression model, we will compare 2 different depths of tree
regr_1 = DecisionTreeRegressor(max_depth=2)
regr_2 = DecisionTreeRegressor(max_depth=5)
regr_1.fit(X, y)
regr_2.fit(X, y)

# Predict
X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]
y_1 = regr_1.predict(X_test)
y_2 = regr_2.predict(X_test)

# combine into dataframe
ml_df = pd.DataFrame()
ml_df["x"] = X_test.ravel()
ml_df["y_true"] = np.sin(X_test).ravel()
ml_df["y_1"] = y_1
ml_df["y_2"] = y_2

ml_df_melt = ml_df.melt(id_vars=["x"], var_name="series")

In [None]:
ml_df_melt.head()

In [None]:
# let's plot the tree and learned rules
plot_tree(regr_1)

In [None]:
# let's compare the 2 fitted trees to the actual sine function
sns.lineplot(data=ml_df_melt, x="x", y="value", hue="series")