# Exploratory Data Analysis
This project is used to do some simple analyses of two large ETF's in the energy sector, USO and ICLN. One is an oil ETF, the other a clean energy ETF.

## Yahoo Finance Python API Wrapper
Here we use a Yahoo Finance wrapper to easily access their API and pull in data on certain stocks, markets, and commodities

In [None]:
import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats.mstats import pearsonr, ttest_ind
from scipy import stats

sns.set_theme(style="whitegrid")

In [None]:
icln_5y = yf.download("ICLN", '2018-01-01', '2023-01-01').reset_index()
uso_5y = yf.download("USO", '2018-01-01', '2023-01-01').reset_index()

## Data Trends

First, we'll look at a comparison of the oil and clean energy ETFs over the past 5 years.

In [None]:
sns.lineplot(uso_5y, x="Date", y="Open")
plt.title("USO (Oil) Prices 5y")
plt.ylabel("Price per Share $")

plt.figure()

sns.lineplot(icln_5y, x="Date", y="Open")
plt.title("ICLN (Clean Energy) Prices 5y")
plt.ylabel("Price per Share $")

In [None]:
cor_5y = pearsonr(uso_5y["Open"], icln_5y["Open"])
print(f"The correlation between USO and ICLN over the past 5 years is {round(cor_5y[0], 3)} with a p-value of {round(cor_5y[1], 3)}.")

Now we'll compare them just in 2022.

In [None]:
uso_1y = uso_5y.loc[uso_5y["Date"] > np.datetime64("2022-01-01")]
icln_1y = icln_5y.loc[icln_5y["Date"] > np.datetime64("2022-01-01")]

In [None]:
sns.lineplot(uso_1y, x="Date", y="Open")
plt.title("USO (Oil) Prices 1y")
plt.ylabel("Price per Share $")

plt.figure()

sns.lineplot(icln_1y, x="Date", y="Open")
plt.title("ICLN (Clean Energy) Prices 1y")
plt.ylabel("Price per Share $")

In [None]:
cor_1y = pearsonr(uso_1y["Open"], icln_1y["Open"])
print(f"The correlation between USO and ICLN over 2022 is {round(cor_1y[0], 3)} with a p-value of {round(cor_1y[1], 3)}.")

Now we'll check if the volume of trades between the two ETFs is significantly different

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(10, 5))

sns.histplot(uso_5y.Volume, bins="sturges", ax=axs[0])
sns.histplot(icln_5y.Volume, bins="sturges", ax=axs[1])

axs[0].set_title("USO 5y Volume Distribution")
axs[1].set_title("ICLN 5y Volume Distribution")

plt.tight_layout()

In [None]:
print(f"The USO mean trading volume over the past 5 years is {round(uso_5y.Volume.mean())}.")
print(f"The ICLN mean trading volume over the past 5 years is {round(icln_5y.Volume.mean())}.")

In [None]:
ttest_ind(uso_5y.Volume, icln_5y.Volume)

In [None]:
fig, axs = plt.subplots(1, 2, figsize=(10,5))

sns.histplot(uso_1y.Volume, bins="sturges", ax=axs[0])
sns.histplot(icln_1y.Volume, bins="sturges", ax=axs[1])

axs[0].set_title("USO 2022 Volume Distribution")
axs[1].set_title("ICLN 2022 Volume Distribution")

plt.tight_layout()

In [None]:
print(f"The USO mean trading volume over 2022 is {round(uso_1y.Volume.mean())}.")
print(f"The ICLN mean trading volume over 2022 is {round(icln_1y.Volume.mean())}.")

In [None]:
ttest_ind(uso_5y.Volume, icln_5y.Volume)

So both ETFs have significantly different mean trade volumes in each respective time period. When looking at 5y, USO has a higher mean, but in 2022 ICLN had a higher trading volume.

## Looking at Volatility
Now we'll check out volatility between the two stocks by looking at the standard deviation of their price. This should probably actually done by percentage of the stocks price, as strictly using the dollar value doesn't tell too much of a story.

In [None]:
print("USO 5y sd: ", round(np.std(uso_5y.Open), 3), "\n",
    "ICLN 5y sd: ", round(np.std(icln_5y.Open), 3))

In [None]:
print("USO 1y sd: ", round(np.std(uso_1y.Open), 3), "\n",
    "ICLN 1y sd: ", round(np.std(icln_1y.Open), 3))