<h1> Machine Learning: Regression Model</h1>
<h2>"Gold Price Prediction Model"</h2>

<p>
    <font size='3'> Gold is perceived as one of the safest investment, so the price of gold mostly increases during economic crisis. Since the latest economic crisis, Covid 19 pandemic, without doubt, the price of gold has also been rising. Considering this circumstance and relationship, I will explore the data to find relevant feature among indicators reflecting Covid trend, and build a regression prediction model that you can predict next day's gold price based on today's number of new cases and new deaths of Covid 19.
        <ul>
            <li><b>Dataset:</b> the past Gold ETF(<a href src="https://en.wikipedia.org/wiki/SPDR_Gold_Shares">GLD</a>) prices, COVID 19 dataset </li>
            <li><b>Prediction Target:</b> the next-day price of gold </li>
        </ul>
    </font>
    <font size='2'>
        Gold ETF price data source: <a href scr='https://finance.yahoo.com/quote/GLD/'>Yahoo Finance, SPDR Gold Shares</a>
        <br>
        COVID 19 data source: <a href scr='https://ourworldindata.org/coronavirus-source-data'>Our World in Data, Coronavirus Source Data</a>
    </font>
</p>

In [None]:
import wget
import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [None]:
# load data: covid 19 data
covid_df=pd.read_csv('https://raw.githubusercontent.com/owid/covid-19-data/master/public/data/owid-covid-data.csv')

## About Dataset

This COVID-19 dataset is a collection of the COVID-19 data maintained by [Our World in Data](https://ourworldindata.org/coronavirus-source-data). It is updated daily and includes data on confirmed cases, deaths, hospitalizations, testing, and vaccinations as well as other variables of potential interest.

**Column details can be found [here](https://github.com/owid/covid-19-data/blob/master/public/data/owid-covid-codebook.csv)**

## 1. Exploratory Data Analysis

In [None]:
covid_df.tail()

In [None]:
# load data: gold ETF, 'GLD' price
# period: form the start date of covid dataset to today

covid_sdate=covid_df['date'].min()
today=pd.to_datetime('today')
gld_df=yf.download('GLD', covid_sdate, today, auto_adjust=True)
gld_df.head()

In [None]:
# reset index
gld_df.reset_index(inplace=True)
gld_df.tail()

In [None]:
gld_df.info()

In [None]:
covid_df.info()

In [None]:
covid_df['location'].unique()

<br>
<p>
    <font size='3'>This data contains the observations by each country such as 'United Kingdom' as well as by area such as 'North Americas'. Luckily, there is a location category, 'world.' For convenience, I'll select the data with the value, 'world' in the location column.
    </font>
</p>
<br>

In [None]:
# pick rows related to worldwide data.
fea_df=covid_df[covid_df['location']=='World']
fea_df.info()

<br>
<p>
    <font size='3'> I picked 'date', 'new_cases', and 'new_deaths' as they seem like the most relevant data demonstrating the daily rise and fall of covid numbers while the rest columns are accumulated values or demographic informaton.
        'New vaccination' could be a potentially good predictor, but the number of data is still very low (62 as the date of Feburary 24th, 2021), so I excluded it from further analysis. 
    </font>
</p>
<br>

In [None]:
# pick relevant columns
fea_df=fea_df[['date','new_cases','new_deaths']]

# reset index
fea_df.reset_index(drop=True, inplace=True)

# view a dataframe
fea_df.tail()

In [None]:
fea_df.info()

In [None]:
# merge the gold price dataframe and covid dataframe

# rename the column name of date to match with the other dataframe.
gld_df.rename(columns={'Date':'date'},inplace=True)
gld=gld_df[['date','Close']]

# covert data type to match with the other dataframe.
fea_df['date']=pd.to_datetime(fea_df['date'])

# merge
gc_df=fea_df.merge(gld, how='inner', on='date')
gc_df.tail()

In [None]:
gc_df.shape

In [None]:
# plot the number of Covid new dases and GLD price
gc_df.plot(x='new_cases', y='Close', kind='scatter')
plt.xticks(rotation=45)
plt.xlabel('Covid New Cases')
plt.ylabel('GLD price')
plt.xlim(gc_df.new_cases.min(), gc_df.new_cases.max())
plt.ylim(gc_df.Close.min()-5, gc_df.Close.max()+5)
plt.show()

<br>
<p>
    <font size='3'> The plot shows that GLD price grew very steeply from 140 to 170 USD in the section of Covid new cases, 0 to 100,000 ns  a spike over 190 USD came in the section, 200,000 to 300,000.
        After 300,000 cases, the price moved between 170 USD 180 USD. It appears that the number of daily covid new cases could be a good predictor for GLD price.
    </font>
</p>
<br>

In [None]:
# plot the number of Covid new cases and GLD price
gc_df.plot(x='new_deaths', y='Close', kind='scatter')
plt.xlabel('Covid New Deaths')
plt.ylabel('GLD price')
plt.show()

In [None]:
# plot the number of Covid new cases and GLD price
gc_df.plot(x='new_cases', y='new_deaths', kind='scatter')
plt.xlabel('Covid New Cases')
plt.ylabel('Covid New Deaths')
plt.show()

<br>
<p>
    <font size='3'> The plot shows that GLD prices along x-axis are too disperse and can't see the general trend of datapoints. The number of daily covid new death may not be a good predictor for GLD price. However, I'll check if there is any distinguishable trend when both features are plotted in 3D graph.
    </font>
</p>
<br>

In [None]:
gc_df=gc_df.sort_values('new_cases')
gc_df=gc_df.reset_index(drop=True)

In [None]:
# plotting 3D
from mpl_toolkits import mplot3d
fig = plt.figure()
ax = plt.axes(projection='3d')


# Data for a three-dimensional scattered points
zdata = gc_df.Close
xdata = gc_df.new_cases
ydata = gc_df.new_deaths

ax.scatter3D(xdata, ydata, zdata, c=zdata, cmap='Greens')
ax.set_xlabel('new casess')
ax.set_ylabel('new deaths')
ax.set_zlabel('Price')
ax.invert_xaxis()
ax.view_init(30, 300)

<br>
<p>
    <font size='3'>
        3D plot doesn't demonstrate a particular shape and I can't see any impact by the number of new death in changing the shape.  
        <br>
        Therefore, I will exclude 'new death' from feature set. 
    </font>
</p>
<br>

In [None]:
# move the position of row upward to make 'Close' columns as next day's price
gc_df['Close']=gc_df['Close'].shift(-1)

# drop the last row
gc_df=gc_df.drop(index=gc_df.shape[0]-1)

# rename the price column
gc_df=gc_df.rename(columns={'Close':'next_day_price'})

## 2. Modeling

In [None]:
# import preprocessing
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

# import the library pipeline
from sklearn.pipeline import Pipeline

# pipeline constructor #create a list of turples
xtrain, xtest, ytrain, ytest=train_test_split(gc_df[['new_cases']], gc_df['next_day_price'], test_size=0.3, random_state=0)
Input=[('Scale', StandardScaler()),('polynomial', PolynomialFeatures(degree=3)), ('model', LinearRegression()) ] 
pipe=Pipeline(Input)  #input the list in the pipeline constructor.

# train the pipeline object
pipe.fit(xtrain, ytrain)

# produce prediction
yhat=pipe.predict(xtest)

## 3. Model Evaluation

In [None]:
# evaluate the model
print('Evaluation Score: {0:.2f}'.format(pipe.score(xtest, ytest)*100))

## 4. Model Visualization

In [None]:
# create a dataframe with actual GLD price and predicted price.
test=xtest
test['actual_price']=ytest
test['predicted_price']=yhat

# sort values by the number of new cases in ascending order
test=test.sort_values('new_cases')
test.reset_index(drop=True, inplace=True)

In [None]:
# plot predition line and actual observations

# plot actual price data points
plt.scatter(test.new_cases,test.actual_price, color='red')

# plot prediction model (polynomial regression model)
plt.plot(test.new_cases, test.predicted_price, color='blue')
plt.title('Next-Day GLD Price Prediction Model')
plt.xlabel('Number of Covid New Cases')
plt.ylabel('GLD price(USD)')
plt.legend()
plt.show()

## Additional Theory Testing: GLD price prediction model with Russell 2000 index
<br>
<p>
    <font size='3'>
         Russell 2000 Index is a small-cap stock index consisting of the smallest 2,000 stocks in the Russell 3000 Index, often regarded as risky investment, and hence sensatively respond to the economic situation. In the light of such characteristics, I will explore the index's relationship with GLD price and generate a prediction model to see if the index can also be a good predictor. 
    </font>
</p>
<br>


In [None]:
# load 'Russell 2000 index' data
rs=yf.download('^RUT', covid_sdat, today, auto_adjust=True)
print('Russell 2000 Index: ',rs.shape[0])

In [None]:
rs.info()

In [None]:
# rearrange a dataframe
rs.reset_index(inplace=True)
fea_df2=rs[['Date','Close']]
fea_df2=fea_df2.rename(columns={'Date':'date','Close':'Russell_2000_Index'})
pd.to_datetime(fea_df2.Date)
fea_df2.head()

In [None]:
# move the position of row upward to make 'Close' columns as next day's price

gld2['Close']=gld2['Close'].shift(-1)

# drop the last row
gld2=gld2.drop(index=gld2[gld2.Close.isnull()].index)

# rename the price column
gld2=gld2.rename(columns={'Close':'next_day_price'})

# merge
gr_df=fea_df2.merge(gld2, how='inner', on='Date')

In [None]:
gr_df.info()

In [None]:
gr_df.plot(x='Russell_2000_Index', y='next_day_price', kind='scatter')
plt.show()

In [None]:
# pipeline constructor #create a list of turples
xtrain2, xtest2, ytrain2, ytest2=train_test_split(gr_df[['Russell_2000_Index']], gr_df['next_day_price'], test_size=0.2, random_state=0)

degrees=[]
models=[]
yhats=[]
accuracies=[]
for d in range(2,20):
    degrees.append(d)
    Input2=[('Scale', StandardScaler()),('polynomial', PolynomialFeatures(degree=d)), ('model', LinearRegression()) ] 
    pipe2=Pipeline(Input2)  #input the list in the pipeline constructor.

    # train the pipeline object
    pipe2.fit(xtrain2, ytrain2)
    models.append(pipe2)
    
    # produce prediction
    yhat2=pipe2.predict(xtest2)
    yhats.append(yhat2)

    # evaluate the model
    accuracies.append(round((pipe2.score(xtest2, ytest2)*100),2))

In [None]:
pd.DataFrame(accuracies, index=degrees).plot()
plt.xticks(degrees)
plt.yticks(np.arange(0,101, 10))
plt.legend('')
plt.xlabel('Polynomial Degree')
plt.ylabel('Model Accuracy')
plt.text(x=13.5, y=max(accuracies)+5, s=str(max(accuracies))+'%')
plt.show()

In [None]:
i=accuracies.index(max(accuracies))

# create a dataframe with actual GLD price and predicted price.
test2=xtest2
test2['actual_price']=ytest2
test2['predicted_price']=yhats[i]

# sort values by the number of new cases in ascending order
test2=test2.sort_values('Russell_2000_Index')
test2.reset_index(drop=True, inplace=True)

# plot predition line and actual observations

# plot actual price data points
plt.scatter(test2.Russell_2000_Index,test2.actual_price, color='red')

# plot prediction model (polynomial regression model)
plt.plot(test2.Russell_2000_Index, test2.predicted_price, color='blue')
plt.title('Next-Day GLD Price Prediction Model (degree: %s)' %(degrees[i]))
plt.xlabel('Russell_2000_Index')
plt.ylabel('GLD price(USD)')
plt.legend()
plt.show()

## Additional Theory Testing: GLD price prediction model with Russell 2000 index
<br>
<p>
    <font size='3'>
          degree of polynomial The model accuracy was 46%
    </font>
</p>
<br>