# QF 627 Programming and Computational Finance
## Problem-Sets for Exercise `9` | `Questions`

> Hi Team, 👋

> The initial problem sets were designed for practicing supervised learning in classification problems and hierarchical risk parity algorithms, as well as applying unsupervised learning to portfolio management.

> Given that we haven't covered some these topics in depth yet and will be discussing them further in Lessons 9 and 10, the problem sets have been revised.

> Having reviewed your submissions so far, some of the questions have been crafted specifically to enhance your grasp of the course material.

> I trust that the exercises below will support your review and understanding of the course content. 🤞

#### <font color = "green"> Please submit your answers via the eLearn submission folder. Again, you may submit incomplete answers. (Answer as fully as you can. This will help me to see where you stand.)

### For standardization of your answers…

> Please execute the lines of code below before you start work on your answers.

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import matplotlib as mpl

from pandas_datareader import data as pdr
import datetime as dt
import yfinance as yf
yf.pdr_override()

> Let's set some print option.

In [2]:
np.set_printoptions(precision = 3)

plt.style.use("ggplot")

mpl.rcParams["axes.grid"] = True
mpl.rcParams["grid.color"] = "grey"
mpl.rcParams["grid.alpha"] = 0.25

mpl.rcParams["axes.facecolor"] = "white"

mpl.rcParams["legend.fontsize"] = 14

%matplotlib inline

## 👇 <font color = "purple"> Bigger Question 1. 
    
### The first question is to look for clusters of correlations using the agglomerate hierarchical clustering technique (AGNES).
    
### <font color = green> Using the 102 tickers below, and what you have learned in class, run the analysis and develop a dendrogram. Make sure to employ the inclusion criterion of less than 30% of missing values.
    
    According to the dendrogram, which of the stocks are most correlated? 
    
    Also based on the dendrogram, please identify two stocks that are not well correlated.

In [3]:
nasdaq100_components = pd.read_html("https://en.wikipedia.org/wiki/Nasdaq-100")[4]

nasdaq100_components

### Below are the lines of code that lead to an answer:

In [4]:
stocks = pdr.get_data_yahoo(list(nasdaq100_components["Ticker"]), start=dt.datetime(2000,1,1))
stocks = stocks.loc[ : , ("Adj Close")]

In [5]:
missing_fractions = \
    stocks \
    .isnull() \
    .mean() \
    .sort_values(ascending = False)

In [6]:
drop_list =\
    sorted(list(missing_fractions
                [missing_fractions > 0.3]
                .index)
           )

In [7]:
stocks1 =\
    stocks \
    .drop(labels= drop_list, 
          axis=1)
stocks1 = stocks1.fillna(method = "ffill")
stocks1 = stocks1.dropna()

In [8]:
#Calculate average annual percentage return and volatilities over a theoretical one year period

returns =\
(
    stocks1
    .pct_change()
    .mean() 
    * 252
)

returns = pd.DataFrame(returns)

returns.columns = ["Returns"]

In [9]:
returns["Volatility"] =\
(    
     stocks1
    .pct_change()
    .std() 
    * np.sqrt(252)
)

In [10]:
data = np.asarray([np.asarray(returns['Returns']),np.asarray(returns['Volatility'])]).T

In [11]:
#standarize the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler().fit(data)
rescaledDataset = pd.DataFrame(scaler.fit_transform(data),columns = returns.columns, index = returns.index)
X = rescaledDataset
X.head()

In [12]:
from scipy.cluster.hierarchy import dendrogram, linkage, ward, fcluster
#Calulate linkage
Z = linkage(X, method = "ward")

### <font color = red> Answer 1 is presented in the cell below: </font>

In [13]:
# Plot Dendogram

plt.figure(figsize=(18, 10)
          )
plt.title("Stocks Dendograms")

dendrogram(Z, labels = X.index)

plt.show()

In [14]:
distance_threshold = 0.02

clusters = fcluster(Z, distance_threshold, criterion='distance')

chosen_clusters = pd.DataFrame(data=clusters, 
                               columns=['cluster']
                              )

chosen_clusters['cluster'].value_counts().head(2)

According to the dendrogram, which of the stocks are most correlated? 

In [15]:
X.iloc[chosen_clusters[chosen_clusters['cluster']==43].index]

Also based on the dendrogram, please identify two stocks that are not well correlated.

randomly pick one from cluster1 and pick one from cluster2, they are not well correlated i.e.any different clusters(two largest clusters shown below)'s stock are not well correlated.

In [16]:
distance_threshold = 10

clusters = fcluster(Z, distance_threshold, criterion='distance')

chosen_clusters = pd.DataFrame(data=clusters, 
                               columns=['cluster']
                              )

chosen_clusters['cluster'].value_counts().head(2)

In [17]:
X.iloc[chosen_clusters[chosen_clusters['cluster']==1].index].head()

In [18]:
X.iloc[chosen_clusters[chosen_clusters['cluster']==2].index].head()

## 👇 <font color = "purple"> Bigger Question 2. ### 

### The second question asks you to run a principal components analysis (PCA) for portfolio management. Begin your analysis with all the above stocks. Make sure to employ the inclusion criterion of less than 30% of missing values.
    
    Your objective is to find the portfolio using PCA.
    
    Select and normalize the four largest components and use them as weights for 
    portfolios that you can compare to an equal-weighted portfolio comprising all stocks.
    
    Identify the profile of the portfolio based on the portfolio weights.
    
    When comparing the performance of each portfolio over the sample period 
    to "the market", assess the performance of other portfolios that capture different 
    return patterns.
    
> Please use 75% of your data for PCA and 25% for backtesting.    
    
### <font color = "green"> NOTE: The investment horizon will be 10 years between 2010 and 2019.

### Below are the lines of code that lead to an answer:

In [19]:
stocks = pdr.get_data_yahoo(list(nasdaq100_components["Ticker"]), start=dt.datetime(2000,1,1), end=dt.datetime(2020,1,1))

In [20]:
stocks = stocks.loc[:,("Adj Close")]

In [21]:
missing_values =\
(
    stocks
    .isnull() # True (1) vs. False (0)
    .mean()
    .sort_values(ascending = False)
)

In [22]:
#here don't know how to do, just drop columns which has missing_values
drop_list =\
(
    sorted(list(missing_values[missing_values > 0]
                .index)
          )
)

stocks =\
(
    stocks
    .drop(labels = drop_list,
          axis = 1)
)

In [23]:
stocks =\
(
    stocks
    .fillna(method = "ffill")
)

In [24]:
Daily_Linear_Return =\
(
    stocks
    .pct_change(1)
)

In [25]:
#drop outlier
Daily_Linear_Return =\
(
    Daily_Linear_Return[Daily_Linear_Return 
                        .apply(lambda x:(x - x.mean() #by column
                                        ).abs() < (3 * x.std()
                                                  )
                              )
                        .all(1)#by row
    ]
)

In [26]:
scaler =\
(
    StandardScaler()
    .fit(Daily_Linear_Return)
)

In [27]:
scaled_stocks =\
(
    pd
    .DataFrame(scaler.fit_transform(Daily_Linear_Return),
               columns = Daily_Linear_Return.columns,
               index = Daily_Linear_Return.index)
)

scaled_stocks.describe()

In [28]:
prop =\
    int(len(scaled_stocks) * 0.75)

X_Train = scaled_stocks[    : prop] 
X_Test  = scaled_stocks[prop:     ] 

X_Train_Raw = Daily_Linear_Return[    :prop]
X_Test_Raw  = Daily_Linear_Return[prop:    ]

In [29]:
stock_tickers =\
(
 scaled_stocks
 .columns
 .values
)

stock_tickers

In [30]:
from sklearn.decomposition import PCA
from sklearn.decomposition import KernelPCA
pca = PCA()

PrincipalComponent = pca.fit(X_Train)

In [31]:
def PCWeights():

    weights = pd.DataFrame()

    for i in range(len(pca.components_[:4])#here only for 4 largest
                  ):
        weights["weights_{}".format(i)] = pca.components_[i] / sum(pca.components_[i]
                                                                  )

    weights = weights.values.T
    return weights # Team, be careful with indentation

In [32]:
weights = PCWeights()

In [33]:
def calculate_sharpe_ratio(ts_returns, periods_per_year = 252):

    n_years = ts_returns.shape[0] / periods_per_year

    annualized_return = np.power(np.prod(1 + ts_returns), (1 / n_years)
                                ) - 1

    annualized_vol = ts_returns.std() * np.sqrt(periods_per_year)

    annualized_sharpe = annualized_return / annualized_vol

    return annualized_return, annualized_vol, annualized_sharpe

In [34]:
# Yet another gift

def backtest_PCA_porfolios(eigen):

    eigen_prtfi =\
        (
            pd
            .DataFrame(data = {"weights": eigen.squeeze()
                              },
                       index = stock_tickers)
        )

    eigen_prtfi.sort_values(by = ["weights"],
                            ascending = False,
                            inplace = True)

    eigen_prtfi_returns =\
    (
        np
        .dot(X_Test_Raw
             .loc[ : , eigen_prtfi.index],
             eigen)
    )

    eigen_portfolio_returns =\
    (
        pd
        .Series(eigen_prtfi_returns.squeeze(),
                index = X_Test_Raw.index)
    )

    returns, vol, sharpe = calculate_sharpe_ratio(eigen_portfolio_returns)

    print("Our PCA-based Portfolio:\nReturn = %.2f%%\nVolatility = %.2f%%\nSharpe = %.2f"  %
          (returns * 100, vol * 100, sharpe)
         )

    # Compared with what? Equal-weightage Portfolio

    equal_weight_return =\
    (
        X_Test_Raw * (1 / len(pca.components_)
                     )
    ).sum(axis = 1)

    df_plot =\
        (
            pd
            .DataFrame({"ML Portfolio Return": eigen_portfolio_returns,
                        "Equal Weight Index": equal_weight_return},
                      index = X_Test.index
                      )
        )

    (
        np
        .cumprod(df_plot + 1)
        .plot(title = "Returns of the equal weighted index vs. Eigen-Portfolio",
              figsize = [16, 8]
             )
    )

    plt.show()

### <font color = red> Answer 2 is presented in the cell below: </font>

In [35]:
for i in range(len(weights)):
    backtest_PCA_porfolios(eigen = weights[i])

### <font color = blue> 👉 Questions 3. Using `pandas.datareader`, extract the stock prices of the following ticker symbols, between July 2015 and June 2019.

- General Motors `GM`
- Marriott `MAR`
- Pfizer `PFE`
- ExxonMobil `XOM`
- The Walt Disney Company `DIS`
- Bank of America `BAC`
- Proctor & Gamble `PG`
- Hilton `HLT`
- Walmart `WMT`
- Twitter `TWTR`

### Then, calculate simple daily percentage changes in the stock prices, and store them into an object, printing the results into an output cell.

### Below are the lines of code that lead to an answer:

In [36]:
stock_tickers = ["GM", "MAR", "PFE", "XOM", "DIS", "BAC", "PG", "HLT", "WMT", "TWTR"]

In [37]:
stocks = pdr.get_data_yahoo(stock_tickers, start=dt.datetime(2015,7,1),end=dt.datetime(2019,7,1))

In [38]:
stocks = stocks.loc[:,("Adj Close")]

In [39]:
stocks = stocks.drop(labels = ["TWTR"], axis=1)

In [40]:
stocks.isnull().values.any()

In [41]:
returns = stocks.pct_change()

### <font color = red> Answer 3 is presented in the cell below: </font>

In [42]:
returns

### <font color = blue> 👉 Questions 4. Using a box-and-whisker plot, compare the performance of the stocks over the given period of time. Find the stock with the highest variability and risk, based on the visualization.

### Below are the lines of code that lead to an answer:

In [43]:
returns.dropna(inplace=True)

### <font color = red> Answer 4 is presented in the cell below: </font>

In [44]:
fig = plt.figure(figsize=[16,8])
ax = fig.add_subplot(111)
plt.boxplot(returns)
ax.set_xticklabels(returns.columns)
plt.show()

### <font color = blue> 👉 Questions 5. Create your own function to compare daily percentage changes between stocks, using a scatter plot and its distribution relative to a perfect diagonal (regression line). 

### Assess which of the following pairs seem to show the closest relationships.

1. ExxonMobil (`XOM`) and General Motors (`GM`)
2. Twitter (`TWTR`) and The Walt Disney Company (`DIS`)
3. Marriott (`MAR`) and Hilton (`HLT`)
4. Pfeizer (`PFE`) and Proctor & Gamble (`PG`)
5. Bank of America (`BAC`) and Walmart (`WMT`)

### Upon completion of the above, please execute more tasks for the sake of this question. 

### As you have learned in class, if you wish to look for all combinations of stocks you can use the scatter matrix graph provided by the `pandas` module. Create the scatter matrix, along with a Kernel Density Estimation on the diagonal.

### Below are the lines of code that lead to an answer:

In [45]:
def compare_two_stocks(stock1, stock2):
    stocks = pd.DataFrame({stock1.name:stock1, stock2.name:stock2})
    x = np.linspace(stocks.values.min(), stocks.values.max())
    plt.plot(x,x)
    plt.scatter(stock1, stock2)
    
    plt.xlabel(stock1.name)
    plt.ylabel(stock2.name)
    plt.title(f"{stock1.name} vs {stock2.name}")
    plt.show()

### <font color = red> Answer 5 is presented in the cell below: </font>

In [46]:
lst_pairs = [["XOM","GM"], ["MAR", "HLT"], ["PFE", "PG"], ["BAC", "WMT"]]

for pair in lst_pairs:
    compare_two_stocks(returns[pair[0]], returns[pair[1]])

Assess `MAR` and `HLT` seem to show the closest relationships.

In [47]:
from pandas.plotting import scatter_matrix
scatter_matrix(returns,
               figsize = (16, 16)
              )
plt.show()

### <font color = blue> 👉 Question 6. It is often useful to analyze stock performance against a market index such as the S&P 500. This will give a sense of how a stock price compares to movements in the overall market.

### Carry out the following analysis steps.

<font color = green>

> ### 1. Extract the S&P 500 (`^GSPC`) data for the same time period used for the stocks in Question 1.

> ### 2. In order to perform comparisons, you must run the same calculations to derive the daily percentage changes and cumulative returns on the index. You might first want to concatenate the index calculations in the results of the calculations of the stocks with respect to daily percentage changes. The process will lead you to efficiently compare the overall set of stocks and index calculations for daily percentage changes.

> ### 3. Calculate the cumulative daily returns.

> ### 4. To complete this analysis, calculate the correlation of the daily percentage change values.

> ### 5. Using location accessor, print only the correlational coefficients of each stock relative to the S&P 500, in descending order.

</font> 
        
### Which stock price moved in the most similar way to the S&P 500? Which moved in the least similar way?

### Below are the lines of code that lead to an answer:

In [48]:
sp500 = pdr.get_data_yahoo(["^GSPC"], start=dt.datetime(2000,1,1))[["Adj Close"]]

In [49]:
sp500 = sp500.rename(columns={"Adj Close":"sp500"})
prices = pd.concat([sp500, stocks1], axis=1)
returns = prices.pct_change()

In [50]:
cul_returns = np.cumprod(returns+1)

In [51]:
correlation_matrix = returns.corr()

### <font color = red> Answer 6 is presented in the cell below: </font>

In [52]:
correlation_matrix["sp500"].sort_values(ascending=False)

**Which stock price moved in the most similar way to the S&P 500?**

In [53]:
correlation_matrix["sp500"].sort_values(ascending=False).index[1]

**Which moved in the least similar way?**

In [54]:
correlation_matrix["sp500"].sort_values(ascending=False).index[-1]

### <font color = blue> 👉 Question 7. One common type of data visualization in finance is a stock’s trading volume relative to its closing price.

### Create a chart below after obtaining the data from Yahoo Finance!, using `pandas.datareader`. The target symbol is `AMZN`, and our period of interest is between January 2007 and December 2009. 

In [55]:
AMZN = pdr.get_data_yahoo(["AMZN"], start=dt.datetime(2007,1,1), end=dt.datetime(2010,1,1))

### Below are the lines of code that lead to an answer:

In [56]:
def visulize_price_volume(stock):
    plt.figure(figsize=(18, 9))

    plt.bar(stock.index, stock["Volume"], color='blue', alpha=0.5)

    ax2 = plt.twinx()

    ax2.plot(stock.index, stock["Close"], color='red')


    plt.xlabel('Date')
    plt.ylabel('Volume')
    ax2.set_ylabel('Closing Price')

    plt.title('Stock Trading Volume vs. Closing Price')
    plt.show()

### <font color = red> Answer 7 is presented in the cell below: </font>

In [57]:
visulize_price_volume(AMZN)

## 👇 <font color = "purple"> Bigger Question 8. 

### Please create a predictive model for the weekly return of NFLX stock. You will use supervised learning for your predictive modelling.

> As you learned in class, to do this it is essential to know what factors are related to Netflix’s stock price, and to incorporate as much information as you can into the model.

> Among the three major factors (correlated assets, technical indicators, and fundamental analysis), you will use correlated assets and technical indicators as features here.

    Step 1. Use 75% of your data for the training of your algorithm, and 25% for the testing set.

    Step 2. For your feature engineering...
    
> Our operational definition of `outcome` (`Y`) is the weekly return of Netflix (NFLX). The number of trading days in a week is assumed to be five, and we compute the return using five trading days. 
<br>
    
* <font color = "green"> NOTE: The lagged five-day variables embed the time series component by using a time-delay approach, where the lagged variable is included as one of the predictor variables. This step translates the time series data into a supervised regression-based model framework.
<br>    
    
> For `input features` (`predictors`; `Xs`), we use (The variables used as predictors are as follows) ...

> `Correlated assets`

* lagged five-day returns of stocks (META, APPLE, AMZN, GOOGL);
* currency exchange rates (USD/JPY and GBP/USD);
* indices (S&P 500, Dow Jones, and VIX);
* lagged five-day, 15-day, 30-day, and 60-day returns of NFLX.

> `Technical indicators`

* 21-day, 63-day, and 252-day moving averages;
* 10-day, 30-day, and 200-day exponential moving averages;
* 10-day, 30-day, and 200-day relative strength index;
* stochastic oscillator %K and %D (using rolling windows of 10-, 30-, 200-day);
* rate of change (using 10-, 30-day past prices).
    
    
    Step 3. For your algorithm of choices, please assess the model performance of the following algorithms: 

    
* Linear Regression
* Elastic Net
* LASSO
* Support Vector Machine
* K-Nearest Neighbor
* ARIMA
* Decision Tree
* Extra Trees 
* Random Forest
* Gradient Boosting Tree
* Adaptive Boosting
    
    
    Step 4. For this exercise, hyperparameter tuning is not requested. 
    
    Step 5. But make sure to compare the model performance of the above algorithms.

> The metric for assessing model performance will be mean squared error (`MSE`).
<br>

> Show which of the algorithms perform relatively better by a comparison visualization of performance, for both the training and testing sets learned in class. 

    Step 6. Using the model of your choice, please visualize the actual vs. predicted (estimated) data.

### Below are the lines of code that lead to an answer:

In [58]:
stock_ticker = ["NFLX","META","AAPL","AMZN","GOOGL"]

currency_ticker = ["DEXJPUS", "DEXUSUK"]

index_ticker = ["SP500", "DJIA", "VIXCLS"]


stock_data = pdr.get_data_yahoo(stock_ticker)
currency_data = pdr.get_data_fred(currency_ticker)
index_data = pdr.get_data_fred(index_ticker)

In [59]:
return_period = 5

In [60]:
Y =\
    (np
     .log(stock_data.loc[ : , ("Adj Close", "NFLX")]
         )
     .diff(return_period)
    )

Y

In [61]:
Y.name = Y.name[-1]+"_pred"
Y

In [62]:
X1 =\
    (np.
     log(stock_data.loc[ : , ("Adj Close", ("META","AAPL","AMZN","GOOGL")
                             )
                       ]
        )
     .diff(return_period)
     .shift(return_period)
    )

X1.columns =\
    (X1
     .columns
     .droplevel()
    )

X1

In [63]:
X2 =\
    (np
     .log(currency_data)
     .diff(return_period)
    )

X2

In [64]:
X3 =\
    (np
     .log(index_data)
     .diff(return_period)
    )

X3

In [65]:
X4 =\
    (
    pd
    .concat([np
             .log(stock_data.loc[ : , ("Adj Close", "NFLX")
                                ]
                 )
             .diff(i) for i in [return_period, 
                                return_period * 3, 
                                return_period * 6, 
                                return_period * 12]
            ],
           axis = 1
           )
    .shift(return_period)
    .dropna()
)

X4.columns = ["NFLX_DT", "NFLX_3DT", "NFLX_6DT", "NFLX_12DT"]

In [66]:
NFLX = pdr.get_data_yahoo(["NFLX"])
NFLX.head()

In [67]:
X5 = (
    pd.concat(
        [
            NFLX["Adj Close"]
            .rolling(i)
            .mean() 
            for i in [21
                      ,63
                      ,252
                     ]
        ],
        axis=1
    )
)

X5.columns = ["NFLX_SMA21","NFLX_SMA63","NFLX_SMA252"]

In [68]:
X6 = (
    pd.concat(
        [
            NFLX["Adj Close"]
            .ewm(i).mean() 
            for i in [10,
                      30,
                      200]
        ],
        axis=1
    )
)

X6.columns = ["NFLX_EMA10","NFLX_EMA30","NFLX_EMA200"]

In [69]:
def cal_RSI(period, stock1):
    stock = stock1.copy()
    stock["change"] = stock.diff()

    stock["gain"] = stock["change"].apply(lambda x: x if x > 0 else 0)
    stock["loss"] = stock["change"].apply(lambda x: -x if x < 0 else 0)

    stock["avg_gain"] = stock["gain"].rolling(period).mean()
    stock["avg_loss"] = stock["loss"].rolling(period).mean()
    
    for i in range(period, len(stock)):
        stock.iloc[i,stock.columns.get_loc("avg_gain")] = (stock.iloc[i-1]["avg_gain"]*13 + stock.iloc[i]["gain"])/14
        stock.iloc[i,stock.columns.get_loc("avg_loss")] = (stock.iloc[i-1]["avg_loss"]*13 + stock.iloc[i]["loss"])/14
    
    stock["RS"] = stock["avg_gain"]/stock["avg_loss"]
    stock["RSI"] = 100 - 100/(1+stock["RS"])
    
    return stock["RSI"]

In [70]:
X7 = pd.DataFrame([cal_RSI(10,NFLX[["Adj Close"]]),cal_RSI(30,NFLX[["Adj Close"]]),cal_RSI(200,NFLX[["Adj Close"]])]).T

X7.columns = ["NFLX_RSI10","NFLX_RSI30","NFLX_RSI200"]

In [71]:
def cal_SO(period, stock1):
    stock = stock1.copy()
    stock["lowest_low"] = stock["Low"].rolling(period).min()
    stock["highest_high"] = stock["High"].rolling(period).max()
    stock[f"{period}%K"] = ((stock["Close"] - stock["lowest_low"]) / (stock["highest_high"] - stock["lowest_low"])) * 100
    stock[f"{period}%D"] = stock[f"{period}%K"].rolling(3).mean()
    
    return stock[[f"{period}%K", f"{period}%D"]]

In [72]:
X8 = pd.concat([cal_SO(10, NFLX),cal_SO(30, NFLX),cal_SO(200, NFLX)], axis = 1)

In [73]:
X9 = pd.concat([NFLX["Adj Close"].pct_change(10)*100, NFLX["Adj Close"].pct_change(30)*100], axis = 1)
X9.columns = ["ROC10", "ROC30"]

In [74]:
X = pd.concat([X1, X2, X3, X4, X5, X6, X7, X8, X9],axis=1)
X

In [75]:
data =\
(
pd
.concat([Y, X],
        axis = 1)
.dropna()
.iloc[ : :return_period, :]
)

In [76]:
Y = data.loc[ : , Y.name]

Y

In [77]:
X = data.loc[ : , X.columns]

X

In [78]:
validation_size = 0.25

train_size =\
    int(len(X) 
        * 
        (1 - validation_size)
       )

X_train, X_test =\
    (X[0         :train_size], 
     X[train_size:len(X)    ]
    )

Y_train, Y_test =\
    (Y[0         :train_size], 
     Y[train_size:len(X)    ]
    )

In [79]:
#next we fill in the model mentioned in step 3
# Loading Algorithm

from sklearn.linear_model import LinearRegression

# Regularization
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet

# Decision Tree
from sklearn.tree import DecisionTreeRegressor

# ENSEMBLE

## Bagging
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import ExtraTreesRegressor

## Boosting
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor

# Support Vector Machine
from sklearn.svm import SVR

# K-Nearest Neighbor
from sklearn.neighbors import KNeighborsRegressor

# Multi-layer Perceptron (Neural Networks)
from sklearn.neural_network import MLPRegressor

In [80]:
# for assessment
from sklearn.metrics import mean_squared_error

#for ingore warnings
import warnings
warnings.filterwarnings("ignore")

In [81]:
models = []
#Linear Regression
models.append(("LR", LinearRegression()
             )
            )

#Elastic Net
models.append(("EN", ElasticNet()
             )
            )

#LASSO
models.append(("LASSO", Lasso()
             )
            )

#Support Vector Machine
models.append(("SVR", SVR()
             )
            )

#K-Nearest Neighbor
models.append(("KNN", KNeighborsRegressor()
             )
            )

#ARIMA

#Decision Tree
models.append(("CART", DecisionTreeRegressor()
             )
            )

#Extra Trees
models.append(("ETR", ExtraTreesRegressor()
              )
             )

#Random Forest
models.append(("RFR", RandomForestRegressor()
              )
             )

#Gradient Boosting Tree
models.append(("GBR", GradientBoostingRegressor()
              )
             )

#Adaptive Boosting
models.append(("ABR", AdaBoostRegressor()
              )
             )

In [82]:
train_results = []
test_results = []

names = []

for name, model in models:
    
    names.append(name)
    
    res = model.fit(X_train, Y_train)
    train_result = mean_squared_error(res.predict(X_train), Y_train)
    train_results.append(train_result)
    
    test_result = mean_squared_error(res.predict(X_test), Y_test)
    test_results.append(test_result)
    

In [83]:
#we're not done yet still have ARIMA left
import statsmodels.tsa.arima.model as stats
import statsmodels.api as sm

modelARIMA =\
(    stats
     .ARIMA(endog = Y_train,
                exog = X_train,
                order = [1, 0, 0]
            )
)

model_fit = modelARIMA.fit()

In [84]:
train_mse = mean_squared_error(Y_train,model_fit.fittedvalues)

In [85]:
predicted =\
(
    model_fit
    .predict(start = train_size - 1,
             end = len(X) - 1,
             exog = X_test)[1: ]
)

In [86]:
test_mse = mean_squared_error(Y_test,predicted)

In [87]:
names.append("ARIMA")
train_results.append(train_mse)
test_results.append(test_mse)

In [88]:
#finally we visualize it
fig = plt.figure(figsize = [16, 8])

ind = np.arange(len(names)
               )

width = 0.30

fig.suptitle("Comparing the Perfomance of Various Algorithms on the Training vs. Testing Data")

ax = fig.add_subplot(111)

(plt
 .bar(ind - width/2,
      train_results,
      width = width,
      label = "Errors in Training Set")
)

(plt
 .bar(ind + width/2,
      test_results,
      width = width,
      label = "Errors in Testing Set")
)

plt.legend()

ax.set_xticks(ind)
ax.set_xticklabels(names)

plt.ylabel("Mean Squared Error (MSE)")

plt.show()

In [89]:
res = models[-1][1].fit(X_train,Y_train)
predicted = res.predict(X_test)

In [90]:
predicted = pd.DataFrame(predicted)
predicted.index = Y_test.index

### <font color = red> Answer 8 is presented in the cell below: </font>

In [91]:
plt.figure(figsize=[16, 8])
plt.plot(predicted, label="ABR_predicted_data")
plt.plot(Y_test, label="actual data")

plt.legend()

plt.show()

> 💯 “Thank you for putting your efforts into the individual assessment questions” 😊