# Regression Model: 

**a. Assumptions:**

* The relationship between High, Low, Open, Volume together with Close is assumed to be linear, we also later proved this by calculating the correlation between all columns of the given dataset
* Eventhough the columns High, Low, Open have a high correlation between each other, we are considering all of them to predict close value
* We took multicollinearity into consideration while predicting only for Ridge and Lasso

**b. Model Chosen with Rationale:**

We tested Ridge regression, Lasso regression, XGBoost and Multiple Linear Regression. 
We chose Multiple Linear Regression as our model as:
* This model did not overfit the data as much as the other models.
* Gave the best estimation between coeffecients of the independent variables (as there is a strong linear relationship between close and Open, High and Low)
* As there is a strong linear relationship between the independent varibales and dependent variable, multiple linear regression makes the most sense

**c. Parameters**

* We made use of Pipeline along with StandardScaler and LinearRegression to train our model of the entire train dataset

# Time Series Model: 

**a. Assumptions**

* We use the ARIMA (2,1,1) model for prediction which removes the trend component
* We found that the dataset is not stationary having both a trend and seasonal component
* We do not consider the seasonal component 

**b. Model Chosen with Rationale**

* With the help of acf and pacf plot, we found that p=1 and q=1
* With the help of auto arima, we found the optimal model: ARIMA(2,1,1) for training the model
* LSTM was also tested on the dataset as it was seen to be the best model during research for time series analysis

**c. Parameters**

* p=2, d=1, q=1, m=0, P=0, D=0, Q=0

# Comparison between Regression and Time Series models: 
**a. Based on validation performance metrics**

We ended up choosing the MLR model (RMSE=1.68) as we got a lower RMSE value for the given test data when compared to the ARIMA(2,1,1) model (RMSE=32.4)

**b. Which model is more suitable for the data? (account for not just best performance metrics, but handling of fluctuations of data as well)**

Eventhough SARIMA seems like the better option to consider the seasonality and trend component during prediction, the RMSE values obtained by the Multiple Linear Regression model was much lesser than the ARIMA value.
We obtained the values p=1 and q=1 from the PACF and ACF graphs respecively and used those as the base values for our model. As auto-ARIMA gives better and accurate results, we used the model generated from auto-ARIMA. The obtained ARIMA model was ARIMA(2,1,1) and this does not consider the seasonality changes, a very important factor. On performing seasonal decomposition on the given data set, we found that the time period for one seasonal cycle is 43-45 days (1 and a half months). When provided to the auto-ARIMA module, it still did not consider the seasonality of the data to perform predictions. When we tried changing P, D, Q and m values (m=45 or m=365), we got a system ran out of memory exception.

**c. Reasoning for what model is chosen to predict test (hidden) data**

According to our analysis, MLR gave better prediction and thus a lower RMSE value as fluctuations due to the seasonality component was captured well. The ARIMA model generated did not capture the seasonality component hence giving us a greater RMSE value as compared to the Multiple Linear Regression model. Since there was a high correlation between the Open, High, Low and Close values, we know a linear relationship exists between them. Hence, Multiple Linear Regression came out as the best model. We therefore followed the Occam's razor principle and decided on MLR. 

# EDA

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df=pd.read_csv('/kaggle/input/pes-ec-dataanalytics-assign/train.csv')
df.set_index('Date', inplace=True)

In [None]:
x_t=pd.read_csv('/kaggle/input/pes-ec-dataanalytics-assign/test.csv')
x_t.set_index('Date', inplace=True)
x_t=x_t.iloc[:,0:-1]

In [None]:
df.head()

In [None]:
df.isna().sum()
#There are no null values

In [None]:
df.duplicated().sum()
#no duplicate values

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
sns.set(rc={'figure.figsize':(11, 4)})
df['Open'].plot(linewidth=0.5);
# there seems to be an upward trend and a multiplicative seasonality component

In [None]:
sns.set(rc={'figure.figsize':(11, 4)})
df['High'].plot(linewidth=0.5);
# there seems to be an upward trend and a multiplicative seasonality component

In [None]:
sns.set(rc={'figure.figsize':(11, 4)})
df['Low'].plot(linewidth=0.5);
# there seems to be an upward trend and a multiplicative seasonality component

In [None]:
sns.set(rc={'figure.figsize':(25, 8)})
df['Volume'].plot(linewidth=0.5);
#since there seems to be a spike at similar intervals, we can say there is a seasonality component 

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
X_variables = df[['Open','High','Low','Volume']]
vif_data = pd.DataFrame()
vif_data["feature"] = X_variables.columns
vif_data["VIF"] = [variance_inflation_factor(X_variables.values, i) for i in range(len(X_variables.columns))]
vif_data

In [None]:
import matplotlib.pyplot as plt
from mlxtend.plotting import heatmap
cols=['Open','High','Low','Close','Volume']
cm=np.corrcoef(df[cols].values.T)
hm=heatmap(cm,row_names=cols,column_names=cols)
plt.show()

In [None]:
df.corr()

Since correlation is high between open, high and low, we can use linear regression to predict the close value

In [None]:
df.boxplot(figsize=(10,10))

In [None]:
no_of_outliers=dict()
for i in ['Open','High','Low','Volume']:
    Q1=df[i].quantile(q=0.25,interpolation='midpoint')
    Q3=df[i].quantile(q=0.75,interpolation='midpoint')
    IQR=Q3-Q1
    for j in df[i]:
        if j>Q3+1.5*IQR or j<Q1-1.5*IQR:
            if i not in no_of_outliers:
                no_of_outliers[i]=0
            else:
                no_of_outliers[i]+=1
print(no_of_outliers)

# REGRESSION MODEL

We tried:
    Lasso
    Ridge
    Multiple Linear regression

In [None]:
from sklearn import metrics

In [None]:
from sklearn.model_selection import train_test_split
# X-> Contains the features
X = df.iloc[:, 0:-1]
# y-> Contains all the targets
y = df.iloc[:, -1]

# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)

In [None]:
# LASSO REGRESSION
import warnings
warnings.filterwarnings('ignore')
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV
lasso=Lasso()
parameters={'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20,30,35,40,45,50,55,100]}
lasso_regressor=GridSearchCV(lasso,parameters,scoring='neg_mean_squared_error',cv=5)

lasso_regressor.fit(X_train,y_train)
print(lasso_regressor.best_params_)
print(lasso_regressor.best_score_)
prediction_lasso=lasso_regressor.predict(X_test)
print((metrics.mean_squared_error(y_test, prediction_lasso,squared=False)))


In [None]:
#RIDGE REGRESSION
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False)
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

ridge=Ridge()
parameters={'alpha':[1e-15,1e-10,1e-8,1e-3,1e-2,1,5,10,20,30,35,40,45,50,55,100]}
ridge_regressor=GridSearchCV(ridge,parameters,scoring='neg_mean_squared_error',cv=5)
ridge_regressor.fit(X_train,y_train)
#print(ridge_regressor.best_params_)
#print(ridge_regressor.best_score_)
prediction_ridge=ridge_regressor.predict(X_test)
print((metrics.mean_squared_error(y_test, prediction_ridge,squared=False)))


In [None]:
#MULTIPLE LINEAR REGRESSION

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import math
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import sklearn
import sklearn.preprocessing as pre
import sklearn.linear_model as lm
import sklearn.datasets
import sklearn.neighbors as nb
import sklearn.pipeline as pipeline
from sklearn.preprocessing import StandardScaler
import sklearn.svm as svm
import sklearn.neural_network as nn
import sklearn.neighbors as neigh
import sklearn.metrics as metrics
from sklearn.model_selection import train_test_split


dateparse = lambda dates: pd.datetime.strptime(dates, '%d-%m-%Y')
train = pd.read_csv('/kaggle/input/pes-ec-dataanalytics-assign/train.csv',sep=',', index_col='Date', parse_dates=['Date'], date_parser=dateparse).fillna(0)
df_test = pd.read_csv('/kaggle/input/pes-ec-dataanalytics-assign/test.csv',sep=',', index_col='Date', parse_dates=['Date'], date_parser=dateparse).fillna(0)
train
df=train

In [None]:
df

In [None]:
x=df.drop('Close', axis=1)
y=df['Close']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [None]:
pipe=pipeline.Pipeline(steps=[('scaler', StandardScaler()), ('model', lm.LinearRegression(normalize=True))])

In [None]:
pipe.fit(x_train, y_train)

In [None]:
x_testing = df_test.drop('Close', axis=1)
lr_p_testing = pipe.predict(x_testing)

In [None]:
lr_p_testing

In [None]:
ndf=pd.DataFrame()
ndf['Date']=df_test.index
ndf['Close']=lr_p_testing
ndf.set_index('Date',inplace=True)
#ndf.to_csv('LinearRegression.csv')
ndf.to_csv('8790.csv')

# TIME SERIES

In [None]:
pip install pmdarima

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARIMA
from pmdarima.arima import auto_arima
from sklearn.metrics import mean_squared_error, mean_absolute_error
import math

train=pd.read_csv('/kaggle/input/pes-ec-dataanalytics-assign/train.csv')
test=pd.read_csv('/kaggle/input/pes-ec-dataanalytics-assign/test.csv')
train
df=test

In [None]:
dateparse = lambda dates: pd.datetime.strptime(dates, '%d-%m-%Y')

In [None]:
t1= pd.read_csv('/kaggle/input/pes-ec-dataanalytics-assign/train.csv',sep=',', index_col='Date', parse_dates=['Date'], date_parser=dateparse).fillna(0)

In [None]:
t1

In [None]:
#plot close price
plt.figure(figsize=(10,6))
plt.grid(True)
plt.xlabel('Date')
plt.ylabel('Close Prices')
plt.plot(t1['Close'])
plt.title('Closing price')
plt.show()

In [None]:
#Distribution of the dataset
df_close = t1['Close']
df_close.plot(kind='kde')

In [None]:
#Test for stationarity
def test_stationarity(timeseries):
    #Determing rolling statistics
    rolmean = timeseries.rolling(12).mean()
    rolstd = timeseries.rolling(12).std()
    #Plot rolling statistics:
    plt.plot(timeseries, color='blue',label='Original')
    plt.plot(rolmean, color='red', label='Rolling Mean')
    plt.plot(rolstd, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean and Standard Deviation')
    plt.show(block=False)
    print("Results of dickey fuller test")
    adft = adfuller(timeseries,autolag='AIC')
    output = pd.Series(adft[0:4],index=['Test Statistics','p-value','No. of lags used','Number of observations used'])
    for key,values in adft[4].items():
        output['critical value (%s)'%key] =  values
    print(output)
test_stationarity(df_close)

In [None]:
#To separate the trend and the seasonality from a time series, we can decompose the series
result = seasonal_decompose(df_close, model='multiplicative', freq = 30)
fig = plt.figure()  
fig = result.plot()  
fig.set_size_inches(16, 9)

In [None]:
from matplotlib.pyplot import figure


figure(figsize=(20, 6), dpi=80)
#plt.grid()
#plt.xticks(np.arange(0, 201, 5))
plt.plot(result.seasonal[:200])

In [None]:
#it is not stationary so eliminate trend
from pylab import rcParams
rcParams['figure.figsize'] = 10, 6
df_log = np.log(df_close)
moving_avg = df_log.rolling(12).mean()
std_dev = df_log.rolling(12).std()
plt.legend(loc='best')
plt.title('Moving Average')
plt.plot(std_dev, color ="black", label = "Standard Deviation")
plt.plot(moving_avg, color="red", label = "Mean")
plt.legend()
plt.show()

In [None]:
#split data into train and training set
train_data, test_data = df_log[3:int(len(df_log)*0.9)], df_log[int(len(df_log)*0.9):]
plt.figure(figsize=(10,6))
plt.grid(True)
plt.xlabel('Dates')
plt.ylabel('Closing Prices')
plt.plot(df_log, 'green', label='Train data')
plt.plot(test_data, 'blue', label='Test data')
plt.legend()

In [None]:
from matplotlib import pyplot
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
plot_acf(train_data)
pyplot.show()
#q=1

In [None]:
plot_pacf(train_data)
pyplot.show()
#p=1

In [None]:
model_autoARIMA = auto_arima(train_data, start_p=0, start_q=0, test='adf',max_p=3, max_q=3, m=1, d=None, seasonal=False, start_P=0,D=0,trace=True,error_action='ignore', suppress_warnings=True, stepwise=True)
print(model_autoARIMA.summary())
model_autoARIMA.plot_diagnostics(figsize=(15,8))
plt.show()

In [None]:
# Build Model
model = ARIMA(train_data, order=(2,1,1))  
fitted = model.fit(disp=-1)  
print(fitted.summary())

In [None]:
# Forecast
fc, se, conf = fitted.forecast(150, alpha=0.05)  # 95% conf

In [None]:
# Make as pandas series
fc_series = pd.Series(fc, index=test_data.index)
lower_series = pd.Series(conf[:, 0], index=test_data.index)
upper_series = pd.Series(conf[:, 1], index=test_data.index)
# Plot
plt.figure(figsize=(10,5), dpi=100)
plt.plot(train_data, label='training data')
plt.plot(test_data, color = 'blue', label='Actual Stock Price')
plt.plot(fc_series, color = 'orange',label='Predicted Stock Price')
plt.fill_between(lower_series.index, lower_series, upper_series, 
                 color='k', alpha=.10)
plt.title('Stock Price Prediction')
plt.xlabel('Time')
plt.ylabel('Stock Price')
plt.legend(loc='upper left', fontsize=8)
plt.show()

In [None]:
rmse = math.sqrt(mean_squared_error(test_data, fc))
print('RMSE: '+str(rmse))

In [None]:
test_data

FOR FULL TRAINING DATA

In [None]:
# Build Model
model = ARIMA(y, order=(2,1,1))  
fitted = model.fit(disp=-1)  
print(fitted.summary())

In [None]:
# Forecast
fc, se, conf = fitted.forecast(30, alpha=0.05)  # 95% conf

In [None]:
# Make as pandas series
fc_series = pd.Series(fc, index=x_t.index)
lower_series = pd.Series(conf[:, 0], index=x_t.index)
upper_series = pd.Series(conf[:, 1], index=x_t.index)
# Plot
plt.figure(figsize=(10,5), dpi=100)
plt.plot(y, label='training data')

plt.plot(fc_series, color = 'orange',label='Predicted Stock Price')
plt.fill_between(lower_series.index, lower_series, upper_series, 
                 color='k', alpha=.10)
plt.title('Stock Price Prediction')
plt.xlabel('Time')
plt.ylabel('Stock Price')
plt.legend(loc='upper left', fontsize=8)
plt.show()

In [None]:
#RMSE FOR THIS MODEL WHEN SUBMITTED WAS 32.74 when we trained it for the entire training data to predict test data