# **Step 1: Busines Understanding**

# **Problem Statement: Prediction of Yahoo Stock Market**

# ****Data Set: Yahoo Stock Price*****

The problem we are facing is what will be the value of Yahoo's assets in the near future? At this stage some additional questions also need to be reviewed to get a better and more useful data set. For example, what are the factors that can be overlooked in asset prices? Isn't the price dependent on the depreciation of competing companies? Do foreign policies affect asset value? Is inflation not the reason for the rise in asset prices? Etc etc


# **Step 2: Data Exploration**

**Import following Libraries**

In [None]:
import pandas as pd 
from datetime import datetime
import numpy as np 
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

**Read this Dataset**

In [None]:
SPY_data = pd.read_csv("../input/data-science-project-lifecycle/SPY_2015.csv")
 
# Change the Date column from object to datetime object 
SPY_data["Date"] = pd.to_datetime(SPY_data["Date"])
 
# Preview the data
SPY_data.head(10)

# **Step 3: Data Cleansing and Transformation**

**Indexing and Sorting**

In [None]:
# Set Date as index
SPY_data.set_index('Date',inplace=True)
 
# Reverse the order of the dataframe in order to have oldest values at top
SPY_data.sort_values('Date',ascending=True)

# Check Null Values

In [None]:
# Take the name of the columns of the SPY_data to see if null values exists
variables = SPY_data.columns 
SPY_data.isnull().sum().loc[variables]

# **Step 4: Exploratory Data Analysis**

In [None]:
jet= plt.get_cmap('jet')
colors = iter(jet(np.linspace(0,1,10)))
 
def correlation(df,variables, n_rows, n_cols):
    fig = plt.figure(figsize=(8,6))
    #fig = plt.figure(figsize=(14,9))
    for i, var in enumerate(variables):
        ax = fig.add_subplot(n_rows,n_cols,i+1)
        asset = df.loc[:,var]
        ax.scatter(df["Adj Close"], asset, c = next(colors))
        ax.set_xlabel("Adj Close")
        ax.set_ylabel("{}".format(var))
        ax.set_title(var +" vs price")
    fig.tight_layout() 
    plt.show()

# **Correlation**

# Is there any correlation between Volume and Adj Close price?

In [None]:
# Is there any correlation between Volume and Adj Close price?
variables =SPY_data.columns[-1:] # read last column name
correlation(SPY_data,variables,1,1)

# **Is there any correlation between Adj Close price vs. Open, High, Low, Close ?**

In [None]:
# Is there any correlation between Adj Close price vs. Open, High, Low, Close?
variables =SPY_data.columns#[0:6]   
correlation(SPY_data,variables,3,3)

In [None]:
SPY_data.corr()['Adj Close'].loc[variables]

# **Step 5: Featuer Engineering**

In [None]:
SPY_data['High-Low_pct'] = (SPY_data['High'] - SPY_data['Low']).pct_change()
SPY_data['ewm_5'] = SPY_data["Close"].ewm(span=5).mean().shift(periods=1)
SPY_data['price_std_5'] = SPY_data["Close"].rolling(center=False,window= 30).std().shift(periods=1)
 
SPY_data['volume Change'] = SPY_data['Volume'].pct_change()
SPY_data['volume_avg_5'] = SPY_data["Volume"].rolling(center=False,window=5).mean().shift(periods=1)
SPY_data['volume Close'] = SPY_data["Volume"].rolling(center=False,window=5).std().shift(periods=1)

# Correlation with New features

In [None]:
jet= plt.get_cmap('jet')
colors = iter(jet(np.linspace(0,1,10)))

# Take the name of the last 6 columns of the SPY_data which are the model features
variables = SPY_data.columns[-6:]  
 
correlation(SPY_data,variables,3,3)

In [None]:
SPY_data.corr()['Adj Close'].loc[variables]

# **Step 6: Build Predictive Model**

***Check Null values***

In [None]:
SPY_data.head(5)

In [None]:
SPY_data.isnull().sum().loc[variables]

# **Drop/Remove NA records**

In [None]:
# To train a model, it is necessary to drop missing values.
SPY_data = SPY_data.dropna(axis=0)

# **Train & Test Dataset Distribution**

In [None]:
# Generate the train and test sets
train = SPY_data[SPY_data.index < datetime(year=2015, month=1, day=1)]

test = SPY_data[SPY_data.index >= datetime(year=2015, month=1, day=1)]
dates = test.index

# **Building Regression Model**

In [None]:
lr = LinearRegression()
X_train = train[["High-Low_pct","ewm_5","price_std_5","volume_avg_5","volume Change","volume Close"]]
 
Y_train = train["Adj Close"]
 
lr.fit(X_train,Y_train) 

# **Test Dataset**

In [None]:
# Create the test features dataset (X_test) which will be used to make the predictions.
X_test = test[["High-Low_pct","ewm_5","price_std_5","volume_avg_5","volume Change","volume Close"]].values 

# The labels of the model
Y_test = test["Adj Close"].values # will be used for comparison

# **Prediction**

In [None]:
close_predictions = lr.predict(X_test) 

# **Model Evaluation**

**Mean Absolute Error (MAE):**

In statistics, mean absolute error (MAE) is a measure of errors between paired observations expressing the same phenomenon. Examples of Y versus X include comparisons of predicted versus observed, subsequent time versus initial time, and one technique of measurement versus an alternative technique of measurement. MAE is calculated as:
![../input/data-science-project-lifecycle/mae_eq.PNG](http://)
The mean absolute error is a common measure of forecast error in time series analysis.

In [None]:
mae = sum(abs(close_predictions - test["Adj Close"].values)) / test.shape[0]
print(mae)

The MAE value is approx. 18.

# **Error Graph for last 25 days**

Simple error (Actual - Predicted) computered and ploted for last 25 days.

In [None]:
# Create a dataframe that output the Date, the Actual and the predicted values
df = pd.DataFrame({'Date':dates,'Actual': Y_test, 'Predicted': close_predictions})
df1 = df.tail(25)
 
# set the date with string format for plotting
df1['Date'] = df1['Date'].dt.strftime('%Y-%m-%d')
 
df1.set_index('Date',inplace=True)
 
error = df1['Actual'] - df1['Predicted']
 
# Plot the error term between the actual and predicted values for the last 25 days
 
error.plot(kind='bar',figsize=(8,6))
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.xticks(rotation=45)
plt.show()