# Checkpoint Five: Modeling Data

With your visualizations ready to go, the final step in your project is to do some predictive analysis on your dataset. You will be using linear regression for your model. You will not be penalized if your linear regression model does not work out. You just need to build the model and make notes as to the results.

Link to my dataset:

## Prepare Your Dataframe

Import any libraries you need and create a dataframe.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.model_selection import train_test_split
import statsmodels.api as sm

In [2]:
stock_df = pd.read_csv("aapl_c.csv")

In [3]:
stock_df.shape

(1258, 6)

In [4]:
stock_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1258 entries, 0 to 1257
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Date    1258 non-null   object 
 1   Open    1258 non-null   float64
 2   High    1258 non-null   float64
 3   Low     1258 non-null   float64
 4   Close   1258 non-null   float64
 5   Volume  1258 non-null   int64  
dtypes: float64(4), int64(1), object(1)
memory usage: 59.1+ KB


In [5]:
stock_df.describe()

Unnamed: 0,Open,High,Low,Close,Volume
count,1258.0,1258.0,1258.0,1258.0,1258.0
mean,128.601988,129.715323,127.475242,128.625531,41300080.0
std,38.228897,38.589302,37.858325,38.223879,21217890.0
min,67.917,68.4357,67.1539,67.9446,11475920.0
25%,100.6062,101.521025,99.594625,100.572975,26281930.0
50%,115.9025,116.47725,114.74875,115.85205,35903160.0
75%,156.81575,158.307575,155.1907,156.847525,50315110.0
max,228.9953,231.6645,228.0031,230.2754,189978100.0


## Find Correlations

Use either pairplot or a heatmap or both to find the two variables with the strongest correlation in your dataset.

In [None]:
sns.pairplot(stock_df, x_vars=['Open','Close','High','Low','Volume'], y_vars='Date', height=4,aspect=1, kind='scatter')
plt.show()


In [None]:
sns.heatmap(stock_df.corr(), cmap="YlGnBu", annot = True)
plt.show()

## Create Your Model

Use the two columns with the strongest correlation to create and train your model. Make sure to print out the summary and plot the column values and the line produced by the model.

In [None]:
# Step 1 is to assign your x and y
X = stock_df['Open']
y = stock_df['Volume']

In [None]:
# Step 2 Splitting the varaibles as training and testing (create train and test sets)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.7, 
                                                    test_size = 0.3, random_state = 100)

In [None]:
X_train

In [None]:
y_train

In [None]:
# Step 3 is to build the model. importing Statsmodels.api library from Stamodel package

# Adding a constant to get an intercept
X_train_sm = sm.add_constant(X_train)

In [None]:
# Fitting the resgression line using 'OLS'
lr = sm.OLS(y_train, X_train_sm).fit()

In [None]:
# Printing the parameters
lr.params


In [None]:
lr.summary()

# Step 4 # perform residual analysis


1. The coefficient for Open price is -2.199e+05, and its corresponding p-value is very low, almost 0. Since Coefficient is in negative (less than p-value) That means the coefficient is statistically insignificant.

2. R-squared value is 0.154, which means that 15.4% of the Volume variance can be explained by the Open Price column using this line.

3. Prob F-statistic has a very low p-value, practically zero, which gives us that the model fit is statistically significant.


In [None]:
# Visualizing the regression line
plt.scatter(X_train, y_train)
plt.plot(X_train, 6.908e+07 + -2.199e+05*X_train, 'r')
plt.show()

## Error Terms

Finally, plot your error terms!

In [None]:
#Error = Actual y value - y predicted value

# Predicting y_value using traingn data of X
y_train_pred = lr.predict(X_train_sm)

# Creating residuals from the y_train data and predicted y_data
res = (y_train - y_train_pred)

In [None]:
# Plotting the histogram using the residual values
fig = plt.figure()
sns.distplot(res, bins = 15)
plt.title('Error Terms', fontsize = 15)
plt.xlabel('y_train - y_train_pred', fontsize = 15)
plt.show()

In [None]:
# Looking for any patterns in the residuals
plt.scatter(X_train,res)
plt.show()

## Summarize Your Work

Make notes of your answers to the questions below.

1. What was your R-squared value? 0.154
2. Based on your results, do you think a linear regression model was the best type of predictive analysis for your dataset? It was not particularly effective in this case.
3. Was your plot of the error terms a bell curve? Yes  

