# Predicting Stock Returns with Cluster-Then-Predict

<img src="images/stockreturns.jpg"/>

In the second lecture sequence this week, we heard about cluster-then-predict, a methodology in which you first cluster observations and then build cluster-specific prediction models. In the lecture sequence, we saw how this methodology helped improve the prediction of heart attack risk. In this assignment, we'll use cluster-then-predict to predict future stock prices using historical stock data.

When selecting which stocks to invest in, investors seek to obtain good future returns. In this problem, we will first use clustering to identify clusters of stocks that have similar returns over time. Then, we'll use logistic regression to predict whether or not the stocks will have positive future returns.

For this problem, we'll use StocksCluster.csv, which contains monthly stock returns from the NASDAQ stock exchange. The NASDAQ is the second-largest stock exchange in the world, and it lists many technology companies. The stock price data used in this problem was obtained from infochimps, a website providing access to many datasets.

Each observation in the dataset is the monthly returns of a particular company in a particular year. The years included are 2000-2009. The companies are limited to tickers that were listed on the exchange for the entire period 2000-2009, and whose stock price never fell below $1. So, for example, one observation is for Yahoo in 2000, and another observation is for Yahoo in 2001. Our goal will be to predict whether or not the stock return in December will be positive, using the stock returns for the first 11 months of the year.

### This dataset contains the following variables:

    ReturnJan = the return for the company's stock during January (in the year of the observation). 

    ReturnFeb = the return for the company's stock during February (in the year of the observation). 

    ReturnMar = the return for the company's stock during March (in the year of the observation). 

    ReturnApr = the return for the company's stock during April (in the year of the observation). 

    ReturnMay = the return for the company's stock during May (in the year of the observation). 

    ReturnJune = the return for the company's stock during June (in the year of the observation). 

    ReturnJuly = the return for the company's stock during July (in the year of the observation). 

    ReturnAug = the return for the company's stock during August (in the year of the observation). 

    ReturnSep = the return for the company's stock during September (in the year of the observation). 

    ReturnOct = the return for the company's stock during October (in the year of the observation). 

    ReturnNov = the return for the company's stock during November (in the year of the observation). 

    PositiveDec = whether or not the company's stock had a positive return in December (in the year of the observation). This variable takes value 1 if the return was positive, and value 0 if the return was not positive.

For the first 11 variables, the value stored is a proportional change in stock value during that month. For instance, a value of 0.05 means the stock increased in value 5% during the month, while a value of -0.02 means the stock decreased in value 2% during the month.

### Problem 1.1 - Exploring the Dataset
Load StocksCluster.csv into a data frame called "stocks". **How many observations are in the dataset?**

In [1]:
# Load the dataset

stocks = read.csv("data/StocksCluster.csv")

head(stocks)

Unnamed: 0_level_0,ReturnJan,ReturnFeb,ReturnMar,ReturnApr,ReturnMay,ReturnJune,ReturnJuly,ReturnAug,ReturnSep,ReturnOct,ReturnNov,PositiveDec
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<int>
1,0.08067797,0.06625,0.03294118,0.18309859,0.130333952,-0.01764234,-0.02051703,0.02467587,-0.02040816,-0.17331768,-0.02538531,0
2,-0.01067989,0.10211539,0.14549595,-0.08442804,-0.327300392,-0.35926605,-0.02532131,0.21129,-0.58000326,-0.26714125,-0.15123457,0
3,0.04774193,0.03598972,0.03970223,-0.16235294,-0.147426982,0.04858934,-0.13538462,0.03339192,0.0,0.0916955,-0.05956113,0
4,-0.07404022,-0.04816956,0.01821862,-0.02467917,-0.006036217,-0.02530364,-0.094,0.09529025,0.05668016,-0.09633911,-0.04051173,1
5,-0.03104575,-0.21267723,0.09147609,0.18933823,-0.153846154,-0.10611511,0.35530086,0.0568421,0.03360215,0.03626943,-0.08530511,1
6,0.57980016,0.33225225,-0.40546095,-0.06,0.060732113,-0.21536106,0.27444694,0.53834395,0.12706817,-0.17142857,-0.19537452,1


In [2]:
str(stocks)

'data.frame':	11580 obs. of  12 variables:
 $ ReturnJan  : num  0.0807 -0.0107 0.0477 -0.074 -0.031 ...
 $ ReturnFeb  : num  0.0663 0.1021 0.036 -0.0482 -0.2127 ...
 $ ReturnMar  : num  0.0329 0.1455 0.0397 0.0182 0.0915 ...
 $ ReturnApr  : num  0.1831 -0.0844 -0.1624 -0.0247 0.1893 ...
 $ ReturnMay  : num  0.13033 -0.3273 -0.14743 -0.00604 -0.15385 ...
 $ ReturnJune : num  -0.0176 -0.3593 0.0486 -0.0253 -0.1061 ...
 $ ReturnJuly : num  -0.0205 -0.0253 -0.1354 -0.094 0.3553 ...
 $ ReturnAug  : num  0.0247 0.2113 0.0334 0.0953 0.0568 ...
 $ ReturnSep  : num  -0.0204 -0.58 0 0.0567 0.0336 ...
 $ ReturnOct  : num  -0.1733 -0.2671 0.0917 -0.0963 0.0363 ...
 $ ReturnNov  : num  -0.0254 -0.1512 -0.0596 -0.0405 -0.0853 ...
 $ PositiveDec: int  0 0 0 1 1 1 1 0 0 0 ...


Answer: 11.580 observations.

### Problem 1.2 - Exploring the Dataset
**What proportion of the observations have positive returns in December?**

In [3]:
# Tabulate the positive december returns
dec = table(stocks$PositiveDec)
dec


   0    1 
5256 6324 

In [4]:
round(dec[2]/(sum(dec)),4)

Answer: 54.61%

### Problem 1.3 - Exploring the Dataset
**What is the maximum correlation between any two return variables in the dataset?** You should look at the pairwise correlations between ReturnJan, ReturnFeb, ReturnMar, ReturnApr, ReturnMay, ReturnJune, ReturnJuly, ReturnAug, ReturnSep, ReturnOct, and ReturnNov.

In [5]:
# Obtain a correlation matrix of the data

cor = cor(stocks)
cor

Unnamed: 0,ReturnJan,ReturnFeb,ReturnMar,ReturnApr,ReturnMay,ReturnJune,ReturnJuly,ReturnAug,ReturnSep,ReturnOct,ReturnNov,PositiveDec
ReturnJan,1.0,0.06677458,-0.090496798,-0.037678006,-0.044411417,0.09223831,-0.081429765,-0.0227920187,-0.0264371526,0.14297723,0.06763233,0.004728518
ReturnFeb,0.066774583,1.0,-0.155983263,-0.191351924,-0.09552092,0.16999448,-0.0617785094,0.1315597863,0.0435017706,-0.08732427,-0.15465828,-0.038173184
ReturnMar,-0.090496798,-0.15598326,1.0,0.009726288,-0.003892789,-0.08590549,0.0033741597,-0.0220053995,0.0765183267,-0.01192376,0.03732353,0.022408661
ReturnApr,-0.037678006,-0.19135192,0.009726288,1.0,0.063822504,-0.01102775,0.0806319317,-0.051756051,-0.0289209718,0.04854003,0.03176184,0.094353528
ReturnMay,-0.044411417,-0.09552092,-0.003892789,0.063822504,1.0,-0.02107454,0.0908502642,-0.033125658,0.0219628623,0.01716673,0.04804659,0.058201934
ReturnJune,0.092238307,0.16999448,-0.085905486,-0.011027752,-0.021074539,1.0,-0.0291525996,0.010710526,0.0447472692,-0.02263599,-0.06527054,0.023409745
ReturnJuly,-0.081429765,-0.06177851,0.00337416,0.080631932,0.090850264,-0.0291526,1.0,0.0007137558,0.0689478037,-0.05470891,-0.04837384,0.07436421
ReturnAug,-0.022792019,0.13155979,-0.0220054,-0.051756051,-0.033125658,0.01071053,0.0007137558,1.0,0.0007407139,-0.07559456,-0.11648903,0.004166966
ReturnSep,-0.026437153,0.04350177,0.076518327,-0.028920972,0.021962862,0.04474727,0.0689478037,0.0007407139,1.0,-0.05807924,-0.0197198,0.041630286
ReturnOct,0.142977229,-0.08732427,-0.011923758,0.048540025,0.017166728,-0.02263599,-0.0547089088,-0.0755945614,-0.0580792362,1.0,0.19167279,-0.052574956


In [6]:
round(cor,2)

Unnamed: 0,ReturnJan,ReturnFeb,ReturnMar,ReturnApr,ReturnMay,ReturnJune,ReturnJuly,ReturnAug,ReturnSep,ReturnOct,ReturnNov,PositiveDec
ReturnJan,1.0,0.07,-0.09,-0.04,-0.04,0.09,-0.08,-0.02,-0.03,0.14,0.07,0.0
ReturnFeb,0.07,1.0,-0.16,-0.19,-0.1,0.17,-0.06,0.13,0.04,-0.09,-0.15,-0.04
ReturnMar,-0.09,-0.16,1.0,0.01,0.0,-0.09,0.0,-0.02,0.08,-0.01,0.04,0.02
ReturnApr,-0.04,-0.19,0.01,1.0,0.06,-0.01,0.08,-0.05,-0.03,0.05,0.03,0.09
ReturnMay,-0.04,-0.1,0.0,0.06,1.0,-0.02,0.09,-0.03,0.02,0.02,0.05,0.06
ReturnJune,0.09,0.17,-0.09,-0.01,-0.02,1.0,-0.03,0.01,0.04,-0.02,-0.07,0.02
ReturnJuly,-0.08,-0.06,0.0,0.08,0.09,-0.03,1.0,0.0,0.07,-0.05,-0.05,0.07
ReturnAug,-0.02,0.13,-0.02,-0.05,-0.03,0.01,0.0,1.0,0.0,-0.08,-0.12,0.0
ReturnSep,-0.03,0.04,0.08,-0.03,0.02,0.04,0.07,0.0,1.0,-0.06,-0.02,0.04
ReturnOct,0.14,-0.09,-0.01,0.05,0.02,-0.02,-0.05,-0.08,-0.06,1.0,0.19,-0.05


Answer: "ReturnOct" with "TeturnNov"

### Problem 1.4 - Exploring the Dataset
**Which month (from January through November) has the largest mean return across all observations in the dataset?**

In [7]:
# Obtain a summary of the data
summary(stocks)

   ReturnJan            ReturnFeb           ReturnMar        
 Min.   :-0.7616205   Min.   :-0.690000   Min.   :-0.712994  
 1st Qu.:-0.0691663   1st Qu.:-0.077748   1st Qu.:-0.046389  
 Median : 0.0009965   Median :-0.010626   Median : 0.009878  
 Mean   : 0.0126316   Mean   :-0.007605   Mean   : 0.019402  
 3rd Qu.: 0.0732606   3rd Qu.: 0.043600   3rd Qu.: 0.077066  
 Max.   : 3.0683060   Max.   : 6.943694   Max.   : 4.008621  
   ReturnApr           ReturnMay          ReturnJune       
 Min.   :-0.826503   Min.   :-0.92207   Min.   :-0.717920  
 1st Qu.:-0.054468   1st Qu.:-0.04640   1st Qu.:-0.063966  
 Median : 0.009059   Median : 0.01293   Median :-0.000880  
 Mean   : 0.026308   Mean   : 0.02474   Mean   : 0.005938  
 3rd Qu.: 0.085338   3rd Qu.: 0.08396   3rd Qu.: 0.061566  
 Max.   : 2.528827   Max.   : 6.93013   Max.   : 4.339713  
   ReturnJuly           ReturnAug           ReturnSep        
 Min.   :-0.7613096   Min.   :-0.726800   Min.   :-0.839730  
 1st Qu.:-0.0731917   

In [8]:
jan = round(mean(stocks$ReturnJan),3)
feb = round(mean(stocks$ReturnFeb),3)
mar = round(mean(stocks$ReturnMar),3)
apr = round(mean(stocks$ReturnApr),3)
may = round(mean(stocks$ReturnMay),3)
jun = round(mean(stocks$ReturnJune),3)
jul = round(mean(stocks$ReturnJuly),3)
aug = round(mean(stocks$ReturnAug),3)
sep = round(mean(stocks$ReturnSep),3)
oct = round(mean(stocks$ReturnOct),3)
nov = round(mean(stocks$ReturnNov),3)

paste("jan: ",jan)
paste("feb: ",feb)
paste("mar: ",mar)
paste("apr: ",apr)
paste("may: ",may)
paste("jun: ",jun)
paste("jul: ",jul)
paste("aug: ",aug)
paste("sep: ",sep)
paste("oct: ",oct)
paste("nov: ",nov)

Answer: Largest mean was April with 0.026308

**Which month (from January through November) has the smallest mean return across all observations in the dataset?**

Answer: Smallest mean was September with -0.014721.

### Problem 2.1 - Initial Logistic Regression Model
Run the following commands to split the data into a training set and testing set, putting 70% of the data in the training set and 30% of the data in the testing set:

    set.seed(144)

    spl = sample.split(stocks$PositiveDec, SplitRatio = 0.7)

    stocksTrain = subset(stocks, spl == TRUE)

    stocksTest = subset(stocks, spl == FALSE)

Then, use the stocksTrain data frame to train a logistic regression model (name it StocksModel) to predict PositiveDec using all the other variables as independent variables. Don't forget to add the argument family=binomial to your glm command.

**What is the overall accuracy on the training set, using a threshold of 0.5?**

In [9]:
library(caTools)

In [10]:
# Split the dataset -> Training and Testing 

set.seed(144)

spl = sample.split(stocks$PositiveDec, SplitRatio = 0.7)

stocksTrain = subset(stocks, spl == TRUE)

stocksTest = subset(stocks, spl == FALSE)

In [11]:
# Create Logistic Regression Model
StocksModel= glm(PositiveDec ~ ., data = stocksTrain, family=binomial)

# Predict on the training set
predictTrain = predict(StocksModel, type="response")

# Tabulate the predictions of training set vs our predict function
cmTR = table(stocksTrain$PositiveDec, predictTrain > 0.5)
cmTR

   
    FALSE TRUE
  0   990 2689
  1   787 3640

The rows are labeled with the actual outcome, and the columns are labeled with the predicted outcome.

                      Predict 0       Predict 1
        Actual 0    True Negative   False Positive
        Actual 1    False Negative  True Positive

        cmTR = [1][3]
               [2][4]

In [12]:
# Compute Logistic Regression Model Training Set Accuracy
accurLGTR = sum(diag(cmTR))/sum(cmTR)
paste("Accuracy Logistic Regression Training Set:", round(accurLGTR,digits=4))

### Problem 2.2 - Initial Logistic Regression Model
Now obtain test set predictions from StocksModel. **What is the overall accuracy of the model on the test, again using a threshold of 0.5?**

In [13]:
# Predict on the test set
predictTest = predict(StocksModel, type="response", newdata=stocksTest)

# Tabulate the predictions of test set vs our predict function
cmTE = table(stocksTest$PositiveDec, predictTest > 0.5)
cmTE

   
    FALSE TRUE
  0   417 1160
  1   344 1553

In [14]:
# Compute Logistic Regression Model Test Set Accuracy
accurLGTE = sum(diag(cmTE))/sum(cmTE)
paste("Accuracy Logistic Regression Test Set:", round(accurLGTE,digits=4))

### Problem 2.3 - Initial Logistic Regression Model
**What is the accuracy on the test set of a baseline model that always predicts the most common outcome (PositiveDec = 1)?**

In [15]:
# Tabulate the test set positive december returns
cmPD = table(stocksTest$PositiveDec)
cmPD


   0    1 
1577 1897 

In [16]:
# Compute Baseline Model Accuracy
accurBL = max(cmPD)/sum(cmPD)
paste("Accuracy Baseline:", round(accurBL,digits=4))

### Problem 3.1 - Clustering Stocks
Now, let's cluster the stocks. The first step in this process is to remove the dependent variable using the following commands:

    limitedTrain = stocksTrain

    limitedTrain$PositiveDec = NULL

    limitedTest = stocksTest

    limitedTest$PositiveDec = NULL

**Why do we need to remove the dependent variable in the clustering phase of the cluster-then-predict methodology?**

In [17]:
#Remove Dependent Variable

limitedTrain = stocksTrain

limitedTrain$PositiveDec = NULL

limitedTest = stocksTest

limitedTest$PositiveDec = NULL

Answer: In cluster-then-predict, our final goal is to predict the dependent variable, which is unknown to us at the time of prediction. Therefore, if we need to know the outcome value to perform the clustering, the methodology is no longer useful for prediction of an unknown outcome value.

This is an important point that is sometimes mistakenly overlooked. If you use the outcome value to cluster, you might conclude your method strongly outperforms a non-clustering alternative. However, this is because it is using the outcome to determine the clusters, which is not valid.

### Problem 3.2 - Clustering Stocks
In the market segmentation assignment in this week's homework, you were introduced to the preProcess command from the caret package, which normalizes variables by subtracting by the mean and dividing by the standard deviation.

In cases where we have a training and testing set, we'll want to normalize by the mean and standard deviation of the variables in the training set. We can do this by passing just the training set to the preProcess function:

    library(caret)

    preproc = preProcess(limitedTrain)

    normTrain = predict(preproc, limitedTrain)

    normTest = predict(preproc, limitedTest)

**What is the mean of the ReturnJan variable in normTrain?**

In [18]:
library(caret)

Loading required package: lattice

Loading required package: ggplot2



In [19]:
#Preprocess the data

preproc = preProcess(limitedTrain)

normTrain = predict(preproc, limitedTrain)

normTest = predict(preproc, limitedTest)

In [20]:
#Obtains the mean of the ReturnJan variable in normTrain
mj = mean(normTrain$ReturnJan)
paste("ReturnJan Mean in normTrain: ", mj)

**What is the mean of the ReturnJan variable in normTest?**

In [21]:
#Obtains the mean of the ReturnJan variable in normTrain
mjt = mean(normTest$ReturnJan)
paste("ReturnJan Mean in normTest: ", mjt)

### Problem 3.3 - Clustering Stocks
**Why is the mean ReturnJan variable much closer to 0 in normTrain than in normTest?**

Answer: From mean(stocksTrain\\$ReturnJan) and mean(stocksTest$ReturnJan), we see that the average return in January is slightly higher in the training set than in the testing set. Since normTest was constructed by subtracting by the mean ReturnJan value from the training set, this explains why the mean value of ReturnJan is slightly negative in normTest.

### Problem 3.4 - Clustering Stocks
Set the random seed to 144 (it is important to do this again, even though we did it earlier). Run k-means clustering with 3 clusters on normTrain, storing the result in an object called kmc.

**Which cluster has the largest number of observations?**

In [22]:
# implemenmt the k-mean cluster
set.seed(144)
kmc = kmeans(normTrain, centers=3)

# Subset the clusters into three different clusters
KmeansCluster1 = subset(normTrain, kmc$cluster == 1)
KmeansCluster2 = subset(normTrain, kmc$cluster == 2)
KmeansCluster3 = subset(normTrain, kmc$cluster == 3)

In [23]:
# Number of observations
no = table(kmc$cluster)
no


   1    2    3 
2479 4731  896 

In [24]:
# Cluster with the largest number of observations
which.max(no)

Answer: Cluster 2 with 4731 observations.

### Problem 3.5 - Clustering Stocks
Recall from the recitation that we can use the flexclust package to obtain training set and testing set cluster assignments for our observations (note that the call to as.kcca may take a while to complete):

    library(flexclust)

    km.kcca = as.kcca(km, normTrain)

    clusterTrain = predict(km.kcca)

    clusterTest = predict(km.kcca, newdata=normTest)

**How many test-set observations were assigned to Cluster 2?**

In [25]:
# Flexclust package
library(flexclust)

kmc.kcca = as.kcca(kmc, normTrain)

Loading required package: grid

Loading required package: modeltools

Loading required package: stats4



In [26]:
# Predict the Training set
clusterTrain = predict(kmc.kcca)

# Predict the Test set
clusterTest = predict(kmc.kcca, newdata=normTest)

In [27]:
# Number of observations 
notest = table(clusterTest)
notest

clusterTest
   1    2    3 
1058 2029  387 

In [28]:
notest[2]

### Problem 4.1 - Cluster-Specific Predictions
Using the subset function, build data frames stocksTrain1, stocksTrain2, and stocksTrain3, containing the elements in the stocksTrain data frame assigned to clusters 1, 2, and 3, respectively (be careful to take subsets of stocksTrain, not of normTrain). Similarly build stocksTest1, stocksTest2, and stocksTest3 from the stocksTest data frame.

**Which training set data frame has the highest average value of the dependent variable?**

In [29]:
# Subsetting stocksTrain into 1, 2, and 3 from the respective clusters
stocksTrain1 = subset(stocksTrain, clusterTrain == 1)
stocksTrain2 = subset(stocksTrain, clusterTrain == 2)
stocksTrain3 = subset(stocksTrain, clusterTrain == 3)

# Subsetting stocksTest into 1, 2, and 3 from the respective clusters
stocksTest1 = subset(stocksTest, clusterTest == 1)
stocksTest2 = subset(stocksTest, clusterTest == 2)
stocksTest3 = subset(stocksTest, clusterTest == 3)

In [30]:
# Compute the average value of the Positive December returns in each respective cluster

a = mean(stocksTrain1$PositiveDec)
b = mean(stocksTrain2$PositiveDec)
c = mean(stocksTrain3$PositiveDec)

paste("Average Value of Positive December Returns in Cluster Train 1: ",round(a,4))
paste("Average Value of Positive December Returns in Cluster Train 2: ",round(b,4))
paste("Average Value of Positive December Returns in Cluster Train 3: ",round(c,4))

Answer: Cluster 1.

### Problem 4.2 - Cluster-Specific Predictions
Build logistic regression models StocksModel1, StocksModel2, and StocksModel3, which predict PositiveDec using all the other variables as independent variables. StocksModel1 should be trained on stocksTrain1, StocksModel2 should be trained on stocksTrain2, and StocksModel3 should be trained on stocksTrain3.

**Which variables have a positive sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3 and a negative sign for the coefficient in at least one of StocksModel1, StocksModel2, and StocksModel3?**

In [31]:
# Create Logistic Regression Model for each Cluster Training

StocksModel1= glm(PositiveDec ~ ., data = stocksTrain1, family=binomial)

StocksModel2= glm(PositiveDec ~ ., data = stocksTrain2, family=binomial)

StocksModel3= glm(PositiveDec ~ ., data = stocksTrain3, family=binomial)

In [32]:
# Examine the logistic regression models
summary(StocksModel1)


Call:
glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain1)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7220  -1.2879   0.8679   1.0096   1.7170  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.20739    0.08020   2.586  0.00971 ** 
ReturnJan    0.12448    0.31474   0.396  0.69246    
ReturnFeb   -0.46307    0.32713  -1.416  0.15692    
ReturnMar    0.55465    0.24804   2.236  0.02534 *  
ReturnApr    1.08354    0.25005   4.333 1.47e-05 ***
ReturnMay    0.30487    0.24993   1.220  0.22253    
ReturnJune   0.00172    0.33525   0.005  0.99591    
ReturnJuly  -0.02763    0.30216  -0.091  0.92714    
ReturnAug    0.40299    0.34570   1.166  0.24373    
ReturnSep    0.70779    0.32611   2.170  0.02998 *  
ReturnOct   -1.33254    0.29055  -4.586 4.51e-06 ***
ReturnNov   -0.78944    0.30583  -2.581  0.00984 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial f

In [33]:
# Examine the logistic regression models
summary(StocksModel2)


Call:
glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2268  -1.2086   0.9698   1.1294   1.6769  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.1165255  0.0335061   3.478 0.000506 ***
ReturnJan    0.4489316  0.2414090   1.860 0.062938 .  
ReturnFeb   -0.1385458  0.1514734  -0.915 0.360373    
ReturnMar    0.4657754  0.2442129   1.907 0.056488 .  
ReturnApr    0.7839726  0.2558668   3.064 0.002184 ** 
ReturnMay    0.7709831  0.2625217   2.937 0.003316 ** 
ReturnJune   0.5764584  0.2191809   2.630 0.008537 ** 
ReturnJuly   0.8654737  0.2869778   3.016 0.002563 ** 
ReturnAug    0.0177313  0.2317020   0.077 0.939001    
ReturnSep    1.0464947  0.2684133   3.899 9.67e-05 ***
ReturnOct   -0.0001062  0.2426989   0.000 0.999651    
ReturnNov   -0.3716212  0.2603210  -1.428 0.153421    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersio

In [34]:
# Examine the logistic regression models
summary(StocksModel3)


Call:
glm(formula = PositiveDec ~ ., family = binomial, data = stocksTrain3)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8938  -1.0863  -0.5241   1.0874   2.1892  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.4504     0.1700   2.649 0.008071 ** 
ReturnJan     0.3816     0.2742   1.392 0.163927    
ReturnFeb     0.3943     0.4612   0.855 0.392490    
ReturnMar    -1.7220     0.4372  -3.938 8.20e-05 ***
ReturnApr     0.5214     0.3275   1.592 0.111316    
ReturnMay     1.1301     0.4274   2.644 0.008194 ** 
ReturnJune    1.5889     0.4423   3.592 0.000328 ***
ReturnJuly    1.3800     0.4602   2.999 0.002709 ** 
ReturnAug     0.4146     0.4824   0.860 0.390054    
ReturnSep    -0.0148     0.4754  -0.031 0.975167    
ReturnOct    -0.6820     0.3044  -2.241 0.025039 *  
ReturnNov    -1.7175     0.4133  -4.156 3.24e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial f

Answer: ReturnJan, ReturnFeb, ReturnMar, ReturnJune, ReturnAug, and ReturnOct differ in sign between the models.

### Problem 4.3 - Cluster-Specific Predictions
Using StocksModel1, make test-set predictions called PredictTest1 on the data frame stocksTest1. Using StocksModel2, make test-set predictions called PredictTest2 on the data frame stocksTest2. Using StocksModel3, make test-set predictions called PredictTest3 on the data frame stocksTest3.

**What is the overall accuracy of StocksModel1 on the test set stocksTest1, using a threshold of 0.5?**

In [35]:
# Predict on the test set
predictTest1 = predict(StocksModel1, type="response", newdata=stocksTest1)

# Tabulate the predictions of test set vs our predict function
cmTest1 = table(stocksTest1$PositiveDec, predictTest1 > 0.5)
cmTest1

   
    FALSE TRUE
  0    43  350
  1    26  639

In [36]:
# Compute StocksModel1 stocksTest1 Accuracy
accurS1S1 = sum(diag(cmTest1))/sum(cmTest1)
paste("Accuracy StocksModel1 StocksTest1:", round(accurS1S1,digits=4))

**What is the overall accuracy of StocksModel2 on the test set stocksTest2, using a threshold of 0.5?**

In [37]:
# Predict on the test set
predictTest2 = predict(StocksModel2, type="response", newdata=stocksTest2)

# Tabulate the predictions of test set vs our predict function
cmTest2 = table(stocksTest2$PositiveDec, predictTest2 > 0.5)
cmTest2

   
    FALSE TRUE
  0   277  719
  1   221  812

In [38]:
# Compute StocksModel2 stocksTest2 Accuracy
accurS2S2 = sum(diag(cmTest2))/sum(cmTest2)
paste("Accuracy StocksModel2 StocksTest2:", round(accurS2S2,digits=4))

**What is the overall accuracy of StocksModel3 on the test set stocksTest3, using a threshold of 0.5?**

In [39]:
# Predict on the test set
predictTest3 = predict(StocksModel3, type="response", newdata=stocksTest3)

# Tabulate the predictions of test set vs our predict function
cmTest3 = table(stocksTest3$PositiveDec, predictTest3 > 0.5)
cmTest3

   
    FALSE TRUE
  0   119   69
  1    76  123

In [40]:
# Compute StocksModel2 stocksTest2 Accuracy
accurS3S3 = sum(diag(cmTest3))/sum(cmTest3)
paste("Accuracy StocksModel3 StocksTest3:", round(accurS3S3,digits=4))

### Problem 4.4 - Cluster-Specific Predictions
To compute the overall test-set accuracy of the cluster-then-predict approach, we can combine all the test-set predictions into a single vector and all the true outcomes into a single vector:

    AllPredictions = c(PredictTest1, PredictTest2, PredictTest3)

    AllOutcomes = c(stocksTest1$PositiveDec, stocksTest2$PositiveDec, stocksTest3$PositiveDec)

**What is the overall test-set accuracy of the cluster-then-predict approach, again using a threshold of 0.5?**

In [41]:
# Combine all test-set predictions and outcomes into a single vector

AllPredictions = c(predictTest1, predictTest2, predictTest3)

AllOutcomes = c(stocksTest1$PositiveDec, stocksTest2$PositiveDec, stocksTest3$PositiveDec)


# Confusion Matrix with All Outcomes and All Predictions

cmAA = table(AllOutcomes, AllPredictions > 0.5)
cmAA

           
AllOutcomes FALSE TRUE
          0   439 1138
          1   323 1574

In [42]:
# Compute overall test-set accuracy of the cluster-then-predict approach Accuracy
accurAA = sum(diag(cmAA))/sum(cmAA)
paste("Accuracy all the test-set predictions combined:", round(accurAA,digits=4))

We see a modest improvement over the original logistic regression model. Since predicting stock returns is a notoriously hard problem, this is a good increase in accuracy. By investing in stocks for which we are more confident that they will have positive returns (by selecting the ones with higher predicted probabilities), this cluster-then-predict model can give us an edge over the original logistic regression model.