# Forecasting Elantra Sales

An important application of linear regression is understanding sales. Consider a company that produces and sells a product. In a given period, if the company produces more units than how many consumers will buy, the company will not earn money on the unsold units and will incur additional costs due to having to store those units in inventory before they can be sold. If it produces fewer units than how many consumers will buy, the company will earn less than it potentially could have earned. Being able to predict consumer sales, therefore, is of first order importance to the company.

In this problem, we will try to predict monthly sales of the Hyundai Elantra in the United States. The Hyundai Motor Company is a major automobile manufacturer based in South Korea. The Elantra is a car model that has been produced by Hyundai since 1990 and is sold all over the world, including the United States. We will build a linear regression model to predict monthly sales using economic indicators of the United States as well as Google search queries.

<img src="images/hyundai.jpg"/>

The file elantra.csv contains data for the problem. Each observation is a month, from January 2010 to February 2014. For each month, we have the following variables:

    Month = the month of the year for the observation (1 = January, 2 = February, 3 = March, ...).

    Year = the year of the observation.

    ElantraSales = the number of units of the Hyundai Elantra sold in the United States in the given month.

    Unemployment = the estimated unemployment percentage in the United States in the given month.

    Queries = a (normalized) approximation of the number of Google searches for "hyundai elantra" in the given month.

    CPI_energy = the monthly consumer price index (CPI) for energy for the given month.

    CPI_all = the consumer price index (CPI) for all products for the given month; this is a measure of the magnitude of the prices paid by consumer households for goods and services (e.g., food, clothing, electricity, etc.).

### Problem 1 - Loading the Data

Load the data set. Split the data set into training and testing sets as follows: place all observations for 2012 and earlier in the training set, and all observations for 2013 and 2014 into the testing set.

How many observations are in the training set?

In [1]:
elantra <-read.csv("data/elantra.csv")
head(elantra)

Unnamed: 0_level_0,Month,Year,ElantraSales,Unemployment,Queries,CPI_energy,CPI_all
Unnamed: 0_level_1,<int>,<int>,<int>,<dbl>,<int>,<dbl>,<dbl>
1,1,2010,7690,9.7,153,213.377,217.466
2,1,2011,9659,9.1,259,229.353,221.082
3,1,2012,10900,8.2,354,244.178,227.666
4,1,2013,12174,7.9,230,242.56,231.321
5,1,2014,15326,6.6,232,247.575,234.933
6,2,2010,7966,9.8,130,209.924,217.251


In [2]:
str(elantra)

'data.frame':	50 obs. of  7 variables:
 $ Month       : int  1 1 1 1 1 2 2 2 2 2 ...
 $ Year        : int  2010 2011 2012 2013 2014 2010 2011 2012 2013 2014 ...
 $ ElantraSales: int  7690 9659 10900 12174 15326 7966 12289 13820 16219 16393 ...
 $ Unemployment: num  9.7 9.1 8.2 7.9 6.6 9.8 9 8.3 7.7 6.7 ...
 $ Queries     : int  153 259 354 230 232 130 266 296 239 240 ...
 $ CPI_energy  : num  213 229 244 243 248 ...
 $ CPI_all     : num  217 221 228 231 235 ...


In [3]:
summary(elantra)

     Month           Year       ElantraSales    Unemployment      Queries     
 Min.   : 1.0   Min.   :2010   Min.   : 7690   Min.   :6.600   Min.   :130.0  
 1st Qu.: 3.0   1st Qu.:2011   1st Qu.:12560   1st Qu.:7.725   1st Qu.:224.8  
 Median : 6.0   Median :2012   Median :15624   Median :8.250   Median :262.5  
 Mean   : 6.3   Mean   :2012   Mean   :16005   Mean   :8.422   Mean   :263.5  
 3rd Qu.: 9.0   3rd Qu.:2013   3rd Qu.:19197   3rd Qu.:9.100   3rd Qu.:311.0  
 Max.   :12.0   Max.   :2014   Max.   :26153   Max.   :9.900   Max.   :427.0  
   CPI_energy       CPI_all     
 Min.   :204.2   Min.   :217.3  
 1st Qu.:230.1   1st Qu.:221.3  
 Median :244.4   Median :227.9  
 Mean   :236.9   Mean   :226.7  
 3rd Qu.:247.1   3rd Qu.:231.7  
 Max.   :256.4   Max.   :235.2  

In [4]:
elantraTrain<-subset(elantra, Year<=2012)
elantraTest<-subset(elantra, Year>2012)

In [5]:
str(elantraTrain)

'data.frame':	36 obs. of  7 variables:
 $ Month       : int  1 1 1 2 2 2 3 3 3 4 ...
 $ Year        : int  2010 2011 2012 2010 2011 2012 2010 2011 2012 2010 ...
 $ ElantraSales: int  7690 9659 10900 7966 12289 13820 8225 19255 19681 9657 ...
 $ Unemployment: num  9.7 9.1 8.2 9.8 9 8.3 9.9 9 8.2 9.9 ...
 $ Queries     : int  153 259 354 130 266 296 138 281 303 132 ...
 $ CPI_energy  : num  213 229 244 210 232 ...
 $ CPI_all     : num  217 221 228 217 222 ...


In [6]:
str(elantraTest)

'data.frame':	14 obs. of  7 variables:
 $ Month       : int  1 1 2 2 3 4 5 6 7 8 ...
 $ Year        : int  2013 2014 2013 2014 2013 2013 2013 2013 2013 2013 ...
 $ ElantraSales: int  12174 15326 16219 16393 26153 24445 25090 22163 23958 24700 ...
 $ Unemployment: num  7.9 6.6 7.7 6.7 7.5 7.5 7.5 7.5 7.3 7.2 ...
 $ Queries     : int  230 232 239 240 313 248 252 320 274 271 ...
 $ CPI_energy  : num  243 248 253 246 245 ...
 $ CPI_all     : num  231 235 233 235 232 ...


In [7]:
nrow(elantraTrain)

### Problem 2.1 - A Linear Regression Model

Build a linear regression model to predict monthly Elantra sales using Unemployment, CPI_all, CPI_energy and Queries as the independent variables. Use all of the training set data to do this.

What is the model R-squared? Note: In this problem, we will always be asking for the "Multiple R-Squared" of the model.

In [8]:
lmSale<-lm(ElantraSales ~ Unemployment + CPI_energy + CPI_all + Queries, data=elantraTrain)
summary(lmSale)


Call:
lm(formula = ElantraSales ~ Unemployment + CPI_energy + CPI_all + 
    Queries, data = elantraTrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-6785.2 -2101.8  -562.5  2901.7  7021.0 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)
(Intercept)   95385.36  170663.81   0.559    0.580
Unemployment  -3179.90    3610.26  -0.881    0.385
CPI_energy       38.51     109.60   0.351    0.728
CPI_all        -297.65     704.84  -0.422    0.676
Queries          19.03      11.26   1.690    0.101

Residual standard error: 3295 on 31 degrees of freedom
Multiple R-squared:  0.4282,	Adjusted R-squared:  0.3544 
F-statistic: 5.803 on 4 and 31 DF,  p-value: 0.00132


In [9]:
summary(lmSale)$r.square

### Problem 2.2 - Significant Variables

How many variables are significant, or have levels that are significant? Use 0.10 as your p-value cutoff.

Answer: One, "Queries".

### Problem 2.3 - Coefficients

What is the coefficient of the Unemployment variable?

Answer: -3179.90

### Problem 2.4 - Interpreting the Coefficient

What is the interpretation of this coefficient?

Answer: For an increase of 1 in predicted Elantra sales, Unemployment decreases by approximately 3000.

### Problem 3.1 - Modeling Seasonality

Our model R-Squared is relatively low, so we would now like to improve our model. In modeling demand and sales, it is often useful to model seasonality. Seasonality refers to the fact that demand is often cyclical/periodic in time. For example, in countries with different seasons, demand for warm outerwear (like jackets and coats) is higher in fall/autumn and winter (due to the colder weather) than in spring and summer. (In contrast, demand for swimsuits and sunscreen is higher in the summer than in the other seasons.) Another example is the "back to school" period in North America: demand for stationary (pencils, notebooks and so on) in late July and all of August is higher than the rest of the year due to the start of the school year in September.

In our problem, since our data includes the month of the year in which the units were sold, it is feasible for us to incorporate monthly seasonality. From a modeling point of view, it may be reasonable that the month plays an effect in how many Elantra units are sold.

To incorporate the seasonal effect due to the month, build a new linear regression model that predicts monthly Elantra sales using Month as well as Unemployment, CPI_all, CPI_energy and Queries. Do not modify the training and testing data frames before building the model.

What is the model R-Squared?

In [10]:
lmSale2 <- lm(ElantraSales ~ Month + Unemployment + CPI_all + CPI_energy + Queries, data=elantraTrain)
summary(lmSale2)


Call:
lm(formula = ElantraSales ~ Month + Unemployment + CPI_all + 
    CPI_energy + Queries, data = elantraTrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-6416.6 -2068.7  -597.1  2616.3  7183.2 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  148330.49  195373.51   0.759   0.4536  
Month           110.69     191.66   0.578   0.5679  
Unemployment  -4137.28    4008.56  -1.032   0.3103  
CPI_all        -517.99     808.26  -0.641   0.5265  
CPI_energy       54.18     114.08   0.475   0.6382  
Queries          21.19      11.98   1.769   0.0871 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3331 on 30 degrees of freedom
Multiple R-squared:  0.4344,	Adjusted R-squared:  0.3402 
F-statistic: 4.609 on 5 and 30 DF,  p-value: 0.003078


In [11]:
summary(lmSale2)$r.square

### Problem 3.2 - Effect of Adding a New Variable

Which of the following best describes the effect of adding Month?

Answer: The model is not better because the adjusted R-squared has gone down and none of the variables (including the new one) are very significant.

### Problem 3.3 - Understanding the Model

Let us try to understand our model.

In the new model, given two monthly periods that are otherwise identical in Unemployment, CPI_all, CPI_energy and Queries, what is the absolute difference in predicted Elantra sales given that one period is in January and one is in March?

In [12]:
110.69 * 2

In the new model, given two monthly periods that are otherwise identical in Unemployment, CPI_all, CPI_energy and Queries, what is the absolute difference in predicted Elantra sales given that one period is in January and one is in May?

In [13]:
110.69 * 4

### Problem 3.4 - Numeric vs. Factors

You may be experiencing an uneasy feeling that there is something not quite right in how we have modeled the effect of the calendar month on the monthly sales of Elantras. If so, you are right. In particular, we added Month as a variable, but Month is an ordinary numeric variable. In fact, we must convert Month to a factor variable before adding it to the model.

What is the best explanation for why we must do this?

Answer: By converting Month to a factor variable, we will effectively increase the number of coefficients we need to estimate, which will boost our model's R-Squared.

### Problem 4.1 - A New Model

Re-run the regression with the Month variable modeled as a factor variable. (Create a new variable that models the Month as a factor (using the as.factor function) instead of overwriting the current Month variable. We'll still use the numeric version of Month later in the problem.)

What is the model R-Squared?

In [14]:
elantraTrain$MonthFactor = as.factor(elantraTrain$Month)
elantraTest$MonthFactor = as.factor(elantraTest$Month)
head(elantraTrain)

Unnamed: 0_level_0,Month,Year,ElantraSales,Unemployment,Queries,CPI_energy,CPI_all,MonthFactor
Unnamed: 0_level_1,<int>,<int>,<int>,<dbl>,<int>,<dbl>,<dbl>,<fct>
1,1,2010,7690,9.7,153,213.377,217.466,1
2,1,2011,9659,9.1,259,229.353,221.082,1
3,1,2012,10900,8.2,354,244.178,227.666,1
6,2,2010,7966,9.8,130,209.924,217.251,2
7,2,2011,12289,9.0,266,232.188,221.816,2
8,2,2012,13820,8.3,296,247.615,228.138,2


In [15]:
modelSalesM <- lm(ElantraSales ~ MonthFactor + Unemployment + CPI_all + CPI_energy + Queries, data=elantraTrain)
summary(modelSalesM)


Call:
lm(formula = ElantraSales ~ MonthFactor + Unemployment + CPI_all + 
    CPI_energy + Queries, data = elantraTrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-3865.1 -1211.7   -77.1  1207.5  3562.2 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   312509.280 144061.867   2.169 0.042288 *  
MonthFactor2    2254.998   1943.249   1.160 0.259540    
MonthFactor3    6696.557   1991.635   3.362 0.003099 ** 
MonthFactor4    7556.607   2038.022   3.708 0.001392 ** 
MonthFactor5    7420.249   1950.139   3.805 0.001110 ** 
MonthFactor6    9215.833   1995.230   4.619 0.000166 ***
MonthFactor7    9929.464   2238.800   4.435 0.000254 ***
MonthFactor8    7939.447   2064.629   3.845 0.001010 ** 
MonthFactor9    5013.287   2010.745   2.493 0.021542 *  
MonthFactor10   2500.184   2084.057   1.200 0.244286    
MonthFactor11   3238.932   2397.231   1.351 0.191747    
MonthFactor12   5293.911   2228.310   2.376 0.027621 *  
Unemployment   -7739.381   2

In [16]:
summary(modelSalesM)$r.square

### Problem 4.2 - Significant Variables

Which variables are significant, or have levels that are significant? Use 0.10 as your p-value cutoff.

Answer: MonthFactor & CPI_energy

### Problem 5.1 - Multicolinearity

Another peculiar observation about the regression is that the sign of the Queries variable has changed. In particular, when we naively modeled Month as a numeric variable, Queries had a positive coefficient. Now, Queries has a negative coefficient. Furthermore, CPI_energy has a positive coefficient -- as the overall price of energy increases, we expect Elantra sales to increase, which seems counter-intuitive (if the price of energy increases, we'd expect consumers to have less funds to purchase automobiles, leading to lower Elantra sales).

As we have seen before, changes in coefficient signs and signs that are counter to our intuition may be due to a multicolinearity problem. To check, compute the correlations of the variables in the training set.

Which of the following variables is CPI_energy highly correlated with? Select all that apply. (Include only variables where the absolute value of the correlation exceeds 0.6. For the purpose of this question, treat Month as a numeric variable, not a factor variable.)

In [17]:
cor(elantraTrain[c("Unemployment","Month","Queries","CPI_energy","CPI_all")])

Unnamed: 0,Unemployment,Month,Queries,CPI_energy,CPI_all
Unemployment,1.0,-0.2036029,-0.6411093,-0.8007188,-0.9562123
Month,-0.2036029,1.0,0.0158443,0.1760198,0.2667883
Queries,-0.6411093,0.0158443,1.0,0.8328381,0.7536732
CPI_energy,-0.8007188,0.1760198,0.8328381,1.0,0.9132259
CPI_all,-0.9562123,0.2667883,0.7536732,0.9132259,1.0


Answer: CPI_energy, Queries, CPI_all.

### Problem 5.2 - Correlations

Which of the following variables is Queries highly correlated with? Again, compute the correlations on the training set. Select all that apply. (Include only variables where the absolute value of the correlation exceeds 0.6. For the purpose of this question, treat Month as a numeric variable, not a factor variable.)

Answer: Unemployment, CPI_energy, CPI_all.

### Problem 6.1 - A Reduced Model

Let us now simplify our model (the model using the factor version of the Month variable). We will do this by iteratively removing variables, one at a time. Remove the variable with the highest p-value (i.e., the least statistically significant variable) from the model. Repeat this until there are no variables that are insignificant or variables for which all of the factor levels are insignificant. Use a threshold of 0.10 to determine whether a variable is significant.

Which variables, and in what order, are removed by this process?

In [18]:
lmSaleMon3<-lm(ElantraSales ~ MonthFactor + Queries + Unemployment + CPI_energy + CPI_all, data=elantraTrain)
summary(lmSaleMon3)


Call:
lm(formula = ElantraSales ~ MonthFactor + Queries + Unemployment + 
    CPI_energy + CPI_all, data = elantraTrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-3865.1 -1211.7   -77.1  1207.5  3562.2 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)   312509.280 144061.867   2.169 0.042288 *  
MonthFactor2    2254.998   1943.249   1.160 0.259540    
MonthFactor3    6696.557   1991.635   3.362 0.003099 ** 
MonthFactor4    7556.607   2038.022   3.708 0.001392 ** 
MonthFactor5    7420.249   1950.139   3.805 0.001110 ** 
MonthFactor6    9215.833   1995.230   4.619 0.000166 ***
MonthFactor7    9929.464   2238.800   4.435 0.000254 ***
MonthFactor8    7939.447   2064.629   3.845 0.001010 ** 
MonthFactor9    5013.287   2010.745   2.493 0.021542 *  
MonthFactor10   2500.184   2084.057   1.200 0.244286    
MonthFactor11   3238.932   2397.231   1.351 0.191747    
MonthFactor12   5293.911   2228.310   2.376 0.027621 *  
Queries           -4.764    

In [19]:
lmSaleMon4<-lm(ElantraSales ~ MonthFactor + Unemployment + CPI_energy + CPI_all, data=elantraTrain)
summary(lmSaleMon4)


Call:
lm(formula = ElantraSales ~ MonthFactor + Unemployment + CPI_energy + 
    CPI_all, data = elantraTrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-3866.0 -1283.3  -107.2  1098.3  3650.1 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)   325709.15  136627.85   2.384 0.026644 *  
MonthFactor2    2410.91    1857.10   1.298 0.208292    
MonthFactor3    6880.09    1888.15   3.644 0.001517 ** 
MonthFactor4    7697.36    1960.21   3.927 0.000774 ***
MonthFactor5    7444.64    1908.48   3.901 0.000823 ***
MonthFactor6    9223.13    1953.64   4.721 0.000116 ***
MonthFactor7    9602.72    2012.66   4.771 0.000103 ***
MonthFactor8    7919.50    2020.99   3.919 0.000789 ***
MonthFactor9    5074.29    1962.23   2.586 0.017237 *  
MonthFactor10   2724.24    1951.78   1.396 0.177366    
MonthFactor11   3665.08    2055.66   1.783 0.089062 .  
MonthFactor12   5643.19    1974.36   2.858 0.009413 ** 
Unemployment   -7971.34    2840.79  -2.806 0.010586

Answer: Queries.

### Problem 6.2 - Test Set Predictions

Using the model from Problem 6.1, make predictions on the test set. What is the sum of squared errors of the model on the test set?

In [20]:
predictions = predict(lmSaleMon4, newdata = elantraTest)
SSE = sum((predictions - elantraTest$ElantraSales)^2)
SSE

### Problem 6.3 - Comparing to a Baseline

What would the baseline method predict for all observations in the test set? Remember that the baseline method we use predicts the average outcome of all observations in the training set.

In [21]:
baseline = mean(elantraTrain$ElantraSales)
baseline

### Problem 6.4 - Test Set R-Squared

What is the test set R-Squared?

In [22]:
SST = sum((baseline - elantraTest$ElantraSales)^2)
R2 = 1 - SSE/SST
R2

### Problem 6.5 - Absolute Errors

What is the largest absolute error that we make in our test set predictions?

In [23]:
abs_errors = abs(predictions - elantraTest$ElantraSales)
max(abs_errors)

### Problem 6.6 - Month of Largest Error

In which period (Month,Year pair) do we make the largest absolute error in our prediction?

In [24]:
which.max(abs(abs_errors))

In [25]:
elantraTest[which.max(abs(abs_errors)),]

Unnamed: 0_level_0,Month,Year,ElantraSales,Unemployment,Queries,CPI_energy,CPI_all,MonthFactor
Unnamed: 0_level_1,<int>,<int>,<int>,<dbl>,<int>,<dbl>,<dbl>,<fct>
14,3,2013,26153,7.5,313,244.598,232.075,3


Answer: March, 2013.