# REDA1-CE1000: Introduction to Real Estate Data Analytics

## Week 4



### Continued Exploration of Causality: Raise the Rent to Lower Vacancy!



### Extending the Bivariate Algorithm to the Multivariate Algorithm: CAPM to Fama-French Factor Models (and Beyond)



### Model Diagnostics and Their Valid Application

## A Continued Discussion of Empirical Causality

* Prior to the 1930s, there were few to none government-sponsored measures of economic activity.
    * The US unemployment rate in 1930 [was 25%](https://www.marketwatch.com/story/the-soaring-us-unemployment-rate-could-approach-great-depression-era-levels-2020-04-03).
    * In truth, this is an approximation as there were few measures of economic activity at the time.
    * This is not to say that **sovereigns** were not economic actors.  [Remarkable research](https://www.bankofengland.co.uk/working-paper/2020/eight-centuries-of-global-real-interest-rates-r-g-and-the-suprasecular-decline-1311-2018) from the Bank of England on long-duration, risk-free interest rates.



* As the US and Europe slid into the Great Depression of the 1930's, policy makers lacked basic information.



* In the US, the National Income and Product Accounts (NIPA) were born in 1934 and greatly expanded during and after WWII.
    * Gross National Product = Consumption + Gov't Spending + Investment + Exports - Imports = $C + G + I + X - M$
    * Gross Domestic Product = Consumption + Gov't Spending + Investment = $C + G + I$



* At the time, Alfred Cowles established the [Cowles Commission](https://en.wikipedia.org/wiki/Cowles_Foundation) for Research in Economics.



* Cowles approach was a probabilistic framework to estimate systems of simultaneous equations to model an economy.  Consider, [GDP](https://en.wikipedia.org/wiki/Multiplier_(economics)):
    * $C = C($aggregate output and other things$)$
    * $G = \bar G$, fixed and given
    * $I = I($interest rates, changes in consumption and other things$)$
    * Your spending is my income and my spending is your income
    * Consumption and investment are related via a multiplier



* Ultimately Cowles would develop very large scale **macroeconometric models** to examine a host of different economic variables. 
    * Hard-won gains
    * But approach was found to be inadequate for policy evaluation: **Goodhart’s Conjecture** and the **Lucas critique**.
    


### Goodhart's Conjecture

* Goodhart “asserts that any economic relation tends to break down when used for policy purposes.”  [Wickens](https://www.amazon.com/Macroeconomic-Theory-Dynamic-Equilibrium-Approach/dp/0691152861/ref=sr_1_1?dchild=1&keywords=wickens&qid=1591918350&sr=8-1)
    * Proposed relationships, economic or otherwise, are not structural in nature.
    * Instead they are derived from fundamental behavioral relationships (sometimes called *structural*).
    


### The Lucas Critique

* Lucas (1976) notes that individual decision rules affected by policy are driven by “deep structural parameters.” 
    * Decision rules and, therefore, decisions are contingent on the state of the system as it is.
    * Change the system through policy, change the decision rule.
    * Such changes may not be captured in non-structural models.
    



### Experimental Design: Natural Experiments (Freakonomics)

> A natural experiment is an empirical study in which individuals (or clusters of individuals) exposed to the experimental and control conditions are determined by nature or by other factors outside the control of the investigators, yet the process governing the exposures arguably resembles random assignment. [Wikipedia](https://en.wikipedia.org/wiki/Natural_experiment)

* Examples
    * **Natural disasters**
        * 1906 S.F. earthquake to examine the impact of stock changes (vacancy) on rent.
            * $\rightarrow$ given the gold standard, the 1906 earthquake arguably **caused** the 1907 Depression, the proximate cause of the 1913 Federal Reserve Act
        * Hurricane Sandy and a [potential algorithmic counterfactual](https://github.com/thsavage/Causation/blob/master/Poster.pdf).
        * COVID-19 and its [potential impacts](https://www.vox.com/recode/2020/4/14/21211789/coronavirus-office-space-work-from-home-design-architecture-real-estate) on commercial real estate.
    * **Lotteries** that truly randomize a group of individuals.  
        * Vietnam draft in the US as a means to explore the impacts of education on wages.
        * Delay in schooling reduces wages (even if the same level of schooling is achieved).
        * Randomized eligibility for mandatory military service in Argentina.  
            * Actual conscription increases the likelihood of having a criminal record later in adulthood. 
            * Possible inference: Delayed entry to the labor market has adverse implications in later labor market outcomes.
        * Recent paper on the impact of rent control on quality of housing stock.
    * **Jurisdictional boundaries** over which policies are different.
        * Different minimum wage laws (New Jersey and Pennsylvania) to examine the impact of minimum wages on employment levels.
        * [North v. South Korea at night](https://www.sciencephoto.com/media/108308/view/korea-at-night-satellite-image).
    * **Same individuals** faced with exogenous changes.
        * Tim in the classroom versus Tim online.
        


### Big Data: Will the Explosion of Data Generation Help?
* Remember the AAPL graph: introduction of the IPhone created the current digital revolution.
    * We generate more data in a day today than we created in a year 10 years ago.
    * It is the **digital exhaust** of human activity.
    * Non-experimental.
* Will it help? 
    * Perhaps, but it depends on understanding algorithms.
* In the meantime, let's explore how CRE thinks about these ideas.



### A Real Estate Example: Landlords Should Raise the Rent to Lower Their Vacancy

## Regression with Multiple Features and the Valid Application of Diagnostics

* It is trivial to extend the bivariate linear model to a linear model that simultaneously incorporates multiple features.  
    * The interpretation of the results of a statistical model that uses multiple features is the same the interpretation of the partial derivative from the calculus of many variables: the effect of a small change in a particular feature on a label (or outcome) **holding all else constant**.  



* A model with $K$ features, $x_{ik}$ and label $y_i$:
    $y_i=\sum_{k=1}^Kx_{ik}\cdot\beta_k+\epsilon_i = x_i^\prime \beta + \epsilon_i$



* The $K$ features $x_{ik}$ influence the label $y_i$ through the $K$-vector, $\beta$, which we fit using the linear estimator or sometimes called **multivariate regression**.  As a result, we can use partial differentiation to interpret the results.

    ${\displaystyle \frac{\partial E(\hat y_i)}{\partial x_{ik}}=\hat \beta_k}$



* For those interested in history: [Frisch–Waugh–Lovell theorem](https://en.wikipedia.org/wiki/Frisch–Waugh–Lovell_theorem).  One can applied the basic principle from multivariate calculus: Holding everything else constant, what is the impact of a feature on an outcome of interest.

**Bottom line is simple**: Fit the linear model with multiple features.  The basic approach to hypothesis testing remains unchanged.  The challenge is the interpretation of the results and the valid application of regression diagnostics.  

## Goodness of Fit: R$^2$

* Different statisical computing environments largely produce the same regression ouput, formatted differently.  
    * This information is typically used for **regression diagnosgics**.  
    * We have discussed regression coefficients and their interpretation, as well as the use of  confidence intervals for hypothesis testing. 



* The R$^2$ **goodness of fit** metric is a frequently-cited regression diagnostic.  
    * If a linear regression uses a constant (which should be included in practice), the R$^2$ is bounded between 0 and 1.  
    * It measures the share of the variation in $y$ explained by the variation in the features used in a model.  
    * Given this definition, the idea that **bigger is better** tends to prevail.  is the first place that people go to evaluate the quality of the model, which is unwarranted.  



> "However, it can still be challenging to determine what is a good R$^2$ value, and in general, this will depend on the application.  For instance, in certain problems in physics, we may know that the data truly comes from a linear model with a small residual error.  In this case, we would expect to see an R$^2$ value that is extremely close to 1, and a substantially smaller R$^2$ might indicate serious problems with the experiment in which the data were generated.  On the other hand, in typical application in biology, pyschology, marketing and other domains, the linear model is at best an extremely rough approximation to the data, and residual errors due to other unmeasured factors are often very large.  In this setting, we would expect only a very small proportion of the variance in the response to be explained by the predictor, and an R$^2$ value well below 0.1 might be more realistic."  

> Trevor Hastie, Robert Tibshirani, et al.



* Let's consider this this idea by extending CAPM to the Fama-French factor models.
    * An exercise in data mining in finance.
    
    
    
* Consider the distinction between risk and uncertainty made by [Frank Knight](https://en.wikipedia.org/wiki/Frank_Knight).  The same argument can be applied to the R$^2$ metric when used in isolation.  We do not know the total variance of the system (which is essentially the point being made by Hastie and Tibshirani).
    * COVID-19 is example where **Knightian uncertainty** has increased.  As a result, market volatility may, in truth, not have increased. 
    * In the R example, we have generated data from two systems, one with smaller **total variance**.
    * In the R example, we generated the **perfect model**.  (This is not factually correct.)
    
    
    
* The valid application of the R$^2$ metric is to compare nested models.  (In time series, we will see another method of optimization, called the **method of maximum likelihood**.  The same argument will prevail there.)