# Non-linear Least Squares and Quantile Regression (Python Version)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import minimize, curve_fit
import statsmodels.api as sm
from statsmodels.regression.quantile_regression import QuantReg
import utils

# Set up plotting style
utils.set_pitt_style()
PITT_BLUE = utils.PITT_BLUE
PITT_GOLD = utils.PITT_GOLD
PITT_GRAY = utils.PITT_GRAY
PITT_LGRAY = utils.PITT_LGRAY

In some settings, the standard linear model might fail to give us the relationship we want:

OLS model estimations have us try to get a sense of:
* Constant marginal effects for each variable 
    - i.e linear in the parameters $\beta_0$,  $\beta_1$, $\ldots$, $\beta_k$
* The expected relationship in the population
    - i.e we are trying to understand the effects on the *average* person

Here I'll discuss two estimation methods that allow us to move beyond this:
* Non-linear Least Squares (NLS)
    - this method will allow us to specify relationships where the parameters enter non-linearly
    - for example, via ratios, interactions, etc. 

* Quantile regression
    - this method will allow us to switch to predicting a *quantile* for the population
    - for example, the median, the top 10 percent, etc.

* In each case, the model means that we will not have a nice closed-form solution as in OLS
* As such, we're going to have to find the optimal parameters using some numerical methods

We'll go through both methods today:
* We'll talk about the objectives that are maximized
* We'll set-up and use optimizers to search for good solutions
* And finally, we'll look at the libraries in Python that can run these estimations for us

## Non-linear least squares:
The set-up is very similar to ordinary least squares:
* We have an outcome $y_i$ we are seeking to explain
* We have a series of $k$ different explanatory variables, $\mathbf{x}_i=\left(x_{i,1}, x_{i,2},\ldots, x_{i,k}\right)$

Similar to OLS we are trying to estimate the conditional expectation of $y$ given the observables $\mathbf{x}_i$. 

However we now specify that this conditional mean has a non-linear relationship:
$\mathbb{E}(y|\mathbf{x})=f(\mathbf{x})$
where the function $f(\cdot)$ has j unknown parameters $\boldsymbol{\beta}=(\beta_0, \beta_1,\ldots, \beta_j)$.

Again, similar to OLS, we specify that there is an *additive* error $\epsilon_i$ for each observation such that the realized $y_i$ satisfies:
$$ y_i=f(\mathbf{x};\boldsymbol{\beta})+\epsilon_i$$
where $\epsilon_i$ is a mean zero disturbance with variance $\sigma^2$

As an example, suppose we are trying to estimate the growth of a user-base across time $t$.

We want to fit the data to a logistic function 
 $$ f(t)=\frac{\alpha}{1+\gamma \exp(-\lambda t)}$$

In [None]:
# R: logis.f <- function(t, alpha, gamma, lambda) alpha/(1+gamma*exp(-lambda*t))
# Python:
def logistic_f(t, alpha, gamma, lam):
    """Logistic growth function."""
    return alpha / (1 + gamma * np.exp(-lam * t))

# Plot the logistic function with example parameters
t_vals = np.linspace(0, 20, 200)
fig, ax = plt.subplots()
ax.plot(t_vals, logistic_f(t_vals, 10, 50, 0.5), color=PITT_BLUE, linewidth=2)
ax.set_xlabel('t')
ax.set_ylabel('f(t)')
ax.set_title('Logistic Growth Function')
plt.show()

So this is a function that allows for a slow initial growth, a period of quick growth, then a levelling off.

So we'll use this to model the growth of active Twitter users in the US (source: public Twitter account, before takeover and rebranding to X)

In [None]:
# R: twitter.us <- read.csv('social/TwitterUsers.csv')
# Python:
twitter_us = pd.read_csv('social/TwitterUsers.csv')
print(twitter_us.tail())

fig, ax = plt.subplots()
ax.scatter(twitter_us['t'], twitter_us['TwitterActiveUS'], color=PITT_BLUE, s=40, zorder=5)
ax.set_xlabel('t')
ax.set_ylabel('Twitter Active US (millions)')
ax.set_title('Twitter Active Users in the US')
plt.show()

So here we have a function with a single observable input $t$, and three parameters $\boldsymbol{\beta}=(\alpha,\gamma,\lambda)$ in:
$$ f(t)=\frac{\alpha}{1+\gamma\exp(-\lambda t)}$$

We will estimate the parameters by attempting to minimize the total sum of squares:
$$ \text{Sum of Squares}(\boldsymbol{\beta})=\sum_i \left( y_i -f(t_i) \right)^2$$
which is given by:
$$ \text{Sum of Squares}(\boldsymbol{\beta})=\sum_i \left( y_i - \frac{\alpha}{1+\gamma\exp(-\lambda t)}\right)^2$$

So we want to find the estimated value $\hat{\boldsymbol{\beta}}=\left(\alpha,\gamma,\lambda \right)$ that minimize this.

# Estimate the model

I'm going to write the objective function first:

In [None]:
# R: non.lin.LS.obj <- function(beta) sum((twitter.us$TwitterActiveUS - beta[1]/(1+beta[2]*exp(-beta[3]*twitter.us$t)))**2)
# Python:
def nls_objective(beta):
    """Sum of squared residuals for the logistic model."""
    predicted = beta[0] / (1 + beta[1] * np.exp(-beta[2] * twitter_us['t'].values))
    return np.sum((twitter_us['TwitterActiveUS'].values - predicted)**2)

# Test with some initial values
print(f"SSR at (68, 2, 0.14): {nls_objective([68, 2, 0.14]):.4f}")

and then optimize it using the Nelder-Mead solver via:

```python
minimize(fun=nls_objective, x0=[?, ?, ?], method='Nelder-Mead')
```

However, I need to tell the routine some starting values. Here I quickly calibrate things, using my understanding of the model

Looking at the model:
$$ f(t)=\frac{\alpha}{1+\gamma \cdot e^{-\lambda t}}$$
We know that 
* $\alpha$ is the amount of users as $t\rightarrow\infty$

In [None]:
# R: tail(twitter.us)
print(twitter_us.tail())

In [None]:
alpha0 = 69

So I set $\alpha_0$ as the last value 68. Using this value, I can guess a value for $\gamma_0$ because at time zero we have that:
$$f(0)=\frac{\alpha_0}{1+\gamma_0}=9$$
which I can invert to get a guess for $\gamma_0$.

In [None]:
print(twitter_us.head())

In [None]:
gamma0 = alpha0 / 9 - 1
print(f"gamma0 = {gamma0:.4f}")

Finally, for the logistic function, the inflection point where the slope stops increasing is given by $t=\frac{\log(\gamma)}{\lambda}$. If we ballpark this at happening at $t=10$ in the data we can guess a value of $\lambda_0$

In [None]:
lambda0 = np.log(gamma0) / 10
print(f"lambda0 = {lambda0:.6f}")

### Running the estimation:

Using `scipy.optimize.minimize` with Nelder-Mead (analogous to R's `optim`):

In [None]:
# R: twitter.opt <- optim(fn=non.lin.LS.obj, par=c(alpha0, gamma0, lambda0), hessian=TRUE)
# Python:
result = minimize(fun=nls_objective, x0=[alpha0, gamma0, lambda0], method='Nelder-Mead')

estimated_params = result.x
param_names = ['alpha', 'gamma', 'lambda']
for name, val in zip(param_names, estimated_params):
    print(f"{name} = {val:.6f}")

# Residual standard error
n = len(twitter_us)
sigma_hat_sq = result.fun / (n - 3)
print(f"\nResidual std. error: {np.sqrt(sigma_hat_sq):.4f}")

# Long-run prediction
print(f"\nPrediction at t=100: {logistic_f(100, *estimated_params):.4f}")

### Plotting the output:

In [None]:
fig, ax = plt.subplots()
t_fit = np.linspace(0, 40, 200)
ax.plot(t_fit, logistic_f(t_fit, *estimated_params), color=PITT_GOLD, linewidth=2, label='Fitted')
ax.scatter(twitter_us['t'], twitter_us['TwitterActiveUS'], color=PITT_BLUE, s=40, zorder=5, label='Data')
ax.set_xlabel('t')
ax.set_ylabel('Twitter Active US (millions)')
ax.set_title('NLS Fit: Logistic Growth Model')
ax.legend()
plt.show()

So we now have a working model to forecast the number of active US twitter users going forward (in millions):
$$ f(t)=\frac{ 68.39 }{1+7.53e^{-0.219\cdot t}} $$

## Standard errors and inference
* Inference for these models is similar to OLS models, in that we can generate standard errors for each of our estimates, generate confidence intervals, etc.
* Inferential tests based on the parameters are then based on asymptotic notions over the amount of data $n$ and Taylor expansions
* We will not dwell on the formulas for the standard errors here, and will come back to this a little when we talk about the delta-method and maximum likelihood estimations
* Here is some example code to generate the variance-covariance matrix for the estimates using numerical derivatives

In [None]:
# Compute variance-covariance matrix via numerical derivatives of f
eps = 1e-8

def nls_f_vec(beta):
    """Model predictions evaluated at all data points."""
    return beta[0] / (1 + beta[1] * np.exp(-beta[2] * twitter_us['t'].values))

# Numerical Jacobian: k x n matrix of partial derivatives
g = np.zeros((3, n))
for j in range(3):
    beta_plus = estimated_params.copy()
    beta_plus[j] += eps
    g[j, :] = (nls_f_vec(beta_plus) - nls_f_vec(estimated_params)) / eps

# Variance-covariance matrix: sigma^2 * (G G')^{-1}
D_mat = sigma_hat_sq * np.linalg.inv(g @ g.T)
D_df = pd.DataFrame(D_mat, index=param_names, columns=param_names)
print("Variance-Covariance Matrix:")
print(D_df)
print(f"\nStd error of alpha: {np.sqrt(D_mat[0,0]):.4f}")

It is also possible to make predictions through the model, but this also relies on constructing different standard errors, through derivatives of the parameters with respect to the model. 

Below I do this using some *analytical* derivatives

*(Note: We will come back to this when we talk about the **delta-method**)*

In [None]:
def prediction_with_ci(t_val, params, D_mat, sigma_hat_sq):
    """Generate prediction with confidence and prediction intervals."""
    alpha, gamma, lam = params
    
    # Analytical derivatives of f with respect to each parameter
    denom = (1 + gamma * np.exp(-lam * t_val))**2
    g_deriv = np.array([
        (1 + gamma * np.exp(-lam * t_val)) / denom,           # df/dalpha
        -alpha * np.exp(-lam * t_val) / denom,                 # df/dgamma
        alpha * gamma * t_val * np.exp(-lam * t_val) / denom   # df/dlambda
    ])
    
    model_est = alpha / (1 + gamma * np.exp(-lam * t_val))
    se = np.sqrt(g_deriv @ D_mat @ g_deriv)
    
    return {
        't': t_val,
        'model': model_est,
        'lcl': model_est - 1.96 * se,
        'ucl': model_est + 1.96 * se,
        'lpred': model_est - 1.96 * (se + sigma_hat_sq),
        'upred': model_est + 1.96 * (se + sigma_hat_sq)
    }

# Generate predictions for a range of t values
t_range = np.arange(-4, 71)
out_model = pd.DataFrame([prediction_with_ci(t, estimated_params, D_mat, sigma_hat_sq) for t in t_range])
print(out_model.head())

And so we can plot the resulting confidence interval for both the model, and the prediction.

In [None]:
fig, ax = plt.subplots()
# Prediction interval (lighter)
ax.fill_between(out_model['t'], out_model['lpred'], out_model['upred'],
                color=PITT_GOLD, alpha=0.2, label='Prediction interval')
# Confidence interval (darker)
ax.fill_between(out_model['t'], out_model['lcl'], out_model['ucl'],
                color=PITT_GOLD, alpha=0.8, label='Confidence interval')
# Data points
ax.scatter(twitter_us['t'], twitter_us['TwitterActiveUS'], color=PITT_BLUE, s=40, zorder=5, label='Data')
ax.set_xlabel('t')
ax.set_ylabel('Twitter Active US (millions)')
ax.set_title('NLS Fit with Confidence and Prediction Intervals')
ax.legend()
plt.show()

### Using `scipy.optimize.curve_fit`

In R we used the `nls()` function. In Python, the closest equivalent is `scipy.optimize.curve_fit`, which fits a non-linear function to data and also returns the covariance matrix of the parameter estimates.

In [None]:
# R: nl.twitter.model <- nls(TwitterActiveUS ~ alpha/(1+gamma*exp(-lambda*t)), data=twitter.us,
#                            start=list(alpha=alpha0, gamma=gamma0, lambda=lambda0))
# Python:
popt, pcov = curve_fit(
    logistic_f, 
    twitter_us['t'].values, 
    twitter_us['TwitterActiveUS'].values,
    p0=[alpha0, gamma0, lambda0]
)

print("Estimated Parameters:")
for name, val, se in zip(param_names, popt, np.sqrt(np.diag(pcov))):
    print(f"  {name:>6s} = {val:10.6f}  (SE = {se:.6f})")

# Residual standard error
residuals = twitter_us['TwitterActiveUS'].values - logistic_f(twitter_us['t'].values, *popt)
rse = np.sqrt(np.sum(residuals**2) / (n - 3))
print(f"\nResidual std. error: {rse:.4f}")

So we can construct confidence intervals from the standard errors:

In [None]:
# R: confint(nl.twitter.model, level=0.95)
# Python: use parameter estimates +/- 1.96 * SE (or use t-distribution)
from scipy.stats import t as t_dist

dof = n - 3
t_crit = t_dist.ppf(0.975, dof)

print("95% Confidence Intervals:")
print(f"{'Param':>8s}  {'Lower':>12s}  {'Upper':>12s}")
for name, val, se in zip(param_names, popt, np.sqrt(np.diag(pcov))):
    lower = val - t_crit * se
    upper = val + t_crit * se
    print(f"{name:>8s}  {lower:12.6f}  {upper:12.6f}")

Here is the estimated variance-covariance matrix from `curve_fit` which we can compare to the one we calculated above with the simple numerical derivatives:

In [None]:
# R: vcov(nl.twitter.model)
# Python:
print("Variance-Covariance from curve_fit:")
print(pd.DataFrame(pcov, index=param_names, columns=param_names))

print("\nVariance-Covariance from manual calculation:")
print(D_df)

### Generate predictions:
Suppose we wanted to get some extrapolated values for Twitter's earlier years when we didn't have data to compare to our company's growth.

In [None]:
# R: prev.years <- data.frame("t"=c(0,-1,-2,-3,-4))
#    prev.years["prediction"] <- predict(nl.twitter.model, newdata=prev.years)
# Python:
prev_years = pd.DataFrame({'t': [0, -1, -2, -3, -4]})
prev_years['prediction'] = logistic_f(prev_years['t'].values, *popt)
print(prev_years)

In [None]:
fig, ax = plt.subplots()
t_fit = np.linspace(-5, 40, 200)
ax.plot(t_fit, logistic_f(t_fit, *popt), color=PITT_GOLD, linewidth=2, label='Fitted model')
ax.scatter(twitter_us['t'], twitter_us['TwitterActiveUS'], color=PITT_BLUE, s=40, zorder=5, label='Data')
ax.scatter(prev_years['t'], prev_years['prediction'], color='red', s=40, zorder=5, label='Predictions')
ax.set_xlabel('t')
ax.set_ylabel('Twitter Active US (millions)')
ax.legend()
plt.show()

### Model misspecification
One warning here is that the non-linear model assumption is held constant. Your standard errors and other inference will be made under the maintained assumption that the model is correct.

As an example, let's look at some publicly available data for the growth of SnapChat's user base:

In [None]:
# R: snapchat.all <- read.csv('social/SnapchatUsers.csv')
# Python:
snapchat_all = pd.read_csv('social/SnapchatUsers.csv')
print(snapchat_all.head())

In [None]:
fig, ax = plt.subplots()
ax.scatter(snapchat_all['t'], snapchat_all['snapchatDailyAll'], color=PITT_GOLD, s=40, zorder=5)
ax.set_xlabel('t')
ax.set_ylabel('Snapchat Daily Users (millions)')
ax.set_title('Snapchat Daily Active Users')
plt.show()

Guess some new initial values from the above graph (300 million asymptote, 48 million at time 0, inflection point at t=15)

In [None]:
alpha0_snap = 300
gamma0_snap = alpha0_snap / 48 - 1
lambda0_snap = np.log(gamma0_snap) / 15
print(f"Starting values: alpha={alpha0_snap}, gamma={gamma0_snap:.4f}, lambda={lambda0_snap:.6f}")

We can again easily fit the non-linear logistic-map model using `curve_fit`:

In [None]:
# R: nl.snapchat.model <- nls(snapchatDailyAll ~ (alpha/(1+gamma*exp(-lambda*t))),
#                             data=snapchat.all, start=list(alpha=alpha0, gamma=gamma0, lambda=lambda0))
# Python:
popt_snap, pcov_snap = curve_fit(
    logistic_f,
    snapchat_all['t'].values,
    snapchat_all['snapchatDailyAll'].values,
    p0=[alpha0_snap, gamma0_snap, lambda0_snap]
)

print("Estimated Parameters (Snapchat):")
for name, val, se in zip(param_names, popt_snap, np.sqrt(np.diag(pcov_snap))):
    print(f"  {name:>6s} = {val:10.4f}  (SE = {se:.4f})")

**Question:** Looking at these results are you worried about anything here?

Graphing the fitted function:

In [None]:
fig, ax = plt.subplots()
t_fit = np.linspace(0, 40, 200)
ax.plot(t_fit, logistic_f(t_fit, *popt_snap), color=PITT_GOLD, linewidth=2, label='Fitted')
ax.scatter(snapchat_all['t'], snapchat_all['snapchatDailyAll'], color=PITT_BLUE, s=40, zorder=5, label='Data')
ax.set_xlabel('t')
ax.set_ylabel('Snapchat Daily Users (millions)')
ax.set_title('NLS Fit: Snapchat Logistic Growth')
ax.legend()
plt.show()

The problem here is that our assumptions about the functional form lock down the parameters fairly well in terms of independent moves of each.

In [None]:
# Profiling the objective as a function of alpha alone
def snap_rmse(alpha, gamma, lam):
    predicted = alpha / (1 + gamma * np.exp(-lam * snapchat_all['t'].values))
    return np.sqrt(np.mean((snapchat_all['snapchatDailyAll'].values - predicted)**2))

opt_rmse = snap_rmse(*popt_snap)

alpha_range = np.arange(200, 271)
rel_errors = [(snap_rmse(a, popt_snap[1], popt_snap[2]) - opt_rmse) / opt_rmse for a in alpha_range]

fig, ax = plt.subplots()
ax.scatter(alpha_range, rel_errors, color=PITT_BLUE, s=20)
ax.set_xlabel('alpha')
ax.set_ylabel('Relative Error')
ax.set_title('Sensitivity of Objective to Alpha (holding other params fixed)')
plt.show()

### Making the model richer
Let's modify the model adding a vertical shift parameter (intercept), and show that it can be well estimated with the **Twitter** data.

In [None]:
def logistic_f_c(t, alpha, gamma, lam, intercept):
    """Logistic growth function with intercept."""
    return intercept + alpha / (1 + gamma * np.exp(-lam * t))

# R: nls(TwitterActiveUS ~ (intercept + alpha/(1+gamma*exp(-lambda*t))), ...)
# Python:
popt_tw2, pcov_tw2 = curve_fit(
    logistic_f_c,
    twitter_us['t'].values,
    twitter_us['TwitterActiveUS'].values,
    p0=[10, 2, 0.219654, 0]
)

param_names_ext = ['alpha', 'gamma', 'lambda', 'intercept']
print("Estimated Parameters (Twitter, extended model):")
for name, val, se in zip(param_names_ext, popt_tw2, np.sqrt(np.diag(pcov_tw2))):
    print(f"  {name:>10s} = {val:10.6f}  (SE = {se:.6f})")

**Question:** Are the estimated parameters sensitive to the initial conditions?

Let's plot the fitted function:

In [None]:
fig, ax = plt.subplots()
t_fit = np.linspace(0, 40, 200)
ax.plot(t_fit, logistic_f_c(t_fit, *popt_tw2), color=PITT_GOLD, linewidth=2, label='Fitted (with intercept)')
ax.scatter(twitter_us['t'], twitter_us['TwitterActiveUS'], color=PITT_BLUE, s=40, zorder=5, label='Data')
ax.set_xlabel('t')
ax.set_ylabel('Twitter Active US (millions)')
ax.legend()
plt.show()

### Moving on to the Snapchat data...
In the twitter data, there's enough curvature to the model for us to estimate all of the effects relatively precisely, but as we saw, the non-linear features of the model don't really show up well in the Snapchat data.

When we try the extended model with an intercept on the Snapchat data, the Gauss-Newton algorithm (used by `curve_fit`) may struggle to converge. Let's try manual optimization with Nelder-Mead instead:

In [None]:
def snap_ssr_intercept(beta):
    """Sum of squared residuals for logistic + intercept on Snapchat data."""
    predicted = beta[3] + beta[0] / (1 + beta[1] * np.exp(-beta[2] * snapchat_all['t'].values))
    return np.sum((snapchat_all['snapchatDailyAll'].values - predicted)**2)

# First round of optimization
optim1 = minimize(snap_ssr_intercept, x0=[235.2, 4.38650, 0.17, 0], method='Nelder-Mead')
print(f"Round 1 - Converged: {optim1.success}")
print(f"  Parameters: {optim1.x}")
print(f"  Objective: {optim1.fun:.2f}")

In [None]:
# Second round, restarting from previous output
optim2 = minimize(snap_ssr_intercept, x0=optim1.x, method='Nelder-Mead')
print(f"Round 2 - Converged: {optim2.success}")
print(f"  Parameters: {optim2.x}")
print(f"  Objective: {optim2.fun:.2f}")

In [None]:
# Third round
optim3 = minimize(snap_ssr_intercept, x0=optim2.x, method='Nelder-Mead')
print(f"Round 3 - Converged: {optim3.success}")
print(f"  Parameters: {optim3.x}")
print(f"  Objective: {optim3.fun:.2f}")

Let's look at the two parameter estimates we have:

In [None]:
print("Round 1 params:", optim1.x)
print("Round 3 params:", optim3.x)

And now we can plot out the two functions:

In [None]:
fig, ax = plt.subplots()
t_fit = np.linspace(0, 35, 200)
ax.plot(t_fit, logistic_f_c(t_fit, *optim1.x), color=PITT_GOLD, linewidth=5, alpha=0.7, label='Round 1')
ax.plot(t_fit, logistic_f_c(t_fit, *optim3.x), color='red', linewidth=1, label='Round 3')
ax.scatter(snapchat_all['t'], snapchat_all['snapchatDailyAll'], color=PITT_BLUE, s=40, zorder=5, label='Data')
ax.set_xlabel('t')
ax.set_ylabel('Snapchat Daily Users (millions)')
ax.legend()
plt.show()

But when you extrapolate outside of the data, the two models can have very different predictions!

In [None]:
# Look at t=80...
print(f"Round 1 prediction at t=80: {logistic_f_c(80, *optim1.x):.2f}")
print(f"Round 3 prediction at t=80: {logistic_f_c(80, *optim3.x):.2f}")

The problem here is that the function parameters are not well identified by the data once I allow for this change.

## Direct transformations
Some non-linear models can however be directly translated.

Consider the Cobb-Douglas model for how output is related to capital and labor inputs: $$ y=\beta \cdot C^\gamma L^{1-\gamma} $$

This model can be directly linearized to get:
$$\log(y)=\log(\beta)+\gamma\log(C)+(1-\gamma)\log(L)$$

So you could estimate the non-linear Cobb Douglas production model using a linear model by taking log transformations of the outputs and inputs.

The logistic growth model we used can't be similarly transformed, hence the need for the non-linear approach.

---
# Quantile Regression

## The idea

What if instead of estimating the **mean** conditional on observables, we wanted to estimate other features of the conditional distribution?

For instance:
* The **median** (50th percentile)
* The bottom 10th percentile
* The top 1st percentile

Quantile regression allows exactly this: we estimate the conditional $\tau$-th quantile of the outcome $y$ as a function of $\mathbf{x}$.

OLS and Quantile Regression differ over the objective we attempt to minimize to find the parameters. These objectives are illustrated by the following two costs attributed to errors in each.

* **OLS**: Minimize $\sum_i (y_i - \mathbf{x}_i \boldsymbol{\beta})^2$ -- the *squared* loss
* **LAD (Median)**: Minimize $\sum_i |y_i - \mathbf{x}_i \boldsymbol{\beta}|$ -- the *absolute* loss

The squared loss penalizes large errors more heavily, pulling the estimate toward the mean. The absolute loss treats all errors equally, producing the median.

![Image](https://alistairjwilson.github.io/MQE_AW/i/Compareopt.gif)

## Why will it be the median?
Focus in on the animation just of the LAD problem here:

![Image](https://alistairjwilson.github.io/MQE_AW/i/MedOpt.gif)

Any changes to the parameter have exactly offsetting effects from the left and right points, so we just move the median point towards zero.

But this insight leads to a generalization: suppose we have five points, but we make the left hand side of the tick three times steeper!

![Image](https://alistairjwilson.github.io/MQE_AW/i/MedOpt2.gif)

* So the optimal parameter here will be such that there are three times as many points to the left as to the right.

* With lots and lots of data points, this will be close to the 25th percentile of the data.

* Hence, the general method of quantile regression can be used to estimate different quantiles of the data by changing the relative slope of errors to the left and to the right.

### General Objective
So we can define the loss function that attempts to minimize the distance from the $\tau$-th quantile of the error distribution as:
$$ \rho_\tau\left(u \right)= \begin{cases}
-u\cdot(1-\tau) & \text{if } u<0   \\
 u\cdot\tau & \text{if } u\geq 0
\end{cases}$$

We can then look for the parameter vector $\boldsymbol{\beta}$ that minimizes:
$$ \sum_i \rho_\tau\bigl(y_i -\mathbf{x}_i \boldsymbol{\beta}\bigr)$$

## Finding parameters
* With meaningful regressors on the right-hand side of the model, we again have to use numerical methods to solve for the minimizing parameters. 

* Rather than numerically set this up and solve it again, let's jump straight to a Python package that does this for us: `statsmodels.regression.quantile_regression.QuantReg`

In [None]:
# R: house.sales <- read.csv('real_estate/sales_all.csv')
# Python:
house_sales = pd.read_csv('real_estate/sales_all.csv')
print(house_sales.columns.tolist())
print(house_sales.head(1))

In [None]:
# Look at sale description frequencies
desc_counts = house_sales['SALEDESC'].value_counts().head(12)
print(desc_counts)

Wrangle the data a little here:

In [None]:
# R: valid.sales <- subset(house.sales, SALEDESC=="VALID SALE")
# Python:
valid_sales = house_sales[house_sales['SALEDESC'] == 'VALID SALE'].copy()

valid_sales['date'] = pd.to_datetime(valid_sales['SALEDATE'])
valid_sales['m'] = valid_sales['date'].dt.strftime('%m-%B')
valid_sales['y'] = valid_sales['date'].dt.year.astype(str)
valid_sales['month_n'] = valid_sales['date'].dt.month
valid_sales['year_n'] = valid_sales['date'].dt.year
valid_sales['price'] = pd.to_numeric(valid_sales['PRICE'], errors='coerce')
valid_sales['t'] = 12 * (valid_sales['year_n'] - 2016) + valid_sales['month_n']
valid_sales['high_summer'] = ((valid_sales['month_n'] >= 5) & (valid_sales['month_n'] <= 8)).astype(int)

# Drop rows with missing prices
valid_sales = valid_sales.dropna(subset=['price'])

print(f"Valid sales: {len(valid_sales)} observations")
print(valid_sales[['price', 'y', 'm', 'high_summer']].head())

First, let's run a standard OLS model for comparison:

In [None]:
# R: ols.model.expec <- lm(price ~ y + m, data=valid.sales)
# Python: create dummies for y and m
X_ols = pd.get_dummies(valid_sales[['y', 'm']], drop_first=True).astype(float)
X_ols = sm.add_constant(X_ols)
y_price = valid_sales['price'].values

ols_model = sm.OLS(y_price, X_ols).fit()
print(ols_model.summary().tables[1])

In [None]:
# Simpler model: price ~ y + high_summer
# R: lm(price ~ y + high.summer, data=valid.sales)
# Python:
X_simple = pd.get_dummies(valid_sales[['y']], drop_first=True).astype(float)
X_simple['high_summer'] = valid_sales['high_summer'].values
X_simple = sm.add_constant(X_simple)

ols_simple = sm.OLS(y_price, X_simple).fit()
print(ols_simple.summary().tables[1])

## Running a quantile regression
Let's run a quantile regression instead!

In [None]:
# R: q.model.median <- rq(price ~ y + m, data=valid.sales)
# Python:
qr_median = QuantReg(y_price, X_ols).fit(q=0.5)
print("Median Regression (tau=0.5):")
print(qr_median.summary())

In [None]:
# Median regression with simpler specification
# R: rq(price ~ y + high.summer, data=valid.sales, tau=0.5)
# Python:
qr_simple_median = QuantReg(y_price, X_simple).fit(q=0.5)
print("Median Regression (price ~ y + high_summer):")
print(qr_simple_median.summary())

### Comparing OLS vs Median Regression

Let's create prediction data to compare the month-level effects:

In [None]:
# Build prediction data for months in 2019
pred_2019 = valid_sales[valid_sales['y'] == '2019'][['m', 'y', 't']].drop_duplicates().sort_values('t')
pred_2019 = pred_2019.reset_index(drop=True)
pred_2019['t_adjusted'] = pred_2019['t'] - 36  # relative time index

# Create dummy variables for prediction
X_pred = pd.get_dummies(pred_2019[['y', 'm']], drop_first=True).astype(float)
# Ensure same columns as training data
for col in X_ols.columns:
    if col not in X_pred.columns:
        X_pred[col] = 0
X_pred = X_pred[X_ols.columns]

# Predictions - subtracting baseline to show month effects
ols_pred = ols_model.predict(X_pred)
qr_pred = qr_median.predict(X_pred)

pred_2019['Expectation'] = ols_pred - ols_pred.mean()
pred_2019['Median'] = qr_pred - qr_pred.mean()

print(pred_2019[['m', 't_adjusted', 'Expectation', 'Median']])

In [None]:
fig, ax = plt.subplots()
ax.plot(pred_2019['t_adjusted'], pred_2019['Expectation'], color=PITT_BLUE, linewidth=3, label='Expectation (OLS)')
ax.plot(pred_2019['t_adjusted'], pred_2019['Median'], color=PITT_GOLD, linewidth=3, label='Median (QR)')
ax.set_xlabel('Month (relative)')
ax.set_ylabel('Price Differential ($)')
ax.set_title('Month Effects: OLS vs Quantile Regression')
ax.legend()
plt.show()

## Check whether this is true for the top of the market too

We can fit quantile regressions at other quantiles to see how the effects differ across the price distribution:

In [None]:
# R: rq(price ~ y + m, data=valid.sales, tau=0.9)
# Python:
qr_90 = QuantReg(y_price, X_ols).fit(q=0.9)
print("90th Percentile Regression (tau=0.9):")
print(qr_90.summary())

In [None]:
# R: rq(price ~ y + high.summer, data=valid.sales, tau=0.99)
# Python:
qr_99 = QuantReg(y_price, X_simple).fit(q=0.99)
print("99th Percentile Regression (price ~ y + high_summer):")
print(qr_99.summary())

In [None]:
# Add top decile predictions
qr_90_pred = qr_90.predict(X_pred)
pred_2019['Top_Decile'] = qr_90_pred - qr_90_pred.mean()

print(pred_2019[['m', 't_adjusted', 'Expectation', 'Median', 'Top_Decile']])

In [None]:
fig, ax = plt.subplots()
ax.plot(pred_2019['t_adjusted'], pred_2019['Expectation'], color=PITT_BLUE, linewidth=3, label='Expectation (OLS)')
ax.plot(pred_2019['t_adjusted'], pred_2019['Median'], color=PITT_GOLD, linewidth=3, label='Median (QR 0.5)')
ax.plot(pred_2019['t_adjusted'], pred_2019['Top_Decile'], color='red', linewidth=3, label='Top Decile (QR 0.9)')
ax.set_xlabel('Month (relative)')
ax.set_ylabel('Price Differential ($)')
ax.set_title('Month Effects: OLS vs Quantile Regressions')
ax.legend()
plt.show()

From looking at this, I'd certainly believe that trying to sell my house in the summer made sense (though I should likely account for a month's lag to the closing dates listed here)

Less certain about the specific month effects, especially for the higher value houses.

### Comparing multiple quantiles

Let's fit several quantile regressions and compare the coefficients across quantiles:

In [None]:
taus = [0.1, 0.25, 0.5, 0.75, 0.9]
qr_results = {}

for tau in taus:
    qr_results[tau] = QuantReg(y_price, X_simple).fit(q=tau)

# Compare the high_summer coefficient across quantiles
print(f"{'Quantile':>10s}  {'Intercept':>12s}  {'high_summer':>12s}")
print("-" * 38)

# Also include OLS for comparison
print(f"{'OLS':>10s}  {ols_simple.params[0]:12.2f}  {ols_simple.params['high_summer']:12.2f}")
for tau in taus:
    model = qr_results[tau]
    print(f"{tau:>10.2f}  {model.params[0]:12.2f}  {model.params['high_summer']:12.2f}")

### Interpretation Issues

There are some issues with quantile regression to do with how we can *interpret* the resulting effects. So what we were looking at before was the median house prices conditional on being sold in each of these months. However, it's not necessarily a causal story that if we just change the month we sell in, we should expect this gain. It could be that the houses that come up for sale in these months are just different in some other aspect.

More generally, we should be cautious because when we conduct quantile regressions we are modelling the conditional quantile across the data, where we can't directly think of the estimated effects in the same way as OLS.

That is, we might be interested in understanding how the median person's wage responds if we were to change that median person's education:
$$\frac{\partial\text{Median}(\text{wage})}{\partial\text{Educ}}=\beta_1$$

But instead, a quantile regression of wage on education tells us how the conditional-median changes as we look across different levels of education. That is, for a quantile-regression with $\tau=0.5$ we have the interpretation:
$$\beta_1 = \frac{\partial\text{Median}(\text{wage}|\text{Educ} ) }{\partial \text{Educ} } $$

More concretely, if education is measured in years, the model would tell us that the median wage among those with a college degree is $4\beta_1$ greater than the median wage for those with high school education.

Note that this *does not* tell us how the marginal effect of the variable for the overall median person, and how they would be affected by an additional year of education.

This is different from standard regression, where the *law of iterated expectations* means that taking expectations over conditional expectations gives the same overall expected effect 
$$ \mathbb{E}\left[ \mathbb{E}Y|X\right]=\mathbb{E}Y$$
And so we can interpret a slope $\beta$ as the unconditional effect for the average person too

## Summary: R to Python Reference Table

| Task | R | Python |
|------|---|--------|
| Non-linear least squares | `nls(y ~ alpha/(1+gamma*exp(-lambda*t)), start=list(...))` | `curve_fit(func, x, y, p0=[...])` |
| Manual optimization | `optim(fn=obj, par=c(...))` | `minimize(fun=obj, x0=[...], method='Nelder-Mead')` |
| Get coefficients | `coef(model)` | `model.params` or `popt` from `curve_fit` |
| Confidence intervals | `confint(model)` | `model.conf_int()` or manual from `pcov` |
| Variance-covariance | `vcov(model)` | `model.cov_params()` or `pcov` from `curve_fit` |
| Quantile regression | `rq(y ~ x, tau=0.5)` (quantreg) | `QuantReg(y, X).fit(q=0.5)` (statsmodels) |
| QR summary | `summary(model, se="iid")` | `model.summary()` |
| Predict | `predict(model, newdata=df)` | `model.predict(X_new)` or `func(x_new, *popt)` |
| Create dummies | `factor(x)` in formula | `pd.get_dummies(df, drop_first=True)` |