ENH: heteroscedasticity, variance function with quantile regression - nonlinearity #3302

josef-pkt · 2016-12-01T03:29:06Z

(another random thought, a quick google search shows that there is some information, but I haven't read anything.)

In the illustrative quantile regression example, I used heteroscedasticity to get quadratic quantile functions. In general, outside of i.i.d. or non-normal (?) samples, the quantile function might be a non-linear function of the parameters. We can handle this, for example, with splines, however splines might run easily into crossing problems (example ?). If one source of the nonlinearity of the quantile function is in systematic, exog dependent heteroscedasticity, then a pre-weighting as in HETWLS might remove most of the non-linearity.

Question: Can we add a HetQuantileRegression that does the variance scaling before computing the Quantile Regression and then move back to original space? (pre-whiten and recolor add variance function scaling to objective function. (viewed as an M-estimator with scaling, analog to RLM)

aside: RLM doesn't allow for prior variance functions or heteroscedasticity either.

josef-pkt · 2016-12-01T03:36:29Z

parking a script, ex_quantile_regression_polynomial_reduced.py related to this, requires #3183 (which I don't have checked out right now)

AFAIR, I got quite large quantile crossings in the center for some parameters, all quantiles are bunched together in this area but overall shape of the regression curve is dominated by range with large heteroscedasticity.

# -*- coding: utf-8 -*-
"""
copy from mailing list 2016-02-03
"using quantile regression for 2nd order polynomial fit"
Created on Wed Feb  3 15:55:14 2016

added M-quantiles, RobustNorm class moved to statsmodels
Created on Wed Aug 31 12:35:57 2016

Author: Josef Perktold
License: BSD-3
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import statsmodels.formula.api as smf

from statsmodels.robust import norms
from statsmodels.robust.robust_linear_model import RLM

MQuantileNorm = norms.MQuantileNorm


#np.random.seed(654123)
np.random.seed(654579)

# generate an interesting dataset with heteroscedasticity
nobs = 1000
x = np.random.uniform(-4, 4, nobs)
y = x + 0.25 * x**2 + 0.1 * np.exp(2 + np.abs(x)) * np.random.randn(nobs)

df = pd.DataFrame({'temp': x, 'dens': y})

x1 = pd.DataFrame({'temp': np.linspace(df.temp.min(), df.temp.max(), nobs)})

poly_2 = smf.ols(formula='dens ~ 1 + temp + I(temp ** 2.0)', data=df).fit()
plt.plot(x1.temp, poly_2.predict(x1), 'g-',
         label='2nd order poly fit, $R^2$=%.2f' % poly_2.rsquared,
         alpha=0.9)
plt.plot(x, y, 'o', alpha=0.5)
plt.legend(loc="upper center")


##################
# with quantile regression

# we already have the data frame
#d = {'temp': x, 'dens': y}
#df = pd.DataFrame(data=d)

# Least Absolute Deviation
# The LAD model is a special case of quantile regression where q=0.5

# Notes: possible problem with smoothing
# quantile regression has few neighboring in endog points where the variance
# is large. Choosing a larger df or smaller bandwidth results in overfitting
# at the high variance regions. (e.g. bs with df=8 or df=10)
# local smoothing would need to depend on local density, I guess.

mod = smf.quantreg('dens ~ temp + I(temp ** 2.0)', df)
mod = smf.quantreg('dens ~ bs(temp, df=6, degree=3, include_intercept=True) - 1', df)
#mod = smf.quantreg('dens ~ cr(temp, df=7) - 1', df)
res = mod.fit(q=.5)
print(res.summary())

# Quantile regression for 5 quantiles

quantiles = [.25, .50, .75] #[.05, .25, .50, .75, .95]

# get all result instances in a list
res_all = [mod.fit(q=q) for q in quantiles]

res_ols = smf.ols('dens ~ temp + I(temp ** 2.0)', df).fit()

#####  MQuantlile Regression
y = res_ols.model.endog
xx = res_ols.model.exog

t_eps = 100#1.345#1e-6
mq_norm = MQuantileNorm(0.25, norms.HuberT(t=t_eps))
mod_rlm = RLM(y, xx, M=mq_norm)
#mod_rlm = RLM(y, x, M=norms.HuberT())
res_rlm = mod_rlm.fit()

mq_norm2 = MQuantileNorm(0.75, norms.HuberT(t=t_eps))
mod_rlm2 = RLM(y, xx, M=mq_norm2)
#mod_rlm = RLM(y, x, M=norms.HuberT())
res_rlm2 = mod_rlm2.fit()

mq_norm1 = MQuantileNorm(0.2, norms.HuberT(t=t_eps))
mod_rlmf = smf.rlm('dens ~ bs(temp, df=6, degree=3, include_intercept=True) - 1', df, M=mq_norm1)
res_rlmf = mod_rlmf.fit()

########

plt.figure()

# create x for predition
x_p = np.linspace(df.temp.min(), df.temp.max(), 50)
df_p = pd.DataFrame({'temp': x_p})
from scipy import stats
yp25 = x_p + 0.25 * x_p**2 + 0.1 * np.exp(2 + np.abs(x_p)) * stats.norm.ppf(0.25)
yp75 = x_p + 0.25 * x_p**2 + 0.1 * np.exp(2 + np.abs(x_p)) * stats.norm.ppf(0.75)
yp50 = yp75 = x_p + 0.25 * x_p**2

for qm, res in zip(quantiles, res_all):
    # get prediction for the model and plot
    # here we use a dict which works the same way as the df in ols
    #plt.plot(x_p, res.predict({'temp': x_p}), linestyle='dotted', lw=2,
    plt.plot(x_p, res.predict(df_p), linestyle='dotted', lw=3,
             color='blue', label='q=%.2F' % qm)

x_predict = np.column_stack((np.ones(len(x_p)), x_p, x_p**2))
y_rlm_predicted = res_rlm.predict(x_predict)
y_rlm2_predicted = res_rlm2.predict(x_predict)
y_rlmf_predicted = res_rlmf.predict(df_p)
y_ols_predicted = res_ols.predict(df_p)
#plt.plot(x_p, y_ols_predicted, color='red', label='OLS')
plt.plot(x_p, yp25, color='green', label='DGP')
plt.plot(x_p, yp75, color='green')
plt.plot(x_p, yp50, color='green')
plt.plot(x_p, y_rlm_predicted, color='red', label='RLM')
plt.plot(x_p, y_rlm2_predicted, color='red')
plt.plot(x_p, y_rlmf_predicted, '--', color='red', lw=3)
plt.scatter(df.temp, df.dens, alpha=.2)
plt.xlim((-4, 4))
plt.ylim((-30, 40))
plt.legend(loc="upper center")
plt.xlabel('temp')
plt.ylabel('dens')
plt.title('Quantile Regression and Heteroscedasticity')
plt.show()

from scipy import stats
y25 = x + 0.25 * x**2 + 0.1 * np.exp(2 + np.abs(x)) * stats.norm.ppf(0.25)
res_all[0].fittedvalues[:10]

yp25 = x_p + 0.25 * x_p**2 + 0.1 * np.exp(2 + np.abs(x_p)) * stats.norm.ppf(0.25)
yq25 = res_all[0].predict(df_p).values

# proportions
(res_rlm.resid < 0).mean()
(res_rlm2.resid < 0).mean()
(res_all[0].resid < 0).mean()
((y - y25) < 0).mean()

# check
(res_rlm.resid == 0).mean()
(res_all[0].resid == 0).mean()

# check knot location
# for 'dens ~ bs(temp, df=6, degree=3, include_intercept=True) - 1'
'''
>>> np.argmax(mod.exog[np.argsort(x)], 0)
array([  0, 150, 373, 627, 834, 999], dtype=int64)
'''

josef-pkt added comp-regression comp-robust wishlist labels Dec 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: heteroscedasticity, variance function with quantile regression - nonlinearity #3302

ENH: heteroscedasticity, variance function with quantile regression - nonlinearity #3302

josef-pkt commented Dec 1, 2016

josef-pkt commented Dec 1, 2016

ENH: heteroscedasticity, variance function with quantile regression - nonlinearity #3302

ENH: heteroscedasticity, variance function with quantile regression - nonlinearity #3302

Comments

josef-pkt commented Dec 1, 2016

josef-pkt commented Dec 1, 2016