Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: heteroscedasticity, variance function with quantile regression - nonlinearity #3302

Open
josef-pkt opened this issue Dec 1, 2016 · 1 comment

Comments

@josef-pkt
Copy link
Member

(another random thought, a quick google search shows that there is some information, but I haven't read anything.)

In the illustrative quantile regression example, I used heteroscedasticity to get quadratic quantile functions. In general, outside of i.i.d. or non-normal (?) samples, the quantile function might be a non-linear function of the parameters. We can handle this, for example, with splines, however splines might run easily into crossing problems (example ?). If one source of the nonlinearity of the quantile function is in systematic, exog dependent heteroscedasticity, then a pre-weighting as in HETWLS might remove most of the non-linearity.

Question: Can we add a HetQuantileRegression that does the variance scaling before computing the Quantile Regression and then move back to original space? (pre-whiten and recolor add variance function scaling to objective function. (viewed as an M-estimator with scaling, analog to RLM)

aside: RLM doesn't allow for prior variance functions or heteroscedasticity either.

@josef-pkt
Copy link
Member Author

parking a script, ex_quantile_regression_polynomial_reduced.py related to this, requires #3183 (which I don't have checked out right now)

AFAIR, I got quite large quantile crossings in the center for some parameters, all quantiles are bunched together in this area but overall shape of the regression curve is dominated by range with large heteroscedasticity.

# -*- coding: utf-8 -*-
"""
copy from mailing list 2016-02-03
"using quantile regression for 2nd order polynomial fit"
Created on Wed Feb  3 15:55:14 2016

added M-quantiles, RobustNorm class moved to statsmodels
Created on Wed Aug 31 12:35:57 2016

Author: Josef Perktold
License: BSD-3
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import statsmodels.formula.api as smf

from statsmodels.robust import norms
from statsmodels.robust.robust_linear_model import RLM

MQuantileNorm = norms.MQuantileNorm


#np.random.seed(654123)
np.random.seed(654579)

# generate an interesting dataset with heteroscedasticity
nobs = 1000
x = np.random.uniform(-4, 4, nobs)
y = x + 0.25 * x**2 + 0.1 * np.exp(2 + np.abs(x)) * np.random.randn(nobs)

df = pd.DataFrame({'temp': x, 'dens': y})

x1 = pd.DataFrame({'temp': np.linspace(df.temp.min(), df.temp.max(), nobs)})

poly_2 = smf.ols(formula='dens ~ 1 + temp + I(temp ** 2.0)', data=df).fit()
plt.plot(x1.temp, poly_2.predict(x1), 'g-',
         label='2nd order poly fit, $R^2$=%.2f' % poly_2.rsquared,
         alpha=0.9)
plt.plot(x, y, 'o', alpha=0.5)
plt.legend(loc="upper center")


##################
# with quantile regression

# we already have the data frame
#d = {'temp': x, 'dens': y}
#df = pd.DataFrame(data=d)

# Least Absolute Deviation
# The LAD model is a special case of quantile regression where q=0.5

# Notes: possible problem with smoothing
# quantile regression has few neighboring in endog points where the variance
# is large. Choosing a larger df or smaller bandwidth results in overfitting
# at the high variance regions. (e.g. bs with df=8 or df=10)
# local smoothing would need to depend on local density, I guess.

mod = smf.quantreg('dens ~ temp + I(temp ** 2.0)', df)
mod = smf.quantreg('dens ~ bs(temp, df=6, degree=3, include_intercept=True) - 1', df)
#mod = smf.quantreg('dens ~ cr(temp, df=7) - 1', df)
res = mod.fit(q=.5)
print(res.summary())

# Quantile regression for 5 quantiles

quantiles = [.25, .50, .75] #[.05, .25, .50, .75, .95]

# get all result instances in a list
res_all = [mod.fit(q=q) for q in quantiles]

res_ols = smf.ols('dens ~ temp + I(temp ** 2.0)', df).fit()

#####  MQuantlile Regression
y = res_ols.model.endog
xx = res_ols.model.exog

t_eps = 100#1.345#1e-6
mq_norm = MQuantileNorm(0.25, norms.HuberT(t=t_eps))
mod_rlm = RLM(y, xx, M=mq_norm)
#mod_rlm = RLM(y, x, M=norms.HuberT())
res_rlm = mod_rlm.fit()

mq_norm2 = MQuantileNorm(0.75, norms.HuberT(t=t_eps))
mod_rlm2 = RLM(y, xx, M=mq_norm2)
#mod_rlm = RLM(y, x, M=norms.HuberT())
res_rlm2 = mod_rlm2.fit()

mq_norm1 = MQuantileNorm(0.2, norms.HuberT(t=t_eps))
mod_rlmf = smf.rlm('dens ~ bs(temp, df=6, degree=3, include_intercept=True) - 1', df, M=mq_norm1)
res_rlmf = mod_rlmf.fit()

########

plt.figure()

# create x for predition
x_p = np.linspace(df.temp.min(), df.temp.max(), 50)
df_p = pd.DataFrame({'temp': x_p})
from scipy import stats
yp25 = x_p + 0.25 * x_p**2 + 0.1 * np.exp(2 + np.abs(x_p)) * stats.norm.ppf(0.25)
yp75 = x_p + 0.25 * x_p**2 + 0.1 * np.exp(2 + np.abs(x_p)) * stats.norm.ppf(0.75)
yp50 = yp75 = x_p + 0.25 * x_p**2

for qm, res in zip(quantiles, res_all):
    # get prediction for the model and plot
    # here we use a dict which works the same way as the df in ols
    #plt.plot(x_p, res.predict({'temp': x_p}), linestyle='dotted', lw=2,
    plt.plot(x_p, res.predict(df_p), linestyle='dotted', lw=3,
             color='blue', label='q=%.2F' % qm)

x_predict = np.column_stack((np.ones(len(x_p)), x_p, x_p**2))
y_rlm_predicted = res_rlm.predict(x_predict)
y_rlm2_predicted = res_rlm2.predict(x_predict)
y_rlmf_predicted = res_rlmf.predict(df_p)
y_ols_predicted = res_ols.predict(df_p)
#plt.plot(x_p, y_ols_predicted, color='red', label='OLS')
plt.plot(x_p, yp25, color='green', label='DGP')
plt.plot(x_p, yp75, color='green')
plt.plot(x_p, yp50, color='green')
plt.plot(x_p, y_rlm_predicted, color='red', label='RLM')
plt.plot(x_p, y_rlm2_predicted, color='red')
plt.plot(x_p, y_rlmf_predicted, '--', color='red', lw=3)
plt.scatter(df.temp, df.dens, alpha=.2)
plt.xlim((-4, 4))
plt.ylim((-30, 40))
plt.legend(loc="upper center")
plt.xlabel('temp')
plt.ylabel('dens')
plt.title('Quantile Regression and Heteroscedasticity')
plt.show()

from scipy import stats
y25 = x + 0.25 * x**2 + 0.1 * np.exp(2 + np.abs(x)) * stats.norm.ppf(0.25)
res_all[0].fittedvalues[:10]

yp25 = x_p + 0.25 * x_p**2 + 0.1 * np.exp(2 + np.abs(x_p)) * stats.norm.ppf(0.25)
yq25 = res_all[0].predict(df_p).values

# proportions
(res_rlm.resid < 0).mean()
(res_rlm2.resid < 0).mean()
(res_all[0].resid < 0).mean()
((y - y25) < 0).mean()

# check
(res_rlm.resid == 0).mean()
(res_all[0].resid == 0).mean()

# check knot location
# for 'dens ~ bs(temp, df=6, degree=3, include_intercept=True) - 1'
'''
>>> np.argmax(mod.exog[np.argsort(x)], 0)
array([  0, 150, 373, 627, 834, 999], dtype=int64)
'''

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant