Robust Regression is a regression model, which also models the relationship between one or more independent variables and a dependent variable. The difference is that it aims to overcome some limitations of traditional parametric and non-parametric methods, such as misleading results when the assumptions of ordinary least squares are not true, while robust regression is designed not to be overly influenced by the violation of assumptions in the basic data generation process.
Robust regression is a strong regression method for outliers. Given a data set of n statistical units, a linear regression model assumes that the relationship between the dependent variable
and the
of regressors X is linear. This relationship is modeled through a disturbance term or error variable ε — an unobserved random variable that adds "noise" to the linear relationship between the dependent variable and regressors. However, if the noise is caused by abnormal measurement error or other violations of standard assumptions, then the validity of the conventional linear regression model will be affected. The robust regression model is improved in this respect, and the allowable variance depends on the independent variable X. The model is expressed in the following form:
The robust regression adopts Huber loss function, which divides the residuals into different segments and uses different loss calculation methods for the residuals of different segments.
where is a group of samples. This method combine the square loss and absolute loss together to avoid being dominated by particularly large outliers.
Robust regression algorithm can be abstracted as a 1×N PSModel, denoted by w, where , as shown in the following figure:
Angel MLLib provides Robust regression algorithm trained with the mini-batch gradient descent method.
In each iteration, worker pulls the up-to-date w from PS, updates the model parameters, △w, using the mini-batch gradient descent optimization method, and push △w back to PS. -
In each iteration, PS receives △w from all workers, add their average to w,obtaining a new model. -
Decaying learning rate
The learning rate decays along iterations as , where:- α is the decay rate
- T is the epoch
- Data fromat is set in "", supporting "libsvm", "dense" and "dummy" types. For details, see Angel Data Format
- Model size is set in "ml.model.size", for some sparse model, there are features that all samples are zero at those indices (invalidate indices), therefore ml.model.size = ml.feature.index.range - number of invalidate indices
- Feature vector's dimension is set in "ml.feature.index.range"
Algorithm Parameters
- ml.epoch.num: number of iterations
- ml.num.update.per.epoch: number update in each epoch
- proportion of data used for validation, no validation when set to 0
- ml.learn.rate: initial learning rate
- ml.learn.decay: decay rate of the learning rate
- coefficient of the L1 penalty
- coefficient of the L2 penalty
- difference section point
I/O Parameters
- ml.feature.num: number of features
- Angel Data Format, supporting "dense" and "libsvm"
- save path for trained input path for train
- input path for predict
- angel.predict.out.path: output path for predict
- angel.log.path: save path for the log
Resource Parameters
- angel.workergroup.number: number of workers
- angel.worker.memory.mb: worker's memory requested in G
- angel.worker.task.number: number of tasks on each worker, default is 1
- number of PS
- PS's memory requested in G
Training Job
./bin/angel-submit \ --action.type=train \ \ \$input_path \$model_path \ --angel.log.path=$log_path \ \ \ --ml.epoch.num=10 \ --ml.feature.index.range=150361 \ \ --ml.learn.rate=0.1 \ --ml.learn.decay=1 \ --ml.reg.l2=0.001 \ \ --ml.model.type=T_FLOAT_DENSE \ --ml.num.update.per.epoch=10 \ --ml.worker.thread.num=4 \ --angel.workergroup.number=2 \ --angel.worker.memory.mb=5000 \ --angel.worker.task.number=1 \ \ \ \ --angel.output.path.deleteonexist=true \
IncTraining Job
./bin/angel-submit \ --action.type=inctrain \ \ \$input_path \ --angel.load.model.path=$model_path \$model_path \ --angel.log.path=$log_path \ \ \ \ --ml.epoch.num=10 \ --ml.feature.index.range=$featureNum+1 \ \ --ml.learn.rate=0.1 \ --ml.learn.decay=1 \ --ml.reg.l2=0.001 \ \ --ml.model.type=T_FLOAT_DENSE \ --ml.num.update.per.epoch=10 \ --ml.worker.thread.num=4 \ --angel.workergroup.number=2 \ --angel.worker.memory.mb=5000 \ --angel.worker.task.number=1 \ \ \ \ --angel.output.path.deleteonexist=true
Prediction Job
./bin/angel-submit \ --action.type=predict \ \ \$input_path \ --angel.load.model.path=$model_path \ --angel.predict.out.path=$predict_path \ --angel.log.path=$log_path \ --ml.feature.index.range=150361 \ \ --ml.model.type=T_FLOAT_DENSE \ --ml.worker.thread.num=4 \ --angel.workergroup.number=2 \ --angel.worker.memory.mb=5000 \ --angel.worker.task.number=1 \ \ \ \ --angel.output.path.deleteonexist=true \
- Data: E2006-tfidf, 1.5×10^5 features, 1.6×10^4 samples
- Resources:
- Angel: executor: 2, 5G memory, 1 task; ps: 2, 5G memory
- Time of 100 epochs:
- Angel: 22min