## Uporaba Orange za regresijo

## Using Orange for regression

In [1]:
import Orange
import random
import numpy as np
np.random.seed(42)

Regresija v Orange-u je zelo podobna klasifikaciji. Oba zahtevata označene podatke. Tako kot v klasifikaciji se regresija izvaja z učnimi algoritmi in regresijskimi modeli (regresorji). Regresijski učni algoritmi so objekti, ki sprejemajo podatke in vrnejo regresorje. Regresijski modeli dobijo podatkke za napovedovanje vrednosti zveznega razreda:

Regression in Orange is very similar to classification. Both require labeled data. Just like in classification, regression is implemented with learners and regression models (regressors). Regression learners are objects that accept data and return regressors. Regression models are given data items to predict the value of continuous class:

In [2]:
data = Orange.data.Table("housing")
learner = Orange.regression.LinearRegressionLearner()
model = learner(data)

print("predicted, observed:")
for d in data[:3]:
    print("%.1f, %.1f" % (model(d), d.get_class()))

predicted, observed:
30.0, 24.0
25.0, 21.6
30.6, 34.7


  from pandas import MultiIndex, Int64Index


Začnimo z regresijskimi drevesi. Spodaj je primer skripta, ki gradi drevo iz podatkov o cenah stanovanja in natisne drevo v obliki besedila:

Let us start with regression trees. Below is an example script that builds the tree from data on housing prices and prints out the tree in textual form:

In [3]:
tree_learner = Orange.regression.SimpleTreeLearner(max_depth=2)
tree = tree_learner(data)
print(tree.to_string())


RM (22.5: 506.0)
: <=6.941
   LSTAT (19.9: 430.0)
   : <=14.40 --> (23.3: 255.0)
   : >14.40 --> (15.0: 175.0)
: >6.941
   RM (37.2: 76.0)
   : <=7.437 --> (32.1: 46.0)
   : >7.437 --> (45.1: 30.0)


Sledi inicializacija nekaterih drugih regresorjev in njihovo napovedovanje prvih pet podatkovnih podatkov v podatkovnem nizu cen stanovanj:

Following is initialization of few other regressors and their prediction of the first five data instances in housing price dataset:

In [4]:
random.seed(42)
test = Orange.data.Table.from_list(data.domain, random.sample(data, 5))
train = Orange.data.Table.from_list(data.domain, [d for d in data if d not in test])

lin = Orange.regression.linear.LinearRegressionLearner()
rf = Orange.regression.random_forest.RandomForestRegressionLearner()
rf.name = "rf"
ridge = Orange.regression.RidgeRegressionLearner()

learners = [lin, rf, ridge]
regressors = [learner(train) for learner in learners]

print("y   ", " ".join("%5s" % l.name for l in regressors))

for d in test:
    print(("{:<5}" + " {:5.1f}"*len(regressors)).format(
        d.get_class(),
        *(r(d) for r in regressors)))

y    linear regression    rf ridge regression
22.2   19.3  20.7  19.5
31.6   33.2  31.8  33.2
21.7   20.9  19.1  21.0
10.2   16.9  12.1  16.8
14.0   13.6  14.4  13.5


Zdi se, da cene stanovanj niso tako težko predvidljive.

Looks like the housing prices are not that hard to predict.

##### Vprašanje 5-4-1
Z razsevnim diagramom prikaži, kako se vrednost napovedi spreminja glede na dejansko vrednost. Komentiraj sliko.

##### Question 5-4-1
Show the way the predicted value changes according to the actual value with a scatter plot. Comment this picture.

##### Vprašanje 5-4-2
Prikaži, kako se napaka napovedi spreminja glede na dejansko vrednost. Komentiraj sliko.

##### Question 5-4-2
Show how the prediction error changes according to the actual value. Comment this picture.

### Prečno preverjanje

### Cross-Validation

Ocenjevanje in metode točkovanja so na voljo pri Orange.evaluation:

Evaluation and scoring methods are available at Orange.evaluation:

In [5]:
random.seed(42)
lin = Orange.regression.linear.LinearRegressionLearner()
rf = Orange.regression.random_forest.RandomForestRegressionLearner()
rf.name = "rf"
ridge = Orange.regression.RidgeRegressionLearner()
mean = Orange.regression.MeanLearner()

learners = [lin, rf, ridge, mean]

cv = Orange.evaluation.CrossValidation(k=5)
res = cv(data, learners)
rmse = Orange.evaluation.RMSE(res)
r2 = Orange.evaluation.R2(res)

print("Learner  RMSE  R2")
for i in range(len(learners)):
    print("{:8s} {:.2f} {:5.2f}".format(learners[i].name, rmse[i], r2[i]))

Learner  RMSE  R2
linear regression 4.88  0.72
rf       4.04  0.81
ridge regression 4.91  0.71
mean     9.20 -0.00


Ni veliko razlike. Vsaka regresijska metoda ima nabor parametrov. Uporaljali smo jih s privzetimi parametri in nastavljanje parametrov bi pomagalo. Na seznam naših regresorjev smo vključili MeanLearner; ta regresor preprosto napoveduje srednjo vrednost iz učne množice in se uporabljajo kot izhodišče.

Not much difference here. Each regression method has a set of parameters. We have been running them with default parameters, and parameter fitting would help. Also, we have included MeanLearner in the list of our regressors; this regressor simply predicts the mean value from the training set, and is used as a baseline.