## Uporaba Orange za regresijo

## Using Orange for regression

In [1]:
import Orange
import random
import numpy as np
np.random.seed(42)

Regresija v Orange-u je zelo podobna klasifikaciji. Oba zahtevata označene podatke. Tako kot v klasifikaciji se regresija izvaja z učnimi algoritmi in regresijskimi modeli (regresorji). Regresijski učni algoritmi so objekti, ki sprejemajo podatke in vrnejo regresorje. Regresijski modeli dobijo podatkke za napovedovanje vrednosti zveznega razreda:

Regression in Orange is very similar to classification. Both require labeled data. Just like in classification, regression is implemented with learners and regression models (regressors). Regression learners are objects that accept data and return regressors. Regression models are given data items to predict the value of continuous class:

In [2]:
data = Orange.data.Table("housing")
learner = Orange.regression.LinearRegressionLearner()
model = learner(data)

print("predicted, observed:")
for d in data[:3]:
    print("%.1f, %.1f" % (model(d)[0], d.get_class()))

predicted, observed:
30.0, 24.0
25.0, 21.6
30.6, 34.7


Začnimo z regresijskimi drevesi. Spodaj je primer skripta, ki gradi drevo iz podatkov o cenah stanovanja in natisne drevo v obliki besedila:

Let us start with regression trees. Below is an example script that builds the tree from data on housing prices and prints out the tree in textual form:

In [3]:
tree_learner = Orange.regression.SimpleTreeLearner(max_depth=2)
tree = tree_learner(data)
print(tree.to_string())


RM (22.5: 506.0)
: <=6.941
   LSTAT (19.9: 430.0)
   : <=14.4 --> (23.3: 255.0)
   : >14.4 --> (15.0: 175.0)
: >6.941
   RM (37.2: 76.0)
   : <=7.437 --> (32.1: 46.0)
   : >7.437 --> (45.1: 30.0)


Sledi inicializacija nekaterih drugih regresorjev in njihovo napovedovanje prvih pet podatkovnih podatkov v podatkovnem nizu cen stanovanj:

Following is initialization of few other regressors and their prediction of the first five data instances in housing price dataset:

In [4]:
random.seed(42)
test = Orange.data.Table(data.domain, random.sample(data, 5))
train = Orange.data.Table(data.domain, [d for d in data if d not in test])

lin = Orange.regression.linear.LinearRegressionLearner()
rf = Orange.regression.random_forest.RandomForestRegressionLearner()
rf.name = "rf"
ridge = Orange.regression.RidgeRegressionLearner()

learners = [lin, rf, ridge]
regressors = [learner(train) for learner in learners]

print("y   ", " ".join("%5s" % l.name for l in regressors))

for d in test:
    print(("{:<5}" + " {:5.1f}"*len(regressors)).format(
        d.get_class(),
        *(r(d)[0] for r in regressors)))

y    linear regression    rf ridge regression
22.2   19.3  20.8  19.5
31.6   33.2  31.9  33.2
21.7   20.9  19.0  21.0
10.2   16.9  12.0  16.8
14.0   13.6  13.8  13.5


Zdi se, da cene stanovanj niso tako težko predvidljive.

Looks like the housing prices are not that hard to predict.

##### Vprašanje 5-4-1
Z razsevnim diagramom prikaži, kako se vrednost napovedi spreminja glede na dejansko vrednost. Komentiraj sliko.

##### Question 5-4-1
Show the way the predicted value changes according to the actual value with a scatter plot. Comment this picture.

##### Vprašanje 5-4-2
Prikaži, kako se napaka napovedi spreminja glede na dejansko vrednost. Komentiraj sliko.

##### Question 5-4-2
Show how the prediction error changes according to the actual value. Comment this picture.

### Prečno preverjanje

### Cross-Validation

Ocenjevanje in metode točkovanja so na voljo pri Orange.evaluation:

Evaluation and scoring methods are available at Orange.evaluation:

In [5]:
random.seed(42)
lin = Orange.regression.linear.LinearRegressionLearner()
rf = Orange.regression.random_forest.RandomForestRegressionLearner()
rf.name = "rf"
ridge = Orange.regression.RidgeRegressionLearner()
mean = Orange.regression.MeanLearner()

learners = [lin, rf, ridge, mean]

res = Orange.evaluation.CrossValidation(data, learners, k=5)
rmse = Orange.evaluation.RMSE(res)
r2 = Orange.evaluation.R2(res)

print("Learner  RMSE  R2")
for i in range(len(learners)):
    print("{:8s} {:.2f} {:5.2f}".format(learners[i].name, rmse[i], r2[i]))

Learner  RMSE  R2
linear regression 4.88  0.72
rf       4.06  0.80
ridge regression 4.91  0.71
mean     9.20 -0.00


Ni veliko razlike. Vsaka regresijska metoda ima nabor parametrov. Uporaljali smo jih s privzetimi parametri in nastavljanje parametrov bi pomagalo. Na seznam naših regresorjev smo vključili MeanLearner; ta regresor preprosto napoveduje srednjo vrednost iz učne množice in se uporabljajo kot izhodišče.

Not much difference here. Each regression method has a set of parameters. We have been running them with default parameters, and parameter fitting would help. Also, we have included MeanLearner in the list of our regressors; this regressor simply predicts the mean value from the training set, and is used as a baseline.

## Povezovalna pravila

## Association rules

In [6]:
from orangecontrib.associate.fpgrowth import * 
from scipy.sparse import issparse

import random
random.seed(42)

Orange ponuja dva algoritma za indukcijo povezovalnih pravil, standardni Apriori algoritem za analizo redkih (*sparse*) podatkov (košara) in različico Apriori za nabore podatkov atribut-vrednost. Oba algoritma podpirata tudi rudarjenje pogostih postavk.

Začnimo s podatki o tržni košarici:

Orange provides two algorithms for induction of association rules, a standard Apriori algorithm for sparse (basket) data analysis and a variant of Apriori for attribute-value data sets. Both algorithms also support mining of frequent itemsets.

Let's start with market basket data:

In [7]:
data = Orange.data.Table("podatki/foodmart.basket")

Raziščimo podatke.

Let's explore the data.

In [8]:
print(len(data))
print(type(data.X))
data[:5]

62560
<class 'scipy.sparse.csr.csr_matrix'>


[[Pasta=3.000, Soup=2.000, STORE_ID_2=1.000],
 [Soup=1.000, STORE_ID_2=1.000, Fresh Vegetables=3.000, Milk=3.000, Plastic Utensils=2.000, ...],
 [STORE_ID_2=1.000, Cheese=2.000, Deodorizers=1.000, Hard Candy=2.000, Jam=2.000, ...],
 [STORE_ID_2=1.000, Fresh Vegetables=2.000],
 [STORE_ID_2=1.000, Cleaners=1.000, Cookies=2.000, Eggs=2.000, Preserves=1.000, ...]
]

Podatkov v tabeli ne moremo neposredno uporabljati; najprej jih moramo preoblikovati.

Dobimo bazo podatkov, ki jo lahko uporabimo za iskanje pogostih postavk, in mapiranje, ki ga bomo kasneje uporabili za  preoblikovanje nazaj.

We can’t use table data directly; we first have to one-hot transform it.

We get a database we can use to find frequent itemsets, and a mapping we will use later to revert the transformation.

In [9]:
X, mapping = OneHot.encode(data)
X, mapping

([[0, 1, 2],
  [1, 2, 3, 4, 5],
  [2, 6, 7, 8, 9],
  [2, 3],
  [2, 10, 11, 12, 13],
  [1, 2, 6, 14],
  [2, 15, 16, 17],
  [2, 11, 13, 15],
  [2, 3, 10, 18, 19, 20],
  [1, 2, 16, 21, 22, 23],
  [2, 24, 25, 26],
  [2, 3, 11, 12, 27, 28, 29],
  [2, 11, 30, 31, 32],
  [2, 3, 33, 34, 35],
  [1, 2, 4, 30, 36, 37, 38],
  [2, 14, 15, 26, 32, 39, 40],
  [2, 3, 31, 35, 41, 42, 43],
  [2, 3, 22, 30, 40, 44],
  [2, 3, 41, 42, 45, 46],
  [1, 2, 33],
  [2, 6, 11, 30, 47, 48],
  [2, 20, 27, 30],
  [2, 6, 45, 49],
  [2, 47, 50],
  [2, 27, 40, 41, 51],
  [1, 2, 28, 42],
  [2, 23, 30, 33, 39],
  [2, 38, 41, 50, 52],
  [2, 27, 53],
  [2, 12, 40],
  [2, 3, 5, 9, 34, 36, 54],
  [2, 3, 55],
  [2, 3, 23, 38, 56],
  [2, 50, 55, 57],
  [2, 3, 30],
  [2, 48],
  [2, 4, 9, 11, 58],
  [2, 30, 38],
  [2, 13, 19, 29, 30, 34, 41],
  [2, 10, 14, 18, 36, 59, 60],
  [2, 3, 4, 33, 40, 61],
  [2, 3, 47, 49],
  [1, 2, 3, 62, 63],
  [0, 2, 3, 22, 32, 33, 56],
  [2, 19, 59, 64, 65],
  [2, 16, 22],
  [2, 5, 27, 30, 32, 35, 66

Z ```decode``` lahko povežemo vsak indeks z imenom predmeta.

We can use ```decode``` to link each index with an item's name.

In [10]:
names = {item: ('{}').format(var.name, val)
                 for item, var, val in OneHot.decode(mapping, data, mapping)}

Želimo postavke z nizko podporo, saj bo težko najti prevladujoča pravila za več kot 62.000 transakcij.

We want itemsets with low support, since it will be hard to find prevailing rules for more than 62,000 transactions.

In [11]:
itemsets = {}
for itemset, support in frequent_itemsets(X, 0.01/100):
    itemsets[itemset] = support
len(itemsets)

64760

Zdaj lahko ustvarimo vsa povezovalna pravila, ki imajo vsaj 70% zaupanja (t.j. klasifikacijskih pravil):

Now we can generate all association rules that have at least 70% confidence (i.e. classification rules):

In [12]:
rules = []
for rule in association_rules(itemsets, 0.7):
    left, right, support, confidence = rule
    left_str =  ', '.join(names[i] for i in sorted(left))
    right_str = ', '.join(names[i] for i in sorted(right))
    rules.append(left_str + " -> " + right_str)

In [13]:
rules[:10]

['Cheese, Cookies, Fresh Fruit, Wine -> Fresh Vegetables',
 'Soup, Fresh Fruit, Dried Fruit, Paper Wipes -> Fresh Vegetables',
 'Soup, Cheese, Fresh Fruit, Nuts -> Fresh Vegetables',
 'Cheese, Preserves, Fresh Fruit, Nuts -> Fresh Vegetables',
 'Soup, Preserves, Fresh Fruit, Nuts -> Fresh Vegetables',
 'Soup, Fresh Vegetables, Preserves, Nuts -> Fresh Fruit',
 'Fresh Vegetables, Cheese, Juice, Pizza -> Soup',
 'Soup, Fresh Vegetables, Juice, Pizza -> Cheese',
 'Soup, Deli Meats, Chips, Wine -> Fresh Vegetables',
 'Soup, Fresh Vegetables, Deli Meats, Chips -> Wine']

##### Vprašanje 5-4-3
Filtriraj pravila. Poišči vsa tista pravila, ki napovejo še nabavo sira.

##### Question 5-4-3
Filter rules. Find all the rules that predict the purchase of cheese.