# 1. Domača Naloga

## Naloga 1: 

1.a: Dopolni funkcijo najdi_koeficiente, ki najde optimalne koeficiente za linearno regresijo. Pomagaj si s prosojnicami iz predavanj, funkcijo inverza numpy.linalg.inv(array) in matričnim množenjem (numpy.matmul(array1, array2) ali @). Pravilnost funkcije lahko preveriš s spodnjo kodo. 

Kakšne vrednosti koeficientov pričakuješ in kakšne si dobil/a? 

In [1]:
import numpy as np

In [2]:
def najdi_koeficiente(x_train: np.array, y_train: np.array):
    coefs = np.linalg.inv(x_train.T @ x_train) @ x_train.T @ y_train
    return coefs

In [3]:
x = np.random.random((1000, 4))
y = x[:, 0] + 5*x[:, 1]
coefs = najdi_koeficiente(x, y)

assert coefs.shape == (4,)
print(coefs)

[ 1.00000000e+00  5.00000000e+00  2.10942375e-15 -2.55351296e-15]


1.b Smiselno dopolni funkciji *najdi_koeficiente2* in *napovej* tako, da boš z njima lahko sestavil model, ki bo imel RMSE < 1e-10 na testnih podatkih funkcije x_1 + 5x_2 + 12. Pomagaš si lahko s funkcijo numpy.concatnate(seznam stolpcev, axis=1).

In [4]:
from sklearn.metrics import mean_squared_error

In [5]:
def najdi_koeficiente2(x_train, y_train):
    x = np.concatenate([np.ones((x_train.shape[0], 1)), x_train], axis=1)
    coefs = np.linalg.inv(x.T @ x) @ x.T @ y_train
    return coefs

In [6]:
def napovej(coefs, x_test):
    return x_test @ coefs[1:] + coefs[0]

In [7]:
x = np.random.random((1000, 4))
y = x[:, 0] + 5*x[:, 1] + 12
coefs = najdi_koeficiente2(x, y)

print(f"Coeficients: {', '.join([str(np.round(c, 5)) for c in coefs])}")

x_test = np.random.random((1000, 4))
y_test = x_test[:, 0] + 5*x_test[:, 1] + 12
y_pred = napovej(coefs, x_test)
error = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE: {error}")
assert error < 1e-10

Coeficients: 12.0, 1.0, 5.0, -0.0, 0.0
RMSE: 2.4060161591314086e-14


1.c: Naloži podatke iz datoteke "" in jih razdeli na podatke (x) in ciljne vrednosti (y). Ciljne vrednosti se nahajajo v zadnjem stolpcu.

In [8]:
data = np.load("../Podatki/dn1_1.npy")
X = data[:, :-1]
y = data[:, -1]

1.d: Čim bolje oceni točnost napovednega modela iz naloge 1.b. Predpostavi, da bo model natreniran na vseh podatkih iz naloge 1.c in da bodo novi podatki prihajali iz iste domene (in distribucije) kot učni podatki.

In [9]:
from sklearn.model_selection import KFold

In [10]:
kfold = KFold(n_splits=X.shape[0])
errors = []

for i, (train_idx, test_idx) in enumerate(kfold.split(X)):
    x_train = X[train_idx, :]
    y_train = y[train_idx]
    x_test = X[test_idx, :]
    y_test = y[test_idx]
    coefs = najdi_koeficiente2(x_train, y_train) 
    y_pred = napovej(coefs, x_test)
    error = np.sqrt(mean_squared_error(y_test, y_pred))
    errors.append(error)

print(f"Mean RMSE: {np.mean(errors)}")

Mean RMSE: 0.19247086658284746


## Naloga 2:

2.a: Preberi podatke z diskretno ciljno spremenljivko iz datoteke "dn1_2.npz" (pomagaj si z nalogo 2 iz vaj 3). Izračunaj nekaj preprostih statistik za vsak stolpec in podatke po potrebi predprocesiraj.

In [11]:
data = np.load("../Podatki/dn1_2.npz")
X = data["x"]
y = data["y"]

In [12]:
for i in range(X.shape[1]):
    print(f"x{i+1}: min {np.round(np.min(X[:, i]), 4)}, max {np.round(np.max(X[:, i]), 4)}, mean {np.round(np.mean(X[:, i]), 4)}")

x1: min -2.9932, max 2.9962, mean -0.0166
x2: min -0.9992, max 8.9973, mean 3.9371
x3: min 4.0011, max 6.9997, mean 5.4998
x4: min 0.0002, max 1.9992, mean 0.9804
x5: min -0.9993, max 3.9982, mean 1.4863
x6: min 0.2002, max 0.9996, mean 0.5971
x7: min 0.5, max 0.9998, mean 0.7453
x8: min 0.0003, max 0.9992, mean 0.5085


2.b: Preveri točnost in stabilnost logistične regresije. Poskrbi, da bodo eksperimenti ponovljivi in točnost modela čim bolje ocenjena. Če uporabiš prečno preverjanje, naj število vzorcev ne presega 10.

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [14]:
np.random.seed(18)
kfold = KFold(n_splits=10, shuffle=True)
errors = []

for i, (train_idx, test_idx) in enumerate(kfold.split(X)):
    x_train = X[train_idx, :]
    y_train = y[train_idx]
    x_test = X[test_idx, :]
    y_test = y[test_idx]
    model = LogisticRegression().fit(x_train, y_train)
    y_pred = model.predict(x_test)
    error = accuracy_score(y_test, y_pred)
    errors.append(error)
    print(f"Fold {i} accuracy: {error}")

print(f"Mean accuracy: {np.mean(errors)}")

Fold 0 accuracy: 0.945
Fold 1 accuracy: 0.96
Fold 2 accuracy: 0.975
Fold 3 accuracy: 0.95
Fold 4 accuracy: 0.95
Fold 5 accuracy: 0.96
Fold 6 accuracy: 0.98
Fold 7 accuracy: 0.975
Fold 8 accuracy: 0.955
Fold 9 accuracy: 0.96
Mean accuracy: 0.961


2.c: Domenski ekspert vam je namignil, da je ciljna vrednost korelirana s kvadratom prve spremenljivke (X[:, 0]) in produktom prvih treh (X[:, 0], X[:, 1], X[:, 2]). Lahko s tem namigom model izboljšaš? Če lahko, za koliko se model izboljša? 

Opomba:  Opozorilu o problem s konvergenco se lahko izognete tako da povečate število iteracij logistične regresije (npr. na 1000)

In [15]:
X = np.concatenate([X, (X[:, 0]**2)[:, None], (X[:, 0]*X[:, 1]*X[:, 2])[:, None]], axis=1)

In [16]:
np.random.seed(18)
kfold = KFold(n_splits=10, shuffle=True)
errors = []

for i, (train_idx, test_idx) in enumerate(kfold.split(X)):
    x_train = X[train_idx, :]
    y_train = y[train_idx]
    x_test = X[test_idx, :]
    y_test = y[test_idx]
    model = LogisticRegression(max_iter=1000).fit(x_train, y_train)
    y_pred = model.predict(x_test)
    error = accuracy_score(y_test, y_pred)
    errors.append(error)
    print(f"Fold {i} accuracy: {error}")

print(f"Mean accuracy: {np.mean(errors)}")

Fold 0 accuracy: 0.985
Fold 1 accuracy: 1.0
Fold 2 accuracy: 0.995
Fold 3 accuracy: 0.99
Fold 4 accuracy: 0.985
Fold 5 accuracy: 1.0
Fold 6 accuracy: 1.0
Fold 7 accuracy: 1.0
Fold 8 accuracy: 1.0
Fold 9 accuracy: 1.0
Mean accuracy: 0.9955


In [17]:
from sklearn.preprocessing import normalize

In [18]:
normalized_x = normalize(X)

np.random.seed(18)
kfold = KFold(n_splits=10, shuffle=True)
errors = []

for i, (train_idx, test_idx) in enumerate(kfold.split(normalized_x)):
    x_train = normalized_x[train_idx, :]
    y_train = y[train_idx]
    x_test = normalized_x[test_idx, :]
    y_test = y[test_idx]
    model = LogisticRegression(max_iter=1000).fit(x_train, y_train)
    y_pred = model.predict(x_test)
    error = accuracy_score(y_test, y_pred)
    errors.append(error)
    print(f"Fold {i} accuracy: {error}")

print(f"Mean accuracy: {np.mean(errors)}")

Fold 0 accuracy: 0.955
Fold 1 accuracy: 0.96
Fold 2 accuracy: 0.95
Fold 3 accuracy: 0.965
Fold 4 accuracy: 0.95
Fold 5 accuracy: 0.96
Fold 6 accuracy: 0.98
Fold 7 accuracy: 0.97
Fold 8 accuracy: 0.96
Fold 9 accuracy: 0.975
Mean accuracy: 0.9624999999999998


2.d: Lahko s pomočjo logistične regresije najdeš spremenljivke, ki za klasifikacijo niso pomembne? Če ja, katere spremenljivke si izbral, zakaj in kakšna je točnost brez njih?

In [19]:
model = LogisticRegression(max_iter=1000).fit(X, y)
print(f"Intercept: {model.intercept_}")
for i in range(X.shape[1]):
    print(f"Coefficient of feature {i}: {model.coef_[0][i]}")


Intercept: [-26.85666836]
Coefficient of feature 0: 0.19655533428944624
Coefficient of feature 1: 0.2486838590895063
Coefficient of feature 2: -1.15475246798134
Coefficient of feature 3: 0.10730059787665548
Coefficient of feature 4: 3.8830382732492916
Coefficient of feature 5: 0.3432592869754726
Coefficient of feature 6: 0.15627957514272783
Coefficient of feature 7: 0.5809316615842955
Coefficient of feature 8: 0.5015772948141607
Coefficient of feature 9: 0.7695602562480682


In [20]:
x_smaller = X[:, [2, 4, 7, 8, 9]]
np.random.seed(18)
kfold = KFold(n_splits=10, shuffle=True)
errors = []

for i, (train_idx, test_idx) in enumerate(kfold.split(x_smaller)):
    x_train = x_smaller[train_idx, :]
    y_train = y[train_idx]
    x_test = x_smaller[test_idx, :]
    y_test = y[test_idx]
    model = LogisticRegression(max_iter=1000).fit(x_train, y_train)
    y_pred = model.predict(x_test)
    error = accuracy_score(y_test, y_pred)
    errors.append(error)
    print(f"Fold {i} accuracy: {error}")

print(f"Mean accuracy: {np.mean(errors)}")

Fold 0 accuracy: 0.985
Fold 1 accuracy: 1.0
Fold 2 accuracy: 0.995
Fold 3 accuracy: 0.99
Fold 4 accuracy: 0.99
Fold 5 accuracy: 1.0
Fold 6 accuracy: 1.0
Fold 7 accuracy: 1.0
Fold 8 accuracy: 1.0
Fold 9 accuracy: 1.0
Mean accuracy: 0.9960000000000001


## Naloga 3: K-najbližjih sosedov

3.a: Preberi podatkovno množico z numerično ciljno spremenljiko iz datoteke `dn1_3.npz`.  

In [2]:
import numpy as np
data3 = np.load("../Podatki/dn1_3.npz")
x = data3['x']
y = data3['y']

3.b: Sestavi napovedni model k-najbližjih sosedov s čim nižjim RMSE-jem. Poskrbi, da poleg ocene točnosti modela, poročaš tudi stabilnost ocene. Opiši, kakšen je tvoj model in kako/zakaj si se odločil za vsako izbiro. Opiši tudi eksperimente, ki si jih poskusil/a, a niso izboljšali rezultata.

Opomba: Pomen spremenljivk je sledeč:
- $x_1$: Zaporedna številka diamanta v bazi
- $x_2$: Število karatov
- $x_3$: Procent globine diamanta ($\frac{2\cdot z}{x+y}$)
- $x_4$: Razmerje med širino vrha in najširšo točko
- $x_5$: Dolžina diamanta
- $x_6$: Širina diamanta
- $x_7$: Globina diamanta
- $x_8$: Kvaliteta brusa ("Ideal": 4, "Premium": 3, "Very Good": 2, "Good": 1, "Fair": 0)
- $x_{9}$: Barva ("D": 0, "E": 1, "F": 2, "G": 3, "H": 4, "I": 5, "J":6)
- $x_{10}$: Čistost ("I1": 0, "SI2": 1, "SI1": 2, "VS2": 3, "VS1": 4, "VVS2": 5, "VVS1": 6, "IF": 7)
- $x_{11}$: Oddaljenost najdišča od ekvatorja
- $y$: Cena diamanta

In [3]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error