<a href="https://colab.research.google.com/github/yoondaeng/ICE4104-AI-Applications/blob/main/HW01_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## HW01: Regression, Cross-Validation, and Regularization

아래 코드 문제들을 풀고, 설명을 작성하시오.

## Code Task 1: Implement `calc_root_mean_squared_error`

See the test cases below and complete `calc_root_mean_squared_error`.

In [1]:
'''
Test Cases
--------
>>> y_N = 0.0
>>> yhat_N = 4.123
>>> calc_root_mean_squared_error(y_N, yhat_N)
4.123

>>> y_N = np.asarray([-2, 0, 2], dtype=np.float64)
>>> yhat_N = np.asarray([-4, 0, 2], dtype=np.float64)
>>> rmse = calc_root_mean_squared_error(y_N, yhat_N)
>>> np.round(rmse, 6)
1.154701
'''

import numpy as np


def calc_root_mean_squared_error(y_N, yhat_N):
    ''' Compute root mean squared error given true and predicted values

    Args
    ----
    y_N : 1D array, shape (N,)
        Each entry represents 'ground truth' numeric response for an example
    yhat_N : 1D array, shape (N,)
        Each entry representes predicted numeric response for an example

    Returns
    -------
    rmse : scalar float
        Root mean squared error performance metric
        .. math:
            rmse(y,\hat{y}) = \sqrt{\frac{1}{N} \sum_{n=1}^N (y_n - \hat{y}_n)^2}
    '''
    y_N = np.atleast_1d(y_N) # 입력값을 최소 1차원 배열로 변환 -> 입력이 스칼라인 경우에도 허용
    yhat_N = np.atleast_1d(yhat_N)
    assert y_N.ndim == 1 # y_N이 1차원 배열인지 확인
    assert y_N.shape == yhat_N.shape # 예측값과 실제값의 형태가 동일한지 확인 -> 두 배열의 길이 동일
    mse = np.mean((y_N - yhat_N) ** 2) # 평균 제곱 오차(MSE)를 계산
    rmse = np.sqrt(mse) # 루트를 취하여 RMSE
    return rmse

In [2]:
# 테스트 케이스
y_N = 0.0
yhat_N = 4.123
calc_root_mean_squared_error(y_N, yhat_N)

4.123

In [3]:
# 테스트 케이스 배열로 RMSE 계산
y_N = np.asarray([-1, 0, 1], dtype=np.float64)
yhat_N = np.asarray([-2, 0, 2], dtype=np.float64)
rmse = calc_root_mean_squared_error(y_N, yhat_N)
np.round(rmse, 6)

0.816497

## Ans 1: 코드 설명

* 이 코드는 주어진 실제값(`y_N`)과 예측값(`yhat_N`) 간의 루트 평균 제곱 오차(RMSE, Root Mean Squared Error)을 계산하는 함수

- 평균 제곱 오차(MSE)에 루트를 취하여 RMSE를 계산한다.

- RMSE (Root Mean Squared Error)
    - 예측값이 실제값과 얼마나 가까운지를 측정하는 지표
    - 값이 작을수록 예측이 실제 값에 가깝다는 것을 의미

### 함수 정의 및 매개변수
- `y_N`
    - 실제값들을 포함하는 1차원 배열
- `yhat_N`
    - 예측값들을 포함하는 1차원 배열
- `Return`
    - RMSE는 실제 값과 예측 값 간의 루트 평균 제곱 오차를 나타내는 스칼라(float)



## Code Task 2: Implement `fit` and `predict`

This code defines a LeastSquaresLinearRegressor class with the two key methods of the usual sklearn regression API: `fit` and `predict`. You will edit this file to complete the `fit` and the `predict` methods, which will demonstrate your understanding of what goes on "inside" sklearn-like regressor objects.



### Task 2(a):  The fit method should take in a labeled dataset $\{x_n, y_n\}_{n=1}^N$ and instantiate two instance attributes

* `w_F` : 1D numpy array, shape (n_features = F,) Represents the 'weights' Contains float64 entries of the weight coefficients
* `b` : scalar float Represents the 'bias' or 'intercept'.

Hint: Within a Python class, you can set an attribute like `self.b = 1.0`.

Nothing should be returned. You're updating the internal state of the object.

These attributes should be set using the formulas discussed in class (Lecture 03) for solving the "least squares" optimization problem (finding w and b values that minimize squared error on the training set).



### Task 2(b):  The `predict` method should take in an array of feature vectors $\{x_n\}^N_{n=1}$ and produce (return) the predicted responses $\{\hat{y}(x_n)\}^N_{n=1}$.

Recall that for linear regression, we've defined the prediction function as:

$$\hat{y}(x_n)=b+w^Tx_n=b+\sum_{f=1}^F{w_f x_{n,f}}$$

In [4]:
'''
Test Case
---------
>>> prng = np.random.RandomState(0)
>>> N = 100

>>> true_w_F = np.asarray([1.1, -2.2, 3.3])
>>> true_b = 0.0
>>> x_NF = prng.randn(N, 3)
>>> y_N = true_b + np.dot(x_NF, true_w_F) + 0.03 * prng.randn(N)

>>> linear_regr = LeastSquaresLinearRegressor()
>>> linear_regr.fit(x_NF, y_N)

>>> yhat_N = linear_regr.predict(x_NF)
>>> np.set_printoptions(precision=3, formatter={'float':lambda x: '% .3f' % x})
>>> print(linear_regr.w_F)
[ 1.099 -2.202  3.301]
>>> print(np.asarray([linear_regr.b]))
[-0.005]
'''

import numpy as np
# No other imports allowed!

class LeastSquaresLinearRegressor(object):
    ''' A linear regression model with sklearn-like API

    Fit by solving the "least squares" optimization problem.

    Attributes
    ----------
    * self.w_F : 1D numpy array, size n_features (= F)
        vector of weights, one value for each feature
    * self.b : float
        scalar real-valued bias or "intercept"
    '''

    def __init__(self):
        ''' Constructor of an sklearn-like regressor

        Should do nothing. Attributes are only set after calling 'fit'.
        '''
        # Leave this alone
        pass

    def fit(self, x_NF, y_N):
        ''' Compute and store weights that solve least-squares problem.

        Args
        ----
        x_NF : 2D numpy array, shape (n_examples, n_features) = (N, F)
            Input measurements ("features") for all examples in train set.
            Each row is a feature vector for one example.
        y_N : 1D numpy array, shape (n_examples,) = (N,)
            Response measurements for all examples in train set.
            Each row is a feature vector for one example.

        Returns
        -------
        Nothing.

        Post-Condition
        --------------
        Internal attributes updated:
        * self.w_F (vector of weights for each feature)
        * self.b (scalar real bias, if desired)

        Notes
        -----
        The least-squares optimization problem is:

        .. math:
            \min_{w \in \mathbb{R}^F, b \in \mathbb{R}}
                \sum_{n=1}^N (y_n - b - \sum_f x_{nf} w_f)^2
        '''
        N, F = x_NF.shape

        # 절편을 위한 열 추가
        X_aug = np.hstack((np.ones((N, 1)), x_NF))

        # 선형 회귀 정규 방정식 -> np.linalg.solve 함수를 이용하여 가중치와 절편 계산
        theta = np.linalg.solve(np.dot(X_aug.T, X_aug), np.dot(X_aug.T, y_N))

        # 절편 b
        self.b = theta[0]
        # 가중치
        self.w_F = theta[1:]


    def predict(self, x_MF):
        ''' Make predictions given input features for M examples

        Args
        ----
        x_MF : 2D numpy array, shape (n_examples, n_features) (M, F)
            Input measurements ("features") for all examples of interest.
            Each row is a feature vector for one example.

        Returns
        -------
        yhat_M : 1D array, size M
            Each value is the predicted scalar for one example
        '''
        # 주어진 입력 데이터 x_MF 에 대해 가중치 w_F 와 절편 b을 사용하여 예측값 계산
        yhat_M = self.b + np.dot(x_MF, self.w_F)
        return yhat_M

In [5]:
# 테스트 코드
prng = np.random.RandomState(0)
N = 100

true_w_F = np.asarray([1.1, -2.2, 3.3])
true_b = 0.0
x_NF = prng.randn(N, 3)
y_N = true_b + np.dot(x_NF, true_w_F) + 0.03 * prng.randn(N)

linear_regr = LeastSquaresLinearRegressor()
linear_regr.fit(x_NF, y_N)

yhat_N = linear_regr.predict(x_NF)
np.set_printoptions(precision=3, formatter={'float':lambda x: '% .3f' % x})
print(linear_regr.w_F)

print(np.asarray([linear_regr.b]))

[ 1.099 -2.202  3.301]
[-0.005]


## Ans 2: 코드 설명

* 이 코드는 `LeastSquaresLinearRegressor` 클래스를 정의하여 선형 회귀 모델을 구현한다.
- `LeastSquaresLinearRegressor` 클래스는 fit 및 predict 메서드를 통해 데이터를 학습하고 예측한다.
- fit 메서드는 모델의 가중치와 절편을 학습하고, predict 메서드는 주어진 입력에 대한 예측 값을 리턴

### `init`
- 회귀의 생성자
- 메서드는 초기화 시 아무런 동작도 하지 않음
- 속성은 fit 메서드를 호출한 후에 설정

### `fit`
- fit 메서드는 주어진 학습 데이터를 사용하여 모델의 가중치(w_F)와 절편(b)을 계산한다.
- 입력 행렬 x_NF에 절편을 위한 열을 추가하여 X_aug를 만든다.
- 선형 회귀의 정규 방정식을 풀기 위해 np.linalg.solve 함수를 사용하여 가중치와 절편을 계산한다
- theta 벡터에서 첫 번째 값은 절편(b)이고 나머지 값들은 가중치(w_F)를 의미한다.

### `predict`
- 주어진 입력 데이터 x_MF에 대해 가중치(w_F)와 절편(b)을 사용하여 예측값을 계산
- 예측값 = 절편(b)과 입력 데이터 x_MF와 가중치(w_F)의 내적의 합

In [6]:
# Test code
def test_on_toy_data(N=100):
    '''
    Simple example use case
    With toy dataset with N=100 examples
    created via a known linear regression model plus small noise
    '''
    prng = np.random.RandomState(0)

    true_w_F = np.asarray([1.1, -2.2, 3.3])
    true_b = 0.0
    x_NF = prng.randn(N, 3)
    y_N = true_b + np.dot(x_NF, true_w_F) + 0.03 * prng.randn(N)

    linear_regr = LeastSquaresLinearRegressor()
    linear_regr.fit(x_NF, y_N)

    yhat_N = linear_regr.predict(x_NF)

    np.set_printoptions(precision=3, formatter={'float':lambda x: '% .3f' % x})

    print("True weights")
    print(true_w_F)
    print("Estimated weights")
    print(linear_regr.w_F)

    print("True intercept")
    print(np.asarray([true_b]))
    print("Estimated intercept")
    print(np.asarray([linear_regr.b]))

if __name__ == '__main__':
    test_on_toy_data()

True weights
[ 1.100 -2.200  3.300]
Estimated weights
[ 1.099 -2.202  3.301]
True intercept
[ 0.000]
Estimated intercept
[-0.005]


## Code Task 3: Randomly divide data into splits and estimate training and heldout error.

### Task 3(a) : Implement the `make_train_and_test_row_ids_for_n_fold_cv` function


This function should consume the number of examples, the desired number of folds, and a pseudo-random number generator. Then, it will produce, for each of the desired number of folds, arrays of integers indicating which rows of the dataset belong to the training set, and which belong to the test set.


See the starter code for detailed specification.

Note : For each fold, you do NOT need to produce exactly the same random splits as our code. For instance, while creating 3 fold splits for an array all_examples=[1, 2, 3, 4, 5], the examples in each in fold could be :

* fold0_examples=[1, 2]
* fold1_examples=[3, 5]
* fold2_examples=[4]
**OR**
* fold0_examples=[3, 4]
* fold1_examples=[1, 5]
* fold2_examples=[2]



In [7]:
def make_train_and_test_row_ids_for_n_fold_cv(
        n_examples=0, n_folds=3, random_state=0):
    ''' Divide row ids into train and test sets for n-fold cross validation.

    Will *shuffle* the row ids via a pseudorandom number generator before
    dividing into folds.

    Args
    ----
    n_examples : int
        Total number of examples to allocate into train/test sets
    n_folds : int
        Number of folds requested
    random_state : int or numpy RandomState object
        Pseudorandom number generator (or seed) for reproducibility

    Returns
    -------
    train_ids_per_fold : list of 1D np.arrays
        One entry per fold
        Each entry is a 1-dim numpy array of unique integers between 0 to N
    test_ids_per_fold : list of 1D np.arrays
        One entry per fold
        Each entry is a 1-dim numpy array of unique integers between 0 to N

    Guarantees for Return Values
    ----------------------------
    Across all folds, guarantee that no two folds put same object in test set.
    For each fold f, we need to guarantee:
    * The *union* of train_ids_per_fold[f] and test_ids_per_fold[f]
    is equal to [0, 1, ... N-1]
    * The *intersection* of the two is the empty set
    * The total size of train and test ids for any fold is equal to N

    Examples
    --------
    >>> N = 11
    >>> n_folds = 3
    >>> tr_ids_per_fold, te_ids_per_fold = (
    ...     make_train_and_test_row_ids_for_n_fold_cv(N, n_folds))
    >>> len(tr_ids_per_fold)
    3

    # Count of items in training sets
    >>> np.sort([len(tr) for tr in tr_ids_per_fold])
    array([7, 7, 8])

    # Count of items in the test sets
    >>> np.sort([len(te) for te in te_ids_per_fold])
    array([3, 4, 4])

    # Test ids should uniquely cover the interval [0, N)
    >>> np.sort(np.hstack([te_ids_per_fold[f] for f in range(n_folds)]))
    array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

    # Train ids should cover the interval [0, N) TWICE
    >>> np.sort(np.hstack([tr_ids_per_fold[f] for f in range(n_folds)]))
    array([ 0,  0,  1,  1,  2,  2,  3,  3,  4,  4,  5,  5,  6,  6,  7,  7,  8,
            8,  9,  9, 10, 10])
    '''
    if hasattr(random_state, 'rand'):
        # Handle case where provided random_state is a random generator
        # (e.g. has methods rand() and randn())
        random_state = random_state # just remind us we use the passed-in value
    else:
        # Handle case where we pass "seed" for a PRNG as an integer
        random_state = np.random.RandomState(int(random_state))

    # TODO obtain a shuffled order of the n_examples

    # 0부터 n_examples-1까지의 정수 배열 생성
    shuffled_indices = np.arange(n_examples)
    # shuffle -> 배열을 무작위로 섞음
    random_state.shuffle(shuffled_indices)

    # fold_size: 각 폴드의 크기를 저장하는 배열
    fold_sizes = np.full(n_folds, n_examples // n_folds, dtype=int)
    fold_sizes[:n_examples % n_folds] += 1

    train_ids_per_fold = list()
    test_ids_per_fold = list()

    # TODO establish the row ids that belong to each fold's
    # train subset and test subset

    current = 0 # 현재 위치 추적하는 변수
    for fold_size in fold_sizes:
        start, stop = current, current + fold_size # 테스트와 학습 데이터 분할
        test_ids = shuffled_indices[start:stop]
        train_ids = np.concatenate((shuffled_indices[:start], shuffled_indices[stop:]))
        test_ids_per_fold.append(test_ids)
        train_ids_per_fold.append(train_ids)
        current = stop

    return train_ids_per_fold, test_ids_per_fold

### Task 3(b) : Implement the `train_models_and_calc_scores_for_n_fold_cv` function

This function will use the procedure from 3(a) to determine the different "folds", and then train a separate model at each fold and return that model's training error and heldout error.

In [8]:
def train_models_and_calc_scores_for_n_fold_cv(
        estimator, x_NF, y_N, n_folds=3, random_state=0):
    ''' Perform n-fold cross validation for a specific sklearn estimator object

    Args
    ----
    estimator : any regressor object with sklearn-like API
        Supports 'fit' and 'predict' methods.
    x_NF : 2D numpy array, shape (n_examples, n_features) = (N, F)
        Input measurements ("features") for all examples of interest.
        Each row is a feature vector for one example.
    y_N : 1D numpy array, shape (n_examples,)
        Output measurements ("responses") for all examples of interest.
        Each row is a scalar response for one example.
    n_folds : int
        Number of folds to divide provided dataset into.
    random_state : int or numpy.RandomState instance
        Allows reproducible random splits.

    Returns
    -------
    train_error_per_fold : 1D numpy array, size n_folds
        One entry per fold
        Entry f gives the error computed for train set for fold f
    test_error_per_fold : 1D numpy array, size n_folds
        One entry per fold
        Entry f gives the error computed for test set for fold f

    Examples
    --------
    # Create simple dataset of N examples where y given x
    # is perfectly explained by a linear regression model
    >>> N = 101
    >>> n_folds = 7
    >>> x_N3 = np.random.RandomState(0).rand(N, 3)
    >>> y_N = np.dot(x_N3, np.asarray([1., -2.0, 3.0])) - 1.3337
    >>> y_N.shape
    (101,)

    >>> import sklearn.linear_model
    >>> my_regr = sklearn.linear_model.LinearRegression()
    >>> tr_K, te_K = train_models_and_calc_scores_for_n_fold_cv(
    ...                 my_regr, x_N3, y_N, n_folds=n_folds, random_state=0)

    # Training error should be indistiguishable from zero
    >>> np.array2string(tr_K, precision=8, suppress_small=True)
    '[0. 0. 0. 0. 0. 0. 0.]'

    # Testing error should be indistinguishable from zero
    >>> np.array2string(te_K, precision=8, suppress_small=True)
    '[0. 0. 0. 0. 0. 0. 0.]'
    '''
    train_error_per_fold = np.zeros(n_folds, dtype=np.float32)
    test_error_per_fold = np.zeros(n_folds, dtype=np.float32)

    # TODO define the folds here by calling your function
    # e.g. ... = make_train_and_test_row_ids_for_n_fold_cv(...)

    # TODO loop over folds and compute the train and test error
    # for the provided estimator

    # Define the folds here by calling the function
    train_ids_per_fold, test_ids_per_fold = make_train_and_test_row_ids_for_n_fold_cv(
        n_examples=x_NF.shape[0], n_folds=n_folds, random_state=random_state)

    # 폴드별로 모델 학습 및 평가
    for fold in range(n_folds):
        train_ids = train_ids_per_fold[fold] # 각 폴드에 대해 학습용 데이터와
        test_ids = test_ids_per_fold[fold] # 테스트용 데이터 분리

        x_train = x_NF[train_ids]
        y_train = y_N[train_ids]
        x_test = x_NF[test_ids]
        y_test = y_N[test_ids]

        estimator.fit(x_train, y_train) # 학습용 데이터로 모델 학습

        y_train_pred = estimator.predict(x_train) # 학습된 모델을 사용하여 x_train 예측
        y_test_pred = estimator.predict(x_test) # # 학습된 모델을 사용하여 x_test 예측

        # 예측값과 실체값을 비교하여 MSE 계산
        train_error_per_fold[fold] = np.mean((y_train - y_train_pred) ** 2)
        test_error_per_fold[fold] = np.mean((y_test - y_test_pred) ** 2)

    return train_error_per_fold, test_error_per_fold

In [9]:
# Test code
N = 101
n_folds = 7
x_N3 = np.random.RandomState(0).rand(N, 3)
y_N = np.dot(x_N3, np.asarray([1., -2.0, 3.0])) - 1.3337
print(y_N.shape)

import sklearn.linear_model
my_regr = sklearn.linear_model.LinearRegression()
tr_K, te_K = train_models_and_calc_scores_for_n_fold_cv(my_regr, x_N3, y_N, n_folds=n_folds, random_state=0)

# Training error should be indistiguishable from zero
print(np.array2string(tr_K, precision=8, suppress_small=True))

# Testing error should be indistinguishable from zero
print(np.array2string(te_K, precision=8, suppress_small=True))

(101,)
[ 0.000  0.000  0.000  0.000  0.000  0.000  0.000]
[ 0.000  0.000  0.000  0.000  0.000  0.000  0.000]


## Ans 3: 코드 설명
이 코드는  주어진 데이터셋과 모델을 사용하여 n-폴드 교차 검증을 수행한다. 또한, 테스트 코드를 통해 각 폴드마다 학습 오류와 테스트 오류를 계산하여 출력한다.
###     `make_train_and_test_row_ids_for_n_fold_cv`
- 데이터셋을 n개의 폴드로 나누어 교차 검증을 수행할 수 있도록 학습용 데이터와 테스트용 데이터의 인덱스를 생성한다.
- 각 폴드마다 테스트 데이터는 다른 인덱스를 가지며, 모든 폴드의 학습 데이터와 테스트 데이터의 결합은 전체 데이터셋을 포함

###     `train_models_and_calc_scores_for_n_fold_cv`
- 주어진 데이터셋을 여러 개의 폴드로 나누어 각 폴드마다 모델을 학습하고 평가하는 교차 검증을 수행
- 앞서 구현한 `make_train_and_test_row_ids_for_n_fold_cv` 함수를 사용하여 폴드를 나누고 각 폴드마다 학습과 평가를 진행

### 테스트 결과분석
- 각 폴드에 대한 학습 오류와 테스트 오류를 출력
- 데이터셋이 선형 회귀 모델로 완벽하게 설명되므로, 오류값은 0에 가깝게 출력되는 것을 볼 수 있다.

Original resource: [Introductions to Machine Learning, tufts](https://www.cs.tufts.edu/comp/135/2023f/)
