## 1. Answer the questions

### 1. Derive an analytical solution to the regression problem. Use a vector form of the equation.

Модель линейной регрессии:
$$
\hat y = X w + b
$$
Объединим параметры:
$$
\tilde w = \begin{bmatrix} b \\ w \end{bmatrix}
$$
Добавим столбец единиц:
$$
\tilde X = \begin{bmatrix}
1 & x_{11} & \dots & x_{1d} \\
1 & x_{21} & \dots & x_{2d} \\
\vdots & \vdots & & \vdots \\
1 & x_{n1} & \dots & x_{nd}
\end{bmatrix}
$$
Тогда:
$$
\hat y = \tilde X \tilde w
$$
Рассмотрим функцию потерь MSE:
$$
L(\tilde w) = \| \tilde X \tilde w − y \|^2
$$
Найдём точку минимума, приравняв градиент функции потерь к 0:
$$
\nabla L = 2 \tilde X^T(\tilde X \tilde w - y) = 0
$$
Из полученного уравнения находим аналитическое решение задачи:
$$
\tilde w = (\tilde X^T \tilde X)^{-1} \tilde X^T y
$$

### 2. What changes in the solution when L1 and L2 regularizations are added to the loss function?

L2-регуляризация добавляет к функции потерь штраф за большие веса:
$$
L(w) = \|X w - y\|^2 + \lambda \| w \|_2^2
$$
Веса модели уменьшаются, но остаются ненулевыми, что делает модель более устойчивой к переобучению и мультиколлинеарности.  
L1-регуляризация добавляет штраф на абсолютные значения весов:
$$
L(w) = \|X w - y\|^2 + \lambda \| w \|_1
$$
Она решает другую задачу — упрощение модели и отбор признаков.

### 3. Explain why L1 regularization is often used to select features. Why are there many weights equal to 0 after the model is fit?

L1-регуляризация часто используется для отбора признаков, потому что она делает модель разреженной: многие коэффициенты становятся равными нулю.  
Градиент по весу $w_j$:
$$
\nabla L_{w_j} = 2 X_j^T (X w - y) + \lambda \text{sign}(w_j)
$$
Т.к. второй член постоянен по модулю, он может занулить маленькие коэффициенты, в отличие от L2, где градиент уменьшается с весом и редко делает его равным нулю.

### 4. Explain how you can use the same models (Linear regression, Ridge, etc.) but make it possible to fit nonlinear dependencies.

Линейные модели можно применять к нелинейным зависимостям, если преобразовать признаки с помощью функций или полиномов.  
Можно, например, 
- добавить квадраты признаков, их произведения или другие функции (x^2, sqrt(x), sin(x)),
- либо воспользоваться PolynomialFeatures() из sklearn.  

После такого преобразования линейная модель остаётся линейной по новым признакам, но фактически обучает нелинейную зависимость исходной целевой переменной.

## 2. Import

In [None]:
import numpy as np
import pandas as pd
from collections import Counter
from sklearn.metrics import mean_absolute_error, root_mean_squared_error
from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler, StandardScaler
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, SGDRegressor

In [60]:
train = pd.read_json('./data/train.json')
test = pd.read_json('./data/test.json')

In [61]:
train.head(3)

Unnamed: 0,bathrooms,bedrooms,building_id,created,description,display_address,features,latitude,listing_id,longitude,manager_id,photos,price,street_address,interest_level
4,1.0,1,8579a0b0d54db803821a35a4a615e97a,2016-06-16 05:55:27,Spacious 1 Bedroom 1 Bathroom in Williamsburg!...,145 Borinquen Place,"[Dining Room, Pre-War, Laundry in Building, Di...",40.7108,7170325,-73.9539,a10db4590843d78c784171a107bdacb4,[https://photos.renthop.com/2/7170325_3bb5ac84...,2400,145 Borinquen Place,medium
6,1.0,2,b8e75fc949a6cd8225b455648a951712,2016-06-01 05:44:33,BRAND NEW GUT RENOVATED TRUE 2 BEDROOMFind you...,East 44th,"[Doorman, Elevator, Laundry in Building, Dishw...",40.7513,7092344,-73.9722,955db33477af4f40004820b4aed804a0,[https://photos.renthop.com/2/7092344_7663c19a...,3800,230 East 44th,low
9,1.0,2,cd759a988b8f23924b5a2058d5ab2b49,2016-06-14 15:19:59,**FLEX 2 BEDROOM WITH FULL PRESSURIZED WALL**L...,East 56th Street,"[Doorman, Elevator, Laundry in Building, Laund...",40.7575,7158677,-73.9625,c8b10a317b766204f08e613cef4ce7a0,[https://photos.renthop.com/2/7158677_c897a134...,3495,405 East 56th Street,medium


## 3. Intro data analysis

### 1. Clean data and remove the lines outside the 1st and 99th percentiles.

In [62]:
percentile_1 = np.percentile(train['price'], 1)
percentile_99 = np.percentile(train['price'], 99)
train = train.loc[(percentile_1 < train['price']) & (train['price'] < percentile_99)]

In [63]:
percentile_1 = np.percentile(test['price'], 1)
percentile_99 = np.percentile(test['price'], 99)
test = test.loc[(percentile_1 < test['price']) & (test['price'] < percentile_99) & (test['bathrooms'] <= 10)]

### 2. Remove unused symbols ([, ], ', ", and space) from 'features' column.

In [64]:
def remove_symbols(feature):
    for s in '[]\'" ':
        feature = feature.replace(s, '')
    return feature

train['features'] = train['features'].apply(lambda features: [remove_symbols(feature) for feature in features])
test['features'] = test['features'].apply(lambda features: [remove_symbols(feature) for feature in features])

### 3. Collect the result in one huge list.

In [65]:
features = [feature for features in train['features'] for feature in features]

### 4. How many unique values does a result list contain?

In [66]:
count_features = Counter(features)
len(count_features.keys())

1529

`Answer`: 1529

### 5. Print the top 20 most popular features.

In [67]:
count_features.most_common(20)

[('Elevator', 25375),
 ('HardwoodFloors', 23146),
 ('CatsAllowed', 23135),
 ('DogsAllowed', 21652),
 ('Doorman', 20479),
 ('Dishwasher', 20081),
 ('NoFee', 17793),
 ('LaundryinBuilding', 16082),
 ('FitnessCenter', 12989),
 ('Pre-War', 8971),
 ('LaundryinUnit', 8437),
 ('RoofDeck', 6417),
 ('OutdoorSpace', 5132),
 ('DiningRoom', 4890),
 ('HighSpeedInternet', 4223),
 ('Balcony', 2898),
 ('SwimmingPool', 2643),
 ('LaundryInBuilding', 2564),
 ('NewConstruction', 2504),
 ('Terrace', 2177)]

### 6. Create 20 new features based on the top 20 values.

In [68]:
feature_list = np.array(count_features.most_common(20))[:, 0]
feature_list

array(['Elevator', 'HardwoodFloors', 'CatsAllowed', 'DogsAllowed',
       'Doorman', 'Dishwasher', 'NoFee', 'LaundryinBuilding',
       'FitnessCenter', 'Pre-War', 'LaundryinUnit', 'RoofDeck',
       'OutdoorSpace', 'DiningRoom', 'HighSpeedInternet', 'Balcony',
       'SwimmingPool', 'LaundryInBuilding', 'NewConstruction', 'Terrace'],
      dtype='<U21')

In [69]:
train = train[['bathrooms', 'bedrooms', 'features', 'price']]
for feature in feature_list:
    train[str(feature)] = train['features'].apply(lambda features: int(feature in features))
train = train.drop('features', axis=1)
train.head()

Unnamed: 0,bathrooms,bedrooms,price,Elevator,HardwoodFloors,CatsAllowed,DogsAllowed,Doorman,Dishwasher,NoFee,...,LaundryinUnit,RoofDeck,OutdoorSpace,DiningRoom,HighSpeedInternet,Balcony,SwimmingPool,LaundryInBuilding,NewConstruction,Terrace
4,1.0,1,2400,0,1,1,1,0,1,0,...,0,0,0,1,0,0,0,0,0,0
6,1.0,2,3800,1,1,0,0,1,1,1,...,0,0,0,0,0,0,0,0,0,0
9,1.0,2,3495,1,1,0,0,1,1,0,...,1,0,0,0,0,0,0,0,0,0
10,1.5,3,3000,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
15,1.0,0,2795,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


### 7. Do the same with test data.

In [70]:
test = test[['bathrooms', 'bedrooms', 'features', 'price']]
for feature in feature_list:
    test[str(feature)] = test['features'].apply(lambda features: int(feature in features))
test = test.drop('features', axis=1)
test.head()

Unnamed: 0,bathrooms,bedrooms,price,Elevator,HardwoodFloors,CatsAllowed,DogsAllowed,Doorman,Dishwasher,NoFee,...,LaundryinUnit,RoofDeck,OutdoorSpace,DiningRoom,HighSpeedInternet,Balcony,SwimmingPool,LaundryInBuilding,NewConstruction,Terrace
0,1.0,1,2950,1,1,0,0,0,1,0,...,1,0,1,0,0,0,0,0,0,0
1,1.0,2,2850,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1.0,0,2295,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1.0,2,2900,0,1,1,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,1.0,1,3254,1,0,1,1,1,0,0,...,0,1,0,0,1,0,0,0,0,0


### 8. Divide into train and training data.

In [71]:
X_train = train.drop('price', axis=1)
y_train = train['price']
X_test = test.drop('price', axis=1)
y_test = test['price']

## 4. Models implementation — Linear regression

### 1. Implementing classes and metrics.

#### 1. Linear regression using `stochastic gradient descent` (SGD) with `MSE` loss function.

In [72]:
class CustomSGD():
    def __init__(self, learning_rate=0.0001, epochs=100, batch=30, random_state=None):
        self._learning_rate = learning_rate
        self._epochs = epochs
        self._batch = batch
        self._rng = np.random.default_rng(random_state)
     
    def fit(self, X, y):
        X = np.asarray(X)
        y = np.asarray(y)
        N, D = X.shape
        self._w = self._rng.random(D)
        self._b = 0
        B = self._batch
        for _ in range(self._epochs):
            indexes = np.arange(N)
            self._rng.shuffle(indexes)
            X = X[indexes]
            y = y[indexes]
            for i in range(0, N, B):
                X_batch = X[i : i+B]
                y_batch = y[i : i+B]
                pred = X_batch.dot(self._w) + self._b
                err = pred - y_batch
                grad_w = 2 * X_batch.T.dot(err) / B
                grad_b = 2 * np.mean(err)
                self._w -= self._learning_rate * grad_w
                self._b -= self._learning_rate * grad_b

    def predict(self, X):
        X = np.asarray(X)
        return X.dot(self._w) + self._b

#### 2. Linear regression using `gradient descent` (GD) with `MSE` loss function.

In [73]:
class CustomGD():
    def __init__(self, learning_rate=0.001, max_iter=10000, random_state=None):
        self._learning_rate = learning_rate
        self._max_iter = max_iter
        self._rng = np.random.default_rng(random_state)
     
    def fit(self, X, y):
        X = np.asarray(X)
        y = np.asarray(y)
        N, D = X.shape
        self._w = self._rng.random(D)
        self._b = 0
        for _ in range(self._max_iter):
            pred = X.dot(self._w) + self._b
            err = pred - y
            grad_w = 2 * X.T.dot(err) / N
            grad_b = 2 * np.mean(err)
            self._w -= self._learning_rate * grad_w
            self._b -= self._learning_rate * grad_b

    def predict(self, X):
        X = np.asarray(X)
        return X.dot(self._w) + self._b

#### 3. Linear regression using an `analytical method`.

In [74]:
class CustomAnalytical():
    def _to_bias_matrix(self, X):
        X = np.asarray(X)
        N, D = X.shape
        ones = np.ones((N, 1))
        return np.hstack([X, ones])
        
    def fit(self, X, y):
        y = np.asarray(y)
        X = self._to_bias_matrix(X)
        X_T = X.T
        inv_matrix = np.linalg.inv(X_T.dot(X))
        self._w = inv_matrix.dot(X_T.dot(y))
    
    def predict(self, X):
        X = self._to_bias_matrix(X)
        return X.dot(self._w)

#### 4. R squared (`R2`) coefficient.

In [75]:
def r2_coefficient(y_true, y_pred):
    mean = np.mean(y_true)
    y_mean = [mean] * len(y_true)
    r2 = 1 - np.sum((y_true - y_pred) ** 2 )/ np.sum((y_true - y_mean) ** 2)
    return r2

### 2. Testing classes with MAE, RMSE and R2 metrics.

In [76]:
result_MAE = pd.DataFrame(columns=['model', 'train', 'test'])
result_RMSE = pd.DataFrame(columns=['model', 'train', 'test'])
result_R2 = pd.DataFrame(columns=['model', 'train', 'test'])

In [77]:
def add_result(model_name, y_pred_train, y_pred_test):
    mae_train = mean_absolute_error(y_train, y_pred_train)
    mae_test = mean_absolute_error(y_test, y_pred_test)
    result_MAE.loc[len(result_MAE)] = [model_name, mae_train, mae_test]

    rmse_train = root_mean_squared_error(y_train, y_pred_train)
    rmse_test = root_mean_squared_error(y_test, y_pred_test)
    result_RMSE.loc[len(result_RMSE)] = [model_name, rmse_train, rmse_test]

    r2_train = r2_coefficient(y_train, y_pred_train)
    r2_test = r2_coefficient(y_test, y_pred_test)
    result_R2.loc[len(result_R2)] = [model_name, r2_train, r2_test]

#### 1. Stochastic gradient descent (SGD).

In [78]:
sgd_reg = CustomSGD(random_state=21)
sgd_reg.fit(X_train, y_train)
y_pred_train = sgd_reg.predict(X_train)
y_pred_test = sgd_reg.predict(X_test)
add_result('SGD', y_pred_train, y_pred_test)

#### 2. Gradient descent (GD).

In [79]:
gd_reg = CustomGD(random_state=21)
gd_reg.fit(X_train, y_train)
y_pred_train = gd_reg.predict(X_train)
y_pred_test = gd_reg.predict(X_test)
add_result('GD', y_pred_train, y_pred_test)

#### 3. Analytical method.

In [80]:
anl_reg = CustomAnalytical()
anl_reg.fit(X_train, y_train)
y_pred_train = anl_reg.predict(X_train)
y_pred_test = anl_reg.predict(X_test)
add_result('Analytical', y_pred_train, y_pred_test)

#### 4. Linear regression from Sklearn.

In [81]:
lr = LinearRegression()
lr.fit(X=X_train, y=y_train)
y_pred_train = lr.predict(X_train)
y_pred_test = lr.predict(X_test)
add_result('Sklearn Linear', y_pred_train, y_pred_test)

#### 5. Results 

In [82]:
result_MAE

Unnamed: 0,model,train,test
0,SGD,708.369964,709.405253
1,GD,708.986886,709.908019
2,Analytical,708.737118,709.726085
3,Sklearn Linear,708.737118,709.726085


In [83]:
result_RMSE

Unnamed: 0,model,train,test
0,SGD,1027.678837,1021.569984
1,GD,1028.63871,1022.516649
2,Analytical,1027.262879,1021.194868
3,Sklearn Linear,1027.262879,1021.194868


In [84]:
result_R2

Unnamed: 0,model,train,test
0,SGD,0.579929,0.578353
1,GD,0.579144,0.577571
2,Analytical,0.580269,0.578662
3,Sklearn Linear,0.580269,0.578662


`Conclusion`: По сравнению с результатами проекта ml_1 добавление 20 новых признаков улучшило качество линейной регрессии на обучающей и тестовой выборках. Все реализации (**SGD**, **GD** И **Analytical**) близки к **Sklearn** версии линейной регрессии.

## 5. Regularized models implementation — Ridge, Lasso, ElasticNet

### 1. Implementing classes.

#### 1. Ridge Regression.

In [85]:
class CustomRidge():
    def __init__(self, alpha=0.1, learning_rate=0.0001, epochs=100, batch=30, random_state=None):
        self._alpha = alpha
        self._learning_rate = learning_rate
        self._epochs = epochs
        self._batch = batch
        self._rng = np.random.default_rng(random_state)

    def fit(self, X, y):
        X = np.asarray(X)
        y = np.asarray(y)
        N, D = X.shape
        self._w = self._rng.random(D)
        self._b = 0
        B = self._batch
        for _ in range(self._epochs):
            for i in range(0, N, B):
                X_batch = X[i : i+B]
                y_batch = y[i : i+B]
                pred = X_batch.dot(self._w) + self._b
                err = pred - y_batch
                grad_w = 2 * X_batch.T.dot(err) / B + 2 * self._alpha * self._w
                grad_b = 2 * np.mean(err)
                self._w -= self._learning_rate * grad_w
                self._b -= self._learning_rate * grad_b

    def predict(self, X):
        X = np.asarray(X)
        return X.dot(self._w) + self._b

#### 2. Lasso Regression.

In [86]:
class CustomLasso():
    def __init__(self, alpha=1, learning_rate=0.0001, epochs=100, batch=30, random_state=None):
        self._alpha = alpha
        self._learning_rate = learning_rate
        self._epochs = epochs
        self._batch = batch
        self._rng = np.random.default_rng(random_state)

    def fit(self, X, y):
        X = np.asarray(X)
        y = np.asarray(y)
        N, D = X.shape
        self._w = self._rng.random(D)
        self._b = 0
        B = self._batch
        for _ in range(self._epochs):
            for i in range(0, N, B):
                X_batch = X[i : i+B]
                y_batch = y[i : i+B]
                pred = X_batch.dot(self._w) + self._b
                err = pred - y_batch
                grad_w = 2 * X_batch.T.dot(err) / B + self._alpha * np.sign(self._w)
                grad_b = 2 * np.mean(err)
                self._w -= self._learning_rate * grad_w
                self._b -= self._learning_rate * grad_b

    def predict(self, X):
        X = np.asarray(X)
        return X.dot(self._w) + self._b

#### 3. ElasticNet.

In [87]:
class CustomElasticNet():
    def __init__(self, alpha=1, l1_ratio=0.5, learning_rate=0.0001, epochs=100, batch=30, random_state=None):
        self._alpha = alpha
        self._l1_ratio = l1_ratio
        self._learning_rate = learning_rate
        self._epochs = epochs
        self._batch = batch
        self._rng = np.random.default_rng(random_state)

    def fit(self, X, y):
        X = np.asarray(X)
        y = np.asarray(y)
        N, D = X.shape
        self._w = self._rng.random(D)
        self._b = 0
        B = self._batch
        for _ in range(self._epochs):
            for i in range(0, N, B):
                X_batch = X[i : i+B]
                y_batch = y[i : i+B]
                pred = X_batch.dot(self._w) + self._b
                err = pred - y_batch
                grad_w = 2 * X_batch.T.dot(err) / B + self._alpha * (self._l1_ratio * np.sign(self._w) + 2 * (1 - self._l1_ratio) * self._w)
                grad_b = 2 * np.mean(err)
                self._w -= self._learning_rate * grad_w
                self._b -= self._learning_rate * grad_b

    def predict(self, X):
        X = np.asarray(X)
        return X.dot(self._w) + self._b

### 2. Testing classes with MAE, RMSE and R2 metrics.

#### 1. Ridge Regression.

In [88]:
rid_reg = CustomRidge(random_state=21)
rid_reg.fit(X_train, y_train)
y_pred_train = rid_reg.predict(X_train)
y_pred_test = rid_reg.predict(X_test)
add_result('Ridge', y_pred_train, y_pred_test)

#### 2. Lasso Regression.

In [89]:
las_reg = CustomLasso(random_state=21)
las_reg.fit(X_train, y_train)
y_pred_train = las_reg.predict(X_train)
y_pred_test = las_reg.predict(X_test)
add_result('Lasso', y_pred_train, y_pred_test)

#### 3. ElasticNet.

In [90]:
ela_reg = CustomElasticNet(random_state=21)
ela_reg.fit(X_train, y_train)
y_pred_train = ela_reg.predict(X_train)
y_pred_test = ela_reg.predict(X_test)
add_result('ElasticNet', y_pred_train, y_pred_test)

#### 4. Sklearn regressions.

In [91]:
rid_reg = Ridge(random_state=21)
rid_reg.fit(X_train, y_train)
y_pred_train = rid_reg.predict(X_train)
y_pred_test = rid_reg.predict(X_test)
add_result('Sklearn Ridge', y_pred_train, y_pred_test)

In [92]:
las_reg = Lasso(random_state=21)
las_reg.fit(X_train, y_train)
y_pred_train = las_reg.predict(X_train)
y_pred_test = las_reg.predict(X_test)
add_result('Sklearn Lasso', y_pred_train, y_pred_test)

In [93]:
ela_reg = ElasticNet(random_state=21)
ela_reg.fit(X_train, y_train)
y_pred_train = ela_reg.predict(X_train)
y_pred_test = ela_reg.predict(X_test)
add_result('Sklearn ElasticNet', y_pred_train, y_pred_test)

#### 5. Results.

In [94]:
result_MAE

Unnamed: 0,model,train,test
0,SGD,708.369964,709.405253
1,GD,708.986886,709.908019
2,Analytical,708.737118,709.726085
3,Sklearn Linear,708.737118,709.726085
4,Ridge,723.645851,724.142397
5,Lasso,707.59855,708.625361
6,ElasticNet,802.107558,802.474767
7,Sklearn Ridge,708.734175,709.722774
8,Sklearn Lasso,708.38337,709.350489
9,Sklearn ElasticNet,804.298801,804.656101


In [95]:
result_RMSE

Unnamed: 0,model,train,test
0,SGD,1027.678837,1021.569984
1,GD,1028.63871,1022.516649
2,Analytical,1027.262879,1021.194868
3,Sklearn Linear,1027.262879,1021.194868
4,Ridge,1062.230278,1056.554388
5,Lasso,1027.822497,1021.618801
6,ElasticNet,1180.692803,1174.245871
7,Sklearn Ridge,1027.262884,1021.19508
8,Sklearn Lasso,1027.454741,1021.242235
9,Sklearn ElasticNet,1180.302849,1173.93149


In [96]:
result_R2

Unnamed: 0,model,train,test
0,SGD,0.579929,0.578353
1,GD,0.579144,0.577571
2,Analytical,0.580269,0.578662
3,Sklearn Linear,0.580269,0.578662
4,Ridge,0.551208,0.548979
5,Lasso,0.579812,0.578313
6,ElasticNet,0.445526,0.442903
7,Sklearn Ridge,0.580269,0.578662
8,Sklearn Lasso,0.580112,0.578623
9,Sklearn ElasticNet,0.445892,0.443201


## 6. Feature normalization

### 1. Write several examples of why and where feature normalization is required and vice versa.

Нормализация признаков обязательна в:
- Градиентных методах SGD/GD, т.к. без неё сходимость будет медленной,
- Регуляризации L1/L2/ElasticNet, т.к. размер штрафа зависит напрямую от масштаба весов,
- KNN, k-Means, т.к. признак с большим разбросом будет вносить больший вес в расчёты.  
  
Нормализация не нужна в:
- Деревьях Random Forest, т.к. разбиение по порогам не зависит от масштаба признаков.

### 2. The classical normalization methods.

#### 1. MinMaxScaler

##### 1. Mathematical formula.

Нормализация в [0, 1]:
$$
X_{\text{std}} = \frac{X - X_{\min}}{X_{\max} - X_{\min}}
$$
Масштабирование в произвольный диапазон [a, b]:
$$
X_{\text{scaled}} = X_{\text{std}} \cdot (b - a) + a
$$

##### 2. Implement your own class.

In [97]:
class CustomMinMaxScaler:
    def __init__(self, feature_range=(0, 1)):
        self._a, self._b = feature_range

    def _normalization(self, X, feature):
        denom = self._max[feature] - self._min[feature]
        if denom == 0:
            X_scaled = self._a
        else: 
            X_std = (X - self._min[feature]) / denom
            X_scaled = X_std * (self._b - self._a) + self._a
        return X_scaled

    def fit(self, X):
        X = pd.DataFrame(X)
        self._max = X.max()
        self._min = X.min()

    def transform(self, X):
        X = pd.DataFrame(X)
        X_copy = pd.DataFrame(columns=X.columns)
        for feature in X.columns:
            X_copy[feature] = self._normalization(X[feature], feature)
        return X_copy
    
    def fit_transform(self, X):
        self.fit(X)
        return self.transform(X)

##### 3. Compare the feature normalization with your own method and with sklearn.

In [98]:
custom_min_max_scaler = CustomMinMaxScaler()
custom_min_max_scaler.fit_transform(X_train)

Unnamed: 0,bathrooms,bedrooms,Elevator,HardwoodFloors,CatsAllowed,DogsAllowed,Doorman,Dishwasher,NoFee,LaundryinBuilding,...,LaundryinUnit,RoofDeck,OutdoorSpace,DiningRoom,HighSpeedInternet,Balcony,SwimmingPool,LaundryInBuilding,NewConstruction,Terrace
4,0.10,0.125,0.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.10,0.250,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.10,0.250,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.15,0.375,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
15,0.10,0.000,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
124000,0.10,0.375,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
124002,0.10,0.250,1.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
124004,0.10,0.125,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,...,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
124008,0.10,0.250,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [99]:
scaler = MinMaxScaler()
scaler.fit_transform(X_train)

array([[0.1  , 0.125, 0.   , ..., 0.   , 0.   , 0.   ],
       [0.1  , 0.25 , 1.   , ..., 0.   , 0.   , 0.   ],
       [0.1  , 0.25 , 1.   , ..., 0.   , 0.   , 0.   ],
       ...,
       [0.1  , 0.125, 1.   , ..., 0.   , 0.   , 0.   ],
       [0.1  , 0.25 , 0.   , ..., 0.   , 0.   , 0.   ],
       [0.1  , 0.375, 1.   , ..., 0.   , 0.   , 0.   ]], shape=(48343, 22))

`Conclusion`: CustomMinMaxScaler() нормализует аналогично MinMaxScaler() из sklearn.

#### 2. StandardScaler

##### 1. Mathematical formula.

StandardScaler нормализует каждый признак так, чтобы он имел среднее 0 и стандартное отклонение 1.
$$
X_{\text{scaled}} ​= \frac{X - \mu}{\sigma}
$$
где:
$$
\mu = \frac{1}{N} \sum_{i=1}^{N} X_i
$$
- среднее значение признака,  
$$
\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (X_i - \mu)^2}
$$
- стандартное отклонение признака.

##### 2. Implement your own class.

In [100]:
class CustomStandardScaler:
    def _normalization(self, X, feature):
        if self._std[feature] == 0:
            self._std[feature] = 1
        X_scaled = (X - self._mean[feature]) / self._std[feature]
        return X_scaled

    def fit(self, X):
        X = pd.DataFrame(X)
        self._mean = X.mean()
        self._std = X.std(ddof=0)

    def transform(self, X):
        X = pd.DataFrame(X)
        X_copy = pd.DataFrame(columns=X.columns)
        for feature in X.columns:
            X_copy[feature] = self._normalization(X[feature], feature)
        return X_copy
    
    def fit_transform(self, X):
        self.fit(X)
        return self.transform(X)

##### 3. Compare the feature normalization with your own method and with sklearn.

In [101]:
custom_standard_scaler = CustomStandardScaler()
custom_standard_scaler.fit_transform(X_train)

Unnamed: 0,bathrooms,bedrooms,Elevator,HardwoodFloors,CatsAllowed,DogsAllowed,Doorman,Dishwasher,NoFee,LaundryinBuilding,...,LaundryinUnit,RoofDeck,OutdoorSpace,DiningRoom,HighSpeedInternet,Balcony,SwimmingPool,LaundryInBuilding,NewConstruction,Terrace
4,-0.427289,-0.485075,-1.051094,1.043365,1.043841,1.110282,-0.857300,1.186339,-0.763166,1.416344,...,-0.459806,-0.391223,-0.344625,2.980955,-0.30938,-0.252526,-0.240486,-0.236661,-0.233722,-0.217154
6,-0.427289,0.422959,0.951390,1.043365,-0.958000,-0.900672,1.166453,1.186339,1.310331,1.416344,...,-0.459806,-0.391223,-0.344625,-0.335463,-0.30938,-0.252526,-0.240486,-0.236661,-0.233722,-0.217154
9,-0.427289,0.422959,0.951390,1.043365,-0.958000,-0.900672,1.166453,1.186339,-0.763166,1.416344,...,2.174829,-0.391223,-0.344625,-0.335463,-0.30938,-0.252526,-0.240486,-0.236661,-0.233722,-0.217154
10,0.670290,1.330994,-1.051094,-0.958437,-0.958000,-0.900672,-0.857300,-0.842929,-0.763166,-0.706043,...,-0.459806,-0.391223,-0.344625,-0.335463,-0.30938,-0.252526,-0.240486,-0.236661,-0.233722,-0.217154
15,-0.427289,-1.393110,0.951390,-0.958437,-0.958000,-0.900672,1.166453,-0.842929,-0.763166,1.416344,...,-0.459806,-0.391223,-0.344625,-0.335463,-0.30938,-0.252526,-0.240486,-0.236661,-0.233722,-0.217154
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
124000,-0.427289,1.330994,0.951390,1.043365,-0.958000,-0.900672,-0.857300,1.186339,-0.763166,-0.706043,...,-0.459806,-0.391223,-0.344625,-0.335463,-0.30938,-0.252526,-0.240486,-0.236661,-0.233722,-0.217154
124002,-0.427289,0.422959,0.951390,-0.958437,1.043841,1.110282,1.166453,-0.842929,1.310331,-0.706043,...,-0.459806,-0.391223,-0.344625,-0.335463,-0.30938,-0.252526,-0.240486,4.225461,-0.233722,-0.217154
124004,-0.427289,-0.485075,0.951390,1.043365,1.043841,1.110282,-0.857300,1.186339,1.310331,1.416344,...,2.174829,-0.391223,-0.344625,2.980955,-0.30938,-0.252526,-0.240486,-0.236661,-0.233722,-0.217154
124008,-0.427289,0.422959,-1.051094,-0.958437,-0.958000,-0.900672,-0.857300,1.186339,1.310331,-0.706043,...,2.174829,-0.391223,2.901709,-0.335463,-0.30938,-0.252526,-0.240486,-0.236661,-0.233722,-0.217154


In [102]:
scaler = StandardScaler()
scaler.fit_transform(X_train)

array([[-0.42728906, -0.4850753 , -1.05109371, ..., -0.23666054,
        -0.233722  , -0.21715413],
       [-0.42728906,  0.42295936,  0.95138996, ..., -0.23666054,
        -0.233722  , -0.21715413],
       [-0.42728906,  0.42295936,  0.95138996, ..., -0.23666054,
        -0.233722  , -0.21715413],
       ...,
       [-0.42728906, -0.4850753 ,  0.95138996, ..., -0.23666054,
        -0.233722  , -0.21715413],
       [-0.42728906,  0.42295936, -1.05109371, ..., -0.23666054,
        -0.233722  , -0.21715413],
       [-0.42728906,  1.33099403,  0.95138996, ..., -0.23666054,
        -0.233722  , -0.21715413]], shape=(48343, 22))

`Conclusion`: CustomStandardScaler() нормализует аналогично StandardScaler() из sklearn.

## 7. Fit custom and sklearn models with normalized data

### 1. Fit all models with MinMaxScaler.

In [103]:
X_train_copy = custom_min_max_scaler.transform(X_train)
X_test_copy = custom_min_max_scaler.transform(X_test)

#### 1. Linear Regression.

In [104]:
lin_reg = CustomSGD(learning_rate=0.01, random_state=21)
lin_reg.fit(X_train_copy, y_train)
y_pred_train = lin_reg.predict(X_train_copy)
y_pred_test = lin_reg.predict(X_test_copy)
add_result('SGD MinMaxScaler', y_pred_train, y_pred_test)

#### 2. Ridge.

In [105]:
rid_reg = CustomRidge(alpha=0.001, learning_rate=0.01, random_state=21)
rid_reg.fit(X_train_copy, y_train)
y_pred_train = rid_reg.predict(X_train_copy)
y_pred_test = rid_reg.predict(X_test_copy)
add_result('Ridge MinMaxScaler', y_pred_train, y_pred_test)

#### 3. Lasso.

In [106]:
las_reg = CustomLasso(learning_rate=0.01, random_state=21)
las_reg.fit(X_train_copy, y_train)
y_pred_train = las_reg.predict(X_train_copy)
y_pred_test = las_reg.predict(X_test_copy)
add_result('Lasso MinMaxScaler', y_pred_train, y_pred_test)

#### 4. ElasticNet.

In [107]:
ela_reg = CustomElasticNet(learning_rate=0.01, random_state=21)
ela_reg.fit(X_train_copy, y_train)
y_pred_train = ela_reg.predict(X_train_copy)
y_pred_test = ela_reg.predict(X_test_copy)
add_result('ElasticNet MinMaxScaler', y_pred_train, y_pred_test)

#### 5. Results.

In [108]:
result_MAE

Unnamed: 0,model,train,test
0,SGD,708.369964,709.405253
1,GD,708.986886,709.908019
2,Analytical,708.737118,709.726085
3,Sklearn Linear,708.737118,709.726085
4,Ridge,723.645851,724.142397
5,Lasso,707.59855,708.625361
6,ElasticNet,802.107558,802.474767
7,Sklearn Ridge,708.734175,709.722774
8,Sklearn Lasso,708.38337,709.350489
9,Sklearn ElasticNet,804.298801,804.656101


In [109]:
result_RMSE

Unnamed: 0,model,train,test
0,SGD,1027.678837,1021.569984
1,GD,1028.63871,1022.516649
2,Analytical,1027.262879,1021.194868
3,Sklearn Linear,1027.262879,1021.194868
4,Ridge,1062.230278,1056.554388
5,Lasso,1027.822497,1021.618801
6,ElasticNet,1180.692803,1174.245871
7,Sklearn Ridge,1027.262884,1021.19508
8,Sklearn Lasso,1027.454741,1021.242235
9,Sklearn ElasticNet,1180.302849,1173.93149


In [110]:
result_R2

Unnamed: 0,model,train,test
0,SGD,0.579929,0.578353
1,GD,0.579144,0.577571
2,Analytical,0.580269,0.578662
3,Sklearn Linear,0.580269,0.578662
4,Ridge,0.551208,0.548979
5,Lasso,0.579812,0.578313
6,ElasticNet,0.445526,0.442903
7,Sklearn Ridge,0.580269,0.578662
8,Sklearn Lasso,0.580112,0.578623
9,Sklearn ElasticNet,0.445892,0.443201


### 2. Fit all models with StandardScaler.

In [111]:
X_train_copy = custom_standard_scaler.transform(X_train)
X_test_copy = custom_standard_scaler.transform(X_test)

#### 1. Linear Regression.

In [112]:
lin_reg = CustomSGD(random_state=21)
lin_reg.fit(X_train_copy, y_train)
y_pred_train = lin_reg.predict(X_train_copy)
y_pred_test = lin_reg.predict(X_test_copy)
add_result('SGD StandardScaler', y_pred_train, y_pred_test)

#### 2. Ridge.

In [113]:
rid_reg = CustomRidge(random_state=21)
rid_reg.fit(X_train_copy, y_train)
y_pred_train = rid_reg.predict(X_train_copy)
y_pred_test = rid_reg.predict(X_test_copy)
add_result('Ridge StandardScaler', y_pred_train, y_pred_test)

#### 3. Lasso.

In [114]:
las_reg = CustomLasso(random_state=21)
las_reg.fit(X_train_copy, y_train)
y_pred_train = las_reg.predict(X_train_copy)
y_pred_test = las_reg.predict(X_test_copy)
add_result('Lasso StandardScaler', y_pred_train, y_pred_test)

#### 4. ElasticNet.

In [115]:
ela_reg = CustomElasticNet(random_state=21)
ela_reg.fit(X_train_copy, y_train)
y_pred_train = ela_reg.predict(X_train_copy)
y_pred_test = ela_reg.predict(X_test_copy)
add_result('ElasticNet StandardScaler', y_pred_train, y_pred_test)

#### 5. Results.

In [116]:
result_MAE

Unnamed: 0,model,train,test
0,SGD,708.369964,709.405253
1,GD,708.986886,709.908019
2,Analytical,708.737118,709.726085
3,Sklearn Linear,708.737118,709.726085
4,Ridge,723.645851,724.142397
5,Lasso,707.59855,708.625361
6,ElasticNet,802.107558,802.474767
7,Sklearn Ridge,708.734175,709.722774
8,Sklearn Lasso,708.38337,709.350489
9,Sklearn ElasticNet,804.298801,804.656101


In [117]:
result_RMSE

Unnamed: 0,model,train,test
0,SGD,1027.678837,1021.569984
1,GD,1028.63871,1022.516649
2,Analytical,1027.262879,1021.194868
3,Sklearn Linear,1027.262879,1021.194868
4,Ridge,1062.230278,1056.554388
5,Lasso,1027.822497,1021.618801
6,ElasticNet,1180.692803,1174.245871
7,Sklearn Ridge,1027.262884,1021.19508
8,Sklearn Lasso,1027.454741,1021.242235
9,Sklearn ElasticNet,1180.302849,1173.93149


In [118]:
result_R2

Unnamed: 0,model,train,test
0,SGD,0.579929,0.578353
1,GD,0.579144,0.577571
2,Analytical,0.580269,0.578662
3,Sklearn Linear,0.580269,0.578662
4,Ridge,0.551208,0.548979
5,Lasso,0.579812,0.578313
6,ElasticNet,0.445526,0.442903
7,Sklearn Ridge,0.580269,0.578662
8,Sklearn Lasso,0.580112,0.578623
9,Sklearn ElasticNet,0.445892,0.443201


## 8. Overfit models

In [119]:
poly = PolynomialFeatures(degree=10)
X_train_copy = poly.fit_transform(X_train[['bathrooms', 'bedrooms']])
X_test_copy = poly.transform(X_test[['bathrooms', 'bedrooms']])
custom_standard_scaler = CustomStandardScaler()
X_train_copy = custom_standard_scaler.fit_transform(X_train_copy)
X_test_copy = custom_standard_scaler.transform(X_test_copy)

#### 1. Linear Regression.

In [120]:
lin_reg = CustomSGD(learning_rate=10 ** -5, random_state=21)
lin_reg.fit(X_train_copy, y_train)
y_pred_train = lin_reg.predict(X_train_copy)
y_pred_test = lin_reg.predict(X_test_copy)
add_result('SGD Polynomial', y_pred_train, y_pred_test)

#### 2. Ridge.

In [121]:
rid_reg = CustomRidge(learning_rate=10 ** -5, random_state=21)
rid_reg.fit(X_train_copy, y_train)
y_pred_train = rid_reg.predict(X_train_copy)
y_pred_test = rid_reg.predict(X_test_copy)
add_result('Ridge Polynomial', y_pred_train, y_pred_test)

#### 3. Lasso.

In [122]:
las_reg = CustomLasso(learning_rate=10 ** -5, random_state=21)
las_reg.fit(X_train_copy, y_train)
y_pred_train = las_reg.predict(X_train_copy)
y_pred_test = las_reg.predict(X_test_copy)
add_result('Lasso Polynomial', y_pred_train, y_pred_test)

#### 4. ElasticNet.

In [123]:
ela_reg = CustomElasticNet(learning_rate=10 ** -5, random_state=21)
ela_reg.fit(X_train_copy, y_train)
y_pred_train = ela_reg.predict(X_train_copy)
y_pred_test = ela_reg.predict(X_test_copy)
add_result('ElasticNet Polynomial', y_pred_train, y_pred_test)

#### 5. Results.

In [124]:
result_MAE

Unnamed: 0,model,train,test
0,SGD,708.369964,709.405253
1,GD,708.986886,709.908019
2,Analytical,708.737118,709.726085
3,Sklearn Linear,708.737118,709.726085
4,Ridge,723.645851,724.142397
5,Lasso,707.59855,708.625361
6,ElasticNet,802.107558,802.474767
7,Sklearn Ridge,708.734175,709.722774
8,Sklearn Lasso,708.38337,709.350489
9,Sklearn ElasticNet,804.298801,804.656101


In [125]:
result_RMSE

Unnamed: 0,model,train,test
0,SGD,1027.678837,1021.569984
1,GD,1028.63871,1022.516649
2,Analytical,1027.262879,1021.194868
3,Sklearn Linear,1027.262879,1021.194868
4,Ridge,1062.230278,1056.554388
5,Lasso,1027.822497,1021.618801
6,ElasticNet,1180.692803,1174.245871
7,Sklearn Ridge,1027.262884,1021.19508
8,Sklearn Lasso,1027.454741,1021.242235
9,Sklearn ElasticNet,1180.302849,1173.93149


In [126]:
result_R2

Unnamed: 0,model,train,test
0,SGD,0.579929,0.578353
1,GD,0.579144,0.577571
2,Analytical,0.580269,0.578662
3,Sklearn Linear,0.580269,0.578662
4,Ridge,0.551208,0.548979
5,Lasso,0.579812,0.578313
6,ElasticNet,0.445526,0.442903
7,Sklearn Ridge,0.580269,0.578662
8,Sklearn Lasso,0.580112,0.578623
9,Sklearn ElasticNet,0.445892,0.443201


`Conclusion`: Использование полиномов **PolynomialFeatures** ухудшило метрики: упали до ~0.49-0.51, что связано мультиколлинеарностью признаков и избыточной сложностью модели для данного набора признаков.

## 9. Naive models 

### 1. Naive mean

In [127]:
y_pred_train = [y_train.mean()] * len(y_train)
y_pred_test = [y_test.mean()] * len(y_test)
add_result('Naive mean', y_pred_train, y_pred_test)

### 2. Naive median

In [128]:
y_pred_train = [y_train.mean()] * len(y_train)
y_pred_test = [y_test.mean()] * len(y_test)
add_result('Naive median', y_pred_train, y_pred_test)

## 10. Results

In [129]:
result_MAE

Unnamed: 0,model,train,test
0,SGD,708.369964,709.405253
1,GD,708.986886,709.908019
2,Analytical,708.737118,709.726085
3,Sklearn Linear,708.737118,709.726085
4,Ridge,723.645851,724.142397
5,Lasso,707.59855,708.625361
6,ElasticNet,802.107558,802.474767
7,Sklearn Ridge,708.734175,709.722774
8,Sklearn Lasso,708.38337,709.350489
9,Sklearn ElasticNet,804.298801,804.656101


In [130]:
result_RMSE

Unnamed: 0,model,train,test
0,SGD,1027.678837,1021.569984
1,GD,1028.63871,1022.516649
2,Analytical,1027.262879,1021.194868
3,Sklearn Linear,1027.262879,1021.194868
4,Ridge,1062.230278,1056.554388
5,Lasso,1027.822497,1021.618801
6,ElasticNet,1180.692803,1174.245871
7,Sklearn Ridge,1027.262884,1021.19508
8,Sklearn Lasso,1027.454741,1021.242235
9,Sklearn ElasticNet,1180.302849,1173.93149


In [131]:
result_R2

Unnamed: 0,model,train,test
0,SGD,0.579929,0.578353
1,GD,0.579144,0.577571
2,Analytical,0.580269,0.578662
3,Sklearn Linear,0.580269,0.578662
4,Ridge,0.551208,0.548979
5,Lasso,0.579812,0.578313
6,ElasticNet,0.445526,0.442903
7,Sklearn Ridge,0.580269,0.578662
8,Sklearn Lasso,0.580112,0.578623
9,Sklearn ElasticNet,0.445892,0.443201


`Conclusion`: 
- Лучшая модель - `Lasso/SGD`.  
**Lasso** имеет наибольшее значение $R^2$ на тестовой выборке при использовании **StandardScaler** (0.578675). Показатели MAE и RMSE находятся всегда на минимальном уровне и практически не меняются при различных способах масштабирования. По качеству предсказания **Lasso** сопоставима с **SGD**. **Lasso** показала себя более устойчивой в условиях мультиколлинеарности: при использовании полиномиальных признаков она показала наилучшие значения $R^2$ на обучающей и тестовой выборках: 0.506426 и 0.495742 соответсвенно.
- Худшая модель (обучаемая) - `ElasticNet`.
**ElasticNet** имеет наименьшие значения $R^2$ (от 0.12 до 0.54), высокие MAE и RMSE. Метрики сильно меняются при использовании нормализации и полиномиальных признаков, что говорит о низкой устойчивости модели к преобразованиям данных. Особенно плохо себя показала с **MinMaxScaler**.
- Стабильная модель - `Lasso/SGD`.  
Для них наблюдается минимальная разница между метриками на обучающей и тестовой выборках.