## Warm-Up!

### Import Relevant Packages

In [1]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

### Loading diabetes dataset and Examining the dataset:

In [2]:
diabetes = load_diabetes(scaled=False)

df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)

In [3]:
print(df.head())

print(df.dtypes)

    age  sex   bmi     bp     s1     s2    s3   s4      s5    s6
0  59.0  2.0  32.1  101.0  157.0   93.2  38.0  4.0  4.8598  87.0
1  48.0  1.0  21.6   87.0  183.0  103.2  70.0  3.0  3.8918  69.0
2  72.0  2.0  30.5   93.0  156.0   93.6  41.0  4.0  4.6728  85.0
3  24.0  1.0  25.3   84.0  198.0  131.4  40.0  5.0  4.8903  89.0
4  50.0  1.0  23.0  101.0  192.0  125.4  52.0  4.0  4.2905  80.0
age    float64
sex    float64
bmi    float64
bp     float64
s1     float64
s2     float64
s3     float64
s4     float64
s5     float64
s6     float64
dtype: object


Data types are float64, so we ensured they are numeric features.

#### Checking missing data and Scaling data:

---

For the diabetes dataset, we opted for StandardScaler due to its effectiveness in handling features with varying magnitudes. This choice was motivated by the recognition that the dataset's features, such as blood pressure readings and body mass index (BMI) measurements, might exhibit different scales. StandardScaler was preferred for its robustness and suitability to our dataset's characteristics. While it maintains the shape of the original distribution of features, albeit centered at zero and scaled to unit variance, it also ensures that the scaled features still retain their original relative distances and relationships.

StandardScaler standardizes features by removing the mean and scaling to unit variance, making it suitable for datasets where features have varying magnitudes. This approach ensures model stability by being less sensitive to outliers and preserves interpretability by retaining original feature distributions.

In contrast, we refrained from using MinMaxScaler. This scaler scales features to a fixed range (for example [0,1]), which may be unsuitable for datasets with unknown feature distributions or varying magnitudes. Additionally, MinMaxScaler is more sensitive to outliers, which could distort the scaling process.

---

In [4]:
missing_values = df.isnull().sum()
if missing_values.any():
    df.fillna(df.mean(), inplace=True)
else:
    print("No missing data found!\n")

scaler = StandardScaler()
scaled_features = scaler.fit_transform(df)

X_train, X_test, y_train, y_test = train_test_split(scaled_features, diabetes.target, test_size=0.05)

print("Number of instances in training set:", len(X_train))
print("Number of instances in testing set:", len(X_test))


No missing data found!

Number of instances in training set: 419
Number of instances in testing set: 23


Checking the dataset after scaling it to standard:

In [5]:
sdf = pd.DataFrame(data=scaled_features, columns=diabetes.feature_names)
print(sdf.head())

        age       sex       bmi        bp        s1        s2        s3  \
0  0.800500  1.065488  1.297088  0.459841 -0.929746 -0.732065 -0.912451   
1 -0.039567 -0.938537 -1.082180 -0.553505 -0.177624 -0.402886  1.564414   
2  1.793307  1.065488  0.934533 -0.119214 -0.958674 -0.718897 -0.680245   
3 -1.872441 -0.938537 -0.243771 -0.770650  0.256292  0.525397 -0.757647   
4  0.113172 -0.938537 -0.764944  0.459841  0.082726  0.327890  0.171178   

         s4        s5        s6  
0 -0.054499  0.418531 -0.370989  
1 -0.830301 -1.436589 -1.938479  
2 -0.054499  0.060156 -0.545154  
3  0.721302  0.476983 -0.196823  
4 -0.054499 -0.672502 -0.980568  


## Main Task

### Part 1: Functions’ Implementation

Here are the LaTeX formats for the formulas of the mentioned loss functions:

1. **Mean Squared Error (MSE)**:
$  \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 $

2. **Mean Absolute Error (MAE)**:
$ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i| $

3. **Root Mean Squared Error (RMSE)**:
$ \text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} $

4. **R² Score (Coefficient of Determination)**:
$ R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}} = 1 - \frac{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}{\sum_{i=1}^{n} (y_i - \bar{y})^2} $

Where:
-  **n** is the number of observations,
- $ y_i $ is the actual value,
- $ \hat{y}_i $ is the predicted value,
- $ \bar{y} $ is the mean of the actual values of the dependent variable.
- $ SS_{\text{res}} $ is the sum of squares of residuals,
- $ SS_{\text{tot}} $ is the total sum of squares.

In [6]:
def multiply_vectors (v1, v2):
    sum = 0
    if(len(v1) != len(v2)):
        print("Can't match the dimentions")
        return;
    
    for i in range(len(v1)):
        sum += v1[i] * v2[i]

    return sum;
    
def mse_loss(X, y, model_weights, model_bias):
    sum_errors = 0
    
    for i in range(len(X)):
        predicted_y = multiply_vectors(X[i], model_weights) + model_bias
        
        sum_errors += pow( y[i] - predicted_y, 2)

    return sum_errors / len(X)

def mae_loss(X, y, model_weights, model_bias):
    sum_errors = 0
 
    for i in range(len(X)):
        predicted_y =  multiply_vectors(X[i], model_weights) + model_bias
        
        sum_errors += abs(y[i] - predicted_y)

    return sum_errors / len(X)

def rmse_loss(X, y, model_weights, model_bias):
    return sqrt(mse_loss(X, y, model_weights, model_bias))

def r2score(X, y, model_weights, model_bias):
    n = len(X)
    sum_squared_residuals = 0
    sum_squared_total = 0
    y_mean = sum(y) / n
    
    for i in range(n):
        predicted_y = multiply_vectors(X[i], model_weights) + model_bias
        sum_squared_residuals += (y[i] - predicted_y) ** 2
        sum_squared_total += (y[i] - y_mean) ** 2
    
    r2 = 1 - (sum_squared_residuals / sum_squared_total)
    
    return r2    

### Part 2: Building and Training the Linear Regression Model

In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error (MSE):", mse)
print("R-squared (R2) Score:", r2)

Mean Squared Error (MSE): 3569.585514620719
R-squared (R2) Score: 0.46656442173732404
