# Linear Regression
## Question 1
Make a class called LinearRegression which provides two functions : fit and predict. Try to implement it from scratch. If stuck, refer to the examples folder.

In [None]:
import numpy as np

class LinearRegression:
    def __init__(self):
        self.slope = None
        self.intercept = None                                      #we find the value of slope for which distance between y values of predicted and actual data is minimum and got
                                                                   #desired value of slope whcih is used in the code
    def fit(self, X, y):

        #mean calculation
        x_mean = np.mean(X)
        y_mean = np.mean(y)

        #putting values of mean in formula e got from differentiating y difference  values
        numerator = np.sum((X - x_mean) * (y - y_mean))
        denominator = np.sum((X - x_mean)**2)

        # Calculate coefficients
        self.slope = numerator / denominator
        self.intercept = y_mean - (self.slope * x_mean)



    def predict(self, X):

        return self.intercept + self.slope * X

## Question 2

Use the dataset https://www.kaggle.com/datasets/quantbruce/real-estate-price-prediction (*).
1. Read it using pandas.
2. Check for **null values**.
3. For each of the columns (except the first and last), plot the column values in the X-axis against the last column of prices in the Y-axis.
4. Remove the unwanted columns.
5. Split the dataset into train and test data. Test data size = 25% of total dataset.
6. **Normalize** the X_train and X_test using MinMaxScaler from sklearn.preprocessing.
7. Fit the training data into the model created in question 1 and predict the testing data.
8. Use **mean square error and R<sup>2</sup>** from sklearn.metrics as evaluation criterias.
9. Fit the training data into the models of the same name provided by sklearn.linear_model and evaluate the predictions using MSE and R<sup>2</sup>.
10. Tune the hyperparameters of your models (learning rate, epochs) to achieve losses close to that of the sklearn models.

Note : (*) To solve this question, you may proceed in any of the following ways :
1. Prepare the notebook in Kaggle, download it and submit it separately with the other questions.
2. Download the dataset from kaggle. Upload it to the session storage in Colab.
3. Use Colab data directly in Colab. [Refer here](https://www.kaggle.com/general/74235). For this, you need to create kaggle API token. Before submitting, hide or remove the API token.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression as SklearnLinearRegression
from sklearn.metrics import mean_squared_error, r2_score

df = pd.read_csv('Real estate.csv')
null = df.isnull().sum()
print("Null Values:\n", null)
features = df.columns[1:-1]
target_col = df.columns[-1]

for i, col in enumerate(features):
    plt.scatter(df[col], df[target_col], alpha=0.5)
    plt.xlabel(col)
    plt.ylabel(target_col)
    plt.title(f"{col} vs Price")
    plt.savefig(f"plot_{i}.png")
    plt.close()

X_raw = df.drop(columns=['No', 'X1 transaction date', target_col])
y_raw = df[target_col].values
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X_raw, y_raw, test_size=0.25, random_state=42)
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train_raw)
X_test = scaler.transform(X_test_raw)
class CustomLinearRegression:
    def __init__(self, learning_rate=0.01, epochs=1000):
        self.lr = learning_rate
        self.epochs = epochs
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for _ in range(self.epochs):
            y_predicted = np.dot(X, self.weights) + self.bias

            # Gradients
            dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))
            db = (1 / n_samples) * np.sum(y_predicted - y)

            # Update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

    def predict(self, X):
        return np.dot(X, self.weights) + self.bias

def evaluate(model, X, y_true):

  preds = model.predict(X)
  mse = mean_squared_error(y_true, preds)
  r2 = r2_score(y_true, preds)
  return mse, r2
sk_model = SklearnLinearRegression()
sk_model.fit(X_train, y_train)
sk_mse, sk_r2 = evaluate(sk_model, X_test, y_test)

best_custom_mse = float('inf')
best_params = {}
lrs = [0.1, 0.05, 0.01]
epochs_list = [5000, 10000]

print("\nTuning Custom Model:")
for lr in lrs:
    for ep in epochs_list:
        model = CustomLinearRegression(learning_rate=lr, epochs=ep)
        model.fit(X_train, y_train)
        mse, r2 = evaluate(model, X_test, y_test)
        print(f"LR: {lr}, Epochs: {ep} -> MSE: {mse:.4f}, R2: {r2:.4f}")
        if mse < best_custom_mse:
            best_custom_mse = mse
            best_custom_r2 = r2
            best_params = {'lr': lr, 'epochs': ep}

# Final output for report
print("\nFinal Results:")
print(f"Sklearn Model: MSE={sk_mse:.4f}, R2={sk_r2:.4f}")
print(f"Best Custom Model ({best_params}): MSE={best_custom_mse:.4f}, R2={best_custom_r2:.4f}")




Null Values:
 No                                        0
X1 transaction date                       0
X2 house age                              0
X3 distance to the nearest MRT station    0
X4 number of convenience stores           0
X5 latitude                               0
X6 longitude                              0
Y house price of unit area                0
dtype: int64

Tuning Custom Model:
LR: 0.1, Epochs: 5000 -> MSE: 66.0170, R2: 0.5838
LR: 0.1, Epochs: 10000 -> MSE: 66.6232, R2: 0.5800
LR: 0.05, Epochs: 5000 -> MSE: 65.2572, R2: 0.5886
LR: 0.05, Epochs: 10000 -> MSE: 66.0168, R2: 0.5838
LR: 0.01, Epochs: 5000 -> MSE: 65.7928, R2: 0.5852
LR: 0.01, Epochs: 10000 -> MSE: 65.0616, R2: 0.5898

Final Results:
Sklearn Model: MSE=66.7486, R2=0.5792
Best Custom Model ({'lr': 0.01, 'epochs': 10000}): MSE=65.0616, R2=0.5898


# Logistic Regression
## Question 3

The breast cancer dataset is a binary classification dataset commonly used in machine learning tasks. It is available in scikit-learn (sklearn) as part of its datasets module.
Here is an explanation of the breast cancer dataset's components:

* Features (X):

 * The breast cancer dataset consists of 30 numeric features representing different characteristics of the FNA images. These features include mean, standard error, and worst (largest) values of various attributes such as radius, texture, smoothness, compactness, concavity, symmetry, fractal dimension, etc.

* Target (y):

 * The breast cancer dataset is a binary classification problem, and the target variable (y) represents the diagnosis of the breast mass. It contains two classes:
    * 0: Represents a malignant (cancerous) tumor.
    * 1: Represents a benign (non-cancerous) tumor.

Complete the code given below in place of the "..."

1. Load the dataset from sklearn.datasets
2. Separate out the X and Y columns.
3. Normalize the X data using MinMaxScaler or StandardScaler.
4. Create a train-test-split. Take any suitable test size.

In [None]:

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression as SklearnLogistic
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

data = load_breast_cancer()
X, y = data.data, data.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


5. Write code for the sigmoid function and Logistic regression.


In [None]:
def sigmoid(z):
    # This function "squashes" the linear input into a range of [0, 1]
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    # Useful for understanding the slope of the curve at any point
    s = sigmoid(z)
    return s * (1 - s)

class LogisticRegression:
    def __init__(self, learning_rate, epochs):
        # Initialise the hyperparameters of the model
        self.lr = learning_rate
        self.epochs = epochs
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        n_samples, n_features = X.shape
        y = y.reshape(-1, 1)

        # Initialise weights as zeros and bias as 0
        self.weights = np.zeros((n_features, 1))
        self.bias = 0

        # Implement the GD algorithm
        for _ in range(self.epochs):
            # 1. Linear combination of features
            z = np.dot(X, self.weights) + self.bias

            # 2. Map to probability using sigmoid
            y_pred = sigmoid(z)

            # 3. Calculate gradients (Partial derivatives of Loss)
            dw = (1 / n_samples) * np.dot(X.T, (y_pred - y))
            db = (1 / n_samples) * np.sum(y_pred - y)

            # 4. Update the knowledge of the model
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

    def predict(self, X):
        # Write the predict function
        z = np.dot(X, self.weights) + self.bias
        probabilities = sigmoid(z)

        # Convert probabilities to discrete classes: 0 or 1
        # We typically use a threshold of 0.5
        y_pred = np.array([1 if p >= 0.5 else 0 for p in probabilities])

        return y_pred

6. Fit your model on the dataset and make predictions.
7. Compare your model with the Sklearn Logistic Regression model. Try out all the different penalties.
8. Print accuracy_score in each case using sklearn.metrics .

In [None]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression as SklearnLogistic
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

data = load_breast_cancer()
X, y = data.data, data.target
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

def sigmoid(z):
    # This function "squashes" the linear input into a range of [0, 1]
    return 1 / (1 + np.exp(-z))

def sigmoid_derivative(z):
    # Useful for understanding the slope of the curve at any point
    s = sigmoid(z)
    return s * (1 - s)

class LogisticRegression:
    def __init__(self, learning_rate, epochs):
        # Initialise the hyperparameters of the model
        self.lr = learning_rate
        self.epochs = epochs
        self.weights = None
        self.bias = None

    def fit(self, X, y):
        n_samples, n_features = X.shape
        y = y.reshape(-1, 1)

        # Initialise weights as zeros and bias as 0
        self.weights = np.zeros((n_features, 1))
        self.bias = 0

        # Implement the GD algorithm
        for _ in range(self.epochs):
            # 1. Linear combination of features
            z = np.dot(X, self.weights) + self.bias

            # 2. Map to probability using sigmoid
            y_pred = sigmoid(z)

            # 3. Calculate gradients (Partial derivatives of Loss)
            dw = (1 / n_samples) * np.dot(X.T, (y_pred - y))
            db = (1 / n_samples) * np.sum(y_pred - y)

            # 4. Update the knowledge of the model
            self.weights -= self.lr * dw
            self.bias -= self.lr * db

    def predict(self, X):
        # Write the predict function
        z = np.dot(X, self.weights) + self.bias
        probabilities = sigmoid(z)

        # Convert probabilities to discrete classes: 0 or 1
        # We typically use a threshold of 0.5
        y_pred = np.array([1 if p >= 0.5 else 0 for p in probabilities])

        return y_pred

my_model = LogisticRegression(learning_rate=0.1, epochs=2000)
my_model.fit(X_train, y_train)
my_preds = my_model.predict(X_test)
print("First 10 Predictions:   ", my_preds[:10])
print("First 10 Actual Labels: ", y_test[:10])

penalties = ['l2', None] # Note: 'l1' often requires a different solver like 'liblinear'
for p in penalties:
    sk_model = SklearnLogistic(penalty=p, solver='lbfgs', max_iter=5000)
    sk_model.fit(X_train, y_train)
    sk_preds = sk_model.predict(X_test)
    print(f"Sklearn Accuracy (penalty={p}): {accuracy_score(y_test, sk_preds):.4f}")

print("\n--- Custom Model Report ---")
print(f"Accuracy: {accuracy_score(y_test, my_preds):.4f}")
print("Confusion Matrix:\n", confusion_matrix(y_test, my_preds))
print("Classification Report:\n", classification_report(y_test, my_preds))


First 10 Predictions:    [1 0 0 1 1 0 0 0 1 1]
First 10 Actual Labels:  [1 0 0 1 1 0 0 0 1 1]
Sklearn Accuracy (penalty=l2): 0.9737
Sklearn Accuracy (penalty=None): 0.9386

--- Custom Model Report ---
Accuracy: 0.9912
Confusion Matrix:
 [[42  1]
 [ 0 71]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.98      0.99        43
           1       0.99      1.00      0.99        71

    accuracy                           0.99       114
   macro avg       0.99      0.99      0.99       114
weighted avg       0.99      0.99      0.99       114



9. For the best model in each case (yours and scikit-learn), print the classification_report using sklearn.metrics .
10. For the best model in each case (yours and scikit-learn), print the confusion_matrix using sklearn.metrics .

In [None]:
#it is done in above code space

# KNN
## Question 4

How accurately can a K-Nearest Neighbors (KNN) model classify different types of glass based on a glass classification dataset consisting of 214 samples and 7 classes? Use the kaggle dataset "https://www.kaggle.com/datasets/uciml/glass".

Context: This is a Glass Identification Data Set from UCI. It contains 10 attributes including id. The response is glass type(discrete 7 values)

1. Load the data as you did in the 2nd question.
2. Extract the X and Y columns.
3. Split it into training and testing datasets.

In [4]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
df = pd.read_csv('glass.csv')
X = df.iloc[:, :-1].values
y = df.iloc[:, -1].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
class KNN:
    def __init__(self, k=3):
        self.k = k

    def fit(self, X, y):
        # KNN is a "lazy" learner; it just stores the training data
        self.X_train = np.array(X)
        self.y_train = np.array(y)

    def _euclidean_distance(self, x1, x2):
        return np.sqrt(np.sum((x1 - x2)**2))

    def predict(self, X_test):
        predictions = [self._predict(x) for x in X_test]
        return np.array(predictions)

    def _predict(self, x):
        # Calculate distances to all training points
        distances = [self._euclidean_distance(x, x_train) for x_train in self.X_train]
        # Sort and get the indices of the K nearest neighbors
        k_indices = np.argsort(distances)[:self.k]
        # Get the labels of those neighbors
        k_labels = [self.y_train[i] for i in k_indices]
        # Majority vote
        return max(set(k_labels), key=k_labels.count)
knn_custom = KNN(k=3)
knn_custom.fit(X_train, y_train)
custom_preds = knn_custom.predict(X_test)
knn_sklearn = KNeighborsClassifier(n_neighbors=3)
knn_sklearn.fit(X_train, y_train)
sklearn_preds = knn_sklearn.predict(X_test)
print("--- PREDICTIONS COMPARISON (First 10) ---")
print(f"Actual Labels:  {y_test[:10]}")
print(f"Custom Preds:   {custom_preds[:10]}")
print(f"Sklearn Preds:  {sklearn_preds[:10]}")
print(f"Custom Model Accuracy (k=3): {accuracy_score(y_test, custom_preds):.4f}")
print(f"Sklearn Model Accuracy (k=3): {accuracy_score(y_test, sklearn_preds):.4f}")



--- PREDICTIONS COMPARISON (First 10) ---
Actual Labels:  [1 7 1 7 2 2 1 2 2 2]
Custom Preds:   [1 7 1 7 2 2 1 2 2 2]
Sklearn Preds:  [1 7 1 7 2 2 1 2 2 2]
Custom Model Accuracy (k=3): 0.7674
Sklearn Model Accuracy (k=3): 0.7674


4. Define Euclidean distance.
5. Build the KNN model.
6. Fit the model on the training data. (Note : you may require to change the type of the data from pandas dataframe to numpy arrays. To do that, just do this X=np.array(X) and so on...)

In [None]:
#done in first coding block of this question

7. Make predictions. Find their accuracy using accuracy_score. Try different k values. k=3 worked well in our case.
8. Compare with the sklearn model (from sklearn.neighbors import KNeighborsClassifier)

In [None]:
#done in first coding block of this question

In [None]:
#done in first coding block of this question