<a href="https://colab.research.google.com/github/tejashreereddyy/FMML-Project-and-Labs/blob/main/Copy_of_AIML_III_Module_01_Lab_02_Machine_Learning_terms_and_metrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

 Machine Learning terms and metrics

Module 1, Lab 2

In this lab, we will show a part of the ML pipeline by using the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district. We will use the scikit-learn library to load the data and perform some basic data preprocessing and model training. We will also show how to evaluate the model using some common metrics, split the data into training and testing sets, and use cross-validation to get a better estimate of the model's performance.

In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Step 1: Getting the Data
# Load the California Housing dataset
california = fetch_california_housing()

# Create a DataFrame with the data
data = pd.DataFrame(california.data, columns=california.feature_names)
data['MedHouseVal'] = california.target

# Display the top 5 rows of data
print("Top 5 rows of data:")
print(data.head())

# Step 2: Preparing the Data
# Split the data into features (X) and target (y)
X = data.drop('MedHouseVal', axis=1)
y = data['MedHouseVal']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Teaching the Toy (Training the Model)
# Create a Linear Regression model
model = LinearRegression()

# Train the model using the training data
model.fit(X_train, y_train)

# Step 4: Checking How Well the Toy Learned (Evaluating the Model)
# Predict the prices using the testing data
y_pred = model.predict(X_test)

# Calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)

# Calculate the R-squared score
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared Score: {r2}")

# Step 5: Cross-Validation
# Perform cross-validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')

# Calculate the mean of the cross-validation scores
mean_cv_score = np.mean(cv_scores)

print(f"Cross-Validation Mean Squared Error: {-mean_cv_score}")


Loading the Data: We fetch the California Housing dataset and create a DataFrame to see the first 5 rows of data.

Preparing the Data: We split the data into features (information about the houses) and the target (house prices). Then, we split this into training and testing sets.

Training the Model: We create a Linear Regression model and teach it using the training data.

Evaluating the Model: We make the model guess the house prices using the testing data and check how good the guesses are using mean squared error and R-squared score.

Cross-Validation: We use cross-validation to get a better estimate of the model's performance by splitting the data into different parts and checking the model's performance on each part.

The simplest model to use for classification is the K-Nearest Neighbors model. We will use this model to predict the house value with a K value of 1. We will also use the accuracy metric to evaluate the model.

In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Step 1: Load and Prepare the Data
# Load the California Housing dataset
california = fetch_california_housing()

# Create a DataFrame with the data
data = pd.DataFrame(california.data, columns=california.feature_names)
data['MedHouseVal'] = california.target

# Convert the target variable into categories
# Let's create 3 categories for simplicity: Low, Medium, High
data['MedHouseValCat'] = pd.qcut(data['MedHouseVal'], 3, labels=['Low', 'Medium', 'High'])

# Display the top 5 rows of data
print("Top 5 rows of data:")
print(data.head())

# Step 2: Split the Data
# Split the data into features (X) and target (y)
X = data.drop(['MedHouseVal', 'MedHouseValCat'], axis=1)
y = data['MedHouseValCat']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Train the KNN Model
# Create a KNN model with K=1
knn = KNeighborsClassifier(n_neighbors=1)

# Train the model using the training data
knn.fit(X_train, y_train)

# Step 4: Evaluate the Model
# Predict the categories using the testing data
y_pred = knn.predict(X_test)

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")


Does averaging the validation accuracy across multiple splits give more consistent results?

Yes, averaging the validation accuracy across multiple splits (cross-validation) gives more consistent results. It reduces the variability that might result from a single train-test split.

Does it give a more accurate estimate of test accuracy?

Yes, cross-validation provides a more accurate estimate of the model's performance on unseen data because it ensures that every data point gets a chance to be in the training and testing sets, thus reducing the risk of overfitting to a particular train-test split.

What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?

Generally, a higher number of iterations (folds) in cross-validation can give a better estimate, as it provides more samples for the training and testing processes, leading to a more reliable estimate. However, it also increases computational complexity.

Can we deal with a very small train dataset or validation dataset by increasing the iterations?

Increasing the number of iterations in cross-validation can help mitigate issues with small datasets by ensuring that each data point is used for both training and validation multiple times. However, if the dataset is very small, this might not fully resolve the problem, and other techniques like data augmentation or using simpler models might be necessary.

How does the accuracy of the 3 nearest neighbor classifier change with the number of splits? How is it affected by the split size? Compare the results with the 1 nearest neighbor classifier.

To analyze this, we'll write a Python script to perform the necessary experiments.

In [None]:
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the California Housing dataset
california = fetch_california_housing()

# Create a DataFrame with the data
data = pd.DataFrame(california.data, columns=california.feature_names)
data['MedHouseVal'] = california.target

# Convert the target variable into categories
data['MedHouseValCat'] = pd.qcut(data['MedHouseVal'], 3, labels=['Low', 'Medium', 'High'])

# Prepare features (X) and target (y)
X = data.drop(['MedHouseVal', 'MedHouseValCat'], axis=1)
y = data['MedHouseValCat']

# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Define a function to perform cross-validation with different K and splits
def cross_val_knn(k, n_splits):
    knn = KNeighborsClassifier(n_neighbors=k)
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)
    scores = cross_val_score(knn, X_scaled, y, cv=kf, scoring='accuracy')
    return scores

# Experiment with different K values and number of splits
results = {}
for k in [1, 3]:
    for n_splits in [5, 10, 20]:
        scores = cross_val_knn(k, n_splits)
        results[(k, n_splits)] = {
            'mean_accuracy': np.mean(scores),
            'std_accuracy': np.std(scores)
        }

# Print the results
for key, value in results.items():
    k, n_splits = key
    print(f"K: {k}, Splits: {n_splits} - Mean Accuracy: {value['mean_accuracy']:.4f}, Std Dev: {value['std_accuracy']:.4f}")

# Plotting the results for better visualization
import matplotlib.pyplot as plt

ks = [1, 3]
splits = [5, 10, 20]
accuracies = [[results[(k, s)]['mean_accuracy'] for s in splits] for k in ks]

plt.figure(figsize=(10, 5))
for i, k in enumerate(ks):
    plt.plot(splits, accuracies[i], marker='o', label=f'K={k}')

plt.xlabel('Number of Splits')
plt.ylabel('Mean Accuracy')
plt.title('Accuracy of K-Nearest Neighbors Classifier')
plt.legend()
plt.grid(True)
plt.show()


Explanation of Results
Impact of Number of Splits on Accuracy: Increasing the number of splits generally provides a more reliable estimate of the model's accuracy by reducing the variance caused by any single train-test split.

Comparison of K=1 and K=3:

K=1 (1-Nearest Neighbor): This classifier might have higher variance because it is highly sensitive to the noise in the data. It typically has lower bias but higher variance.

K=3 (3-Nearest Neighbors): This classifier tends to be more robust as it considers more neighbors. It balances bias and variance better compared to K=1, often resulting in more stable accuracy across different splits.

Effect of Split Size: As the number of splits increases, each model gets trained and tested on a larger variety of data combinations, leading to more stable and consistent performance estimates. However, too many splits can increase computation time significantly.

Small Train/Validation Dataset: With very small datasets, increasing the number of splits helps but does not completely solve the problem. The estimates become more stable, but the fundamental issue of limited data might still hinder the model's ability to generalize.

By running the provided code, you can see how the accuracy of the 1-Nearest Neighbor and 3-Nearest Neighbor classifiers changes with different numbers of splits and compare their performance.