<a href="https://colab.research.google.com/github/sukumar57/FMML-LAB-22B21A4642/blob/main/Copy_of_Module_01_Lab_02_MLPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning terms and metrics

FMML Module 1, Lab 2<br>


 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district.

Let us download and examine the dataset.

In [2]:
 dataset =  datasets.fetch_california_housing()
 # print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 # print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

(20640, 8)
(20640,)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset.target = dataset.target.astype(np.int) # so that we can classify


Here is a function for calculating the 1-nearest neighbours

In [3]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [4]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm.

In [5]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [6]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [7]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

Number of test samples =  4144
Number of other samples =  16496
Percent of test data =  20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [8]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

What is the accuracy of our classifiers on the train dataset?

In [9]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.164375808538163


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [10]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.34108527131782945
Validation accuracy using random classifier is  0.1688468992248062


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier.

Now let us try another random split and check the validation accuracy

In [11]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

Validation accuracy of nearest neighbour is  0.34048257372654156


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [12]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.34917953667953666


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

    Increasing the Percentage of Validation Set:
        Pros: Increasing the percentage of the validation set (while decreasing the size of the training set) can lead to a more reliable estimate of the model's generalization performance. With a larger validation set, you have more data to evaluate the model's performance on, which can result in a more accurate assessment.
        Cons: However, as you increase the size of the validation set, the size of the training set decreases. A smaller training set might lead to poorer model training, and your model may not generalize well to the test data. It can be challenging to train complex models with very limited training data.

    Reducing the Percentage of Validation Set:
        Pros: Reducing the percentage of the validation set (while increasing the size of the training set) allows you to train your model on a larger portion of the data. This can be beneficial when you have a small overall dataset, as it provides more data for model training, potentially leading to a model that better captures underlying patterns.
        Cons: On the downside, a smaller validation set might result in a less reliable estimate of the model's generalization performance. With less validation data, there's a higher chance of variability in the validation accuracy, making it more challenging to confidently assess the model's performance.

In practice, the choice of the validation set size depends on the specific problem and the amount of available data. Cross-validation techniques, such as k-fold cross-validation, can be useful when you have limited data. K-fold cross-validation divides the data into k subsets, where each subset serves as a validation set in turn while the others are used for training. This helps provide a more stable estimate of model performance without sacrificing too much training data.

    Representativeness: The training set should be representative of the overall data distribution. If the training set is too small, it may not capture the diversity of the data, leading to a poorly trained model. A larger training set is generally better for model training, as it provides more examples for the model to learn from.

    Generalization: The validation set serves as a proxy for the test set, helping you estimate how well your model will perform on unseen data. If the validation set is too small, it may not accurately represent the test set's characteristics. A larger validation set can provide a more reliable estimate of the model's generalization performance.

    Prediction Accuracy: The size of the validation set can affect how accurately you can predict the model's performance on the test set. If the validation set is small, there's a higher chance of variability in the evaluation metric (e.g., accuracy), making it less reliable as an indicator of the model's performance on unseen data.

Here are some scenarios to consider:

    Large Training Set, Small Validation Set: If you have a large training set and a small validation set, your model might be well-trained on the training data, but the validation set might not provide a robust estimate of generalization performance. The accuracy on the validation set may not reliably predict performance on the test set.

    Small Training Set, Large Validation Set: Conversely, if you have a small training set and a large validation set, your model may not be well-trained due to limited training data. However, the large validation set can still provide a more stable estimate of generalization performance.

    Balanced Training and Validation Sets: Ideally, you want a balance between the training and validation set sizes. This allows the model to be trained on a representative dataset while providing a sufficiently reliable estimate of its generalization performance.

In practice, cross-validation techniques, such as k-fold cross-validation, can help mitigate the impact of the validation set size on performance estimation. Cross-validation divides the data into multiple subsets, allowing you to train and validate the model multiple times. This provides a more robust estimate of how well the model generalizes to unseen data, regardless of the size of the validation set.

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute.

In [13]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [14]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

Average validation accuracy is  0.33584635395170215
test accuracy is  0.34917953667953666


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


    Reduced Variance: A single random split of your data into a training and validation set can result in a high variance in the evaluation metric. The choice of which data points end up in the training or validation set can significantly impact the performance metric. By averaging across multiple splits (e.g., k-fold cross-validation), you obtain a more stable estimate of the model's performance, as the variance in the evaluation metric is reduced.

    Better Generalization: Averaging over multiple splits helps you assess how well your model generalizes to different subsets of the data. If your model consistently performs well across all splits, it is more likely to generalize well to unseen data.

    Detecting Data Sensitivity: It can also help you identify cases where your model's performance is highly sensitive to the specific data split. For example, if the model performs exceptionally well on one split but poorly on another, it may indicate that your model is not robust and may require further tuning or regularization.

    More Robust Hyperparameter Tuning: When tuning hyperparameters or selecting the best model among several candidates, cross-validation provides a more robust basis for comparison. It reduces the risk of selecting a model that happened to perform well by chance on a particular data split.

    Validation Accuracy (Cross-Validation): Cross-validation is a technique used to estimate how well your model is likely to perform on unseen data based on the training data you have. It involves splitting your data into multiple subsets (folds), training and evaluating the model multiple times (using a different fold as the validation set each time), and then averaging the results. This process helps you assess the model's generalization performance and provides a more reliable estimate of its likely performance on new, unseen data from the same distribution as your training data.

    Test Accuracy: The test accuracy is the performance metric you compute using a separate, held-out dataset that the model has never seen during training or cross-validation. This dataset is meant to simulate how well your model will perform in a real-world scenario on completely new and unseen data.

While cross-validation helps you estimate how well your model generalizes to different subsets of the training data and provides a more reliable measure of its potential performance, it does not replace the need for a separate test set. The test set is essential for evaluating your model's true performance on unseen data and gives you an accurate estimate of how well it will perform in production.

In practice, you typically use cross-validation during model development and hyperparameter tuning to guide your decisions and assess your model's robustness. Once you are satisfied with your model's performance based on cross-validation results, you then evaluate its final performance using the test set to get a more accurate estimate of its performance in a real-world setting.



    Training Convergence: In many machine learning algorithms, training involves optimizing a model's parameters to minimize a loss function. The number of iterations determines how many times the optimization algorithm (e.g., gradient descent) updates the model's parameters. More iterations can lead to better convergence, meaning the model's parameters get closer to their optimal values. However, there's a point of diminishing returns, and training for too many iterations can lead to overfitting on the training data.

    Validation Performance: During training, you typically monitor the model's performance on a validation set (or a cross-validation fold) to assess how well it generalizes to unseen data. In general, as the number of iterations increases, the model's performance on the validation set may improve initially if it's still learning useful patterns. However, after a certain point, it may start overfitting the training data, and the validation performance may degrade. The optimal number of iterations for good generalization depends on factors like the complexity of the model and the amount of training data.

    Estimating Model Performance: The number of iterations does not directly affect the estimate of a model's performance. You still evaluate the model's performance on validation or test data using appropriate evaluation metrics (e.g., accuracy, F1 score, etc.) after training. The key is to choose the number of iterations that results in the best trade-off between model convergence and generalization performance.

In practice, it's common to monitor the validation performance during training and use techniques like early stopping to automatically halt training when the validation performance starts to degrade. This helps prevent overfitting and allows you to choose an appropriate number of iterations.

In summary, while increasing the number of iterations can improve a model's training convergence, it doesn't necessarily result in a better estimate of its performance. The balance between training iterations and model generalization is crucial, and it's typically determined through experimentation and monitoring validation performance during training. The goal is to find the right number of iterations that yields a model with good generalization capabilities.
User
Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?
ChatGPT

Increasing the number of iterations can help to some extent when dealing with a very small training dataset or validation dataset, but it is not a complete solution to the challenges posed by limited data. Here's why:

    Overfitting: When you have a very small training dataset, increasing the number of iterations can lead to overfitting. Overfitting occurs when the model learns to fit the training data noise or idiosyncrasies rather than capturing the underlying patterns. More iterations allow the model to become increasingly complex and potentially memorize the training data, resulting in poor generalization to new, unseen data.

    Validation Data: Increasing the number of iterations doesn't inherently address the problem of a small validation dataset. If your validation dataset is small, it may not accurately represent the overall performance of your model. A small validation dataset can lead to high variability in validation results, making it difficult to draw reliable conclusions about your model's generalization capabilities.

    Data Augmentation: Instead of relying solely on increasing iterations, a more effective approach for mitigating the effects of a small dataset is to use data augmentation techniques. Data augmentation involves creating additional training examples by applying various transformations (e.g., rotations, flips, cropping) to your existing data. This artificially increases the effective size of your training dataset and can help improve the model's ability to generalize.

    Regularization: Implementing regularization techniques like dropout, weight decay, or early stopping can also help prevent overfitting when you have limited data. These techniques encourage the model to have simpler, more generalizable representations.

    Transfer Learning: Another strategy for small datasets is to leverage pre-trained models using transfer learning. You can fine-tune a pre-trained model on your specific task, which often requires fewer iterations and can lead to better results when data is limited.

In summary, while increasing the number of iterations may help improve the model's performance on a small dataset to some extent, it should be done cautiously to avoid overfitting. Addressing the challenges of limited data often involves a combination of techniques, including data augmentation, regularization, and, in some cases, transfer learning, to enhance model generalization and robustness.


Overfitting: When you have a very small training dataset, increasing the number of iterations can lead to overfitting. Overfitting occurs when the model learns to fit the training data noise or idiosyncrasies rather than capturing the underlying patterns. More iterations allow the model to become increasingly complex and potentially memorize the training data, resulting in poor generalization to new, unseen data.

Validation Data: Increasing the number of iterations doesn't inherently address the problem of a small validation dataset. If your validation dataset is small, it may not accurately represent the overall performance of your model. A small validation dataset can lead to high variability in validation results, making it difficult to draw reliable conclusions about your model's generalization capabilities.

Data Augmentation: Instead of relying solely on increasing iterations, a more effective approach for mitigating the effects of a small dataset is to use data augmentation techniques. Data augmentation involves creating additional training examples by applying various transformations (e.g., rotations, flips, cropping) to your existing data. This artificially increases the effective size of your training dataset and can help improve the model's ability to generalize.

Regularization: Implementing regularization techniques like dropout, weight decay, or early stopping can also help prevent overfitting when you have limited data. These techniques encourage the model to have simpler, more generalizable representations.

Transfer Learning: Another strategy for small datasets is to leverage pre-trained models using transfer learning. You can fine-tune a pre-trained model on your specific task, which often requires fewer iterations and can lead to better results when data is limited