<a href="https://colab.research.google.com/github/veerankiteja/FMML-LAB-ASSIGNMENTS/blob/main/Module_01_Lab_02_MLPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning terms and metrics

FMML Module 1, Lab 2<br>


 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district.

Let us download and examine the dataset.

In [2]:
 dataset =  datasets.fetch_california_housing()
 # print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 # print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

(20640, 8)
(20640,)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset.target = dataset.target.astype(np.int) # so that we can classify


Here is a function for calculating the 1-nearest neighbours

In [3]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [4]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm.

In [5]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [6]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [7]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

Number of test samples =  4144
Number of other samples =  16496
Percent of test data =  20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [8]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

What is the accuracy of our classifiers on the train dataset?

In [9]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.164375808538163


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [10]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.34108527131782945
Validation accuracy using random classifier is  0.1688468992248062


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier.

Now let us try another random split and check the validation accuracy

In [11]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

Validation accuracy of nearest neighbour is  0.34048257372654156


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [12]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.34917953667953666


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

1.ncreasing the Percentage of the Validation Set:

Pros:

A larger validation set can provide a better estimate of your model's performance on unseen data. This can be particularly useful if your dataset is very large, as you can afford to allocate more data to validation without significantly reducing the training data.
It can help in reducing overfitting. With a larger validation set, the model gets more opportunities to see diverse examples during training and, therefore, may generalize better.
Cons:

By increasing the validation set size, you are reducing the amount of data available for training. If your training dataset is not very large to begin with, this can lead to poorer model performance because the model has less data to learn from.
Longer training times may be required since you're using a smaller portion of your data for training.
Reducing the Percentage of the Validation Set:

Pros:

More data is available for training, which can be beneficial if your dataset is small. A larger training set can help the model learn better representations of the data.
Faster training times because you're using a larger portion of your data for training.
Cons:

The estimated performance of your model on unseen data (generalization) may not be as reliable. With a smaller validation set, the performance metric may have higher variability and be less indicative of how well the model will perform on new, unseen data.
There's a higher risk of overfitting because the model has more training data and less data for validation. It might learn to perform well on the validation set but not generalize to new data.
The choice of the percentage of data allocated to the validation set is often a trade-off between having enough data to estimate model performance reliably and having enough data to train a good model. The optimal percentage can vary depending on the size of your dataset, the complexity of your model, and the nature of your problem. Cross-validation techniques, such as k-fold cross-validation, can also be used to mitigate the impact of this choice and provide a more robust estimate of model performance. These techniques involve dividing the data into multiple subsets, with each subset taking turns as the validation set.







2.The sizes of the training set and validation set can affect how well you can predict the accuracy on the test set using the validation set in several ways:

Underestimation of Test Set Performance:

If you have a very small validation set compared to your training set, the performance metric (e.g., accuracy) on the validation set may not accurately reflect the model's generalization ability. It may underestimate the model's performance on the test set because the validation set might not be representative enough.
Overfitting of Validation Set:

When the validation set is too small, there's a higher risk that the model may overfit to the validation set. In other words, the model may start memorizing the specific examples in the validation set rather than learning general patterns. As a result, the model may perform well on the validation set but poorly on the test set.
Overestimation of Test Set Performance:

On the other hand, if you have a large validation set and a small training set, there's a risk of overestimating the model's performance. The model may perform well on the validation set, but this performance might not generalize to the test set because it has seen a substantial portion of the data during validation.
Optimal Generalization:

Ideally, you want to strike a balance where your validation set is large enough to provide a reasonably accurate estimate of the model's performance on unseen data but not so large that it significantly reduces your training data. Finding this balance is often done through experimentation and cross-validation techniques.
Cross-Validation:

To mitigate the variability in the performance estimate due to the size of the validation set, you can use techniques like k-fold cross-validation. In k-fold cross-validation, the data is divided into k subsets, and the model is trained and validated k times, each time using a different subset as the validation set. This helps in getting a more robust estimate of model performance.
Randomness and Variability:

It's important to recognize that there can be a degree of randomness and variability in model performance, especially when the validation set is small. Different random splits of the data can lead to different estimates of model performance. This is another reason why cross-validation is valuable, as it averages out this variability.
In summary, the size of the training and validation sets can influence your ability to predict the accuracy on the test set using the validation set. It's crucial to strike a balance that allows you to get a reliable estimate of generalization performance without sacrificing too much of your training data. Cross-validation techniques can help mitigate the impact of the validation set size and provide a more robust assessment of your model's performance.







3.There is no one-size-fits-all answer to what percentage of the dataset should be reserved for the validation set, as the ideal split depends on various factors, including the size of your dataset, the complexity of your model, and the nature of your problem. However, here are some general guidelines and common practices to consider when choosing the size of the validation set:

80/20 Rule: A common starting point is the 80/20 rule, where you allocate 80% of your data to the training set and 20% to the validation set. This split is often used when you have a reasonably large dataset.

70/30 or 75/25 Split: For smaller datasets, you might consider a larger portion of the data for validation. A 70/30 or 75/25 split can provide more data for validation while still leaving a significant portion for training.

Cross-Validation: Instead of a fixed percentage split, you can use cross-validation techniques like k-fold cross-validation. For example, in 5-fold cross-validation, you divide your data into 5 subsets, and each time you train on 4 of them and validate on the remaining 1. This approach provides a robust estimate of model performance and helps mitigate the impact of the validation set size.

Stratified Split: If your dataset has class imbalance, consider using stratified sampling to ensure that each class is represented proportionally in both the training and validation sets. This can help ensure a representative validation set.

Data Size Consideration: If you have a very large dataset, you can afford to allocate a smaller percentage to the validation set while still having a reasonably large validation set. Conversely, if your dataset is small, you may need to allocate a larger percentage to validation to obtain reliable performance estimates.

Experimentation: Ultimately, the choice of the validation set size should involve experimentation. You can try different splits and cross-validation strategies and monitor how they affect model performance. The goal is to find a balance that provides a reliable estimate of generalization performance without sacrificing too much training data.

Domain Knowledge: Consider any domain-specific knowledge or requirements that might influence the validation set size. For example, in some scientific experiments, there may be specific guidelines for splitting data.

Remember that there is no fixed rule for the "perfect" validation set size, and it often depends on the specific context of your project. The key is to make an informed decision based on your dataset and problem characteristics and to ensure that your choice allows for reliable model evaluation and generalization to unseen data.







## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute.

In [13]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [14]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

Average validation accuracy is  0.33584635395170215
test accuracy is  0.34917953667953666


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


1.Yes, averaging the validation accuracy across multiple splits of your data, such as using k-fold cross-validation, can indeed provide more consistent and robust results when assessing your model's performance. This is because it helps mitigate the impact of randomness and variability in the data splits.

Here's how it works:

Variability Reduction: In a single train-validation split, the specific data points chosen for the validation set can influence the validation accuracy. By repeating this process with different random splits of the data into training and validation sets (as in k-fold cross-validation), you reduce the impact of any particular random split.

Better Estimation: Averaging the results from multiple splits provides a more stable estimate of your model's performance. It gives you a more reliable indication of how well your model is likely to perform on unseen data.

Effective Use of Data: Cross-validation allows you to make better use of your data. Instead of designating a fixed percentage of your data as a validation set (which can be problematic for small datasets), you cycle through different subsets as validation, ensuring that all data points contribute to both training







2.Cross-validation, such as k-fold cross-validation, does not necessarily provide a more accurate estimate of test accuracy compared to a traditional train-validation split. However, it does provide a more reliable and robust estimate of a model's generalization performance on unseen data. Here's the distinction:

Accuracy vs. Reliability:

Cross-validation does not make the accuracy of the estimate itself more accurate; instead, it improves the reliability and stability of the estimate. In other words, the estimate from cross-validation is less likely to be influenced by the specific random split of the data compared to a single train-validation split.
Robustness to Data Variability:

When you perform k-fold cross-validation, you are effectively training and validating your model multiple times on different subsets of the data. This accounts for potential variability in the data and random fluctuations in performance. As a result, you get a more consistent and robust estimate of how well your model is likely to perform on unseen data.
Assessment of Generalization:

The primary purpose of cross-validation is to assess how well your model generalizes to unseen data. While the estimate itself may not be more accurate in the sense of being closer to the true population accuracy (which is typically unknown), it is a more trustworthy estimate because it accounts for the inherent variability in data.
Model Selection and Hyperparameter Tuning:

Cross-validation is particularly useful when you are comparing multiple models or tuning hyperparameters. It helps you make more informed decisions about which model or set of hyperparameters is likely to perform better on unseen data.
In summary, cross-validation provides a more reliable and robust estimate of a model's generalization performance, but it does not necessarily make the estimate more accurate in an absolute sense. It is a valuable technique for assessing and comparing models, especially when you have limited data and want a more stable assessment of model performance.







3.The number of iterations, in the context of machine learning and model training, typically refers to the number of times a learning algorithm iterates over the entire training dataset during the training process. This is often associated with the number of epochs in deep learning or the number of iterations in iterative optimization algorithms like gradient descent.

The effect of the number of iterations on the estimate of model performance can vary depending on several factors:

Underfitting vs. Overfitting:

In the early stages of training, as you increase the number of iterations, your model is likely to improve its fit to the training data, reducing underfitting. This can result in better performance on both the training and validation datasets.
Diminishing Returns:

However, as you continue to increase the number of iterations, the model may start to overfit the training data. It becomes too specialized in learning the training data patterns but loses its ability to generalize to unseen data. This can lead to worse performance on the validation (and test) dataset.
Optimal Number of Iterations:

There is typically an optimal number of iterations where the model achieves the best balance between underfitting and overfitting. This is often found through experimentation or by monitoring the model's performance on a validation dataset during training.
Early Stopping:

To prevent overfitting, it's common practice to use techniques like early stopping. Early stopping involves monitoring the validation performance during training and stopping training when the validation performance starts to degrade, even if the training performance continues to improve. This helps in finding a good estimate of model performance without overfitting.
Computational Resources:

The number of iterations can also be influenced by computational resources. Training a model with a very large number of iterations may require significant time and computational power, and there might be diminishing returns beyond a certain point.
In summary, the effect of the number of iterations on the estimate of model performance is not linear. Increasing iterations initially tends to improve the model's performance, but there's a point of diminishing returns where the model starts to overfit. The goal is to find the right balance through experimentation, and techniques like early stopping can help in getting a good estimate of model performance without training for an excessive number of iterations. The optimal number of iterations can vary depending on the specific dataset, model architecture, and problem you are working on.







4.Increasing the number of iterations during training can help to some extent when dealing with a very small training dataset or validation dataset, but it is not a panacea for overcoming the limitations associated with small datasets. Here are some key points to consider:

Overfitting Risk: Increasing the number of iterations can make the model fit the training data better, but it can also increase the risk of overfitting. Overfitting occurs when the model becomes too specialized in learning the noise in the training data and fails to generalize to new, unseen data. This risk becomes more significant as the model complexity increases and with more iterations.

Limited Information: Small datasets inherently contain limited information about the underlying patterns in the data. No amount of additional iterations can create information that is not present in the dataset. Therefore, even with more iterations, the model may struggle to generalize well, especially if the dataset is not representative of the overall population.

Validation Set: When dealing with a small validation dataset, the risk of overfitting to the validation set is also higher. Increasing iterations may lead to the model overfitting to the small validation set, which can result in an optimistic estimate of model performance.

Early Stopping: It's crucial to use techniques like early stopping when dealing with small datasets. Early stopping monitors the validation performance and stops training when it starts to degrade. This helps in finding a point where the model has learned meaningful patterns from the data without overfitting.

Data Augmentation and Regularization: Instead of solely relying on increasing iterations, other techniques like data augmentation (for image data) or various regularization methods (e.g., dropout, L1/L2 regularization) can be more effective in preventing overfitting in small datasets.

Transfer Learning: For deep learning tasks, transfer learning from pre-trained models on larger datasets can be highly effective when you have a small dataset. This leverages knowledge learned from large datasets and fine-tunes the model on your specific task.

In summary, while increasing the number of iterations can help improve model performance to some extent, it's not a complete solution for the challenges posed by very small training or validation datasets. It's important to combine it with other strategies such as early stopping, data augmentation, regularization, and, when applicable, transfer learning to effectively address these challenges and obtain a more robust model.





