<a href="https://colab.research.google.com/github/sudheer2226/FMML_Project_and_Labs/blob/main/Module_01_Lab_02_MLPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning terms and metrics

FMML Module 1, Lab 2<br>


 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district.

Let us download and examine the dataset.

In [2]:
 dataset =  datasets.fetch_california_housing()
 # print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 # print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

(20640, 8)
(20640,)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset.target = dataset.target.astype(np.int) # so that we can classify


Here is a function for calculating the 1-nearest neighbours

In [3]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [4]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm.

In [5]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [6]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [7]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

Number of test samples =  4144
Number of other samples =  16496
Percent of test data =  20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [8]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

What is the accuracy of our classifiers on the train dataset?

In [9]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.164375808538163


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [10]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.34108527131782945
Validation accuracy using random classifier is  0.1688468992248062


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier.

Now let us try another random split and check the validation accuracy

In [11]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

Validation accuracy of nearest neighbour is  0.34048257372654156


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [12]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.34917953667953666


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

***1)*** In machine learning, the size of the validation set plays a crucial role in assessing the performance of a model during training and in preventing overfitting. The validation set is used to estimate how well a model generalizes to unseen data, and its size can have different effects on the accuracy and performance metrics of the model. Let's explore how changing the percentage of the validation set affects the model's accuracy and behavior:

1. **Increasing the Percentage of Validation Set:**
   - **Pros:**
     - As you allocate more data to the validation set, you get a better estimate of the model's generalization performance because the validation set is representative of a larger portion of your dataset.
     - It helps in detecting overfitting early since the model has less training data to memorize.
   - **Cons:**
     - You have less data available for training, which might lead to slower model convergence or less accurate model parameters.
     - If the validation set becomes too large, you might not have enough data for effective model training, and the model might underfit.

   The accuracy on the validation set is likely to be a more reliable estimate of the model's generalization performance as you increase the size of the validation set. However, the training performance (accuracy on the training set) may decrease as more data is moved to the validation set.

2. **Reducing the Percentage of Validation Set:**
   - **Pros:**
     - You have more data available for training, which can lead to better model parameter estimation and potentially higher training performance.
     - The model may converge faster during training.
   - **Cons:**
     - A smaller validation set may lead to a less reliable estimate of the model's generalization performance, as it might not be representative of the entire dataset.
     - Overfitting may be less apparent, as the model has more data to memorize.

   With a smaller validation set, the accuracy on the validation set may be a less reliable indicator of generalization performance. The model's training performance is likely to be higher due to having more training data.

In practice, the size of the validation set is often determined empirically through techniques like cross-validation. Cross-validation involves splitting the dataset into multiple folds and training/validating the model on different subsets to get a more robust estimate of its performance. This can help mitigate the impact of the validation set size on model assessment.



***2)*** The size of the training and validation sets can significantly affect how well you can predict the accuracy on the test set using the validation set. In machine learning, this is often referred to as the reliability of the validation set as a proxy for test set performance. Here's how the sizes of these sets can impact this prediction:

1. **Larger Training Set:**
   - When you allocate a larger portion of your data to the training set, the model has more data to learn from. This typically leads to better model parameter estimation and can result in a model that is more representative of the underlying data distribution.
   - With a larger training set, the model is more likely to generalize well to unseen data, assuming it is not overfitting.
   - As a result, the validation set's performance (e.g., accuracy) tends to be a better predictor of the test set performance. If the model performs well on a large validation set, it's more likely to perform well on the test set.

2. **Larger Validation Set:**
   - A larger validation set can provide a more reliable estimate of the model's generalization performance because it's based on a larger sample of the data. This means that the validation set's performance is a better reflection of how the model is likely to perform on unseen data.
   - However, if you allocate too much data to the validation set, the training set becomes smaller, which can lead to slower convergence and potentially less accurate model parameters.
   - Despite a larger validation set, it's still essential to ensure that the training set is representative and provides sufficient data for the model to learn effectively.

3. **Balanced Split:**
   - A balanced split between the training and validation sets strikes a compromise between having enough data for training and a representative subset for validation.
   - This balance ensures that the model has a good chance of learning the underlying patterns in the data while still getting a reliable estimate of its generalization performance.
   - The performance on the validation set in a balanced split is a reasonably good indicator of the model's performance on the test set.

 the sizes of the training and validation sets are interrelated, and their relative sizes can affect how well you can predict the accuracy on the test set using the validation set. A well-chosen balance that considers the complexity of your model, the size of your dataset, and the need for reliable estimates of generalization performance is crucial. Cross-validation techniques can also be useful to mitigate the impact of a specific data split on model assessment by repeatedly splitting the data into different training and validation sets.

***3)*** The choice of the percentage to reserve for the validation set in a machine learning task is not one-size-fits-all and depends on several factors, including the size of your dataset, the complexity of your model, and the available computational resources. However, there are some common practices and guidelines to consider when determining the validation set size:

1. **Rule of Thumb: 70/30 or 80/20 Split:**
   - A common starting point is to use a 70/30 or 80/20 split for training and validation, respectively. This means you reserve 70% or 80% of your data for training and the remaining 30% or 20% for validation.
   - This split is a good starting point for moderate-sized datasets and models of moderate complexity. It allows the model to learn from a substantial portion of the data while still providing a reasonably large validation set for performance estimation.

2. **Cross-Validation:**
   - Cross-validation techniques, such as k-fold cross-validation, can be used to get a more robust estimate of model performance. Instead of a fixed validation set size, you partition your data into k subsets (folds), train and validate the model k times, and then average the performance scores.
   - Cross-validation can help mitigate the impact of a specific data split on model assessment and provides a more reliable estimate of generalization performance.

3. **Data Size Considerations:**
   - If you have a very large dataset, you can afford to allocate a smaller percentage for the validation set since you still have a substantial amount of data for training.
   - Conversely, if you have a very small dataset, you might need to allocate a larger percentage for validation to ensure you have enough data to assess performance reliably.

4. **Model Complexity:**
   - More complex models, which are prone to overfitting, may benefit from larger validation sets to detect overfitting early. In such cases, you might consider a larger percentage for validation.

5. **Resource Constraints:**
   - Consider your available computational resources. Smaller validation sets result in faster training times, which can be crucial when you have limited resources.

6. **Plotting Learning Curves:**
   - Plot learning curves that show the model's performance on both the training and validation sets as you increase the validation set size or vary the training/validation split ratio.
   - Observe how the performance changes with different split ratios and choose the one that provides a good balance between training and validation sizes while maintaining reliable performance estimates.

In practice, it's often a good idea to experiment with different validation set sizes and use cross-validation to determine the optimal split for your specific problem and dataset. Keep in mind that the goal is to achieve a balance where the model can learn effectively while still providing a reliable estimate of its generalization performance on unseen data.

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute.

In [13]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [14]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

Average validation accuracy is  0.33584635395170215
test accuracy is  0.34917953667953666


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


***1)*** Yes, averaging the validation accuracy (or other performance metrics) across multiple splits, typically through techniques like k-fold cross-validation, can provide more consistent and reliable results in machine learning. Cross-validation helps mitigate the impact of a specific data split on model assessment and offers several advantages:

1. **Reduced Variance:** When you split your data into multiple subsets and perform cross-validation, you essentially train and validate your model multiple times, each time on a different subset of the data. This reduces the variance in the performance estimate compared to a single fixed validation set, which may be sensitive to the particular data points included.

2. **Better Generalization Estimate:** Cross-validation provides a more accurate estimate of a model's generalization performance because it assesses how well the model performs on different subsets of the data. It provides a more robust indicator of how the model is likely to perform on unseen data.

3. **Effective Use of Data:** In machine learning, especially when data is limited, it's essential to make the most efficient use of the available data. Cross-validation allows you to leverage the entire dataset for both training and validation, which can lead to more informed decisions about your model's performance.

4. **Detecting Overfitting and Underfitting:** Cross-validation can help you detect overfitting or underfitting more effectively. By observing how performance metrics vary across different folds, you can identify whether your model is generalizing well or if it's suffering from high variance or bias.

5. **Hyperparameter Tuning:** Cross-validation is often used for hyperparameter tuning. By repeatedly training and validating the model with different hyperparameter settings, you can select the best combination that maximizes performance across multiple splits.

In summary,averaging validation accuracy (or other performance metrics) across multiple splits through cross-validation is a recommended practice in machine learning. It provides more stable and reliable results, reduces the risk of making decisions based on a single, possibly biased data split, and helps you make more informed choices about your model's performance and hyperparameter settings.

***2)***  Cross-validation, while providing a more accurate estimate of a model's generalization performance compared to a single validation split, still doesn't give a direct estimate of the test accuracy. However, it does provide a more reliable and robust estimate of how your model is likely to perform on unseen data compared to a single validation split. Here's why:

1. **Consistency:** By repeatedly splitting your data into different subsets (folds) and performing cross-validation, you get a more consistent estimate of your model's performance. This consistency helps in reducing the impact of random variations that may occur with a single validation split.

2. **Better Generalization Estimate:** Cross-validation assesses how well your model performs on various subsets of the data. It provides a more informed and reliable estimate of your model's generalization performance. This estimate is likely to be closer to the true test performance because it considers a more comprehensive view of your data.

3. **Model Selection and Hyperparameter Tuning:** Cross-validation is often used for model selection and hyperparameter tuning. Models or hyperparameter settings that perform well across multiple cross-validation folds are more likely to perform well on unseen test data.

4. **Risk Mitigation:** Using cross-validation reduces the risk of making decisions based on an overly optimistic or pessimistic assessment of your model's performance. A single validation split might, by chance, result in unusually good or bad performance, leading to misleading conclusions.

However, it's important to note that while cross-validation provides a more reliable estimate of model performance, it is still an estimate and not a direct measure of test accuracy. The true test accuracy can only be obtained by evaluating your model on completely unseen data that it has not been exposed to during training or cross-validation.

In practice, cross-validation is a valuable tool for assessing and comparing models, selecting the best model or hyperparameters, and getting a more accurate sense of how well your model is likely to perform in real-world scenarios. But, for reporting final test accuracy, you should reserve a separate, completely untouched test dataset that is not used during model development or hyperparameter tuning. This test dataset provides the most accurate estimate of how your model will perform when deployed in the real world.

***3)*** In machine learning, the number of iterations, often associated with training a model, can have an effect on the estimate of the model's performance. However, this primarily relates to how many iterations or epochs are used during the training process, rather than the number of iterations in cross-validation or model assessment. Let's clarify the impact of the number of iterations in both contexts:

**Number of Iterations During Training:**
- When training a machine learning model, particularly deep learning models like neural networks, the number of training iterations or epochs can significantly impact the model's performance.
- Increasing the number of training iterations can lead to better model convergence, where the model learns the underlying patterns in the data more effectively.
- With more iterations, the model may be able to fit the training data better, potentially improving training performance metrics.

However, it's important to note that there's a trade-off with the number of training iterations:

- Too few iterations might result in underfitting, where the model doesn't capture the data's complexity.
- Too many iterations can lead to overfitting, where the model starts to memorize the training data and performs poorly on unseen data (including the validation or test set).

Therefore, finding the optimal number of training iterations is often determined through techniques like early stopping, where you monitor the model's performance on a validation set and stop training when it starts to degrade.

**Number of Iterations in Cross-Validation:**
- In the context of cross-validation or k-fold cross-validation, the number of iterations typically refers to how many times you split the data into folds and perform the validation.
- Increasing the number of cross-validation folds (iterations) can provide a more robust estimate of model performance. With more folds, you assess the model's performance on different subsets of the data, reducing the impact of the specific data split.
- However, increasing the number of folds also means each fold contains less data for training, which can lead to longer training times and potentially less accurate parameter estimation within each fold.

In summary, the effect of the number of iterations on model performance estimation varies depending on whether you're referring to training iterations or cross-validation iterations:

- In training, the number of iterations should be carefully chosen to balance model convergence and the risk of overfitting. More iterations are not necessarily better if they lead to overfitting.

- In cross-validation, increasing the number of iterations (folds) can provide a more robust estimate of model performance. However, it should be balanced with the computational cost and the amount of data available for training in each fold. Typically, a 5 or 10-fold cross-validation is commonly used in practice.

***4)*** Increasing the number of iterations during training can sometimes help mitigate the impact of a very small training dataset, but it's not a guaranteed solution, and it comes with trade-offs. Let's explore the implications of using more iterations to compensate for a small training dataset:

**Mitigating the Impact of a Small Training Dataset with More Iterations:**

1. **Pros:**

   - More iterations can allow the model to see the same data multiple times (especially in techniques like gradient descent), which may help it converge to a better solution.
   
   - In deep learning, more iterations can help the model learn more complex representations from limited data.

2. **Cons:**

   - Overfitting Risk: Increasing the number of iterations can increase the risk of overfitting, especially when the training dataset is very small. The model may start to memorize the training examples rather than learning generalizable patterns.

   - Computational Cost: More iterations require more computational resources, which can be expensive and time-consuming.

   - Diminishing Returns: There's a point of diminishing returns where further iterations may not significantly improve performance but increase computational cost.

**Mitigating the Impact of a Small Validation Dataset with More Iterations:**

1. **Pros:**

   - When the validation dataset is small, running more iterations of cross-validation (e.g., using more folds) can help in obtaining a more reliable estimate of model performance. This is because each fold is closer in size to the full validation set, reducing the variability in performance estimates.

   - More iterations can give you a better sense of the model's performance stability across different data splits.

2. **Cons:**

   - Computational Cost: As you increase the number of cross-validation folds (iterations), the training and evaluation process becomes more computationally expensive.

   - Data Availability: More iterations can exacerbate the problem of data scarcity for training if each fold has a smaller training subset.

In summary, while increasing the number of iterations can be a strategy to deal with very small training or validation datasets, it's not a silver bullet and must be used judiciously. It may help the model learn more effectively from limited data, but it also increases the risk of overfitting and comes with increased computational costs. In the case of small validation datasets, increasing the number of cross-validation folds can provide a more stable estimate of performance, but this also requires careful consideration of computational resources and data availability.

The ideal approach often involves a combination of strategies, including data augmentation, regularization techniques, and model selection, to address the limitations posed by small datasets while maintaining model generalization.