<a href="https://colab.research.google.com/github/umakoduru2204/FMML-LAB-ASSIGNMENT/blob/main/Module_01_Lab_02_MLPractice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning terms and metrics

FMML Module 1, Lab 2<br>


 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district.

Let us download and examine the dataset.

In [2]:
 dataset =  datasets.fetch_california_housing()
 # print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 # print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

(20640, 8)
(20640,)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dataset.target = dataset.target.astype(np.int) # so that we can classify


Here is a function for calculating the 1-nearest neighbours

In [3]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

We will also define a 'random classifier', which randomly allots labels to each sample

In [4]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm.

In [5]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [6]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [7]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

Number of test samples =  4144
Number of other samples =  16496
Percent of test data =  20.07751937984496 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [8]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

What is the accuracy of our classifiers on the train dataset?

In [9]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.164375808538163


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [10]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.34108527131782945
Validation accuracy using random classifier is  0.1688468992248062


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier.

Now let us try another random split and check the validation accuracy

In [11]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

Validation accuracy of nearest neighbour is  0.34048257372654156


You can run the above cell multiple times to try with different random splits.
We notice that the accuracy is different for each run, but close together.

Now let us compare it with the accuracy we get on the test dataset.

In [12]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.34917953667953666


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced?

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

**ANSWER FOR Q1**
The size of the validation set can have a significant impact on the accuracy of your machine learning or deep learning model. It is an important hyperparameter that you should tune when developing a model, and its effects on accuracy can vary depending on the specific dataset and problem you're working on. Here's how increasing and reducing the percentage of the validation set can affect accuracy:

1. **Increasing the Percentage of Validation Set:**

   - **Pros:**
     - **More Reliable Evaluation:** A larger validation set provides a more reliable evaluation of your model's performance. It helps reduce the variability in performance metrics, making it less prone to random fluctuations.
     - **Better Generalization:** With a larger validation set, your model is tested on a more diverse set of data points, which can lead to better generalization to unseen data.
     - **Reduced Overfitting:** A larger validation set may help you detect overfitting more effectively. Overfitting occurs when a model performs well on training data but poorly on unseen data. A larger validation set can capture more representative samples of unseen data.

   - **Cons:**
     - **Reduced Training Data:** As you allocate a larger portion of your data to the validation set, you'll have less data available for training your model. This can potentially limit the model's ability to learn complex patterns from the training data.

2. **Reducing the Percentage of Validation Set:**

   - **Pros:**
     - **More Training Data:** By reducing the size of the validation set, you have more data available for training. This can be beneficial, especially when you have limited data, as it allows your model to learn from a larger portion of the dataset.
     - **Faster Training:** Smaller validation sets can result in faster training times because the model trains on a larger portion of the data in each epoch.

   - **Cons:**
     - **Less Reliable Evaluation:** With a smaller validation set, the evaluation of your model's performance can be more susceptible to randomness and may not be as reliable. Performance metrics can fluctuate more due to variations in the validation set.
     - **Risk of Overfitting:** A smaller validation set may be less effective at detecting overfitting. If the validation set is not representative of the data distribution, it might not reveal overfitting issues until the model is deployed in the real world.

The choice of the validation set size should be made carefully, considering your specific dataset and problem. Cross-validation techniques, such as k-fold cross-validation, can also be used to mitigate the impact of the validation set size on model evaluation. These techniques involve splitting the data into multiple folds and repeatedly using different subsets as validation sets, providing a more robust assessment of model performance.C

**ANSWER FOR Q2**
The size of the training and validation sets can indeed affect how well you can predict the accuracy of your model on the test set using the validation set. The relationship between these sets plays a crucial role in assessing your model's generalization performance. Here's how the size of these sets can impact your ability to predict test set accuracy:

1. **Larger Training Set and Smaller Validation Set:**

   - **Effect:** When you allocate a larger portion of your data to the training set and a smaller portion to the validation set, you provide your model with more data to learn from during training.
   
   - **Implication:** This can result in a model that is better at capturing the underlying patterns and nuances in the data. It may lead to better training performance, which could potentially translate into better test set performance.
   
   - **Prediction Accuracy:** You may be more confident in predicting test set accuracy based on the validation set performance, especially if the validation set is representative of the test set. However, you should still be cautious about potential overfitting, as the model has more exposure to the training data.

2. **Larger Validation Set and Smaller Training Set:**

   - **Effect:** When you allocate a larger portion of your data to the validation set and a smaller portion to the training set, you prioritize model evaluation and generalization assessment over model training.
   
   - **Implication:** This can result in a more reliable estimate of how well your model is likely to perform on unseen data, as the validation set is more representative of the test set. It can help you identify potential issues like overfitting early in the model development process.
   
   - **Prediction Accuracy:** You may have a more accurate prediction of test set performance based on the validation set, but the model's training performance might not be as strong due to the smaller training set. In some cases, a smaller training set could lead to underfitting.

3. **Balanced Training and Validation Sets:**

   - **Effect:** When you split your data evenly between the training and validation sets, you strike a balance between providing the model with ample data for learning and having a representative validation set for evaluation.
   
   - **Implication:** This approach can be a good compromise, allowing you to train a reasonably well-performing model while still having a reliable estimate of its generalization performance.
   
   - **Prediction Accuracy:** Your prediction of test set accuracy based on the validation set is likely to be reasonably accurate, assuming that both sets are representative of the test set. However, it may not be as strong as in the case of a larger training set.

In summary, the size of the training and validation sets can affect your ability to predict test set accuracy based on the validation set. The key is to strike a balance that suits your specific problem and dataset. Regardless of the size of these sets, it's essential to monitor both training and validation performance and consider additional techniques like cross-validation to get a more robust estimate of your model's generalization performance.

**ANSWER FOR Q3**
The percentage of data to reserve for the validation set can vary depending on the size of your overall dataset and the complexity of your machine learning or deep learning model. There is no one-size-fits-all answer, but a common practice is to reserve somewhere between 10% and 30% of your data for the validation set. Here are some considerations to help you choose an appropriate percentage:

1. **Dataset Size:**

   - **Larger Datasets:** If you have a large dataset (e.g., tens of thousands or more data points), you can typically afford to reserve a smaller percentage (e.g., 10% to 20%) for the validation set because even a smaller fraction will provide an adequate number of samples for evaluation.

   - **Smaller Datasets:** In cases where your dataset is relatively small, you might need to allocate a larger percentage (e.g., 20% to 30%) to the validation set to ensure a representative sample for evaluation.

2. **Model Complexity:**

   - **Simple Models:** If you are training a relatively simple model with few parameters (e.g., linear regression), you may require a smaller validation set percentage because such models are less prone to overfitting.

   - **Complex Models:** For more complex models (e.g., deep neural networks with many layers and parameters), you might want to allocate a larger validation set percentage to ensure effective evaluation and early detection of overfitting.

3. **Availability of Data:**

   - **Limited Data:** If you have a limited amount of data, you should be cautious about reserving too much for the validation set, as it may reduce the amount available for training. In such cases, you might consider techniques like cross-validation to make the most of your data.

4. **Cross-Validation:**

   - **K-Fold Cross-Validation:** Another approach to balance the trade-off between training and validation data is to use k-fold cross-validation. In k-fold cross-validation, you split your data into k subsets, train and validate your model k times, each time using a different subset as the validation set. This allows you to use all your data for both training and validation, helping you get a more robust estimate of model performance.

In practice, starting with a 70-30 or 80-20 split for training and validation data is common. However, you should adjust these percentages based on the factors mentioned above. Experimentation and iteration are often necessary to find the optimal split for your specific problem. Additionally, monitoring both training and validation performance, as well as considering other techniques like regularization, can help you strike the right balance between underfitting and overfitting.

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute.

In [13]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [14]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

Average validation accuracy is  0.33584635395170215
test accuracy is  0.34917953667953666


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


**ANSWER FOR Q1**
Yes, averaging the validation accuracy across multiple splits, such as in k-fold cross-validation, generally gives more consistent and robust results compared to using a single fixed validation set. This approach is especially beneficial in cases where the dataset is relatively small or where the distribution of the data may vary across different subsets. Here's why averaging validation accuracy across multiple splits is valuable:

1. **Reduced Variability:** Averaging over multiple splits helps reduce the impact of randomness or the specific choice of a single validation set. In a single split, the performance evaluation can be influenced by the particular data points included in that validation set, which might not be fully representative of the entire dataset. Averaging over multiple splits mitigates this issue.

2. **Better Generalization Estimate:** By repeatedly training and evaluating the model on different subsets of data, you obtain a more reliable estimate of the model's generalization performance. This is crucial for assessing how well the model is likely to perform on unseen data.

3. **Improved Detection of Overfitting:** Cross-validation can better detect overfitting because it evaluates the model's performance on multiple validation sets. If the model consistently performs well across different validation sets, it is less likely to be overfitting the training data.

4. **More Robust Hyperparameter Tuning:** When performing hyperparameter tuning, cross-validation provides a more robust basis for selecting the best hyperparameters. Averaging the results from different splits helps you make more informed choices.

5. **Maximizing Data Usage:** Cross-validation allows you to make the most of your available data. Instead of reserving a fixed portion for a single validation set, you can use all your data for both training and validation, which is especially beneficial when you have a limited dataset.

Common choices for the number of folds (k) in k-fold cross-validation include 5-fold and 10-fold, but you can adjust this value based on your specific circumstances. A higher value of k can provide a more robust estimate but requires more computation.

In summary, averaging the validation accuracy across multiple splits, as done in cross-validation, is a valuable technique for obtaining more consistent and reliable results when assessing the performance of machine learning or deep learning models. It helps ensure that the evaluation is less influenced by the randomness of a single validation set and provides a better estimate of how well the model generalizes to unseen data.

**ANSWER FOR Q2**
Cross-validation, such as k-fold cross-validation, provides a more accurate estimate of test accuracy compared to a single fixed validation set. However, it's essential to clarify the terminology here:

- **Test Accuracy:** Test accuracy refers to the performance of your machine learning or deep learning model on a completely unseen dataset that is not used during training or model selection. Test accuracy is the gold standard for evaluating how well your model generalizes to new, unseen data.

- **Validation Accuracy:** Validation accuracy is an estimate of how well your model is likely to perform on unseen data based on its performance on a validation set. This is an intermediate step used during model development to guide hyperparameter tuning and assess generalization before deploying the model.

Cross-validation helps in obtaining a more accurate estimate of the test accuracy indirectly by providing a more reliable estimate of how well your model generalizes to unseen data. Here's how it contributes to a more accurate estimate:

1. **Robust Evaluation:** Cross-validation evaluates your model's performance on multiple validation sets created by splitting your data into multiple folds. This robust evaluation accounts for variations in data distribution and randomness in the choice of a single validation set, providing a more consistent and reliable assessment of your model's generalization performance.

2. **Better Hyperparameter Tuning:** When using cross-validation, you typically perform hyperparameter tuning (e.g., grid search) over multiple iterations, considering various combinations of hyperparameters. This process helps you select hyperparameters that lead to better generalization, which, in turn, contributes to a more accurate test accuracy estimate.

3. **Effective Detection of Overfitting:** Cross-validation can effectively detect overfitting because it evaluates the model's performance on multiple validation sets. If the model consistently performs well across different validation sets, it is more likely to generalize well to unseen data.

While cross-validation provides a more accurate estimate of how well your model is likely to perform on unseen data, it's essential to note that the actual test accuracy on a completely independent test dataset might still vary due to factors like dataset distribution shifts, noise in real-world data, and other external factors.

In summary, cross-validation helps obtain a more accurate estimate of how well your model generalizes to unseen data, and it is a crucial step in assessing model performance during development. However, for the most accurate estimate of test accuracy, it's still important to evaluate your final model on a dedicated, completely unseen test dataset.

**ANSWER FOR Q3**
The number of iterations in cross-validation can have an effect on the estimate of model performance. In general, increasing the number of iterations in cross-validation can lead to a more stable and reliable estimate of model performance. However, there are diminishing returns, and the computational cost also increases with a higher number of iterations. Let's explore this in more detail:

**Effect of the Number of Iterations:**

1. **Fewer Iterations (Lower k):**
   - **Pros:** With fewer iterations (e.g., smaller k in k-fold cross-validation), cross-validation is computationally less expensive and faster to perform.
   - **Cons:** The estimate of model performance can be more sensitive to the specific random splits of the data. If you have limited data, a single random split might not be representative of the entire dataset, leading to more variability in the performance estimates.

2. **More Iterations (Higher k):**
   - **Pros:** Increasing the number of iterations (e.g., using a higher k value in k-fold cross-validation) provides a more stable and robust estimate of model performance. It reduces the impact of randomness in data splitting.
   - **Cons:** The computational cost increases with higher k values. Additionally, as k approaches the number of data points (leave-one-out cross-validation), each iteration becomes similar to a single train-validation split, which may not provide as much benefit in terms of performance estimation.

**Balancing Act:**
The choice of the number of iterations (k) should be a balance between computational resources and the need for a robust estimate. Common values for k include 5-fold and 10-fold cross-validation. These values strike a reasonable balance between reducing variability in estimates and computational efficiency. However, if you have a very large dataset, you might choose a smaller value for k without sacrificing the reliability of the estimate.

**Diminishing Returns:**
It's important to note that increasing the number of iterations beyond a certain point might yield diminishing returns in terms of improving the accuracy of your estimate. At some point, additional iterations might not significantly reduce the variability in the performance estimates but will increase the computational burden.

In summary, increasing the number of iterations in cross-validation generally leads to a more stable and reliable estimate of model performance, but there's a trade-off with computational cost. It's advisable to choose a value of k that suits your specific dataset size, computational resources, and the desired level of confidence in your performance estimate.

**ANSWER FOR Q4**
Increasing the number of iterations in cross-validation can help mitigate the impact of having a very small training or validation dataset to some extent, but it may not completely compensate for the limitations imposed by small dataset sizes. Here's how increasing iterations can help and what its limitations are:

**Advantages of Increasing Iterations with Small Datasets:**

1. **More Robust Estimate:** With more iterations (e.g., increasing k in k-fold cross-validation), you can obtain a more robust and reliable estimate of model performance, even with small training or validation datasets. This is because you repeatedly evaluate the model on different subsets of the data, which helps reduce the impact of random data splits.

2. **Effective Hyperparameter Tuning:** More iterations allow you to perform more extensive hyperparameter tuning, exploring different combinations of hyperparameters. This can help you find the best-performing model configuration despite the limited data.

**Limitations of Increasing Iterations with Small Datasets:**

1. **Limited Data:** If your training dataset is very small, no amount of iterations can magically increase the amount of training data available. A small training dataset may result in models that struggle to capture complex patterns, even with robust performance estimates from cross-validation.

2. **Risk of Overfitting:** With a very small training dataset, there's a higher risk of overfitting because the model may have difficulty generalizing from limited examples. Cross-validation can detect overfitting to some extent, but it doesn't address the fundamental issue of insufficient data.

3. **Resource Constraints:** While increasing the number of iterations can improve the estimate's reliability, it also increases computational costs. If you have resource constraints, such as limited computing power or time, performing a large number of iterations may not be feasible.

**Recommended Strategies for Small Datasets:**

1. **Data Augmentation:** If obtaining more data is not possible, consider data augmentation techniques to artificially increase the effective size of your training dataset. This can help your model learn more robust features.

2. **Regularization:** Use regularization techniques (e.g., L1, L2 regularization) to combat overfitting when dealing with small datasets. Regularization encourages simpler models that are less likely to overfit.

3. **Transfer Learning:** When applicable, consider using pre-trained models and transfer learning. Leveraging features learned on larger datasets can be highly beneficial when you have limited data.

4. **Feature Engineering:** Invest time in careful feature engineering to extract meaningful information from your limited dataset.

5. **Ensemble Methods:** Ensemble methods, such as bagging and boosting, can help improve model performance with limited data.

In conclusion, while increasing iterations in cross-validation can improve the reliability of performance estimates, it cannot fully compensate for the challenges posed by very small training or validation datasets. Dealing with limited data often requires a combination of data augmentation, regularization, and other strategies to enhance model performance and generalization.