<a href="https://colab.research.google.com/github/raviteja9849/FMML-2021/blob/main/Module1_LAB_2%5BMachine_Learning_Terms_and_Metrics%5D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine learning terms and metrics

FMML Module 1, Lab 2<br>
 Module Coordinator: Thrupthi Ann John thrupthi.ann@research.iiit.ac.in <br>
 Release date: 11 October 2021 Monday <br>

 In this lab, we will show a part of the ML pipeline by extracting features, training and testing

In [None]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In this lab, we will use the California Housing dataset. There are 20640 samples, each with 8 attributes like income of the block, age of the houses per district etc. The task is to predict the cost of the houses per district. 

Let us download and examine the dataset. 

In [None]:
 dataset =  datasets.fetch_california_housing()
 # print(dataset.DESCR)  # uncomment this if you want to know more about this dataset
 # print(dataset.keys())  # if you want to know what else is there in this dataset
 dataset.target = dataset.target.astype(np.int) # so that we can classify
 print(dataset.data.shape)
 print(dataset.target.shape)

(20640, 8)
(20640,)


Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  after removing the cwd from sys.path.


Here is a function for calculating the 1-nearest neighbours

In [None]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here 
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has 
            
                                      # the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data 
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel


We will also define a 'random classifier', which randomly allots labels to each sample

In [None]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

Let us define a metric 'Accuracy' to see how good our learning algorithm is. Accuracy is the ratio of the number of correctly classified samples to the total number of samples. The higher the accuracy, the better the algorithm. 

In [None]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), # "Length of the groundtruth labels and predicted labels
                                       # should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal 
                                        to the predicted label.
  return correct/len(gtlabel)

Let us make a function to split the dataset with the desired probability.

In [None]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

We will reserve 20% of our dataset as the test set. We will not change this portion throughout our experiments

In [None]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 85/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

Number of test samples =  17562
Number of other samples =  3078
Percent of test data =  85.08720930232558 %


## Experiments with splits

Let us reserve some of our train data as a validation set

In [None]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 85/100)

What is the accuracy of our classifiers on the train dataset?

In [None]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.1626848691695108


For nearest neighbour, the train accuracy is always 1. The accuracy of the random classifier is close to 1/(number of classes) which is 0.1666 in our case.

Let us predict the labels for our validation set and get the accuracy

In [None]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.2857142857142857
Validation accuracy using random classifier is  0.1927437641723356


Validation accuracy of nearest neighbour is considerably less than its train accuracy while the validation accuracy of random classifier is the same. However, the validation accuracy of nearest neighbour is twice that of the random classifier. 

Now let us try another random split and check the validation accuracy

In [None]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 85/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

Validation accuracy of nearest neighbour is  0.2813852813852814


You can run the above cell multiple times to try with different random splits. 
We notice that the accuracy is different for each run, but close together. 

Now let us compare it with the accuracy we get on the test dataset. 

In [None]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.305603006491288


### Try it out for yourself and answer:
1. How is the accuracy of the validation set affected if we increase the percentage of validation set? What happens when we reduce it?
2. How does the size of the train and validation set affect how well we can predict the accuracy on the test set using the validation set?
3. What do you think is a good percentage to reserve for the validation set so that thest two factors are balanced? 

Answer for both nearest neighbour and random classifier. You can note down the values for your experiments and plot a graph using  <a href=https://matplotlib.org/stable/gallery/lines_bars_and_markers/step_demo.html#sphx-glr-gallery-lines-bars-and-markers-step-demo-py>plt.plot<href>. Check also for extreme values for splits, like 99.9% or 0.1%

# ANSWERS:

# 1.Answer:

 **The accuracy of the validation set will get increased , if we increase the percentage of validation set.If we  decrease the percentage the accuracy of the validation set will get reduced . But, here the notable point is that the increase in the accuracy which is directly related to the percentage of validation set we assign will no change significantally in this  data which was a major setback because though we assign 95 percent of validation data set the accuracy lies below 40 only. so, here  the model we trained , I think was not feed with the large amount of data  . so, this model needs more data to be trained with to get better results overally.**


# 2.Answer:

 If the training data and also validation data was very low in their ratio then the accuracy of the test set was low .
 whereas , if the training and validating data were given in the higher number then the accuracy of the test set was high but not very significiantally high  from those of the low values given case.  
  If the training data was  given in higher number  and the validating data in lower number then , we  will  get the good  test accuaracy.
  if the tarining  data was given in lower number and the validating data in higher number then the accuracy of the test prediction was  low .

  **so, finally we can say  that the to get the better or higher  test accuracy then the training data should be high but not very high  and validating data should be low  but not very low . so , we should maintain a ideal ratio depending on our model like {85 or 80 percent for training the data }  and the {15 or 20 percent for validating the data} to get the good accuracy of the testing , so we should maintain a ideal ratio between this two data in our model.**

# 3.Answer:

I think the good percentage to reserve for the validation set is  80 (or)  85 percentage of our total data to balance the other two factors like test  accuracy and validation accuracy of higher value of accuracy to get because the validation percentage of 80 percent is  ideal but  we can make it + or - 10 or 5 percenatge depending on the data set and model we train .so definitely the major portion should  be give like more than 70 but less than 90 percent to validation set  for better accuracy of our models.

## Multiple Splits

One way to get more accurate estimates for the test accuracy is by using <b>crossvalidation</b>. Here, we will try a simple version, where we do multiple train/val splits and take the average of validation accuracies as the test accuracy estimation. Here is a function for doing this. Note that this function will take a long time to execute. 

In [None]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [None]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

Average validation accuracy is  0.28211626797265893
test accuracy is  0.305603006491288


This is a very simple way of doing cross-validation. There are many well-known algorithms for cross-validation, like k-fold cross-validation, leave-one-out etc. This will be covered in detail in a later module. For more information about cross-validation, check <a href=https://en.wikipedia.org/wiki/Cross-validation_(statistics)>Cross-validatioin (Wikipedia)</a>

### Questions
1. Does averaging the validation accuracy across multiple splits give more consistent results?
2. Does it give more accurate estimate of test accuracy?
3. What is the effect of the number of iterations on the estimate? Do we get a better estimate with higher iterations?
4. Consider the results you got for the previous questions. Can we deal with a very small train dataset or validation dataset by increasing the iterations?


# ANSWERS:

# 1.Answer:

**yes, averaging the validation accuracy across multiple splits gives more consistent results because  we  will divide the data into multiple splits where  it will get validated and tested many more  number of times and  it  will  increase the capability  of  our  model  predicting the  results right by  lowering the chance of margin of errors while prediction. so, as  a result  it gives more consistent results.**

# 2.Answer:

**YES, after  we split the data  into multiple ways through the technique of cross-validation method the accuracy of the test estimation have been increased significantally because the data that we split into multiple classes  or iterations by cross  validation will undego  rigurous training  for our  model  and the real beauty is that  every time in each split the model is cross validated with the new set of data  means one class or iteration split will never repeat the same data and in this way the model  gets intense training the testing accuracy was very high .**

# 3.Answer:

**The effect of number of iterations  on the estimate is that the accuracy of the model estimation of the right input gets increased through this iterations.
we  definitely get the much better estimations with higher iterations  because  the model will always predict  that  data which is not seen before  in every  single iteration and the models capability of grasping the things will gets increased because we  have  given various scenarios of predictions using more and more iterations that the model never seen before  and the  important  point here  is that  by using more  and more iterations it  will be quite helpful for model  to learn more particularly  , when the sample data to  train our model  was very low in number  where  the model  cannot  learn much from the low  data trained problem but this can be overcomed by this iterations  of  higher number in multiple classes. finally , we  get the higher  accuracy which signifies the better estimates  of our  model predictions of the input we give to our model to get the output. ****

# 4.Answer:

**Yes , definitely we can deal with the  very small amount of dataset or validation set by increasing the number of iterations  beacuse  the iteration will play a crucial  role in getting the best insights  from the minute data that is  available to  train  our   model.**

**Say we have only 100 examples, if we do a simple 80–20 split, we’ll get 20 examples in our test set. It is not enough. We can get almost any performance on this set only due to chance. The problem is even worse when we have a multi-class problem. If we have 10 classes and only 20 examples, It leaves us with only 2 examples for each class on average. Testing anything on only 2 examples can’t lead to any real conclusion.**

**If we use cross-validation(by increasing the number of iterations) in this case, we build K different models, so we are able to make predictions on all of our data. For each instance, we make a prediction by a model that didn’t see this example, and so we are getting 100 examples in our test set. For the multi-class problem, we get 10 examples for each class on average, and it’s much better than just 2. After we evaluated our learning algorithm.we are now can train our model on all our data because if our 5 models had similar performance using different train sets, we assume that by training it on all the data will get similar performance.By doing cross-validation(that is by increasing the number of iterations), we’re able to use all our 100 examples both for training and for testing while evaluating our learning algorithm on examples it has never seen before. ***

**So, finally in this  way the by increasing the number of iterations  , we get  the higher acuracy and better  results without facing any problems of using very low  data which does not affect the performance of our model. ***