<a href="https://colab.research.google.com/github/saisumedha/FMML-PROJECTS_AND_LABS/blob/main/MODULE_01_LAB2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
# set randomseed
rng = np.random.default_rng(seed=42)

In [3]:
dataset =  datasets.fetch_california_housing()


In [4]:
print(dataset.DESCR)
print(dataset.keys())
dataset.target = dataset.target.astype(int)
print(dataset.data.shape)
print(dataset.target.shape)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

In [5]:
def NN1(traindata, trainlabel, query):
  diff  = traindata - query  # find the difference between features. Numpy automatically takes care of the size here
  sq = diff*diff # square the differences
  dist = sq.sum(1) # add up the squares
  label = trainlabel[np.argmin(dist)] # our predicted label is the label of the training data which has the least distance from the query
  return label

def NN(traindata, trainlabel, testdata):
  # we will run nearest neighbour for each sample in the test data
  # and collect the predicted classes in an array using list comprehension
  predlabel = np.array([NN1(traindata, trainlabel, i) for i in testdata])
  return predlabel

In [6]:
def RandomClassifier(traindata, trainlabel, testdata):
  # in reality, we don't need these arguments

  classes = np.unique(trainlabel)
  rints = rng.integers(low=0, high=len(classes), size=len(testdata))
  predlabel = classes[rints]
  return predlabel

In [7]:
def Accuracy(gtlabel, predlabel):
  assert len(gtlabel)==len(predlabel), "Length of the groundtruth labels and predicted labels should be the same"
  correct = (gtlabel==predlabel).sum() # count the number of times the groundtruth label is equal to the predicted label.
  return correct/len(gtlabel)

In [8]:
def split(data, label, percent):
  # generate a random number for each sample
  rnd = rng.random(len(label))
  split1 = rnd<percent
  split2 = rnd>=percent
  split1data = data[split1,:]
  split1label = label[split1]
  split2data = data[split2,:]
  split2label = label[split2]
  return split1data, split1label, split2data, split2label

In [9]:
testdata, testlabel, alltraindata, alltrainlabel = split(dataset.data, dataset.target, 20/100)
print('Number of test samples = ', len(testlabel))
print('Number of other samples = ', len(alltrainlabel))
print('Percent of test data = ', len(testlabel)*100/len(dataset.target),'%')

Number of test samples =  4144
Number of other samples =  16496
Percent of test data =  20.07751937984496 %


In [10]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)

In [11]:
trainpred = NN(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using nearest neighbour is ", trainAccuracy)

trainpred = RandomClassifier(traindata, trainlabel, traindata)
trainAccuracy = Accuracy(trainlabel, trainpred)
print("Train accuracy using random classifier is ", trainAccuracy)

Train accuracy using nearest neighbour is  1.0
Train accuracy using random classifier is  0.164375808538163


In [12]:
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using nearest neighbour is ", valAccuracy)

valpred = RandomClassifier(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy using random classifier is ", valAccuracy)

Validation accuracy using nearest neighbour is  0.34108527131782945
Validation accuracy using random classifier is  0.1688468992248062


In [13]:
traindata, trainlabel, valdata, vallabel = split(alltraindata, alltrainlabel, 75/100)
valpred = NN(traindata, trainlabel, valdata)
valAccuracy = Accuracy(vallabel, valpred)
print("Validation accuracy of nearest neighbour is ", valAccuracy)

Validation accuracy of nearest neighbour is  0.34048257372654156


In [14]:
testpred = NN(alltraindata, alltrainlabel, testdata)
testAccuracy = Accuracy(testlabel, testpred)
print('Test accuracy is ', testAccuracy)

Test accuracy is  0.34917953667953666


1)When we increase the percentage of data reserved for the validation set, we have less data available for training.
A larger validation set can provide a better estimate of our model's performance on unseen data, making it more reliable for hyperparameter tuning and model selection.
With a smaller training set, our model may not learn as effectively, and it might suffer from underfitting. This could result in lower training accuracy, and the model may not generalize well to the test data.
Reducing the size of the validation set means we have more data available for training.
A larger training set can help our model learn more effectively, potentially improving its accuracy on the training data.
A smaller validation set might not provide a robust estimate of your model's generalization performance, which can lead to overfitting. Overfit models may perform well on the training data but poorly on the test data.

2)If the training set is too small, the model might not learn important patterns, and the validation set may not be able to accurately predict test set performance.
If the validation set is too small, it might not provide a reliable estimate of test accuracy.

3)It should be large enough to provide a meaningful estimate of our model's performance but not so large that it hampers our model's ability to learn.
A common split is 70-80% for training and 20-30% for validation.

In [15]:
# you can use this function for random classifier also
def AverageAccuracy(alldata, alllabel, splitpercent, iterations, classifier=NN):
  accuracy = 0
  for ii in range(iterations):
    traindata, trainlabel, valdata, vallabel = split(alldata, alllabel, splitpercent)
    valpred = classifier(traindata, trainlabel, valdata)
    accuracy += Accuracy(vallabel, valpred)
  return accuracy/iterations # average of all accuracies

In [16]:
print('Average validation accuracy is ', AverageAccuracy(alltraindata, alltrainlabel, 75/100, 10, classifier=NN))
testpred = NN(alltraindata, alltrainlabel, testdata)
print('test accuracy is ',Accuracy(testlabel, testpred) )

Average validation accuracy is  0.33584635395170215
test accuracy is  0.34917953667953666


Averaging validation accuracy across multiple splits can provide more consistent and reliable results compared to a single train-validation
When we perform multiple splits and calculate validation accuracy for each split, we understand about how our model performs on different subsets of the data. This helps to reduce the impact of data randomness or bias that can occur in a single split.
Averaging the results smoothes out the variations in performance that might arise due to the random selection of the validation set. This leads to more stable and consistent estimates.

By averaging results from multiple splits, we get more robust estimate of how well our model generalizes to unseen data. This estimate tends to be more accurate because it's based on a broader range of data samples.It helps reduce the risk of overfitting to a specific validation set, which can happen when using a single fixed validation set.

The number of iterations can impact the estimate's accuracy and stability.Increasing the number of iterations leads to a more accurate estimate but also requires more computational resources.The optimal number of iterations can vary depending on our dataset size and characteristics.if our dataset is extremely small, increasing the number of iterations alone may not be a complete solution. we might still face challenges related to the limited amount of data available for training and validation.

 averaging validation accuracy across multiple splits through cross-validation provides more consistent and accurate estimates of test accuracy. While increasing the number of iterations can improve the estimate's quality, it's not a substitute for having a sufficiently large and diverse dataset, especially if our dataset is exceptionally small. A balanced approach, considering both dataset size and cross-validation, is often the best strategy for model evaluation.




