#Question 1
To calculate the probabilities P(sunny|a cone of ice cream) and P(rainy|a cup of hot coffee) using Bayes' Theorem and the Naïve Bayes assumption, we can decompose the joint distribution with conditional independence. Let's break down the calculations:

1. P(sunny|a cone of ice cream) =
   We want to find the probability of the category "Sunny" given the statement "a cone of ice cream."

   According to Bayes' Theorem:
   P(sunny|a cone of ice cream) ∝ P(a cone of ice cream|sunny) * P(sunny)

   With the Naïve Bayes assumption:
   P(a cone of ice cream|sunny) = P(a|sunny) * P(cone|sunny) * P(of|sunny) * P(ice|sunny) * P(cream|sunny)

2. P(rainy|a cup of hot coffee) =
   Similarly, we want to find the probability of the category "Rainy" given the statement "a cup of hot coffee."

   Using Bayes' Theorem:
   P(rainy|a cup of hot coffee) ∝ P(a cup of hot coffee|rainy) * P(rainy)

   With the Naïve Bayes assumption:
   P(a cup of hot coffee|rainy) = P(a|rainy) * P(cup|rainy) * P(of|rainy) * P(hot|rainy) * P(coffee|rainy)


In [113]:
import util
import classificationMethod

class MostFrequentClassifier(classificationMethod.ClassificationMethod):
  """
  The MostFrequentClassifier is a very simple classifier: for
  every test instance presented to it, the classifier returns
  the label that was seen most often in the training data.
  """
  def __init__(self, legalLabels):
    self.guess = None
    self.type = "mostfrequent"
  
  def train(self, data, labels, validationData, validationLabels):
    """
    Find the most common label in the training data.
    """
    counter = util.Counter()
    counter.incrementAll(labels, 1)
    self.guess = counter.argMax()
  
  def classify(self, testData):
    """
    Classify all test data as the most common label.
    """
    return [self.guess for i in testData]


In [116]:
import util
import classificationMethod
import math

class NaiveBayesClassifier(classificationMethod.ClassificationMethod):
    def __init__(self, legalLabels):
        self.legalLabels = legalLabels
        self.type = "naivebayes"
        self.k = 1  # this is the smoothing parameter
        self.automaticTuning = False
        self.class_probs = {}  
        
    def setSmoothing(self, k):
        self.k = k

    def train(self, trainingData, trainingLabels, validationData, validationLabels):
        self.features = list(trainingData[0].keys())

        if self.automaticTuning:
            kgrid = [0.001, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 20, 50]
        else:
            kgrid = [self.k]

            self.trainAndTune(trainingData, trainingLabels, validationData, validationLabels, kgrid)

    def trainAndTune(self, trainingData, trainingLabels, validationData, validationLabels, kgrid):
        best_accuracy = 0.0
        best_k = None
        original_k = self.k  # Store the original k value

        for k in kgrid:
            # Train the classifier with Laplace smoothing
            self.k = k
            self.trainNaiveBayes(trainingData, trainingLabels)

            # Calculate accuracy on validation data
            accuracy = self.validate(validationData, validationLabels)

            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_k = k

        self.k = best_k  # Set the best 'k' value
        self.k = original_k  # Reset to the original k value

class NaiveBayesClassifier(classificationMethod.ClassificationMethod):
    def trainNaiveBayes(self, trainingData, trainingLabels):
        
        self.class_probs = {label: 0.0 for label in self.legalLabels}
        
        self.feature_probs = {label: {feature: 0.0 for feature in self.features} for label in self.legalLabels}

        total_count = len(trainingLabels)

        # Collect class counts
        class_counts = {label: 0 for label in self.legalLabels}
        for label_ in trainingLabels:
            class_counts[label_] += 1

        # Calculate class probabilities
        for label in self.legalLabels:
            self.class_probs[label] = (class_counts[label] + self.k) / (total_count + self.k * len(self.legalLabels))

        # Calculate feature probabilities
        for label in self.legalLabels:
            label_data = [datum for datum, label_ in zip(trainingData, trainingLabels) if label_ == label]
            label_word_counts = {feature: 0 for feature in self.features}
            for datum in label_data:
                for feature, count in datum.items():
                    label_word_counts[feature] += count
            for feature in self.features:
                self.feature_probs[label][feature] = (label_word_counts[feature] + self.k) / (
                        sum(label_word_counts.values()) + self.k * len(self.features))

    def validate(self, validationData, validationLabels):
        correct = 0
        total = len(validationLabels)

        for datum, true_label in zip(validationData, validationLabels):
            
            predicted_label = self.classify(datum)
            
            if predicted_label == true_label:
                correct += 1

        accuracy = correct / total
        return accuracy

    def classify(self, testData):
        guesses = []

        for datum in testData:
            
            max_prob = float("-inf")
            best_label = None

            for label in self.legalLabels:
                prob = math.log(self.class_probs[label])

                for feature, count in datum.items():
                    if feature in self.feature_probs[label]:
                        prob += count * math.log(self.feature_probs[label][feature])

                if prob > max_prob:
                    max_prob = prob
                    best_label = label

            guesses.append(best_label)

            return guesses

    def calculateLogJointProbabilities(self, datum):
        logJoint = {label: 0.0 for label in self.legalLabels}

        for label in self.legalLabels:
            logJoint[label] = math.log(self.class_probs[label])

            for feature, count in datum.items():
                if feature in self.feature_probs[label]:
                    logJoint[label] += count * math.log(self.feature_probs[label][feature])

        return logJoint

    def findHighOddsFeatures(self, label1, label2):
        featuresOdds = []

        for feature in self.features:
            odds_ratio = (self.feature_probs[label1][feature] + self.k) / (
                    self.feature_probs[label2][feature] + self.k)
            featuresOdds.append((feature, odds_ratio))

        featuresOdds.sort(key=lambda x: -x[1])
        return [feature for feature, _ in featuresOdds[:100]]
    

In [None]:
# This file contains feature extraction methods and harness 
# code for data classification

import mostFrequent
import naiveBayes
import samples
import sys
import util

TEST_SET_SIZE = 140
DIGIT_DATUM_WIDTH=28
DIGIT_DATUM_HEIGHT=28
FACE_DATUM_WIDTH=60
FACE_DATUM_HEIGHT=70


def basicFeatureExtractorDigit(datum):
  """
  Returns a set of pixel features indicating whether
  each pixel in the provided datum is white (0) or gray/black (1)
  """
  a = datum.getPixels()

  features = util.Counter()
  for x in range(DIGIT_DATUM_WIDTH):
    for y in range(DIGIT_DATUM_HEIGHT):
      if datum.getPixel(x, y) > 0:
        features[(x,y)] = 1
      else:
        features[(x,y)] = 0
  return features


def analysis(classifier, guesses, testLabels, testData, rawTestData, printImage):
  """
  This function is called after learning.
  Include any code that you want here to help you analyze your results.
  
  Use the printImage(<list of pixels>) function to visualize features.
  
  An example of use has been given to you.
  
  - classifier is the trained classifier
  - guesses is the list of labels predicted by your classifier on the test set
  - testLabels is the list of true labels
  - testData is the list of training datapoints (as util.Counter of features)
  - rawTestData is the list of training datapoints (as samples.Datum)
  - printImage is a method to visualize the features 
  (see its use in the odds ratio part in runClassifier method)
  
  This code won't be evaluated. It is for your own optional use
  (and you can modify the signature if you want).
  """
  
  # Put any code here...
  mistake_count = 0  # Track the number of mistakes found
  for i in range(len(guesses)):
      prediction = guesses[i]
      truth = testLabels[i]
      if prediction != truth:
          print("===================================")
          print("Mistake on example %d" % i)
          print("Predicted %d; truth is %d" % (prediction, truth))
          print("Image:")
          print(rawTestData[i])
          mistake_count += 1
          
      if mistake_count >= 4:
          break  # Stop after finding the first four mistakes



## =====================
## You don't have to modify any code below.
## =====================


class ImagePrinter:
    def __init__(self, width, height):
      self.width = width
      self.height = height

def default(str):
  return str + ' [Default: %default]'

def readCommand(argv):
  "Processes the command used to run from the command line."
  from optparse import OptionParser  
  parser = OptionParser(USAGE_STRING)
  
  parser.add_option('-c', '--classifier', help=default('The type of classifier'), choices=['mostFrequent', 'nb', 'naiveBayes', 'perceptron', 'mira', 'minicontest'], default='mostFrequent')
  parser.add_option('-d', '--data', help=default('Dataset to use'), choices=['digits', 'faces'], default='digits')
  parser.add_option('-t', '--training', help=default('The size of the training set'), default=100, type="int")
  parser.add_option('-a', '--autotune', help=default("Whether to automatically tune hyperparameters"), default=False, action="store_true")
  parser.add_option('-i', '--iterations', help=default("Maximum iterations to run training"), default=3, type="int")

  options, otherjunk = parser.parse_args(argv)
  if len(otherjunk) != 0:
      raise Exception('Command line input not understood: ' + str(otherjunk))
                  
  args = {}
  
  # Set up variables according to the command line input.
  print("Doing classification")
  print("--------------------")
  print("data:\t\t" + options.data)
  print("classifier:\t\t" + options.classifier)
  print("training set size:\t" + str(options.training))
  if(options.data=="digits"):
    printImage = ImagePrinter(DIGIT_DATUM_WIDTH, DIGIT_DATUM_HEIGHT)
    featureFunction = basicFeatureExtractorDigit    
  else:
    print("Unknown dataset", options.data)
    print(USAGE_STRING)
    sys.exit(2)
    
  if(options.data=="digits"):
    legalLabels = list(range(10))
  else:
    legalLabels = list(range(2))
    
  if options.training <= 0:
      print("Training set size should be a positive integer (you provided: %d)" % options.training)
      print(USAGE_STRING)
      sys.exit(2)

  if(options.classifier == "mostFrequent"):
    classifier = mostFrequent.MostFrequentClassifier(legalLabels)
  elif(options.classifier == "naiveBayes" or options.classifier == "nb"):
    classifier = naiveBayes.NaiveBayesClassifier(legalLabels)
    if (options.autotune):
        print("using automatic tuning for naivebayes")
        classifier.automaticTuning = True
  else:
    print("Unknown classifier:", options.classifier)
    print(USAGE_STRING)
    
    sys.exit(2)

  args['classifier'] = classifier
  args['featureFunction'] = featureFunction
  args['printImage'] = printImage
  
  return args, options

USAGE_STRING = """
  USAGE:      python dataClassifier.py <options>
  EXAMPLES:   (1) python dataClassifier.py
                  - trains the default mostFrequent classifier on the digit dataset
                  using the default 100 training examples and
                  then test the classifier on test data
                 """

# Main harness code

def runClassifier(args, options):

  featureFunction = args['featureFunction']
  classifier = args['classifier']
  printImage = args['printImage']
      
  # Load data  
  numTraining = options.training

  rawTrainingData = samples.loadDataFile("trainingimages", numTraining,DIGIT_DATUM_WIDTH,DIGIT_DATUM_HEIGHT)
  trainingLabels = samples.loadLabelsFile("traininglabels", numTraining)
  rawValidationData = samples.loadDataFile("validationimages", TEST_SET_SIZE,DIGIT_DATUM_WIDTH,DIGIT_DATUM_HEIGHT)
  validationLabels = samples.loadLabelsFile("validationlabels", TEST_SET_SIZE)
  rawTestData = samples.loadDataFile("testimages", TEST_SET_SIZE,DIGIT_DATUM_WIDTH,DIGIT_DATUM_HEIGHT)
  testLabels = samples.loadLabelsFile("testlabels", TEST_SET_SIZE)
    
  
  # Extract features
  print("Extracting features...")
  trainingData = list(map(featureFunction, rawTrainingData))
  validationData = list(map(featureFunction, rawValidationData))
  testData = list(map(featureFunction, rawTestData))
  
  # Conduct training and testing
  print("Training...")
  classifier.train(trainingData, trainingLabels, validationData, validationLabels)
  print("Validating...")
  guesses = classifier.classify(validationData)
  correct = [guesses[i] == validationLabels[i] for i in range(len(validationLabels))].count(True)
  print(str(correct), ("correct out of " + str(len(validationLabels)) + " (%.1f%%).") % (100.0 * correct / len(validationLabels)))
  print("Testing...")
  guesses = classifier.classify(testData)
  correct = [guesses[i] == testLabels[i] for i in range(len(testLabels))].count(True)
  print(str(correct), ("correct out of " + str(len(testLabels)) + " (%.1f%%).") % (100.0 * correct / len(testLabels)))
  analysis(classifier, guesses, testLabels, testData, rawTestData, printImage)

if __name__ == '__main__':
  # Read input
  args, options = readCommand( sys.argv[1:] ) 
  # Run classifier
  runClassifier(args, options)

In [130]:
!python dataClassifier.py

Doing classification
--------------------
data:		digits
classifier:		mostFrequent
training set size:	100
Extracting features...
Training...
Validating...
17 correct out of 140 (12.1%).
Testing...
18 correct out of 140 (12.9%).
Mistake on example 0
Predicted 1; truth is 9
Image:
                            
                            
                            
                            
                            
                            
                            
             ++###+         
             ######+        
            +######+        
            ##+++##+        
           +#+  +##+        
           +##++###+        
           +#######+        
           +#######+        
            +##+###         
              ++##+         
              +##+          
              ###+          
            +###+           
            +##+            
           +##+             
          +##+              
         +##+               
         ##+             

In [87]:
!python dataClassifier.py -h 

Usage: 
  USAGE:      python dataClassifier.py <options>
  EXAMPLES:   (1) python dataClassifier.py
                  - trains the default mostFrequent classifier on the digit dataset
                  using the default 100 training examples and
                  then test the classifier on test data
                 

Options:
  -h, --help            show this help message and exit
  -c CLASSIFIER, --classifier=CLASSIFIER
                        The type of classifier [Default: mostFrequent]
  -d DATA, --data=DATA  Dataset to use [Default: digits]
  -t TRAINING, --training=TRAINING
                        The size of the training set [Default: 100]
  -a, --autotune        Whether to automatically tune hyperparameters
                        [Default: False]
  -i ITERATIONS, --iterations=ITERATIONS
                        Maximum iterations to run training [Default: 3]


In [127]:
!python dataClassifier.py -c naiveBayes --autotune

Doing classification
--------------------
data:		digits
classifier:		naiveBayes
training set size:	100
using automatic tuning for naivebayes
Extracting features...
Training...
Performance validation set for k=0.001: 66.4%
Performance validation set for k=0.010: 67.9%
Performance validation set for k=0.050: 70.0%
Performance validation set for k=0.100: 70.7%
Performance validation set for k=0.500: 68.6%
Performance validation set for k=1.000: 67.9%
Performance validation set for k=5.000: 55.0%
Performance validation set for k=10.000: 48.6%
Performance validation set for k=20.000: 36.4%
Performance validation set for k=50.000: 26.4%
Validating...
99 correct out of 140 (70.7%).
Testing...
81 correct out of 140 (57.9%).
Mistake on example 3
Predicted 3; truth is 5
Image:
                            
                            
                            
                            
                            
          +#########+       
         +###########+      
         ##########