# Probability and Naive Bayes
From 'A Programmer's Guide to Data Mining' by Ron Zacharski
http://guidetodatamining.com/chapter6/


## Bayes Theorem
* Probability
* hypothesis
* 
```
P(A|B) = P(B|A)P(A)/P(B)
    
P(h|D) = P(D|h)P(h)/P(D)    
```

## Naïve Bayes Classifier
* Classifies wearable exercise health monitor


## Example: i100 / i500
* iHealth a sells two wearable exercise monitors that increase in functionality: i100, i500
* Task: build recommendation system for customers. 
* Data: attributes of customers who purchased monitor from questionnaire. 

## Features: iHealth data set
* First, main reason for starting an exercise program three options: health, appearance, both. 
* Second, current exercise level is: sedentary, moderate, or active. 
* How motivated they are: moderate or aggressive. 
* How comfortable using technological devices.

## Load packages and data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

## Naïve Bayes: Which model would you recommend to person whose:
* main interest is health
* current exercise level is moderate
* moderately motivated
* and comfortable with technological devices

## Calculate probabilities, select model with highest likelihood.
* P(i100 | health, moderateExercise, moderateMotivation, techComfortable)
* P(i500 | health, moderateExercise, moderateMotivation, techComfortable)

## Find the following:
* P(i100 | health, moderateExercise, moderateMotivation, techComfortable) = P(health|i100) P(moderateExercise|i100) * P(moderateMotivated|i100) * P(techComfortable|i100)P(i100)

* P(i500 | health, moderateExercise, moderateMotivation, techComfortable) = P(health|i500) P(moderateExercise|i500) * P(moderateMotivated|i500) * P(techComfortable|i500)P(i500)

## First, compute terms for i100:
* Six occurences of people buying i100 model: P(i100) = 6/15 = 0.4
* One whose main interest was 'health': P(health|i100) = 1/6 = 0.167
* One moderat level of 'exercise': P(moderateExercise|i100) = 1/6 = 0.167
* Five moderately 'motivated': P(moderateMotivated|i100) = 5/6 = 0.83
* Two comfortable with tech devices P(techComfortable|i100) = 2/6 = 0.33
```
P(i100 | evidence) = 0.167*0.167*0.833*0.333*0.4 = 0.00309
```

## Second, compute terms for i500:
* Six occurences of people buying i100 model: P(i500) = 9/15 = 0.6
* One whose main interest was 'health': P(health|i500) = 4/9 = 0.444
* One moderat level of 'exercise': P(moderateExercise|i500) = 3/9 = 0.333
* Five moderately 'motivated': P(moderateMotivated|i500) = 3/9 = 0.333
* Two comfortable with tech devices P(techComfortable|i500) = 6/9 = 0.667
```
P(i500 | evidence) = 0.444*0.333*0.333*0.667*0.6 = 0.01975
```

# Implement Naive Bayes in Python

In [4]:
file = pd.read_csv('iHealth.csv')
df = pd.DataFrame(file)

df.shape

(15, 6)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 6 columns):
Interest       15 non-null object
Exercise       15 non-null object
Motivated      15 non-null object
Comfortable    15 non-null object
Income         15 non-null int64
Model          15 non-null object
dtypes: int64(1), object(5)
memory usage: 800.0+ bytes


In [11]:
df.tail()

Unnamed: 0,Interest,Exercise,Motivated,Comfortable,Income,Model
10,both,moderate,aggressive,yes,100,i500
11,appearance,active,aggressive,yes,120,i500
12,both,active,moderate,no,95,i500
13,health,active,moderate,no,90,i500
14,health,sedentary,aggressive,yes,85,i500


## Training 
Output of training needs to be:
* Set of prior probabilities - P(i100) = 0.4
* Set of conditional probabilities - P(health|i100) = 0.167

## Represent set of prior probabilities as Dictionary

```
self.prior = {'i500': 0.6, 'i100': 0.4}
```

## Posterior probabilities are a litte more complicated 
```
{'i500': {1: {'appearance': 0.3333333333333, 'health': 0.4444444444444,
              'both': 0.2222222222222},
          2: {'sedentary': 0.2222222222222, 'moderate': 0.333333333333,
              'active': 0.4444444444444444},
          3: {'moderate': 0.333333333333, 'aggressive': 0.66666666666},
          4: {'no': 0.3333333333333333, 'yes': 0.6666666666666666}}, 
 
 'i100':{1: {'appearance': 0.333333333333, 'health': 0.1666666666666,
             'both': 0.5},
         2: {'sedentary': 0.5, 'moderate': 0.16666666666666,
             'active': 0.3333333333333},
         3: {'moderate': 0.83333333334, 'aggressive': 0.166666666666},
         4: {'no': 0.6666666666666, 'yes': 0.3333333333333}}}
```

In [14]:
# Naive Bayes Classifier chapter 6

class Classifier:
    def __init__(self, bucketPrefix, testBucketNumber, dataFormat):

        """ a classifier will be built from files with the bucketPrefix
        excluding the file with textBucketNumber. dataFormat is a string that
        describes how to interpret each line of the data files. For example,
        for the iHealth data the format is:
        "attr	attr	attr	attr	class"
        """
   
        total = 0
        classes = {}
        counts = {}
        
        
        # reading the data in from the file
        
        self.format = dataFormat.strip().split('\t')
        self.prior = {}
        self.conditional = {}
        # for each of the buckets numbered 1 through 10:
        for i in range(1, 11):
            # if it is not the bucket we should ignore, read in the data
            if i != testBucketNumber:
                filename = "%s-%02i" % (bucketPrefix, i)
                f = open(filename)
                lines = f.readlines()
                f.close()
                for line in lines:
                    fields = line.strip().split('\t')
                    ignore = []
                    vector = []
                    for i in range(len(fields)):
                        if self.format[i] == 'num':
                            vector.append(float(fields[i]))
                        elif self.format[i] == 'attr':
                            vector.append(fields[i])                           
                        elif self.format[i] == 'comment':
                            ignore.append(fields[i])
                        elif self.format[i] == 'class':
                            category = fields[i]
                    # now process this instance
                    total += 1
                    classes.setdefault(category, 0)
                    counts.setdefault(category, {})
                    classes[category] += 1
                    # now process each attribute of the instance
                    col = 0
                    for columnValue in vector:
                        col += 1
                        counts[category].setdefault(col, {})
                        counts[category][col].setdefault(columnValue, 0)
                        counts[category][col][columnValue] += 1
        
        #
        # ok done counting. now compute probabilities
        #
        # first prior probabilities p(h)
        #
        for (category, count) in classes.items():
            self.prior[category] = count / total
        #
        # now compute conditional probabilities p(D|h)
        #
        for (category, columns) in counts.items():
              self.conditional.setdefault(category, {})
              for (col, valueCounts) in columns.items():
                  self.conditional[category].setdefault(col, {})
                  for (attrValue, count) in valueCounts.items():
                      self.conditional[category][col][attrValue] = (
                          count / classes[category])
        self.tmp =  counts               
        
    def testBucket(self, bucketPrefix, bucketNumber):
        """Evaluate the classifier with data from the file
        bucketPrefix-bucketNumber"""
        
        filename = "%s-%02i" % (bucketPrefix, bucketNumber)
        f = open(filename)
        lines = f.readlines()
        totals = {}
        f.close()
        loc = 1
        for line in lines:
            loc += 1
            data = line.strip().split('\t')
            vector = []
            classInColumn = -1
            for i in range(len(self.format)):
                  if self.format[i] == 'num':
                      vector.append(float(data[i]))
                  elif self.format[i] == 'attr':
                      vector.append(data[i])
                  elif self.format[i] == 'class':
                      classInColumn = i
            theRealClass = data[classInColumn]
            classifiedAs = self.classify(vector)
            totals.setdefault(theRealClass, {})
            totals[theRealClass].setdefault(classifiedAs, 0)
            totals[theRealClass][classifiedAs] += 1
        return totals

    def classify(self, itemVector):
        """Return class we think item Vector is in"""
        results = []
        for (category, prior) in self.prior.items():
            prob = prior
            col = 1
            for attrValue in itemVector:
                if not attrValue in self.conditional[category][col]:
                    # we did not find any instances of this attribute value
                    # occurring with this category so prob = 0
                    prob = 0
                else:
                    prob = prob * self.conditional[category][col][attrValue]
                col += 1
            results.append((prob, category))
        # return the category with the highest probability
        return(max(results)[1])
 
def tenfold(bucketPrefix, dataFormat):
    results = {}
    for i in range(1, 11):
        c = Classifier(bucketPrefix, i, dataFormat)
        t = c.testBucket(bucketPrefix, i)
        for (key, value) in t.items():
            results.setdefault(key, {})
            for (ckey, cvalue) in value.items():
                results[key].setdefault(ckey, 0)
                results[key][ckey] += cvalue
                
    # now print results
    categories = list(results.keys())
    categories.sort()
    print(   "\n            Classified as: ")
    header =    "             "
    subheader = "               +"
    for category in categories:
        header += "% 10s   " % category
        subheader += "-------+"
    print (header)
    print (subheader)
    total = 0.0
    correct = 0.0
    for category in categories:
        row = " %10s    |" % category 
        for c2 in categories:
            if c2 in results[category]:
                count = results[category][c2]
            else:
                count = 0
            row += " %5i |" % count
            total += count
            if c2 == category:
                correct += count
        print(row)
    print(subheader)
    print("\n%5.3f percent correct" %((correct * 100) / total))
    print("total of %i instances" % total)

tenfold("house-votes/hv", "class\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr")
#c = Classifier("house-votes/hv", 0,
#                       "class\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr")

#c = Classifier("iHealth/i", 10,
#                       "attr\tattr\tattr\tattr\tclass")
#print(c.classify(['health', 'moderate', 'moderate', 'yes']))

#c = Classifier("house-votes-filtered/hv", 5, "class\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr\tattr")
#t = c.testBucket("house-votes-filtered/hv", 5)
#print(t)



            Classified as: 
               democrat   republican   
               +-------+-------+
   democrat    |   111 |    13 |
 republican    |     9 |    99 |
               +-------+-------+

90.517 percent correct
total of 232 instances
