# Summary

## Overview

1. Simple yet powerful ML algorithm
2. Used for <u>Classification tasks based on features</u>

## Naive Assumption
1. Assumes features are independent of each other, therefore called <u>Naive</u>

## Algorithm Workflow
1. Calculates probability of each class given the input features
2. Selects class with highest probability as predicted class
3. Uses Bayes' theorem to update probability based on observed features

## Key Features
1. Simple and computationally efficient
2. Particularly effective for text classification tasks like <u>spam detection and document categorization</u>.
3. Performs well in practice, even with the "naive" independence assumption.

## Applications

1. Widely used in various applications due to its speed, simplicity, and effectiveness
2. Especially <u>suitable for high-dimensional data with many features</u>

## A Very Basic Example

Suppose we have a dataset of emails, where each email is represented by two features: 
1. the presence of the word "lottery" (0 or 1) 
2. the presence of the word "free" (0 or 1)

The target variable is whether the email is spam (1) or not spam (0).

| Email ID | Lottery | Free | Spam |
|----------|---------|------|------|
| 1        | 1       | 0    | 1    |
| 2        | 0       | 1    | 1    |
| 3        | 0       | 0    | 0    |
| 4        | 1       | 1    | 1    |
| 5        | 0       | 0    | 0    |

Now, let's say we receive a new email with the following features: "lottery" (1) and "free" (0).

Using Naive Bayes, we want to predict whether this new email is spam or not spam.

**The algorithm works as follows:**

1. <u>Calculate Class Probabilities:</u>

Compute the probability of spam and not spam emails in the dataset. In this case, P(Spam) = 3/5 and P(Not Spam) = 2/5.

2. <u>Calculate Feature Probabilities:</u>

For each feature (lottery and free), compute the conditional probabilities of the feature given the class (spam or not spam).  

For example, P(Lottery=1 | Spam) = 2/3 (as 2 out of 3 spam emails contain "lottery").
Similarly, P(Lottery=1 | Not Spam) = 1/2 and P(Free=0 | Not Spam) = 1.

3. <u>Calculate Posterior Probabilities:</u>

Use Bayes' theorem to calculate the posterior probabilities of each class given the features.  

For the new email, calculate P(Spam | Lottery=1, Free=0) and P(Not Spam | Lottery=1, Free=0) using the conditional probabilities and class probabilities.

4. <u>Predict the Class:</u>

Choose the class with the highest posterior probability as the predicted class.

In this case, if P(Spam | Lottery=1, Free=0) > P(Not Spam | Lottery=1, Free=0), then predict spam; otherwise, predict not spam.

5. <u>Result:</u>

Based on the computed probabilities, we predict whether the new email is spam or not spam.
This is a basic example of how Naive Bayes works for classification tasks. It's simple yet effective, especially for text classification problems like spam detection.

# 1. A Really Dumb Spam Filter

- Consider 2 events:
  1. S : "the message is spam"
  2. B : "the message contains word `bitcoin`
 
- According to Bayes' theorem : Probability of a message is spam when message contains word bitcoin is - 
  $$ P(S|B) = \frac{P(B|S).P(S)}{P(B|S).P(S) + P(B|¬S).P(¬S)}$$

- This simply represents the proportion of bitcoin messages that are spam.

- **Assumption**
  1. We have a sample data of known 'spam' and 'not spam' - thus we can estimate 
     $ P(B|S)$ and $P(B|¬S)$
  3. Any new message has equal probability of being 'spam' or 'not spam' - thus $P(S) = P(¬S) = 0.5$

     So, from Bayes' theorem:
   >  $$ P(S|B) = \frac{P(B|S).P(S)}{P(B|S).P(S) + P(B|¬S).P(¬S)}$$
   >  $$ P(S|B) = \frac{P(B|S).P(S)}{P(S).(P(B|S) + P(B|¬S)}$$
   >  $$ P(S|B) = \frac{P(B|S)}{P(B|S) + P(B|¬S)}$$

  **Example**
  if 50% of spam messages have the word bitcoin (P(B|S) = 0.5) but only 1% of nonspam messages do (P(B|-S) = 0.01), then the probability that any given bitcoin-containing email is spam is:
$$ P(S|B) = \frac{0. 5}{0. 5 + 0. 01} = 0.98 $$

# 2. A More Sophisticated Spam Filter
- Imagine we have a vocabulary of words $w_{1}, w_{2}, ... w_{n}$
  > Event $X_{i}$ means ' the message contains $w_i$ '
- Imagine **we have sample data** of spam and non-spam mesages from where we know:
  > Probability that a spam message contains word $w_{i}$ is $P(X_{i}|S)$  
  > and  
  > Probability that a non-spam message contains word $w_{i}$ is $P(X_{i}|-S)$
- Naive Bayes assumes that **presence of each word is independent of each other** such that (Extreme Assumption):
  > $P(X_{1} = x_{1}, . . . ,X_{n} = x_{n}|S) = P(X_{1} = x_{1}|S) ×⋯× P(X_{n} = x_{n}|S)$
  - This is an extreme assumption.
  - Imagine we have 100 spam messages.
  - Out of it, 50 spam messages contain spam word 'bitcoin' and 50 messages contain spam word 'rolex'.
  - No spam message contains both words together thus probability of spam message to contain both words becomes should be 0.
  - But as both words are considered independent occuring, it comes from formula:
  > $P(X_{1} = 1, X_{2} = 1|S) = P(X_{1} = 1|S) × P(X_{2} = 1|S)$  
  > $P(X_{1} = 1, X_{2} = 1|S) = 0.5 × 0.5 = 0.25$  
  > Despite the unrealisticness of this assumption, this model often performs well and has historically been used in actual spam filters.

- Using Bayes' theorem we can compute $P(S|X=x)$
  > $$ P(S|X=x_{i}) = \frac{P(x_{i}|S)}{P(x_{i}|S) + P(x_{i}|¬S)}$$
  > To compute the probability of message to be spam if it has list of words $x_{1}, x_{2},..., x_{n}$ multiply individual probability estimates

- **Problem of Underflow**: we want to avoid multiplying lots of probabilities together, to prevent a problem called underflow, in which computers don’t deal well with floating-point numbers that are too close to 0. Computer will multiply many very small numbers (e.g. 0.0001 x 0.00025 x ...) and eventually after many multiplications it will reduce to 0
  > So, to avoid this we use - $log(ab) = log a + log b$ and $exp(log x) = x$   
  > It is floating-point-friendlier 

- **Problem of 0 probability of any word**:
  > **How do we calculate $P(x_{i}|S)$ and $P(x_{i}|-S)$?** -- From sample data -- simply as the fraction of spam messages containing the word wi.    
  > **Problem in this?** - Imagine that in our training set the vocabulary word `data` only occurs in nonspam messages. Then we’d estimate P(`data`|S) = 0. The result is that our Naive Bayes classifier would always assign spam probability 0 to any message containing the word data, even a message like “data on free bitcoin and authentic rolex watches.”    
  > **How to solve this?** -- we’ll choose a pseudocount—k—and estimate the probability of seeing the $i^{th}$ word in a spam message as:  
$$ P (Xi|S) = \frac{k + number\_of\_spams\_containing\_wi}{2k + number\_of\_spams} $$    
  > We do similarly for $P(Xi|¬S)$. That is, when computing the spam probabilities for the $i^{th}$ word, we assume we also saw k additional nonspams containing the word and k additional nonspams not containing the word.  
  > For example, if 'data' occurs in 0/98 spam messages, and if k is 1, we estimate P(data|S) as 1/100 = 0.01, **which allows our classifier to still assign some nonzero spam probability** to messages that contain the word 'data'.

# 3.Implementation

## Without functions - Basic code
> 1. We have messages in a named tuple format of (text: str, is_spam: bool)
> 2. First we will tokenize our texts and keep it in a dict of {text: List[words]}
> 3. Then we will check if the messgae is_spam - Count no. of spam messages, no. of ham messages, frequency of tokens in spam messages, frequency of tokens in ham messages
> 4. Calculate probability of each token in spam and ham messages

In [37]:
# Our training data
from typing import NamedTuple
class Message(NamedTuple):
    text: str
    is_spam: bool
    
messages = [Message(text = 'Congratulations Tanu, you have won 0.5 BTC. Click here to claim your BTC now', is_spam = True),
            Message(text = 'Hey Tanu, you won a new gift! Click here to buy your Rolex now', is_spam = True),
            Message(text = 'Dear Tanu, this mail is a reply to your query', is_spam = False)]

# Let's print the class
for message in messages:
    print(f"{message.text=}")
    print(f"{message.is_spam=}")


message.text='Congratulations Tanu, you have won 0.5 BTC. Click here to claim your BTC now'
message.is_spam=True
message.text='Hey Tanu, you won a new gift! Click here to buy your Rolex now'
message.is_spam=True
message.text='Dear Tanu, this mail is a reply to your query'
message.is_spam=False


In [38]:
# Tokenize the training data

import re

tokens_dict = {}
for message in messages:
    text = message.text.lower()
    words = re.findall("[a-z0-9']+", text) # Returns a list of all non-overlapping matches in the string
    words = set(words)
    tokens_dict[message] = words

# Let's print the dict of tokens
print(f"{tokens_dict=}")

tokens_dict={Message(text='Congratulations Tanu, you have won 0.5 BTC. Click here to claim your BTC now', is_spam=True): {'tanu', 'won', 'click', 'claim', 'you', 'now', 'congratulations', 'here', 'have', 'to', 'your', 'btc', '0', '5'}, Message(text='Hey Tanu, you won a new gift! Click here to buy your Rolex now', is_spam=True): {'tanu', 'won', 'new', 'a', 'you', 'click', 'rolex', 'now', 'buy', 'here', 'your', 'to', 'gift', 'hey'}, Message(text='Dear Tanu, this mail is a reply to your query', is_spam=False): {'mail', 'tanu', 'reply', 'a', 'this', 'is', 'your', 'to', 'dear', 'query'}}


In [39]:
# Counts

from collections import defaultdict

spam_messages = 0
ham_messages = 0
spam_token_counts = defaultdict(int)
ham_token_counts = defaultdict(int)

for message in messages:
    if message.is_spam:
        spam_messages +=1 
        for word in tokens_dict[message]:
            spam_token_counts[word] +=1
    else:
        ham_messages +=1
        for word in tokens_dict[message]:
            ham_token_counts[word] +=1

# Let's print the token counts in spam and ham

print(spam_token_counts)
print(ham_token_counts)

defaultdict(<class 'int'>, {'tanu': 2, 'won': 2, 'click': 2, 'claim': 1, 'you': 2, 'now': 2, 'congratulations': 1, 'here': 2, 'have': 1, 'to': 2, 'your': 2, 'btc': 1, '0': 1, '5': 1, 'new': 1, 'a': 1, 'rolex': 1, 'buy': 1, 'gift': 1, 'hey': 1})
defaultdict(<class 'int'>, {'mail': 1, 'tanu': 1, 'reply': 1, 'a': 1, 'this': 1, 'is': 1, 'your': 1, 'to': 1, 'dear': 1, 'query': 1})


## Naive Bayes' Classifier

1. Tokenize
2. Define type of training data as Namedtuple
3. Define classifier
   > 1. init method 
   > 2. training by iteration
   > 3. prob calculation from (k+spam_with_w_counts)/(2k+total_spam_counts)
   > 4. predict probabilty of message being spam considering all tokens are independent to each other
   

In [40]:
#Tokenize
from typing import Set, Iterable, Tuple
import re, math
from collections import defaultdict


def tokenize(text:str) -> Set[str]:
    text = text.lower()
    all_words = re.findall("[a-z0-9']+", text)
    return set(all_words)

assert tokenize("Data science is science") == {"data", "science", "is", "science"}


# Define type for training data
# For named tuple type we use class
from typing import NamedTuple
class Message(NamedTuple):
    text: str
    is_spam: bool


# Classifer needs track of tokens, counts, labels etc
# For this we create a class
# Initialize all counters to 0

class NaiveBayesClassifier:
    def __init__(self, k: float = 0.5) -> None:
        self.k = k # smoothening factor

        self.tokens: Set[str] = set()
        self.token_spam_counts: Dict[str, int] = defaultdict(int)
        self.token_ham_counts:  Dict[str, int] = defaultdict(int)
        self.spam_messages = self.ham_messages = 0

   
    # Method to train a bunch of messages
    # Increment spam and ham message counts
    # tokenize spam and ham message and keep count track in dicts
    def train(self, messages: Iterable[Message]) -> None:

        # Increment spam and ham counts
        for message in messages:
            if message.is_spam:
                self.spam_messages +=1
            else:
                self.ham_messages +=1

            # Increment word counts
            for token in tokenize(message.text):
                self.tokens.add(token)
                if message.is_spam:
                    self.token_spam_counts[token] +=1
                else:
                    self.token_ham_counts[token] +=1

    # To predict P(spam|token) using Bayes' theorem
    # We need P(token|spam) and P(token|ham)
    # Create private helper function for this

    def _probabilities(self, token: str) -> Tuple[float, float]:
        """
        returns P(token|spam) and P(token|ham)
        """
        spam = self.token_spam_counts[token] # number of 'tokens' in spam messages
        ham = self.token_ham_counts[token]   # number of 'tokens' in ham messages

        p_token_spam = (spam + self.k) / (self.spam_messages + 2 * self.k)
        p_token_ham = (ham + self.k) /(self.ham_messages + 2* self.k)
        return p_token_spam, p_token_ham
        
    # Finally predict using log and exp method instead of multiplication of small p's
   

    def predict(self, text: str) -> float:
        text_tokens = tokenize(text)
        log_p_if_spam = log_p_if_ham = 0.0

        # Iterate through each word in our vocab 

        for token in self.tokens:
            prob_if_spam, prob_if_ham = self._probabilities(token)

            # If 'token' appears in the message,
            # Add log prob of seeing it
            if token in text_tokens:
                log_p_if_spam += math.log(prob_if_spam)
                log_p_if_ham += math.log(prob_if_ham)

            # Otherwise add the log probability of not seeing it
            else:
                log_p_if_spam  += math.log(1.0 - prob_if_spam)
                log_p_if_ham += math.log(1.0 - prob_if_ham)

        prob_if_spam = math.exp(log_p_if_spam)
        prob_if_ham = math.exp(log_p_if_ham)
        return prob_if_spam/(prob_if_spam+prob_if_ham)



### Testing 

In [41]:
# Sample test data

messages = [Message("spam rules", is_spam=True),
Message("ham rules", is_spam=False),
Message("hello ham", is_spam=False)]

model = NaiveBayesClassifier(k = 0.5)
model.train(messages)

# Let's check what we got
print(f" {model.k =},\n {model.tokens=},\n {model.token_spam_counts=},\n {model.token_ham_counts=},\n {model.spam_messages =},\n {model.ham_messages=} ")

       

 model.k =0.5,
 model.tokens={'spam', 'ham', 'rules', 'hello'},
 model.token_spam_counts=defaultdict(<class 'int'>, {'spam': 1, 'rules': 1}),
 model.token_ham_counts=defaultdict(<class 'int'>, {'rules': 1, 'ham': 2, 'hello': 1}),
 model.spam_messages =1,
 model.ham_messages=2 


In [42]:
# Let's make the prediction now

text = "hello spam"
model.predict(text)

0.8350515463917525

In [43]:
#Let's calculate it by hand 

text = "hello spam"

#probability of various words be present in spam and text
prob_if_spam =  [
(1 + 0.5) / (1 + 2 * 0.5), # "spam" (present in text)  
1 - (0 + 0.5) / (1 + 2 * 0.5), # "ham" (not present)  
1 - (1 + 0.5) / (1 + 2 * 0.5), # "rules" (not present)  
(0 + 0.5) / (1 + 2 * 0.5) # "hello" (present)
]

#probability of various words be present in ham and text
prob_if_ham =  [
(0+0.5)/(2*0.5 + 2),  #'spam'(present)  
(1- (2+0.5) / (2*0.5 + 2)),  #'ham'(not present)    
(1- (1+0.5) / (2*0.5 + 2)), #'rules'(not present)   
(1+0.5)/(2*0.5 + 2) #'hello'(present in text)   
]

In [44]:
p_if_spam = math.exp(sum(math.log(p) for p in prob_if_spam))
p_if_ham = math.exp(sum(math.log(p) for p in prob_if_ham))

# Should be about 0.83
assert model.predict(text) == p_if_spam / (p_if_spam + p_if_ham)

# 4. Using our model on real data

In [45]:
# Download and unpack dataset

import requests # To request file from url
import tarfile # To open files in .tar format
from io import BytesIO # So we can treat bytes as a file.


BASE_URL = "https://spamassassin.apache.org/old/publiccorpus/"
FILES = ["20021010_easy_ham.tar.bz2",
         "20021010_hard_ham.tar.bz2",
         "20021010_spam.tar.bz2"]

OUTPUT_DIR = 'spam_data' # This is where data will be downloaded

# Unzip all three FILES
for filename in FILES:
    content = requests.get(f"{BASE_URL}/{filename}").content
    fin = BytesIO(content)
    with tarfile.open(fileobj = fin, mode = 'r:bz2') as tf:
        tf.extractall(OUTPUT_DIR)


- After downloading the data you should have three folders: spam, easy_ham, and hard_ham.
  
- Each folder contains many emails, each contained in a single file.

- To keep things really simple, we’ll just look at the subject lines of each email.

- How do we identify the subject line? When we look through the files, they all seem to <u>start with “Subject:”</u>. So we’ll look for that.

In [46]:
from typing import List
import glob # Helps in searching for files matching a specified pattern
import re

# Create a Message type class
from typing import NamedTuple
class Message(NamedTuple):
    text: str
    is_spam: bool

# spam_data is directory where files are present
# * is the wildcard character used to find path patterns
path = 'spam_data/*/*'

# Create empty list of Message type
data: List[Message] = []

for filename in glob.glob(path):
    is_spam = "ham" not in filename # Set 1 if spam and 0 if ham
    with open(filename, errors = 'ignore') as email_file:
        for line in email_file:
            if line.startswith("Subject:"):
                subject = line.lstrip("Subject: ") 
                data.append(Message(subject, is_spam))
                break
#data

In [47]:
# Split data into training and test data
import random
from scratch.machine_learning import split_data

random.seed(0)
train_messages, test_messages = split_data(data, 0.75)

model = NaiveBayesClassifier()
model.train(train_messages)

In [48]:
# Generate predictions

from collections import Counter 

predictions = [(message, model.predict(message.text))
              for message in test_messages]
confusion_matrix = Counter((message.is_spam, spam_probability >0.5)
                           for message, spam_probability in predictions)

print(confusion_matrix)

Counter({(False, False): 668, (True, True): 85, (True, False): 54, (False, True): 18})


- According to confusion matrix our model predicts:
  > spam as spam = 85   
  > spam as ham = 54  
  > ham as ham = 668  
  > ham as spam = 18

  > Precision = 85/(85+18) = 82%  
  > Recall = 85/(85+54) = 61%  

- Not bad for such a simple model where we are just focusing on subject of emails

In [49]:
# Let's see which words are most and least indicative of spam

def p_spam_given_token(token:str, model: NaiveBayesClassifier) -> float:
    prob_if_spam, prob_if_ham = model._probabilities(token)

    return prob_if_spam / (prob_if_ham + prob_if_spam)

words = sorted(model.tokens, key = lambda t: p_spam_given_token(t, model))
print("spammiest_words", words[-10:])
print("hammiest_words", words[:10])

spammiest_words ['500', 'assistance', 'account', 'attn', 'sale', 'zzzz', 'systemworks', 'money', 'adv', 'rates']
hammiest_words ['spambayes', '2', 'users', 'razor', 'zzzzteana', 'sadev', 'ouch', 'apt', 'bliss', 'selling']


That's it!

To get the better performance what can we do?
- Look at the message content, not just the subject line. You’ll have to be careful how you deal with the message headers.

- Our classifier takes into account every word that appears in the training set, even words that appear only once. Modify the classifier to accept an optional min_count threshold and ignore tokens that don’t appear at least that many times.

- The tokenizer has no notion of similar words (e.g., cheap and cheapest). Modify the classifier to take an optional stemmer function that converts words to equivalence classes of words. For example, a really simple stemmer function might be:  
`def drop_final_s(word):`         
    `return re.sub("s$", "", word)`

  Difficult - people use Porter stemmer - https://tartarus.org/martin/PorterStemmer/

- Although our features are all of the form “message contains word wi,” there’s no reason why this has to be the case. In our implementation, we could add extra features like “message contains a number” by creating phony tokens like contains:number and modifying the tokenizer to emit them when appropriate.

# 5. Naive Bayes using sklearn

Steps:
1. Import libraries
2. Prepare data
3. Split data
4. Vectorize features
   > Vectorization assigns digits from 0 to each word of text in alphabatical order (vectorizer.vocabulary_)  
   > When we fit the text in vectorizer it creates array of row = texts, column = words, in terms of labels

In [50]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Sample training data
texts = ["good movie", "not a good movie", "did not like", "i like it", "good one"]
labels = [1, 0, 0, 1, 1]

#Vectorize the texts
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
vocab = vectorizer.vocabulary_

#Let's see what's vocab and vectorized matrix
assert vocab=={'good': 1, 'movie': 4, 'not': 5, 'did': 0, 'like': 3, 'it': 2, 'one': 6}
print(X)
X.shape

  (0, 1)	1
  (0, 4)	1
  (1, 1)	1
  (1, 4)	1
  (1, 5)	1
  (2, 5)	1
  (2, 0)	1
  (2, 3)	1
  (3, 3)	1
  (3, 2)	1
  (4, 1)	1
  (4, 6)	1


(5, 7)

In [51]:
#Split the vectorized data X into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.5, random_state=42)
print(X_train)

# Train data in Naive classifier
clf = MultinomialNB()
clf.fit(X_train,y_train)


  (0, 1)	1
  (0, 4)	1
  (1, 3)	1
  (1, 2)	1


In [52]:
# Make predictions on test data
predictions = clf.predict(X_test)
print(predictions)

# Calculate accuracy
accuracy = accuracy_score(y_test,predictions)
print(accuracy)

print(classification_report(y_test,predictions))

[1 1 1]
0.3333333333333333
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.33      1.00      0.50         1

    accuracy                           0.33         3
   macro avg       0.17      0.50      0.25         3
weighted avg       0.11      0.33      0.17         3



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [53]:
from typing import List
import glob # Helps in searching for files matching a specified pattern
import re

random.seed(0)
# Create a Message type class
from typing import NamedTuple
class Message(NamedTuple):
    text: str
    is_spam: bool

# spam_data is directory where files are present
# * is the wildcard character used to find path patterns
path = 'spam_data/*/*'

# Create empty list of Message type
data: List[Message] = []

for filename in glob.glob(path):
    is_spam = "ham" not in filename # Set 1 if spam and 0 if ham
    with open(filename, errors = 'ignore') as email_file:
        for line in email_file:
            if line.startswith("Subject:"):
                subject = line.lstrip("Subject: ") 
                data.append(Message(subject, is_spam))
                break


In [54]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

#split data
train_messages, test_messages = train_test_split(data, train_size=0.75, random_state = 42)

# Vectorize
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform([message.text for message in train_messages])
X_test = vectorizer.transform([message.text for message in test_messages])

In [55]:
clf = MultinomialNB()
clf.fit(X_train, [message.is_spam for message in train_messages])
predictions = clf.predict(X_test)
accuracy = accuracy_score([message.is_spam for message in test_messages], predictions)
accuracy

0.9212121212121213

In [56]:
from collections import Counter
confusion_matrix = Counter(zip([message.is_spam for message in test_messages], predictions))
print(f"{confusion_matrix=}")

confusion_matrix=Counter({(False, False): 660, (True, True): 100, (True, False): 49, (False, True): 16})


In [59]:
from scratch.machine_learning import recall, precision
fp,fn,tn,tp = [confusion_matrix[(i,j)] for i in range(2)
               for j in range(2)]

recall(tp, fp, fn, tn)
precision(tp, fp, fn, tn)

0.13157894736842105

In [58]:
precision?

[0;31mSignature:[0m [0mprecision[0m[0;34m([0m[0mtp[0m[0;34m:[0m [0mint[0m[0;34m,[0m [0mfp[0m[0;34m:[0m [0mint[0m[0;34m,[0m [0mfn[0m[0;34m:[0m [0mint[0m[0;34m,[0m [0mtn[0m[0;34m:[0m [0mint[0m[0;34m)[0m [0;34m->[0m [0mfloat[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m <no docstring>
[0;31mFile:[0m      ~/Data-science-from-scratch/scratch/machine_learning.py
[0;31mType:[0m      function