<a href="https://colab.research.google.com/github/michaelwnau/ai_academy_notebooks/blob/main/WKS7_Student_tues_nau.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Workshop 7: Bayes Rule Rules

In this workshop, we'll be looking at how to use Naive Bayes and Bayes Nets

---

In [41]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


# 0) Imports

In [42]:
import numpy as np

import pandas as pd
import matplotlib.pyplot as plt

from sklearn import datasets

# set a seed for reproducibility
random_seed = 25
np.random.seed(random_seed)

# 1) Naive Bayes Spam Filtering

One historical use of Naive Bayes is to try and detect [spam emails](https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering). 

In this exercise, you will be using dataset that of emails from the [Enron Corporation](https://en.wikipedia.org/wiki/Enron_Corpus), an accounting firm that [went bankrupt in 2001 due to an accounting scandal](https://en.wikipedia.org/wiki/Enron_scandal).

## 1.1) Exploring the Data (Follow)

In [47]:
df = pd.read_csv("/content/drive/MyDrive/AI ACADEMY/2 - Data Mining/7- Week 7/WKS7_Student/enron_emails.csv")

In [48]:
# Ham is a legitimate email, while spam is unwanted
# Let's look at our distribution of spam and ham emails
email_counts = df.label.value_counts()
print(email_counts)

ham     3672
spam    1499
Name: label, dtype: int64


Keeping with the theme of meat products, some researchers call emails that are *not spam*, **ham**. 

To sum up: a **ham** email is a legitimate email, while a **spam** email is unwanted.

In [49]:
# Let's look at our distriubtion of spam and ham emails
df.label.value_counts()

ham     3672
spam    1499
Name: label, dtype: int64

Let's explore some of the ham emails...

In [50]:
print(df[df["label"]=="ham"].text.iloc[4])

Subject: ehronline web address change
this message is intended for ehronline users only .
due to a recent change to ehronline , the url ( aka " web address " ) for accessing ehronline needs to be changed on your computer . the change involves adding the letter " s " to the " http " reference in the url . the url for accessing ehronline should be : https : / / ehronline . enron . com .
this change should be made by those who have added the url as a favorite on the browser .


And now the spam emails...

In [51]:
print(df[df["label"]=="spam"].text.iloc[18])

Subject: back
emile (
the cablefilterz will allow you to receive
all the channels that you order with your remote control ,
payperviews , axxxmovies , sport events , special - events !
http : / / www . 8006 hosting . com / cable /
avocation , despoil .



Try exploring different emails by changing the index in the lines above. **What common traits do you notice accross the ham emails? The spam emails?**

## 1.2) Bag of Words (Follow)

Last week we used tf-idf to represent words as feature vectors. However, sometimes simpler methods work just as well (if not better). For this, we'll be using the [bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) representation of a piece of text, which is much more interpretable than tf-idf.

In [52]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(X.shape)

(4, 9)


The CountVectorizer's `fit_transform` method returns a NxM matrix. `N` is the number of documents (sentences) you have in your corpus, and `M` is the number of unique words in your corpus. Item `n`x`m` is how many times word `m` appears in document `n`.

In [53]:
vectorizer.get_feature_names_out()

array(['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third',
       'this'], dtype=object)

In [54]:
print(X.toarray())

[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


A more interpretable view...

In [55]:
print(corpus[0])
dict(zip(vectorizer.get_feature_names_out(),X.toarray()[0]))

This is the first document.


{'and': 0,
 'document': 1,
 'first': 1,
 'is': 1,
 'one': 0,
 'second': 0,
 'the': 1,
 'third': 0,
 'this': 1}

In [56]:
print(corpus[1])
dict(zip(vectorizer.get_feature_names_out(),X.toarray()[1]))

This document is the second document.


{'and': 0,
 'document': 2,
 'first': 0,
 'is': 1,
 'one': 0,
 'second': 1,
 'the': 1,
 'third': 0,
 'this': 1}

Now, if you want to vectorize new data (e.g. test data), then you use the `.transform` function. If the vectorizer encounters a word it hasn't seen before, it will simply ignore it.

In [57]:
vectorizer.transform(["This is the coolest document"]).toarray()

array([[0, 1, 0, 1, 0, 0, 1, 0, 1]])

# 1.3) Building and Running the Model (Group)

Now that you have all the required tools, build a **Naive Bayes Classifier** and evaluate it on a train and test set. In this instance, [Multinomial Naive Bayes](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) classifier, which is most useful for discrete features that use frequency counts (e.g. a bag of words vector).

In [58]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

In [59]:
# Create training and test splits - 20% split
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=random_seed)

In [60]:
# Vectorize on your training data using BoW
vectorizer = CountVectorizer(stop_words='english')
X_train_vec = vectorizer.fit_transform(X_train)

In [62]:
# Fit the classifier below
clf = MultinomialNB()
clf.fit(X_train_vec, y_train)

In [63]:
# Vectorize your test data using transform and then predict the test data
X_test_vec = vectorizer.transform(X_test)
y_pred = clf.predict(X_test_vec)

In [66]:
#from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [67]:
# Print a confusion matrix using confusion_matrix
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Confusion Matrix:
 [[728   6]
 [ 14 287]]


In [69]:
# Print a classification report using classification_report
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Classification Report:
               precision    recall  f1-score   support

         ham       0.98      0.99      0.99       734
        spam       0.98      0.95      0.97       301

    accuracy                           0.98      1035
   macro avg       0.98      0.97      0.98      1035
weighted avg       0.98      0.98      0.98      1035



In [70]:
# Code from Lori K:

train, test = train_test_split(df, test_size=0.2, random_state=random_seed)
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train.text)
clf = MultinomialNB()
clf.fit(X,train.label_num)
test_vecs = vectorizer.transform(test.text)
predictions = clf.predict(test_vecs)
confusion_matrix(test.label_num,predictions)
print(classification_report(test.label_num,predictions))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99       734
           1       0.98      0.96      0.97       301

    accuracy                           0.98      1035
   macro avg       0.98      0.98      0.98      1035
weighted avg       0.98      0.98      0.98      1035



## 1.4) Exploring Important Words (Group)

Before you start, predict what words might be more predictive of SPAM or HAM. Make a list below of 5 words you think will be very _predictive_ of an email being SPAM, and 5 words that are predictive of being HAM. Remember this is an office email database from Enron in the 1990s.

Words the will predict SPAM (junk emails):

1. x
2. x
3. x
4. x
5. x

Words the will predict HAM (real emails):

1. x
2. x
3. x
4. x
5. x

**Technical Note: Log Probabilities**: 

When using probabilistic methods with large datasets, sometimes you get features with extremely small probabilities (e.g. $10^{-10}$). 

This becomes a problem, because computers aren't really good at doing operations with numbers at this scale. Therefore, in most systems, operations are done on the *log* of the probabilities. 

This makes calculations much more managable (e.g. $\log(10^{-10})=-10$). As an added bonus, due to log rules ($log(ab)=log(a)+log(b)$), all multiplications turn into additions, which are easier for the computer.

Some general rules of thumb: **the closer to zero a log prob is, the more probabable it is**, and **each time a log prob decreases by one, it's an order of magnitude less probable**.

`feature_log_probs` gives us the log probabilities for each word. In notation, each of these are $P(word | class)$

In [71]:
# Given that a message is ham, how probable is it for the words to show up?
clf.feature_log_prob_[0]

array([ -5.75023447,  -5.71398455, -11.28296498, ..., -13.07472445,
       -13.07472445, -13.07472445])

In [72]:
# Given that a message is SPAM, how probable is it for the words to show up?
clf.feature_log_prob_[1]

array([ -6.23363088,  -7.07284789,  -9.8691907 , ..., -11.74099287,
       -11.74099287, -11.74099287])

This code will sort all the words by log probability, so that all of the most probable words show up first...

In [73]:
spam_args = np.argsort(clf.feature_log_prob_[1])
spam_words = np.array(vectorizer.get_feature_names_out())[spam_args]
spam_words = np.flip(spam_words)
print(spam_words)


ham_args = np.argsort(clf.feature_log_prob_[0])
ham_words = np.array(vectorizer.get_feature_names_out())[ham_args]
ham_words = np.flip(ham_words)
print(ham_words)

['the' 'to' 'and' ... 'brewer' 'breveffo' 'cima']
['the' 'to' 'ect' ... 'luge' 'lugging' 'ikoogybhmxdc']


In [74]:
spam_words[0:10]

array(['the', 'to', 'and', 'of', 'in', 'you', 'for', 'this', 'is', 'your'],
      dtype=object)

In [75]:
ham_words[0:10]

array(['the', 'to', 'ect', 'for', 'and', 'hou', 'enron', 'subject', 'on',
       'of'], dtype=object)

However, a more useful way to look at the data is to look at the *ratios* of the probabilities for a given word. For example, if we have the word "free":

*If an email is spam, there is a 50% probability it will contain the word "free"*

$P(free|spam)=0.5$

*If an email is ham, there is a 10% probability it will contain the word "free"*

$P(free|ham)=0.1$

*The ratio*

$P(free|spam)/P(free|ham)=5$

This means that the word *free* is 5x as more likely to show up in spam messages compared to ham messages. So, we can use this to calculate and sort for words that are proportionally more present in spam emails.

In [82]:
# Since we're operating on logs, division turns into subtraction
log_odds = clf.feature_log_prob_[1] - clf.feature_log_prob_[0]
spam_ham_args = np.argsort(log_odds)
spam_ham_words = np.array(vectorizer.get_feature_names_out())[spam_ham_args]
spam_ham_words = np.flip(spam_ham_words)

Here's some of the "spammiest" words...

In [83]:
top_x=200
spam_ham_words[0:top_x]

array(['td', 'nbsp', 'pills', 'width', 'computron', 'br', 'font', 'href',
       'viagra', 'height', 'xp', 'src', '2004', 'cialis', 'soft', 'meds',
       'paliourg', 'php', 'voip', 'drugs', 'oo', 'valign', 'bgcolor',
       'biz', 'hotlist', 'moopid', 'div', 'photoshop', 'mx', 'img',
       'knle', 'pharmacy', 'gr', 'intel', 'corel', 'prescription', 'iit',
       'demokritos', 'rolex', 'xanax', 'macromedia', 'dealer',
       'uncertainties', 'valium', 'htmlimg', 'darial', '000000',
       '0310041', 'lots', 'projections', 'jebel', 'adobe', 'rnd', 'color',
       'alt', '161', 'colspan', 'pain', 'readers', 'rx', 'canon',
       'export', 'draw', 'fontfont', 'gra', 'speculative', '1226030',
       'gold', 'pro', 'logos', 'wi', 'toshiba', 'china', '1933', 'spam',
       'vicodin', 'itoy', 'viewsonic', 'ooking', '1618', 'cellpadding',
       'weight', 'hewlett', '4176', 'pill', 'robotics', 'soma',
       'resellers', '8834464', '8834454', 'apc', 'intellinet', 'aopen',
       'iomega', 'en

Note that words like `td`, `nbsp` and `br` are all HTML tags (for tables, spaces and newlines, respectively. This suggests that SPAM is more likely to have fancy HTML formatting than HAM.

Reverse the list, and now we have the "hammiest" words... (words most indicative of a legitimate email)

In [84]:
np.flip(spam_ham_words)[0:top_x]

array(['enron', 'ect', 'meter', 'hpl', 'daren', 'mmbtu', 'xls', 'pec',
       'sitara', 'hou', 'volumes', 'ena', 'forwarded', 'melissa',
       'tenaska', 'teco', 'nom', '2001', 'pat', 'aimee', 'actuals',
       'noms', 'hsc', 'susan', 'cotten', 'chokshi', 'nomination', 'fyi',
       'pipeline', 'wellhead', 'eastrans', 'clynes', 'hplc', '713',
       'counterparty', 'pefs', 'bob', 'nominations', 'cec', 'gcs',
       'lannou', 'txu', 'farmer', 'hplno', 'rita', 'weissman', 'cc',
       'equistar', 'enronxgate', 'iferc', 'scheduled', 'spreadsheet',
       'wynne', 'allocated', 'entex', 'path', 'buyback', 'fuels', 'hplo',
       'lisa', 'scheduling', 'pops', 'anita', 'calpine', 'gco', 'darren',
       'clem', 'steve', 'aep', 'katy', 'tu', 'flowed', 'follows',
       'sherlyn', 'donna', 'lloyd', 'midcon', 'pm', 'redeliveries',
       'jackie', 'gary', 'vance', 'papayoti', 'meters', 'cornhusker',
       'luong', 'howard', 'pg', 'lsk', 'revision', 'julie', 'utilities',
       '281', 'bryan', 

In [85]:
# Given that a message is ham, how probable is it for the words to show up?
print("Ham words log probabilities:\n", clf.feature_log_prob_[0])

# Given that a message is spam, how probable is it for the words to show up?
print("\nSpam words log probabilities:\n", clf.feature_log_prob_[1])

# Sort words by log probability for spam and ham classes
spam_args = np.argsort(clf.feature_log_prob_[1])
spam_words = np.array(vectorizer.get_feature_names_out())[spam_args]
spam_words = np.flip(spam_words)

ham_args = np.argsort(clf.feature_log_prob_[0])
ham_words = np.array(vectorizer.get_feature_names_out())[ham_args]
ham_words = np.flip(ham_words)

# Print the top 10 spam and ham words
print("\nTop 10 spam words:\n", spam_words[:10])
print("\nTop 10 ham words:\n", ham_words[:10])

# Calculate the log odds and sort words based on it
log_odds = clf.feature_log_prob_[1] - clf.feature_log_prob_[0]
spam_ham_args = np.argsort(log_odds)
spam_ham_words = np.array(vectorizer.get_feature_names_out())[spam_ham_args]
spam_ham_words = np.flip(spam_ham_words)

top_x = 200
print("\nTop {} spammiest words:\n".format(top_x), spam_ham_words[:top_x])

# Reverse the list to get the "hammiest" words
print("\nTop {} hammiest words:\n".format(top_x), np.flip(spam_ham_words)[:top_x])


Ham words log probabilities:
 [ -5.75023447  -5.71398455 -11.28296498 ... -13.07472445 -13.07472445
 -13.07472445]

Spam words log probabilities:
 [ -6.23363088  -7.07284789  -9.8691907  ... -11.74099287 -11.74099287
 -11.74099287]

Top 10 spam words:
 ['the' 'to' 'and' 'of' 'in' 'you' 'for' 'this' 'is' 'your']

Top 10 ham words:
 ['the' 'to' 'ect' 'for' 'and' 'hou' 'enron' 'subject' 'on' 'of']

Top 200 spammiest words:
 ['td' 'nbsp' 'pills' 'width' 'computron' 'br' 'font' 'href' 'viagra'
 'height' 'xp' 'src' '2004' 'cialis' 'soft' 'meds' 'paliourg' 'php' 'voip'
 'drugs' 'oo' 'valign' 'bgcolor' 'biz' 'hotlist' 'moopid' 'div'
 'photoshop' 'mx' 'img' 'knle' 'pharmacy' 'gr' 'intel' 'corel'
 'prescription' 'iit' 'demokritos' 'rolex' 'xanax' 'macromedia' 'dealer'
 'uncertainties' 'valium' 'htmlimg' 'darial' '000000' '0310041' 'lots'
 'projections' 'jebel' 'adobe' 'rnd' 'color' 'alt' '161' 'colspan' 'pain'
 'readers' 'rx' 'canon' 'export' 'draw' 'fontfont' 'gra' 'speculative'
 '1226030' 'gol

In [88]:
# from sklearn.manifold import TSNE

# # Combine spam and ham words
# top_words = np.concatenate((spam_words[:200], ham_words[:200]))

# # Get the feature vectors for the top words
# top_word_vectors = np.array([vectorizer.transform([word]).toarray()[0] for word in top_words])

# # Apply t-SNE to project the word vectors into 2D
# tsne = TSNE(n_components=2, random_state=random_seed)
# word_vectors_2d = tsne.fit_transform(top_word_vectors)
# import matplotlib.pyplot as plt

# # Create a scatter plot of the t-SNE projected word vectors
# plt.figure(figsize=(12, 12))

# # Plot spam words in red
# plt.scatter(word_vectors_2d[:200, 0], word_vectors_2d[:200, 1], c='red', label='Spam')

# # Plot ham words in blue
# plt.scatter(word_vectors_2d[200:, 0], word_vectors_2d[200:, 1], c='blue', label='Ham')

# # Set plot title and axis labels
# plt.title('t-SNE Clustering of Top Spam and Ham Words')
# plt.xlabel('t-SNE Dimension 1')
# plt.ylabel('t-SNE Dimension 2')

# # Add a legend
# plt.legend()

# # Show the plot
# plt.show()


**Look at the words that best distinguish SPAM and HAM:**
1. How many of your words showed up in the SPAM and HAM top 200 predictive words?
2. Are they what you would have expected?
3. Based on this, what can you say about the differences between how people make prediction and ML algorithms make predictions?
4. Does this make you more confident or less confident in ML predictions?


**Discuss here**

# 2) Bayesian Networks

In this problem, we'll be using the `ASIA` dataset, which showcases the reltionships between travel, smoking, etc. and the probabilty of having various conditions. We will be using the [pomegranate](https://pomegranate.readthedocs.io/en/latest/) library to handle the heavy lifting for Bayes Nets.

![](./asia_data.png)

## 2.1) Exploring Bayes Nets (Group)

First go here: https://www.bayesserver.com/examples/networks/asia

Try checking different boxes and seeing how the model updates. When you check a box, you're "given" a specific value for that node. For example, checking "True" for "Visit to Asia" means the patient has visited Asia, but we don't know the other probabilities yet. The new probabilities are "given" that you've visited Asia.

Now answer the following questions. For each one, first make a **prediction** about how the model will change, and the try it to see if you're right. Write down your prediction and then the actual answer. If your prediction differs than the actual answer, try and discuss why.

**After each question, uncheck all boxes.**

1. If you set the value of Visit to Asia, which nodes will update?
2. If you set the value of XRay Result, which nodes with update?
3. First set the value for Has Tuberculosis. If you then set the value for Visit to Asia, which nodes will update?
4. First set the value for Tuberculosis or Cancer. If you then set the value for Dyspnea, which nodes will update?
5. If you check the box for Has Tuberculosis, will Has Lung Cancer update?
6. First set the value for Tuberculosis or Cancer. Now if you check the box for Has Tuberculosis, will Has Lung Cancer update?

If you wish to understand more about conditional independence and D-Separation, go here:  https://www.youtube.com/watch?v=_R_RYn5KelA


**Discuss Here**

## 2.2) Building the Bayes Net (Follow)

Just like in class, we can define the initial structure and conditional probability tables for the Bayes Net using our expert knowledge of the scenario (in this case, given to use by experts).

For example, the first table gives the probability of having TB give that you have (or have not) visited Asia.

| Asia | HasTB | P(HasTB\|Asia) |
| ---- | ----- | ------------- |
| T | T | 0.05 |
| T | F | 0.95 |
| F | T | 0.01 |
| F | F | 0.99 |

In [None]:
from pomegranate import *

In [None]:
# First, we define our top level nodes with their base probabilities.

visit_to_asia = DiscreteDistribution({'T':0.01, 'F':0.99})
smoke = DiscreteDistribution({'T':0.5, 'F':0.5})

# Now, we have to fill in all of the conditional probability tables for the other nodes

has_tb = ConditionalProbabilityTable(
    [
        #Asia? #HasTB #Probability
        ["T","T",0.05],
        ["T","F",0.95],
        
        ["F","T",0.01],
        ["F","F",0.99],
    ], [visit_to_asia])


has_lung_cancer = ConditionalProbabilityTable(
    [
        #Smoke? 
        ["T","T",0.1],
        ["T","F",0.9],
        
        ["F","T",0.01],
        ["F","F",0.99]
    ], [smoke])


has_bc = ConditionalProbabilityTable(
    [
        #Smoke?
        ["T","T",0.6],
        ["T","F",0.4],
        
        ["F","T",0.3],
        ["F","F",0.7]
    ], [smoke])

tb_or_cancer = ConditionalProbabilityTable(
    [
        #Lung? TB? 
        ["T","T","T",1],
        ["T","T","F",0],
        
        ["T","F","T",1],
        ["T","F","F",0],
        
        ["F","T","T",1],
        ["F","T","F",0],
        
        ["F","F","T",0],
        ["F","F","F",1]
    ], [has_lung_cancer,has_tb])

x_ray_abnormal = ConditionalProbabilityTable(
    [
        #TB or Cancer?
        ["T","T",0.98],
        ["T","F",0.02],
        
        ["F","T",0.05],
        ["F","F",0.95]
    ], [tb_or_cancer])

dyspnea = ConditionalProbabilityTable(
    [
        #BC
        ["T","T","T",0.9],
        ["T","T","F",0.1],
        
        ["T","F","T",0.8],
        ["T","F","F",0.2],
        
        ["F","T","T",0.7],
        ["F","T","F",0.3],
        
        ["F","F","T",0.1],
        ["F","F","F",0.9]
    ], [has_bc, tb_or_cancer])


In [None]:
# Next we have to create all the nodes
asia_node = Node(visit_to_asia, name="asia")
tb_node = Node(has_tb, name="tb")
smoke_node = Node(smoke, name="smoke")
lung_node = Node(has_lung_cancer, name="lung")
bronc_node = Node(has_bc, name="bc")
either_node = Node(tb_or_cancer, name="either")
xray_node = Node(x_ray_abnormal,name="xray")
dysp_node = Node(dyspnea, name="dysp")

In [None]:
# Now we init the model
model = BayesianNetwork("ASIA")
model.add_states(asia_node,
                 tb_node,
                 smoke_node,
                 lung_node,
                 bronc_node,
                 either_node,
                 xray_node,
                 dysp_node)

# Add all of the correct edges 
model.add_edge(asia_node, tb_node)

model.add_edge(smoke_node, bronc_node)
model.add_edge(smoke_node, lung_node)

model.add_edge(tb_node,either_node)
model.add_edge(lung_node,either_node)

model.add_edge(either_node, xray_node)
model.add_edge(either_node, dysp_node)

model.add_edge(bronc_node, dysp_node)

# And then commit our changes
model.bake()

In [None]:
# Helper function to print the model structure
def print_model_structure(model, features):
    for i in range(len(features)):
        parents = [features[pi] for pi in model.structure[i]]
        print(f'Node "{features[i]}" has parents: {parents}')

In [None]:
# We'll keep our features in this order for consistency
features = [
    "Visit to Asia",
    "Has TB",
    "Smoker",
    "Has Lung Cancer",
    "Has Bronchitis",
    "TB or Cancer",
    "XRay Abnormal",
    "Dyspnea"
]

In [None]:
# Let's make sure the structure of our newly created model is correct
print_model_structure(model, features)

## 2.3) Predictions (Group)

`predict` allows us to do inference based off of the data. It chooses the values that are the most likely.

For example, let's say we have a patient who has an abnormal X-ray, but is not a smoker and hasn't visited Asia. We can then infer the most likely values for all of the other variables.

In [None]:
model.predict([
    ["F",None,"F",None,None,None,"T",None]
])

In [None]:
# Now let's say they *were* a smoker. See how it changes?
model.predict([
    ["F",None,"T",None,None,None,"T",None]
])

We can get a little more detail and check out the actual probabilities.

In [None]:
def pretty_results(results):
    for i,dist in enumerate(results):
        print(features[i])
        if isinstance(dist,str):
            print(dist)
        else:
            print(dist.parameters)

In [None]:
res = model.predict_proba([
    ["F",None,"F",None,None,None,"T",None]
])
pretty_results(res[0])

Use the Bayes net to calcualte the following probabilities:

1. $P(xray=true | TBorCancer=true)$
2. $P(xray=true | TBorCancer=true, TB=true)$
3. $P(TB=true)$
4. $P(TB=true | smoke=false)$
5. $P(TB=true | smoke=false, TBorCancer=true)$

What values are equivalent? Why?

**Write your answers here:**

1.
2.
3.
4.
5.

What values are equivalent? Why?

And write code below to help you.

In [None]:
# As a reminder, here are the indices of the features
features

In [None]:
# You can modify this code to help you
pretty_results(model.predict_proba([
    [None,None,'F',None,None,'T',None,None]
])[0])

## 2.4) Evaluating Bayes Nets (Follow)

Let's see how well this net is on inferencing from data. We're going to remove the Bronchitis column from this dataset, and see if our net can predict what the missing value should be.

In [None]:
# Some data we will use to generate our probabilities
asia_data = pd.read_csv("Asia10k.csv")
asia_data.shape

In [None]:
asia_data.head()

In [None]:
# Let's make sure we're consistant with our labels
asia_data = asia_data.replace("no", "F").replace("yes", "T")
asia_data.head()

In [None]:
values = asia_data.values.copy()
indices = np.random.choice(asia_data.index, 1000)
values = values[indices]
values[:,4] = None
values[1]

In [None]:
predictions = model.predict(values)

In [None]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [None]:
confusion_matrix(asia_data.values[indices,4],np.array(predictions)[:,4])

In [None]:
print(classification_report(asia_data.values[indices,4],np.array(predictions)[:,4]))

## 2.5) Fitting a Bayes Net to Data (Group)

In many applications, you may have a general idea of the structure of the Bayes Net, but do not have a list of probabilities. Luckily, given some data, we can fill out the probabilities in a given net. **Note:** You may only get similar results to the previous method, since it turns out this data was *simulated* from the given conditional probabilities. So, one would expect that the model would learn parameters like the ones we've given.

In [None]:
fitted_model = model.fit(asia_data)

In [None]:
# Helper function to print the probability distributions
def print_distributions(model):
    for i, state in enumerate(fitted_model.states):
        print(features[i])
        states = state.distribution.parameters[0]
        if len(states)>1:
            for state in states:
                print(state)
        #print(state.distribution.parameters[0])
        print()

In [None]:
print_distributions(fitted_model)

**Take a look at the learned probability distributions. Are they similar to "expert" ones given in the previous problem?**

**Discuss Here**

Now, perform the same evaluation that you did in the previous problem. 

In [None]:
# Remove Some other column other than Bronchitis

In [None]:
# Make the predictions

In [None]:
# Confusion Matrix

In [None]:
# Classification report

How well did it perform?

## 2.6) Learning Structure from Data (Group)

Now, the most interesting problem is when we only have data, but we don't know the structure of the data (however, we still have a reason to believe that *it can be reperesented as Bayes Net*). Luckily, pomegranate has the ability to solve this problem as well. Given a dataset, we can use `from_samples` to build a Bayes net, structure and all, from the data.

In [None]:
learned_model = BayesianNetwork.from_samples(asia_data, algorithm='exact')

In [None]:
print_model_structure(learned_model, features)

**Compare the model structure from experts vs learned from the data:**

1. Draw out both models (the one you made earlier and the one learned) on a piece of paper.
2. What are the differences you observe?
3. Why might a model learned from the data have a different structure? Are some influences (edges) in the model more or less important than others?

**Discuss Here**

## 2.7) BYOB: Build Your Own Bayes Net (Group, if time permits)
If you find yourself with some extra time after this portion, consider **building a Bayes net that represents something in your daily life**. It could be the effect of traffic on a morning commute, deciding what to do for dinner, etc. It can be very small, only around 3-5 nodes probably (conditional probability tables are a pain!). Then play around with predictions and probabilities to see how various factors impact your model.

In [None]:
# Start building!