# Tutorial for Introduction to ML Lecture

version 0.1, September 2023

Bryan Scott, CIERA/Northwestern

## Problem 1: Bayes Classifiers

A good starting point for Machine Learning is the Bayes classifier. The basic idea is to assign the most probable label to each data point using Bayes theorem, we take:

$$
p(y | x_n) \propto p(y)p(x_i, ..., x_n | y)
$$

where y is a label for a data point and the $x_n$ are the features of the data that we want to use to classify each data point. A $\textit{Naive} Bayes$ classifier makes an important simplifying assumptions that gives it the name - it assumes that the conditional probabilities are independent, $p(x_i, ..., x_n | y) = p(x_i|y)... p(x_n | y)$. That is, the probability of observing any individual feature doesn't depend on any of the other features. Our task is to construct this classifier from a set of examples we've observed previously and compare it to new data. 

### Part 0: Load and split the data

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt

In [4]:
lsst_data = pd.read_csv('session_19_extragalactic_subset.csv', index_col=0)

In [5]:
lsst_data[0:1000].to_csv('session_19_DC2_subset.csv', index=False)

In [6]:
lsst_data.shape[0]

1000

### Loading and splitting the data. 

Read in the data, then start by selecting the id, fluxes, and object truth type in the lsst data file you've been provided. 

Once you have selected those, randomly split the data into two arrays, one containing 80% of the data, and a second array containing 20% of the data. 

In [22]:
lsst_data = pd.read_csv('session_19_DC2_subset.csv') #path to your data

col_to_classify = ['id','flux_g','flux_i','flux_r','flux_u','flux_y','flux_z','truth_type']

lsst_data_to_classify = lsst_data.loc[:,col_to_classify]
N_data_to_classify = lsst_data_to_classify.shape[0]
random_data = np.sort(np.random.choice(N_data_to_classify, int(N_data_to_classify*0.2), replace=False))

train_data = lsst_data_to_classify.drop(random_data)
test_data = lsst_data_to_classify.loc[random_data].reset_index(drop=True)

In [19]:
print(random_data)

[  9  17  28  42  44  50  51  54  62  64  70  82  92 104 110 112 114 118
 120 132 135 139 140 154 157 159 162 164 165 166 167 179 186 195 197 221
 225 226 228 230 234 236 243 244 247 253 254 255 262 289 292 301 302 307
 308 322 323 329 339 346 363 364 366 372 375 385 387 388 392 394 397 399
 402 411 413 414 417 419 428 429 434 436 454 462 463 472 477 478 479 483
 484 487 495 498 505 510 519 522 523 524 526 529 531 538 540 541 546 549
 557 572 580 582 584 593 594 596 599 601 608 610 614 617 620 629 631 636
 642 650 660 663 669 676 679 682 684 685 687 691 695 708 711 719 726 766
 767 771 773 786 791 793 798 799 800 803 804 806 811 815 819 825 827 830
 849 851 852 853 857 858 867 876 878 881 890 894 897 898 902 904 908 910
 913 915 927 931 933 935 937 944 946 949 955 956 976 978 981 984 988 990
 997 998]


In [23]:
test_data

Unnamed: 0,id,flux_g,flux_i,flux_r,flux_u,flux_y,flux_z,truth_type
0,40969793594,17.83660,190.3410,51.4846,1.75680,422.2170,328.8920,2
1,40969784981,9.22705,73.9039,27.5084,1.01195,140.4130,114.2860,2
2,40969794479,12.76260,221.4810,44.3510,1.59968,817.0840,486.3460,2
3,40969800379,52.11700,902.9370,180.9370,6.54124,3328.5400,1981.8000,2
4,41021116967,33.44870,57.3507,49.3317,12.15540,61.9937,60.5627,2
...,...,...,...,...,...,...,...,...
195,40047153469,3467.59000,27791.0000,10341.6000,380.10700,52816.9000,42984.4000,2
196,40969624595,9.53608,67.9682,28.6144,1.09289,121.3620,100.3890,2
197,40969636266,29.97310,274.1900,89.0342,3.17704,559.9740,446.7240,2
198,40969635236,19.47590,129.9370,60.3894,2.10310,219.0790,184.0700,2


In [24]:
train_data

Unnamed: 0,id,flux_g,flux_i,flux_r,flux_u,flux_y,flux_z,truth_type
0,40749426052,589.69000,7390.7700,1654.2300,56.45420,17941.500,13607.600,2
1,40969792214,46.01050,328.2330,138.1330,5.26928,586.331,484.929,2
2,40749426351,209.17300,2232.3500,603.7970,20.60080,4952.030,3857.390,2
4,40969791100,2729.48000,4716.7300,4090.1800,1051.78000,5011.580,4917.180,2
5,40969810898,9.54055,76.4186,28.4438,1.04629,145.195,118.177,2
...,...,...,...,...,...,...,...,...
995,40969637437,188.09400,1716.6500,557.9820,19.97450,3502.080,2794.960,2
996,40969622675,1429.32000,11451.0000,4261.8200,156.72600,21758.800,17709.300,2
997,40969632587,55.82410,397.1930,167.3400,6.40664,708.629,586.350,2
998,40034381371,70087.20000,640120.0000,208000.0000,7438.53000,1306330.000,1042430.000,2


### Part 1: Estimate Class Frequency in the training set

One of the ingredients in our classifier is p(y), the unconditional class probabilities. 

We can get this by counting the number of rows belonging to each class in train_data and dividing by the length of the training data set. 

In [27]:
def estimate_class_probabilities(data, class_column_name):
    """
    Computes unconditional class probabilities. 
     
    Args:
        x_train (array): training data for the classifier
 
    Returns:
        ints p1, p2: unconditional probability of an element of the training set belonging to class 1
    """
    
    p1 = data.loc[data[class_column_name] == 1].shape[0]/data.shape[0]
    p2 = data.loc[data[class_column_name] == 2].shape[0]/data.shape[0]

    print(data.loc[data[class_column_name] == 2].shape[0])
    print(data.loc[data[class_column_name] == 1].shape[0])
    print(data.shape[0])

    return p1, p2

p1, p2 = estimate_class_probabilities(train_data, 'truth_type')

792
0
800


In [26]:
p1, p2

(0.0, 0.99)

### Part 2:  Feature Likelihoods

We are assuming that the relationship between the classes and feature probabilities are related via:

$p(x_i, ..., x_n | y) =  p(x_i|y)... p(x_n | y)$

however, we still need to make an assumption about the functional form of the $p(x_i | y)$. As a simple case, we will assume $p(x_i | y)$ follows a Gaussian distribution given by:

$$
p(x_i | y) = \frac{1}{\sqrt{2 \pi \sigma_y}} \exp{\left(-\frac{(x_i - \mu)^2}{\sigma_y^2}\right)}
$$

and we will make a maximum likelihood estimate of $\mu$ and $\sigma_y$ from the data. This means using empirical estimates $\bar{x}$ and $\hat{\sigma}$ as estimators of the true parameters $\mu$ and $\sigma_y$. 

Write a fitting function that takes the log of the fluxes and returns an estimate of the parameters of the per-feature likelihoods for each class.

In [31]:
def per_feature_likelihood_parameters(x_train, label):
    """"
    Computes MAP estimates for the class conditional likelihood. 
     
    Args:
        x_train (array or pd series): training data for the classifier
        label (int): training labels for the classifier 
 
    Returns:
        means, stdevs (array): MAP estimates of the Gaussian conditional probability distributions for a specific class
    """
    
    means = 
    stdevs = 
    
    return means, stdevs


### Part 3: MAP Estimates of the Class Probabilities

Now that we have the unconditional class probabilities and the parameters of the per feature likelihoods in hand, we can put this all together to build the classifier. Use the methods you have already written to write a function that takes in the training data and returns fit parameters. Once you have done that, write a method that takes the fit parameters as an argument and predicts the class of new (and unseen) data. 

In [32]:
# build the classifier

# solved 

def fit(x_train):
    """"
    Convenience function to perform fitting on the training data
     
    Args:
        x_train (array or pd series): training data for the classifier
 
    Returns:
        p1, p2, class_1_mean, class_2_mean, class_1_std, class_2_std: see documentation for per_feature_likelihood_parameters
    """
    
    # compute probabilities and MAP estimates of the Gaussian distribution's parameters using the methods you wrote above
    
    return p1, p2, class_1_mean, class_2_mean, class_1_std, class_2_std


In [34]:

def predict(x_test, class_probability, class_means, class_dev):
    """"
    Predict method
     
    Args:
        x_test (array): data to perform classification on
        class_probability (array): unconditional class probabilities
        class_means, class_dev (array): MAP estimates produced by the fit method
 
    Returns:
        predict_List (list): class membership predictions
    """
    
    # compute probabilities of an element of the test set belonging to class 1 or 2
        
    for i in range():
        if 
            
        if 
    
    return predict_list

### Part 4: Metrics

After creating a classifier, you now want to evaluate it in terms of how often it correctly and incorrectly classifies the objects in your training set. To do this, we'll design a confusion matrix. A confusion matrix is a matrix whose entries are the counts of the predicted vs actual class. For example, the first entry is the count of objects that are predicted to be of class 1 and actually are of class 1 and so on, while the off-diagonal elements would be instances of class 1 that are predicted to be of class 2, and instances of class 2 that are predicted to be of class 1. 

In [37]:
def plot_confusion_matrix(df_confusion, cmap=):
    """
    
    Convenience function to plot the confusion matrix from a pd.crosstab object. Hint: use plt.matshow and choose a sensible color map.
    
    Args:
        df_confusion (pd.crosstab): A pd.crosstab object.
        
    Returns:
        null 
    """
    
    
    plt.matshow()


## Problem 2: The Cramer-Rao bound (pen & paper, challenging, optional)

As we saw in the lecture, the Cramer-Rao bound is an important result in statistics that has intuitive consequences for many applied problems in ML. The proof of the Cramer-Rao bound can be insightful to work through. 

The starting point for the proof of the bound is the Cauchy-Schwarz inequality, which can be used to show that:

$$
[Cov(U, V)]^2 \le Var(U)Var(V)
$$

Starting from the definitions that U = T(X), where T(X) is an estimator of some parameter $\theta$ of the distribution $f(X|\theta)$ from which the data is sampled, and V = $\frac{\partial}{\partial \theta} log f(X |\theta)$. Use the Cauchy-Schwarz inequality to show the Cramer-Rao bound for these choices of U and V. 

$\textit{Hint:}$ you will need the fact that the $\mathbb{E}(V) = 0$, where $\mathbb{E}$ is the expectation of a random variable.