In [1]:
# load libraries
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
import numpy as np

# ML Stuff

### Supervised Machine Learning
- Training, Validation, Testing datasets
    - Training: used to train the model
    - Validation: used to tune the hyperparameters
        - Hyperparameters: parameters that are not learned by the model
        - modern models often handle this automatically
        - terminology has evolved so older sources may say validation but mean testing data
    - Testing: used to evaluate the model
- Cross Validation
    - iteratively train and test the model on different subsets of the data
    - allows you to use all of the data for training and testing without overfitting (hopefully)
    - Leave-One-Out Cross Validation (LOOCV)
        - train on all but one data point
        - test on the one data point
        - repeat for all data points
        - pros: uses all data for training and testing
        - cons: computationally expensive and can lead to overfitting
        - **Should not be used**
    - k-fold Cross Validation
        - split data into k subsets
        - train on k-1 subsets
        - test on the remaining subset
        - repeat for all subsets
        - pros: computationally efficient
        - cons: uses less data for training and testing
  ### We will use 2-fold cross validation for this course

### USeful Python Libraries
- NumPy
    - good for linear algebra
- scikit-learn
    - good for machine learning
- pandas
    - good for data manipulation

In [2]:
#load dataset
url = "files/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length',
'petal-width', 'class']
dataset = read_csv(url, names=names)

In [3]:
# print(dataset.head(20))

Most scikit-learn library functions use the following convention:
- X is an array containing all the features in the first columns and the class in the last column.
- y is an array containing only the classes.
- Note: Test_size must be set to 0.50 for 2-fold cross-validation which we will be using in this class.

In [4]:
#Create Arrays for Features and Classes
array = dataset.values
X = array[:,0:4] #contains flower features (petal length, etc..)
y = array[:,4] #contains flower names
#Split Data into 2 Folds for Training and Test
X_Fold1, X_Fold2, y_Fold1, y_Fold2 = train_test_split(X, y, test_size=0.50, random_state=1)

In [26]:
model = GaussianNB() #create model of type Gaussian Naive Bayes
model.fit(X_Fold1, y_Fold1)  #train model on Fold1
pred1 = model.predict(X_Fold2)  #test model on Fold2
model.fit(X_Fold2, y_Fold2)  #train model on Fold2
pred2 = model.predict(X_Fold1)  #test model on Fold1

### Evaluating the Model
- used to quantify
    - desired performance vs actual performance
    - desired vs baseline performance
    - progress over time
- Accuracy
    - number of correct predictions / total number of predictions
    - good for balanced datasets
    - bad for unbalanced datasets
- Confusion Matrix
    - shows the number of correct and incorrect predictions
    - good for unbalanced datasets
    - at it's most basic, made up of 4 values
        - true positives (TP)
        - true negatives (TN)
        - false positives (FP)
        - false negatives (FN)
        - FP and FN are often called Type I and Type II errors
        - <img src="images/FP_and_FN.png" alt="drawing" width="500"/>
    - accuracy, precision, recall, and F1 score can be calculated from the confusion matrix
        - accuracy = (TP + TN) / (TP + TN + FP + FN)
                - how often the model is correct
        - precision = TP / (TP + FP)
                - how often the model is correct when it predicts positive
        - recall = TP / (TP + FN)
                - how often the model predicts positive when it is correct
        - F1 score = 2 * (precision * recall) / (precision + recall)
                - harmonic mean of precision and recall
                - good for unbalanced datasets
    - F-Score
        - F-Score or F-measure is used in statistical analysis of binary classification
        - F-Score is the harmonic mean of precision and recall
        - highest possible value is 1.0
        - lowest possible value is 0.0

### Multiclass Confusion Matrices
- confusion matrices can be extended to multiclass problems
    - e.g.
        - <img src="images/multiclass.png" alt="drawing" width="500"/>
        - precisoin of cat is from the horizontal cat row, 4/13
        - recall of cat is from the vertical cat column, 4/6
- there is no standard orientation of the matrix
    - i.e. the predicted and true labels can be on the rows or columns
    - so always read the labels
    - the diagonal is always the true positives
-


In [32]:
actual = np.concatenate([y_Fold2, y_Fold1])  #combine the actual labels from both folds
predicted = np.concatenate([pred1, pred2])   #combine the predicted labels from both folds
print(f"Accuracy: {accuracy_score(actual, predicted)}")   #print the accuracy
print("Confusion Matrix:")   #print the confusion matrix
print(confusion_matrix(actual, predicted))   #print the confusion matrix
print("Classification Report:")   #print the classification report
print(classification_report(actual, predicted))   #print the classification report

Accuracy: 0.96
Confusion Matrix:
[[50  0  0]
 [ 0 47  3]
 [ 0  3 47]]
Classification Report:
              precision    recall  f1-score   support

      Setosa       1.00      1.00      1.00        50
  Versicolor       0.94      0.94      0.94        50
   Virginica       0.94      0.94      0.94        50

    accuracy                           0.96       150
   macro avg       0.96      0.96      0.96       150
weighted avg       0.96      0.96      0.96       150


### Regression Classifiers
- linear regression
    - single input variable
    - $y = b_0 + b_1x$
        - $y$ is the response
        - $b_0$ is the bias coefficient
        - $b_1$ is the coefficient for the input variable
    - training data is used to find the values of the coefficients
        - finding the best fit line
        - many different algorithms can be used to find the best fit line
            - ordinary least squares
            - gradient descent
            - stochastic gradient descent
            - etc...
    - once the coefficients are found, the model can be used to make predictions
        - $y = 0.5 + 0.8x$
        - $y = 0.5 + 0.8(5)$
        - $y = 4.5$ 
- polynomial regression
    - nonlinear relationship between the input and response
        - $y = b_0 + b_1x + b_2x^2 +$ ...    
    - formulas are typically represented as matrices
- multiple linear regression
    - multiple input variables
    - $y = b_0 + b_1x_1 + b_2x_2 +$ ...
    - formulas are typically represented as matrices

In [5]:
x = [[0, 1], [5, 1], [15, 2], [25, 5], [35, 11], [45, 15], [55, 34], [60, 35]]
y = [4, 5, 20, 14, 32, 22, 38, 43]
x, y = np.array(x), np.array(y)
# print(x)
# print(y)
from sklearn.linear_model import LinearRegression
model = LinearRegression() #create model of type Linear Regression
model.fit(x, y)  #train model on data
# model = LinearRegression().fit(x, y) # oneliner for the above 2 lines

### Evaluating Regression Models
- $R^2$ is a measure of the fit
- can be obtained with `.score()`

In [9]:
model.score(x, y)
print('B_0:', model.intercept_)
print('[B_1 B_2]:', model.coef_)

B_0: 5.52257927519819
[B_1 B_2]: [0.44706965 0.25502548]


### Regression
- Strengths
    - straightforward to understand and explain
    - can be regularized to avoid overfitting
    - easily updated with new data via gradient descent
- Weaknesses
    - assumes a linear relationship between the input and response
        - performs poorly with nonlinear relationships
    - not flexible enough to capture more complex relationships
        - e.g. polynomial regression 

### In Class 29Aug23

1. a) Polynomial regression, the data clearly does not follow a straight line
1. b) Linear regression, the data follows a straight line
2.  
- $y = 0.2 + 0.1x_1 + 0.05x_2$
- $x_1 = 5.1$
- $x_2 = 1.8$
- $y = 0.2 + 0.1(5.1) + 0.05(1.8)$
- $y = 0.2 + 0.51 + 0.09$
- $y = 0.8$
- The model predicts Iris-setosa

#### scikit-learn Algorithm for Regression
- needed for assignment 2
- doesn't work without other code (as of 29Aug23)

In [1]:
def regModel(name, model):
    #Fit and transform data sets according to the regression degree
    poly_reg = None
    if (name == "Linear Regression"):
        poly_reg = PolynomialFeatures(degree=1)
    elif(name == "2 Degree Polynomial Regression"):
        poly_reg = PolynomialFeatures(degree=2)
    elif(name == "3 Degree Polynomial Regression"):
        poly_reg = PolynomialFeatures(degree=3)
    #create 2 folds
    X_Poly1 = poly_reg.fit_transform(X_Fold1)
    X_Poly2 = poly_reg.fit_transform(X_Fold2)

In [2]:
model.fit(X_Poly1, y_Fold1) #first fold training
pred1 = model.predict(X_Poly2).round() #first fold testing
#regression may produce values < 1 or > 3.
pred1 = np.where(pred1 >= 3.0, 2.0, pred1) #map all values > 3 to 2
pred1 = np.where(pred1 <= -1.0, 0.0, pred1) #map all values < 0 to 0
model.fit(X_Poly2, y_Fold2) #second fold training
pred2 = model.predict(X_Poly1).round() #second fold testing
pred2 = np.where(pred2 >= 3.0, 2.0, pred2)
pred2 = np.where(pred2 <= -1.0, 0.0, pred2)
actual = np.concatenate([y_Fold2, y_Fold1])
predicted = np.concatenate([pred1, pred2])

NameError: name 'model' is not defined

### Naive Bayesian Classifiers
- simplest ML classifier
- gold standard for comparing other classifiers
    - if a new classifier is not better than a naive bayesian classifier, it is not worth using 
- based on Bayes' Theorem of conditional probability
    - $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$
    - $P(A|B)$ is the probability of A given B
    - $P(B|A)$ is the probability of B given A
    - $P(A)$ is the probability of A
    - $P(B)$ is the probability of B
- mean and variance are used to summarize the data
    - mean  $\mu$
         - the average
        - $\mu = \frac{1}{n}\sum_{i=1}^{n}x_i $
    - variance $\sigma^2$
        - how much the data varies from the mean
        - $\sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(x_i - \mu)^2 $
- NB Classifiers are conditional probability models
    - a sample to be classified is represented as a vector of features
        - $x = (x_1, x_2, x_3, ..., x_n)$
    - calculates the conditional probability of each class given the features
        - $P(C_k|x_1, x_2, x_3, ..., x_n)$
    - the class with the highest probability is the predicted class
- problem
    - if the number of features is large, classification by conditional probability is infeasible
    - thus the model is reformulated to be more tractable
        - the denominator is removed because it is effectively a constant
- reduced form
    - posterior numerator
        - posterior numerator = prior * likelihood
        - can estimate $p(x_k|C_i)$ from the training data
            - $p(x_k|C_i) = \frac{1}{\sqrt{2\pi\sigma_{ik}^2}}e^{-\frac{(x_k - \mu_{ik})^2}{2\sigma_{ik}^2}}$
            - $x_k$ is the value of feature k in the sample
            - $\mu_{ik}$ is the mean of feature k for the entire training set
            - $\sigma_{ik}^2$ is the variance of feature k for the entire training set
            - $C_i$ is the class
            - $e$ is Euler's number (2.71828...)

### Summary of Naive Bayes
- Strengths
    - simple and easy to implement
    - fast
    - good for high dimensional data
    - good for categorical data
    - good for text classification
- Weaknesses
    - assumes independence of features
    - assumes a gaussian distribution of features
    - based on probability theory
        - real world data is often more complex
    - can be outperformed by other classifiers
- Training
    - calculate one probability for each class
    - calculate n * m conditional probabilities
        - n is the number of class
        - m is the number of features 

### In Class 31Aug23

In [1]:
print("test")

test
