# ML Stuff

### Supervised Machine Learning
- Training, Validation, Testing datasets
    - Training: used to train the model
    - Validation: used to tune the hyperparameters
        - Hyperparameters: parameters that are not learned by the model
        - modern models often handle this automatically
        - terminology has evolved so older sources may say validation but mean testing data
    - Testing: used to evaluate the model
- Cross Validation
    - iteratively train and test the model on different subsets of the data
    - allows you to use all of the data for training and testing without overfitting (hopefully)
    - Leave-One-Out Cross Validation (LOOCV)
        - train on all but one data point
        - test on the one data point
        - repeat for all data points
        - pros: uses all data for training and testing
        - cons: computationally expensive and can lead to overfitting
        - **Should not be used**
    - k-fold Cross Validation
        - split data into k subsets
        - train on k-1 subsets
        - test on the remaining subset
        - repeat for all subsets
        - pros: computationally efficient
        - cons: uses less data for training and testing
  ### We will use 2-fold cross validation for this course

### USeful Python Libraries
- NumPy
    - good for linear algebra
- scikit-learn
    - good for machine learning
- pandas
    - good for data manipulation

In [2]:
# load libraries
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB

In [23]:
#load dataset
url = "files/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length',
'petal-width', 'class']
dataset = read_csv(url, names=names)

In [24]:
# print(dataset.head(20))

Most scikit-learn library functions use the following convention:
- X is an array containing all the features in the first columns and the class in the last column.
- y is an array containing only the classes.
- Note: Test_size must be set to 0.50 for 2-fold cross-validation which we will be using in this class.

In [25]:
#Create Arrays for Features and Classes
array = dataset.values
X = array[:,0:4] #contains flower features (petal length, etc..)
y = array[:,4] #contains flower names
#Split Data into 2 Folds for Training and Test
X_Fold1, X_Fold2, y_Fold1, y_Fold2 = train_test_split(X, y, test_size=0.50, random_state=1)

In [26]:
model = GaussianNB() #create model of type Gaussian Naive Bayes
model.fit(X_Fold1, y_Fold1)  #train model on Fold1
pred1 = model.predict(X_Fold2)  #test model on Fold2
model.fit(X_Fold2, y_Fold2)  #train model on Fold2
pred2 = model.predict(X_Fold1)  #test model on Fold1

### Evaluating the Model
- used to quantify
    - desired performance vs actual performance
    - desired vs baseline performance
    - progress over time
- Accuracy
    - number of correct predictions / total number of predictions
    - good for balanced datasets
    - bad for unbalanced datasets
- Confusion Matrix
    - shows the number of correct and incorrect predictions
    - good for unbalanced datasets
    - at it's most basic, made up of 4 values
        - true positives (TP)
        - true negatives (TN)
        - false positives (FP)
        - false negatives (FN)
        - FP and FN are often called Type I and Type II errors
        -
      

![image.png](attachment:c00b143a-4e49-4da5-a0f6-42b8a3777900.png)