## Intro to Machine Learning (ML)
### 2 Main Types
1. Supervised learning: labeled data (e.g. there is a special attribute called the "class", that we are interested in predicting for "unseen" instances)
    * If the class is numeric -> regression task
    * If the class is categorical -> classification task
    * Example algorithm: `kNN` (k nearest neighbors)
2. Unsupervised learning: unlabeled data (e.g. no class attribute)
    * Example algorithm: k means clustering

### Supervised learning
* We need a way to divide a dataset into a training set and a testing set
    * Train/build an algorithm/model using a training set
    * Test/evaluate an algorithm/model using the testing set
        * Testing has "unseen instances"
    * Note: Training and testing set are *different*
* Example: tiny t-shirt dataset
    * Goal: predict t-shirt sizes using height and weight of people
    * Training set
        * 4 instances, 2 attributes (AKA features), 1 class (t-shirt size)
    * Testing set
        * 1 unseen instance (161cm, 63kg)
        * What should the t-shirt size be?
        * Let say the "actual" AKA "ground truth value" for this instance is M (Medium)
        * If `kNN` predicts M, then we have 100% accuracy
        * If `kNN` predicts L, then we have 0% accuracy

### `kNN` Algorithm
* Identify the k nearest neighbor in the training set for the unseen instance
    * The most frequent class label amongst these k neighbors will be the algorithm's prediction for the class label for the unseen instance
* We need a way to measure "nearness" or "closeness"
    * 2-dimensional: pythagorean theorem
    * N-dimensional: euclidean distance $dist(a, b) = \sqrt{\sum_{i=1}^{n} (a_i - b_i)^2}$
* To avoid an inadvertent weighting of our attributes when we use this formual, we typically apply a preprocessing step called nomralization (or standardization)
    * For `kNN`, we will use a normalization technique called min-max scaling
    * For each attribute, subtract the min and divide by the range
    * Each attrbute will be in `[0, 1]`
* Trace time!
    

In [1]:
import pandas as pd

df = pd.read_csv("shirt_sizes.csv")
df

Unnamed: 0,height(cm),weight(kg),t-shirt size
0,158,58,M
1,163,61,M
2,165,61,L
3,168,66,L


In [2]:
# We will use the sci-kit learn ML library
# However, they are mostly shallow ML library
# Notation:
# X: a feature matrix (2D; rows are feature vectors aka instances)
# remove the class and store in y
# y: a class vector (1D; what you are trying to predict)
# X and y are parallel
# Add a_train or _test to denote training or testing set
X_train = df.drop("t-shirt size", axis=1)
print(X_train)
y_train = df["t-shirt size"]
print(y_train)

X_test = [[161, 63]] #2D, add more if there is more testing set
print(X_test)

   height(cm)  weight(kg)
0         158          58
1         163          61
2         165          61
3         168          66
0    M
1    M
2    L
3    L
Name: t-shirt size, dtype: object
[[161, 63]]


In [3]:
# Normalize the data
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(X_train)
X_train_normalized = scaler.transform(X_train)
print(X_train_normalized)

X_test_normalized = scaler.transform(X_test)
print(X_test_normalized)

[[0.    0.   ]
 [0.5   0.375]
 [0.7   0.375]
 [1.    1.   ]]
[[0.3   0.625]]




In [4]:
# get prediction
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=3, metric="euclidean")
knn_clf.fit(X_train_normalized, y_train) # to train
y_predicted = knn_clf.predict(X_test_normalized) # to test
print(y_predicted)
print(knn_clf.kneighbors(X_test_normalized))

['M']
(array([[0.32015621, 0.47169906, 0.69327123]]), array([[1, 2, 0]]))


### Closing Thoughts on `kNN`
* Not a very efficient algorithm
* What if our attributes are not numeric (they are categorical?)
    * Convert category labels to integers
        * `from sklearn.preprocessing import Label Encoder`
    * Use a different distance metric (or make your own)
* Note that `kNN` is not the only supervised ML algorithm (it is a good one to start learning with though)
    * Naive Bayes
    * Decision trees
    * Random forests
    * Support vector machines (SVMs)
    * Neural networks
    * etc.

### Warm-up Tasks

In [9]:
df = pd.read_csv("shirt_sizes_long.csv")

X = df.drop("t-shirt size", axis=1)
X = scaler.fit_transform(X)
y = df["t-shirt size"]

print(X.shape, y.shape)

(18, 2) (18,)


### Classifier Evaluation
* In our previous demo, we had one instance in our "test" set
    * If our classifier correctly predicts the label for this instacne -> 100% accuracy
    * If our classifier incorrectly predicts the label for this instacne -> 0% accuracy
* Notes:
    * We want a "large" testing set so we can get a good big picture of how well our classifier has learned
    * Accuracy doesn't tell the whole story 
* We need a way to "divide" our dataset into training and testing
    * A few way to do this:
        1. Holdout method
        1. Cross validation

### Holdout Method
* "Hold out" a certain number or percentage of instances for testing
    * Train on the remaining instances
* Typically use a common "split" for how much to hold out
    * 2:1 -> 1/3 held for testing, 2/3 held for training
    * 25% held out for testing, 75% held out for training
        * Default for sci-kit learn's `train_test_split()`

In [12]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, stratify=y)
# random_state=0 for reproducible resultbb
# stratify=y to force the same distribution of class labels in your training and testing set

print(X_train)
print("****")
print(y_train)

[[0.58333333 0.3       ]
 [1.         1.        ]
 [0.83333333 0.5       ]
 [0.41666667 0.2       ]
 [0.58333333 0.7       ]
 [0.         0.5       ]
 [0.         0.1       ]
 [0.41666667 0.6       ]
 [1.         0.6       ]
 [0.16666667 0.1       ]
 [0.16666667 0.2       ]
 [1.         0.5       ]
 [0.83333333 0.8       ]]
****
9     L
17    L
13    L
5     M
11    L
2     M
1     M
8     L
16    L
3     M
4     M
15    L
14    L
Name: t-shirt size, dtype: object


In [14]:
print(X_test)
print(y_test)

[[0.83333333 0.4       ]
 [0.41666667 0.3       ]
 [0.16666667 0.6       ]
 [0.         0.        ]
 [0.58333333 0.4       ]]
12    L
6     M
7     L
0     M
10    L
Name: t-shirt size, dtype: object


In [13]:
from sklearn.metrics import accuracy_score

knn_clf.fit(X_train, y_train)
acc = knn_clf.score(X_test, y_test)
print("Accuracy:", acc)

# Or

y_predicted = knn_clf.predict(X_test)
print(y_predicted)
acc = accuracy_score(y_test, y_predicted)
print("Accuracy:", acc)

Accuracy: 0.8
['L' 'M' 'M' 'M' 'L']
Accuracy: 0.8


In [16]:
from sklearn.tree import DecisionTreeClassifier

tree_clf = DecisionTreeClassifier(random_state=0)
tree_clf.fit(X_train, y_train) # to train
acc = tree_clf.score(X_test, y_test)
print("accuracy", acc)

accuracy 1.0


### k Fold Cross Validation
* With cross validation, every instance is in the test set exactly one time
* Basic algorithm: divide the dataset into "folds"
    * For each fold:
        * Test on the fold
        * Train on the remaining folds (folds - fold)
* Accuracy is the total correctly predicted over all the folds divided by the total number of instances

In [20]:
from sklearn.model_selection import cross_val_score, cross_val_predict
import numpy as np

# Do 5 fold cross validation for the kNN and decision tree classifiers
for clf in [knn_clf, tree_clf]:
    print(type(clf))
    accuracies = cross_val_score(clf, X, y)
    print(accuracies, np.mean(accuracies))
    # The preferred way to calculate accuracy
    y_predicted = cross_val_predict(clf, X, y)
    acc = accuracy_score(y, y_predicted)
    print(acc)

<class 'sklearn.neighbors._classification.KNeighborsClassifier'>
[0.75       0.5        1.         1.         0.66666667] 0.7833333333333333
0.7777777777777778
<class 'sklearn.tree._classes.DecisionTreeClassifier'>
[0.5        0.5        1.         1.         0.66666667] 0.7333333333333333
0.7222222222222222
