STAT 451: Machine Learning (Fall 2021)  
Instructor: Sebastian Raschka (sraschka@wisc.edu)  

# L05 - Data Preprocessing and Machine Learning with Scikit-Learn

# 5.3 Object Oriented Programming (OOP) & Python Classes

## Python Classes

- This section illustrates the concept of "classes" in Python, which is relevant for understanding how the scikit-learn API works on a fundamental level later in this lecture.
- Note that Python is an object oriented language, and everything in Python is an object.
- Classes are "templates" for creating objects (this is called "instantiating" objects).
- An object is a collection of special "functions" (a "function" of an object or class is called "method") and attributes.
- Note that the `self` attribute is a special keyword for referring to a class or an instantiated object of a class, "itself."

In [1]:
class VehicleClass():
    
    def __init__(self, horsepower):
        "This is the 'init' method"
        # this is a class attribute:
        self.horsepower = horsepower
        
    def horsepower_to_torque(self, rpm):
        "This is a regular method"
        torque = self.horsepower * rpm / 5252
        return torque
    
    def tune_motor(self):
        self.horsepower *= 2
    
    def _private_method(self):
        print('this is private')
    
    def __very_private_method(self):
        print('this is very private')

In [2]:
# instantiate an object:
car1 = VehicleClass(horsepower=123)
print(car1.horsepower)

123


In [3]:
car1.horsepower_to_torque(rpm=5000)

117.0982482863671

In [4]:
car1.tune_motor()
car1.horsepower_to_torque(rpm=5000)

234.1964965727342

In [5]:
car1._private_method()

this is private


- Python has the motto "we are all adults here," which means that a user can do the same things as a developer (in contrast to other programming languages, e.g., Java).
- A preceding underscore is an indicator that a method is considered "private" -- this means, this method is meant to be used internally but not by the user directly (also, it does not show up in the "help" documentation)
- a preceding double-underscore is a "stronger" indicator for methods that are supposed to be private, and while users can access these (adhering to the "we are all adults here" moto), we have to refer to "name mangling."

In [6]:
# Excecuting the following would raise an error:
# car1.__very_private_method()

In [7]:
# If we use "name mangling" we can access this private method:
car1._VehicleClass__very_private_method()

this is very private


- Another useful aspect of using classes is the concept of "inheritance."
- Using inheritance, we can "inherit" methods and attributes from a parent class for re-use.
- For instance, consider the `VehicleClass` as a more general class than the `CarClass` -- i.e., a car, truck, or motorbike are specific cases of a vehicle.
- Below is an example of a `CarClass` that inherits the methods from the `VehicleClass` and adds a specific `self.num_wheels=4` attribute -- if we were to create a `BikeClass`, we could set this to `self.num_wheels=2`, for example.
- All-in-all, this is a very simple demonstration of class inheritance, however, it's a concept that is very useful for writing "clean code" and structuring projects -- the scikit-learn machine learning library makes heavy use of this concept internally (we, as users, don't have to worry about it too much though, it is useful to know though in case you would like to modify or contribute to the library).

In [8]:
class CarClass(VehicleClass):

    def __init__(self, horsepower):
        super().__init__(horsepower)
        self.num_wheels = 4
    
new_car = CarClass(horsepower=123)
print('Number of wheels:', new_car.num_wheels)
print('Horsepower:', new_car.horsepower)
new_car.tune_motor()
print('Horsepower:', new_car.horsepower)

Number of wheels: 4
Horsepower: 123
Horsepower: 246


## K-Nearest Neighbors Implementation

- Below is a very simple implementation of a K-nearest Neighbor classifier.
- This is a very slow and inefficient implementation, and in real-world problems, it is always recommended to use established libraries (like scikit-learn) instead of implementing algorithms from scratch.
- The scikit-learn library, for example, implements *k*NN much more efficiently and robustly -- using advanced data structures (KD-Tree and Ball-Tree, which we briefly discussed in Lecture 02).
- A scenario where it is useful to implement algorithms from scratch is for learning and teaching purposes, or if we want to try out new algorithms, hence, the implementation below, which gently introduces how things are implemented in scikit-learn.

In [9]:
class KNNClassifier(object):
    def __init__(self, k, dist_fn=None):
        self.k = k
        if dist_fn is None:
            self.dist_fn = self._euclidean_dist
    
    def _euclidean_dist(self, a, b):
        dist = 0.
        for ele_i, ele_j in zip(a, b):
            dist += ((ele_i - ele_j)**2)
        dist = dist**0.5
        return dist
        
    def _find_nearest(self, x):
        dist_idx_pairs = []
        for j in range(self.dataset_.shape[0]):
            d = self.dist_fn(x, self.dataset_[j])
            dist_idx_pairs.append((d, j))
            
        sorted_dist_idx_pairs = sorted(dist_idx_pairs)

        return sorted_dist_idx_pairs
    
    def fit(self, X, y):
        self.dataset_ = X.copy()
        self.labels_ = y.copy()
        self.possible_labels_ = np.unique(y)

    def predict(self, X):
        predictions = np.zeros(X.shape[0], dtype=int)
        for i in range(X.shape[0]):
            k_nearest = self._find_nearest(X[i])[:self.k]
            indices = [entry[1] for entry in k_nearest]
            k_labels = self.labels_[indices]
            counts = np.bincount(k_labels,
                                 minlength=self.possible_labels_.shape[0])
            pred_label = np.argmax(counts)
            predictions[i] = pred_label
        return predictions

In [10]:
# Code repeated from 5-2-basic-data-handling.ipynb

import pandas as pd
import numpy as np


df = pd.read_csv('data/iris.csv')

d = {'Iris-setosa': 0,
     'Iris-versicolor': 1,
     'Iris-virginica': 2}
df['Species'] = df['Species'].map(d)

X = df.iloc[:, 1:5].values
y = df['Species'].values

indices = np.arange(X.shape[0])
rng = np.random.RandomState(123)
permuted_indices = rng.permutation(indices)

train_size, valid_size = int(0.65*X.shape[0]), int(0.15*X.shape[0])
test_size = X.shape[0] - (train_size + valid_size)
train_ind = permuted_indices[:train_size]
valid_ind = permuted_indices[train_size:(train_size + valid_size)]
test_ind = permuted_indices[(train_size + valid_size):]
X_train, y_train = X[train_ind], y[train_ind]
X_valid, y_valid = X[valid_ind], y[valid_ind]
X_test, y_test = X[test_ind], y[test_ind]

print(f'X_train.shape: {X_train.shape}')
print(f'X_valid.shape: {X_valid.shape}')
print(f'X_test.shape: {X_test.shape}')

X_train.shape: (97, 4)
X_valid.shape: (22, 4)
X_test.shape: (31, 4)


In [11]:
knn_model = KNNClassifier(k=3)
knn_model.fit(X_train, y_train)


print(knn_model.predict(X_valid))

[0 1 2 1 1 1 0 0 1 2 0 0 1 1 1 2 1 1 1 2 0 0]


Note that there are class attributes with a `_` suffix in the implementation above -- this is not a typo.
- The trailing `_` (e.g., here: `self.dataset_`) is a scikit-learn convention and indicates that these are "fit" attributes -- that is, attributes that are available only *after* calling the `fit` method.