## Module 4

This is Module 4 in building a reusable, class-based machine learning framework.

In [Module 1](https://github.com/threnjen/object_oriented_machine_learning/blob/main/module1/module_1.ipynb), we focused on preparing to code outside of Jupyter Notebook. We set up virtual environments with [Pipenv](https://pipenv.pypa.io/en/latest/), installed and configured [VS Code](https://code.visualstudio.com/), and added automatic [Black](https://github.com/psf/black) formatting to our code.

In [Module 2](https://github.com/threnjen/object_oriented_machine_learning/blob/main/module2/module_2.ipynb), we refreshed about how to build Python classes. We built our base model class with an initial exploratory method. We added docstrings to our class methods so that we get documentation when calling help().

In [Module 3](https://github.com/threnjen/object_oriented_machine_learning/blob/main/module3/module_3.ipynb), we built an EDA Cleaning class and integrated it into our Base Model.

In Module 4, we will focus on the following skills:
- Understanding the concept of an abstract class
- Setting up a Regression abstract class
- Introducing type hints into our code

### Abstract Classes

While a class is a blueprint for different similar objects, an abstract class is a blueprint for other classes. We use an abstract class if we're going to have functions that we ALWAYS want as part of our class, and that we want to use the SAME NAMES, but the function may have *different behavior*.

A good example, and in fact the one we will use in our framework, is regression vs classification. In both of these model types we'll want to use train_trest_split to break up our data, but we might want to approach that differently depending on the model type. However we would like to be able to call the same method name to accomplish the task. Another example would be making a basic model - linear regression vs logistic regression, respectively. We can call a method named basic_model to accomplish this task, but have it use the correct type of model depending on our data set.

We are going to turn our BaseModel into a class blueprint (an abstract class) and define two methods - a split method and a basic regression method. However we will define the methods as abstract methods, meaning that the classes we make from the blueprint MUST use them. 

Our first step is to redefine our BaseModel as a class blueprint by making it an abstract class. We have a new import statement, and importantly, our BaseModel class declaration is now class BaseModel(ABC) which defines it as a class blueprint.


In [3]:
from abc import ABC, abstractmethod
import pandas as pd

from module4_eda_cleaning import EDACleaning

class BaseModel(ABC):
    def __init__(self, filename):
        self.df = self._load_file(filename)
        self.cleaner = EDACleaning()

    def _load_file(self, filename):
        """Load file from filename and set target field
        Args:
            filename: filename in csv format
        Returns:
            pd.DataFrame: df loaded from file
        """
        return pd.read_csv(filename, on_bad_lines="skip")

    def set_target(self, target):
        """Sets model target field
        Args:
            target: target field for model
        """
        self.target = target
        self.cleaner.set_target(target)

    def print_statistics(self):
        """Print basic statistics for data"""
        self.cleaner.print_statistics(self.df)


The next step is to define our two abstract methods, which are the methods that are required in our other objects such as Regression Model and Logistic Model.

We add two methods to the bottom of our BaseModel. They still take self, just like other class methods. They only return pass, because we don't put logic for abstract methods in the class blueprint. Most importantly, they have the `@abstractmethod` decorator before the function definition. This declares that classes made from the BaseModel blueprint MUST have these methods.

In [4]:
class BaseModel(ABC):
    def __init__(self, filename):
        self.df = self._load_file(filename)
        self.cleaner = EDACleaning()

    def _load_file(self, filename):
        """Load file from filename and set target field
        Args:
            filename: filename in csv format
        Returns:
            df loaded from file
        """
        return pd.read_csv(filename, on_bad_lines="skip")

    def set_target(self, target):
        """Sets model target field
        Args:
            target: target field for model
        """
        self.target = target
        self.cleaner.set_target(target)

    def print_statistics(self):
        """Print basic statistics for data"""
        self.cleaner.print_statistics(self.df)

    @abstractmethod
    def split_data(self, stratify):
        pass

    @abstractmethod
    def basic_regression(self):
        pass

### Setting up a Regression Model class

Now we will make a class using our class blueprint. It's time to set up our RegressionModel! Let's make it a simple class with a pass \_\_init\_\_. Also since we're using the BaseModel as a blueprint for this class, we declare the class with `class RegressionModel(BaseModel)` to indicate that it's a class of type BaseModel.

After we make the RegressionModel class, we try making RegressionModel object.

In [10]:
class RegressionModel(BaseModel):
    def __init__(self):
        pass

model_object = RegressionModel()

TypeError: Can't instantiate abstract class RegressionModel with abstract methods basic_regression, split_data

When we tried to instantiate our empty class, we got an error that RegressionModel is missing the methods basic_regression and split_data. These are the two methods that we decorated in our BaseModel with `@abstractmethod` which means they are required in our child object.

In [12]:
class RegressionModel(BaseModel):
    def __init__(self, filename):
        super().__init__(filename)

    def basic_regression(self):
        """basic linear regression"""
        pass

    def split_data(self):
        """train_test_split data"""
        pass

Above is our updated RegressionModel object. We've also added another new concept here which is the super()\_\_init\_\_! This is saying that we want to use the \_\_init\_\_ of the parent class, in this case our BaseModel, which also means we need to pass the arguments that the BaseModel init uses. Since the BaseModel init is in charge of opening our file via a filename argument, we also pass that argument here.

And, as mentioned earlier, we have the two required methods now present in our RegressionModel. There is no requirement that they do anything yet, and in fact they only pass. But they still must be defined.

Now we will make an actual object. See if you notice what is different in our object declaration this time:

In [13]:
model_object = RegressionModel('kc_house_data.csv')

We're now using the child class to make our object. The secret of it being a child of BaseModel is it still has ALL of the methods that we've written in BaseModel:

In [14]:
model_object.set_target('price')
model_object.print_statistics()

           id             date     price  ...     long  sqft_living15  sqft_lot15
0  7129300520  20141013T000000  221900.0  ... -122.257           1340        5650
1  6414100192  20141209T000000  538000.0  ... -122.319           1690        7639
2  5631500400  20150225T000000  180000.0  ... -122.233           2720        8062
3  2487200875  20141209T000000  604000.0  ... -122.393           1360        5000
4  1954400510  20150218T000000  510000.0  ... -122.045           1800        7503

[5 rows x 21 columns]
DF shape: (21613, 21)

Data types: id                 int64
date              object
price            float64
bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64


### Introducing Type Hints

As we add methods to our code and have lots of functions with different arguments, it becomes less obvious what all of those functions both take and output. Type hints are a code best practice that makes our code understandable at a glance by declaring what variable types should be. We can and should type hint both our inputs and our outputs wherever possible.

We'll start by adding type hints to our BaseModel class. See if you can find the type hints I've added below. We also add a new import statement.

In [16]:
from typing import Optional

class BaseModel:
    def __init__(self, filename: str):
        self.df = self._load_file(filename)
        self.cleaner = EDACleaning()

    def _load_file(self, filename: str) -> pd.DataFrame:
        """Load file from filename and set target field
        Args:
            filename (str): filename in csv format
        Returns:
            pd.DataFrame: df loaded from file
        """
        return pd.read_csv(filename, on_bad_lines="skip")

    def set_target(self, target: str):
        """Sets model target field
        Args:
            target (str): target: target field for model
        """
        self.target = target
        self.cleaner.set_target(target)

    def print_statistics(self):
        """Print basic statistics for data"""
        self.cleaner.print_statistics(self.df)

    @abstractmethod
    def split_data(self, stratify: Optional[bool]):
        pass

    @abstractmethod
    def basic_regression(self):
        pass

In our \_\_init\_\_, we've specified that our filename should be a string.

In our _load_file, we expect the filename again as a string, and we return a pd.DataFrame.

In set_target, we expect the target as a string.

In our split_data method, we see our first argument with a default option if no explicit option is entered. We designate that this is an optional argument with the `Optional[]` designator.

Now we type hint our RegressionModel:

In [None]:
class RegressionModel(BaseModel):
    def __init__(self, filename: str):
        super().__init__(filename)

    def basic_regression(self):
        """basic linear regression"""
        pass

    def split_data(self, stratify: Optional[bool] = False):
        """train_test_split data"""
        pass

Note that here we've entered a default value for our stratify variable. This means that this function can be used without being passed an argument for stratify, in which case it will default to False.

Type hinting our EDACleaning:

In [None]:
class EDACleaning:
    def __init__(self):
        pass

    def set_target(self, target:str):
        """set model target variable"""
        self.target = target

    def print_statistics(self, df: pd.DataFrame):
        """Print basic dataframe statistics"""
        print(df.head())
        print(f"DF shape: {df.shape}\n")
        print(f"Data types: {df.dtypes}\n")
        print(f"Describe: {df.describe()}\n")
        print(f"isna sum: {df.isna().sum()}\n")

In this module we added the last of our class and clean code concepts. In Module 5 we will start expanding on our EDA module, and think about other modules that we want to incorporate.