## Module 5

This is Module 5 in building a reusable, class-based machine learning framework.

In [Module 1](https://github.com/threnjen/object_oriented_machine_learning/blob/main/module1/module_1.ipynb), we focused on preparing to code outside of Jupyter Notebook. We set up virtual environments with [Pipenv](https://pipenv.pypa.io/en/latest/), installed and configured [VS Code](https://code.visualstudio.com/), and added automatic [Black](https://github.com/psf/black) formatting to our code.

In [Module 2](https://github.com/threnjen/object_oriented_machine_learning/blob/main/module2/module_2.ipynb), we refreshed about how to build Python classes. We built our base model class with an initial exploratory method. We added docstrings to our class methods so that we get documentation when calling help().

In [Module 3](https://github.com/threnjen/object_oriented_machine_learning/blob/main/module3/module_3.ipynb), we built an EDA Cleaning class and integrated it into our Base Model.

In [Module 4](https://github.com/threnjen/object_oriented_machine_learning/blob/main/module4/module_4.ipynb), we introduced the concept of an abstract class and started our Regression abstract class, as well as introduced type hints into our code.

In Module 5, we will largely focus on extending our Cleaning class and our Regression class, and use/add the following elements:
- Default parameter values
- Add an undo function that can undo our last alteration method (maybe you dropped outliers by IQR, then change your mind)
- Additional EDA methods
- Additional Cleaning methods

Let's look back at our BaseModel and EDACleaning objects, as we left them in Module 4.

In [6]:
from abc import ABC, abstractmethod
from typing import Optional, Literal
import pandas as pd
import numpy as np

class BaseModel:
    def __init__(self, filename: str):
        self.df = self._load_file(filename)
        self.target = None
        self.cleaner = EDACleaning()

    def _load_file(self, filename: str) -> pd.DataFrame:
        """Load file from filename and set target field
        Args:
            filename (str): filename in csv format
        Returns:
            pd.DataFrame: df loaded from file
        """
        return pd.read_csv(filename, on_bad_lines="skip")

    def set_target(self, target: str) -> None:
        """Sets model target field
        Args:
            target (str): target: target field for model
        """
        self.target = target
        self.cleaner.set_target(target)

    def print_statistics(self) -> None:
        """Print basic statistics for data"""
        self.cleaner.print_statistics(self.df)

    @abstractmethod
    def split_data(self, stratify: Optional[bool]):
        pass

    @abstractmethod
    def basic_regression(self):
        pass

In [7]:
class EDACleaning:
    def __init__(self):
        pass

    def set_target(self, target: str) -> None:
        """set model target variable"""
        self.target = target

    def print_statistics(self, df: pd.DataFrame) -> None:
        """Print basic dataframe statistics"""
        print(df.head())
        print(f"DF shape: {df.shape}\n")
        print(f"Data types: {df.dtypes}\n")
        print(f"Describe: {df.describe()}\n")
        print(f"isna sum: {df.isna().sum()}\n")

Our first plan is to add some new EDA methods to our EDACleaning class. The EDA methods are easy to identify because they do not return anything. Any of our modification/cleaning methods found in our EDACleaning class will return an altered data frame.

Our new methods are:
- print_sorted
- check_value_counts
- find_outliers

In [8]:
class EDACleaning:
    def __init__(self):
        pass

    def set_target(self, target: str):
        """set model target variable"""
        self.target = target

    def print_statistics(self, df: pd.DataFrame) -> None:
        """Print basic dataframe statistics"""
        print(df.head())
        print(f"DF shape: {df.shape}\n")
        print(f"Data types: {df.dtypes}\n")
        print(f"Describe: {df.describe()}\n")
        print(f"isna sum: {df.isna().sum()}\n")

    def print_sorted(
        self,
        df: pd.DataFrame,
        field: str,
        groupby: Optional[str],
        asc: bool,
    ) -> None:
        if groupby:
            print(df.groupby(groupby)[field].mean().sort_values(ascending=asc))
        else:
            print(df.sort_values(field, ascending=asc).head())

    def check_value_counts(self, df: pd.DataFrame, field: str) -> None:
        print(df[field].value_counts(normalize=True).head())

    def find_outliers(self, field: str) -> None:
        find_outliers = self.df.groupby(field)[self.target].describe()
        find_outliers.sort_values("mean", ascending=False).head(20)

Now we will add methods into our BaseModel class that call on these.

We are using our BaseModel class as our dispatcher for everything that we do, so that we, as eventual users of the system, need to know as little as necessary to use our work. If we don't use dispatch methods from within our BaseModel, that means we would need to call `help()` separately on our different modules, and figure out where a particular method lives. Then to call the method we would need to call our self.other_module from our model, making for long and messy method calls, such as `model.cleaner.print_sorted(df=my_df)`. Instead, we can just call help on our model object, and get all of the most minimal information required in order to use the system. For that same complicated example call just presented, instead we will only need to call `model.print_sorted()` to accomplish exactly the same thing.

Generally, these "pass-through" methods that simply call a method in another class are discouraged. In standard software design you'd want to avoid them. We appreciate them in our case because our end use of this system will be manually within Jupyter Notebook, and we're making our system as simple for ourselves to use as possible.

In [11]:
class BaseModel(ABC):
    def __init__(self, filename: str, seed: Optional[int] = None):
        self.df = self._load_file(filename)
        self.target = None
        self.cleaner = EDACleaning()

    def _load_file(self, filename: str) -> pd.DataFrame:
        """Load file from filename and set target field
        Args:
            filename (str): filename in csv format
        Returns:
            pd.DataFrame: df loaded from file
        """
        return pd.read_csv(filename, on_bad_lines="skip")

    def set_target(self, target: str) -> None:
        """Sets model target field
        Args:
            target (str): target: target field for model
        """
        self.target = target
        self.cleaner.set_target(target)

    def print_statistics(self) -> None:
        """Print basic statistics for data"""
        self.cleaner.print_statistics(self.df)

    def print_sorted(
        self,
        field: Optional[str] = None,
        groupby: Optional[str] = None,
        asc: Optional[bool] = False,
    ) -> None:
        """Prints sorted based on provided field. Will use target if no field provided.
        Args:
            field (_type_, optional): Will sort by this field. Defaults to target.
            asc (bool, optional): Sort ascending. Defaults to False.
            groupby (_type_, optional): If entered, will group by this field. Defaults to None.
        """
        if not field:
            field = self.target
        self.cleaner.print_sorted(df=self.df, field=field, asc=asc, groupby=groupby)

    def check_value_counts(self, field: Optional[str] = None) -> None:
        """Will print value counts for field
        Args:
            field (_type_, optional): Will report on this field. Defaults to target.
        """
        if not field:
            field = self.target
        self.cleaner.check_value_counts(self.df, field)

    def find_outliers(self, field: str) -> None:
        self.cleaner.find_outliers(field)

You'll see that in our new methods, print_sorted in particular, we are making use of default arguments. These are the parameters that the function will use automatically if not given any alternative arguments. Default behavior is a powerful tool to reduce the complexity of our user interface.

The print_sorted method can take several arguments - a field to sort on, a groupby field to group with first, and a switch to ascending=True. However the method doesn't require any of these, and in fact will be callable with a plain `model.print_sorted()` and automatically sort on the target field. We can optionally add other arguments to the method when we call it if we want different information, such as `model.print_sorted(groupby='bedrooms', asc=True)`

Remember we can always use the handy help() call to find out what our options are!

In [12]:
help(BaseModel)

Help on class BaseModel in module __main__:

class BaseModel(abc.ABC)
 |  BaseModel(filename: str, seed: Optional[int] = None)
 |  
 |  Method resolution order:
 |      BaseModel
 |      abc.ABC
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, filename: str, seed: Optional[int] = None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  check_value_counts(self, field=None) -> None
 |      Will print value counts for field
 |      Args:
 |          field (_type_, optional): Will report on this field. Defaults to target.
 |  
 |  find_outliers(self, field: str) -> None
 |  
 |  print_sorted(self, field: Optional[str] = None, groupby: Optional[str] = None, asc: Optional[bool] = False) -> None
 |      Prints sorted based on provided field. Will use target if no field provided.
 |      Args:
 |          field (_type_, optional): Will sort by this field. Defaults to target.
 |          asc (bool, optional): Sort ascending. Defaults to

Now we're going to add some methods to our EDACleaning object that actually alter our original dataframe! The new methods are added at the bottom:
- drop_dupes
- remove_outliers
- _calculate_iqr

In [None]:
class EDACleaning:
    def __init__(self):
        pass

    def set_target(self, target: str):
        """set model target variable"""
        self.target = target

    def print_statistics(self, df: pd.DataFrame) -> None:
        """Print basic dataframe statistics"""
        print(df.head())
        print(f"DF shape: {df.shape}\n")
        print(f"Data types: {df.dtypes}\n")
        print(f"Describe: {df.describe()}\n")
        print(f"isna sum: {df.isna().sum()}\n")

    def print_sorted(
        self,
        df: pd.DataFrame,
        field: str,
        groupby: Optional[str],
        asc: bool,
    ) -> None:
        if groupby:
            print(df.groupby(groupby)[field].mean().sort_values(ascending=asc))
        else:
            print(df.sort_values(field, ascending=asc).head())

    def check_value_counts(self, df: pd.DataFrame, field: str) -> None:
        print(df[field].value_counts(normalize=True).head())

    def find_outliers(self, field: str) -> None:
        find_outliers = self.df.groupby(field)[self.target].describe()
        find_outliers.sort_values("mean", ascending=False).head(20)

    def drop_dupes(self, df: pd.DataFrame, subset: list=None) -> pd.DataFrame:
        if subset:
            df.drop_duplicates(subset, keep="last", inplace=True)
        else:
            df.drop_duplicates(inplace=True)
        return df

    def remove_outliers(
        self, df: pd.DataFrame, fields: list, method: str, range: float
    ) -> pd.DataFrame:
        if method == "iqr":
            for field in fields:
                lower_range, upper_range = self._calculate_iqr(df[field], range)
                df = df.drop(
                    df[(df[field] > upper_range) | (df[field] < lower_range)].index
                )
        return df

    def _calculate_iqr(self, column: pd.Series, range: float) -> Literal(float, float):
        """return the lower range and upper range for the data based on IQR
        Arguments:
        column - column to be evaluated
        iqr_level - iqr range to be evaluated
        """
        Q1, Q3 = np.percentile(column, [25, 75])
        iqr = Q3 - Q1
        lower_range = Q1 - (range * iqr)
        upper_range = Q3 + (range * iqr)
        return lower_range, upper_range


You'll might notice that in our type hints, these functions note that they return a pd.DataFrame instead of None. These are our first cleaning functions that actually make a modification to our original dataframe.

Now we add the methods that call these to our BaseModel.

In [13]:
class BaseModel(ABC):
    def __init__(self, filename: str, seed: Optional[int] = None):
        self.df = self._load_file(filename)
        self.target = None
        self.cleaner = EDACleaning()

    def _load_file(self, filename: str) -> pd.DataFrame:
        """Load file from filename and set target field
        Args:
            filename (str): filename in csv format
        Returns:
            pd.DataFrame: df loaded from file
        """
        return pd.read_csv(filename, on_bad_lines="skip")

    def set_target(self, target: str) -> None:
        """Sets model target field
        Args:
            target (str): target: target field for model
        """
        self.target = target
        self.cleaner.set_target(target)

    def print_statistics(self) -> None:
        """Print basic statistics for data"""
        self.cleaner.print_statistics(self.df)

    def print_sorted(
        self,
        field: Optional[str] = None,
        groupby: Optional[str] = None,
        asc: Optional[bool] = False,
    ) -> None:
        """Prints sorted based on provided field. Will use target if no field provided.
        Args:
            field (_type_, optional): Will sort by this field. Defaults to target.
            asc (bool, optional): Sort ascending. Defaults to False.
            groupby (_type_, optional): If entered, will group by this field. Defaults to None.
        """
        if not field:
            field = self.target
        self.cleaner.print_sorted(df=self.df, field=field, asc=asc, groupby=groupby)

    def check_value_counts(self, field: Optional[str] = None) -> None:
        """Will print value counts for field
        Args:
            field (_type_, optional): Will report on this field. Defaults to target.
        """
        if not field:
            field = self.target
        self.cleaner.check_value_counts(self.df, field)

    def find_outliers(self, field: str) -> None:
        self.cleaner.find_outliers(field)

    def drop_dupes(self, subset: Optional[list] = None):
        """drops duplicate dataframe rows
        Args:
            subset (list, optional): Subset on which to drop dupes. Defaults to None.
        """
        self.df = self.cleaner.drop_dupes(self.df, subset)

    def remove_outliers(
        self,
        fields: list = [],
        method: Optional[str] = "iqr",
        range: Optional[float] = 1.5,
        save: Optional[bool] = True,
    ) -> None:
        """removes outliers, defaultings to IQR with a default IQR range of 1.5
        Args:
            fields (list): list of fields. Must be list even if one item.
            method (str, optional): outlier removal method. Defaults to "iqr".
            range (float, optional): IQR range. Defaults to 1.5.
        """
        self.df = self.cleaner.remove_outliers(self.df, fields, method, range)

    def reset_df_index(self, save: Optional[bool] = True) -> None:
        """resets dataframe index"""
        print(self.df.head())
        self.df.reset_index(inplace=True, drop=True)
        print(self.df.head())

    @abstractmethod
    def split_data(self, stratify: Optional[bool]):
        pass

    @abstractmethod
    def basic_regression(self):
        pass


You may notice a method here that wasn't in the EDACleaning object. reset_df_index is only located here. This method is SO simple that I opted to keep its full implementation here instead of using a pass-through method - but for the sake of consistency, you could opt to pass it through.

We could be done here, but now that we've added methods that change our initial data, we've run into a potential problem point. What if we make a change and then don't like our change? As it stands, we would have to go all the way back to the beginning of our work and remake our model object from scratch with our filename, in order to reset the data.

Instead of doing that, we're going to implement an undo() method that undoes our last change call. We're going to do the following steps:
- Implement a _set_save() method that saves our dataframe state
- Implement an undo() method that restores the last saved state
- Insert a save() call into any method that makes data frame changes, and default saving to True.

Here's what this all looks like, implemented in our BaseModel class:

In [14]:
class BaseModel(ABC):
    def __init__(self, filename: str, seed: Optional[int] = None):
        self.df = self._load_file(filename)
        self.target = None
        self.cleaner = EDACleaning()

    def _load_file(self, filename: str) -> pd.DataFrame:
        """Load file from filename and set target field
        Args:
            filename (str): filename in csv format
        Returns:
            pd.DataFrame: df loaded from file
        """
        return pd.read_csv(filename, on_bad_lines="skip")

    def set_target(self, target: str) -> None:
        """Sets model target field
        Args:
            target (str): target: target field for model
        """
        self.target = target
        self.cleaner.set_target(target)

    def _set_save(self, saved: str) -> None:
        """Sets a save point on the dataframe before performing an alteration task

        Args:
            saved (str): description of saved task to report if undone
        """
        self.saved_df = self.df.copy()
        self.saved_action = saved

    def undo(self) -> None:
        """Undoes the last data frame alteration task, and reports on Undo"""
        self.df = self.saved_df()
        print(f"Undid last change: {self.saved_action}")

    def print_statistics(self) -> None:
        """Print basic statistics for data"""
        self.cleaner.print_statistics(self.df)

    def print_sorted(
        self,
        field: Optional[str] = None,
        groupby: Optional[str] = None,
        asc: Optional[bool] = False,
    ) -> None:
        """Prints sorted based on provided field. Will use target if no field provided.
        Args:
            field (_type_, optional): Will sort by this field. Defaults to target.
            asc (bool, optional): Sort ascending. Defaults to False.
            groupby (_type_, optional): If entered, will group by this field. Defaults to None.
        """
        if not field:
            field = self.target
        self.cleaner.print_sorted(df=self.df, field=field, asc=asc, groupby=groupby)

    def check_value_counts(self, field: Optional[str] = None) -> None:
        """Will print value counts for field
        Args:
            field (_type_, optional): Will report on this field. Defaults to target.
        """
        if not field:
            field = self.target
        self.cleaner.check_value_counts(self.df, field)

    def find_outliers(self, field: str) -> None:
        self.cleaner.find_outliers(field)

    def drop_dupes(self, subset: Optional[list] = None, save: Optional[bool] = True):
        """Save point, then drops duplicate dataframe rows
        Args:
            subset (list, optional): Subset on which to drop dupes. Defaults to None.
            save (boolean, optional): Toggles to save. Defaults to None.
        """
        if save:
            self._set_save("drop_dupes")
        self.df = self.cleaner.drop_dupes(self.df, subset)

    def remove_outliers(
        self,
        fields: list = [],
        method: Optional[str] = "iqr",
        range: Optional[float] = 1.5,
        save: Optional[bool] = True,
    ) -> None:
        """Save point, then removes outliers, defaultings to IQR with a default IQR range of 1.5
        Args:
            fields (list): list of fields. Must be list even if one item.
            method (str, optional): outlier removal method. Defaults to "iqr".
            range (float, optional): IQR range. Defaults to 1.5.
            save (boolean, optional): Toggles to save. Defaults to None.
        """
        if save:
            self._set_save("remove_outliers")
        self.df = self.cleaner.remove_outliers(self.df, fields, method, range)

    def reset_index(self, save: Optional[bool] = True) -> None:
        """Save point, then resets dataframe index"""
        print(self.df.head())
        if save:
            self._set_save("reset_index")
        self.df.reset_index(inplace=True, drop=True)
        print(self.df.head())

    @abstractmethod
    def split_data(self, stratify: Optional[bool]):
        pass

    @abstractmethod
    def basic_regression(self):
        pass
