## Module 5

This is Module 5 in building a reusable, class-based machine learning framework.

In [Module 1](https://github.com/threnjen/object_oriented_machine_learning/blob/main/module1/module_1.ipynb), we focused on preparing to code outside of Jupyter Notebook. We set up virtual environments with [Pipenv](https://pipenv.pypa.io/en/latest/), installed and configured [VS Code](https://code.visualstudio.com/), and added automatic [Black](https://github.com/psf/black) formatting to our code.

In [Module 2](https://github.com/threnjen/object_oriented_machine_learning/blob/main/module2/module_2.ipynb), we refreshed about how to build Python classes. We built our base model class with an initial exploratory method. We added docstrings to our class methods so that we get documentation when calling help().

In [Module 3](https://github.com/threnjen/object_oriented_machine_learning/blob/main/module3/module_3.ipynb), we built an EDA Cleaning class and integrated it into our Base Model.

In [Module 4](https://github.com/threnjen/object_oriented_machine_learning/blob/main/module4/module_4.ipynb), we introduced the concept of an abstract class and started our Regression abstract class, as well as introduced type hints into our code.

In Module 5, we will largely focus on extending our Cleaning class, and use/add the following elements:
- Default parameter values
- Add an undo function that can undo our last alteration method (maybe you dropped outliers by IQR, then change your mind)
- Additional EDA methods
- Additional Cleaning methods

Let's look back at our BaseModel and EDACleaning objects, as we left them in Module 4.

In [11]:
from abc import ABC, abstractmethod
from typing import Optional, Literal
import pandas as pd
import numpy as np

class BaseModel:
    def __init__(self, filename: str):
        self.df = self._load_file(filename)
        self.target = None
        self.cleaner = EDACleaning()

    def _load_file(self, filename: str) -> pd.DataFrame:
        """Load file from filename and set target field
        Args:
            filename (str): filename in csv format
        Returns:
            pd.DataFrame: df loaded from file
        """
        return pd.read_csv(filename, on_bad_lines="skip")

    def set_target(self, target: str) -> None:
        """Sets model target field
        Args:
            target (str): target: target field for model
        """
        self.target = target
        self.cleaner.set_target(target)

    def print_statistics(self) -> None:
        """Print basic statistics for data"""
        self.cleaner.print_statistics(self.df)

    @abstractmethod
    def split_data(self, stratify: Optional[bool]):
        pass

    @abstractmethod
    def basic_regression(self):
        pass

In [12]:
class EDACleaning:
    def __init__(self):
        pass

    def set_target(self, target: str) -> None:
        """set model target variable"""
        self.target = target

    def print_statistics(self, df: pd.DataFrame) -> None:
        """Print basic dataframe statistics"""
        print(df.head())
        print(f"DF shape: {df.shape}\n")
        print(f"Data types: {df.dtypes}\n")
        print(f"Describe: {df.describe()}\n")
        print(f"isna sum: {df.isna().sum()}\n")

Our first plan is to add some new EDA methods to our EDACleaning class. The EDA methods are easy to identify because they do not return anything. The cleaning methods that we put into the EDACleaning class will return an altered data frame.

Our new methods are:
- print_sorted
- check_value_proportions
- find_outliers

In [42]:
class EDACleaning:
    def __init__(self):
        pass

    def set_target(self, target: str):
        """set model target variable"""
        self.target = target

    def print_statistics(self, df: pd.DataFrame) -> None:
        """Print basic dataframe statistics"""
        print(df.head())
        print(f"DF shape: {df.shape}\n")
        print(f"Data types: {df.dtypes}\n")
        print(f"Describe: {df.describe()}\n")
        print(f"isna sum: {df.isna().sum()}\n")

    def print_sorted(
        self,
        df: pd.DataFrame,
        field: str,
        groupby: Optional[str],
        asc: bool,
    ) -> None:
        if groupby:
            print(df.groupby(groupby)[field].mean().sort_values(ascending=asc))
        else:
            print(df.sort_values(field, ascending=asc).head())

    def check_value_proportions(self, df: pd.DataFrame, field: str) -> None:
        print(df[field].value_counts(normalize=True).head())

    def find_outliers(self, df: pd.DataFrame, field: str) -> None:
        find_outliers = df.groupby(field)[self.target].describe()
        outlier_report = find_outliers.sort_values("mean", ascending=False).head(20)
        print(outlier_report)

Now we will add methods into our BaseModel class that call on these.

We are using our BaseModel class as our dispatcher for everything that we do, so that we, as eventual users of the system, need to know as little as necessary to use our work. If we don't use dispatch methods from within our BaseModel, that means we would need to call `help()` separately on our different modules, and figure out where a particular method lives. Then to call the method we would need to call our self.other_module from our model, making for long and messy method calls, such as `model.cleaner.print_sorted(df=my_df)`. Instead, we can just call help on our model object, and get all of the most minimal information required in order to use the system. For that same complicated example call just presented, instead we will only need to call `model.print_sorted()` to accomplish exactly the same thing.

Generally, "pass-through" methods that simply call a method in another class are discouraged. In standard software design you'd want to avoid them. We appreciate them in our use case, because our end use of this system will be done manually from within Jupyter Notebook. We're making our system as simple for ourselves to use as possible.

In [43]:
class BaseModel(ABC):
    def __init__(self, filename: str):
        self.df = self._load_file(filename)
        self.target = None
        self.cleaner = EDACleaning()

    def _load_file(self, filename: str) -> pd.DataFrame:
        """Load file from filename and set target field
        Args:
            filename (str): filename in csv format
        Returns:
            pd.DataFrame: df loaded from file
        """
        return pd.read_csv(filename, on_bad_lines="skip")

    def set_target(self, target: str) -> None:
        """Sets model target field
        Args:
            target (str): target: target field for model
        """
        self.target = target
        self.cleaner.set_target(target)

    def print_statistics(self) -> None:
        """Print basic statistics for data"""
        self.cleaner.print_statistics(self.df)

    def print_sorted(
        self,
        field: Optional[str] = None,
        groupby: Optional[str] = None,
        asc: Optional[bool] = False,
    ) -> None:
        """Prints sorted based on provided field. Will use target if no field provided.
        Args:
            field (_type_, optional): Will sort by this field. Defaults to target.
            asc (bool, optional): Sort ascending. Defaults to False.
            groupby (_type_, optional): If entered, will group by this field. Defaults to None.
        """
        if not field:
            field = self.target
        self.cleaner.print_sorted(df=self.df, field=field, asc=asc, groupby=groupby)

    def check_value_proportions(self, field: Optional[str] = None) -> None:
        """Will print value counts for field
        Args:
            field (_type_, optional): Will report on this field. Defaults to target.
        """
        if not field:
            field = self.target
        self.cleaner.check_value_proportions(self.df, field)

    def find_outliers(self, field: str) -> None:
        self.cleaner.find_outliers(self.df, field)

You'll see that in our new methods, print_sorted in particular, we are making use of default arguments. These are the parameters that the function will use automatically if not given any alternative arguments. Default behavior is a powerful tool to reduce the complexity of our user interface.

The print_sorted method can take several arguments - a field to sort on, a groupby field to group with first, and a switch to ascending=True. However the method doesn't require any of these, and in fact will be callable with a plain `model.print_sorted()` and automatically sort on the target field. We can optionally add other arguments to the method when we call it if we want different information, such as `model.print_sorted(groupby='bedrooms', asc=True)`

Remember we can always use the handy help() call to find out what our options are!

In [44]:
help(BaseModel)

Help on class BaseModel in module __main__:

class BaseModel(abc.ABC)
 |  BaseModel(filename: str)
 |  
 |  Method resolution order:
 |      BaseModel
 |      abc.ABC
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __init__(self, filename: str)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  check_value_proportions(self, field: Optional[str] = None) -> None
 |      Will print value counts for field
 |      Args:
 |          field (_type_, optional): Will report on this field. Defaults to target.
 |  
 |  find_outliers(self, field: str) -> None
 |  
 |  print_sorted(self, field: Optional[str] = None, groupby: Optional[str] = None, asc: Optional[bool] = False) -> None
 |      Prints sorted based on provided field. Will use target if no field provided.
 |      Args:
 |          field (_type_, optional): Will sort by this field. Defaults to target.
 |          asc (bool, optional): Sort ascending. Defaults to False.
 |          groupby (_type

Let's try using some of our EDA methods to look at our data.

In [45]:
model = BaseModel('./kc_house_data.csv')

model.print_statistics()

           id             date     price  bedrooms  bathrooms  sqft_living  \
0  7129300520  20141013T000000  221900.0         3       1.00         1180   
1  6414100192  20141209T000000  538000.0         3       2.25         2570   
2  5631500400  20150225T000000  180000.0         2       1.00          770   
3  2487200875  20141209T000000  604000.0         4       3.00         1960   
4  1954400510  20150218T000000  510000.0         3       2.00         1680   

   sqft_lot  floors  waterfront  view  ...  grade  sqft_above  sqft_basement  \
0      5650     1.0           0     0  ...      7        1180              0   
1      7242     2.0           0     0  ...      7        2170            400   
2     10000     1.0           0     0  ...      6         770              0   
3      5000     1.0           0     0  ...      7        1050            910   
4      8080     1.0           0     0  ...      8        1680              0   

   yr_built  yr_renovated  zipcode      lat     lo

In [46]:
model.set_target('price')

model.print_sorted()

              id             date      price  bedrooms  bathrooms  \
7252  6762700020  20141013T000000  7700000.0         6       8.00   
3914  9808700762  20140611T000000  7062500.0         5       4.50   
9254  9208900037  20140919T000000  6885000.0         6       7.75   
4411  2470100110  20140804T000000  5570000.0         5       5.75   
1448  8907500070  20150413T000000  5350000.0         5       5.00   

      sqft_living  sqft_lot  floors  waterfront  view  ...  grade  sqft_above  \
7252        12050     27600     2.5           0     3  ...     13        8570   
3914        10040     37325     2.0           1     2  ...     11        7680   
9254         9890     31374     2.0           0     4  ...     13        8860   
4411         9200     35069     2.0           0     0  ...     13        6200   
1448         8000     23985     2.0           0     4  ...     12        6720   

      sqft_basement  yr_built  yr_renovated  zipcode      lat     long  \
7252           3480     

In [47]:
model.find_outliers('bedrooms')

           count          mean            std       min        25%       50%  \
bedrooms                                                                       
0           13.0  4.095038e+05  358682.627507  139950.0  235000.00  288000.0   
1          199.0  3.176429e+05  148864.955017   75000.0  222000.00  299000.0   
2         2760.0  4.013727e+05  198051.827269   78000.0  269837.50  374000.0   
3         9824.0  4.662321e+05  262469.771863   82000.0  295487.50  413000.0   
4         6882.0  6.354195e+05  388594.441911  100000.0  376962.50  549997.5   
5         1601.0  7.865998e+05  596204.003693  133000.0  438000.00  620000.0   
6          272.0  8.255206e+05  799238.819958  175000.0  435000.00  650000.0   
7           38.0  9.511847e+05  739953.558961  280000.0  539250.00  728580.0   
8           13.0  1.105077e+06  897495.725295  340000.0  490000.00  700000.0   
9            6.0  8.939998e+05  381533.900984  450000.0  624999.25  817000.0   
10           3.0  8.193333e+05  284677.5

In [35]:
model.check_value_proportions(field='bedrooms')

3    0.454541
4    0.318419
2    0.127701
5    0.074076
6    0.012585
Name: bedrooms, dtype: float64


In [40]:
model.print_sorted(field='price', groupby='bedrooms')

bedrooms
8     1.105077e+06
7     9.511847e+05
9     8.939998e+05
6     8.255206e+05
10    8.193333e+05
5     7.865998e+05
33    6.400000e+05
4     6.354195e+05
11    5.200000e+05
3     4.662321e+05
0     4.095038e+05
2     4.013727e+05
1     3.176429e+05
Name: price, dtype: float64


Now we're going to add some methods to our EDACleaning object that actually alter our original dataframe! The new methods are added at the bottom:
- drop_dupes
- remove_outliers
- _calculate_iqr

In [None]:
class EDACleaning:
    def __init__(self):
        pass

    def set_target(self, target: str):
        """set model target variable"""
        self.target = target

    def print_statistics(self, df: pd.DataFrame) -> None:
        """Print basic dataframe statistics"""
        print(df.head())
        print(f"DF shape: {df.shape}\n")
        print(f"Data types: {df.dtypes}\n")
        print(f"Describe: {df.describe()}\n")
        print(f"isna sum: {df.isna().sum()}\n")

    def print_sorted(
        self,
        df: pd.DataFrame,
        field: str,
        groupby: Optional[str],
        asc: bool,
    ) -> None:
        if groupby:
            print(df.groupby(groupby)[field].mean().sort_values(ascending=asc))
        else:
            print(df.sort_values(field, ascending=asc).head())

    def check_value_proportions(self, df: pd.DataFrame, field: str) -> None:
        print(df[field].value_counts(normalize=True).head())

    def find_outliers(self, df: pd.DataFrame, field: str) -> None:
        find_outliers = df.groupby(field)[self.target].describe()
        outlier_report = find_outliers.sort_values("mean", ascending=False).head(20)
        print(outlier_report)

    def drop_dupes(self, df: pd.DataFrame, subset: list=None) -> pd.DataFrame:
        if subset:
            df.drop_duplicates(subset, keep="last", inplace=True)
        else:
            df.drop_duplicates(inplace=True)
        return df

    def remove_outliers(
        self, df: pd.DataFrame, fields: list, method: str, range: float
    ) -> pd.DataFrame:
        if method == "iqr":
            for field in fields:
                lower_range, upper_range = self._calculate_iqr(df[field], range)
                df = df.drop(
                    df[(df[field] > upper_range) | (df[field] < lower_range)].index
                )
        return df

    def _calculate_iqr(self, column: pd.Series, range: float) -> Literal(float, float):
        """return the lower range and upper range for the data based on IQR
        Arguments:
        column - column to be evaluated
        iqr_level - iqr range to be evaluated
        """
        Q1, Q3 = np.percentile(column, [25, 75])
        iqr = Q3 - Q1
        lower_range = Q1 - (range * iqr)
        upper_range = Q3 + (range * iqr)
        return lower_range, upper_range


You'll might notice that in our type hints, these functions note that they return a pd.DataFrame instead of None. These are our first cleaning functions that actually make a modification to our original dataframe.

Now we add the methods that call these to our BaseModel.

In [None]:
class BaseModel(ABC):
    def __init__(self, filename: str):
        self.df = self._load_file(filename)
        self.target = None
        self.cleaner = EDACleaning()

    def _load_file(self, filename: str) -> pd.DataFrame:
        """Load file from filename and set target field
        Args:
            filename (str): filename in csv format
        Returns:
            pd.DataFrame: df loaded from file
        """
        return pd.read_csv(filename, on_bad_lines="skip")

    def set_target(self, target: str) -> None:
        """Sets model target field
        Args:
            target (str): target: target field for model
        """
        self.target = target
        self.cleaner.set_target(target)

    def print_statistics(self) -> None:
        """Print basic statistics for data"""
        self.cleaner.print_statistics(self.df)

    def print_sorted(
        self,
        field: Optional[str] = None,
        groupby: Optional[str] = None,
        asc: Optional[bool] = False,
    ) -> None:
        """Prints sorted based on provided field. Will use target if no field provided.
        Args:
            field (_type_, optional): Will sort by this field. Defaults to target.
            asc (bool, optional): Sort ascending. Defaults to False.
            groupby (_type_, optional): If entered, will group by this field. Defaults to None.
        """
        if not field:
            field = self.target
        self.cleaner.print_sorted(df=self.df, field=field, asc=asc, groupby=groupby)

    def check_value_proportions(self, field: Optional[str] = None) -> None:
        """Will print value counts for field
        Args:
            field (_type_, optional): Will report on this field. Defaults to target.
        """
        if not field:
            field = self.target
        self.cleaner.check_value_proportions(self.df, field)

    def find_outliers(self, field: str) -> None:
        self.cleaner.find_outliers(self.df, field)

    def drop_dupes(self, subset: Optional[list] = None):
        """drops duplicate dataframe rows
        Args:
            subset (list, optional): Subset on which to drop dupes. Defaults to None.
        """
        self.df = self.cleaner.drop_dupes(self.df, subset)

    def remove_outliers(
        self,
        fields: list = [],
        method: Optional[str] = "iqr",
        range: Optional[float] = 1.5,
        save: Optional[bool] = True,
    ) -> None:
        """removes outliers, defaultings to IQR with a default IQR range of 1.5
        Args:
            fields (list): list of fields. Must be list even if one item.
            method (str, optional): outlier removal method. Defaults to "iqr".
            range (float, optional): IQR range. Defaults to 1.5.
        """
        self.df = self.cleaner.remove_outliers(self.df, fields, method, range)

    def reset_df_index(self, save: Optional[bool] = True) -> None:
        """resets dataframe index"""
        print(self.df.head())
        self.df.reset_index(inplace=True, drop=True)
        print(self.df.head())

    @abstractmethod
    def split_data(self, stratify: Optional[bool]):
        pass

    @abstractmethod
    def basic_regression(self):
        pass


You may notice a method here that wasn't in the EDACleaning object. reset_df_index is only located here. This method is SO simple that I opted to keep its full implementation here instead of using a pass-through method - but for the sake of consistency, you could opt to pass it through.

We could be done here, but now that we've added methods that change our initial data, we've run into a potential problem point. What if we make a change and then don't like our change? As it stands, we would have to go all the way back to the beginning of our work and remake our model object from scratch with our filename, in order to reset the data.

Instead of doing that, we're going to implement an undo() method that undoes our last change call. We're going to do the following steps:
- Implement a _set_save() method that saves our dataframe state
- Implement an undo() method that restores the last saved state
- Insert a save() call into any method that makes data frame changes, and default saving to True.

Here's what this all looks like, implemented in our BaseModel class:

In [None]:
class BaseModel(ABC):
    def __init__(self, filename: str):
        self.df = self._load_file(filename)
        self.target = None
        self.cleaner = EDACleaning()

    def _load_file(self, filename: str) -> pd.DataFrame:
        """Load file from filename and set target field
        Args:
            filename (str): filename in csv format
        Returns:
            pd.DataFrame: df loaded from file
        """
        return pd.read_csv(filename, on_bad_lines="skip")

    def set_target(self, target: str) -> None:
        """Sets model target field
        Args:
            target (str): target: target field for model
        """
        self.target = target
        self.cleaner.set_target(target)

    def _set_save(self, saved: str) -> None:
        """Sets a save point on the dataframe before performing an alteration task

        Args:
            saved (str): description of saved task to report if undone
        """
        self.saved_df = self.df.copy()
        self.saved_action = saved

    def undo(self) -> None:
        """Undoes the last data frame alteration task, and reports on Undo"""
        self.df = self.saved_df()
        print(f"Undid last change: {self.saved_action}")

    def print_statistics(self) -> None:
        """Print basic statistics for data"""
        self.cleaner.print_statistics(self.df)

    def print_sorted(
        self,
        field: Optional[str] = None,
        groupby: Optional[str] = None,
        asc: Optional[bool] = False,
    ) -> None:
        """Prints sorted based on provided field. Will use target if no field provided.
        Args:
            field (_type_, optional): Will sort by this field. Defaults to target.
            asc (bool, optional): Sort ascending. Defaults to False.
            groupby (_type_, optional): If entered, will group by this field. Defaults to None.
        """
        if not field:
            field = self.target
        self.cleaner.print_sorted(df=self.df, field=field, asc=asc, groupby=groupby)

    def check_value_proportions(self, field: Optional[str] = None) -> None:
        """Will print value counts for field
        Args:
            field (_type_, optional): Will report on this field. Defaults to target.
        """
        if not field:
            field = self.target
        self.cleaner.check_value_proportions(self.df, field)

    def find_outliers(self, field: str) -> None:
        self.cleaner.find_outliers(self.df, field)

    def drop_dupes(self, subset: Optional[list] = None, save: Optional[bool] = True):
        """Save point, then drops duplicate dataframe rows
        Args:
            subset (list, optional): Subset on which to drop dupes. Defaults to None.
            save (boolean, optional): Toggles to save. Defaults to None.
        """
        if save:
            self._set_save("drop_dupes")
        self.df = self.cleaner.drop_dupes(self.df, subset)

    def remove_outliers(
        self,
        fields: list = [],
        method: Optional[str] = "iqr",
        range: Optional[float] = 1.5,
        save: Optional[bool] = True,
    ) -> None:
        """Save point, then removes outliers, defaultings to IQR with a default IQR range of 1.5
        Args:
            fields (list): list of fields. Must be list even if one item.
            method (str, optional): outlier removal method. Defaults to "iqr".
            range (float, optional): IQR range. Defaults to 1.5.
            save (boolean, optional): Toggles to save. Defaults to None.
        """
        if save:
            self._set_save("remove_outliers")
        self.df = self.cleaner.remove_outliers(self.df, fields, method, range)

    def reset_index(self, save: Optional[bool] = True) -> None:
        """Save point, then resets dataframe index"""
        print(self.df.head())
        if save:
            self._set_save("reset_index")
        self.df.reset_index(inplace=True, drop=True)
        print(self.df.head())

    @abstractmethod
    def split_data(self, stratify: Optional[bool]):
        pass

    @abstractmethod
    def basic_regression(self):
        pass


Our _set_save is an internal-only method that makes a copy of our current data frame, as well as records the name of the last taken action.
Our undo() call will return the data frame to the last save point, and report about what was rolled back.

_set_save() is implemented in all methods where the data frame is altered, so long as save=True, which is the default behavior. A message is sent to the _set_save() call to be recorded as the rollback message.

It is now very easy for us to roll back our last action if we realize we made a mistake, without resetting our entire object.



In this module, we implemented new cleaning methods, and set up the simplest possible interfaces for us to access these methods. For our method which alter our data, we set up a save system that allows us an undo. Finally, we make use of default parameters for our functions, to allow us to access them as easily as possible.