## Module 3

This is Module 3 in building a reusable, class-based machine learning framework.

In [Module 1](https://github.com/threnjen/object_oriented_machine_learning/blob/main/module1/module_1.ipynb), we focused on preparing to code outside of Jupyter Notebook. We set up virtual environments with Pipenv, installed and configured VS Code, and added automatic Black formatting to our code.

In [Module 2](https://github.com/threnjen/object_oriented_machine_learning/blob/main/module2/module_2.ipynb), we refreshed about how to build Python classes. We built our base model class with an initial exploratory method. We added docstrings to our class methods so that we get documentation when calling help().

In Module 3, we will focus on the following skills:
- Building an EDA handler class
- Integrating our EDA handler into our BaseModel

### Build an EDA Handler Class

We're going to build our EDA Handler and do the following:
- add an __init__, and have it do any setup tasks
- Add a new method to our new EDA class
- Move a method from our BaseModel into our EDACleaning class


In [16]:
import pandas as pd

class EDACleaning:
    def __init__(self):
        pass
    
    def _set_target(self, target):
        """set model target variable"""
        self.target = target

    def print_statistics(self, df):
        """Print basic dataframe statistics"""
        print(df.head())
        print(f"DF shape: {df.shape}\n")
        print(f"Data types: {df.dtypes}\n")
        print(f"Describe: {df.describe()}\n")
        print(f"isna sum: {df.isna().sum()}\n")

We didn't need our __init__ to do any setup tasks for us, so we just have it pass.

Our new method _set_target will be used by our BaseModel to set the target in our EDACleaning object.

print_statistics used to be in the BaseModel, but since it does tasks that are part of EDA, we now move it into the EDACleaning obect.

Now we'll see how to use the EDACleaning class together with the BaseModel class.



### Integrate our EDA Handler into our BaseModel

We begin with our BaseModel class from Module 2.

First we set up our EDACleaning object inside of our BaseModel init. Now we can access the methods inside of the cleaning object - but we'll build all of them inside of BaseModel so that we don't have to think about how they work! If this is your first time seeing a class object instantiated INSIDE another class object, we're unlocking a powerful tool here. Objects can live inside of other objects, and it makes the interactions between objects much simpler!

In our second change, we add a line to our set_target method to also set the target in the EDACleaning object.

Our third change is we have moved our print_statistics method into the EDA Cleaner, so now instead of having the same code here in the BaseModel, we access the method in the EDA Cleaner instead. However keeping the method in our BaseModel object allows us to use it from the BaseModel without us needing to know what arguments to send, and without having to identify and call the cleaning object on our own.

Let's take a look at our updated BaseModel object, and then I'll provide examples to explain.

In [17]:
class BaseModel:
    def __init__(self, filename):
        self.df = self._load_file(filename)
        self.cleaner = EDACleaning()

    def _load_file(self, filename):
        """Load file from filename and set target field
        Args:
            filename: filename in csv format
        Returns:
            pd.DataFrame: df loaded from file
        """
        return pd.read_csv(filename, on_bad_lines="skip")

    def set_target(self, target):
        """Sets model target field
        Args:
            target: target field for model
        """
        self.target = target
        self.cleaner._set_target(target)

    def print_statistics(self):
        """Print basic statistics for data"""
        self.cleaner.print_statistics(self.df)

Now let's take a look at using our new methods.

In [18]:
model_object = BaseModel('../resources/kc_house_data.csv')

We've made a model object, and we know from the BaseModel __init__ that we've done two setup tasks. The first is we made a dataframe from our file, which we can access directly if we need like so:

In [19]:
model_object.df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


The other thing we did is make our EDACleaning object INSIDE of our BaseModel object. We're going to integrate the EDACleaing functions into our BaseModel in a way that's totally invisible and easy-to-use from the outside. For now, you can confirm that the object was made:

In [20]:
model_object.cleaner

<__main__.EDACleaning at 0x17ddba55960>

Now let's use some of the methods in our model_object.

In [10]:
model_object.print_statistics()

           id             date     price  bedrooms  bathrooms  sqft_living  \
0  7129300520  20141013T000000  221900.0         3       1.00         1180   
1  6414100192  20141209T000000  538000.0         3       2.25         2570   
2  5631500400  20150225T000000  180000.0         2       1.00          770   
3  2487200875  20141209T000000  604000.0         4       3.00         1960   
4  1954400510  20150218T000000  510000.0         3       2.00         1680   

   sqft_lot  floors  waterfront  view  ...  grade  sqft_above  sqft_basement  \
0      5650     1.0           0     0  ...      7        1180              0   
1      7242     2.0           0     0  ...      7        2170            400   
2     10000     1.0           0     0  ...      6         770              0   
3      5000     1.0           0     0  ...      7        1050            910   
4      8080     1.0           0     0  ...      8        1680              0   

   yr_built  yr_renovated  zipcode      lat     lo

We're calling the exact same method in model_object that we did in Module2, and it's printing out the same thing. It's just that now, it's actually calling on the EDACleaning object that we made, and asking the cleaning object to perform the method. We've done this invisibly so that it doesn't affect our usability from the outside at all.

We can actually call on the cleaner directly as well, but it requires more code if we do it manually. More importantly, it requires us to know what arguments to send (in this case, our dataframe), whereas when we call it using the method we made in our model_object, we don't need to know these things.

Call on the cleaning object's print_statistics manually:

In [25]:
model_object.cleaner.print_statistics(model_object.df)

           id             date     price  bedrooms  bathrooms  sqft_living  \
0  7129300520  20141013T000000  221900.0         3       1.00         1180   
1  6414100192  20141209T000000  538000.0         3       2.25         2570   
2  5631500400  20150225T000000  180000.0         2       1.00          770   
3  2487200875  20141209T000000  604000.0         4       3.00         1960   
4  1954400510  20150218T000000  510000.0         3       2.00         1680   

   sqft_lot  floors  waterfront  view  ...  grade  sqft_above  sqft_basement  \
0      5650     1.0           0     0  ...      7        1180              0   
1      7242     2.0           0     0  ...      7        2170            400   
2     10000     1.0           0     0  ...      6         770              0   
3      5000     1.0           0     0  ...      7        1050            910   
4      8080     1.0           0     0  ...      8        1680              0   

   yr_built  yr_renovated  zipcode      lat     lo

So the same results CAN be accomplished through direct code, but we have to remember a lot more about the method.

Let's use our other updated method and set our target:

In [21]:
model_object.set_target('price')

And here, we've set our target variable both in our BaseModel object and in our EDACleaning object! For our model_object we are setting it directly as an attribute, and for the cleaner object, we are calling on an internal method and sending the target as an argument to tell it what to set as the target.

We can now access the target for both and check on them:

In [26]:
model_object.target

'price'

In [27]:
model_object.cleaner.target

'price'