## Module 2

This is Module 2 in building a reusable, class-based machine learning framework.

In [Module 1](https://github.com/threnjen/object_oriented_machine_learning/blob/main/module1/module_1.ipynb), we focused on preparing to code outside of Jupyter Notebook. We set up virtual environments with Pipenv, installed and configured VS Code, and added automatic Black formatting to our code.

In Module 2, we will focus on the following skills:
- Quick refresher of classes and object-oriented programming (or a quick introduction, but overall this is not intended as a tutorial about class-based programming)
- Building our base model class with an initial exploratory method
- Introducing docstrings into our code

### Class Refresher (or brief introduction)

This section has a basic overview/refresher of classes. If you haven't seen a class before, follow along - but I cannot promise full understanding. However as the tutorial proceeds, the intricacies should become more clear.

A class is a blueprint. You can create many objects from the blueprint, and they are all similar, because they are made from the same blueprint. The class has its own set of functions (in a class we call these methods) that will all perform the same way no matter how many objects we make from the blueprint. It also has its own set of variables (in a class we call these attributes) that we can manipulate in the same way as global variables.

A common teaching example of a class blueprint is a Dog. Let's make a sample class of type Dog, and make some different dog objects to explore how this works.

In [None]:
class Dog:

    def __init__(self):
        pass

Classes ALWAYS start with a special method called the init. The init is always written as, at minimum, def \_\_init\_\_(self) and may contain nothing but a pass, as in the above Dog example. It's not required to do anything. But what the \_\_init\_\_ DOES do is, anytime you make an object using your CLASS blueprint, it will do all of the things inside of the init when it makes the object. You can think of the \_\_init\_\_ as the class setup function. 

We're going to make our Dog class a little more interesting. Here is our class Dog:

In [16]:
class Dog:

    def __init__(self, dog_type):
        self.type = dog_type
    
    def bark(self):
        print("WOOF!")
        print(self.type)

In the Dog class \_\_init\_\_, we're passing an argument "dog_type" and setting a class attribute. Well talk more about how that works in a moment.


In the Dog class I am showing both an ATTRIBUTE and a METHOD. 

The ATTRIBUTE we made in the Dog class is self.type. Any variable inside a class that we preface with a "self." becomes available to all of the methods inside of the class, without us having to pass it around.

Our class METHOD is "def bark(self)". You can see that it's just a function. Whenever we have a function inside a class, it is called a method.  You may see in the output of method "bark" that we reference the variable self.type, but *we did not have to pass it into the function*. This is one of the special ways that a class works. The class implicitly knows about all the self.xxxx attributes by having "self" as the first argument of all of the methods. The "self" that you see in all of the class methods, such as def bark(self), is an *invisible argument* that you ignore when you are passing things to your class object. I'll show you what we mean shortly.

Let's make some dogs.

We are going to make a beagle.

In [18]:
beagle = Dog('beagle')

We now have an instance of our object named beagle. We also make a different dog of a different type.

In [None]:
husky = Dog('husky')

When we make a Dog instance of any type, that \_\_init\_\_ setup function gets run, and we can send it arguments just like any other function. You may recall before that I referrred to "self" as an *invisible* argument, meaning you always ignore it when sending arguments to a class method. The same is true for the \_\_init\_\_. Our dog init calls for a dog_type argument. So we DO send it a dog_type argument when we make an instance of a dog.

Now that we have two different dogs, we can make them do the same thing by calling on their class method "bark"

In [19]:
beagle.bark()

WOOF!
beagle


In [20]:
husky.bark()

WOOF!
husky


When we call on the bark method with each dog, they woof and then report their dog_type. These give different outputs even though they are both Dog class items, because Dog is only a BLUEPRINT. The beagle and the husky are OBJECTS. A class definition is a BLUEPRINT for multiple similar objects to use.

We can also call our class attributes directly if we want to see them

In [22]:
print(beagle.type)

beagle


### Building the Base Model Class

Time to start our modeling framework! We will build a BaseModel and then use this base class throughout the tutorial, adding additional modules and functionality. To start, our class will do only a few simple things:
- load a file
- set a target
- print some basic statistics about our data frame

We start by declaring the name of our model class. We are going to call it Base Class.

All classes must have an \_\_init\_\_as a setup function. As a reminder, the init doesn't have to do anything. It could simply say pass. But it still must exist as the first method in our class. Our init is going to load our file, but it's going to load our file by calling a DIFFERENT function inside our class.

In [25]:
import pandas as pd

class BaseModel:

    def __init__(self, filename):
        self.df = self._load_file(filename)
    
    def _load_file(self, filename):
        return pd.read_csv(filename, low_memory=False, on_bad_lines="skip")

You may notice that our \_load_file method starts with an underscore. This is a special indicator that indicates that this method is only called from inside the class, and wouldn't be called directly by us. This particular method is only called by the \_\_init\_\_ setup function.

We make an instance of our object:

In [28]:
model_object = BaseModel('../resources/kc_house_data.csv')

Our BaseModel \_\_init\_\_ requires an argument to be sent, which is the file name that we want to load. The \_\_init\_\_ automatically calls the _load_file method using the filename as an argument, and pandas loads the file and returns it to be set as our self.df. From now on, we can reference self.df in any class method, and it will know what variable we are referring to.

Now we will add a method that prints some basic statistics for our dataframe:

In [29]:
import pandas as pd

class BaseModel:
    def __init__(self, filename):
        self.df = self._load_file(filename)

    def _load_file(self, filename):
        return pd.read_csv(filename, on_bad_lines="skip")

    def print_statistics(self):
        print(f"DF columns: {self.df.columns}\n")
        print(f"DF shape: {self.df.shape}\n")
        print(f"Data types: {self.df.dtypes}\n")
        print(f"Describe: {self.df.describe()}\n")
        print(f"isna sum: {self.df.isna().sum()}\n")
        print(self.df.head())

You should be familiar with all of these pandas methods. We've simply made a class function that prints them all for us. To run this we need to make a new object (because our blueprint changed!) and run this class method.

In [31]:
model_object = BaseModel('../resources/kc_house_data.csv')
model_object.print_statistics()

DF columns: Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

DF shape: (21613, 21)

Data types: id                 int64
date              object
price            float64
bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object

Describe:                  id         price      bedrooms     bathrooms   sqft_living  \
count  2.161300e+04  2.1613

Note that when we call the print_statistics() method, we don't pass any argument for self. You'll recall that self is an invisible argument.

You can also access the dataframe directly if you want to. Just remember that the class attribute is called df. Simply append the object name at the beginning of your calls. Example:

In [32]:
model_object.df.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


We're going to add one last method to our class in this module, which sets our target field.

In [33]:
import pandas as pd

class BaseModel:
    def __init__(self, filename):
        self.df = self._load_file(filename)

    def _load_file(self, filename):
        return pd.read_csv(filename, on_bad_lines="skip")

    def set_target(self, target):
        self.target = target

    def print_statistics(self):
        print(f"DF columns: {self.df.columns}\n")
        print(f"DF shape: {self.df.shape}\n")
        print(f"Data types: {self.df.dtypes}\n")
        print(f"Describe: {self.df.describe()}\n")
        print(f"isna sum: {self.df.isna().sum()}\n")
        print(self.df.head())

Now that we've looked at some base statistics for our dataframe, we should have an idea of our regression or classification target. We're going to call our set_target method to set this as the target field.

Remaking the model object, since we changed the blueprint.

In [34]:
model_object = BaseModel('../resources/kc_house_data.csv')
model_object.set_target('price')

We've now set our target field. We can confirm this by calling our target directly.

In [35]:
model_object.target

'price'

### Introducing Docstrings Into Our Code

We're done writing new code for this module, but we're going to hit one more concept, which is docstrings. Most people have heard of and used them, and their importance cannot be understated. Our class object right now is pretty small, but as we build on it, we'll want good docstrings so that we can use one of Python's simplest and best tools - help()!

In [36]:
help(BaseModel)

Help on class BaseModel in module __main__:

class BaseModel(builtins.object)
 |  BaseModel(filename: str)
 |  
 |  Methods defined here:
 |  
 |  __init__(self, filename: str)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  print_statistics(self)
 |  
 |  set_target(self, target)
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)



Right now if we call help() on BaseModel, we don't get very much information. If we add proper docstrings to our class, we'll get all of the information about our class methods when we call help().

As I mentioned in Module 1, I use VS Code with autoDocstring installed.  This extension builds a docstring template in any function if you type """ and hit enter to "Generate Docstring".

In [39]:
import pandas as pd

class BaseModel:
    def __init__(self, filename):
        self.df = self._load_file(filename)

    def _load_file(self, filename):
        """Load file from filename and set target field
        Args:
            filename: filename in csv format
        Returns:
            pd.DataFrame: df loaded from file
        """
        return pd.read_csv(filename, on_bad_lines="skip")

    def set_target(self, target):
        """Sets model target field
        Args:
            target: target field for model
        """
        self.target = target

    def print_statistics(self):
        """Print basic dataframe statistics"""
        print(f"DF columns: {self.df.columns}\n")
        print(f"DF shape: {self.df.shape}\n")
        print(f"Data types: {self.df.dtypes}\n")
        print(f"Describe: {self.df.describe()}\n")
        print(f"isna sum: {self.df.isna().sum()}\n")
        print(self.df.head())

We've added some docstrings to our base class, so now when we print help, we'll get some more information about each method.

In [40]:
help(BaseModel)

Help on class BaseModel in module __main__:

class BaseModel(builtins.object)
 |  BaseModel(filename)
 |  
 |  Methods defined here:
 |  
 |  __init__(self, filename)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  print_statistics(self)
 |      Print basic dataframe statistics
 |  
 |  set_target(self, target)
 |      Sets model target field
 |      Args:
 |          target: target field for model
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)



Much more informative!

Up next in [Module 3](https://github.com/threnjen/object_oriented_machine_learning/blob/main/module3/module_3.ipynb) we'll build an EDA Cleaning class and integrate it into our Base Model.