# Classes & Objects - Credit Approval Dataset

This notebook demonstrates 
- Python classes
- Scikit-learn transformer classes 
- Scikit-learn `Pipeline` class
- Scikit-learn `FeatureUnion` class

The dataset created below is used in most of the demonstrations. 

The credit approval dataset is read by the `get_initial_pdf` function. 

In [0]:
import pandas as pd
def get_initial_pdf(): 
    pdf = pd.read_csv('https://raw.githubusercontent.com/datalab-datasets/credit-approval/master/crx.data',
                    header=None,
                    names=[f'col_{col_num}' for col_num in range(16)],
                    dtype={f'col_{num}': 'float' for num in [1,2,7,13,14]},
                    na_values=['?']
                    ) \
          .replace(to_replace={'col_15': {'+':1.0, '-':0.0}}) 
    return pdf

It is stored in the `initial_pdf` variable. We will use this later. 

In [0]:
initial_pdf = get_initial_pdf()
float_pdf   = initial_pdf.select_dtypes(include='float')
object_pdf  = initial_pdf.select_dtypes(include='object')

For reasons that will make more sense later, the dataset is split into two pieces: 
- the `float` columns are in `float_pdf` 
- the `object` columns are in `object_pdf`.

## Classes and objects
There are three types of methods: 
- init
- get
- set 

This class has only an init method.

In [0]:
class Person: 
    def __init__(self,name=''): 
        self.name=name

In [0]:
host = Person(name="David")
brother = Person(name='John')

In [5]:
host.name, brother.name

('David', 'John')

A "get" method is added to this class. 

In [0]:
class Person: 
    def __init__(self,name=''):
        self.name=name
    def get_name(self):
        return self.name

In [0]:
host = Person('david')

In [8]:
host.get_name()

'david'

A "set" method is added to this class. 

In [0]:
class Person: 
    def __init__(self,name=''):
        self.name=name
    def get_name(self):
        return self.name
    def set_name(self,name):
        self.name=name
        return self

In [0]:
host = Person()

In [11]:
host.get_name()

''

In [12]:
host.set_name('David')

<__main__.Person at 0x7f2762465400>

In [13]:
host.get_name()

'David'

In [14]:
host.set_name('Davie').get_name()

'Davie'

From this section remember:
- The "init" method records the initial parameters used when creating the object
- The "get" method return values from the object
- The "set" method stores values in the object and returns `self` (the object)


## Transformer classes and objects

Two examples of transformer classes are created below.

### `DoNothing` class (first example)

This class is the minimum needed to create a transformer class. When you are creating a new class, this might be a good place to start (no errors).

In [0]:
from sklearn.base import BaseEstimator, TransformerMixin
class DoNothing(BaseEstimator, TransformerMixin):
    def __init__(self):
        return
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X

In [0]:
no_op = DoNothing()

In [17]:
no_op.fit(initial_pdf)

DoNothing()

In [0]:
transformed_pdf = no_op.transform(initial_pdf)

In [19]:
transformed_pdf.equals(initial_pdf)

True

### `DataFrameSelector` class (second example)

This class save an init parameter as an attribute and transforms any dataframe passed to the transform method by returned only the columns listed in the init method.

In [0]:
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names=[]):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names]

In [21]:
(DataFrameSelector(attribute_names=['col_1','col_2'])
.fit(float_pdf)
.transform(float_pdf)
.head()
)

Unnamed: 0,col_1,col_2
0,30.83,0.0
1,58.67,4.46
2,24.5,0.5
3,27.83,1.54
4,20.17,5.625


## Transforming `object` columns 

In this section objects from two common Scikit-learn transformer classes are used to transform the `object_pdf` dataframe. The result of the first is passed to the second.

- The `fit` method of a transformer class is a "set" method. 
- The `transform` method of a transformer class is a "get" method.

### `Object` - `SimpleImputer`

In [0]:
from sklearn.impute import SimpleImputer
import numpy as np

In [0]:
imp = SimpleImputer(missing_values=np.nan,
                    strategy='most_frequent')

In [28]:
imp.fit(object_pdf)

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='most_frequent', verbose=0)

In [29]:
imp.statistics_

array(['b', 'u', 'g', 'c', 'v', 't', 'f', 'f', 'g'], dtype=object)

In [30]:
object_pdf.columns

Index(['col_0', 'col_3', 'col_4', 'col_5', 'col_6', 'col_8', 'col_9', 'col_11',
       'col_12'],
      dtype='object')

In [31]:
object_pdf.col_0.value_counts()

b    468
a    210
Name: col_0, dtype: int64

In [32]:
object_pdf.col_3.value_counts()

u    519
y    163
l      2
Name: col_3, dtype: int64

In [0]:
imputed_object_arr = imp.transform(object_pdf)

In [34]:
imputed_object_arr

array([['b', 'u', 'g', ..., 't', 'f', 'g'],
       ['a', 'u', 'g', ..., 't', 'f', 'g'],
       ['a', 'u', 'g', ..., 'f', 'f', 'g'],
       ...,
       ['a', 'y', 'p', ..., 't', 't', 'g'],
       ['b', 'u', 'g', ..., 'f', 'f', 'g'],
       ['b', 'u', 'g', ..., 'f', 't', 'g']], dtype=object)

### `Object` - `OneHotEncoder`

- https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

In [0]:
from sklearn.preprocessing import OneHotEncoder

In [0]:
ohe = OneHotEncoder()

In [37]:
test_arr = imputed_object_arr[5:15,:3]
test_arr

array([['b', 'u', 'g'],
       ['b', 'u', 'g'],
       ['a', 'u', 'g'],
       ['b', 'y', 'p'],
       ['b', 'y', 'p'],
       ['b', 'u', 'g'],
       ['b', 'u', 'g'],
       ['a', 'u', 'g'],
       ['b', 'u', 'g'],
       ['a', 'u', 'g']], dtype=object)

In [44]:
ohe.fit(test_arr).transform(test_arr) #.todense()

<10x6 sparse matrix of type '<class 'numpy.float64'>'
	with 30 stored elements in Compressed Sparse Row format>

In [0]:
ohe_imputed_object_arr = ohe.fit(imputed_object_arr).transform(imputed_object_arr)

In [43]:
ohe_imputed_object_arr.todense()

matrix([[0., 1., 0., ..., 1., 0., 0.],
        [1., 0., 0., ..., 1., 0., 0.],
        [1., 0., 0., ..., 1., 0., 0.],
        ...,
        [1., 0., 0., ..., 1., 0., 0.],
        [0., 1., 0., ..., 1., 0., 0.],
        [0., 1., 0., ..., 1., 0., 0.]])

The two objects above transformed the `object_pdf` by
- replacing any missing values with the most frequent value in that columns
- encoding categorical/object columns as multiple columns of `0`/`1`

## `Float`

In this section objects from two common Scikit-learn transformer classes are used to transform the `float_pdf` dataframe. The result of the first is passed to the second.

Recall that:
- The `fit` method of a transformer class is a "set" method. 
- The `transform` method of a transformer class is a "get" method.

### `Float` - `SimpleImputer`

- https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

In [62]:
float_pdf.isnull().sum()

col_1     12
col_2      0
col_7      0
col_13    13
col_14     0
col_15     0
dtype: int64

In [0]:
from sklearn.impute import SimpleImputer
import numpy as np

In [0]:
imp = SimpleImputer(missing_values=np.nan, strategy='mean')

In [47]:
imp.fit(float_pdf)

SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)

In [49]:
imp.statistics_

array([3.15681711e+01, 4.75872464e+00, 2.22340580e+00, 1.84014771e+02,
       1.01738551e+03, 4.44927536e-01])

In [50]:
float_pdf.columns

Index(['col_1', 'col_2', 'col_7', 'col_13', 'col_14', 'col_15'], dtype='object')

In [0]:
imputed_float_arr = imp.transform(float_pdf)

In [61]:
np.sum(np.isnan(imputed_float_arr))

0

### `Float` - `MinMaxScaler`

In [68]:
np.max(imputed_float_arr,axis=0)

array([8.025e+01, 2.800e+01, 2.850e+01, 2.000e+03, 1.000e+05, 1.000e+00])

In [0]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [70]:
scaler.fit(imputed_float_pdf)

MinMaxScaler(copy=True, feature_range=(0, 1))

In [71]:
scaler.data_min_, scaler.data_max_

(array([13.75,  0.  ,  0.  ,  0.  ,  0.  ,  0.  ]),
 array([8.025e+01, 2.800e+01, 2.850e+01, 2.000e+03, 1.000e+05, 1.000e+00]))

In [0]:
scaled_imputed_float_arr = scaler.transform(imputed_float_pdf)

In [77]:
(scaled_imputed_float_arr.min(axis=0), 
 scaled_imputed_float_arr.max(axis=0)
 )

(array([0., 0., 0., 0., 0., 0.]), array([1., 1., 1., 1., 1., 1.]))

## `Object` - `Pipeline`

The pair of transformations (impute, one hot encoding) are chained tohether with a `Pipeline` object, which allows you to run two transformations with a single command. 

In [0]:
imp = SimpleImputer(missing_values=np.nan,strategy='most_frequent')
imputed_object_arr =  imp.fit(object_pdf).transform(object_pdf)

In [0]:
ohe = OneHotEncoder()
ohe_imputed_object_arr = ohe.fit(imputed_object_arr).transform(imputed_object_arr)

In [0]:
ohe_imputed_object_arr.todense()

In [0]:
from sklearn.pipeline import Pipeline
imp_ohe_pipe = Pipeline(steps=[('imp_obj', SimpleImputer(missing_values=np.nan,strategy='most_frequent')),
                               ('ohe_obj', OneHotEncoder())
                               ]
)              

In [0]:
pipe_ohe_imputed_object_arr = imp_ohe_pipe.fit(object_pdf).transform(object_pdf)
np.array_equal(pipe_ohe_imputed_object_arr.todense(),
                    ohe_imputed_object_arr.todense()
               )

## `Float` - `Pipeline`

In [0]:
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imputed_float_arr = imp.fit(float_pdf).transform(float_pdf)

In [0]:
scaler = MinMaxScaler()
scaled_imputed_float_arr = scaler.fit(imputed_float_arr).transform(imputed_float_arr)

In [0]:
from sklearn.pipeline import Pipeline
imp_sca_pipe = Pipeline(steps=[('imp_flt', SimpleImputer(missing_values=np.nan,strategy='mean')),
                               ('sca_flt', MinMaxScaler())
                               ]
)              

In [0]:
pipe_scaled_imputed_float_arr = imp_sca_pipe.fit(float_pdf).transform(float_pdf)

In [0]:
np.array_equal(pipe_scaled_imputed_float_arr,
               scaled_imputed_float_arr)

## `FeatureUnion`

In [0]:
from sklearn.pipeline import FeatureUnion

In [0]:
class SelectDtypePDF(BaseEstimator, TransformerMixin):
    def __init__(self, dtype=[]):
        self.dtype = dtype
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X.select_dtypes(self.dtype)

In [0]:
SelectDtypePDF(dtype='float').fit(initial_pdf).transform(initial_pdf).head()

In [0]:
float_pdf.columns

In [0]:
from sklearn.pipeline import Pipeline
sel_imp_sca_pipe = Pipeline(steps=[('sel_flt', SelectDtypePDF('float')),
                                   ('imp_flt', SimpleImputer(missing_values=np.nan,strategy='mean')),
                                   ('sca_flt', MinMaxScaler())])              

In [0]:
np.array_equal(sel_imp_sca_pipe.fit(initial_pdf).transform(initial_pdf),
                   imp_sca_pipe.fit(float_pdf)  .transform(float_pdf))

In [0]:
from sklearn.pipeline import Pipeline
sel_imp_ohe_pipe = Pipeline(steps=[('sel_obj', SelectDtypePDF('object')),
                                   ('imp_obj', SimpleImputer(missing_values=np.nan,strategy='most_frequent')),
                                   ('ohe_obj', OneHotEncoder())
                               ])              

In [0]:
np.array_equal(imp_ohe_pipe.fit(object_pdf).transform(object_pdf).todense(),
               sel_imp_ohe_pipe.fit(initial_pdf).transform(initial_pdf).todense()
               )

In [0]:
float_object_fea_un = FeatureUnion(transformer_list=[('flt',sel_imp_sca_pipe),
                                                    ('obj',sel_imp_ohe_pipe)])

In [0]:
fea_un_arr = float_object_fea_un.fit(initial_pdf).transform(initial_pdf)

In [0]:
concat_flt_obj_arr = \
np.concatenate((sel_imp_ohe_pipe.fit(initial_pdf).transform(initial_pdf).todense(),
                sel_imp_sca_pipe.fit(initial_pdf).transform(initial_pdf)),
               axis=1)

In [0]:
np.array_equal(fea_un_arr.shape, concat_flt_obj_arr.shape)

In [0]:
float_object_fea_un = \
FeatureUnion(
    transformer_list=[('flt',Pipeline(steps=[('sel_flt', SelectDtypePDF('float')),
                                             ('imp_flt', SimpleImputer(missing_values=np.nan,strategy='mean')),
                                             ('sca_flt', MinMaxScaler())])),
                      ('obj',Pipeline(steps=[('sel_obj', SelectDtypePDF('object')),
                                             ('imp_obj', SimpleImputer(missing_values=np.nan,strategy='most_frequent')),
                                             ('ohe_obj', OneHotEncoder())]))])

In [0]:
float_object_fea_un.fit(initial_pdf).transform(initial_pdf)