# About

try to use custom classes in the src folder

see
* https://drivendata.github.io/cookiecutter-data-science/#getting-started
* https://medium.com/@rrfd/cookiecutter-data-science-organize-your-projects-atom-and-jupyter-2be7862f487e
* https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually
* https://packaging.python.org/tutorials/installing-packages/#installing-from-a-local-src-tree
* https://github.com/kennethreitz/setup.py


Steps to redo:

1. use environment.yml file to set up project's anaconda environment
2. use cookicutter template to set up project itself, especially the setup.py file
3. run pip install --editable . to install the src files

In [32]:
# OPTIONAL: use inline matplotlib
%matplotlib inline

# OPTIONAL: Load the "autoreload" extension so that code can change
%reload_ext autoreload

# OPTIONAL: always reload modules so that as you change code in src, it gets loaded
%autoreload 2

import matplotlib.pyplot as plt

import numpy as np
import pandas as pd

from src.features.build_features import DataFrameSelector, MyTransformer

In [6]:
# loads data
train_pd = pd.read_csv("../data/raw/train.csv.zip", compression="zip")
test_pd = pd.read_csv("../data/raw/test.csv.zip", compression="zip")

In [27]:
# check if we can create an instance of MyTransformer
silly = MyTransformer()
pipeline = Pipeline([
    ('silly', silly),
])

test = pipeline.transform(train_pd)
print(test.head(2))

                 Dates        Category                  Descript  DayOfWeek  \
0  2015-05-13 23:53:00        WARRANTS            WARRANT ARREST  Wednesday   
1  2015-05-13 23:53:00  OTHER OFFENSES  TRAFFIC VIOLATION ARREST  Wednesday   

  PdDistrict      Resolution             Address           X          Y  \
0   NORTHERN  ARREST, BOOKED  OAK ST / LAGUNA ST -122.425892  37.774599   
1   NORTHERN  ARREST, BOOKED  OAK ST / LAGUNA ST -122.425892  37.774599   

           Z  
0 -84.651293  
1 -84.651293  


In [30]:
# check if we can create and use an instance of dataframe selector
from sklearn.pipeline import Pipeline

num_features = ["X", "Y"]

num_pipeline = Pipeline([
    ("selector", DataFrameSelector(num_features))
])

train_prepared = num_pipeline.transform(train_pd)

print(train_prepared[0:10])

[[-122.42589168   37.7745986 ]
 [-122.42589168   37.7745986 ]
 [-122.42436302   37.80041432]
 [-122.42699533   37.80087263]
 [-122.43873762   37.77154117]
 [-122.40325236   37.7134307 ]
 [-122.42332698   37.72513804]
 [-122.37127432   37.72756407]
 [-122.50819403   37.77660126]
 [-122.41908768   37.80780155]]


In [33]:
# check if we can print the source code for our dataframe selector class
import inspect
lines = inspect.getsource(DataFrameSelector)
print(lines)

class DataFrameSelector(BaseEstimator, TransformerMixin): 
    """
    Simple helper class, meant make it easier to use Pandas 
    along with sklearn Pipeline. Create and initate with a 
    list of features, then when the pipeline transform function
    is called, will return a Numpy array of the features.
    
    See Chap 2 transformation pipelines
    
    Example:
        train_pd = pd.read_csv("data.csv")
        num_features = ["X", "Y"]
        num_pipeline = Pipeline([
            ("selector", DataFrameSelector(num_features))
        ])
        train_prepared = num_pipeline.transform(train_pd)
        
    """
    def __init__(self, attribute_names): 
        self.attribute_names = attribute_names 
        
    def fit(self, X, y = None): 
        return self 
    
    def transform(self, X): 
        return X[self.attribute_names].values

