# About

try to use custom classes in the src folder

see
* https://drivendata.github.io/cookiecutter-data-science/#getting-started
* https://medium.com/@rrfd/cookiecutter-data-science-organize-your-projects-atom-and-jupyter-2be7862f487e
* https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#create-env-file-manually
* https://packaging.python.org/tutorials/installing-packages/#installing-from-a-local-src-tree
* https://github.com/kennethreitz/setup.py


Steps to redo:

1. use environment.yml file to set up project's anaconda environment
2. use cookicutter template to set up project itself, especially the setup.py file
3. run pip install --editable . to install the src files

In [1]:
# OPTIONAL: use inline matplotlib
%matplotlib inline

# OPTIONAL: Load the "autoreload" extension so that code can change
%reload_ext autoreload

# OPTIONAL: always reload modules so that as you change code in src, it gets loaded
%autoreload 2

import matplotlib.pyplot as plt

import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline

from src.features.build_features import DataFrameSelector, MyTransformer, SFCCTransformer

In [2]:
# loads data
train_pd = pd.read_csv("../data/raw/train.csv.zip", compression="zip")
test_pd = pd.read_csv("../data/raw/test.csv.zip", compression="zip")

# MyTransformer

check if we can create and use an instance of MyTransformer with Pipeline

In [27]:
silly = MyTransformer()

pipeline = Pipeline([
    ('silly', silly),
])

test = pipeline.transform(train_pd)
print(test.head(2))

                 Dates        Category                  Descript  DayOfWeek  \
0  2015-05-13 23:53:00        WARRANTS            WARRANT ARREST  Wednesday   
1  2015-05-13 23:53:00  OTHER OFFENSES  TRAFFIC VIOLATION ARREST  Wednesday   

  PdDistrict      Resolution             Address           X          Y  \
0   NORTHERN  ARREST, BOOKED  OAK ST / LAGUNA ST -122.425892  37.774599   
1   NORTHERN  ARREST, BOOKED  OAK ST / LAGUNA ST -122.425892  37.774599   

           Z  
0 -84.651293  
1 -84.651293  


# DataFrameSelector

check if we can create and use an instance of DataFrameSelector with Pipeline

In [30]:
num_features = ["X", "Y"]

num_pipeline = Pipeline([
    ("selector", DataFrameSelector(num_features))
])

train_prepared = num_pipeline.transform(train_pd)

print(train_prepared[0:10])

[[-122.42589168   37.7745986 ]
 [-122.42589168   37.7745986 ]
 [-122.42436302   37.80041432]
 [-122.42699533   37.80087263]
 [-122.43873762   37.77154117]
 [-122.40325236   37.7134307 ]
 [-122.42332698   37.72513804]
 [-122.37127432   37.72756407]
 [-122.50819403   37.77660126]
 [-122.41908768   37.80780155]]


# Print Code for DataFrameSelector

check if we can print the source code for our DataFrameSelector class

* in case we need to do so in our real notebook we turn into Isabell

In [33]:
import inspect
lines = inspect.getsource(DataFrameSelector)
print(lines)

class DataFrameSelector(BaseEstimator, TransformerMixin): 
    """
    Simple helper class, meant make it easier to use Pandas 
    along with sklearn Pipeline. Create and initate with a 
    list of features, then when the pipeline transform function
    is called, will return a Numpy array of the features.
    
    See Chap 2 transformation pipelines
    
    Example:
        train_pd = pd.read_csv("data.csv")
        num_features = ["X", "Y"]
        num_pipeline = Pipeline([
            ("selector", DataFrameSelector(num_features))
        ])
        train_prepared = num_pipeline.transform(train_pd)
        
    """
    def __init__(self, attribute_names): 
        self.attribute_names = attribute_names 
        
    def fit(self, X, y = None): 
        return self 
    
    def transform(self, X): 
        return X[self.attribute_names].values



# SFCCTransformer

In [62]:
sfcc = SFCCTransformer()

pipe = Pipeline([
    ("transformer", sfcc)
])

df = pipe.transform(train_pd)

res = df[["Dates", "DayOfWeek"
          , "hour_delta", "day_delta", "week_delta", "month_delta", "year_delta"
          , "hour_of_day", "day_of_month", "week_of_year", "month_of_year", "quarter_of_year", "year"
          , "is_weekend", "is_holiday"
         ]].sort_values("Dates")

print(res[["hour_delta", "day_delta", "week_delta", "month_delta", "year_delta"]].describe())
print(res[["hour_of_day", "day_of_month", "week_of_year", "month_of_year", "quarter_of_year", "year"]].describe())
print(res.groupby(["DayOfWeek", "is_weekend"]).size())
print(res.groupby(["is_holiday"]).size())


          hour_delta      day_delta     week_delta    month_delta  \
count  878049.000000  878049.000000  878049.000000  878049.000000   
mean    54271.786110    2260.778323     322.540805      73.797015   
std     31808.213578    1325.343365     189.330707      43.540851   
min         0.000000       0.000000       0.000000       0.000000   
25%     26426.000000    1101.000000     157.000000      36.000000   
50%     54063.000000    2252.000000     321.000000      74.000000   
75%     82666.000000    3444.000000     492.000000     113.000000   
max    108263.000000    4510.000000     644.000000     148.000000   

          year_delta  
count  878049.000000  
mean        5.708516  
std         3.630844  
min         0.000000  
25%         3.000000  
50%         6.000000  
75%         9.000000  
max        12.000000  
         hour_of_day   day_of_month   week_of_year  month_of_year  \
count  878049.000000  878049.000000  878049.000000  878049.000000   
mean       13.412655      15.5706