# Notebook 01: Generate Project Interval Data Splits

This notebook generates train test splits of the cleaned NYC capital projects data, segmented for a specified time interval's worth of changes. The resulting dataframe provides one record for each project included in the interval. Each project record contains descriptive attributes of the project as well as forecasted budget and schedule data at the start of the project and a set of metrics describing change over the interval measured.

## Instructions:

**To use this notebook in its current state:**

1. **Re-run the 01_EDA_clean_NYC_data notebook** to ensure you have the latest version of the cleansed dataset


2. **Set the `predict_interval` value in the "Read Dataset" section below** to indicate what year snapshot you want to predict (set to `None` if you want to include all cumulative changes, irregardless of observation year).


3. **Run the entire notebook.**


4. The resulting features dataframes will be saved to the `../data/interim/` and `../data/processed/` directories as outlined in the **outputs** section below.


### Inputs:

**`../data/interim/Capital_Projects_clean.csv`**

A cleansed version of the Capital_Projects.csv file, wherein each record represents one project change.

### Outputs:

1. **`../data/interim/NYC_capital_projects_3yr.csv`**


2. **`../data/interim/NYC_capital_projects_all.csv`** 


3. **`'../data/processed/NYC_capital_projects_3yr_train.csv'`**


4. **`'../data/processed/NYC_capital_projects_3yr_train.csv'`**


### Resulting data features

**For a summary of data features included in the resulting dataframe, please view the printed output at the bottom of this notebook**

<a name='index'></a>

## Notebook Index

1. <a href=#imports>Imports</a>


2. <a href=#read>Read cleansed dataset</a>


3. <a href=#run>Generate and save interval features dataframe</a>


4. <a href=#inspect>Show resulting dataframe info</a>


5. <a href=#cats>Consolidate and encode the categories</a>


6. <a href=#split>Generate and save a train test split of the data</a>

<a name='imports'></a>

## Imports
Imports for functions used in this notebook.

<a href=#index>index</a>

In [1]:
import os
import sys
import math
from datetime import datetime, timedelta

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
from sklearn.model_selection import train_test_split

# Avoid scientific notation output in Pandas
# pd.set_option('display.float_format', lambda x: '%.2f' % x)
pd.options.display.float_format = '{:,.2f}'.format
import logging

# import custom local source modules
# sys.path.append('..')
from caproj.datagen import generate_interval_data, print_interval_dict
from caproj.scale import encode_categories


# Improve resolution of output graphcis
%config InlineBackend.figure_format ='retina'

<a name='read'></a>
## Read Dataset
Read the dataset and perform basic manipulation of headers and some fields including formatting.

<a href=#index>index</a>

In [2]:
predict_interval = 3 # number of years for interval analysis (SET THIS TO 3)

file_path = '../data/interim/Capital_Projects_clean.csv'

savefile_train = '../data/interim/NYC_capital_projects_{}yr_train.csv'.format(predict_interval)
savefile_test = '../data/interim/NYC_capital_projects_{}yr_test.csv'.format(predict_interval)
    
if os.path.isfile(file_path):
    print("OK - path points to file.")
else:
    print("ERROR - check the 'file_path' and ensure it points to the source file.")

OK - path points to file.


In [3]:
data = pd.read_csv(file_path)

In [4]:
# entries
print(f"Number of dataset records: {len(data)}")

# num projects
print(f"Number of unique projects in dataset: {len(data['PID'].unique())}")

Number of dataset records: 2095
Number of unique projects in dataset: 355


In [5]:
datetime_cols = [
    'Date_Reported_As_Of',
    'Design_Start',
    'Original_Schedule',
    'Forecast_Completion'
]

for col in datetime_cols:
    data[col] = pd.to_datetime(data[col])
    
# make sure data is sorted properly
data = data.sort_values(by=['PID', 'PID_Index'])

In [6]:
data.info()
data.head(3)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2095 entries, 0 to 2094
Data columns (total 24 columns):
 #   Column                   Non-Null Count  Dtype         
---  ------                   --------------  -----         
 0   Record_ID                2095 non-null   object        
 1   Date_Reported_As_Of      2095 non-null   datetime64[ns]
 2   PID                      2095 non-null   int64         
 3   Project_Name             2095 non-null   object        
 4   Description              2095 non-null   object        
 5   Category                 2095 non-null   object        
 6   Borough                  2095 non-null   object        
 7   Managing_Agency          2095 non-null   object        
 8   Client_Agency            2095 non-null   object        
 9   Current_Phase            2095 non-null   object        
 10  Design_Start             2095 non-null   datetime64[ns]
 11  Original_Budget          2095 non-null   float64       
 12  Budget_Forecast          2095 non-

Unnamed: 0,Record_ID,Date_Reported_As_Of,PID,Project_Name,Description,Category,Borough,Managing_Agency,Client_Agency,Current_Phase,...,Total_Budget_Changes,Original_Schedule,Forecast_Completion,Latest_Schedule_Changes,Total_Schedule_Changes,PID_Index,Change_Years,Change_Year,Current_Project_Years,Current_Project_Year
0,3-0,2014-05-01,3,26th Ward Waste Water Treatment Plant Prelimin...,The 26th Ward WWTP is mandated to be upgraded ...,Wastewater Treatment,Brooklyn,DEP,DEP,2-Design,...,-4318643.37,2020-01-13,2020-01-14,1.0,270.0,0,0.6,1,5.94,6
1,3-1,2015-02-01,3,26th Ward Waste Water Treatment Plant Prelimin...,The 26th Ward WWTP is mandated to be upgraded ...,Wastewater Treatment,Brooklyn,DEP,DEP,3-Construction Procurement,...,-4318643.37,2020-01-13,2020-07-19,187.0,270.0,1,1.36,2,5.94,6
2,3-2,2015-08-01,3,26th Ward Waste Water Treatment Plant Prelimin...,The 26th Ward WWTP is mandated to be upgraded ...,Wastewater Treatment,Brooklyn,DEP,DEP,3-Construction Procurement,...,-4318643.37,2020-01-13,2020-08-08,20.0,270.0,2,1.85,2,5.94,6


<a name='run'></a>

# Generate & save interval features dataframe

<a href=#index>index</a>

In [7]:
# save interval dataframe for 3yr predictions
df_feat_3yr = generate_interval_data(
    data,
    change_year_interval=predict_interval,
    to_csv=True,
)

# save dataframe with entire data set as the interval
df_feat_all = generate_interval_data(
    data,
    change_year_interval=None,
    to_csv=True,
)

The number of unique projects in the resulting dataframe: 149

The resulting interval features dataframe was saved to .csv at:

	../data/interim/NYC_capital_projects_3yr.csv

The number of unique projects in the resulting dataframe: 355

The resulting interval features dataframe was saved to .csv at:

	../data/interim/NYC_capital_projects_all.csv



<a name='inspect'></a>

# Show resulting interval dataframe info

<a href=#index>index</a>

In [8]:
df_feat_3yr.info()
df_feat_3yr.head(3)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149 entries, 0 to 148
Data columns (total 31 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   PID                    149 non-null    int64         
 1   Project_Name           149 non-null    object        
 2   Description            149 non-null    object        
 3   Category               149 non-null    object        
 4   Borough                149 non-null    object        
 5   Managing_Agency        149 non-null    object        
 6   Client_Agency          149 non-null    object        
 7   Phase_Start            149 non-null    object        
 8   Current_Project_Years  149 non-null    float64       
 9   Current_Project_Year   149 non-null    int64         
 10  Design_Start           149 non-null    datetime64[ns]
 11  Budget_Start           149 non-null    float64       
 12  Schedule_Start         149 non-null    datetime64[ns]
 13  Final

Unnamed: 0,PID,Project_Name,Description,Category,Borough,Managing_Agency,Client_Agency,Phase_Start,Current_Project_Years,Current_Project_Year,...,Schedule_Change,Budget_Change,Schedule_Change_Ratio,Budget_Change_Ratio,Budget_Abs_Per_Error,Budget_Rel_Per_Error,Duration_End_Ratio,Budget_End_Ratio,Duration_Ratio_Inv,Budget_Ratio_Inv
0,3,26th Ward Waste Water Treatment Plant Prelimin...,The 26th Ward WWTP is mandated to be upgraded ...,Wastewater Treatment,Brooklyn,DEP,DEP,2-Design,5.94,6,...,91,-15619967.29,0.04,-0.08,0.09,0.08,1.04,0.92,-0.04,0.09
1,7,Bowery Bay Waste Water Treatment Plant Main Se...,The existing Main Sewage Pumps have deteriorat...,Wastewater Treatment,Queens,DEP,DEP,2-Design,6.34,7,...,334,9618785.3,0.19,0.18,0.15,0.18,1.19,1.18,-0.16,-0.15
2,91,Mill Basin Bridge Replacement,Complete replacement of bascule bridge with a ...,Bridges,Brooklyn,DOT,not_specified,2-Design,7.44,8,...,247,-34672662.0,0.09,-0.09,0.1,0.09,1.09,0.91,-0.08,0.1


In [9]:
print_interval_dict()

DATA DICTIONARY: GENERATED INTERVAL DATASETS

0: PID (int64)

	The project identification number


1: Project_Name (object)

	The name of the project


2: Description (object)

	A brief written description of the project


3: Category (object)

	The type of project (i.e. bridges, roadways, schools, etc.)


4: Borough (object)

	The primary borough location for the project


5: Managing_Agency (object)

	The primary managing agency for the project


6: Client_Agency (object)

	The primary client agency for the project


7: Phase_Start (object)

	The project phase at the time of the first change record recorded


8: Current_Project_Years (float64)

	The number of years since (in decimal form) the Design_Start date (i.e. the age of the project when the dataset was compiled)


9: Current_Project_Year (int64)

	The number of years since (in integer form) the Design_Start date


10: Design_Start (datetime64)

	The date at which the design phase of the project began


11: Budget_Start (float6

<a name='cats'></a>

# Consolidate and encode categories

<a href=#index>index</a>

In [10]:
print(
    'The original {} project categories and assocated project counts:\n\n'\
    '{}\n\n'.format(
        df_feat_3yr['Category'].nunique(),
        df_feat_3yr['Category'].value_counts()
    )
)

rename_cat_dict = {
    'Bridges, Streets and Roadways': 'Bridges',
    'Parks, Streets and Roadways': 'Parks',
    'Industrial Development, Parks': 'Industrial Development',
    'Other Government Facilities': 'Other Govt Facilities and Improvements',
    'Public Safety and Criminal Justice': 'Other Govt Facilities and Improvements',
    'Arts and Culture': 'Other Govt Facilities and Improvements',
    'Health and Hospitals': 'Other Govt Facilities and Improvements',
}

df_feat_3yr['Category_Old'] = df_feat_3yr['Category'].copy()
df_feat_3yr['Category'] = df_feat_3yr['Category'].copy().map(rename_cat_dict).fillna(df_feat_3yr['Category'])

print(
    'The newly mapped {} project categories and project counts:\n\n'\
    '{}\n'.format(
        df_feat_3yr['Category'].nunique(),
        df_feat_3yr['Category'].value_counts()
    )
)

The original 17 project categories and assocated project counts:

Streets and Roadways                  31
Sewers                                20
Schools                               15
Industrial Development                15
Water Supply                          13
Wastewater Treatment                  13
Bridges, Streets and Roadways          9
Bridges                                7
Sanitation                             6
Public Safety and Criminal Justice     4
Other Government Facilities            4
Ferries                                3
Health and Hospitals                   3
Arts and Culture                       2
Parks                                  2
Parks, Streets and Roadways            1
Industrial Development, Parks          1
Name: Category, dtype: int64


The newly mapped 11 project categories and project counts:

Streets and Roadways                      31
Sewers                                    20
Industrial Development                    16
Bridges    

In [11]:
# One hot encode categorical variables
drop_cat = 'Other Govt Facilities and Improvements'
one_hot = True

df_feat_3yr = encode_categories(
    df_feat_3yr, colname='Category',
    one_hot=one_hot,
    drop_cat=drop_cat,
    drop_original_col=False,
)

df_feat_3yr.columns

Index(['PID', 'Project_Name', 'Description', 'Category', 'Borough',
       'Managing_Agency', 'Client_Agency', 'Phase_Start',
       'Current_Project_Years', 'Current_Project_Year', 'Design_Start',
       'Budget_Start', 'Schedule_Start', 'Final_Change_Date',
       'Final_Change_Years', 'Phase_End', 'Budget_End', 'Schedule_End',
       'Number_Changes', 'Duration_Start', 'Duration_End', 'Schedule_Change',
       'Budget_Change', 'Schedule_Change_Ratio', 'Budget_Change_Ratio',
       'Budget_Abs_Per_Error', 'Budget_Rel_Per_Error', 'Duration_End_Ratio',
       'Budget_End_Ratio', 'Duration_Ratio_Inv', 'Budget_Ratio_Inv',
       'Category_Old', 'Bridges', 'Ferries', 'Industrial_Development', 'Parks',
       'Sanitation', 'Schools', 'Sewers', 'Streets_and_Roadways',
       'Wastewater_Treatment', 'Water_Supply'],
      dtype='object')

In [12]:
# One hot encode categorical variables
drop_cat = 'Other Govt Facilities and Improvements'
one_hot = False

df_feat_3yr = encode_categories(
    df_feat_3yr, colname='Category',
    one_hot=one_hot,
    drop_cat=drop_cat,
    drop_original_col=False
)

df_feat_3yr.columns

Index(['PID', 'Project_Name', 'Description', 'Category', 'Borough',
       'Managing_Agency', 'Client_Agency', 'Phase_Start',
       'Current_Project_Years', 'Current_Project_Year', 'Design_Start',
       'Budget_Start', 'Schedule_Start', 'Final_Change_Date',
       'Final_Change_Years', 'Phase_End', 'Budget_End', 'Schedule_End',
       'Number_Changes', 'Duration_Start', 'Duration_End', 'Schedule_Change',
       'Budget_Change', 'Schedule_Change_Ratio', 'Budget_Change_Ratio',
       'Budget_Abs_Per_Error', 'Budget_Rel_Per_Error', 'Duration_End_Ratio',
       'Budget_End_Ratio', 'Duration_Ratio_Inv', 'Budget_Ratio_Inv',
       'Category_Old', 'Bridges', 'Ferries', 'Industrial_Development', 'Parks',
       'Sanitation', 'Schools', 'Sewers', 'Streets_and_Roadways',
       'Wastewater_Treatment', 'Water_Supply', 'Category_Code'],
      dtype='object')

<a name='split'></a>

# Perform train test split and save resulting dataframes

<a href=#index>index</a>

In [13]:
random_state = 109
test_size = 0.1

data_train, data_test = train_test_split(
    df_feat_3yr,
    test_size=test_size,
    random_state=random_state,
    shuffle=True,
)

In [14]:
print('{}\t{}'.format(data_train.shape, data_test.shape))

(134, 43)	(15, 43)


In [15]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 134 entries, 14 to 6
Data columns (total 43 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   PID                     134 non-null    int64         
 1   Project_Name            134 non-null    object        
 2   Description             134 non-null    object        
 3   Category                134 non-null    object        
 4   Borough                 134 non-null    object        
 5   Managing_Agency         134 non-null    object        
 6   Client_Agency           134 non-null    object        
 7   Phase_Start             134 non-null    object        
 8   Current_Project_Years   134 non-null    float64       
 9   Current_Project_Year    134 non-null    int64         
 10  Design_Start            134 non-null    datetime64[ns]
 11  Budget_Start            134 non-null    float64       
 12  Schedule_Start          134 non-null    datetime64[

In [16]:
data_train.to_csv(savefile_train, index=True)
data_test.to_csv(savefile_test, index=True)