# NYC Capital Projects

## Notebook 06: Merge Engineered Features Into Final NYC Capital Projects Datasets 

This notebook merges the cleansed NYC capital projects 3-year interval training and test datasets with the engineered features generated in the prior several notebooks. Those engineered features include the $k$-means-generated reference class labels; the PCA-, autoencoder-, and UMAP-generated 2-dimensional encoded BERT embeddings; and the UMAP+HDBSCAN reference class labels. The resulting merged training and test datasets will be used for predictive modeling in all subsequent notebooks for this project. 

### Project authors

- [An Hoang](https://github.com/hoangthienan95)
- [Mark McDonald](https://github.com/mcdomx)
- [Mike Sedelmeyer](https://github.com/sedelmeyer)


### Inputs:

The following files are required to successfully run this notebook.

- ``../data/interim/NYC_capital_projects_3yr_train.csv``

    The training split of our 3-year interval subsetted project data.


- ``../data/interim/NYC_capital_projects_3yr_test.csv``

    The test split of our 3-year interval subsetted project data.


- ``../data/interim/kmeans3_attribute_labels_train.csv``

    The $k$-means labels corresponding to each project record in our 3-year training split.


- ``../data/interim/kmeans3_attribute_labels_test.csv``

    The $k$-means labels corresponding to each project record in our 3-year test split.


- ``../data/interim/ae_pca_encoded_embed_train.csv``

    The PCA and autoencoder 2-dimensional feature values corresponding to each project record in our 3-year training split.


- ``../data/interim/ae_pca_encoded_embed_test.csv``

    The PCA and autoencoder 2-dimensional feature values corresponding to each project record in our 3-year test split.


- ``../data/interim/UMAP_embeddings_NYC_capital_projects_3yr_train.csv``

    The UMAP 2-dimensional encoded embeddings and the UMAP+HDBSCAN reference class cluster labels corresponding to each project record in our 3-year training split.


- ``../data/interim/UMAP_embeddings_NYC_capital_projects_3yr_test.csv``

    The UMAP 2-dimensional encoded embeddings and the UMAP+HDBSCAN reference class cluster labels corresponding to each project record in our 3-year test split.


### Outputs:

The following files are generated by executing the code in this notebook.

- ``../data/processed/NYC_capital_projects_3yr_final_train.csv``

    The final training dataset, including all engineered features, for use in subsequent prediction models.


- ``../data/processed/NYC_capital_projects_3yr_final_test.csv``

    The final test dataset, including all engineered features, for use in subsequent prediction models.


# Notebook contents

1. [Imports](#Imports)

2. [Read and merge datasets](#Read-and-merge-datasets)

3. [Save merged datasets](#Save-merged-datasets)

4. [Print data dictionary for resulting dataset](#Print-data-dictionary-for-resulting-dataset)

# Imports

[Return to top](#Notebook-contents)

In [1]:
from functools import reduce
import os

import pandas as pd

import caproj.datagen

# Read and merge datasets

Load all required datasets and merge all engineered features with the NYC Capital Projects train and test datasets

[Return to top](#Notebook-contents)

In [2]:
files_needed = set(
    [
        'NYC_capital_projects_3yr_test',
        'NYC_capital_projects_3yr_train',
        'ae_pca_encoded_embed_test',
        'ae_pca_encoded_embed_train',
        'UMAP_embeddings_NYC_capital_projects_3yr_test',
        'UMAP_embeddings_NYC_capital_projects_3yr_train',
        'kmeans3_attribute_labels_test',
        'kmeans3_attribute_labels_train',
    ]
)

files_needed_paths = [f"../data/interim/{file}.csv" for file in files_needed]

savepath_train = "../data/processed/NYC_capital_projects_3yr_final_train.csv"
savepath_test = "../data/processed/NYC_capital_projects_3yr_final_test.csv"

In [3]:
# check to ensure target files exist to prevent runtime errors
path_errors = []
for filepath in files_needed_paths: 
    if (not os.path.isfile(filepath)) and (not os.path.isdir(filepath)):
        path_errors.append(filepath)
        
if len(path_errors)==0:
    print("OK - all 'files_needed_paths' point to existing files!")

else:
    raise ValueError(
        "The following target paths do not exist...\n\n\t{}\n"\
        "".format(path_errors)
    )

OK - all 'files_needed_paths' point to existing files!


In [4]:
data_dict = {"train":{}, "test":{}}

print('Dataframes added to data dictionary:\n')

for file in sorted(files_needed_paths):
    file_name, extension = file.split("/")[-1].split(".")
    if file_name.startswith("NYC"):
        date_cols = [
    'Design_Start',
    'Final_Change_Date',
    'Schedule_Start',
    'Schedule_End',
]
        drop_col = "Unnamed: 0"
    
    #umap 
    else:
        date_cols = []
        drop_col = []
    
    df = pd.read_csv(file, parse_dates=date_cols).drop(columns=drop_col)
    
    if file_name.split("_")[-1] == "train":
        data_dict["train"][file_name] = df
    elif file_name.split("_")[-1] == "test":
        data_dict["test"][file_name] = df
    else:
        data_dict[file_name] = df
    
    print(f'\t{file_name}')

Dataframes added to data dictionary:

	NYC_capital_projects_3yr_test
	NYC_capital_projects_3yr_train
	UMAP_embeddings_NYC_capital_projects_3yr_test
	UMAP_embeddings_NYC_capital_projects_3yr_train
	ae_pca_encoded_embed_test
	ae_pca_encoded_embed_train
	kmeans3_attribute_labels_test
	kmeans3_attribute_labels_train


In [5]:
data_dict["train"].keys()

dict_keys(['NYC_capital_projects_3yr_train', 'UMAP_embeddings_NYC_capital_projects_3yr_train', 'ae_pca_encoded_embed_train', 'kmeans3_attribute_labels_train'])

In [6]:
data_dict["test"].keys()

dict_keys(['NYC_capital_projects_3yr_test', 'UMAP_embeddings_NYC_capital_projects_3yr_test', 'ae_pca_encoded_embed_test', 'kmeans3_attribute_labels_test'])

In [7]:
data_dict["train"]['UMAP_embeddings_NYC_capital_projects_3yr_train'].columns[
    data_dict["train"]['UMAP_embeddings_NYC_capital_projects_3yr_train'].columns.str.contains("label")
]

Index(['attribute_clustering_label'], dtype='object')

In [8]:
# filters so only keep the 2D features
umap_df_train = data_dict["train"]['UMAP_embeddings_NYC_capital_projects_3yr_train']
umap_df_train['attribute_clustering_label'] = umap_df_train['attribute_clustering_label'].astype("str") 
data_dict["train"]['UMAP_embeddings_NYC_capital_projects_3yr_train'] = umap_df_train[
    ["PID"] + list(
        umap_df_train.columns[
            umap_df_train.columns.str.startswith("umap_attributes_2D") | 
            umap_df_train.columns.str.startswith("umap_descr_2D") 
        ]
    ) + ['attribute_clustering_label']
]

# filters so only keep the 2D features
umap_df_test = data_dict["test"]['UMAP_embeddings_NYC_capital_projects_3yr_test']
umap_df_test['attribute_clustering_label'] = umap_df_test['attribute_clustering_label'].astype("str") 
data_dict["test"]['UMAP_embeddings_NYC_capital_projects_3yr_test'] = umap_df_test[
    ["PID"]  + list(
        umap_df_test.columns[
            umap_df_test.columns.str.startswith("umap_attributes_2D") |
            umap_df_test.columns.str.startswith("umap_descr_2D")
        ]
    ) + ['attribute_clustering_label']
]


In [9]:
df_train_merged = reduce(lambda  left,right: pd.merge(left.copy(),right.copy(),on='PID',
                                            how='left'), data_dict["train"].values())

assert df_train_merged.isnull().sum().sum() == 0
assert df_train_merged.shape == (134,53)


In [10]:
df_test_merged = reduce(lambda  left,right: pd.merge(left.copy(),right.copy(),on='PID',
                                            how='left'), data_dict["test"].values())

assert df_test_merged.isnull().sum().sum() == 0
assert df_test_merged.shape == (15,53)


In [11]:
df_train_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 134 entries, 0 to 133
Data columns (total 53 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   PID                         134 non-null    int64         
 1   Project_Name                134 non-null    object        
 2   Description                 134 non-null    object        
 3   Category                    134 non-null    object        
 4   Borough                     134 non-null    object        
 5   Managing_Agency             134 non-null    object        
 6   Client_Agency               134 non-null    object        
 7   Phase_Start                 134 non-null    object        
 8   Current_Project_Years       134 non-null    float64       
 9   Current_Project_Year        134 non-null    int64         
 10  Design_Start                134 non-null    datetime64[ns]
 11  Budget_Start                134 non-null    float64       

In [12]:
df_test_merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15 entries, 0 to 14
Data columns (total 53 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   PID                         15 non-null     int64         
 1   Project_Name                15 non-null     object        
 2   Description                 15 non-null     object        
 3   Category                    15 non-null     object        
 4   Borough                     15 non-null     object        
 5   Managing_Agency             15 non-null     object        
 6   Client_Agency               15 non-null     object        
 7   Phase_Start                 15 non-null     object        
 8   Current_Project_Years       15 non-null     float64       
 9   Current_Project_Year        15 non-null     int64         
 10  Design_Start                15 non-null     datetime64[ns]
 11  Budget_Start                15 non-null     float64       
 

# Save merged datasets

[Return to top](#Notebook-contents)

In [13]:
df_train_merged.to_csv(savepath_train)
df_test_merged.to_csv(savepath_test)

# Print data dictionary for resulting dataset

[Return to top](#Notebook-contents)

For ease of reference, printed below is the data dictionary for all features included in the resulting merged dataset.

In [14]:
caproj.datagen.print_data_dictionary(
    '../references/documentation/data_dict_final_features.csv'
)

DATA DICTIONARY: GENERATED INTERVAL DATASETS

0: PID (int64)

	The project identification number


1: Project_Name (object)

	The name of the project


2: Description (object)

	A brief written description of the project


3: Category (object)

	The type of project (i.e. bridges, roadways, schools, etc.)


4: Borough (object)

	The primary borough location for the project


5: Managing_Agency (object)

	The primary managing agency for the project


6: Client_Agency (object)

	The primary client agency for the project


7: Phase_Start (object)

	The project phase at the time of the first change record recorded


8: Current_Project_Years (float64)

	The number of years since (in decimal form) the Design_Start date (i.e. the age of the project when the dataset was compiled)


9: Current_Project_Year (int64)

	The number of years since (in integer form) the Design_Start date


10: Design_Start (datetime64)

	The date at which the design phase of the project began


11: Budget_Start (float6