# CIT Data Pipeline: Formatting

In this notebook, we ingest the avaiable, pre-populated data and format it for proper SQL uploading. 

In [190]:
import pandas as pd
import numpy as np

import requests

from termcolor import colored

In [191]:
files = pd.read_excel("CIT_Newly_added_Catalog_0521.xlsx")
files

Unnamed: 0,Plan Name,Date Added,Suggested By,Url,Plan Resolution,Planning Method,Land Conservation,Unnamed: 7,Unnamed: 8,RESTORE GOALS,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14
0,,NaT,,,,,Aquisition,Easement,Stewardship,Habitat,Water Quality,Resources/Species,Community Resilience,Gulf Economy,Code
1,Habitat Management Plan - Baldwin County Meado...,2017-12-11,Jeniffer Roberts,na,,,,,,,,,,,
2,THE MOBILE PENINSULA - CORRIDOR MASTER PLAN,2017-12-11,Jeniffer Roberts,na,,,,,,,,,,,
3,Management Plan for the - Audubon Bird Sanctuary,2017-12-11,Jeniffer Roberts,na,,,,,,,,,,,
4,Apalachee Region Comprehensive Economic Develo...,2018-02-27,FL Fish and Wildlife Conservation Commission,http://www.nado.org/wp-content/uploads/2014/08...,Regional,,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
291,Green Links Regional CLIP Database,2018-02-27,FL Fish and Wildlife Conservation Commission,https://www.fws.gov/panamacity/resources/Green...,geopolitical,,,,yes,"assist conservation, listed species, green inf...",,"assist conservation, listed species, green inf...","assist conservation, listed species, green inf...",,REG
292,Waterbird Conservation for the Americas: North...,2018-02-27,FL Fish and Wildlife Conservation Commission,https://www.fws.gov/migratorybirds/pdf/managem...,geopolitical,,yes,,yes,"protect, restore, and manage habitat",,"protect, restore, and manage populations",education and outreach,,REG
293,West Florida Comprehensive Economic Developmen...,2018-02-27,FL Fish and Wildlife Conservation Commission,https://www.ecrc.org/document_center/Programs/...,geopolitical,,,,yes,,,resource protection and sustainability as econ...,"make appealing to residents and visitors, prov...",economic development strategies,REG
294,Comprehensive Economic Development Strategy fo...,2018-02-27,FL Fish and Wildlife Conservation Commission,http://www.ncfrpc.org/Publications/CEDS/Withla...,geopolitical,,yes,,yes,,oncrease long-term sustainability of regional ...,"support, protect, and enhance the regions natu...","workforce to add value, high quality education...",economic development strategies,REG


This pings which file links actually point to PDFs. Written by Ethan.

### Write The New Column Names

In [192]:
files.columns

Index(['Plan Name', 'Date Added', 'Suggested By', 'Url', 'Plan Resolution',
       'Planning Method', 'Land Conservation ', 'Unnamed: 7', 'Unnamed: 8',
       'RESTORE GOALS', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12',
       'Unnamed: 13', 'Unnamed: 14'],
      dtype='object')

In [193]:
new_header = ['plan_name', 'date_added', 'suggested_by', 'url', 'plan_resolution',
              'planning_method', 'aquisition', 'easement', 'stewardship',
              'habit', 'water_quality', 'resource_species', 'community_resilience',
              'gulf_economy', 'code', 'related_state', 'status', 'is_new', 'existing_planid','username']

### Strip The Header Row

In [194]:
files = files.iloc[1: , :]

Verify things looks correct.

In [195]:
files

Unnamed: 0,Plan Name,Date Added,Suggested By,Url,Plan Resolution,Planning Method,Land Conservation,Unnamed: 7,Unnamed: 8,RESTORE GOALS,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14
1,Habitat Management Plan - Baldwin County Meado...,2017-12-11,Jeniffer Roberts,na,,,,,,,,,,,
2,THE MOBILE PENINSULA - CORRIDOR MASTER PLAN,2017-12-11,Jeniffer Roberts,na,,,,,,,,,,,
3,Management Plan for the - Audubon Bird Sanctuary,2017-12-11,Jeniffer Roberts,na,,,,,,,,,,,
4,Apalachee Region Comprehensive Economic Develo...,2018-02-27,FL Fish and Wildlife Conservation Commission,http://www.nado.org/wp-content/uploads/2014/08...,Regional,,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,
5,Fishery Management Plan for Spanish Mackerel,2018-02-27,FL Fish and Wildlife Conservation Commission,http://sedarweb.org/docs/wsupp/S17RD03%20ASMFC...,GCR,,,,,,,Manage Spanish mackerel resourse,,Minimize disruptions of markets for Spanish ma...,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
291,Green Links Regional CLIP Database,2018-02-27,FL Fish and Wildlife Conservation Commission,https://www.fws.gov/panamacity/resources/Green...,geopolitical,,,,yes,"assist conservation, listed species, green inf...",,"assist conservation, listed species, green inf...","assist conservation, listed species, green inf...",,REG
292,Waterbird Conservation for the Americas: North...,2018-02-27,FL Fish and Wildlife Conservation Commission,https://www.fws.gov/migratorybirds/pdf/managem...,geopolitical,,yes,,yes,"protect, restore, and manage habitat",,"protect, restore, and manage populations",education and outreach,,REG
293,West Florida Comprehensive Economic Developmen...,2018-02-27,FL Fish and Wildlife Conservation Commission,https://www.ecrc.org/document_center/Programs/...,geopolitical,,,,yes,,,resource protection and sustainability as econ...,"make appealing to residents and visitors, prov...",economic development strategies,REG
294,Comprehensive Economic Development Strategy fo...,2018-02-27,FL Fish and Wildlife Conservation Commission,http://www.ncfrpc.org/Publications/CEDS/Withla...,geopolitical,,yes,,yes,,oncrease long-term sustainability of regional ...,"support, protect, and enhance the regions natu...","workforce to add value, high quality education...",economic development strategies,REG


In [196]:
files.columns[:]

Index(['Plan Name', 'Date Added', 'Suggested By', 'Url', 'Plan Resolution',
       'Planning Method', 'Land Conservation ', 'Unnamed: 7', 'Unnamed: 8',
       'RESTORE GOALS', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12',
       'Unnamed: 13', 'Unnamed: 14'],
      dtype='object')

### Add Empty Columns Where No Data Exists

In [210]:
# Determine the Mismatched Columns
start_count = len(files.columns)
final_count = len(new_header) 
column_deficit = final_count - start_count

# Replace the current header with new header name
files.columns = new_header[:start_count]

# Replace 
new_columns = new_header[start_count:final_count]

In [212]:
# Add New Empty Columns
files = files.reindex(columns=[*files.columns.tolist( ), *new_columns], fill_value="")

Check the outcome of the process.

In [213]:
files

Unnamed: 0,plan_name,date_added,suggested_by,url,plan_resolution,planning_method,aquisition,easement,stewardship,habit,water_quality,resource_species,community_resilience,gulf_economy,code,related_state,status,is_new,existing_planid,username
1,Habitat Management Plan - Baldwin County Meado...,2017-12-11,Jeniffer Roberts,na,,,,,,,,,,,,,,,,
2,THE MOBILE PENINSULA - CORRIDOR MASTER PLAN,2017-12-11,Jeniffer Roberts,na,,,,,,,,,,,,,,,,
3,Management Plan for the - Audubon Bird Sanctuary,2017-12-11,Jeniffer Roberts,na,,,,,,,,,,,,,,,,
4,Apalachee Region Comprehensive Economic Develo...,2018-02-27,FL Fish and Wildlife Conservation Commission,http://www.nado.org/wp-content/uploads/2014/08...,Regional,,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,,,,,,
5,Fishery Management Plan for Spanish Mackerel,2018-02-27,FL Fish and Wildlife Conservation Commission,http://sedarweb.org/docs/wsupp/S17RD03%20ASMFC...,GCR,,,,,,,Manage Spanish mackerel resourse,,Minimize disruptions of markets for Spanish ma...,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
291,Green Links Regional CLIP Database,2018-02-27,FL Fish and Wildlife Conservation Commission,https://www.fws.gov/panamacity/resources/Green...,geopolitical,,,,yes,"assist conservation, listed species, green inf...",,"assist conservation, listed species, green inf...","assist conservation, listed species, green inf...",,REG,,,,,
292,Waterbird Conservation for the Americas: North...,2018-02-27,FL Fish and Wildlife Conservation Commission,https://www.fws.gov/migratorybirds/pdf/managem...,geopolitical,,yes,,yes,"protect, restore, and manage habitat",,"protect, restore, and manage populations",education and outreach,,REG,,,,,
293,West Florida Comprehensive Economic Developmen...,2018-02-27,FL Fish and Wildlife Conservation Commission,https://www.ecrc.org/document_center/Programs/...,geopolitical,,,,yes,,,resource protection and sustainability as econ...,"make appealing to residents and visitors, prov...",economic development strategies,REG,,,,,
294,Comprehensive Economic Development Strategy fo...,2018-02-27,FL Fish and Wildlife Conservation Commission,http://www.ncfrpc.org/Publications/CEDS/Withla...,geopolitical,,yes,,yes,,oncrease long-term sustainability of regional ...,"support, protect, and enhance the regions natu...","workforce to add value, high quality education...",economic development strategies,REG,,,,,


Double check the columns are what you expect.

In [214]:
files.columns

Index(['plan_name', 'date_added', 'suggested_by', 'url', 'plan_resolution',
       'planning_method', 'aquisition', 'easement', 'stewardship', 'habit',
       'water_quality', 'resource_species', 'community_resilience',
       'gulf_economy', 'code', 'related_state', 'status', 'is_new',
       'existing_planid', 'username'],
      dtype='object')

Verify things look correct.

### Fill Missing Rows 

Trouble importing empties. I believe these should be filled with NULL. In Python as a dataframe, this exists as NaN, but after exporting, there is no data filled in when this happens. Try filling in with option as adjusted below.

<img src="figures/bloq_importing_nofill_csv.png"
     alt="Markdown Monster icon"
     width = 600 
     style="float: left; margin-right: 10px;" />

Preview the csv conversion.

### Fill In Missing Rows (skip for now)

### Write The ID Column To Match Existing Plans

This should be automated. Ultimately, we need this to be more automated where it picks up the exactly column number from the existing plans OR this is taken care of by SQL.

In [222]:
 len(files.index) 

295

In [223]:
rows =  len(files.index) 
values = list(range(344,344 + rows))

# Insert ID column to the dataframe
files.insert(0, "id", values)

ValueError: cannot insert id, already exists

In [224]:
files

Unnamed: 0,id,plan_name,date_added,suggested_by,url,plan_resolution,planning_method,aquisition,easement,stewardship,...,water_quality,resource_species,community_resilience,gulf_economy,code,related_state,status,is_new,existing_planid,username
1,344,Habitat Management Plan - Baldwin County Meado...,2017-12-11,Jeniffer Roberts,na,,,,,,...,,,,,,,,,,
2,345,THE MOBILE PENINSULA - CORRIDOR MASTER PLAN,2017-12-11,Jeniffer Roberts,na,,,,,,...,,,,,,,,,,
3,346,Management Plan for the - Audubon Bird Sanctuary,2017-12-11,Jeniffer Roberts,na,,,,,,...,,,,,,,,,,
4,347,Apalachee Region Comprehensive Economic Develo...,2018-02-27,FL Fish and Wildlife Conservation Commission,http://www.nado.org/wp-content/uploads/2014/08...,Regional,,Yes,Yes,Yes,...,Yes,Yes,Yes,Yes,,,,,,
5,348,Fishery Management Plan for Spanish Mackerel,2018-02-27,FL Fish and Wildlife Conservation Commission,http://sedarweb.org/docs/wsupp/S17RD03%20ASMFC...,GCR,,,,,...,,Manage Spanish mackerel resourse,,Minimize disruptions of markets for Spanish ma...,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
291,634,Green Links Regional CLIP Database,2018-02-27,FL Fish and Wildlife Conservation Commission,https://www.fws.gov/panamacity/resources/Green...,geopolitical,,,,yes,...,,"assist conservation, listed species, green inf...","assist conservation, listed species, green inf...",,REG,,,,,
292,635,Waterbird Conservation for the Americas: North...,2018-02-27,FL Fish and Wildlife Conservation Commission,https://www.fws.gov/migratorybirds/pdf/managem...,geopolitical,,yes,,yes,...,,"protect, restore, and manage populations",education and outreach,,REG,,,,,
293,636,West Florida Comprehensive Economic Developmen...,2018-02-27,FL Fish and Wildlife Conservation Commission,https://www.ecrc.org/document_center/Programs/...,geopolitical,,,,yes,...,,resource protection and sustainability as econ...,"make appealing to residents and visitors, prov...",economic development strategies,REG,,,,,
294,637,Comprehensive Economic Development Strategy fo...,2018-02-27,FL Fish and Wildlife Conservation Commission,http://www.ncfrpc.org/Publications/CEDS/Withla...,geopolitical,,yes,,yes,...,oncrease long-term sustainability of regional ...,"support, protect, and enhance the regions natu...","workforce to add value, high quality education...",economic development strategies,REG,,,,,


## Export To CSV

Code for CSV exporting.

In [226]:
files.to_csv(r'CIT_Newly_added_Catalog_0521.csv', na_rep='NULL')