# CIT Data Pipeline: Formatting

In this notebook, we ingest the avaiable, pre-populated data and format it for proper SQL uploading. 

In [67]:
import pandas as pd
import numpy as np

import requests

from termcolor import colored

In [68]:
files = pd.read_excel("CIT_Newly_added_Catalog_0521.xlsx")
files.head() 

Unnamed: 0,Plan Name,Date Added,Suggested By,Url,Plan Resolution,Planning Method,Land Conservation,Unnamed: 7,Unnamed: 8,RESTORE GOALS,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14
0,,NaT,,,,,Aquisition,Easement,Stewardship,Habitat,Water Quality,Resources/Species,Community Resilience,Gulf Economy,Code
1,Habitat Management Plan - Baldwin County Meado...,2017-12-11,Jeniffer Roberts,na,,,,,,,,,,,
2,THE MOBILE PENINSULA - CORRIDOR MASTER PLAN,2017-12-11,Jeniffer Roberts,na,,,,,,,,,,,
3,Management Plan for the - Audubon Bird Sanctuary,2017-12-11,Jeniffer Roberts,na,,,,,,,,,,,
4,Apalachee Region Comprehensive Economic Develo...,2018-02-27,FL Fish and Wildlife Conservation Commission,http://www.nado.org/wp-content/uploads/2014/08...,Regional,,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,


This pings which file links actually point to PDFs. Written by Ethan.

### Write The New Column Names

In [69]:
files.columns

Index(['Plan Name', 'Date Added', 'Suggested By', 'Url', 'Plan Resolution',
       'Planning Method', 'Land Conservation ', 'Unnamed: 7', 'Unnamed: 8',
       'RESTORE GOALS', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12',
       'Unnamed: 13', 'Unnamed: 14'],
      dtype='object')

In [70]:
new_header = ['plan_name', 'date_added', 'suggested_by', 'url', 'plan_resolution',
              'planning_method', 'aquisition', 'easement', 'stewardship',
              'habit', 'water_quality', 'resource_species', 'community_resilience',
              'gulf_economy', 'code', 'related_state', 'status', 'is_new', 'existing_planid','username']

### Strip Some Rows

In [71]:
files = files.iloc[1:, :]

In [72]:
files.head()

Unnamed: 0,Plan Name,Date Added,Suggested By,Url,Plan Resolution,Planning Method,Land Conservation,Unnamed: 7,Unnamed: 8,RESTORE GOALS,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14
1,Habitat Management Plan - Baldwin County Meado...,2017-12-11,Jeniffer Roberts,na,,,,,,,,,,,
2,THE MOBILE PENINSULA - CORRIDOR MASTER PLAN,2017-12-11,Jeniffer Roberts,na,,,,,,,,,,,
3,Management Plan for the - Audubon Bird Sanctuary,2017-12-11,Jeniffer Roberts,na,,,,,,,,,,,
4,Apalachee Region Comprehensive Economic Develo...,2018-02-27,FL Fish and Wildlife Conservation Commission,http://www.nado.org/wp-content/uploads/2014/08...,Regional,,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,
5,Fishery Management Plan for Spanish Mackerel,2018-02-27,FL Fish and Wildlife Conservation Commission,http://sedarweb.org/docs/wsupp/S17RD03%20ASMFC...,GCR,,,,,,,Manage Spanish mackerel resourse,,Minimize disruptions of markets for Spanish ma...,


In [73]:
files.columns[:]

Index(['Plan Name', 'Date Added', 'Suggested By', 'Url', 'Plan Resolution',
       'Planning Method', 'Land Conservation ', 'Unnamed: 7', 'Unnamed: 8',
       'RESTORE GOALS', 'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12',
       'Unnamed: 13', 'Unnamed: 14'],
      dtype='object')

### Add Empty Columns Where No Data Exists

In [74]:
# Determine the Mismatched Columns
start_count = len(files.columns)
final_count = len(new_header) 
column_deficit = final_count - start_count

# Replace the current header with new header name
files.columns = new_header[:start_count]

# Replace 
new_columns = new_header[start_count:final_count]

In [75]:
# Add New Empty Columns
files = files.reindex(columns=[*files.columns.tolist( ), *new_columns], fill_value="")

Check the outcome of the process.

In [76]:
files.head()

Unnamed: 0,plan_name,date_added,suggested_by,url,plan_resolution,planning_method,aquisition,easement,stewardship,habit,water_quality,resource_species,community_resilience,gulf_economy,code,related_state,status,is_new,existing_planid,username
1,Habitat Management Plan - Baldwin County Meado...,2017-12-11,Jeniffer Roberts,na,,,,,,,,,,,,,,,,
2,THE MOBILE PENINSULA - CORRIDOR MASTER PLAN,2017-12-11,Jeniffer Roberts,na,,,,,,,,,,,,,,,,
3,Management Plan for the - Audubon Bird Sanctuary,2017-12-11,Jeniffer Roberts,na,,,,,,,,,,,,,,,,
4,Apalachee Region Comprehensive Economic Develo...,2018-02-27,FL Fish and Wildlife Conservation Commission,http://www.nado.org/wp-content/uploads/2014/08...,Regional,,Yes,Yes,Yes,Yes,Yes,Yes,Yes,Yes,,,,,,
5,Fishery Management Plan for Spanish Mackerel,2018-02-27,FL Fish and Wildlife Conservation Commission,http://sedarweb.org/docs/wsupp/S17RD03%20ASMFC...,GCR,,,,,,,Manage Spanish mackerel resourse,,Minimize disruptions of markets for Spanish ma...,,,,,,


Double check the columns are what you expect.

In [77]:
files.columns

Index(['plan_name', 'date_added', 'suggested_by', 'url', 'plan_resolution',
       'planning_method', 'aquisition', 'easement', 'stewardship', 'habit',
       'water_quality', 'resource_species', 'community_resilience',
       'gulf_economy', 'code', 'related_state', 'status', 'is_new',
       'existing_planid', 'username'],
      dtype='object')

Verify things look correct.

### Fill Missing Rows 

Trouble importing empties. I believe these should be filled with NULL. In Python as a dataframe, this exists as NaN, but after exporting, there is no data filled in when this happens. Try filling in with option as adjusted below.

<img src="figures/bloq_importing_nofill_csv.png"
     alt="Markdown Monster icon"
     width = 600 
     style="float: left; margin-right: 10px;" />

Preview the csv conversion.

### Fill In Missing Rows (skip for now)

### Write The ID Column To Match Existing Plans

This should be automated. Ultimately, we need this to be more automated where it picks up the exactly column number from the existing plans OR this is taken care of by SQL.

In [78]:
 len(files.index) 

295

In [79]:
rows =  len(files.index) 
values = list(range(344,344 + rows))

# Insert ID column to the dataframe
files.insert(0, "id", values)

In [85]:
files.head()

Unnamed: 0,id,plan_name,date_added,suggested_by,url,plan_resolution,planning_method,aquisition,easement,stewardship,...,water_quality,resource_species,community_resilience,gulf_economy,code,related_state,status,is_new,existing_planid,username
1,344,Habitat Management Plan - Baldwin County Meado...,2017-12-11,Jeniffer Roberts,na,,,,,,...,,,,,,,,,,
2,345,THE MOBILE PENINSULA - CORRIDOR MASTER PLAN,2017-12-11,Jeniffer Roberts,na,,,,,,...,,,,,,,,,,
3,346,Management Plan for the - Audubon Bird Sanctuary,2017-12-11,Jeniffer Roberts,na,,,,,,...,,,,,,,,,,
4,347,Apalachee Region Comprehensive Economic Develo...,2018-02-27,FL Fish and Wildlife Conservation Commission,http://www.nado.org/wp-content/uploads/2014/08...,Regional,,Yes,Yes,Yes,...,Yes,Yes,Yes,Yes,,,,,,
5,348,Fishery Management Plan for Spanish Mackerel,2018-02-27,FL Fish and Wildlife Conservation Commission,http://sedarweb.org/docs/wsupp/S17RD03%20ASMFC...,GCR,,,,,...,,Manage Spanish mackerel resourse,,Minimize disruptions of markets for Spanish ma...,,,,,,


In [84]:
files.to_csv(r'CIT_Newly_added_Catalog_0521.csv', na_rep='NULL')

### Review 

In [82]:
files_ascsv = pd.read_csv("CIT_Newly_added_Catalog_0521.csv")

In [83]:
files_ascsv.head()

Unnamed: 0.1,Unnamed: 0,id,plan_name,date_added,suggested_by,url,plan_resolution,planning_method,aquisition,easement,...,water_quality,resource_species,community_resilience,gulf_economy,code,related_state,status,is_new,existing_planid,username
0,1,344,Habitat Management Plan - Baldwin County Meado...,2017-12-11,Jeniffer Roberts,na,,,,,...,,,,,,,,,,
1,2,345,THE MOBILE PENINSULA - CORRIDOR MASTER PLAN,2017-12-11,Jeniffer Roberts,na,,,,,...,,,,,,,,,,
2,3,346,Management Plan for the - Audubon Bird Sanctuary,2017-12-11,Jeniffer Roberts,na,,,,,...,,,,,,,,,,
3,4,347,Apalachee Region Comprehensive Economic Develo...,2018-02-27,FL Fish and Wildlife Conservation Commission,http://www.nado.org/wp-content/uploads/2014/08...,Regional,,Yes,Yes,...,Yes,Yes,Yes,Yes,,,,,,
4,5,348,Fishery Management Plan for Spanish Mackerel,2018-02-27,FL Fish and Wildlife Conservation Commission,http://sedarweb.org/docs/wsupp/S17RD03%20ASMFC...,GCR,,,,...,,Manage Spanish mackerel resourse,,Minimize disruptions of markets for Spanish ma...,,,,,,


# Trouble Shooting

Fix any issues and blockers faced here with some experimentation.

## BLQ: 'is_new' column with invalid Boolean values

FIX: Set values in Boolean columns to exactly True or False. Nan does not count.

In [86]:
files_ascsv = files_ascsv.assign(is_new=True)

In [87]:
files = files_ascsv

In [88]:
files.to_csv(r'CIT_Newly_added_Catalog_0521.csv')

In [89]:
files.head()

Unnamed: 0.1,Unnamed: 0,id,plan_name,date_added,suggested_by,url,plan_resolution,planning_method,aquisition,easement,...,water_quality,resource_species,community_resilience,gulf_economy,code,related_state,status,is_new,existing_planid,username
0,1,344,Habitat Management Plan - Baldwin County Meado...,2017-12-11,Jeniffer Roberts,na,,,,,...,,,,,,,,True,,
1,2,345,THE MOBILE PENINSULA - CORRIDOR MASTER PLAN,2017-12-11,Jeniffer Roberts,na,,,,,...,,,,,,,,True,,
2,3,346,Management Plan for the - Audubon Bird Sanctuary,2017-12-11,Jeniffer Roberts,na,,,,,...,,,,,,,,True,,
3,4,347,Apalachee Region Comprehensive Economic Develo...,2018-02-27,FL Fish and Wildlife Conservation Commission,http://www.nado.org/wp-content/uploads/2014/08...,Regional,,Yes,Yes,...,Yes,Yes,Yes,Yes,,,,True,,
4,5,348,Fishery Management Plan for Spanish Mackerel,2018-02-27,FL Fish and Wildlife Conservation Commission,http://sedarweb.org/docs/wsupp/S17RD03%20ASMFC...,GCR,,,,...,,Manage Spanish mackerel resourse,,Minimize disruptions of markets for Spanish ma...,,,,True,,


In [90]:
files.to_csv(r'CIT_Newly_added_Catalog_0521.csv', na_rep='NULL')

## BLQ: 
ERROR:  extra data after last expected column 
CONTEXT:  COPY plans, line 1: ",Unnamed: 0,Unnamed: 0.1,id,plan_name,date_added,suggested_by,url,plan_resolution,planning_method,aq..."

In [91]:
files.head()

Unnamed: 0.1,Unnamed: 0,id,plan_name,date_added,suggested_by,url,plan_resolution,planning_method,aquisition,easement,...,water_quality,resource_species,community_resilience,gulf_economy,code,related_state,status,is_new,existing_planid,username
0,1,344,Habitat Management Plan - Baldwin County Meado...,2017-12-11,Jeniffer Roberts,na,,,,,...,,,,,,,,True,,
1,2,345,THE MOBILE PENINSULA - CORRIDOR MASTER PLAN,2017-12-11,Jeniffer Roberts,na,,,,,...,,,,,,,,True,,
2,3,346,Management Plan for the - Audubon Bird Sanctuary,2017-12-11,Jeniffer Roberts,na,,,,,...,,,,,,,,True,,
3,4,347,Apalachee Region Comprehensive Economic Develo...,2018-02-27,FL Fish and Wildlife Conservation Commission,http://www.nado.org/wp-content/uploads/2014/08...,Regional,,Yes,Yes,...,Yes,Yes,Yes,Yes,,,,True,,
4,5,348,Fishery Management Plan for Spanish Mackerel,2018-02-27,FL Fish and Wildlife Conservation Commission,http://sedarweb.org/docs/wsupp/S17RD03%20ASMFC...,GCR,,,,...,,Manage Spanish mackerel resourse,,Minimize disruptions of markets for Spanish ma...,,,,True,,
