In [None]:
# default_exp merge

# Merge Data

> This notebook was made to demonstrate how to merge datasets by matching a single columns values from two datasets. We add columns of data from a foreign dataset into the ACS data we downloaded in our last tutorial.

This Coding Notebook is the __second__ in a series.

An Interactive version can be found here <a href="https://colab.research.google.com/github/karpatic/dataplay/blob/master/notebooks/02_Merge_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>.


This colab and more can be found on our [webpage](https://karpatic.github.io/dataplay/). 

- Content covered in previous tutorials will be used in later tutorials. 

- __New code and or  information *should* have explanations and or descriptions__ attached. 

- Concepts or code covered in previous tutorials will be used without being explaining in entirety.

- The [Dataplay](https://karpatic.github.io/dataplay/) Handbook development techniques covered in the [Datalabs](https://karpatic.github.io/datalabs/) Guidebook

- __If content can not be found in the current tutorial and is not covered in previous tutorials, please let me know.__

- This notebook has been optimized for Google Colabs ran on a Chrome Browser. 

- Statements found in the index page on view expressed, responsibility, errors and ommissions, use at risk, and licensing  extend throughout the tutorial.

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/karpatic/datalab/master?filepath=%2Fnotebooks%2Findex.ipynb)
[![Binder](https://pete88b.github.io/fastpages/assets/badges/colab.svg)](https://colab.research.google.com/github/karpatic/datalab/blob/master/notebooks/index.ipynb)
[![Binder](https://pete88b.github.io/fastpages/assets/badges/github.svg)](https://github.com/karpatic/datalab/tree/master/notebooks/index.ipynb)
[![Open Source Love svg3](https://badges.frapsoft.com/os/v3/open-source.svg?v=103)](https://github.com/ellerbrock/open-source-badges/)

[![NPM License](https://img.shields.io/npm/l/all-contributors.svg?style=flat)](https://github.com/karpatic/dataplay/blob/master/LICENSE)
[![Active](http://img.shields.io/badge/Status-Active-green.svg)](https://karpatic.github.io) 
[![Python Versions](https://img.shields.io/pypi/pyversions/dataplay.svg)](https://pypi.python.org/pypi/dataplay/)
[![GitHub last commit](https://img.shields.io/github/last-commit/karpatic/dataplay.svg?style=flat)]() 
[![No Maintenance Intended](http://unmaintained.tech/badge.svg)](http://unmaintained.tech/) 

[![GitHub stars](https://img.shields.io/github/stars/karpatic/dataplay.svg?style=social&label=Star)](https://github.com/karpatic/dataplay) 
[![GitHub watchers](https://img.shields.io/github/watchers/karpatic/dataplay.svg?style=social&label=Watch)](https://github.com/karpatic/dataplay) 
[![GitHub forks](https://img.shields.io/github/forks/karpatic/dataplay.svg?style=social&label=Fork)](https://github.com/karpatic/dataplay) 
[![GitHub followers](https://img.shields.io/github/followers/karpatic.svg?style=social&label=Follow)](https://github.com/karpatic/dataplay) 

[![Tweet](https://img.shields.io/twitter/url/https/github.com/karpatic/dataplay.svg?style=social)](https://twitter.com/intent/tweet?text=Check%20out%20this%20%E2%9C%A8%20colab%20by%20@bniajfi%20https://github.com/karpatic/dataplay%20%F0%9F%A4%97) 
[![Twitter Follow](https://img.shields.io/twitter/follow/bniajfi.svg?style=social)](https://twitter.com/bniajfi)

## About this Tutorial: 

### Whats Inside?

#### __The Tutorial__

In this notebook, the basics of how to perform a merge are introduced.

- We will merge two datasets
- We will merge two datasets using a crosswalk


#### __Objectives__

By the end of this tutorial users should have an understanding of:
- How dataset merges are performed
- The types different union approaches a merge can take
- The 'mergeData' function, and how to use it in the future

# Guided Walkthrough

## SETUP

Install these libraries onto the virtual environment.

In [None]:
%%capture
!pip install geopandas
!pip install dataplay

In [None]:
# @title Run: Install Modules

In [None]:
#export
# @title Run: Import Modules

# These imports will handle everything
import os
import sys
import csv
import numpy as np
import pandas as pd
from dataplay.acsDownload import retrieve_acs_data
from dataplay.intaker import Intake

  pd.set_option('display.max_colwidth', -1)
  """)


In [None]:
# hide
pd.set_option('max_colwidth', 20)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.precision', 2)

## Retrieve Datasets

Our example will merge two simple datasets; pulling CSA names using tract ID's.

The __First__ dataset will be obtained from the Census' ACS 5-year serveys. 

Functions used to obtain this data were obtained from Tutorial 0) ACS: Explore and Download. 


The __Second__ dataset is from a publicly accessible link

### Get the Principal dataset.

We will use the function we created in our last tutorial to download the data!

In [None]:
# Our download function will use Baltimore City's tract, county and state as internal paramters
# Change these values in the cell below using different geographic reference codes will change those parameters
tract = '*'
county = '510'
state = '24'

# Specify the download parameters the function will receieve here
tableId = 'B19001'
year = '17'
saveAcs = False

In [None]:
df = retrieve_acs_data(state, county, tract, tableId, year, saveAcs)
df.head()

Number of Columns 17


Unnamed: 0_level_0,B19001_001E_Total,"B19001_002E_Total_Less_than_$10,000","B19001_003E_Total_$10,000_to_$14,999","B19001_004E_Total_$15,000_to_$19,999","B19001_005E_Total_$20,000_to_$24,999","B19001_006E_Total_$25,000_to_$29,999","B19001_007E_Total_$30,000_to_$34,999","B19001_008E_Total_$35,000_to_$39,999","B19001_009E_Total_$40,000_to_$44,999","B19001_010E_Total_$45,000_to_$49,999","B19001_011E_Total_$50,000_to_$59,999","B19001_012E_Total_$60,000_to_$74,999","B19001_013E_Total_$75,000_to_$99,999","B19001_014E_Total_$100,000_to_$124,999","B19001_015E_Total_$125,000_to_$149,999","B19001_016E_Total_$150,000_to_$199,999","B19001_017E_Total_$200,000_or_more",state,county,tract
NAME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Census Tract 2710.02,1510,209,73,94,97,110,119,97,65,36,149,168,106,66,44,50,27,24,510,271002
Census Tract 2604.02,1134,146,29,73,80,41,91,49,75,81,170,57,162,63,11,6,0,24,510,260402
Census Tract 2712,2276,69,43,41,22,46,67,0,30,30,80,146,321,216,139,261,765,24,510,271200
Census Tract 2804.04,961,111,108,61,42,56,37,73,30,31,106,119,74,23,27,24,39,24,510,280404
Census Tract 901,1669,158,124,72,48,108,68,121,137,99,109,191,160,141,28,88,17,24,510,90100


### Get the Secondary Dataset

Spatial data can be attained by using the 2010 Census Tract Shapefile Picking [Tool](https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2010&layergroup=Census+Tracts) or search their website for
Tiger/[Line](https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.2010.html) Shapefiles
> The core TIGER/Line Files and Shapefiles do not include demographic data, but they do contain geographic entity codes (GEOIDs) that can be linked to the Census Bureau’s demographic data, available on data.census.gov.-census.gov


For this example, we will simply pull a local dataset containing columns labeling tracts within Baltimore City and their corresponding CSA (Community Statistical Area). Typically, we use this dataset internally as a "crosswalk" where-upon a succesfull merge using the tract column, will be merged with a 3rd dataset along it's CSA column.  

In [None]:
!curl https://bniajfi.org/vs_resources/CSA-to-Tract-2010.csv	> CSA-to-Tract-2010.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  8101  100  8101    0     0  23896      0 --:--:-- --:--:-- --:--:-- 23896


In [None]:
print('Boundaries Example:CSA-to-Tract-2010.csv')

Boundaries Example:CSA-to-Tract-2010.csv


In [None]:
# Get the Second dataset. 
# Our Example dataset contains Polygon Geometry information. 
# We want to merge this over to our principle dataset. 
# we will grab it by matching on either CSA or Tract

# The url listed below is public.

print('Tract 2 CSA Crosswalk : CSA-to-Tract-2010.csv')

inFile = input("\n Please enter the location of your file : \n" )

crosswalk = pd.read_csv( inFile ) 
crosswalk.head()

Tract 2 CSA Crosswalk : CSA-to-Tract-2010.csv

 Please enter the location of your file : 
CSA-to-Tract-2010.csv


Unnamed: 0,TRACTCE10,GEOID10,CSA2010
0,10100,24510010100,Canton
1,10200,24510010200,Patterson Park N...
2,10300,24510010300,Canton
3,10400,24510010400,Canton
4,10500,24510010500,Fells Point


In [None]:
crosswalk.columns

Index(['TRACTCE10', 'GEOID10', 'CSA2010'], dtype='object')

## Perform Merge & Save

The following picture does nothing important but serves as a friendly reminder of the 4 basic join types.

<image src="https://docs.trifacta.com/download/attachments/123830435/JoinVennDiagram.png" height='200px'/>

- Left - returns all left records, only includes the right record if it has a match
- Right - Returns all right records, only includes the left record if it has a match 
- Full - Returns all records regardless of keys matching
- Inner - Returns only records where a key match

Get Columns from both datasets to match on

You can get these values from the column values above.

Our Examples will work with the prompted values

In [None]:
print( 'Princpal Columns ' + str(crosswalk.columns) + '')
left_on = input("Left on principal column: ('tract') \n" )
print(' \n ');
print( 'Crosswalk Columns ' + str(crosswalk.columns) + '')
right_on = input("Right on crosswalk column: ('TRACTCE10') \n" ) 

Princpal Columns Index(['TRACTCE10', 'GEOID10', 'CSA2010'], dtype='object')
Left on principal column: ('tract') 
tract
 
 
Crosswalk Columns Index(['TRACTCE10', 'GEOID10', 'CSA2010'], dtype='object')
Right on crosswalk column: ('TRACTCE10') 
TRACTCE10


Specify how the merge will be performed

We will perform a left merge in this example.

It will return our Principal dataset with columns from the second dataset appended to records where their specified columns match.


In [None]:
how = input("How: (‘left’, ‘right’, ‘outer’, ‘inner’) " )

How: (‘left’, ‘right’, ‘outer’, ‘inner’) outer


Actually perfrom the merge

In [None]:
merged_df = pd.merge(df, crosswalk, left_on=left_on, right_on=right_on, how=how)
merged_df = merged_df.drop(left_on, axis=1)
merged_df.head()

Unnamed: 0,B19001_001E_Total,"B19001_002E_Total_Less_than_$10,000","B19001_003E_Total_$10,000_to_$14,999","B19001_004E_Total_$15,000_to_$19,999","B19001_005E_Total_$20,000_to_$24,999","B19001_006E_Total_$25,000_to_$29,999","B19001_007E_Total_$30,000_to_$34,999","B19001_008E_Total_$35,000_to_$39,999","B19001_009E_Total_$40,000_to_$44,999","B19001_010E_Total_$45,000_to_$49,999","B19001_011E_Total_$50,000_to_$59,999","B19001_012E_Total_$60,000_to_$74,999","B19001_013E_Total_$75,000_to_$99,999","B19001_014E_Total_$100,000_to_$124,999","B19001_015E_Total_$125,000_to_$149,999","B19001_016E_Total_$150,000_to_$199,999","B19001_017E_Total_$200,000_or_more",state,county,TRACTCE10,GEOID10,CSA2010
0,1510,209,73,94,97,110,119,97,65,36,149,168,106,66,44,50,27,24,510,271002.0,24500000000.0,Greater Govans
1,1134,146,29,73,80,41,91,49,75,81,170,57,162,63,11,6,0,24,510,260402.0,24500000000.0,Claremont/Armistead
2,2276,69,43,41,22,46,67,0,30,30,80,146,321,216,139,261,765,24,510,271200.0,24500000000.0,North Baltimore/...
3,961,111,108,61,42,56,37,73,30,31,106,119,74,23,27,24,39,24,510,280404.0,24500000000.0,Allendale/Irving...
4,1669,158,124,72,48,108,68,121,137,99,109,191,160,141,28,88,17,24,510,90100.0,24500000000.0,Greater Govans


As you can see, our Census data will now have a CSA appended to it.

In [None]:
# Save Data to User Specified File
outFile = input("Please enter the new Filename to save the data to ('acs_csa_merge_test': " )
merged_df.to_csv(outFile+'.csv', quoting=csv.QUOTE_ALL) 

Please enter the new Filename to save the data to ('acs_csa_merge_test': acs_csa_merge_test


## Final Result

In [None]:
flag = input("Enter a URL? If not ACS data will be used. (Y/N):  " )
if (flag == 'y' or flag == 'Y'):
  df = pd.read_csv( input("Please enter the location of your Principal file: " ) )
else:
  tract = input("Please enter tract id (*): " )
  county = input("Please enter county id (510): " )
  state = input("Please enter state id (24): " )
  tableId = input("Please enter acs table id (B19001): " ) 
  year = input("Please enter acs year (18): " )
  saveAcs = input("Save ACS? (Y/N): " )
  df = retrieve_acs_data(state, county, tract, tableId, year, saveAcs)

print( 'Principal Columns ' + str(df.columns))

print('Crosswalk Example: CSA-to-Tract-2010.csv')

crosswalk = pd.read_csv( input("Please enter the location of your crosswalk file: " ) )
print( 'Crosswalk Columns ' + str(crosswalk.columns) + '\n')

left_on = input("Left on: " )
right_on = input("Right on: " )
how = input("How: (‘left’, ‘right’, ‘outer’, ‘inner’) " )

merged_df = pd.merge(df, crosswalk, left_on=left_on, right_on=right_on, how=how)
merged_df = merged_df.drop(left_on, axis=1)

# Save the data
# Save the data
saveFile = input("Save File ('Y' or 'N'): ")
if saveFile == 'Y' or saveFile == 'y':
  outFile = input("Saved Filename (Do not include the file extension ): ")
  merged_df.to_csv(outFile+'.csv', quoting=csv.QUOTE_ALL);

Enter a URL? If not ACS data will be used. (Y/N):  n
Please enter tract id (*): *
Please enter county id (510): 510
Please enter state id (24): 24
Please enter acs table id (B19001): B19001
Please enter acs year (18): 18
Save ACS? (Y/N): N
Number of Columns 17
Principal Columns Index(['B19001_001E_Total', 'B19001_002E_Total_Less_than_$10,000',
       'B19001_003E_Total_$10,000_to_$14,999',
       'B19001_004E_Total_$15,000_to_$19,999',
       'B19001_005E_Total_$20,000_to_$24,999',
       'B19001_006E_Total_$25,000_to_$29,999',
       'B19001_007E_Total_$30,000_to_$34,999',
       'B19001_008E_Total_$35,000_to_$39,999',
       'B19001_009E_Total_$40,000_to_$44,999',
       'B19001_010E_Total_$45,000_to_$49,999',
       'B19001_011E_Total_$50,000_to_$59,999',
       'B19001_012E_Total_$60,000_to_$74,999',
       'B19001_013E_Total_$75,000_to_$99,999',
       'B19001_014E_Total_$100,000_to_$124,999',
       'B19001_015E_Total_$125,000_to_$149,999',
       'B19001_016E_Total_$150,000_to_$

In [None]:
merged_df

Unnamed: 0,B19001_001E_Total,"B19001_002E_Total_Less_than_$10,000","B19001_003E_Total_$10,000_to_$14,999","B19001_004E_Total_$15,000_to_$19,999","B19001_005E_Total_$20,000_to_$24,999","B19001_006E_Total_$25,000_to_$29,999","B19001_007E_Total_$30,000_to_$34,999","B19001_008E_Total_$35,000_to_$39,999","B19001_009E_Total_$40,000_to_$44,999","B19001_010E_Total_$45,000_to_$49,999","B19001_011E_Total_$50,000_to_$59,999","B19001_012E_Total_$60,000_to_$74,999","B19001_013E_Total_$75,000_to_$99,999","B19001_014E_Total_$100,000_to_$124,999","B19001_015E_Total_$125,000_to_$149,999","B19001_016E_Total_$150,000_to_$199,999","B19001_017E_Total_$200,000_or_more",state,county,TRACTCE10,GEOID10,CSA2010
0,1842,140,71,143,112,169,101,113,116,56,147,158,151,200,8,89,68,24,510,272004.0,2.45e+10,Cross-Country/Ch...
1,1638,246,71,90,104,74,168,115,97,173,115,16,117,64,0,91,97,24,510,120202.0,2.45e+10,Greater Charles ...
2,1252,91,27,55,75,71,59,85,44,47,80,131,108,78,43,104,154,24,510,272005.0,2.45e+10,Cross-Country/Ch...
3,1622,235,177,184,79,67,88,71,64,60,72,155,140,100,77,20,33,24,510,272006.0,2.45e+10,Glen-Fallstaff
4,1775,127,70,190,136,111,114,152,106,11,172,143,193,42,86,53,69,24,510,272007.0,2.45e+10,Glen-Fallstaff
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
196,2143,138,37,76,33,16,30,28,17,64,111,286,181,206,100,234,586,24,510,20300.0,2.45e+10,Fells Point
197,1220,159,93,82,137,51,79,102,79,98,55,140,85,38,22,0,0,24,510,150400.0,2.45e+10,Greater Mondawmin
198,1393,41,12,22,0,10,11,19,85,30,95,128,200,207,240,180,113,24,510,10200.0,2.45e+10,Patterson Park N...
199,761,94,39,39,53,31,20,16,33,27,49,127,50,64,22,49,48,24,510,60400.0,2.45e+10,Oldtown/Middle East


# Advanced

For this next example to work, we will need to import hypothetical csv files

__Intro__

The following Python function is a bulked out version of the previous notes. 
- It contains everything from the tutorial plus more.
- It can be imported and used in future projects or stand alone.

**Description:** add columns of data from a foreign dataset into a primary dataset along set parameters. 

**Purpose:** Makes Merging datasets simple

__Services__

- Merge two datasets without a crosswalk
- Merge two datasets with a crosswalk


In [None]:
#export
#@ title Run: Create mergeDatasets()

# Worried about infinit interactive-loops. not an issue atm.
# Crosswalk needs to have exact same column names as left/right datasets

def mergeDatasets(left_ds=False, right_ds=False, crosswalk_ds=False,
                  left_col=False, right_col=False,
                  crosswalk_left_col = False, crosswalk_right_col = False,
                  merge_how=False, # left right or columnname to retrieve
                  interactive=True):

  if (interactive): print('\n Loading Left Dataset');
  left_ds, left_col = Intake.getAndCheck(left_ds, left_col, interactive)

  if (interactive): print('\n Loading Right Dataset');
  right_ds, right_col  = Intake.getAndCheck(right_ds, right_col, interactive)

  # 1. returns column or False
  def checkMergeHow(right_ds, how, interactive):
    inList = how in ['left', 'right', 'outer', 'inner']
    inDf = Intake.getAndCheck(right_ds, how, False)
    if ( inList or inDf ): return how
    elif ( not interactive ): return False
    else:
      try:
        print('\n InValid Crosswalk Column Given. \n Please select a value from either list');
        print("\n 1) Pull A single Column from the Right Dataset: ", right_ds.columns)
        print("OR \n 2) Join Operation: (‘left’, ‘right’, ‘outer’, ‘inner’, columnName) " )
        return checkMergeHow(right_ds, input("Column Name: " ), interactive);
      except: return False # User probably trying to escape interactivity

  if (interactive): print('\n Validating the merge_how Parameter');
  merge_how = checkMergeHow(right_ds, merge_how, interactive)

  # 2i. This will load our dataset if provided as a url. As well as coerce the dtypes for merging.
  def coerceForMerge( msg, first_ds, second_ds, first_col, second_col, interactive ):
      if (interactive):
        print('\n coerceForMerge: ' + msg);
        print('cols : ', first_col, second_col)
        print('BEFORE COERCE : ', first_ds[first_col].dtype, second_ds[second_col].dtypes)
      second_ds, second_col = Intake.getAndCheck(second_ds, second_col, interactive)
      first_ds, second_ds, status = Intake.coerce(first_ds, second_ds, first_col, second_col, interactive);
      if (not status and interactive): print('\n There was a problem!');
      if (interactive):
        print('AFTER COERCE', first_ds[first_col].dtype, second_ds[second_col].dtypes, second_col )
      return first_ds, second_ds, second_col, status
  # 2ii.
  def mergeAndFilter(msg, first_ds, second_ds, first_col, second_col, how, interactive):
      if interactive:
        print('PERFORMING MERGE : '+ msg);
        print('first_col : ', first_col, first_ds[first_col].dtype)
        print('how: ', how)
        print('second_col : ', second_col, second_ds[second_col].dtype)
      first_ds = mergeOrPull(first_ds, second_ds, first_col, second_col, how)
      return filterEmpties(first_ds, second_ds, first_col, second_col, how, interactive)

  # Decide to perform a merge or commit a pull
  def mergeOrPull(df, cw, left_on, right_on, how):

    def merge(df, cw, left_on, right_on, how):
      df = pd.merge(df, cw, left_on=left_on, right_on=right_on, how=how)
      # df.drop(left_on, axis=1)
      df[right_on] = df[right_on].fillna(value='empty')
      return df

    def pull(df, cw, left_on, right_on, how):
      crswlk = dict(zip(cw[right_on], cw[how]  ) )
      dtype = df[left_on].dtype
      if dtype =='object':  df[how] = df.apply(lambda row: crswlk.get(str(row[left_on]), "empty"), axis=1)
      elif dtype == 'int64':
        df[how] = df.apply(lambda row: crswlk.get(int(row[left_on]), "empty"), axis=1)
      return df

    mergeType = how in ['left', 'right', 'outer', 'inner']
    if mergeType: return merge(df, cw, left_on, right_on, how)
    else: return pull(df, cw, left_on, right_on, how)

  # 2iiii. Filter between matched records and not.
  def filterEmpties(df, cw, left_on, right_on, how, interactive):

    if how in ['left', 'right', 'outer', 'inner']: how = right_on
    nomatch = df.loc[df[how] == 'empty']
    nomatch = nomatch.sort_values(by=left_on, ascending=True)

    if nomatch.shape[0] > 0:
      # Do the same thing with our foreign tracts
      if(interactive):
        print('\n Local Column Values Not Matched ')
        print(nomatch[left_on].unique() )
        print(len(nomatch[left_on]))
        print('\n Crosswalk Unique Column Values')
        print(cw[right_on].unique() )

    # Create a new column with the tracts value mapped to its corresponding value from the crossswalk
    df[how].replace('empty', np.nan, inplace=True)
    df.dropna(subset=[how], inplace=True)
    # cw = cw.sort_values(by=how, ascending=True)
    return df

  # 2. If crosswalk check left-cw, right-cw. try coercing. return ds's, col's, and coerce status
  def checkMerge(left_ds, right_ds, crosswalk_ds, left_col, right_col, crosswalk_left_col, crosswalk_right_col , interactive):
    status = False
    use_crosswalk = crosswalk_ds
    if (interactive and not use_crosswalk):
      use_crosswalk = input("\n Import a crosswalk? ('True'/'False') " ) == "True"
      if (use_crosswalk): crosswalk_ds = input("crosswalk Url" )
    if (use_crosswalk):
      if (interactive): print('\n Loading Crosswalk... \r\n\r\n Left: ', crosswalk_left_col, ' Right: ', crosswalk_right_col, '\n\r\n');
      crosswalk_ds = Intake.getAndCheckDataset(crosswalk_ds, interactive)

      left_ds, crosswalk_ds, crosswalk_left_col, status = coerceForMerge(
        'Left-Crosswalk', left_ds, crosswalk_ds, left_col, crosswalk_left_col, interactive )

      right_ds, crosswalk_ds, crosswalk_right_col, status = coerceForMerge(
        'Right-Crosswalk',right_ds, crosswalk_ds, right_col, crosswalk_right_col, interactive )

      if (interactive): print('\n\r\n\r End Crosswalk Update. Coerceing complete. Status: ', status, '\n \r\n\r\n');
    else:
      left_ds, right_ds, right_col, status = coerceForMerge('Left-Right', left_ds, right_ds, left_col, right_col, interactive )
    return left_ds, right_ds, crosswalk_ds, right_col, crosswalk_left_col, crosswalk_right_col, status

  left_ds, right_ds, crosswalk_ds, right_col, crosswalk_left_col, crosswalk_right_col, status = checkMerge(
      left_ds, right_ds, crosswalk_ds, left_col, right_col, crosswalk_left_col, crosswalk_right_col , interactive
  )

  if ( not status ):
    if (interactive):print('\n Was not able to complete merge!');
    return False;
  else:
    if (not type( crosswalk_ds ) == bool):
      left_ds = mergeAndFilter('LEFT->CROSSWALK', left_ds, crosswalk_ds, left_col, crosswalk_left_col, crosswalk_right_col, interactive)
      left_col = crosswalk_right_col
    left_ds = mergeAndFilter('LEFT->RIGHT', left_ds, right_ds, left_col, right_col, merge_how, interactive)
  return left_ds

### Function Explanation

**Input(s):** 
- Dataset url
- Crosswalk Url 
- Right On 
- Left On 
- How 
- New Filename 

**Output:** File

**How it works:**
- Read in datasets
- Perform Merge

- If the 'how' parameter is equal to ['left', 'right', 'outer', 'inner']
- - then a merge will be performed. 
- If a column name is provided in the 'how' parameter
- - then that single column will be pulled from the right dataset as a new column in the left_ds.

## Function Diagrams

Diagram the mergeDatasets()

In [None]:
%%html
<img src="https://bniajfi.org/images/mermaid/class_diagram_merge_datasets.PNG">

mergeDatasets Flow Chart

In [None]:
%%html
<img src="https://bniajfi.org/images/mermaid/flow_chart_merge_datasets.PNG">

Gannt Chart  mergeDatasets()

In [None]:
%%html
<img src="https://bniajfi.org/images/mermaid/gannt_chart_merge_datasets.PNG">

Sequence Diagram  mergeDatasets()

In [None]:
%%html
<img src="https://bniajfi.org/images/mermaid/sequence_diagram_merge_datasets.PNG">

## Function Examples

#### Interactive Example 1

In [None]:
import geopandas as gpd
import numpy as np
import pandas as pd
from dataplay.geoms import readInGeometryData 
Hhchpov = gpd.read_file("https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhchpov/FeatureServer/1/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson")
Hhchpov = Hhchpov[['CSA2010', 'hhchpov15',	'hhchpov16',	'hhchpov17',	'hhchpov18']]
Hhchpov.to_csv('Hhchpov.csv')

Hhpov = gpd.read_file("https://services1.arcgis.com/mVFRs7NF4iFitgbY/ArcGIS/rest/services/Hhpov/FeatureServer/1/query?where=1%3D1&outFields=*&returnGeometry=true&f=pgeojson")
Hhpov = Hhpov[['CSA2010', 'hhpov15',	'hhpov16',	'hhpov17',	'hhpov18']]
Hhpov.to_csv('Hhpov.csv')

In [None]:
Hhchpov.head(1)

Unnamed: 0,CSA2010,hhchpov15,hhchpov16,hhchpov17,hhchpov18
0,Allendale/Irving...,38.93,34.73,32.77,35.27


In [None]:
Hhpov.head(1)

Unnamed: 0,CSA2010,hhpov15,hhpov16,hhpov17,hhpov18
0,Allendale/Irving...,24.15,21.28,20.7,23.0


In [None]:
ls

24510_B19001_5y18_est.csv           CSA-to-Tract-2010.csv  [0m[01;34msample_data[0m/
24510_B19001_5y18_est_Original.csv  Hhchpov.csv
acs_csa_merge_test.csv              Hhpov.csv


In [None]:
import pandas as pd
import geopandas as gpd
# Table: FDIC Baltimore Banks
# Columns: Bank Name, Address(es), Census Tract
left_ds = 'Hhpov.csv'
left_col = 'CSA2010'

# Table: Crosswalk Census Communities
# 'TRACT2010', 'GEOID2010', 'CSA2010'
right_ds = 'Hhchpov.csv'
right_col='CSA2010'

merge_how = 'outer'
interactive = True

merged_df = mergeDatasets(left_ds=left_ds, right_ds=right_ds, crosswalk_ds=False,
                  left_col=left_col, right_col=right_col,
                  crosswalk_left_col = False, crosswalk_right_col = False,
                  merge_how=merge_how, # left right or columnname to retrieve
                  interactive=True)
merged_df.head()


 Loading Left Dataset
getData :  Hhpov.csv

 Loading Right Dataset
getData :  Hhchpov.csv

 Validating the merge_how Parameter

 Import a crosswalk? ('True'/'False') False

 coerceForMerge: Left-Right
cols :  CSA2010 CSA2010
BEFORE COERCE :  object object
AFTER COERCE object object CSA2010
PERFORMING MERGE : LEFT->RIGHT
first_col :  CSA2010 object
how:  outer
second_col :  CSA2010 object


Unnamed: 0,Unnamed: 0_x,CSA2010,hhpov15,hhpov16,hhpov17,hhpov18,Unnamed: 0_y,hhchpov15,hhchpov16,hhchpov17,hhchpov18
0,0,Allendale/Irving...,24.15,21.28,20.7,23.0,0,38.93,34.73,32.77,35.27
1,1,Beechfield/Ten H...,11.17,11.59,10.47,10.9,1,19.42,21.22,23.92,21.9
2,2,Belair-Edison,18.61,19.59,20.27,22.83,2,36.88,36.13,34.56,39.74
3,3,Brooklyn/Curtis ...,28.36,26.33,24.21,21.54,3,45.01,46.45,46.41,39.89
4,4,Canton,3.0,2.26,3.66,2.05,4,5.49,2.99,4.02,4.61


In [None]:
ls

24510_B19001_5y18_est.csv           CSA-to-Tract-2010.csv  [0m[01;34msample_data[0m/
24510_B19001_5y18_est_Original.csv  Hhchpov.csv
acs_csa_merge_test.csv              Hhpov.csv


#### Example 2 ) Get CSA and Geometry with a Crosswalk using 3 links

In [None]:
# Primary Table
# Description: I created a public dataset from a google xlsx sheet 'Bank Addresses and Census Tract' from a workbook of the same name.
# Table: FDIC Baltimore Banks
# Columns: Bank Name, Address(es), Census Tract
left_ds = 'https28768&single=true&output=csv'
left_col = 'Census Tract'

# Alternate Primary Table
# Description: Same workbook, different Sheet: 'Branches per tract' 
# Columns: Census Tract, Number branches per tract
# left_ds = 'https://docssingle=true&output=csv'
# lef_col = 'Number branches per tract'

# Crosswalk Table
# Table: Crosswalk Census Communities
# 'TRACT2010', 'GEOID2010', 'CSA2010'
crosswalk_ds = 'https://docs.goot=csv'
use_crosswalk = True
crosswalk_left_col = 'TRACT2010'
crosswalk_right_col = 'GEOID2010'

# Secondary Table
# Table: Baltimore Boundaries
# 'TRACTCE10', 'GEOID10', 'CSA', 'NAME10', 'Tract', 'geometry'
right_ds = 'httpse=true&output=csv'
right_col ='GEOID10'

merge_how = 'geometry'
interactive = True
merge_how = 'outer'

merged_df_geom = mergeDatasets(left_ds=left_ds, right_ds=right_ds, crosswalk_ds=False,
                  left_col=left_col, right_col=right_col,
                  crosswalk_left_col = False, crosswalk_right_col = False,
                  merge_how=merge_how, # left right or columnname to retrieve
                  interactive=True)

merged_df_geom.head()

Here we can save the data so that it may be used in later tutorials. 

In [None]:
string = 'test_save_data_with_geom_and_csa'
merged_df.to_csv(string+'.csv', encoding="utf-8", index=False, quoting=csv.QUOTE_ALL)

#### Example 3: Ran Alone

In [None]:
mergeDatasets()