General Checks

All original data sets are represented in the final database
Number of missing values per field
Number of unique values per field
Coverage of Canada (how many provinces/territories are represented)
Groupby various fields and count unique entries. This is an easy way to see abnormalities (e.g. if there are numbers in the wrong field and vice versa).


Sanity Check

Sample input from each dataset and confirm data is in final output and what is kept is unchanged


Data types

Of each field
At each processing step (if dtype is converted there may be issues in coercing e.g. from float to str (pandas may add 0))
Low_memory = False, dtype = “str” in ALL processing steps to ensure proper output.
Document where rounding might occur (e.g. in LAT, LON processing to n decimal places)


Nulls and Dedupes

Null values removed (+)
Nulls removed are sensical (-)
Duplicate values removed (+)
Duplicates removed have original in database (-)


Spatial

Visualize in QGIS
Cartographic file (with water boundaries) is used to determine points that fall outside of CSDUID boundaries (reproject to 4426 for spatial joins because lat/lon points must be projected in that CRS)
Check CSDUID assignments

Filter to only retain cities with multiple CSD assignments
Compute max distances between coordinates for each city
Reproject to 3347 – coordinates are measured in meters
Scipy convex hull to reduce set space (and memory usage)
Scipy cdist for pairwise comparisons

Address processing
Spec_checks

Explore integration of open databases (ODB) and BR if relevant
Reasons for data not matching BR data
Wrong NAICS code in BR
BR incorrectly has data as inactive

In [1]:
#creating csv only containng lats/lons

# importing package
import pandas as pd
import random
import csv




# file paths
filename = '/home/jovyan/ODBiz/8-Cleaning/Not_QC.csv' #file to read in 
newfile = '/home/jovyan/ODBiz/8-Cleaning/filtered.csv'  # outputted new file 



# filtering original merging csv 
data = pd.read_csv(filename, low_memory=False)                             # read the file in as a csv file
data = pd.DataFrame(data)                                                  # turn it into a data frame 
data = data.loc[(data['latitude'] > 0 ) & (data['longitude'] < 0 ) ]       # filter so that it only contains the rows with lat/long info 

data.to_csv(newfile, index=False)                                          # save new data frame as a csv  


In [4]:
data = pd.read_csv('/home/jovyan/ODBiz/8-Cleaning/filtered.csv', low_memory=False)
half_df = len(data) // 2

#first_half = df.iloc[:half_df,]

df1 = data.iloc[:, :half_df]
df2 = data.iloc[:, half_df:]


df1.to_csv('/home/jovyan/ODBiz/8-Cleaning/first_half.csv', index=False) 
df2.to_csv('/home/jovyan/ODBiz/8-Cleaning/second_half.csv', index=False)

In [5]:
import folium
import pandas as pd
from folium.plugins import FastMarkerCluster


data = pd.read_csv("/home/jovyan/ODBiz/8-Cleaning/first_half.csv", low_memory=False)

#create a map
this_map = folium.Map(prefer_canvas=True)

#this_map.add_child(FastMarkerCluster(data[['latitude', 'longitude']].values.tolist()))

def plotDot(point):
    '''input: series that contains a numeric named latitude and a numeric named longitude
    this function creates a CircleMarker and adds it to your this_map'''
    
    folium.CircleMarker(location=[point.latitude, point.longitude],
    #folium.add_child(FastMarkerCluster(location=[point.latitude, point.longitude],
                        weight=5,
                        radius=2,
                        color = 'magenta',
                        tooltip = point['business_name'],
                        popup = point['business_name']).add_to(this_map)

    
    
    
    #my_map.add_child(FastMarkerCluster(samples[['latitude', 'longitude']].values.tolist()))
    
#use df.apply(,axis=1) to "iterate" through every row in your dataframe
try: 
    data.apply(plotDot, axis = 1)
except:
    pass


#this_map.add_child(FastMarkerCluster(data[['latitude', 'longitude']].values.tolist()))

#Set the zoom to the maximum possible
this_map.fit_bounds(this_map.get_bounds())

#Save the map to an HTML file
this_map.save('/home/jovyan/ODBiz/8-Cleaning/folium_map_test.html')


# ****** TO VIEW MAP, RIGHT CLICK THE HTML FILE AND OPEN IN NEW BROWSER TAB *******

In [1]:
import folium
import pandas as pd

data = pd.read_csv("/home/jovyan/ODBiz/8-Cleaning/filtered.csv", low_memory=False)

#create a map
this_map = folium.Map(prefer_canvas=True)

def plotDot(point):
    '''input: series that contains a numeric named latitude and a numeric named longitude
    this function creates a CircleMarker and adds it to your this_map'''
    folium.CircleMarker(location=[point.latitude, point.longitude],
                        weight=5,
                        radius=2,
                        color = 'magenta',
                        tooltip = point['business_name'],
                        popup = point['business_name']).add_to(this_map)

#use df.apply(,axis=1) to "iterate" through every row in your dataframe
try: 
    data.apply(plotDot, axis = 1)
except:
    pass

#Set the zoom to the maximum possible
this_map.fit_bounds(this_map.get_bounds())

#Save the map to an HTML file
this_map.save('/home/jovyan/ODBiz/8-Cleaning/folium_map_test.html')


In [None]:
"""
Fixes errors revealed bby validate.py
Identified errors:
    1. move "unit" in postal code to unit
    2. Postal Code:
        - replace - with ""
        - replace \xa0 with non-breaking space (utf-8 encoding)
        - remove "MB" or "mb" or "manitoba" or "bonnet" from start, if exists
        - ensure upper case
"""
import numpy as np
import pandas as pd
import re

df = pd.read_csv("4_filled.csv", encoding = "utf-8-sig")

######
# 1.
######
# get units and boxes
def split_unit(x):
    if "-" in str(x):
        print(x, str(x).split('-')[0])
        return str(x).split('-')[0]
    else:
        return np.nan

def split_street_no(x):
    if "-" in str(x):
        print(x, str(x).split('-')[1])
        return str(x).split('-')[1]
    else:
        return x


df.unit = np.nan

df['temp_unit'] = df.street_no.map(split_unit)
df['temp_unit_2'] = df['postal_code'].str.extract(r'([uU]nit.*[0-9]{1,})', expand=False)
df['temp_unit_3'] = df.street_name.str.extract(r'([bB]ox.*[0-9]{1,})', expand=False)

df.unit.fillna(df.temp_unit, inplace=True)
df.unit.fillna(df.temp_unit_2, inplace=True)
df.unit.fillna(df.temp_unit_3, inplace=True)

# remove unit and boxes from other cols
df['street_no']=df.street_no.map(split_street_no)
df['street_name']=df['street_name'].str.replace(r'[bB]ox.*[0-9]{1,}', '', regex=True)
df['postal_code']=df['postal_code'].str.replace(r'[uU]nit.*[0-9]{1,}', '', regex=True)
print(df.unit.unique())


######
# 2.
######
# replace dashes and \xa0
df['postal_code'] = df.postal_code.str.replace("-", "")
df['postal_code'] = df.postal_code.str.replace("\xa0", "")

# remove prefixes from pcs that start with it
# df.postal_code = df.postal_code.str.strip(',')

def rmv_prefix(x):
    if x not in [None, np.nan]:
        
        if x.startswith(("MB", "mb")):
            return x[2::]
        elif x.startswith("manitoba"):
            return x.replace("manitoba", "")
        elif x.startswith("bonnet"):
            return x.replace("bonnet", "")
        else:
            return x
    else:
        return x
    
df['postal_code'] = df.postal_code.map(rmv_prefix)

def length_6(x):
    """
    Replaces postal codes that are not 3 or 6 characters in length with None.
    """
    if x not in [None, np.nan] and len(str(x).replace(" ", "").strip()) != 6:
        if len(str(x.strip())) != 3:
            print(f"Replacing {x, len(x)} with None")
            return None
    else:
        return str(x).replace(" ", "").strip()
    
df['postal_code'] = df.postal_code.map(length_6)

# Ensure upper case
df.postal_code = df.postal_code.str.upper()


# export
df.to_csv("5_cleaned.csv", encoding="utf-8-sig", index=False)
df.loc[df.latitude.isna()].to_csv("6_missing_coord.csv", encoding="utf-8-sig")
