# Verifying MLAR Consistency

It looks like the legacy MLAR dataset and the version produced by the ETL pipeline are almost identical. The only identifiable difference is rounding of the `tract_to_msa_income_percentage` field. There is a difference of about 0.003 in the means across legacy and v2. I can live with that.

In [1]:
import pandas as pd

## Load a sample of the legacy MLAR dataset (2022)

Only going to load 5 million rows and force the datatype to string for all. I don't want Pandas doing any processing on the raw datasets. I'm looking to compare string literals.

In [2]:
legacy_22_mlar = pd.read_csv("/Users/bienstocke/Desktop/2022_lar.txt", 
                             sep="|", 
                             nrows=5e6, 
                             dtype=str,
                             na_filter=False)

legacy_subset = legacy_22_mlar[legacy_22_mlar.lei == "RVDPPPGHCGZ40J4VQ731"].copy()
legacy_subset.shape

(328053, 99)

## Load the ETL Pipeline MLAR file

Only read those records that correspond with the above LEI. Reading in chunks because it won't fit in memory. Only going to include a single LEI

In [3]:
etl_subset_holder = []

for chunk in pd.read_csv("/Users/bienstocke/Documents/Github/kedro-etl-pipeline/hmda-etl-pipeline/data/2022/modified_lar/02_data_publication/2022_lar.txt",
                         sep="|",
                         chunksize=1e6,
                         dtype=str,
                         na_filter=False):
    etl_subset_holder.append(chunk[chunk.lei == "RVDPPPGHCGZ40J4VQ731"])

etl_subset = pd.concat(etl_subset_holder)
etl_subset.shape

(328053, 99)

## Verify column names and ordering

In [4]:
etl_subset.columns == legacy_subset.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True])

## Examine the unique values within each column

We don't have any way of (easily) aligning the legacy and new datasets. Let's instead take a look at the unique values appearing in each columns. I make the assumption that if two columns have the exact same set of values, the corresponding values within each row will align. The possiblity that they're identical but scrambled seems pretty low so I'm not going to worry about that. 

The following analysis indicates that there are four columns that do not overlap form a values perspective.

In [5]:
holder = {}

for column in etl_subset:
    
    unique_values_etl = set(etl_subset[column].unique())
    unique_values_legacy = set(legacy_subset[column].unique())
    
    common_record_count = len(unique_values_etl.intersection(unique_values_legacy))
    only_in_etl = len(unique_values_etl.difference(unique_values_legacy))
    
    holder[column] = {"overlap_count": common_record_count, 
                      "only_in_etl_count": only_in_etl}
    
difference_count_by_column = pd.DataFrame(holder).transpose()
difference_count_by_column.query("only_in_etl_count > 0")

    

Unnamed: 0,overlap_count,only_in_etl_count
tract_minority_population_percent,8666,963
tract_to_msa_income_percentage,0,15695


In [9]:
column = "tract_minority_population_percent"
unique_values_etl = set(etl_subset[column].unique().astype(float))
unique_values_legacy = set(legacy_subset[column].unique().astype(float))

common_records = list(unique_values_etl.intersection(unique_values_legacy))
only_in_etl = list(unique_values_etl.difference(unique_values_legacy))
only_in_legacy = list(unique_values_legacy.difference(unique_values_etl))

In [10]:
etl_subset[column].astype(float).mean()

40.49404303572898

In [11]:
legacy_subset[column].astype(float).mean()

40.49404303572898