## Converting Excel to Parquet

Excel is a terrible format for programmatic work. It's bulky and takes forever to
process. This notebook once and for all converts many Excel files in our possession
to Parquet, which is a much more useable format.

In [1]:
import pandas as pd

from hxpr.utils import here

In [2]:
DATA_DIR = here() / "data"

In [3]:
for filename in (DATA_DIR / "hud").glob("*.xlsx"):
    print(f"Converting {filename}...")
    df = pd.read_excel(filename)
    df["zipcode"] = df["ZIP"].apply(lambda x: f"{x:05d}")
    df = df.drop(columns=["ZIP"])
    df = df.rename(columns={col: col.lower() for col in df.columns})

    if "county" in df.columns:
        df["county"] = df["county"].apply(lambda x: f"{x:05d}")
        df = df.rename(columns={"county": "county_fips"})

    df.to_parquet(filename.with_suffix(".parquet"))
print("Done")

Converting /Users/kwilso14/repo/brown/HX-PersonalResponsibility/data/hud/ZIP_TRACT_092021.xlsx...
Converting /Users/kwilso14/repo/brown/HX-PersonalResponsibility/data/hud/COUNTY_ZIP_092021.xlsx...
Converting /Users/kwilso14/repo/brown/HX-PersonalResponsibility/data/hud/TRACT_ZIP_092021.xlsx...
Converting /Users/kwilso14/repo/brown/HX-PersonalResponsibility/data/hud/ZIP_COUNTY_092021.xlsx...
Done
