## Generating Data Subsets in Parquet Format with Dask

Since the flood insurance policies data set is too large to fit into memory, we subset it with Dask and save the subsets we want to work with in .parquet format. 

In [1]:
import dask.dataframe as dd
import pandas as pd
import numpy as np


Reading the data from .csv with manually optimized datatypes.

In [2]:
df = dd.read_csv("../data/FimaNfipPolicies.csv", dtype={"agricultureStructureIndicator": "bool", 
"baseFloodElevation":                    "float32", 
"basementEnclosureCrawlspace":           "float32", 
"censusTract":                           "float32", 
"cancellationDateOfFloodPolicy":          "string", 
"condominiumIndicator":                   "string", 
"construction":                            "bool", 
"countyCode":                            "string", 
"crsClassCode":                          "float32", 
"deductibleAmountInBuildingCoverage":     "string", 
"deductibleAmountInContentsCoverage":     "string", 
"elevationBuildingIndicator":              "bool",
"elevationCertificateIndicator":         "float32",
"elevationDifference":                   "float32", 
"federalPolicyFee":                        "int16", 
"floodZone":                              "string", 
"hfiaaSurcharge":                          "int16", 
"houseOfWorshipIndicator":                 "bool", 
"latitude":                              "float32",
"longitude":                             "float32",
"locationOfContents":                    "float32",
"lowestAdjacentGrade":                   "float32",
"lowestFloorElevation":                  "float32",
"nonProfitIndicator":                      "bool",
"numberOfFloorsInTheInsuredBuilding":    "float32",
"obstructionType":                       "float32",
"occupancyType":                         "float32",
"originalConstructionDate":               "string",
"originalNBDate":                         "string",
"policyCost":                              "int32",
"policyCount":                             "int16",
"policyEffectiveDate":                    "string",
"policyTerminationDate":                  "string",
"policyTermIndicator":                   "float32",
"postFIRMConstructionIndicator":           "bool",
"primaryResidenceIndicator":               "bool",
"propertyState":                          "string",
"reportedZipCode":                        "string", 
"rateMethod":                             "string",
"regularEmergencyProgramIndicator":       "string",
"reportedCity":                           "string",
"smallBusinessIndicatorBuilding":          "bool",
"totalBuildingInsuranceCoverage":        "float64",
"totalContentsInsuranceCoverage":        "float32",
"totalInsurancePremiumOfThePolicy":      "float64",
"id":                                     "string"})

### State Parquets

We are going to produce parquets for three states to work with: Texas, Montana and Louisina. Texas and Louisiana experienced major flood events in recent years, whose impact on insurance we will explore in a separate notebook. Montana did not and will fulfill some functions analogous to those of a control group. The id and propertyState columns get dropped, since the former takes up a lot of space without providing any use for analysis and the latter is made unneccessary by splitting the data by state. 

In [4]:
df_tx = df[df.propertyState == "TX"].drop(["propertyState", "id"], axis=1)
df_tx.to_parquet("../data/TXPolicies.parquet")

In [5]:
df_mo = df[df.propertyState == "MO"].drop(["propertyState", "id"], axis=1)
df_mo.to_parquet("../data/MOPolicies.parquet")

In [6]:
df_la = df[df.propertyState == "LA"].drop(["propertyState", "id"], axis=1)
df_la.to_parquet("../data/LAPolicies.parquet")

### Sampled Parquet

A further option for exploring the data is to take a random sample of the datset instead of splitting it by categories. The following cell produces a sample parquet with 2% of the rows. 

In [8]:
df_sample = df.sample(frac=0.02, random_state=42).drop(["id"], axis=1)
df_sample.to_parquet("../data/SamplePolicies.parquet")