## Dask Starter Notebook


This notebook loads the policies data into a dask dataframe and showcases some basic computations on the data. Writing Dask is extremely similar to writing pandas: a lot of methods, attributes etc. work the same way in both. One big difference, that allows Dask to handle big amounts of data, is that it computes lazily, i.e. only when the result is explicitly requested. For this you need to append `.compute()` to pandas method calls, as can be seen below with `groupby()` and `describe()`. [Quick intro to Dask in pandas docs](https://pandas.pydata.org/docs/user_guide/scale.html), [Dask docs](https://docs.dask.org/en/stable/dataframe.html).

In [11]:
import dask.dataframe as dd


In [12]:
df = dd.read_csv("../../data/FimaNfipPolicies.csv", dtype={"agricultureStructureIndicator": "int64", "baseFloodElevation": "float64", 
"basementEnclosureCrawlspace": "float64", "censusTract": "float64", "cancellationDateOfFloodPolicy": "object", "condominiumIndicator": "object", "construction": "int64", 
"countyCode": "float64", "crsClassCode": "float64", "deductibleAmountInBuildingCoverage": "object", "deductibleAmountInContentsCoverage": "object", "elevationBuildingIndicator": "int64",
"elevationCertificateIndicator": "float64", "elevationDifference": "float64", "federalPolicyFee": "int64", "floodZone": "object", "hfiaaSurcharge": "int64", 
"houseOfWorshipIndicator": "int64", "latitude": "float64",
"longitude":                             "float64",
"locationOfContents":                    "float64",
"lowestAdjacentGrade":                   "float64",
"lowestFloorElevation":                  "float64",
"nonProfitIndicator":                      "int64",
"numberOfFloorsInTheInsuredBuilding":    "float64",
"obstructionType":                       "float64",
"occupancyType":                         "float64",
"originalConstructionDate":               "object",
"originalNBDate":                         "object",
"policyCost":                              "int64",
"policyCount":                             "int64",
"policyEffectiveDate":                    "object",
"policyTerminationDate":                  "object",
"policyTermIndicator":                   "float64",
"postFIRMConstructionIndicator":           "int64",
"primaryResidenceIndicator":               "int64",
"propertyState":                          "object",
"reportedZipCode":                       "object",
"rateMethod":                             "object",
"regularEmergencyProgramIndicator":       "object",
"reportedCity":                           "object",
"smallBusinessIndicatorBuilding":          "int64",
"totalBuildingInsuranceCoverage":        "float64",
"totalContentsInsuranceCoverage":        "float64",
"totalInsurancePremiumOfThePolicy":      "float64",
"id":                                     "object"})

In [20]:
df.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 46 entries, agricultureStructureIndicator to id
dtypes: object(15), float64(19), int64(12)

In [14]:
df.head()

Unnamed: 0,agricultureStructureIndicator,baseFloodElevation,basementEnclosureCrawlspace,censusTract,cancellationDateOfFloodPolicy,condominiumIndicator,construction,countyCode,crsClassCode,deductibleAmountInBuildingCoverage,...,propertyState,reportedZipCode,rateMethod,regularEmergencyProgramIndicator,reportedCity,smallBusinessIndicatorBuilding,totalBuildingInsuranceCoverage,totalContentsInsuranceCoverage,totalInsurancePremiumOfThePolicy,id
0,0,1.0,,22071000000.0,,N,0,22071.0,7.0,1,...,LA,70128,1,R,Temporarily Unavailable,0,34600.0,0.0,92.0,139f1262-a301-44c9-bb58-98d5b8763032
1,0,1.0,,22071000000.0,,N,0,22071.0,7.0,1,...,LA,70128,1,R,Temporarily Unavailable,0,34600.0,0.0,92.0,8fe661f8-ab82-4566-baa3-091dc729e1f4
2,0,1.0,,22071000000.0,,N,0,22071.0,7.0,1,...,LA,70128,1,R,Temporarily Unavailable,0,34600.0,0.0,92.0,a75a4ed2-72b8-4f25-a372-da04e6349694
3,0,,,22071010000.0,,N,0,22071.0,7.0,2,...,LA,70118,1,R,Temporarily Unavailable,0,107000.0,20000.0,900.0,4d442216-c7bf-42c9-993a-2019323e1db4
4,0,,,22071000000.0,,N,0,22071.0,7.0,2,...,LA,70117,1,R,Temporarily Unavailable,0,140800.0,42000.0,1294.0,828b664a-0adf-4672-94e9-194248f438a1


In [16]:
df.groupby("countyCode").mean().compute()

Unnamed: 0_level_0,agricultureStructureIndicator,baseFloodElevation,basementEnclosureCrawlspace,censusTract,construction,crsClassCode,elevationBuildingIndicator,elevationCertificateIndicator,elevationDifference,federalPolicyFee,...,occupancyType,policyCost,policyCount,policyTermIndicator,postFIRMConstructionIndicator,primaryResidenceIndicator,smallBusinessIndicatorBuilding,totalBuildingInsuranceCoverage,totalContentsInsuranceCoverage,totalInsurancePremiumOfThePolicy
countyCode,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
12.0,0.007246,14.969841,0.024967,,0.028986,5.617357,0.253623,2.774892,2.334921,45.167572,...,1.980072,1039.230072,1.124094,1.000000,0.722826,0.503623,0.042572,235986.956522,50278.804348,791.977355
17.0,0.000000,534.685714,1.009709,,0.000000,7.214286,0.559406,,2.952381,43.242574,...,2.336634,873.846535,1.000000,1.000000,0.480198,0.524752,0.049505,107636.633663,63130.198020,666.415842
22.0,0.000000,197.849216,0.000000,,0.003492,7.937792,0.340084,2.900000,2.240190,40.102654,...,1.674581,822.012570,1.000000,1.000000,0.791201,0.613128,0.028631,181907.122905,45710.265363,619.588687
1003.0,0.000034,73.723270,0.714378,1.003011e+09,0.003084,7.315572,0.509099,2.384697,3.326576,66.957527,...,1.842179,1293.181547,2.338633,1.001447,0.870964,0.482888,0.007610,443314.619142,60305.621884,1060.072257
1043.0,0.000000,463.555328,1.070978,1.043965e+09,0.012596,,0.498950,2.773622,3.003854,36.478656,...,1.550035,749.392582,1.000000,1.000000,0.615115,0.582225,0.017495,163810.496851,45016.165150,583.822953
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29085.0,0.000000,,,,0.000000,,0.000000,,,36.666667,...,4.000000,861.333333,1.000000,1.000000,0.000000,0.000000,0.000000,90800.000000,23400.000000,824.666667
29227.0,0.000000,,2.000000,2.922796e+10,0.000000,,0.000000,,,40.571429,...,1.000000,336.285714,1.000000,1.000000,0.000000,0.714286,0.000000,24985.714286,0.000000,257.000000
38.0,0.000000,,1.000000,,0.000000,,0.000000,,,25.000000,...,1.000000,534.000000,1.000000,1.000000,1.000000,1.000000,0.000000,250000.000000,100000.000000,421.000000
38061.0,0.000000,,1.000000,3.806196e+10,0.000000,,0.000000,,,43.583333,...,1.000000,2042.166667,1.000000,1.000000,0.000000,0.000000,0.000000,111666.666667,0.000000,1725.333333


In [18]:
df.describe().compute()

Unnamed: 0,agricultureStructureIndicator,baseFloodElevation,basementEnclosureCrawlspace,censusTract,construction,countyCode,crsClassCode,elevationBuildingIndicator,elevationCertificateIndicator,elevationDifference,...,occupancyType,policyCost,policyCount,policyTermIndicator,postFIRMConstructionIndicator,primaryResidenceIndicator,smallBusinessIndicatorBuilding,totalBuildingInsuranceCoverage,totalContentsInsuranceCoverage,totalInsurancePremiumOfThePolicy
count,61414280.0,21542780.0,25084230.0,61061610.0,61414280.0,61205340.0,44476260.0,61414280.0,14951770.0,21837320.0,...,61386170.0,61414280.0,61414280.0,61414130.0,61414280.0,61414280.0,61414280.0,61413970.0,61412080.0,61414130.0
mean,0.0005947639,322.8299,0.6813089,26250950000.0,0.001928607,26273.91,6.258923,0.1777941,2.085269,1.594389,...,1.713396,963.008,1.255176,1.00615,0.5505555,0.7004227,0.01137066,244235.4,58499.48,803.4454
std,0.02438053,1486.465,1.039847,15856050000.0,0.04387354,15884.07,1.491518,0.3823393,1.035596,78.89538,...,2.123534,1944.266,5.187299,0.1106864,0.4974376,0.4580729,0.1060253,1039159.0,60223.0,1723.44
min,0.0,-9999.0,0.0,1001020000.0,0.0,1.0,1.0,0.0,1.0,-90000.0,...,1.0,-18112.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,-17644.0
25%,0.0,8.0,0.0,12111380000.0,0.0,12111.0,5.0,0.0,1.0,1.0,...,1.0,450.0,1.0,1.0,0.0,1.0,0.0,175000.0,25000.0,348.0
50%,0.0,11.0,1.0,32031000000.0,0.0,32023.0,7.0,0.0,3.0,2.0,...,1.0,654.0,1.0,1.0,1.0,1.0,0.0,250000.0,91100.0,480.0
75%,0.0,990.0,2.0,54039010000.0,0.0,72015.0,10.0,1.0,3.0,4.0,...,11.0,1406.0,1.0,1.0,1.0,1.0,0.0,250000.0,100000.0,1232.0
max,1.0,9998.0,4.0,78030960000.0,1.0,78030.0,10.0,1.0,4.0,998.0,...,19.0,1907728.0,1203.0,3.0,1.0,1.0,1.0,265250000.0,6000000.0,1903850.0
