# State Farm analysis

Author: Mo Al Elew

**What notebook does/produces:**

Replicates and fact checks all the data findings used in publication

**Approach:**

The general pattern includes:
1. Quote the relevant text
2. Determine asserted figure to reproduce
3. Run the operations to reproduce relevant figure
4. Assert expected value against actual value
5. Print the relevant text with the actual value templated in

Some findings cannot be directly tested using an assertion against a single value. In those cases, I display the relevant data slice, chart, or other presentation.

In [1]:
import geopandas as gpd
import pandas as pd

# Constants

In [2]:
INSURER = "State Farm"
DATA_FP = "./outputs/statefarm_auto_clean_gis.zip"

ROUNDING_PRECISION = 3
WAYNE_COUNTY_FIPS = "26163"
PROJECTED_CRS = "EPSG:3078"


def prptn_to_pct(val, precision=3):
    return round(val, precision) * 100


def google_maps_lat_lon(latitude, longitude, zoom_level=12):
    return (
        f"https://maps.google.com/maps?z={zoom_level}&t=m&q=loc:{latitude}+{longitude}"
    )


def google_maps_coords(coords_shapely):
    return google_maps_lat_lon(coords_shapely.y, coords_shapely.x)

In [3]:
RATE_Q_LABELS = [
    "lowest effect",
    "middle low",
    "median",
    "middle high",
    "highest effect",
]
INCOME_Q_LABELS = [
    "lowest income",
    "middle low",
    "median",
    "middle high",
    "highest income",
]
DENSITY_Q_LABELS = [
    "lowest density",
    "middle low",
    "median",
    "middle high",
    "highest density",
]


QUARTILE_ANALYSIS_COLS = [
    "geo_id",
    "geo_name",
    "total_pop",
    "white_tot",
    "black_tot",
    "generic_location_based_premium",
    "effect_quantile",
]

GEOID_GROUP_BY_COLS = [
    "generic_location_based_premium",
    "white_tot",
    "black_tot",
    "total_pop",
    "density",
    "median_income",
    "bg_median_income",
]

QUANTILE_GROUP_BY_COLS = ["black_tot", "white_tot", "total_pop"]

# Read data

In [4]:
gdf = gpd.read_file(DATA_FP)
gdf.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 175436 entries, 0 to 175435
Data columns (total 23 columns):
 #   Column                          Non-Null Count   Dtype   
---  ------                          --------------   -----   
 0   geo_id                          175436 non-null  object  
 1   geo_name                        175436 non-null  object  
 2   total_pop                       175436 non-null  int64   
 3   white_pct                       175436 non-null  object  
 4   black_pct                       175436 non-null  object  
 5   white_tot                       175436 non-null  int64   
 6   black_tot                       175436 non-null  int64   
 7   median_income                   175436 non-null  object  
 8   generic_location_based_premium  175436 non-null  float64 
 9   location_effect                 175436 non-null  float64 
 10  bg_geo_id                       175436 non-null  object  
 11  bg_tot_pop                      175436 non-null  int64   

# Preprocess

In [5]:
gdf["bg_median_income"] = gdf["bg_median_income"].replace("-", -1)
gdf["bg_median_income"] = gdf["bg_median_income"].replace("250,000+", 250000)
gdf["bg_median_income"] = gdf["bg_median_income"].astype(int)

gdf["median_income"] = gdf["median_income"].replace("-", -1)
gdf["median_income"] = gdf["median_income"].replace("250,000+", 250000)
gdf["median_income"] = gdf["median_income"].astype(int)

gdf["black_pct"] = gdf["black_pct"].replace("-", 0).astype(float)
gdf["white_pct"] = gdf["white_pct"].replace("-", 0).astype(float)

gdf["is_in_wayne"] = gdf["geo_id"].str.startswith(WAYNE_COUNTY_FIPS)
gdf_wayne = gdf[gdf["geo_id"].str.startswith(WAYNE_COUNTY_FIPS)]

# Largest gap

> State Farm set its highest location-adjusted base rates in Detroit’s Morningside neighborhood, which is 97 percent Black. Its location effect was nearly seven times that of the territory with the lowest location effect in the state, somewhere in the southeast corner of Washtenaw County’s Saline Township.

In [6]:
ASSERTED_FIGURE = 6.8
max_rate = gdf["generic_location_based_premium"].max()
min_rate = gdf["generic_location_based_premium"].min()
max_div_min = round(max_rate / min_rate, 1)
assert (
    max_div_min == ASSERTED_FIGURE
), f"{max_rate} / {min_rate} == {max_div_min} != {ASSERTED_FIGURE}"

max_effect_entry = gdf[
    gdf["generic_location_based_premium"] == gdf["generic_location_based_premium"].max()
].iloc[0]
min_effect_entry = gdf[
    gdf["generic_location_based_premium"] == gdf["generic_location_based_premium"].min()
].iloc[0]
max_gmaps = google_maps_coords(max_effect_entry["geometry"])
min_gmaps = google_maps_coords(min_effect_entry["geometry"])
print(f"Max location: {max_gmaps}\nMin location: {min_gmaps}")
f"Its location effect was {max_div_min} ({max_rate} / {min_rate}) times that of a gridded territory in the southeast corner of Washtenaw County’s Saline Township, which had the lowest location effect in the state."

Max location: https://maps.google.com/maps?z=12&t=m&q=loc:42.41+-82.98
Min location: https://maps.google.com/maps?z=12&t=m&q=loc:42.1+-83.8


'Its location effect was 6.8 (28071.22 / 4153.92) times that of a gridded territory in the southeast corner of Washtenaw County’s Saline Township, which had the lowest location effect in the state.'

# Detroit minimum

> The location effects in Detroit under State Farm were significantly higher than the rest of the state. The lowest location effect in Detroit was double the lowest effect statewide and 1.4 times the state’s median effect.

In [7]:
gdf_detroit = gdf[gdf["is_in_detroit"]]
gdf["is_in_wayne"] = gdf["geo_id"].str.startswith(WAYNE_COUNTY_FIPS)
gdf_wayne = gdf[gdf["is_in_wayne"]]

In [8]:
ASSERTED_FIGURE = 1.4
detroit_min_effect = gdf_detroit["location_effect"].min()
assert (
    ASSERTED_FIGURE == detroit_min_effect
), f"{ASSERTED_FIGURE} != {detroit_min_effect}"

ASSERTED_FIGURE = 2
detroit_min_rate = gdf_detroit["generic_location_based_premium"].min()
detroit_min_rate_div_state_min = round(detroit_min_rate / min_rate)
assert (
    ASSERTED_FIGURE == detroit_min_rate_div_state_min
), f"{ASSERTED_FIGURE} != {detroit_min_rate_div_state_min}"

print(
    f"""The lowest location effect in Detroit was {detroit_min_rate_div_state_min} ({detroit_min_rate} / {min_rate}) times the lowest effect statewide
    and {detroit_min_effect} times the state’s median effect."""
)

The lowest location effect in Detroit was 2 (8673.96 / 4153.92) times the lowest effect statewide
    and 1.4 times the state’s median effect.


# 8 Mile Rd.

## Preprocess

In [9]:
gdf_north_8_rd = gdf[gdf["is_north_8_mile"]]
gdf_south_8_rd = gdf[gdf["is_south_8_mile"]]

gdf_8_mile_south_nn = gpd.sjoin_nearest(
    gdf_south_8_rd.to_crs("EPSG:3857"),
    gdf_north_8_rd.to_crs("EPSG:3857"),
    distance_col="distances",
    lsuffix="south",
    rsuffix="north",
    exclusive=True,
)

## Demographics

> Multiple territories in predominantly Black neighborhoods south of the road had a significantly higher location effect than the territories in the mostly White neighborhoods.

In [10]:
def calculate_racial_demograhics(df, area_name="This region"):
    white_pct = (
        round(df["white_tot"].sum() / df["total_pop"].sum(), ROUNDING_PRECISION) * 100
    )
    black_pct = (
        round(df["black_tot"].sum() / df["total_pop"].sum(), ROUNDING_PRECISION) * 100
    )
    print(
        f"{area_name} is {white_pct}% ({df['white_tot'].sum()} / {df['total_pop'].sum()}) White and {black_pct}% ({df['black_tot'].sum()} / {df['total_pop'].sum()}) Black."
    )


calculate_racial_demograhics(
    gdf_north_8_rd.drop_duplicates(subset=["geo_id"], keep="first"),
    area_name="The area north of 8 Mile",
)
print("\n")
calculate_racial_demograhics(
    gdf_south_8_rd.drop_duplicates(subset=["geo_id"], keep="first"),
    area_name="The area south of 8 Mile",
)

The area north of 8 Mile is 39.0% (51499 / 131881) White and 50.1% (66022 / 131881) Black.


The area south of 8 Mile is 7.3999999999999995% (9963 / 134767) White and 88.1% (118664 / 134767) Black.


In [11]:
assert (
    len(gdf_north_8_rd) != gdf_north_8_rd["geo_id"].nunique()
), "Each block maps to a single rate grid"
gdf_n_temp = gdf_north_8_rd.drop_duplicates(subset=["geo_id"], keep="first")
assert (
    len(gdf_n_temp) == gdf_n_temp["geo_id"].nunique()
), "There are multiple entries for a single Block"
is_maj_white = gdf_n_temp["white_pct"] > 50
maj_white_blocks_pct = round(is_maj_white.sum() / len(gdf_n_temp), 3) * 100
print(f"{maj_white_blocks_pct}% of blocks north of 8 Mile are majority White.")

30.8% of blocks north of 8 Mile are majority White.


## Topline stat

> Eighty percent of territories south of 8 Mile Road had location effects higher than their northern counterparts. Nearly half of those territories had location effects 50 percent higher or more than their neighboring block north. There were no blocks south of the road that paid less than 10 percent of its northern counterpart.


For each gridded unit south of 8 mile I caclulate this ratio:
$$
  south\ north\ ratio = \frac{Location\ rate\ south\ of\ 8\ Mile}{Closest\ location\ rate\ north\ of\ 8\ Mile}
$$

In [12]:
gdf_8_mile_south_nn["sn_ratio"] = (
    gdf_8_mile_south_nn["generic_location_based_premium_south"]
    / gdf_8_mile_south_nn["generic_location_based_premium_north"]
)
gdf_8_mile_south_nn = gdf_8_mile_south_nn.sort_values(["sn_ratio"])

In [13]:
south_terr_count = len(gdf_8_mile_south_nn["location_effect_south"])
is_higher_effect_south = (
    gdf_8_mile_south_nn["location_effect_south"]
    > gdf_8_mile_south_nn["location_effect_north"]
)
south_terr_with_higher_effect_count = is_higher_effect_south.sum()
south_terr_with_higher_effect_pct = (
    round(south_terr_with_higher_effect_count / south_terr_count, ROUNDING_PRECISION)
    * 100
)
f"{south_terr_with_higher_effect_pct} percent ({south_terr_with_higher_effect_count} / {south_terr_count}) of territories south of 8 Mile Road had location effects higher than their northern counterparts."

'79.5 percent (62 / 78) of territories south of 8 Mile Road had location effects higher than their northern counterparts.'

In [14]:
ASSERTED_FIGURE_MIN = 0.485
south_terr_one_point_five_count = (gdf_8_mile_south_nn["sn_ratio"] > 1.49).sum()
actual_figure = south_terr_one_point_five_count / south_terr_with_higher_effect_count
assert actual_figure >= ASSERTED_FIGURE_MIN, f"{actual_figure} < {ASSERTED_FIGURE_MIN}"
f"{round(actual_figure, 3) * 100} percent of those territories had location effects at least 50 percent higher than their neighboring block north."

'51.6 percent of those territories had location effects at least 50 percent higher than their neighboring block north.'

In [15]:
is_lower_effect_south = gdf_8_mile_south_nn["sn_ratio"] < 1
avg_effect_lower_than_nn = round(
    gdf_8_mile_south_nn[is_lower_effect_south]["sn_ratio"].mean(), 2
)
print(
    f"On average when a location rate was lower the ratio between the effect south to north was {avg_effect_lower_than_nn}"
)

On average when a location rate was lower the ratio between the effect south to north was 0.91


## Effect double neighoring area

> Four State Farm territories within a mile south of the road had location effects twice that of their nearest neighbors to the north. These neighborhoods south of 8 Mile Road are all at least 60 percent Black, while their nearest neighborhoods north of 8 Mile Road were all less than 10 percent Black and majority White. 

I filter out by population because one of these gridded territroies is centered in a park adjacent to one of the other territories with double its northern neighbor.

In [16]:
has_population = gdf_8_mile_south_nn["total_pop_south"] > 0
is_double_effect = (gdf_8_mile_south_nn["sn_ratio"] > 2) & has_population
double_effect_count = is_double_effect.sum()
f"{double_effect_count} State Farm territories within a mile south of the road had location effects twice that of their nearest neighbors to the north."

'5 State Farm territories within a mile south of the road had location effects twice that of their nearest neighbors to the north.'

In [17]:
gdf_8_mile_south_nn[is_double_effect]

Unnamed: 0,geo_id_south,geo_name_south,total_pop_south,white_pct_south,black_pct_south,white_tot_south,black_tot_south,median_income_south,generic_location_based_premium_south,location_effect_south,...,bg_black_pct_north,bg_median_income_north,is_in_detroit_north,is_along_8_mile_north,is_south_8_mile_north,is_north_8_mile_north,density_north,is_in_wayne_north,distances,sn_ratio
22905,26163538600,Census Tract 5386; Wayne County; Michigan,5888,4.3,94.6,256,5572,51171,19547.4,3.17,...,6.0,105250,False,True,False,True,0.001611,False,3016.852144,2.092323
22903,26163538300,Census Tract 5383; Wayne County; Michigan,2010,9.4,89.0,189,1788,36563,18407.84,2.98,...,3.4,86801,False,True,False,True,0.002216,False,3016.852144,2.118815
22904,26163538400,Census Tract 5384; Wayne County; Michigan,4145,14.5,76.5,601,3169,105109,18969.71,3.07,...,0.0,95132,False,True,False,True,0.002216,False,3016.852144,2.134556
22902,26163538300,Census Tract 5383; Wayne County; Michigan,2010,9.4,89.0,189,1788,36563,18688.87,3.03,...,1.4,92699,False,True,False,True,0.002183,False,3016.852144,2.157799
22901,26163509000,Census Tract 5090; Wayne County; Michigan,1463,11.2,79.4,164,1162,48281,18707.97,3.03,...,6.5,70150,False,True,False,True,0.00105,False,3016.852144,2.212735


In [18]:
is_60_plus_pct_black = gdf_8_mile_south_nn["black_pct_south"] >= 60
assert is_double_effect.sum() == (is_60_plus_pct_black & is_double_effect).sum()
min_black_pct_south = gdf_8_mile_south_nn[is_double_effect]["black_pct_south"].min()

is_maj_white_north = gdf_8_mile_south_nn["white_pct_north"] > 50
is_second_claim = (
    is_double_effect
    & is_maj_white_north
    & (gdf_8_mile_south_nn["black_pct_north"] < 10)
)
assert is_second_claim.sum() == is_double_effect.sum()
max_black_pct_north = gdf_8_mile_south_nn[is_double_effect]["black_pct_north"].max()
min_white_pct_north = gdf_8_mile_south_nn[is_double_effect]["white_pct_north"].min()


print(
    f"""These neighborhoods south of 8 Mile Road are all at least {min_black_pct_south} percent Black,
while their nearest neighborhoods north of 8 Mile Road were all less than {max_black_pct_north} percent Black and at least {min_white_pct_north} percent White."""
)

These neighborhoods south of 8 Mile Road are all at least 76.5 percent Black,
while their nearest neighborhoods north of 8 Mile Road were all less than 8.9 percent Black and at least 80.6 percent White.


> On average, the location effects directly south of 8 Mile were about 40 percent higher than the location effects just north of the road.

In [19]:
avg_sn_ratio = round(gdf_8_mile_south_nn["sn_ratio"].mean(), 2)
print(
    f"On average, the location effects directly south of 8 Mile were about {avg_sn_ratio} percent higher than the location effects just north of the road."
)

On average, the location effects directly south of 8 Mile were about 1.38 percent higher than the location effects just north of the road.


# Location effect quantiles

Since State Farm uses a gridded map, I attempt three different ways to analyze the quantiles by demographics.

1. I average the generic rate for each ZCTA `geo_id` 

I moved the these approaches to the [appendix below](#Different_quantile_approaches)

2. Drop all duplicate ZCTA `geo_id` entries within quantiles - this approach allows for the same ZCTA to appear in multiple quantiles
3. Drop all duplicate ZCTA `geo_id` entries in each quantile retaining the one in the lowest quantile

## Average rate

I average the rates by geographic id to avoid double counting 

In [20]:
gdf_groupby_geo_id = gdf.groupby("geo_id")[GEOID_GROUP_BY_COLS].mean()
gdf_groupby_geo_id["income_quartile"] = pd.qcut(
    gdf_groupby_geo_id["median_income"],
    q=len(INCOME_Q_LABELS),
    labels=INCOME_Q_LABELS,
)
gdf_groupby_geo_id["effect_quantile"] = pd.qcut(
    gdf_groupby_geo_id["generic_location_based_premium"],
    q=len(RATE_Q_LABELS),
    labels=RATE_Q_LABELS,
)
gdf_groupby_geo_id["density_quantile"] = pd.qcut(
    gdf_groupby_geo_id["density"], q=len(DENSITY_Q_LABELS), labels=DENSITY_Q_LABELS
)

> Seventeen percent of the state’s residents live in tracts with location effects in the top quintile of the state. This includes 62 percent of Black Michiganders and less than 10 percent of White Michiganders.

In [21]:
gdf_groupby_quartiles = gdf_groupby_geo_id.groupby("effect_quantile", observed=False)[
    QUANTILE_GROUP_BY_COLS
].sum()

print("This calculates (group subset in quartile / total group population)")
df_distribution = prptn_to_pct(
    gdf_groupby_quartiles.div(gdf_groupby_quartiles.sum(axis=0), axis=1)
)
df_distribution

This calculates (group subset in quartile / total group population)


Unnamed: 0_level_0,black_tot,white_tot,total_pop
effect_quantile,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
lowest effect,5.7,25.3,22.5
middle low,6.1,23.4,20.4
median,9.9,21.8,19.9
middle high,16.3,20.6,20.5
highest effect,61.9,8.9,16.7


Visual first draft

In [22]:
%run ../00_misc/helper-func-notebook.ipynb
stacked_quintile_chart = stacked_race_hbar(df_distribution, "State Farm")
stacked_quintile_chart.save("../00_misc/charts/state_farm_population_quintile.png")
stacked_quintile_chart

## Detroit and Wayne County distribution

In [23]:
gdf["effect_quantile"] = pd.qcut(
    gdf["generic_location_based_premium"],
    q=len(RATE_Q_LABELS),
    labels=RATE_Q_LABELS,
)

> All of State Farm’s territories in Detroit showed location effects within the top quintile of the state.

In [24]:
gdf[gdf["is_in_detroit"]]["effect_quantile"].value_counts() / len(
    gdf[gdf["is_in_detroit"]]
)

effect_quantile
highest effect    1.0
lowest effect     0.0
middle low        0.0
median            0.0
middle high       0.0
Name: count, dtype: float64

> Furthermore, 90 percent of territories in all of Wayne County showed location effects in the top quintile of the state. 

In [25]:
prptn_to_pct(
    gdf[gdf["is_in_wayne"]]["effect_quantile"].value_counts()
    / len(gdf[gdf["is_in_wayne"]])
)

effect_quantile
highest effect    88.7
middle high        4.5
median             2.4
middle low         2.3
lowest effect      2.0
Name: count, dtype: float64

# Appendix

## Detroit range

In [26]:
highest_effect = gdf["location_effect"].max()
detroit_loc_effects = gdf[gdf["is_in_detroit"]]["location_effect"]
detroit_min_loc_effects = detroit_loc_effects.min()
detroit_max_loc_effects = detroit_loc_effects.max()
detroit_avg_loc_effect = round(detroit_loc_effects.mean(), 2)
assert highest_effect == detroit_max_loc_effects

print(
    f"{INSURER} range of location effects within Detroit: {detroit_min_loc_effects}-{detroit_max_loc_effects}. The avg: {detroit_avg_loc_effect}"
)
print(f"The highest effect {highest_effect} is in Detroit")

State Farm range of location effects within Detroit: 1.4-4.55. The avg: 3.4
The highest effect 4.55 is in Detroit


## Population density

In [27]:
gdf_temp = gdf_groupby_geo_id.pivot_table(
    index="effect_quantile", columns="density_quantile", aggfunc="count", observed=False
)["median_income"]
df_density_quintile = prptn_to_pct(gdf_temp / gdf_temp.sum())
df_density_quintile

density_quantile,lowest density,middle low,median,middle high,highest density
effect_quantile,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
lowest effect,17.4,34.3,28.3,13.4,6.8
middle low,33.0,29.7,19.0,11.7,6.6
median,34.2,22.3,19.7,13.9,9.9
middle high,10.2,11.7,21.0,29.2,27.8
highest effect,5.2,2.1,12.0,31.8,49.0


In [28]:
gdf_groupby_density_quantiles = gdf_groupby_geo_id.groupby(
    "density_quantile", observed=False
)[QUANTILE_GROUP_BY_COLS].sum()
column_sums = gdf_groupby_density_quantiles.sum(axis=0)
df_density_distribution = prptn_to_pct(
    gdf_groupby_density_quantiles.div(column_sums, axis=1), 2
)
df_density_distribution

Unnamed: 0_level_0,black_tot,white_tot,total_pop
density_quantile,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
lowest density,1.0,19.0,15.0
middle low,4.0,26.0,22.0
median,16.0,22.0,21.0
middle high,30.0,18.0,20.0
highest density,48.0,15.0,21.0


## Effect x density quantiles

In [29]:
def pivot_effect_density_quantiles(
    gdf, race_group, race_label=None, calculate_percent=True
):
    gdf_temp = gdf.pivot_table(
        index="effect_quantile",
        columns="density_quantile",
        values=race_group,
        aggfunc="sum",
        observed=False,
    )
    if calculate_percent:
        gdf_temp = prptn_to_pct(gdf_temp / gdf_temp.sum().sum())
    gdf_temp = gdf_temp.reset_index()
    if race_label:
        gdf_temp["race"] = race_label
    else:
        gdf_temp["race"] = race_group
    gdf_temp["insurer"] = INSURER
    return gdf_temp


def join_effect_density_quantiles_pivots(calculate_percent=True):
    gdf_white = pivot_effect_density_quantiles(
        gdf_groupby_geo_id, "white_tot", "White", calculate_percent=calculate_percent
    )
    gdf_black = pivot_effect_density_quantiles(
        gdf_groupby_geo_id, "black_tot", "Black", calculate_percent=calculate_percent
    )
    return pd.concat([gdf_white, gdf_black], ignore_index=True)


gdf_effect_density_quantiles_pivot = join_effect_density_quantiles_pivots(False)
gdf_effect_density_quantiles_pivot.to_csv(
    "./outputs/effect_density_quantiles_pivot_count.csv", index=False
)

## Income cross tab

In [30]:
gdf_temp = gdf_groupby_geo_id.pivot_table(
    index="effect_quantile", columns="income_quartile", aggfunc="count", observed=False
)["median_income"]
gdf_temp / gdf_temp.sum()

income_quartile,lowest income,middle low,median,middle high,highest income
effect_quantile,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
lowest effect,0.105903,0.132174,0.210435,0.262609,0.289931
middle low,0.119792,0.233043,0.231304,0.238261,0.177083
median,0.154514,0.243478,0.208696,0.184348,0.208333
middle high,0.118056,0.175652,0.22087,0.229565,0.255208
highest effect,0.501736,0.215652,0.128696,0.085217,0.069444


In [31]:
%run ../00_misc/helper-func-notebook.ipynb
df_income_quintile = prptn_to_pct(gdf_temp / gdf_temp.sum(), 15)
income_hbar = stacked_income_hbar(df_income_quintile, title="State Farm")
income_hbar.save("../00_misc/charts/state_farm_income_quintile.png")
income_hbar

## Effect in top quantile

In [32]:
gdf_highest_effects = gdf_groupby_geo_id[
    gdf_groupby_geo_id["effect_quantile"] == "highest effect"
]
lowest_quantile_min_effect = (
    gdf_highest_effects["generic_location_based_premium"].min()
    / gdf_groupby_geo_id["generic_location_based_premium"].median()
)
highest_quantile_max_effect = (
    gdf_highest_effects["generic_location_based_premium"].max()
    / gdf_groupby_geo_id["generic_location_based_premium"].median()
)

print(
    f"The location effect in the top quantile ranged from {lowest_quantile_min_effect} to {highest_quantile_max_effect}"
)

The location effect in the top quantile ranged from 1.4722174813497502 to 4.264382726204148


## Export data

In [33]:
%run ../00_misc/helper-func-notebook.ipynb
df_export = datawrapper_race_distribution(df_distribution, "State Farm")
df_export.to_csv("./outputs/state_farm_race_chart_data.csv")
df_export

race,lowest effect,middle low,median,middle high,highest effect,Insurer
Black,5.7,6.1,9.9,16.3,61.9,State Farm
White,25.3,23.4,21.8,20.6,8.9,State Farm
Total,22.5,20.4,19.9,20.5,16.7,State Farm


In [34]:
%run ../00_misc/helper-func-notebook.ipynb
df_export = datawrapper_income_distribution(df_income_quintile, "State Farm")
df_export.to_csv("./outputs/state_farm_income_chart_data.csv")
df_export

income,lowest effect,middle low,median,middle high,highest effect,Insurer
Lowest income,10.590278,11.979167,15.451389,11.805556,50.173611,State Farm
Lower income,13.217391,23.304348,24.347826,17.565217,21.565217,State Farm
Middle income,21.043478,23.130435,20.869565,22.086957,12.869565,State Farm
Higher income,26.26087,23.826087,18.434783,22.956522,8.521739,State Farm
Highest incomes,28.993056,17.708333,20.833333,25.520833,6.944444,State Farm


In [35]:
%run ../00_misc/helper-func-notebook.ipynb
df_export = datawrapper_race_distribution(df_density_distribution, "State Farm")
df_export.to_csv("./outputs/state_farm_race_density_chart_data.csv")

In [36]:
%run ../00_misc/helper-func-notebook.ipynb
df_export = datawrapper_pop_density_distribution(df_density_quintile, "State Farm")
df_export.to_csv("./outputs/statefarm_pop_density_chart_data.csv")
df_export

Population density,lowest effect,middle low,median,middle high,highest effect,Insurer
Lowest density,17.4,33.0,34.2,10.2,5.2,State Farm
Lower density,34.3,29.7,22.3,11.7,2.1,State Farm
Middle density,28.3,19.0,19.7,21.0,12.0,State Farm
Higher density,13.4,11.7,13.9,29.2,31.8,State Farm
Highest density,6.8,6.6,9.9,27.8,49.0,State Farm


## Exclude Detroit

I average the rates by geographic id to avoid double counting 

In [37]:
gdf_groupby_geo_id = (
    gdf[~gdf["is_in_detroit"]].groupby("geo_id")[GEOID_GROUP_BY_COLS].mean()
)

gdf_groupby_geo_id["effect_quantile"] = pd.qcut(
    gdf_groupby_geo_id["generic_location_based_premium"],
    q=len(RATE_Q_LABELS),
    labels=RATE_Q_LABELS,
)

In [38]:
gdf_groupby_quantiles = gdf_groupby_geo_id.groupby("effect_quantile", observed=False)[
    QUANTILE_GROUP_BY_COLS
].sum()


print("This calculates (group subset in quantile / total group population)")
df_distribution = prptn_to_pct(
    gdf_groupby_quantiles.div(gdf_groupby_quantiles.sum(axis=0), axis=1)
)
df_distribution

This calculates (group subset in quantile / total group population)


Unnamed: 0_level_0,black_tot,white_tot,total_pop
effect_quantile,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
lowest effect,7.9,23.4,21.9
middle low,8.6,22.0,20.2
median,12.7,20.7,19.7
middle high,18.5,19.1,19.5
highest effect,52.3,14.8,18.8


## Different quantile approaches

These are the results for different approaches to categorizing and analyzing the data into quantiles.

### Drop quartile duplicate entries

In [39]:
gdf["effect_quantile"] = pd.qcut(
    gdf["generic_location_based_premium"], q=len(RATE_Q_LABELS), labels=RATE_Q_LABELS
)
gdf_temp = gdf[QUARTILE_ANALYSIS_COLS]
gdf_temp = gdf_temp.drop_duplicates(["geo_id", "effect_quantile"])
duplicated_pct = gdf_temp["geo_id"].duplicated().sum() / gdf_temp["geo_id"].nunique()
print(
    f"After dropping duplicates within the quartiles, {duplicated_pct} percent blocks are categorized in to multiple quartiles"
)

After dropping duplicates within the quartiles, 0.7087243656586723 percent blocks are categorized in to multiple quartiles


In [40]:
gdf_temp = gdf_temp.groupby("effect_quantile", observed=False)[
    QUANTILE_GROUP_BY_COLS
].sum()

gdf_temp.div(gdf_temp.sum(axis=0), axis=1)

Unnamed: 0_level_0,black_tot,white_tot,total_pop
effect_quantile,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
lowest effect,0.069635,0.211331,0.197814
middle low,0.069939,0.207285,0.190451
median,0.076643,0.184526,0.172314
middle high,0.08196,0.153823,0.146659
highest effect,0.701824,0.243035,0.292762


### Drop all duplicates

In [41]:
gdf["effect_quantile"] = pd.qcut(
    gdf["generic_location_based_premium"], q=len(RATE_Q_LABELS), labels=RATE_Q_LABELS
)
gdf_temp = gdf[QUARTILE_ANALYSIS_COLS]
gdf_temp = gdf_temp.sort_values(["generic_location_based_premium"])
gdf_temp = gdf_temp.drop_duplicates(["geo_id"])
duplicated_pct = gdf_temp["geo_id"].duplicated().sum() / gdf_temp["geo_id"].nunique()
print(
    f"After dropping duplicates within the quartiles, {duplicated_pct} percent ZCTAs are categorized in to multiple quartiles"
)

After dropping duplicates within the quartiles, 0.0 percent ZCTAs are categorized in to multiple quartiles


In [42]:
gdf_temp = gdf_temp.groupby("effect_quantile", observed=False)[
    QUANTILE_GROUP_BY_COLS
].sum()
gdf_temp.div(gdf_temp.sum(axis=0), axis=1)

Unnamed: 0_level_0,black_tot,white_tot,total_pop
effect_quantile,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
lowest effect,0.083031,0.397953,0.348263
middle low,0.043271,0.152018,0.132456
median,0.047804,0.096869,0.090546
middle high,0.04214,0.072597,0.069504
highest effect,0.783753,0.280563,0.359231
