# Scoring Model
We decided that, for the purposes of this contest, scoring would be done as simply as possible. We have no values to evaluate against and only basic product information, so we are guessing based on assumptions about the target market.

In [34]:
import os

import pandas as pd
import us

In [35]:
# Read in
zip_data = pd.read_csv("zip_data_final.csv", index_col=None)

# remove pesky "Unnamed" column
zip_data = zip_data.loc[:, ~zip_data.columns.str.contains('Unnamed')]

# Change zips to str and pad with 0s
zip_data["zip"] = zip_data["zip"].apply(lambda x: str(x).zfill(5))

We removed all zip codes with 0 population and all features that we deemed unnecessary for the scoring model.

In [36]:
# Isolate zip codes with 0 population
zero_pop_zips = zip_data[zip_data["population"] == 0]

# Remove zip codes with 0 population, these are not ideal to sell to
zip_data = zip_data[zip_data["population"] > 0]

# Remove US territories from data
states = us.states.STATES
states = [states[state].abbr for state in range(len(states))]
territories = zip_data[~zip_data.state.isin(states)]
zip_data = zip_data[zip_data.state.isin(states)]

# Remove unneeded features
zip_data = pd.DataFrame(zip_data, columns=['zip', 'city', 'state', 'population', 
                                           'pop_45-49', 'pop_50-54', 
                                           'median_indiv_income', 'rpp'])

# Create primary target population feature
zip_data['pop_45-54'] = zip_data['pop_45-49'] + zip_data['pop_50-54']

We decided to use an absolute value function to weight income. 

Our reasoning for this is:
* Low income individuals cannot afford the whole life insurance product
* High income individuals likely will have more than $25,000 in cash or liquid assets at time of death and have no need for this product

So, we want to penalize both high incomes and low incomes with an absolute value function. Our function gives a perfect score to zip codes with a median normalized individual income of $40,000 - this value was selected arbitrarily, but seems reasonable. We are assuming the ideal customer is responsible, has a steady job with a reasonable income, but has little in the way of assets aside from possibly owning a home and a retirement account. In lower to median cost of living areas individuals with even slightly higher than median incomes are likely to have significant liquid assets or cash available at time of death.

Normalization of income is achieved with the formula:
$$normalized\,income = income \times (2 - \frac{RPP}{100})$$ 

Regional Price Parity values describe the cost of goods, rents, and services in comparison to the national average which is given a score of 100. So the regional price parity in an expensive zip code may be 122, while a regional price parity score for a cheap zip code might be 87. Normalizing income like this allows us to potentially target more customers in abnormally high or low cost of living zip codes.

We also want to boost scores for zip codes with higher target populations, and we do so by multiplying the normalized income score by the zip code's target population as a frequency of the maximum target population of any zip code in the data set.

This gives us a final formula of:

$$score = (1 - \frac{\lvert40000 - X_{norm.\,income}\rvert}{40000}) \times \frac{X_{target\,population}}{max(target\,population)}$$

Where X denotes a single sample and max(target population) represents the maximum target population of any zip code in the dataset.

In [38]:
# The max value for the target population
max_target_pop = zip_data['pop_45-54'].max()

# Normalize income for cost of living
zip_data['normalized_income'] = zip_data['median_indiv_income'] * (2 - zip_data['rpp'] / 100)

# Computer score
zip_data['score'] = (1 - abs(40000 - zip_data['normalized_income']) / 40000) * (zip_data['pop_45-54'] / max_target_pop) 

Our top ten zip codes after scoring are:

In [39]:
scores = pd.DataFrame(zip_data, columns=['zip', 'city', 'state', 'population', 'median_indiv_income', 'rpp', 'pop_45-54', 'score'])
scores = scores.rename(index=str, columns={"zip": "Zip Code",
                                          "city": "City",
                                          "state": "State",
                                          "population": "Population",
                                          "median_indiv_income": "Median Individual Income",
                                          "rpp": "RPP",
                                          "pop_45-54": "Population Ages 45-54",
                                          "score": "Score"})
scores.sort_values(by="Score", ascending=False)[0:10]

Unnamed: 0,Zip Code,City,State,Population,Median Individual Income,RPP,Population Ages 45-54,Score
13438,30043,Lawrenceville,GA,88588.0,38131.0,96.3,14493.0,0.856008
35432,77449,Katy,TX,119204.0,37931.0,101.6,15262.0,0.85087
34100,75052,Grand Prairie,TX,94133.0,40933.0,100.2,13964.0,0.816566
36700,79936,El Paso,TX,111918.0,30951.0,88.7,14974.0,0.770495
40674,92592,Temecula,CA,82551.0,52350.0,116.3,14140.0,0.764217
4074,10314,Staten Island,NY,89960.0,48897.0,122.0,13239.0,0.754214
35213,77084,Houston,TX,104582.0,37010.0,101.6,13776.0,0.749376
40195,91709,Chino Hills,CA,78025.0,51704.0,117.7,12663.0,0.70831
40554,92336,Fontana,CA,94327.0,39384.0,107.4,12946.0,0.705227
9642,22193,Woodbridge,VA,82726.0,47012.0,119.1,12090.0,0.686825


Our top score belongs to a zip code in Lawrenceville, GA. This zip code seems like an ideal place to sell the whole life insurance product in question: it has a large target population, a reasonable median income, and a lower-than-average RPP.

It would be ideal to make the formula much more complex - we could also incorporate a sliding scale based on life expectancy to improve scores that have high 55-59 and 60-64 populations if the life expectancy is sufficiently high. However, when we have only a heuristic basis for evaluation, it seems foolish to introduce more variables that need to be tuned. Better yet, machine learning could be used to build a model to predict potential saleability for zip codes.

In [40]:
zero_pop_zips = pd.DataFrame(zero_pop_zips, columns=['zip', 'city', 'state'])
zero_pop_zips['score'] = 0.0

zip_data = pd.DataFrame(zip_data, columns=['zip', 'city', 'state', 'score'])
zip_data = pd.concat([zip_data, zero_pop_zips], axis=0)

In [42]:
# Write submission
zip_data.to_csv("solution_submission.csv")