# 1. Intro

In terms of ethnic bakground, which US states are the most representative of the entire country?

It is easy enough to get data on the ethnic makeup of U.S. states and make a simple ranking. However, all of the obvious metrics I thought of had some kind of flaw in this context, which made me feel uneasy about using them. This begged the question: is there an alternative metric, better suited for this particular task?

I believe that there is. The goal of this notebook is to formulate this new metric, and investigate its properties


In [1]:
import pandas as pd
from scipy.stats.stats import pearsonr   
from sklearn.metrics.pairwise import cosine_similarity
from scipy.stats import pearsonr
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

Let's start by grabbing our data, which was acquired from [the internet](https://www.kff.org/other/state-indicator/distribution-by-raceethnicity/?currentTimeframe=0&sortModel=%7B%22colId%22:%22Location%22,%22sort%22:%22asc%22%7D)

In [2]:
df = pd.read_csv('../raw_data.csv', skiprows = 2)
df = df.set_index('Location').drop(columns = ['Total','Footnotes'])

for col in df.columns:
    df[col] = pd.to_numeric(df[col],errors = 'coerce').fillna(0)

In [3]:
df_country_values = df[df.index == 'United States'].values[0]
df_states = df[df.index != 'United States']
print(df_states.head())
print(df_country_values)

            White  Black  Hispanic  Asian  American Indian/Alaska Native  \
Location                                                                   
Alabama     0.654  0.265     0.044  0.014                          0.004   
Alaska      0.600  0.022     0.070  0.060                          0.151   
Arizona     0.542  0.043     0.318  0.033                          0.039   
Arkansas    0.721  0.152     0.078  0.016                          0.006   
California  0.364  0.053     0.395  0.147                          0.004   

            Native Hawaiian/Other Pacific Islander  Multiple Races  
Location                                                            
Alabama                                      0.000           0.019  
Alaska                                       0.015           0.083  
Arizona                                      0.002           0.024  
Arkansas                                     0.004           0.024  
California                                   0.004   

# 2. Simple rankings

So we have our target numbers (the US population), and our candidate versions (the states). We want to rank the states in such a way that the highest-ranked states have distributions similar to the whole country. When quantifying similarity, a few metrics come to my mind immediately: MAE, MSE, and correlation

In [4]:
from scipy import spatial
def make_comparisons(row):
    #cosine_similarity = 
    mae = mean_absolute_error(row.values, df_country_values)
    mse = mean_squared_error(row.values, df_country_values)
    corr = pearsonr(row.values, df_country_values)[0]
    return pd.Series([mae, mse, corr], index = ['mae','mse','corr'])

In [5]:
df_states.loc[:,['mae','mse','corr']] = df_states.apply(make_comparisons, axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(loc, val, pi)


In [6]:
def get_top_n(df, metric, n):
    return df.sort_values(metric, ascending = metric not in ['corr','log_loss']).index[:n]

In [7]:
pd.DataFrame({'mae' : get_top_n(df_states,'mae',15), 
              'mse' : get_top_n(df_states,'mse',15), 
              'corr' : get_top_n(df_states,'corr',15)
             })

Unnamed: 0,mae,mse,corr
0,Illinois,Illinois,Illinois
1,New York,New York,Connecticut
2,Connecticut,Connecticut,New York
3,New Jersey,New Jersey,New Jersey
4,Virginia,Virginia,Rhode Island
5,Florida,Florida,Colorado
6,Delaware,Colorado,Massachusetts
7,Rhode Island,Rhode Island,Kansas
8,Colorado,North Carolina,Nebraska
9,North Carolina,Washington,Pennsylvania


We have some possible lists. However, do we really love any of these metrics for this task?

MAE is interpretable, because we can think of it as the total fraction of the population which would have to change group in order for the distribution to match the target distribution. However, it has an undesirable property- it only cares about absolute difference, not percentage difference. So a 61% vs a 60% is punished as severely as a 5% vs a 4%. Intuitively, the 1% difference is more important to the smaller group, and the metric should take that into account.

MSE has the same problem, except worse (by squaring the errors, it will tend to care even more about larger groups, which will tend to have larger errors). It is also not interpretable. MSE is used often because errors in the real world tend to have the normal distribution, but that does not really apply here. 

Correlation has the opposite problem. It gives every dimension equal weight, even the ones with very low values. We really shouldn't be caring as much about the group with 0.0001% population and 0.0001% variance as we are the group with 10% population and 10% variance. Plus, again, it isn't really interpretable. 

In addition to the aforementioned mundane problems, I feel like all these metrics also have a more fundamental problem. They all fail to take advantage of the unique nature of population data, which is that the fractions always sums to 1. The traditional metrics don't care about that and apply equally well to data with no specific sum. 

A natural question we might ask is, "what is special about a list of numbers that add up to 1?" My answer is that we can think of it as a probability distribution, and sample it. 

So... what if we did sample it? Perhaps we can choose N people from both candidate "distribution", and seeing how often it matches the target?

# 3. A new approach

Let's try this approach to rank two populations, Y and Z, on their similarity to X. Our first instict might be to need to run a bunch of experiments, choosing N people from each population and seeing whether Y or Z match X more often. However, there is an immediate problem: if a single population is 0 for Y or Z, but greater than 0 for X, the simulated population will never match. 

To get around the problem, we can change our experiment a little bit. Let's start with X and mix in an infinitesimal smidgeon of Y or Z, so that the distributions are close to the original ones, but moving slightly towards the others. Our problem statement changes to "which population can be averaged with the original distribution, and cause the least distortion in overall sample probabilities?" We will start with a weighted average between X and the other distribution, then later take the limit as the weighted average shifts towards X.

The probability of the Y distribution matching the original is $ n_c *  \Pi^{i=j}_{i=1} ((1-w)x_i + wy_i)^{x_i*n} $ where $x_i$ is the proportion of X which is group i, $y_i$ is the proportion of Y which is group i, $j$ is the number of groups within the population, $w$ is the weight of the Y distribution, $n$ is the number of draws we are making, and $n_c$ is the number of ways of combining draws to match the exact population (e.g. if X has two 50-50 groups and n is two, the "true" distribution can be A-B or B-A and $n_c = 2$. With a higher value of $n$, $n_c$ will become a very large number). Here we are assuming that $x_i*n$ is always an integer, which is essentially true when we bring n to infinity. Also, this equation applies equally well to Z instead of Y

Without loss of generality, we can take the log and remove the constants. After all, we are only interested in which distribution is more likely, not exactly how likely either is. This reduces to $\sum^{i=j}_{i=1} x_i * log((1-w)x + wy)$

Now we want to take the limit as $w$ approaches 0. If we apply that directly, we get $\sum^{i=j}_{i=1} x_i * log(x_i)$, which does not help us because it is always the same. 

Fortunately we can also look at derivates. If the first derivative is higher for one distribution than the other, then for the smallest infinitesimal $w$ we can look at, the distribution's score will be higher. Taking the first derivative we get
$\sum^{i=j}_{i=1} \frac{x(y-x)}{-wx+wy+x}$. Taking the limit as $w$ goes to 0, we get $\sum^{i=j}_{i=1} y_i - x_i$ which is an extremely boring constant 0, since y and x both sum to 1.

This is looking bad but we can keep going and look at more derivatives. Perhaps the second derivative will be a tiebreaker. 

The second derivative is $\sum^{i=j}_{i=1} \frac{x_i(y_i-x_i)^2}{(-wx_i+wy_i+x_i)^2}$. Taking the limit as $w$ approaches 0, we get $\sum^{i=j}_{i=1} \frac{(y_i-x_i)^2}{x_i}$, which is a well-behaved error metric! 

It is a weighted mean-squared-error, where each group is weighted by the inverse of its "true" value. We can rearrange it to $\sum^{i=j}_{i=1} (y_i-x_i) * \frac{(y_i-x_i)}{x_i}$ which shows that we are calculating the product of the proportional deviation and the absolute deviation for each group. This gives us the exact properties we wanted- with the proportion held equal, the group with the higher absolute deviation matters more, and with absolute deviation held equal, the group for which is it a higher percentage matters more. Plus, it is interpretable. We didn't just declare this as a metric because it has nice properties, we set it up as the answer to a meaningful question!

We may be worried that this metric involves division, invoking the specter of divide-by-zero errors. However this will not happen becauase the denominator is $x_i$, and the frequency of all groups on the aggregate level must be greater than 0.

Now let's try applying this new metric

In [8]:
import numpy as np

def inverse_weighted_mse(row):
    #cosine_similarity = 
    return np.sum(((row.values[:7]-df_country_values)**2) / df_country_values)

In [9]:
df_states.loc[:,['inverse_weighted_mse']] = df_states.apply(inverse_weighted_mse, axis = 1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_column(ilocs[0], value, pi)


In [10]:
pd.DataFrame({'mae' : get_top_n(df_states,'mae',15), 
              'mse' : get_top_n(df_states,'mse',15), 
              'corr' : get_top_n(df_states,'corr',15),
              'inverse_weighted_mse' : get_top_n(df_states,'inverse_weighted_mse',15),

             })

Unnamed: 0,mae,mse,corr,inverse_weighted_mse
0,Illinois,Illinois,Illinois,Illinois
1,New York,New York,Connecticut,Connecticut
2,Connecticut,Connecticut,New York,New York
3,New Jersey,New Jersey,New Jersey,New Jersey
4,Virginia,Virginia,Rhode Island,Rhode Island
5,Florida,Florida,Colorado,Massachusetts
6,Delaware,Colorado,Massachusetts,Florida
7,Rhode Island,Rhode Island,Kansas,Colorado
8,Colorado,North Carolina,Nebraska,Virginia
9,North Carolina,Washington,Pennsylvania,Kansas


# 4. Conclusion

The top four are still the same, which is not surprising. Those four match the country the best by a long shot. 

Below that, we get something that looks like a mixture of the other two lists. Virginia, for example, is number 5 by mae and mse, number 15 by correlation, and number 9 by our inverse_weighted_mse

Also, if you are curious about the full ranking, here it is

In [20]:
z = pd.DataFrame({'Rank' : range(1,52)
                 ,'State' : get_top_n(df_states,'inverse_weighted_mse',51)
                 }).set_index('Rank')

In [21]:
z

Unnamed: 0_level_0,State
Rank,Unnamed: 1_level_1
1,Illinois
2,Connecticut
3,New York
4,New Jersey
5,Rhode Island
6,Massachusetts
7,Florida
8,Colorado
9,Virginia
10,Kansas
