# Sociomepy

Sociomepy is a Python package that provides a number of convient functions, data structures, and models for manipulating socio-environmental data. Implementation-wise it is a wrapper around the GeoPandas library and is fully compatitble with all data formats that geopandas accepts.

## Model API
The model API builds on the metrics API to allow users to build various explanatory models on the sociome data.

As before, we import the sociomedataframe and initialize it with some default data.

In [2]:
from sociomepy.data import SociomeDataFrame

We can initialize this SociomeDataFrame with many different types of data. In this example, we will work with an ArcGIS address file of the City of Chicago. We will start with 10000 addresses in the city of chicago

In [3]:
chicago = SociomeDataFrame.from_arcgis_file('../data/chicago-addresses.csv', nrows=10000)
chicago.data = chicago.data[chicago.data['Post_Comm'] == 'CHICAGO']

In [4]:
chicago.data.head()

Unnamed: 0,Lat,Long,ADDRDELIV,Post_Comm,State,Post_Code,LSt_Type,LSt_PreDir,geometry,LOCATIONS
0,41.935796,-87.673357,1763 WEST WELLINGTON AVENUE,CHICAGO,IL,60657,AVE,W,POINT (-87.67336 41.93580),1
1,41.855258,-87.667139,1622 WEST CULLERTON STREET,CHICAGO,IL,60608,ST,W,POINT (-87.66714 41.85526),1
2,41.917926,-87.651651,917 WEST ARMITAGE AVENUE,CHICAGO,IL,60614,AVE,W,POINT (-87.65165 41.91793),1
6,41.868253,-87.626334,1132 SOUTH WABASH AVENUE,CHICAGO,IL,60605,AVE,S,POINT (-87.62633 41.86825),1
7,41.8954,-87.658712,732 NORTH WILLARD COURT,CHICAGO,IL,60642,CT,N,POINT (-87.65871 41.89540),1


## Adding Metrics
Now, we will add several metrics to this SociomeDataFrame. First, let's get our imports in order

In [5]:
from sociomepy.data import SociomeDataFrame
from sociomepy.metrics import SpatialDensityFunction, SpatialSubdivisionFunction
from sociomepy.accessors import *

Adding distance to a park.

In [6]:
parks = SociomeDataFrame.from_json('https://data.cityofchicago.org/resource/2eaw-bdhe.json', access_by_location_dict('location'))
park_distance = SpatialDensityFunction(parks)
chicago.add_metric_to_data(park_distance, 'park_distance')

<sociomepy.data.SociomeDataFrame at 0x1163928b0>

Next we will add crime data.

In [7]:
crimes = SociomeDataFrame.from_json('https://data.cityofchicago.org/resource/dfnk-7re6.json', access_by_attribute('longitude', 'latitude'))
crime_density = SpatialDensityFunction(crimes)
chicago.add_metric_to_data(crime_density, 'crime')

<sociomepy.data.SociomeDataFrame at 0x1163928b0>

## Adding Prediction Targets
We will try to predict Zillow house prices using these two explanatory variables. Using the metrics API, we add the following metric to the dataset.

In [8]:
zillow = SociomeDataFrame.from_save_file('../data/zillow')
chicago.add_subdivision(zillow, 'neighborhood', 'RegionID')
prices_in_july = SpatialSubdivisionFunction(zillow, 'neighborhood', '2022-07-31', 'RegionID')
chicago.add_metric_to_data(prices_in_july, 'prices')

  exec(code_obj, self.user_global_ns, self.user_ns)


<sociomepy.data.SociomeDataFrame at 0x1163928b0>

## Modeling

Now, we will fit a few different explanatory models over this data using our model API.

In [9]:
from sociomepy.model import GeospatialModel

A model is initialized with two parameters: a target and a set of desired explanatory variables.

In [10]:
model = GeospatialModel('prices', ['park_distance', 'crime'])

By default the model is a regularized least squares model, but any scikit learn model can be used. The model is fit against a particular SociomeDataFrame. Additionally, the user needs to specify what to call the predicted and residual values.

The output of the fitting procedure is another dataframe.

In [11]:
sdf = model.fit(chicago, 'predicted_prices', 'residual_prices')

In [12]:
sdf.data.head()

Unnamed: 0,predicted_prices,residual_prices,geometry
0,392936.296477,116489.703523,POINT (-87.67336 41.93580)
1,297007.212935,52419.787065,POINT (-87.66714 41.85526)
2,531509.530988,228836.469012,POINT (-87.65165 41.91793)
6,319428.661955,-125562.661955,POINT (-87.62633 41.86825)
7,286719.385084,-65926.385084,POINT (-87.65871 41.89540)


This way if we need to use those predictions elsewhere, we can easily do so. The fitting procedure also generates a bunch of relevant statistics about the data. For example, this effects_table gives us the regression coefficient. Not surprisingly, we see a positive correlation between park_distance and a negative correlation with crime for zillow prices.

In [13]:
model.effects_table

Unnamed: 0,Variable,Coefficient
1,crime,-18376.90863
0,park_distance,22421.44902


We can also get other kinds of stats. We can see that the r2 score is actually quite poor for this model.

In [14]:
model.stats

{'mse': 10520363570.805363,
 'r2': -4.564924146925049,
 'coefficients': [('park_distance', 22421.449020229607),
  ('crime', -18376.90863046265)]}

## Adding ACS Variables

One way that we can make these predictions better is to add ACS variables to the model. We create a new model with all of the socio-economic ACS variables.

In [15]:
gdf = SociomeDataFrame.from_save_file('../data/acs')
chicago.add_subdivision(gdf, 'tract', 'GEOID')
exp = ['park_distance', 'crime']


for c in gdf.data.columns:
    if 'SE_' in c:
        agg = SpatialSubdivisionFunction(gdf, 'tract', c, 'GEOID')
        chicago.add_metric_to_data(agg, c)
        exp.append(c)

model2 = GeospatialModel('prices', exp)
sdf = model2.fit(chicago, 'predicted_prices', 'residual_prices')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [17]:
sdf.data.head()

Unnamed: 0,predicted_prices,residual_prices,geometry
0,615209.746285,-105783.746285,POINT (-87.67336 41.93580)
1,413142.231921,-63715.231921,POINT (-87.66714 41.85526)
2,779291.875724,-18945.875724,POINT (-87.65165 41.91793)
6,158098.476833,35767.523167,POINT (-87.62633 41.86825)
7,264445.498114,-43652.498114,POINT (-87.65871 41.89540)


We can see that this is a substantially improved model.

In [19]:
model2.stats

{'mse': 4429541802.576027,
 'r2': 0.4450358017612569,
 'coefficients': [('park_distance', 1256.3626359302743),
  ('crime', -5605.54807305008),
  ('SE_A00001_', 12.429737088823844),
  ('SE_A00002_', 12.429706939934986),
  ('SE_A0000_1', 0.5715638651582584),
  ('SE_A0000_2', -3126.01844073742),
  ('SE_A02001_', 12.42965185936015),
  ('SE_A0200_1', 22.718724274519207),
  ('SE_A0200_2', -10.288782581931935),
  ('SE_A01001_', 12.429636462727991),
  ('SE_A0100_1', -59.04603474041965),
  ('SE_A0100_2', 75.78719138180145),
  ('SE_A0100_3', -189.22611834319895),
  ('SE_A0100_4', -86.97054279308837),
  ('SE_A0100_5', -39.962045747034026),
  ('SE_A0100_6', 41.59888097403053),
  ('SE_A0100_7', 146.88794951189865),
  ('SE_A0100_8', -30.06525744626507),
  ('SE_A0100_9', -130.82306359703279),
  ('SE_A010010', 181.4276235179752),
  ('SE_A010011', -31.390203872937278),
  ('SE_A010012', 134.21160743675625),
  ('SE_A03001_', 12.429536813882535),
  ('SE_A0300_1', -89.79285094222048),
  ('SE_A0300_2', -101

## Partial Correlation Analysis
We can also use this model to run a partial correlation analysis: to what extent do the crime and parks variable add information over the ACS variables?

Let's first fit a model with only the ACS variables. Note, we have already added them to the chicago dataframe, no need to do it again!

In [20]:
exp = []

for c in gdf.data.columns:
    if 'SE_' in c:
        exp.append(c)

model3 = GeospatialModel('prices', exp)
sdf = model3.fit(chicago, 'predicted_prices', 'residual_prices')

We can see that the r2 score drops slightly with the removal of these variables.

In [24]:
model3.stats['r2']

0.4307582391519388

We can then turn our residual_prices into a new spatial metric.

In [25]:
from sociomepy.metrics import SpatialIdentityFunction
res_prices = SpatialIdentityFunction(sdf, 'residual_prices')
chicago.add_metric_to_data(res_prices, 'residual_prices')

<sociomepy.data.SociomeDataFrame at 0x1163928b0>

In [26]:
chicago.data.head()

Unnamed: 0,Lat,Long,ADDRDELIV,Post_Comm,State,Post_Code,LSt_Type,LSt_PreDir,geometry,LOCATIONS,...,SE_A10001_,SE_A10060_,SE_A1006_1,SE_A1006_2,SE_B13004_,SE_B1300_1,SE_B1300_2,SE_B1300_3,SE_B1300_4,residual_prices
0,41.935796,-87.673357,1763 WEST WELLINGTON AVENUE,CHICAGO,IL,60657,AVE,W,POINT (-87.67336 41.93580),1,...,995.0,948.0,567.0,381.0,2317.0,63.0,160.0,223.0,2094.0,-108449.902299
1,41.855258,-87.667139,1622 WEST CULLERTON STREET,CHICAGO,IL,60608,ST,W,POINT (-87.66714 41.85526),1,...,1678.0,1534.0,404.0,1130.0,4509.0,773.0,1974.0,2747.0,1762.0,-71116.595039
2,41.917926,-87.651651,917 WEST ARMITAGE AVENUE,CHICAGO,IL,60614,AVE,W,POINT (-87.65165 41.91793),1,...,1921.0,1816.0,933.0,883.0,3969.0,296.0,205.0,501.0,3468.0,-24806.066301
6,41.868253,-87.626334,1132 SOUTH WABASH AVENUE,CHICAGO,IL,60605,AVE,S,POINT (-87.62633 41.86825),1,...,2665.0,2152.0,1160.0,992.0,4749.0,776.0,1189.0,1965.0,2784.0,-73111.042897
7,41.8954,-87.658712,732 NORTH WILLARD COURT,CHICAGO,IL,60642,CT,N,POINT (-87.65871 41.89540),1,...,1592.0,1275.0,398.0,877.0,2651.0,633.0,860.0,1493.0,1158.0,-73111.042897


Every data point now is annotated with a residual price. A new model can come in and try to predict those prices with other variables.