# Data Fusion

[Data fusion](https://en.wikipedia.org/wiki/Data_fusion#:~:text=Data%20fusion%20is%20the%20process,at%20which%20fusion%20takes%20place.) is `the process of integrating multiple data sources to produce more consistent, accurate, and useful information than that provided by any individual data source.`

The next steps in the journey are:

  1.  Combine polygons from LA City and State of California opportunity zones to find the opportunity zones in LA
  2.  Use geography to link LA businessess and opportunity zones
  3.  Add new data sets for Business Improvement Districts (BIDs) in LA
  4.  Use spatial techniques to understand businesses and BIDs

Before I start we need to set the env up.  I like to do (most) all my imports upfront.  I do it with a start.py in my profile_default.  This accomplishes the same thing.

**Note:** This can be a bit slow because it initializes osmnx.  

In [None]:
#imports
%run start.py

In [None]:
#from 01-data-wrangling.ipynb
businesses_gdf = gpd.read_parquet('../data/businesses-gdf.parq')
la_boundary_gdf = gpd.read_parquet('../data/la-boundary.parq')
opportunity_zones_gdf = gpd.read_parquet('../data/opportunity-zones.parq')

# new data
city_council_gdf = gpd.read_file('../data/LA_City_Council_Districts_(Adopted_2021).zip')
bid_gdf = gpd.read_file('../data/Business Improvement Districts.zip')

# LA Opportunity Zones

From the previous notebook we have the polygons for all the opportunity zones in CA and the LA city boundary.  We can use geopandas spatial operations to get the opportunity zones in LA.

If we look at the problem in maps, the map on the left shows the opportunity zones for the state.  The map on the right shows the boundary for LA.  There are a couple of approaches to combine them spatially.

**Note:** Combining outputs, side-by-side is a common display idiom.  You should design a function to do this.

In [None]:
zones_output = Output(layout={'border': '1px solid black',
                            'width': '50%'})

la_output = Output(layout={'border': '1px solid black',
                            'width': '50%'})

with zones_output:
    display(opportunity_zones_gdf.explore())

with la_output:
    display(la_boundary_gdf.explore())

HBox([zones_output, la_output])

We can use either the sjoin or overlay operators from geopandas.  There are a number of references online comparing the operations.

I am going to use the overlay operator to create a new geometry.

In [None]:
la_opportunity_zones_gdf = la_boundary_gdf.overlay(opportunity_zones_gdf, how='intersection')

Explore the map.  Notice the attributes for each opportunity zone polygon.

In [None]:
la_opportunity_zones_gdf.explore()

You may want to try this code using the sjoin operator.
```python
join_poly = opportunity_zones_gdf.sjoin(la_boundary_gdf, how='inner', predicate='within')
```

If you do, compare the dataframes.

# Businesses in Opportunity Zones

At this point we've used the LA boundary polygon to "select" the opportunity zones in LA.

The next step is to find the businesses in the opportunity zones.  We can use the sjoin operator to get the points (businesses) in polygon (opportunity zones).

In [None]:
biz_in_oz_gdf = businesses_gdf.sjoin(la_opportunity_zones_gdf, how='inner', predicate='within')

Back-of-the-envelope to see what we have:

In [None]:
biz_oz_count = len(biz_in_oz_gdf)

print(f"Businesses within an OZ: {biz_oz_count} ({biz_oz_count / len(businesses_gdf):.2%})")

I suppose this is reasonable?  At least it doesn't `not make sense`?

Let's see how we can better understand the relationship of businessess and Opportunity Zones.

There are several ways we might want to look at the data.  We'll start with the types of business (sector_desc from NAICS).

**Note:** Remember I added this description + the sector code in parenthesis.

In [None]:
biz_in_oz_gdf.sector_desc.value_counts()

Drilling down a bit, let's see OZ distribution for sector 62.

In [None]:
biz_in_oz_gdf.query(f"sector == '62'").TRACTCE.value_counts()

So Census tract 128303 has 164 businesses of type `Health Cate and Social Assistance`.

With this as a starting point we can look the distribution of all businesses in this OZ.

In [None]:
biz_in_oz_gdf.query(f"TRACTCE == '128303'").sector_desc.value_counts()

In [None]:
sum(_)

So this OZ has a total of 847 businesses from the LA city dataset.  I wonder if that is a `normal` number of businesses in an OZ?

How can we start to figure that out?

In [None]:
biz_in_oz_gdf.groupby(['TRACTCE', 'sector_desc'])['sector'].count().head(20)

Now I would like generate, and visualize, summary information for businesses in opportunity zones.

I will use the biz_in_oz_gdf to generate the needed counts and then join (merge) that information back into the la_oppotunity_zones_gdf.

We can use groupby to get the counts we need for each tract, then join with the opportunity zone gdf.

In [None]:
count_df = biz_in_oz_gdf.groupby(['TRACTCE'])['sector'].count().to_frame().rename(columns={'sector': 'count'}).reset_index()
la_opportunity_zones_gdf = la_opportunity_zones_gdf.merge(count_df, left_on='TRACTCE', right_on='TRACTCE') 

In [None]:
#la_opportunity_zones_gdf

I'm trying to show the techniques to derive quantitative information from the data.  You may want to [decide](https://www.pbcgis.com/normalize/) the utility of the map.

In [None]:
la_opportunity_zones_gdf['count'].describe()

In [None]:
la_opportunity_zones_gdf['count'].plot();

Looks like there's a couple of OZ's that skew the stats.

We can also look at the density of businesses (biz / sq mile).

Once we compute that value we can look at the distribution in a choropleth.  You shouldn't be surprised what it looks like!  Maybe we look at the OZ's that are not downtown?  I suspect we'll want to step up a level.  Maybe look at businesses in a collection of OZ's ...

**Note:** See this [note](https://www.census.gov/quickfacts/fact/note/US/LND110210) for explanation of next step.

In [None]:
la_opportunity_zones_gdf['density'] = la_opportunity_zones_gdf.apply(lambda row: round((row['count'] / (row['ALAND'] / 2589988)), 2), axis=1)

In [None]:
#la_opportunity_zones_gdf

In [None]:
la_opportunity_zones_gdf['density'].plot()

**Note:** You really should build some code to do this!  It's getting quite repetitive.

Let's look at the results on a choropleth map.  The explore method on the geodataframe supports this out of the box.  Since I added both count and density we can look at them side-by-side.  Once the choropleths are displayed, you can navigate around a bit.  The tooltip popup has the TRACTCE, so you can run more queries as we've done above.

In [None]:
count_output = Output(layout={'border': '1px solid black',
                            'width': '50%'})

density_output = Output(layout={'border': '1px solid black',
                            'width': '50%'})

with count_output:
    display(la_opportunity_zones_gdf[['TRACTCE', 'count', 'geometry']].explore(column='count', cmap='YlOrRd', legend=True, tiles='cartodbpositron', style_kwds=dict(color="black")))

with density_output:
    display(la_opportunity_zones_gdf[['TRACTCE', 'density', 'geometry']].explore(column='density', cmap='YlOrRd', legend=True, tiles='cartodbpositron', style_kwds=dict(color="black")))

#print('\nMaps for request type: ' + request_type + '\n\n')
HBox([count_output, density_output])

Enough of this level of analysis.  We can now start down the path of looking at higher level aggregations.

# Business Improvement Districts (BIDs)

Next we'll combine [BID](https://en.wikipedia.org/wiki/Business_improvement_district)s, Opportunity Zones, and businesses.  We can use similar techniques to those described above.

Steps in this section:

  1. We created bid_gdf in the beginning of this nb.
  2. I will look at the BIDs with the standard geopandas explore method.
  3. Use the point-in-polygon idiom, as above, to associate businesses (by LOCATION) with the BID.
  4. Combine BID and OZ polygons for new perspective.
  4. Introduce ipyleaflet maps to explore.
  
**Note:** Because of the number of businesses and limitation of browser-based maps, we'll look at one BID to show the steps.

In [None]:
print(f"There are {len(bid_gdf)} BIDs in the LA City dataset.")

In [None]:
bid_gdf.explore()

Since this is the first time we've used the BID dataset and we want to apply spatial operations we need to verify the crs.

In [None]:
bid_gdf.crs

We need to change the crs.

In [None]:
bid_gdf = bid_gdf.to_crs('EPSG:4326')

Now, with the same crs we can apply the sjoin operator to associate each business with a BID polygon.

In [None]:
biz_in_bid_gdf = businesses_gdf.sjoin(bid_gdf, how='inner', predicate='within')

In [None]:
len(biz_in_bid_gdf)

I think it's interesting this number is very similar to the number in OZs.  

In [None]:
biz_in_bid_gdf.columns

In [None]:
biz_in_bid_gdf.prog_name.value_counts()

# Conclusion

We started looking a ways to combine the datasets.  

Three datasets have been curated for the rest of the analysis.  Specifically:

  1. Use a spatial intersection to find the opportunity zones in LA
  2. Use a point-in-polygon approach to link (sjoin) businesses and OZs
  3. Demonstrate possible queries and statistics for businesses
  4. Introduce Business Improvement Districts (BIDs)
  5. Join BIDs and businesses for analysis
  
The next [notebook](3-combined-details.ipynb) will show how to combine and get some of the details.