# Combining Different Types of BIG Data

-   Merging Vectors and Raster data
-   Merging Vectors and Tabular, but huge, data
-   Putting it all together

In [None]:
# Start with our basic imports

import matplotlib.pyplot as plt
import pandas as pd
import os
import numpy as np
import geopandas as gpd

### Joins

-   We briefly touched on this last lecture, where we joined two Dataframes by a common column.
    -   A different type of join becomes relevant with spatial vector data
        -   The Spatial Join or `sjoin`

In [None]:
# Grab a world map that has extra data in
world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))

# Take a look at it
world.head()

### But I want to KEEP this data

-   The built in data downloading is great, but we're trying to equip you for later work that might USE this information
    -   Currently the data is in memory, but we want to save it to a file

In [None]:
# Define a path to save the output
base_data_dir = '../../../../data'
world_filename = 'world.gpkg'
world_path = os.path.join(base_data_dir, world_filename)

# Save the GDF as a GPKG
world.to_file(world_path, driver='GPKG')

### ![](images/paste-39.png)

### Also get the cities of the world

In [None]:
# Grab data on cities
cities = gpd.read_file(gpd.datasets.get_path("naturalearth_cities"))
# Force cities and world to share the world CRS:
cities = cities.to_crs(world.crs)
cities.head()

In [None]:
cities_filename = 'cities.gpkg'
cities_path = os.path.join(base_data_dir, cities_filename)

# Save the GDF as a GPKG
cities.to_file(cities_path, driver='GPKG')

![](images/paste-40.png)

### Example regression

-   Suppose you want to do a regression that involves city-level observations
    -   Possible example: "Is crime lower in high income countries?"
        -   Hilariously simple, and we've already got the data

### Problem! The cities data doesn't have country GDP

-   If the cities data had a column for "Country ISO3" of something similar, this wouldn't be a problem
    -   We could merge them like we discussed previously (or even just in R)
    -   But there is no column we can merge on
-   However, their spatial information can be used to create this relationship
    -   We will do this via spatial merge

In [None]:
cities_and_countries = gpd.sjoin(cities, world, how="inner", op="intersects")
cities_and_countries.head()

In [None]:
cities_and_countries_filename = 'cities_and_countries.gpkg'
cities_and_countries_path = os.path.join(base_data_dir, cities_and_countries_filename)

# Save it as a GPKG
cities_and_countries.to_file(cities_and_countries_path, driver='GPKG')

### What did we just do?

-   Let's look at it in QGIS

    -   Adding all three we see what we probably expected.

-   <div>

    ![](images/paste-36.png)

    </div>

-   If you Untoggle the `cities` and `world` layers though, you might be surprised

-   <div>

    ![](images/paste-37.png)

    </div>

    -   Where did the polygons go?
        -   Check out the attributes table (right click on layer name)

-   ![](images/paste-38.png)

-   Our table now has the information that was previously just in the `world.gpkg` file!

    -   Out-of-class exercise: Save a GPKG that contains only the cities that are in countries with GDP per capita above 15,000

## Zonal Statistics

-   Often we want to combine raster and vector data

    -   Spatial Join only works for joining polygons to polygons

        -   What can we do?

-   Our task is to sum up how much Maize production is present in each country of the world

    -   This is similar to our problem set, but that only asked for a global sum

        -   Which is easy because you can just sum the whole raster

    -   How would you do this sum so that it reports the total PER COUNTRY?

-   Look at our relevant data in QGIS

    -   ![](images/paste-41.png)

-   As always, we could use a traditional GIS approach to do this, pictured below

    -   But this can be extremely slow and fail on big data

    -   ![](images/paste-42.png)

### Using Pygeoprocessing Zonal Stats

In [None]:
import pygeoprocessing as pgp

# Inspect the documentation for the function by navigating to the zonal statistics function
# right click, and select "Go to definition"
pgp.zonal_statistics

![](images/paste-43.png)

### Enter in the required arguments

-   The path-band might be confusing, but reading it closely, we see:

    -   It should be a tuple (or list) containing `(path_to_raster, band_number)`

In [None]:
maize_production_filename = 'maize_production.tif'
maize_production_path = os.path.join(base_data_dir, maize_production_filename)
base_raster_path_band = (maize_production_path, 1)

In [None]:
# Call zonal_statistics function on our raster and the world polygons
# save the result to a variable named result 
result = pgp.zonal_statistics(base_raster_path_band, world_path)

# Inspect the result
print(result)

### What is this result?

- We see a dictionary with key for each country's index
  - Nested in there is a new dictionary containing the stats
- Unless you're a robot, this is probably hard to read. Let's join it back with our world vector file
  - The best way forward is to merge this dictionary back into our world GDF
    - The key of the dictionary corresponds to the `id` for each country within the GDF
    - So it's a simple merge on that column
- I just might assign doing this in the next problem set...


# Appendix

### Grouping

Because geopandas builds on pandas, all the usual operations (i.e. groupby) that you might do on a pandas dataframe also work for non-geometry columns.

### Aggregation and Dissolve

A useful common map operation is to aggregate by spatial region: i.e. to eliminate some of the granularity of a map by combining smaller regions into larger regions. This is called a "dissolve" because it dissolves the boundaries between regions to form larger ones.

To demonstrate it, lets take the many UK local authority districts and dissolve them into the UK countries. We can do this using the LAD20CD column because the code used for it always begins with a letter that represents the country: E for England, W for Wales, N for Northern Ireland, and S for Scotland. So to do the dissolve, we first create a new column just like we would do for a groupby in a regular pandas dataframe. But instead of using groupby, we dissolve by the new column. This will give us a new map that is made up of only 4 polygons.`