# Introduction to GeoPandas and Geospatial Vector Data

In this second practical session we will begin more hands-on work with geospatial *vector* data (i.e. points, lines, and polygons). We will also introduce the `GeoPandas` package and explore some of its key features for working with this type of data.

Objectives:
* Read/write spatial vector datasets in different formats
* Create/convert/view geometry data
* Produce basic plots of spatial data


In [None]:
# load main packages
import geopandas as gpd
import pandas as pd

## Overview and helpful links

Remember that GeoPandas is a developing project that makes dealing with geospatial data in Python much easier. It leverages components from several other projects, including `pandas`, `shapely`, and `fiona`. 

To learn more and to find the documentation to these projects following these links:
* GeoPandas
    * [https://geopandas.org/](https://geopandas.org/)
    * [https://geopandas.readthedocs.io/en/latest/](https://geopandas.readthedocs.io/en/latest/)
* pandas - for DataFrames
    * [https://pandas.pydata.org/](https://pandas.pydata.org/)
* shapely - for geometry
    * [https://shapely.readthedocs.io/en/latest/](https://shapely.readthedocs.io/en/latest/)
* fiona - for file I/O
    * [https://fiona.readthedocs.io/en/latest/](https://fiona.readthedocs.io/en/latest/)

Under the hood, these packages rely on other libraries like GDAL/OGR ([https://www.osgeo.org/projects/gdal/](https://www.osgeo.org/projects/gdal/)) and proj ([https://www.osgeo.org/projects/proj/](https://www.osgeo.org/projects/proj/)).

Refer back to the documentation as you explore the practicals during this workshop. There are many helpful features and we won't be able to cover everything.


### Background on working with tabular data

The practicals will try to give enough guidance and examples of working with different data types, but if you have less experience with `pandas` you may want to refer back to some of the other tutorials for additional help along the way.

* [https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html](https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html)

## Geospatial vector data

There are different mathematical data models that can be used to represent objects, surfaces, or other geographic phenomena. These specify how data are defined, organised, updated, and queried. The *vector* data model is one example (and as we discussed in the lecture, it is usually contrasted with the *raster* data model).

Vector data represent features whose geometry can consist of:
* Points - a vertex or position in space with X, Y, and optionally Z coordinates

* Lines - sets of vertices linked by paths where the start and ending point are not the same location

* Polygons - a closed shape formed by a set of vertices and paths in a set order, where the start and ending point are the same

Don't confuse the data model with a specific file format. There are now several different file formats that can be used to store geospatial vector data (see also: [https://en.wikipedia.org/wiki/GIS_file_formats](https://en.wikipedia.org/wiki/GIS_file_formats)).

The *shapefile* is still a well-known and common file format because it is used by a major GIS software company ESRI). Other common formats you're likely to encounter are: *geopackages* and *GeoJSON* files. You might notice that the files we're using in the `\data` directory are compressed (&ast;.zip). This is mostly for convenience of storing them on Github. Working with your own data you may want to leave them unzipped.

Try unzipping one of the shapefiles (&ast;.shp.zip). You will see that it isn't one file, but a collection of several files. These contain the geometry, the attributes, coordinate information. And you have to keep them all together.

If you unzip one of the geopackages (&ast;.gpkg.zip) and inspect it, you will see that inside is actually a SQLite database (which adhere to certain standards: [https://www.geopackage.org/](https://www.geopackage.org/)). There are manay ways to work with vector data files, and we're going to focus just on using `GeoPandas` and Python.

Regardless of the format of the vector data, `fiona` makes it easy to read into GeoPandas using the `geopandas.read_file` command.


In [None]:
# load an example of vector data
# we will use a set of US states from the US Census Bureau
gdf = gpd.read_file('zip://../data/cb_2018_us_state_500k.zip')  # note the use of zip:// + relative path
# paths may need to be changed if you are running the notebook locally

# if we have an uncompressed shapefile instead, point towards the *.shp file: gpd.read_file('\path\to\data.shp')

## Linking place and attributes

The key idea and the real power of geospatial data comes from linking information about a feature to a location in space or in the real-world. This may seem obvious and straight-forward, but it (as we will see) opens up many possibilities for how to integrate many different kinds of data and what questions we can starts to ask and answer.

In GeoPandas this central idea of linking attributes to places is operationalised with the core data structure of the `GeoDataFrame`.

The data we just loaded is an example of a `GeoDataFrame`.

In [None]:
type(gdf)

In [None]:
# get the dimensions (rows, cols)
gdf.shape

In [None]:
gdf.head()

We saw some of these basic operations in the first practical, but let's look at the results in a bit more detail this time.

* `.head()` shows the first rows of the dataset (from Pandas)
* each row is an observation or a feature to be represented
* the **attributes** of the observations in the `GeoDataFrame` are columns of data (in a `DataFrame`)
* there is the additional **'geometry'** column
* in this example, the geometry is using polygons
    * in fact, there are **multipolygons**. We will discuss these a bit more in a later section, but for now understand that they have multiple parts, each a polygon which describe a single feature


Because the `gdf` is still (also) a **DataFrame** from pandas, we can work with the attribute columns as we might with a "normal" or non-spatial DataFrame.

In [None]:
# Calculate the total area of land in the US
gdf['ALAND'].sum()

In [None]:
# GeoPandas has another way to quickly get the area of polygon shapes
# examine just the beginning
gdf.area.head()

But what units are those area measurements in? The first seems really big... and the second seems so different...

We will spend a whole practical looking at how to accurately measure things like area (and perimeter, distance, etc.). Generally, you want to be very careful using something pre-calculated like this because often the process for how the calculations were done isn't clear. 

And, as you can in the `Warning`, the coordinate reference system used can have a big impact on the answer. This is telling us that we should "re-project" our data in order to get a more accurate answer. We'll do that later.

We can select a subset of attributes and even convert the `GeoDataFrame` back into a pandas `DataFrame` by removing the geometry attribute if we really want to.

In [None]:
gdfSub = gdf[['NAME', 'GEOID', 'ALAND']]

In [None]:
# it used to be a GeoDataFrame, but now...
type(gdfSub)

In [None]:
gdfSub.head()

We can subset rows of the dataframe into a new GeoDataFrame based on conditions.

In [None]:
# Extract just North Carolina into a new GeoDataFrame
nc = gdf[gdf['NAME'] == 'North Carolina']

nc

In [None]:
nc.plot()

The geometry column deserves a bit more attention.

In [None]:
# extract just the geometry attribute
gs = gdf.geometry

In [None]:
# inspect the column (now extracted into its own vector)
type(gs)

In [None]:
gs.head()

This column is a unique attribute for spatial dataframes. In GeoPandas it is called a `GeoSeries` and it is a set of shapes for each observation in the DataFrame. It is also made up of `shapely` objects. Because of this, the `GeoSeries` allows us to use most of the methods and attributes from `shapely`.

We can also access and work with invidual elements of the geometry series within the `GeoDataFrame`, though in practice we may not often do this.

In [None]:
# get the first record and plot its geometry
gdf.loc[0, 'geometry']

In [None]:
# or as an alternative
gdf.geometry[0]

In a section above we talked a **multipolygon** instead of just a polygon. At this *scale*, the state looks like one shape, but it is, in fact, a multi-part polygon.Together these polygon pieces make up the geometric representation for this one object.

In [None]:
# how many "parts" are there in the geometry
len(gdf.geometry[0])

# note, you can also run `print(gdf.geometry[0])` to see the whole, long string

In [None]:
# inspect the first part
gdf.geometry[0][0]

In [None]:
# inspect the last part
gdf.geometry[0][7]

## Creating a GeoDataFrame

If we don't have a ready-made GIS file, there are several ways we can construct a GeoDataFrame within Python.

### Constructing manually

It's possible to construct the GeoDataFrame with the attribute data.

In [None]:
# first need to import geometry objects
from shapely.geometry import Point, LineString, Polygon

In [None]:
# create a test dataset manually
test = gpd.GeoDataFrame({
                         'geometry': [Point(1, 1), Point(2,2)],
                         'var1': [1, 2],
                         'var2': [3, 4]
                        })

print(test)
type(test)

Notice that I called the column **'geometry'**, but that can be changed.

In [None]:
# rename the geometry column to 'shape'
test = test.rename(columns={'geometry': 'shape'}).set_geometry('shape')

test

In [None]:
test.geometry

The `.geometry` attribute will always return the active column, or we can check its actual name.

In [None]:
# return the name of the column that holds the active geometries
test.geometry.name

Setting, resetting, renaming a geometry column can come in handy. Your GeoDataFrame could theoretically have two (or more) columns containing spatial information, but you could switch and set your choice with `.set_geometry`. You would still need to take care with correctly managing the objects and coordinate systems.

And it's still a `GeoDataFrame` after that switching.

In [None]:
type(test)

**Side Track**

The example above shows how it can be quite easy to construct geometries. Let's try a few others.

In [None]:
# Create a test poplygon
poly = Polygon([(0.5, 1), (1, 2), (.75, 3), (.75, 3.5), (.5, 4)])

# notice that `Polygon` automatically closes our set of vertices for us
print(poly)
poly

In [None]:
# Create a test line (or "linestring")
# using the same set of coordates as above
ln = LineString([(0.5, 1), (1, 2), (.75, 3), (.75, 3.5), (.5, 4)])

# this time the points aren't closed
print(ln)
ln

#### Try for yourself!

Experiment with making a few of your own geometries for points, lines, and polygons. 

What happens when:
* a `LineString` ends at the same point it started with?
* a `Polygon` intersects itself (repeating a vertex)?

Can you store different geometry types within the same GeoDataFrame?
* Hint: construct something like `test` above, but don't use all `Point` elements.

In [None]:
# Test your code here...


#

### Constructing from a `pandas` DataFrame

More commonly than manually creating data, we will need to create a **Geo**DataFrame from an existing *DataFrame* which might also contain geometry information. In the example below we are going to read in a `.csv` file which contains some coordinate locations and then we'll construct a GeoDataFrame.

This is using an example data of populated places from [https://www.naturalearthdata.com/downloads/10m-cultural-vectors/10m-populated-places/](https://www.naturalearthdata.com/downloads/10m-cultural-vectors/10m-populated-places/) which I converted into a `.csv` file.

In [98]:
# read .csv file with pandas
df = pd.read_csv('../data/ne_cities_10m.csv')

# creates a non-spatial DataFrame
type(df)

pandas.core.frame.DataFrame

In [94]:
df.shape

(7343, 38)

In [95]:
df.head()

Unnamed: 0,scalerank,natscale,labelrank,featurecla,name,namepar,namealt,diffascii,nameascii,adm0cap,...,pop_other,rank_max,rank_min,geonameid,meganame,ls_name,ls_match,checkme,min_zoom,ne_id
0,10,1,8,Admin-1 capital,Colonia del Sacramento,,,0,Colonia del Sacramento,0.0,...,0,7,7,3443013.0,,,0,0,9.0,1159112629
1,10,1,8,Admin-1 capital,Trinidad,,,0,Trinidad,0.0,...,0,7,7,3439749.0,,,0,0,9.0,1159112647
2,10,1,8,Admin-1 capital,Fray Bentos,,,0,Fray Bentos,0.0,...,0,7,7,3442568.0,,,0,0,9.0,1159112663
3,10,1,8,Admin-1 capital,Canelones,,,0,Canelones,0.0,...,0,6,6,3443413.0,,,0,0,9.0,1159112679
4,10,1,8,Admin-1 capital,Florida,,,0,Florida,0.0,...,0,7,7,3442585.0,,,0,0,7.0,1159112703


There are a lot of columns in this dataset. More than `pandas` shows by default. But I will tell you that there are two columns: `latitude` and `longitude` that you can use to create `Point` objects and convert the data into a GeoDataFrame.

In [99]:
# we can do the conversion in one step
# pass the data frame along with a shapely object created from the coordinates
cities = gpd.GeoDataFrame(df,
                          geometry = gpd.points_from_xy(df.longitude, df.latitude))

The result should look like this:

In [100]:
cities.head()

Unnamed: 0,scalerank,natscale,labelrank,featurecla,name,namepar,namealt,diffascii,nameascii,adm0cap,...,rank_max,rank_min,geonameid,meganame,ls_name,ls_match,checkme,min_zoom,ne_id,geometry
0,10,1,8,Admin-1 capital,Colonia del Sacramento,,,0,Colonia del Sacramento,0.0,...,7,7,3443013.0,,,0,0,9.0,1159112629,POINT (-57.84000 -34.48000)
1,10,1,8,Admin-1 capital,Trinidad,,,0,Trinidad,0.0,...,7,7,3439749.0,,,0,0,9.0,1159112647,POINT (-56.90100 -33.54400)
2,10,1,8,Admin-1 capital,Fray Bentos,,,0,Fray Bentos,0.0,...,7,7,3442568.0,,,0,0,9.0,1159112663,POINT (-58.30400 -33.13900)
3,10,1,8,Admin-1 capital,Canelones,,,0,Canelones,0.0,...,6,6,3443413.0,,,0,0,9.0,1159112679,POINT (-56.28400 -34.53800)
4,10,1,8,Admin-1 capital,Florida,,,0,Florida,0.0,...,7,7,3442585.0,,,0,0,7.0,1159112703,POINT (-56.21500 -34.09900)


Notice that the number of columns has changed from the original DataFrame. There is one more column called `geometry` which we created.

In [105]:
cities.shape

(7343, 39)

The `geopandas.points_from_xy` is a convenient function to create a list of `shapely` Points. In fact, it is a wrapper for `[Point(x, y) for x, y in zip(df.longitude, df.latitude)]`.

In [104]:
[Point(x, y) for x, y in zip(df.longitude, df.latitude)][0:9]

[<shapely.geometry.point.Point at 0x7efde3a26690>,
 <shapely.geometry.point.Point at 0x7efde3a26750>,
 <shapely.geometry.point.Point at 0x7efde3a26710>,
 <shapely.geometry.point.Point at 0x7efde3a26810>,
 <shapely.geometry.point.Point at 0x7efde3a267d0>,
 <shapely.geometry.point.Point at 0x7efde3a26990>,
 <shapely.geometry.point.Point at 0x7efde3a269d0>,
 <shapely.geometry.point.Point at 0x7efde3a26890>,
 <shapely.geometry.point.Point at 0x7efde397ae50>]

### Constructing from `WKT` (well-known text) representation


## Saving or exporting data

Once we've finished creating or manipulating a GeoDataFrame in our notebooks, we may want to save that as a new output file.

## Bonus section: Big Data

For most analyses (and certainly everything we will do in this workshop) you can use the `geopandas.read_file` command to use standard `fiona` processes to read/write your files. But what about when you have some really, *really* big datasets?

Below I'm going to show you a few tips and tricks and resources that might come in handy for your research.

For starters, we won't 

### Reading in batches of records

### Writing to a SQL database

Future options: Apache, dask, parquet tiles