# Part 2: Working with vector data

## Introduction to vector data, shapely, and geopandas
Now that we understand raster data a bit better, we can dive into the second type of data: vector data. Vector data represents geographic data in a symbolized way: as points, lines or polygons.  
Each geometry in vector data has a specific geographic location and may also contain additional features, also called attributes (such as names, IDs, or other descriptive information).

The left image below displays the three main vector type objects: points, lines and polygons. The right image demonstrates a vector representation of topographic and land use information. We see the different types of geometries containing class information: one polygon represents the desert, and another point represents a house.

<div style="text-align:center">
  <img src="imgs/vector_example.png" alt="Standard geopandas dataframe format" width="500" height="300">
</div>

## Creating geometries
Unlike raster data, vector data consists of a single object with certain values or features attributed to it. Let's create our first vector objects in all three possible forms:
1. points - individual (x,y) locations
2. lines - composed of multiple vertices or connected points
3. polygons - three or more connected vertices that are *closed*

We can create this with these geometries with the Python library Shapely. Shapely is often used for manipulating and analyzing geometric objects.
Let's create our first geometries!

In [None]:
# Import Shapely
from shapely.geometry import Point, Polygon, LineString

# Create a Point
point = Point(0, 0)

# Create a LineString
line = LineString([(0, 0), (1, 1), (2, 1)])

# Create a Polygon
polygon = Polygon([(0, 0), (1, 0), (1, 1), (0, 1), (0, 0)])

Each of these geometries are now *objects*, with their own properties. We can view these properties by simply accessing the attribute of our object.

For example, if we are interested in the coordinates corresponding to the geometry of our created objects, we can access them with the following command:

In [None]:
# get the (x, y) coordinates of our geometries
print(
    f"Point coordinates = {point.xy}\n"
    f"Line coordinates = {line.xy}\n"
    f"Polygon coordinates = {polygon.exterior.xy}"
)

Note that we need to specify `exterior` for the polygon, because polygons can contain holes of which the coordinates can be accessed with the `interiors` attribute.

We can easily visualize our newly made geometries using the `pyplot` module from `matplotlib`.

In [None]:
import matplotlib.pyplot as plt

# Create a figure and axis
fig, ax = plt.subplots(figsize=(6, 4))

# Plot the Point
x, y = point.xy
ax.plot(x, y, "ro", label="Point")
# Plot the LineString
x, y = line.xy
ax.plot(x, y, "g--", label="LineString")
# Plot the Polygon
x, y = polygon.exterior.xy
ax.plot(x, y, "b-", label="Polygon")

# Add labels and legend
ax.set_title("Geometries")
ax.legend()
plt.show()

## Working with attributes
Vector data files, such as shapefiles, geodatabases, or GeoJSON, often include not only geometric information (points, lines, polygons) but also attribute data associated with these geometries. These attributes provide additional information or characteristics about the geographic features, which is very useful for geospatial data analysis! This comes back to our example, where we saw how landcover maps can use different kinds of geometries to represent not only the locations of a certain area such as forest, but can also contain additional information for that specific area, such as the area size, the types of trees, or the number of inhabitants.  
For our use case, these attributes will help us distinguish between the different types of forest. But more on that later!

First let's start with printing some general attributes of our geometries, a few examples of this are `geom_type`, `area`, and `distance`. (You can find all general and geometry specific attributes in the [Shapely user manual](https://shapely.readthedocs.io/en/stable/manual.html#:~:text=General%20Attributes%20and%20Methods).)

In [None]:
# Check the geometry types
print(f"We created a beautiful {point.geom_type}, {line.geom_type}, and {polygon.geom_type}\n")

# Get the area of your polygon
print(f"Our polygon has a size of 1 x 1 = {polygon.area}\n")
# what would happen if you'd do this for your point?

# Get the distance between two geometries
print(f"Distance between two points = {point.distance(Point(1,1))}")

### Exercise
Play around with the geometries and attributes, can you find the *length* of our line?

In [None]:
# Fill in your solution below
#[YOUR SOLUTION]

## Working with real geographic data
Now that we understand the principles of vector data, we can start working with real geographic data!  
We'll work with a landcover map of our area in Greece. The vector format is often used for landcover maps, because it allows areas to contain different types of information. If part of your map consists of forest for example, we might be able to read different attributes from the vector data, such as forest type and area. If we would've used raster data instead, we would only have been able to see that for a certain pixel the landcover type (or *landcover class*) corresponds to forest.  
Being able to read this information helps us understand in which types of forest the wild fire was the most severe. This can help our understanding of how wild fires spread, which is essential information for monitoring and predicting wild fires in the future!

As we discussed, vector data often comes in a _geojson_ file format. This file contains all coordinates and features we need to know, and, fortunately, the geographic version of `pandas`, `geopandas`, follows intuitive steps to load and read the data. 

In [None]:
# very similar to the pandas library
import geopandas as gpd

# load the vector data set and display the first rows
landcover_vector_file_path = "data/landcover_greece.geojson"
landcover = gpd.read_file(landcover_vector_file_path)
landcover.head()

We can recognize the standard pandas dataframe with an extra column: *geometry*. This geometry column always contains the objects with the coordinates. <div style="text-align:center">
  <img src="imgs/geopandas_df_structure.png" alt="Standard geopandas dataframe format">
</div>

If you want to read more about the dataset, you can read [documentation of the landcover dataset](https://land.copernicus.eu/en/products/corine-land-cover/clc2018).

Now let's visualize our MultiPolygons (from the *geometry* column)!  
Geopandas has a built in method `plot()`, which makes it easy to plot all geometries from our geodataframe. It even gives us the opportunity to pass a column title to plot the categorical or continuous values.

In [None]:
# plot the geometries with the landcover_class as categorical value
landcover.plot("landcover_class", legend=True)

### Exercise
For our forest fire analysis, we are most interested in the forest area. Can you create a new dataframe with only the forested areas?

Extra time? Can you visualize the subclasses of the forest area in a plot? Try to find out what `gdf.explore()` does!

In [None]:
# Fill in your solution below
# make a copy of the dataframe
forest_df = landcover.copy()

# select the rows from your dataframe where your landcover class corresponds to 'Forest and semi natural area'
# (Hint: this is the same as for 'regular' pandas dataframes!)
#[YOUR SOLUTION]

# show your new dataframe
forest_df.head()

# plot the subclasses
#[YOUR SOLUTION]


# Hint: explore is very similar to plot()

## Writing vector data
To take our hard work to the next notebook, we can save our filtered dataframe to a file. With the in-build geopandas function `to_file()`, we can easily do this. By default, an ESRI shapefile is written, but any geodata source supported by `Fiona` can be written. We can see these drivers with:

In [None]:
# supported data types to write vector files to
import fiona
fiona.supported_drivers

Now let's store our `forest_df` in the data folder in a geojson format for consistency.

In [None]:
# store it with the other data
forest_df.to_file('data/forest_landcover.geojson', driver='GeoJSON')

# Extra

If you want to explore vector data more, go through the sections below!

## Multi-geometries
When working with vector data, you often encounter a situation where multiple geometries instead of a single geometry is used for a certain area or class. As an example, we will look at a  multi-polygon here, but the theory is the same with lines and points.  

A **multipolygon** is a collection of polygons grouped together, where each polygon has its own set of vertices (bounds) and attributes. 
Multipolygons are useful when you have complex geographic features that can't be adequately represented by a single polygon. For example, a country with islands could be represented as a multipolygon with each island as a separate polygon within it. Moreover, they help you to organize your data efficiently by group related ones together, avoid repetition by using a multipolygon with shared attributes, and work with multiple polygons as a single entity for performing vector operations.

We can easily create a multipolygon from two (or even one) polygon:

In [None]:
# import the multipolygon class
from shapely.geometry import MultiPolygon

# create two polygons
polygon1 = Polygon([(0, 0), (1, 0), (1, 1), (0, 1), (0, 0)])
polygon2 = Polygon([(2, 2), (3, 1), (4, 3), (3, 3), (2, 2)])

# combine the two polygons to a single multipolygon
multi_polygon = MultiPolygon(polygons=[polygon1, polygon2])

# plot our new creation
plt.figure(figsize=(5, 3))
for poly in multi_polygon.geoms:
    plt.plot(*poly.exterior.xy, color="blue")
plt.title("Our First MultiPolygon")
plt.grid(alpha=0.3)
plt.show()
plt.close()

## Invalid geometries

One thing to be cautious for are invalid geometries. Common causes of invalid geometries are:
1. **self-intersecting polygons**: polygons that cross over each other (like a figure 8) are self-intersecting and considered invalid.
2. **geometry with too few points**: a line should consist of at least two points and a polygon of three (such that the area can be enclosed)
3. **geometry gaps or overlaps**: if there are gaps or if polygons touch at an infinite number of points (along a line)

We can create our own example of what this looks like:

In [None]:
# Creating a self-intersecting polygon
polygon_coords = [(0, 0), (2, 0), (1, 1), (0, 1), (2, 1)]
self_intersecting_polygon = Polygon(polygon_coords)

# Checking if it's valid
print("Is our multipolygon valid?", multi_polygon.is_valid)
print("Is our new polygon valid?", self_intersecting_polygon.is_valid)

# visualize our invalid polygon
fig, ax = plt.subplots(figsize=(5, 3))
ax.plot(*self_intersecting_polygon.exterior.xy)
ax.set_title("Invalid geometry: self intersecting polygon")