# Creating and Packaging GeoDataFrame from CSV

In this notebook we are going to check out the csv files containing the labels and extract some important metadata that we can possibly use for data augmentation.

## Import Dependencies

In [None]:
import pandas as pd 
import re
from pathlib import Path
import shapely
import geopandas as gpd
from tqdm.notebook import tqdm
tqdm.pandas();

## Define Directory Paths

In [None]:
sample_path = Path('../input/spacenet-7-multitemporal-urban-development/SN7_buildings_train_sample/sample/L15-0506E-1204N_2027_3374_13')
label_csv_path = Path('../input/spacenet-7-multitemporal-urban-development/SN7_buildings_train_csvs/csvs/sn7_train_ground_truth_pix.csv')
output_path = Path.cwd()
output_csv_path = output_path/'output_csvs/'
Path(output_csv_path).mkdir(parents=True, exist_ok=True)

## Read CSV into pandas DataFrame

In [None]:
df = pd.read_csv(label_csv_path)

In [None]:
df.head()

### Extract File Metadata
#### Notes on file names

-     The format of a filename (as defined above for the footprint definition CSV file) is:
   `global_monthly_<time>_mosaic_<AOI-name>`
    for example:
    `global_monthly_2018_02_mosaic_L15-0369E-1244N_1479_3214_13`

- `<time>` is a timestamp in YYYY_MM format that represents when image collection happened.
 
-    `<AOI-name>` is a unique identifier of a location. All AOI-names are 28 characters long.

-   All ids (filenames and AOI names) are case sensitive.

-   Image data is stored in files named `<filename>.tif` in the images, images_masked and UDM_masks folders. 

The function below will use regex to extract the year and month from an input string.

In [None]:
def extract_date(string):
    pattern = r'(\d+)'
    match = re.findall(pattern=pattern,string=string)
    return (match[0],match[1])

The function below will extract the unique file id from the input string.

In [None]:
def extract_file_id(string):
    pattern = r'_(L.+)'
    match = re.findall(pattern=pattern,string=string)
    return match[0]

We then map the extrac_date function on the column containing our filename string in our dataframe, and place them in 2 new columns. One column for the year and one for the month.

In [None]:
df['year'],df['month'] = zip(*df['filename'].progress_map(extract_date))

We do the same for extracting the file_id from the filename string.

In [None]:
df['file_id'] = df['filename'].progress_map(extract_file_id)

Let's have a look at the output we have so far:

In [None]:
df.head()

Great, looking great!

## About our Data

Now that we have extracted the metadata, let's try and further understand the contents of the dataframe. 

Our dataframe contains information about the polygons that appear in our satellite imagery data. The dataset consists of satellite imagery for specific regions around the world. These regions are sampled every month using a satellite. These images are then annotated by proffesional annotators, and the geolocation of the building polygon is stored in a format known as Well Known Text (WKT). 

Since this dataset is about change detection in satellite imagery we expect that the number of polygons will increase as the year number increases, because the area in which the satellite image is being taken is being built, therefore the total number of polygons would increase. 

Let's have a look below.

In [None]:
df['year'].value_counts()

In [None]:
df['month'].value_counts()

### Convert Well Known Text (WKT) format to shapely polygons

Currently the WKT is under the geometry column in our dataframe, and it is formatted as a string. 

In order to get the most out of our polygons, we need to convert them from WKT format to shapely polygon.

We will also convert our dataframe into geodataframe, which is a dataframe with built in support for spatial data and polygons. It includes methods that will make it easier for us to analyze our vector data, while maintining our georeferencing.

In [None]:
gdf = gpd.GeoDataFrame(df)
gdf['geometry'] = gdf['geometry'].progress_map(shapely.wkt.loads)

In [None]:
print('data type of geometry column before conversion: ', df['geometry'].dtype)
print('data type of geometry column after conversion: ', gdf['geometry'].dtype)

Let's have a look at the difference between the output of a WKT string, and a shapely polygon.

In [None]:
display(df['geometry'][0])
print(gdf['geometry'][0])
display(gdf['geometry'][1])

### Set the Coordinate Reference System (crs) of our GeoDataFrame

According to the dataset, the crs our polygons is EPSG:4326, but what really is that?

A coordinate reference system defines a specific map projection, as well as transformations between different spatial reference systems. So basically by defining our crs we are able to accurately plot our polygons on a map. 

Note that if we wanted to plot our polygons on top of an existing basemap, then both our basemap and the polygons should have the same crs.

In [None]:
gdf.crs = 'EPSG:4326'

In [None]:
gdf.crs

## Saving our GeoDataFrames
Finally let's save our GeoDataFrames into a format known as GeoPackage, this will allow us to save the data for ease of access later on.

In [None]:
gdf.to_file(output_csv_path/"global_geodataframe.gpkg", driver="GPKG")

We will also extract a subset of our labels, that correspond to the SpaceNet 7 building training sample images.

In [None]:
sample_image_id = 'L15-0506E-1204N_2027_3374_13'

In [None]:
sample_gdf = gdf[gdf['file_id'] == sample_image_id].copy()

In [None]:
sample_gdf['id'].value_counts()

In [None]:
sample_gdf.to_file(output_csv_path/"sample_geodataframe.gpkg", driver="GPKG")