# Creating and Packaging GeoDataFrame from Data

In the [previous notebook](https://www.kaggle.com/amerii/spacenet-7-metadata-extraction/) we dealt with the raw data labels that were in a csv format. In this notebook we are going to extract the metadata of the rest of the files in the rest of the directories. This will hopefully make our data more accessible and summarizable and make it easier to explore.

Before we begin, let's begin with some note on the filenames:

-     The format of a filename (as defined above for the footprint definition CSV file) is:
   `global_monthly_<time>_mosaic_<AOI-name>_<file_type>`
    for example:
    `global_monthly_2018_02_mosaic_L15-0369E-1244N_1479_3214_13_UDM`

- `<time>` is a timestamp in `YYYY_MM` format that represents when image collection happened.
 
-  `<AOI-name>` is a unique identifier of a location. All AOI-names are 28 characters long.

-  All ids (filenames and AOI names) are case sensitive.

- `<file_type>` is either going to be a `Buildings` or `UDM` file type, 
    note that the files in the images, and images_masked directories to not have a `<file_type>`

-  Image data is stored in files named `<filename>.tif` in the `images`, `images_masked` and `UDM_masks` folders. <br>
    *Note: Not all directories have a `UDM_masks` folder*

-  Vector data (Building Labels and UDM Labels) is stored in files named `<filename>.geojson` in the `labels`, `labels_match` and `labels_match_pix` directories

Note: AOI stands for Area of Interest

## Import Dependencies

In [None]:
import pandas as pd 
import re
from pathlib import Path
import shapely
import geopandas as gpd
import matplotlib as mpl
import matplotlib.pyplot as plt
from tqdm.notebook import tqdm
from glob import glob
tqdm.pandas();

## Define Directory Paths
### Input Directories

In [None]:
train_dir = Path('../input/spacenet-7-multitemporal-urban-development/SN7_buildings_train')
test_dir = Path('../input/spacenet-7-multitemporal-urban-development/SN7_buildings_test_public')
sample_dir = Path('../input/spacenet-7-multitemporal-urban-development/SN7_buildings_train_sample')

### Output Paths

In [None]:
output_path = Path.cwd()
output_csv_path = output_path/'output_csvs/'
Path(output_csv_path).mkdir(parents=True, exist_ok=True)

### Extract Paths and Metadata

Now that we have setup the input and output directories we can use the functions below to extract our desired metadata.

The functions below will be used to extract the following metadata from a list of paths:
* Complete Path String: <br>`../input/spacenet-7-multitemporal-urban-development/SN7_buildings_train/train/L15-0358E-1220N_1433_3310_13/images/global_monthly_2018_02_mosaic_L15-0358E-1220N_1433_3310_13.tif`
* Mid Path String: <br>`L15-1210E-1025N_4840_4088_13/labels_match/global_monthly_2018_01_mosaic_L15-1210E-1025N_4840_4088_13_Buildings.geojson`
* Unique File Name: `L15-0361E-1300N_1446_2989_13`
* Directory Name: Name of the directory containing the images: `UDM_masks` `images` `images_masked` `labels `labels_match` `labels_match_pix`
* Year: year in which the image was taken
* Month: month in which the image was taken
* Data Type: `Buildings` or `UDM`
* File Extension: `.geojson` or `.tiff`
../input/spacenet-7-multitemporal-urban-development/SN7_buildings_train/train/L15-0361E-1300N_1c

In [None]:
def extract_metadata_from_string(string):
    # extracted groups
    # full path - image_dir_name - sub_dir_name - fname - year - month - data_type - extension
    pattern = r'/(t.+|sample)/(L.+)/(\w+)/(.+_(\d+)_(\d+)_m.+_\d+_\d+_\d+)(?:_(\w+))?.(\w+)'
    match = re.findall(pattern=pattern,string=string)
    return match[0]

In [None]:
string1 = '../input/spacenet-7-multitemporal-urban-development/SN7_buildings_train_sample/sample/L15-0506E-1204N_2027_3374_13/UDM_masks/global_monthly_2019_11_mosaic_L15-0506E-1204N_2027_3374_13_UDM.tif'
string2 = '../input/spacenet-7-multitemporal-urban-development/SN7_buildings_train_sample/sample/L15-0506E-1204N_2027_3374_13/UDM_masks/global_monthly_2019_11_mosaic_L15-0506E-1204N_2027_3374_13.tif'

In [None]:
extract_metadata_from_string(string1)

In [None]:
extract_metadata_from_string(string2)

In [None]:
def extract_list_of_paths(directory):
    paths_list = [path for path in Path.glob(directory,pattern = '**/*.*')]
    return paths_list

In [None]:
def extract_metadata_from_list_of_paths(list_of_paths):
    d_keys = ['parent_dir','image_dir_name','sub_dir_name','fname','year','month','data_type','extension']
    d = {key:[] for key in d_keys}
    d['full_path'] = []
    for path in list_of_paths:
        metadata = extract_metadata_from_string(str(path))
        d['full_path'].append(path)
        
        for i,data in enumerate(metadata):
            d[d_keys[i]].append(data)
    return d

### Extracting the Metadata
The function below will extract a list of the paths of the files inside of the input directory

In [None]:
train_paths = extract_list_of_paths(directory=train_dir)
test_paths = extract_list_of_paths(directory=test_dir)
sample_paths = extract_list_of_paths(directory=sample_dir)

In [None]:
train_metadata_dict = extract_metadata_from_list_of_paths(train_paths)
test_metadata_dict = extract_metadata_from_list_of_paths(test_paths)
sample_metadata_dict = extract_metadata_from_list_of_paths(sample_paths)

In [None]:
df_train = pd.DataFrame(train_metadata_dict)
df_test = pd.DataFrame(test_metadata_dict)
df_sample = pd.DataFrame(sample_metadata_dict)

In [None]:
df_train

In [None]:
df_test

In [None]:
df_sample

In [None]:
df_train[df_train['data_type'] == 'Buildings']['extension'].value_counts()

In [None]:
df_train[df_train['data_type'] == 'UDM']['extension'].value_counts()

In [None]:
df_train[df_train['data_type'] == '']['extension'].value_counts()

As you may have noticed some of our data types are giving us a value of `''`. This is because as mentioned earlier the files that are actually images or `.tif` files do not have a datatype at the end. 

In order to remedy this problem we will simple replace the values = `''` with the value of `Images`

In [None]:
df_train.loc[df_train['data_type'] =='','data_type'] = 'Images'
df_test.loc[df_test['data_type'] =='','data_type'] = 'Images'
df_sample.loc[df_sample['data_type'] =='','data_type'] = 'Images'

In [None]:
df_train

In [None]:
df_test

In [None]:
df_sample

Finally, let's create a function that automates all the steps above.

In [None]:
def get_metadata(input_dir):
    list_of_paths = extract_list_of_paths(input_dir)
    metadata_dict = extract_metadata_from_list_of_paths(list_of_paths)
    df = pd.DataFrame(metadata_dict)
    
    df.loc[df['data_type'] =='','data_type'] = 'Images'
    

    # Identify Images that have UDM Masks
    condition = (df['sub_dir_name'] == 'UDM_masks')
    # Get the indices of the images that have udm
    udm_indices = df.loc[condition].index
    # Get list of unique file names that have UDMs
    udm_fnames = list(df.loc[udm_indices,'fname'])
    # Get all rows that match the file names
    udm_mask = df['fname'].progress_map(lambda x: x in udm_fnames)
    # Initialize has_udm column 
    df['has_udm'] = False
    # Apply mask and update udm value
    df.loc[udm_mask,'has_udm'] = True

    return df

## Saving the Outputs
Finally we are going to save the output dataframes as csvs. We are going to have 4 csvs in total:
* CSV for the train dataframe
* CSV for the test dataframe
* CSV for the sample dataframe
* CSV for the concatenated train, test and sample dataframes

In [None]:
df_train = get_metadata(train_dir)
df_test = get_metadata(test_dir)
df_sample = get_metadata(sample_dir)
df_concat = pd.concat([df_train,df_test,df_sample]).reset_index()

In [None]:
df_train.to_csv(output_csv_path/'df_train.csv',index=False)
df_test.to_csv(output_csv_path/'df_test.csv',index=False)
df_sample.to_csv(output_csv_path/'df_sample.csv',index=False)
df_concat.to_csv(output_csv_path/'df_concat.csv',index=False)

In [None]:
df_train.head()

Let's make sure that the output is saved

In [None]:
!ls ./output_csvs

Okay so now we have saved the first version of our csv, our csv is formatted in a format known as tidy data. This format, makes it really easy to analyse our metadata. Next we are going to manipulate our dataframe so that we can make it easier to create a dataset class from it in pytorch.

We are going to do this by adding a column for each of the labels paths.

In [None]:
def untidy_df(df):
    
    parent_dir = df['parent_dir']
    im_dir_name = df['image_dir_name']
    fname = df['fname']
    year = df['year']
    month = df['month']
    has_udm = df['has_udm']
    
    images_masked = im_dir_name + '/images_masked/' + fname + '.tif'
    
    if parent_dir == 'test_public':
            images = None
            labels_buildings = None
            labels_udm = None
            labels_match = None
            labels_match_pix = None
            udm_masks = None
    else:
        if has_udm:
            udm_masks = im_dir_name + '/UDM_masks/' + fname + '.tif'
        else:
            udm_masks = None


        images = im_dir_name + '/images/' + fname + '.tif'
        labels_buildings = im_dir_name + '/labels/' + fname + '_Buildings.geojson'
        labels_udm = im_dir_name + '/labels/' + fname + '_UDM.geojson'
        labels_match = im_dir_name + '/labels_match/' + fname + '_Buildings.geojson'
        labels_match_pix = im_dir_name + '/labels_match_pix/' + fname + '_Buildings.geojson'

    keys = ['parent_dir','image_dir_name','fname','year','month','has_udm','udm_masks','images','images_masked','labels_buildings','labels_udm','labels_match','labels_match_pix']
    values = [parent_dir,im_dir_name,fname,year,month,has_udm,udm_masks,images,images_masked,labels_buildings,labels_udm,labels_match,labels_match_pix]
    
    return {k:v for (k,v) in zip(keys,values)}

In [None]:
def get_untidy_frame(df):
    # apply function on input dataframe
    list_of_dicts = df.progress_apply(lambda x: untidy_df(x),axis=1)
    # drop the duplicated columns
    untidy_frame = pd.DataFrame.from_records(list_of_dicts).drop_duplicates()
    # bask in all the glory of your untidy frame ;D
    return untidy_frame

In [None]:
test_untidy_df = get_untidy_frame(df_test)
train_untidy_df = get_untidy_frame(df_train)
sample_untidy_df = get_untidy_frame(df_sample)
concat_untidy_df = get_untidy_frame(df_concat)

In [None]:
concat_untidy_df

The above csv format will make it easier for us to create our custom pytorch dataset class, notice how you have access to whichever image or geojson file that you want, and how they are all grouped by the corresponding month and year.

# Finally we save the untidy dataframes

In [None]:
train_untidy_df.to_csv(output_csv_path/'df_train_untidy.csv',index=False)
test_untidy_df.to_csv(output_csv_path/'df_test_untidy.csv',index=False)
sample_untidy_df.to_csv(output_csv_path/'df_sample_untidy.csv',index=False)
concat_untidy_df.to_csv(output_csv_path/'df_concat_untidy.csv',index=False)

In [None]:
!ls ./output_csvs

# What Next?
Next we are going to create a bunch of helper functions,[in our next notebook](https://www.kaggle.com/amerii/spacenet-7-helper-functions), that will make navigating, visualizing and understanding our dataset much much easier!