## Import Libraries

In [None]:
from pathlib import Path
import geopandas as gpd
import matplotlib.pyplot as plt
import matplotlib as mpl
import itertools
from tqdm.notebook import tqdm

In [None]:
mpl.rcParams['figure.dpi'] = 1000

In [None]:
%matplotlib inline

## Configure Paths to Data

In [None]:
spacenet_path = Path('../input/spacenet-7-multitemporal-urban-development')
metadata_path = Path('../input/spacenet-7-metadata-extraction')

sample_csv_path = metadata_path/'output_csvs/sample_geodataframe.gpkg'

## Read & Explore CSV

In [None]:
gdf = gpd.read_file(sample_csv_path)

In [None]:
gdf.head()

### Group Data by Filename, Year and Month

In [None]:
gdf_group = gdf.groupby(['filename','year','month'])

### Extract Each Group to its own DataFrame

In [None]:
dfs = []
for i,(key,df) in enumerate(gdf_group):
    dfs.append(gdf_group.get_group(key).reset_index(drop=True))

Let's check out the length of our list below:

In [None]:
len(dfs)

Most of the data was collected monthly over a period of 2 years, therefore our list consisting of 24 DataFrames is consistent with our expectations. (24 months in 2 years)

In [None]:
df1 = dfs[0]
df1.head()

In [None]:
df24 = dfs[23]
df24.head()

Let's plot one of our dataframes and see what kind of output we get. In order to do that we loop through all of our shapes, and plot them on the same axes as shown below.

In [None]:
f, ax = plt.subplots(1, figsize=(12, 12))

for geom in df1['geometry']:
    ax.plot(*geom.exterior.xy)
plt.show()

Let's make a function so that we can automate the plotting of our geodataframes.

In [None]:
def plot_gdf(gdf,show_plot=True):
    f, ax = plt.subplots(1, figsize=(12, 12))

    for geom in gdf['geometry']:
        ax.plot(*geom.exterior.xy)
    if show_plot:
        plt.show()

One interesting thing to do is to get the differences in the labels between any given 2 months. The function below will allow us to get the difference between 2 of our geodataframes. <br>

The `get_dates` will make the function return a tupple consisting of the difference dataframe and the dates of of the 2 dataframes being compared `(difference_dataframe, (df1_date,df2_date))`

**One thing to note that the function below will only work with the way our data is formatted and will not work for any generic set of geodataframes.**

In [None]:
def dataframe_difference(df1, df2, get_dates=False):
    try:
        df1.reset_index(inplace=True,drop=True)
    except:
        pass
    try:
        df2.reset_index(inplace=True,drop=True)
    except:
        pass
    
    date_info = [df['year'][0] +' '+ df['month'][0] for df in (df1,df2)]
    
    len_1 = len(df1)
    len_2 = len(df2)
    
    len_diff = abs(len_2-len_1)
    
    if len_2 > len_1:
        start_index = len_2-len_diff
        diff_df = df2[start_index:].copy()
        date_info = (date_info[0],date_info[1])
    else:
        start_index = len_1-len_diff
        diff_df = df1[start_index:].copy()
        date_info = (date_info[1],date_info[0])
    if get_dates:
        return diff_df,date_info
    else:
        return diff_df.reset_index(inplace=True,drop=True)

We can further expand the function above and use it to augment our data, we can do this by retrieving all the possible difference combinations for our data. 

For example: if I have a list of dataframes = `[df1,df2,df3,df4]` then the function would return the following list `[diff_df1_df2, diff_df1_df3, diff_df1_df4, diff_df2_df3, diff_df2_df4, diff_df3_df4]`

In [None]:
def get_all_difference_combinations(dfs,get_dates=False):
    diff_dfs = []
    combinations = itertools.combinations(dfs,2)
    
    for comb in combinations:
        diff_dfs.append(dataframe_difference(*comb,get_dates=get_dates))
                        
    return diff_dfs

In [None]:
def plot_all_dfs(dfs,ncols=4,dates_exist=False):
    n_dfs = len(dfs)
    nrows = (n_dfs - 1) // ncols + 1
    h = ncols*10
    l = nrows*8.5
    fig,axs = plt.subplots(nrows=nrows,ncols=ncols,figsize=(h,l))
    for i,row in enumerate(axs):
        for j,ax in enumerate(row):
            df = dfs[i*nrows+j][0]
            if dates_exist:
                dates = dfs[i*nrows+j][1]
                ax.set_title(f'difference between {dates[0]} and {dates[1]}')
            for geom in df['geometry']:
                ax.plot(*geom.exterior.xy)

In [None]:
diff_dfs = get_all_difference_combinations(dfs,get_dates=True)

In [None]:
diff_dfs[0][1]

In [None]:
plot_all_dfs(diff_dfs[0:16],dates_exist=True)

In [None]:
plot_all_dfs(diff_dfs[100:116],dates_exist=True)

## Conclusion

We started out with with 24 dataframes, each labeling the building footprints for their corresponding satellite image. We managed to create a new set of labels that are based on the difference between the 
2 images, and stored them in their corresponding dataframes. In conclusion instead of having 24 output labels for 24 satellite images, now have 276 output labels by taking the combinations of 2 images and finding the difference between them.