# Concatenating data frames

Concatenating data is a common GIS operations.  It just refers to the act of combining multiple data sources.  Usually these data sources have a common structure (i.e. the same attributes and geometry type).  An example might be if you are collecting data in the field using a mobile GPS device.  Every day the data that you collect is downloaded as a shapefile.  After a week you have 5 shapefiles with similar data and each of those must be combined into a masterfile that contains all of the data.

Lets load the BUOWL data first.

In [None]:
%matplotlib inline
import geopandas as gpd

buowl = gpd.read_file("data/BUOWL_Habitat.shp")
buowl.head()

Now lets seperate this dataframe into two dataframes. One will contain the historically occupied buowl habitat and the other the undetermined buowl habitat.

In [None]:
buowl_ho = buowl[buowl['hist_occup'] == 'Yes']
buowl_ho.head()

In [None]:
buowl_und = buowl[buowl['hist_occup'] == 'Undetermined']
buowl_und.head()

In [None]:
buowl_und.count()

We know that these two dataframes have identical structures because we created them from the same dataframe.  In this case it is very easy to concatenate them back into a single dataframe using the pandas concat method.  The concat method can be fairly complex but in its easiest form we can just pass it a list of dataframes to concatenate

In [None]:
import pandas as pd

buowl_all = pd.concat([buowl_ho, buowl_und])
buowl_all.count()

But what if the data structures are not identical?

Lets read in a few data frames that were created as the intersections of environmental constraints and project buffers

In [None]:
raptor_buffer = gpd.read_file("data/intersections.gpkg", layer = 'raptor_buffer')
buowl_buffer = gpd.read_file("data/intersections.gpkg", layer = 'buowl_buffer')

In [None]:
raptor_buffer.info()

In [None]:
buowl_buffer.info()

Notice that in this case that some of the column names differ (although they hold similar type of information).  Lets do a quick simple merge and see what happens.

In [None]:
ec = pd.concat([raptor_buffer, buowl_buffer])
ec

Notice that identical column names are combined automatically in the same column but there are now 2 new columns *habitat_id* and *hist_occup* that reflect the columns in buowl_buffer that are not found in raptor_buffer

Now, we have a few issues to resolve.  One is that the buowl_buffer file has no species column because all of the buowl_habitat reflects a single species.  But if we are going to combine these results with the raptor_intersections then we should have a species column containing the text BUOWL so that we can differentiate them from the raptor nests.  This is easy enough to do.

In [None]:
buowl_buffer['recentspec'] = 'BUOWL'
buowl_buffer.head()

We see that we now have a recentspec column containing the text BUOWL for all the buowl_buffer records.

Another issue is that the buowl_buffer dataframe has column names *habitat_id* and *hist_occup* that contain similar types of information to the *Nest_ID* and *recentstat* columns in the raptor_buffer dataframe.  We want those columns to be combined when we concatenate so they have to have the same field names.

I will have an entire lecture on various ways to rename columns but for now just know that we can simply assign a new list of column names to the columns property of the data frame.  This is probably the easiest although if you have a lot of column names, there are more efficient ways.

In [None]:
buowl_buffer.columns = ['Nest_ID', 'recentstat', 'Project', 'type', 'length_m', 'area_ha', 'geometry', 'recentspec']
buowl_buffer.head()

Now that we have the same column names for both dataframes lets combine them again

In [None]:
ec = pd.concat([raptor_buffer, buowl_buffer])
ec

Notice that there are now no new columns and that the data line up properly, even though the columns are not in the same order.  The important thing is the column name.

And now we can create data summaries that integrate both raptor nests and buowl habitat

In [None]:
pd.pivot_table(ec, index=['Project', 'recentspec', 'recentstat'], values='area_ha', aggfunc=['sum', 'count'])

Pandas has a rich set of functionality for doing these kinds of data manipulations. As is sometimes the case with open source projects there may be several different ways to achieve the same result.  I chose the simplest method for concatenating data that I could and the one that is most similar to methods available in desktop GIS but the same result could be achieved in other ways.

Take a look at the documentation for the concat method for more information on other ways it can be used.

In [None]:
help(pd.concat)

More information on this topic can be found in the [Pandas documentaion](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html).  This page provides a good overview of the many methods that are available.