<a href="https://colab.research.google.com/github/rg-smith/GIS/blob/main/Exercise2/GIS_exercise2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Class exercise 2: exploring geospatial data with python

This exercise is meant to introduce you to some powerful tools for geospatial data analysis and plotting in python. The exercise will mainly work through examples with building complexity. At the end, you will upload your own shapefile and plot it.

The first section only needs to be run once each time you start a new session. It will install the necessary packages, and clone the github repository (copy all the files from my github GIS page to your google colab session).

After running this, if you click the folder icon to the left, you should see a 'GIS' folder. If you don't see it, you can refresh (icon on the top left) and then you should see it.

In [None]:
!pip install rasterio
!pip install geopandas
!git clone https://github.com/rg-smith/GIS.git

Now that we have installed the necessary packages, we will import them. Some packages are pre-installed in the version of python running on google colab, so we don't need to install those (but still need to import them).

In [2]:
import rasterio # tools for working with raster data
import geopandas as gpd #tools for working with vector data
from rasterio.plot import show # rasterio tool for plotting
import pandas as pd #tools for working with tabular (non-spatial) data
import matplotlib.pyplot as plt # tools for generic plotting (non-spatial or spatial)

Now we will load our first raster. You will see a check-mark after running this to verify it worked. 

In [3]:
raster = rasterio.open('GIS/Rasters/srtm_sw_utah.tif')

To plot the raster, we simply use the function we imported earlier (show). We assigned the raster to the variable name 'raster', so we just type 'show(raster)' to plot it. If the raster had a different name assigned to it, we would use that name as the argument below.

Try changing the name of your raster in the code block above, then changing the name below to plot it to see what I mean.

Take a screenshot and include in your report. What type of data do you think this is (what do the values represent?)

In [None]:
show(raster)

Now, we will explore the data a little bit. This is one of the nice things about python. It's fairly straightforward to dive into the data. We'll plot a histogram of the raster, showing how common each value is.

In [None]:
plt.figure();plt.hist(raster.read().flatten(),bins=100)

Try changing the number of bins in the code above, and take a screenshot of your result. Describe the distribution of the data.

Now, we will load a shapefile. We do this using geopandas. We created an alias for geopandas when we imported it (import geopandas as gpd), so to use geopandas functions we type gpd., then the name of the function.

In [6]:
shp = gpd.read_file('GIS/Shapefiles/parowan_watershed.shp')

Now we've loaded the shapefile, let's take a look at it. The shapefile is loaded as a table, with one column representing the geometry. If we print the shapefile, we can see the table.

In [None]:
print(shp)

As you can see, there's just one row in this table. Let's plot it to see what it looks like.

In [None]:
shp.plot()

Do you think the shapefile and raster have the same or different coordinate reference systems? Why?

After making an initial guess, we can check the coordinate reference systems.

In [None]:
print('raster coordinate reference system:')
print(raster.crs)
print('shapefile coordinate reference system:')
print(shp.crs)

EPSG is a code that defines a unique coordinate reference system. Google EPSG 4326, and include the details of this crs, including its datum, in your report.

Now that we've verified that these two datasets have the same crs, we can plot them together.

We'll create a blank plot first, then add each dataset to the plot. Note that Rasterio and geopandas have different syntax for plotting. With rasterio we use 'show', and with geopandas we type 'plot()' following a period after the shapefile name.

In [None]:
fig,ax = plt.subplots()
show(raster,ax=ax)
shp.plot(ax=ax,facecolor='none',edgecolor='red')

## Part 2: joining shapefile data in python

Now we will load some county-level data. Since this shapefile's attribute table has a lot of columns, we will plot a summary of each column with the .info() function.

In [None]:
counties = gpd.read_file('GIS/Shapefiles/cb_2017_us_county_wgs84.shp')
print(counties.info())

Now let's take a look at a few columns more specifically

In [None]:
print(counties[['STATEFP','COUNTYFP','ALAND']])

As we discovered earlier this year, counties can be identified with a code, called FIPS. Each county has a unique FIPS code. This includes a state FIPS and county FIPS. This shapefile has those included separately, but most tabular county data we want to join with this has them together, so we will merge them.

In [None]:
counties['fips'] = counties['STATEFP'] + counties['COUNTYFP']
print(counties['fips'])

Let's plot the counties.

In [None]:
counties.plot()

Now, we will read in some tabular county-level data that can be joined with our county shapefile. The tabular dataset is a table showing surface and groundwater withdrawals for each county in 2010. We will only read in four columns: fips, county population, total freshwater groundwater withdrawals, and total freshwater surface water withdrawals.

In [None]:
gw = pd.read_table('GIS/Tabular/usco2010.txt',usecols=['FIPS','TP-TotPop','TO-WGWFr','TO-WSWFr'],dtype={'FIPS':object})
print(gw.info())

Since this dataset has no spatial data attached to it, we need to join it to our county shapefile, using the fips column. In pandas and geopandas, to join a file, you first need to set an 'index'. This is the column that we are joining the data on.

We do this with .set_index(['column name']) following the shapefile or data table.

In [16]:
counties2 = counties.set_index(['fips'])
gw2 = gw.set_index(['FIPS'])

Now, we can join the datasets.

In [17]:
counties2 = counties2.join(gw2)

We will create some new columns representing groundwater and surface water withdrawals per area. We will also convert these from Million gallons per day to mm water averaged over the whole county area per year.

In [18]:
counties2['gw_withdrawal'] = 1e6*1000*counties2['TO-WGWFr']*0.00378541*365/counties2['ALAND'] # convert to mm/year
counties2['sw_withdrawal'] = 1e6*1000*counties2['TO-WSWFr']*0.00378541*365/counties2['ALAND'] # convert to mm/year
counties2['pop_density'] = 1e6*counties2['TP-TotPop']/counties2['ALAND'] # thousand people per square km

Now, we can plot the data. Let's plot the groundwater withdrawals first. Optional arguments figsize, vmin and vmax change the figure size, the minimum and maximum of the color range. Try changing the values for these arguments, or removing them, to see how it changes the figure. Take a screenshot and describe how you changed the figure.

Also, comment on what you observe about spatial patterns in groundwater withdrawals in the US. Where are they most substantial?

In [None]:
counties2.plot(column = 'gw_withdrawal',figsize=(24,24),vmin=0,vmax=300)

Now, it's a little hard to tell where things are located by the county borders alone. If we added the state boundaries, it would be easier. We will do that now. First, we load the state shapefile and check the crs. Is it the same crs as the others?

In [20]:
state = gpd.read_file('GIS/Shapefiles/us_states.shp')
print(state.crs)

epsg:4269


To change the state crs, we can use the .to_crs function. Fill in the XXXX with the correct epsg code so that this one matches the others.

In [21]:
state2 = state.to_crs(epsg = 'XXXX')

Now, we will plot the states and counties together. We don't have to assign a plot to a variable, but if we do, it saves the axis, and this can be used later to plot another shapefile/raster on the same axis. We will call the axis base here, then make another plot with the argument ax=base.

In [None]:
base = counties2.plot(column = 'gw_withdrawal',figsize=(24,24),vmin=0,vmax=300)
state2.plot(ax=base,color='none',edgecolor='red',linewidth=2)

Once again, try playing around with the arguments in the plot function to change the color/thicknesses of the state outlines, and vmin/vmax to change the color of the pumping values. Try to make it look nice, not just different! Take a screenshot and include in your report.

Now we will experiment with indexing. We will only plot counties with more than 50 mm/year pumping averaged over their whole area.

In [None]:
base = counties2[counties2['gw_withdrawal']>50].plot(column = 'gw_withdrawal',figsize=(24,24),vmin=0,vmax=300)
state2.plot(ax=base,color='none',edgecolor='red',linewidth=2)

Try changing the threshold from 50 to another value to make a plot that more clearly demonstrates regions that have the most groundwater pumping.

Now, we will plot surface water withdrawals.

In [None]:
base = counties2.plot(column = 'sw_withdrawal',figsize=(24,24),vmin=0,vmax=300)
state2.plot(ax=base,color='none',edgecolor='red',linewidth=2)

And finally, the population density.

In [None]:
base = counties2.plot(column = 'pop_density',figsize=(24,24),vmin=0,vmax=0.5)
state2.plot(ax=base,color='none',edgecolor='red',linewidth=1)

Comment on any spatial correlation (or lack thereof) between surface water withdrawals, groundwater withdrawals, and population density. 

## Part 3: application
Now, you will apply what you have learned to plot your own data.

Pick a raster from any of the labs we have done. The srtm_us could be a straightforward one, but pick any that you want. 

On the folder tab to the left, click the arrow next to GIS, then Rasters. Find the folder with your raster, then click and drag into the 'Rasters' folder.

Now do the same with a shapefile. Click the arrow next to Shapefiles to expand it, then click and drag your files associated with your shapefile into that folder.

Replace the 'XXX' with your raster filename and shapefile filenames, then run the code below.

In [None]:
your_raster = rasterio.open('GIS/Rasters/XXX.tif')
your_shapefile = gpd.read_file('GIS/Shapefiles/XXX.shp')

Check that your shapefile and raster are in the same coordinate reference system.

In [None]:
print(your_raster.crs)
print(your_shapefile.crs)

The format may look different, but check for the epsg, or if it's utm, you can also check for the same zone. If they are not in the same coordinate reference system, determine the epsg of your raster, and replace the 'XXX' below with that value. The following line is only needed if they are NOT in the same coordiante reference system:

In [None]:
your_shapefile = your_shapefile.to_crs(epsg = 'XXX')

Now, see if you can figure out how to plot them separately, then together, based on the code above. Those who can do this before class Thursday will get 10 pts extra credit on this assignment. I will provide more guidance on Thursday for the rest of the class!

## Extras

This section only needs to be completed for those who want to do a make-up lab. It is going to be challenging, and you are expected to do the majority of this without TA/instructor help (but you can still ask questions if you are really stuck). 

First, we will load world countries and cities shapefiles, then plot them together. Take a screenshot and comment.



In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
cities = gpd.read_file(gpd.datasets.get_path('naturalearth_cities'))

fig,ax = plt.subplots(figsize=(15,10))
world.plot(ax=ax,column = 'pop_est')
cities.plot(ax=ax,color = 'red')

Now, we will load a few different tabular datasets.

In [None]:
education = pd.read_csv('GIS/Tabular/Education.csv',encoding = 'ISO-8859-1',dtype={'FIPS Code':object})
education = education.set_index(['FIPS Code'])
print(education.info())

In [None]:
poverty = pd.read_csv('GIS/Tabular/PovertyEstimates.csv',dtype={'FIPStxt':object})
poverty = poverty.set_index(['FIPStxt'])
print(poverty.info())

Choose one of these tables to join to your counties2 shapefile. NOTE: you will probably have to do some digging if you use poverty to figure out what the attributes are associated with. Education will be simpler, but you are welcome to explore either one.

Both data tables have already had their index set as fips, so you can just use the join command, as shown previously, to join them with the counties2 shapefile. 

Join these tables, then plot the counties with an attribute from the joined table. Take a screenshot, describe what you are plotting, and what the range in color values is in your lab report. Email the lab report to Jiawei and cc me by Dec 7 for credit.

Some code is shown below to get you started. The XXX are things you would replace. 

In [None]:
#XXX = counties2.join(XXX)

In [None]:
#base = XXX.plot(column = "XXX",figsize=(XX,XX))
#state2.plot(ax=base,color='none',edgecolor='XX')