<a href="https://colab.research.google.com/github/williamlidberg/Analyses-of-Environmental-Data-2/blob/main/modules/module_3/Assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Wrangling with Python
Data is generated from multiple parts of society, but no one agrees on how to properly store data. It is often messy and semi-structured, and for any more extensive projects, you will spend most of the time acquiring, preparing and cleaning the data. Therefore the aim of this module is to introduce you to some tools and methods to handle datasets of different types, you will work on both tabular data, vector data and image data.


All data using in this course will be "real" data. This means that is not always ready for analysis. For part of this excercise you will use data from the Swedish agency for Digital Goverment. https://www.dataportal.se/en It is still under construction but spend some time to poke around in there and see if you can find some inspiration for your future Master thesis.


# Download data
Students and researchers in Sweden have access to most of Swedens geographical data. You can use the [GET tool](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwiX39fm752EAxX-GRAIHX1BCvYQFnoECBIQAQ&url=https%3A%2F%2Fmaps.slu.se%2F&usg=AOvVaw1f8lXxGJcMoJQd2BaJLK0N&opi=89978449) to select data, draw a square on a map and then download the data.

### Task 1
Download buildings from your favorite part of Sweden using the GET tool. Send me an mail if you fail to log in. you will need to download data from GET in module 4 as well so make sure you get this to work. On a scale between 1 to 5, how hard was it to download the data?

### Useful Linux commands
Linux is the underlying kernel of the python envrionment so I wanted to introduce you to some usefull Linux commands that you can use in the terminal. If you want to use these commands in a notebook you need to use the ! sign infront.

Linux commands are available in google colab, but if you are using anaconda on windows, you can install some basic Linux commands with conda install m2-base. Yet another reason to use Linux on your workstation. [Ubuntu](https://ubuntu.com/tutorials/install-ubuntu-desktop) is a good option if you have an old computer laying around. The support for Windows 10 ends this fall so there will be plenty of old but perfectly useable machines that you can aquire for your first workstation. Ask your nearest IT-department.


### Download data with git clone
Small datasets can be stored in github repositories. There are about 1000 images of impact creaters on the moon in this repository: https://github.com/williamlidberg/Unet-tutorial

The flollowing command clones the repository including the data.

In [None]:
!git clone https://github.com/williamlidberg/Unet-tutorial.git

It's always a good idea to inspect your data so lets plot some images using matplotlib

In [None]:
import os
import matplotlib.pyplot as plt
from PIL import Image

def plot_image_grid(image_dir, num_images=4, grid_size=(2, 2)):
    image_files = [file for file in os.listdir(image_dir) if file.endswith(".png")][:num_images]

    fig, axes = plt.subplots(*grid_size, figsize=(8, 8))

    for i, image_file in enumerate(image_files):
        img = Image.open(os.path.join(image_dir, image_file)).convert('L')  # Convert to grayscale
        row, col = divmod(i, grid_size[1])

        axes[row, col].imshow(img, cmap='gray')
        axes[row, col].axis("off")
        axes[row, col].set_title(image_file)

    plt.tight_layout()
    plt.show()

# Example usage:
image_directory = "/content/Unet-tutorial/craters_1000_samples/train/images"
plot_image_grid(image_directory, num_images=4, grid_size=(2, 2))


The data also contain images where each pixel is labeled 1 for impact craters and 0 for background. Lets quickly count the number of pixels labeled as impact crates across the dataset. Lets say we want to plot the distribution of the area of each image covered by impact craters.

In [None]:
import numpy as np
import imageio.v2 as imageio
sums = []

label_directory = '/content/Unet-tutorial/craters_1000_samples/train/images'

for image in os.listdir(label_directory):
  image_with_path = os.path.join(label_directory, image)
  numpy_array = imageio.imread(image_with_path)
  sums.append(np.sum(numpy_array))

plt.hist(sums, bins=10, color='blue', edgecolor='black', alpha=0.7)
plt.title('Distribution of Total Values Across Images')
plt.xlabel('Total Value')
plt.ylabel('Frequency')
plt.show()


### Task 2
Clone this repository https://github.com/williamlidberg/Geographical-Intelligence and use numpy to find the maximum value in the digital elevation model stored under https://github.com/williamlidberg/Geographical-Intelligence/tree/main/data/rasters/dem


### Download data with wget
wget can be used to download files from the internet using a link and a target directory. The -P after the link indicates the path where the file should be stored, in this case under /content/sample_data/ if you are using google Colab.

In [None]:
!wget https://geodata.naturvardsverket.se/nedladdning/diken/Diken_Sverige/Diken_lansvis/Diken_K.zip -P /content/sample_data/


### unzip
Unzip can be used to unpack ziped data. The flag -o means that the data will be overwritten if you already have a file with the same name.

In [None]:
!unzip -o /content/sample_data/Diken_K.zip -d /content/

.gpkg is a geopackage which is an open, standards-based, file format for storing geospatial data. It is designed to be a single-file container that can hold various types of geographic data, including vector features, raster images, and attribute tables. There has been a push to move away from shapefiles to geopackage. Python packages such as geopandas, fiona and gdal can be used to open geopackage files. We will dive deeper into geopandas in module 6 so for now lets just open the geopackage. Install geopandas if its not already installed.

In [None]:
import geopandas as gpd
path_to_geopackage = '/content/Diken_K.gpkg'
data = gpd.read_file(path_to_geopackage)
data

In [None]:
data.plot()

### Task 3
Select another region of Sweden here and download it. https://geodata.naturvardsverket.se/nedladdning/diken/Diken_Sverige/Diken_lansvis/?C=N;O=A

Awnser the question: What is the most common type of ditches in that area? The column with ditch type is named "Typ".

### Download data with urllib
urllib is a python package that collects several modules for working with URLs


In [None]:
from urllib.request import urlretrieve
url = ('https://opendata.umea.se/api/v2/catalog/datasets/vaxthusgasutslapp_umea/exports/csv')
filename = '/content/sample_data/vaxthusgasutslapp_umea.csv' # you need to adjust this path on your own computer if you are using anaconda.

urlretrieve(url, filename)

In google colab the data is stored under '/content/sample_data/' but you need to specify a path on your own machine. R has built in functionality to read csv-files but in Python we need to use a library like pandas. Here we will import pandas as pd and then read the csv file using the command pd.read_csv()

# Tabular data
Tabular data refers to information organized in a table. Each row represents a single observation, entity, or record in the dataset. Each column represents a specific attribute, variable, or field associated with the observations. The R community sometimes refers to this as tidy data.

In [None]:
import pandas as pd
ghg_emissions = pd.read_csv('/content/sample_data/vaxthusgasutslapp_umea.csv')

Something is off with this file. You can download it and try to open it in exceel or notepad to see what could be the issue but a common problem with CSV files is how the columns are separated. Especcially Swedish data where , is used instead of . for numbers. However, just like with R you can specify the separator in pandas. In this case we can try with ';'

In [None]:
import pandas as pd
ghg_emissions = pd.read_csv('/content/sample_data/vaxthusgasutslapp_umea.csv', sep=';')
ghg_emissions

Note that these columns are in Swedish so lets rename them to something more useful using the rename command. It takes the dataframe as input and then you pair up the old name with the new name seperated by :

In [None]:
ghg_emissions.rename(columns={'huvudsektor': 'MainSector',
                   'undersektor': 'SubSector',
                   'artal': 'Year',
                   'varde_co2e': 'CO2EmissionValue'}, inplace=True)
ghg_emissions

Now the column headers are in English but the content is still in Swedish. This can be a bit more cumbersum to fix but here is a demonstration of how you can replace Personbilar with cars in the Subsector column.

In [None]:
ghg_emissions['SubSector'] = ghg_emissions['SubSector'].replace('Personbilar', 'Cars')
ghg_emissions

For a quick overview of a pandas dataframe you can use the command describe(). \

If the DataFrame contains numerical data, the description contains these information for each column:

* count - The number of not-empty values.
* mean - The average (mean) value.
* std - The standard deviation.
* min - the minimum value.
* 25% - The 25% percentile.
* 50% - The 50% percentile.
* 75% - The 75% percentile.
* max - the maximum value.

In [None]:
ghg_emissions.describe()

While this is usefull it does not always make sence. For example, we might want to have a summary per SubSector or year instead of all of them combined. This is where the command groupby shines. \

Group by 'MainSector' and calculate the sum of 'CO2EmissionValue'




In [None]:
summarized_df = ghg_emissions.groupby(['MainSector'])['CO2EmissionValue'].sum().reset_index()
summarized_df

Applying reset_index() will move the grouped columns into a new dataframe with new columns. This can be useful in many cases, especially if you want to work with the result as a regular DataFrame and not deal with a multi-level index.

### Task 4
What was the total Co2 Emission in Umeå in 2017?

# Plotting data with Python
R and ggplot are excellent for making beautiful plots and I sometimes go back to R for my figures. However, there are some nice tools for visualising data in Python as well. Matplotlib and seaborn is a nice combination of packages.

In [None]:
!pip install matplotlib
!pip install seaborn

Lets use the previously downloaded data ghg_emissions to make some plots. We can start by summarizing the emissions by year and create a barplot

In [None]:
yearly_emissions = ghg_emissions.groupby(['Year'])['CO2EmissionValue'].sum().reset_index()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(figsize=(10, 6))

sns.barplot(x='Year', y='CO2EmissionValue', data=yearly_emissions) # note that we use sns.barplot
plt.xlabel('Year')
plt.ylabel('CO2 Emissions')
plt.show()

If you want to group by multiple columns, you can pass a list of column names to the groupby method. Here's an example:

In [None]:
yearly_emissions = ghg_emissions.groupby(['Year', 'SubSector'])['CO2EmissionValue'].sum().reset_index()
yearly_emissions

You can go even further and filter the dataframe for a specific value or class in the attribute table. Here is an example of how to find the emissions over time but only for the jordbruk (agriculture)

In [None]:
column = 'Jordbruk'
agriculture_emissions = ghg_emissions.loc[ghg_emissions['MainSector'] == column]
agriculture_emissions

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(figsize=(10, 6))

sns.boxplot(x='Year', y='CO2EmissionValue', data=agriculture_emissions) # note that we use sns.boxplot
plt.xlabel('Year')
plt.ylabel('CO2 Emissions from agriculture in Umeå')
plt.show()

Filtering dataframes is a very powerful tool and you can filter with multiple conditions. For example you can select emissions from agriculture between different years.

In [None]:
agriculture_subset = ghg_emissions.loc[(ghg_emissions['Year'] >= 2015) & (ghg_emissions['Year'] <= 2020)]

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(figsize=(10, 6))

sns.boxplot(x='Year', y='CO2EmissionValue', data=agriculture_subset, showfliers=False) # Note that we exluded outliers this time
plt.xlabel('Year')
plt.ylabel('CO2 Emissions from agriculture in Umeå')
plt.show()

We cannot cover everything pandas can do here but one final example is to exclude rows in the dataframe. In this case we will exclude the main sector alla. This operation is called with != in python. e.i not equal.

In [None]:
column = 'Alla'
emissions = ghg_emissions.loc[ghg_emissions['MainSector'] != column]

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(figsize=(10, 6))

sns.boxplot(x='Year', y='CO2EmissionValue', data=emissions, showfliers=False) # Note that we exluded outliers this time
plt.xlabel('Year')
plt.ylabel('Total CO2 Emissions from in Umeå')
plt.show()

### Task 5
Make a boxplot of the Co2 emissions from the transportation sector (Transporter) in Umeå between 2015 and 2020 but exclude the emissions from busses (Bussar).

## Animated plots
If you really want to impress your friends you can flex with animated plots. Some of you might have heard of Hans Rosling and gapminder. Lets start with an example using gapminder data and then make an animation with our data.

In [None]:
!pip install chart_studio

df = px.data.gapminder() pulls the data from gampinder

In [None]:
import plotly.express as px
df = px.data.gapminder()
px.scatter(df, x="gdpPercap", y="lifeExp", animation_frame="year", animation_group="country",
           size="pop", color="continent", hover_name="country",
           log_x=True, size_max=55, range_x=[100,100000], range_y=[25,90])

First start by removing the column 'Alla' from the dataset and then sort the dataset by year.

In [None]:
column = 'Alla'
emissions = ghg_emissions.loc[ghg_emissions['MainSector'] != column]
emissions_sorted = emissions.sort_values(by='Year')

Finally we can plot the data as an animation of emissions over time.

In [None]:
import plotly.express as px

df = px.data.gapminder()

fig = px.bar(emissions_sorted, x="MainSector", y="CO2EmissionValue", color="MainSector",
  animation_frame="Year", animation_group="MainSector", range_y=[0,100000])
fig.show()

### Task 6
Change the above animation to a boxplot instead of a barplot and set the theme to darkmode to make it extra cool. It is ok to use LLMs for his task.

## Animated maps
Its also possible to create time series animation using geospatial data. Here is an example of planted trees in Umeå. The [tree dataset]((https://opendata.umea.se/explore/dataset/trad-som-forvaltas-av-gator-och-parker/export/?disjunctive.tradart_vetenskap_namn_1_1_2&disjunctive.tradart_svenskt_namn_1_1_3) is from Umeå kommun.

### Task 7
Download the [tree dataset](https://opendata.umea.se/explore/dataset/trad-som-forvaltas-av-gator-och-parker/export/?disjunctive.tradart_vetenskap_namn_1_1_2&disjunctive.tradart_svenskt_namn_1_1_3) from Umeå


In [None]:
url = ('https://opendata.umea.se/api/explore/v2.1/catalog/datasets/trad-som-forvaltas-av-gator-och-parker/exports/geojson?lang=en&timezone=Europe%2FStockholm')
filename = '/content/sample_data/trees.geojson' # you need to adjust this path on your own computer if you are using anaconda.

urlretrieve(url, filename)

Read the geojson file using geopandas and create a basic plot.

In [None]:
import geopandas as gpd


# Load your geojson
gdf = gpd.read_file("/content/sample_data/trees.geojson")

# Convert to DataFrame with lat/lon
df = gdf.copy()
df["lon"] = gdf.geometry.x
df["lat"] = gdf.geometry.y

fig = px.scatter_mapbox(
    df,
    lat="lat",
    lon="lon",
    color="gatu_eller_parktrad_1_4_4",
    animation_frame="planteringsdatum_6_1_1",
    mapbox_style="open-street-map",
    zoom=10
)

fig.update_layout(
    width=800,    # width in pixels
    height=1000    # height in pixels
)
fig.show()


## Task 7
There is a major campain with restoring wetlands in Sweden. Your final task is to make an animated map with restored wetlands over time. Download restored wetlands from the Swedish Forest Agency. Bellow is the code you need for downloading the data. Feel free to use another dataset if you want. For example, if you have a powerful computer and some patience you can create map annimations from all harvested forest in Sweden: https://geodpags.skogsstyrelsen.se/geodataport/data/sksUtfordAvverk_gpkg.zip



In [None]:
!wget https://geodpags.skogsstyrelsen.se/geodataport/data/sksAtervatningYta_gpkg.zip -P /content/sample_data/
!unzip -o /content/sample_data/sksAtervatningYta_gpkg.zip -d /content/
gdf = gpd.read_file("/content/sksAtervatningYta.gpkg")
print(gdf.columns)

This data is stored as polygons of the wetlands but for our task we only need the coordiantes. Here we use geopandas to extract the centroid coordinates. Ignore the warning.

In [None]:
gdf = gpd.read_file("/content/sksAtervatningYta.gpkg")

# Reproject back to WGS84 for plotting
gdf = gdf_proj.to_crs(epsg=4326)

# Extract lon/lat
gdf["lon"] = gdf.centroid.x
gdf["lat"] = gdf.centroid.y



Make an animated map of restored wetlands where the size is controled by "KartaHektar" (area) , color is "Kommun" (municipality) and the time series is "AvtalatDatum" (date of contract).