#**Geospatial analyses using WaPOR data**
#Topic 3 - Notebook 2: Introduction to Pandas and Geopandas
This notebook a brief introduction to Python packages of Pandas and GeoPandas.

The contents:
1. Pandas
2. Exercises
3. GeoPandas




---



# **1. Pandas**

Pandas is a very important Python package for data analysis and visualization. It has become the default package for data manipulation exploratory data analysis data cleaning. Its ability to read and write many data formats makes it a versatile tool for data analysis.

A huge amount of data is saved in different formats including comma separated values (CSV), text, spreadsheet and many more. These data are in the form of rows and columns. Pandas is a Python package for manipulating tabular data.

Pandas can be used:
*   Import datasets from CSV, databases, spreadsheets files, and more.
*   Time series analysis
*   Calculating summary statistics, correlation between columns and more.
*   Visualize datasets.


### Installing Pandas
Pandas come with standard Python installation. But it can also be installed using ***`!pip install pandas`*** command.




## **Importing data in pandas**

Before using pandas, you need to import the package. The following cell imports pandas. When importing pandas, the most common alias for pandas is pd.

In [None]:
import pandas as pd

Let's start with importing some actual data. we will read the countries which is shared in a folder wuch is accessible in URL.

In [None]:
countries = pd.read_csv("https://surfdrive.surf.nl/files/index.php/s/Wie88hfXHsOmM86/download?path=%2F&files=countries.csv")

countries

The object created here (countries) is a **DataFrame**.
A `DataFrame` is a 2-dimensional, **tabular data structure** comprised of rows and columns. It is similar to a spreadsheet, a database (SQL) table or the data.frame in R.

A DataFrame can store data of different types (including characters, integers, floating point values, categorical data and more) in columns. In pandas, we can check the data types of the columns with the `dtypes` attribute:

![](https://surfdrive.surf.nl/files/index.php/s/Wie88hfXHsOmM86/download?path=%2F&files=dataframe.png)

In [None]:
type(countries)

In [None]:
countries.dtypes

### Each column in a `DataFrame` is a `Series`
When selecting a single column of a pandas `DataFrame`, the result is a pandas `Series`, a 1-dimensional data structure.
To select the column, use the column label in between square brackets `[]`.

In [None]:
countries['pop_est']

In [None]:
s = countries['pop_est']
type(s)

### Pandas objects have attributes and methods
Pandas provides a lot of functionalities for the DataFrame and Series. The `.dtypes` shown above is an *attribute* of the DataFrame. In addition, there are also functions that can be called on a DataFrame or Series, i.e. *methods*. As methods are functions, do not forget to use parentheses `()` to call them.
A few examples that can help exploring the data:

In [None]:
countries.head() # Top 5 rows

In [None]:
countries.tail() # Bottom 5 rows

The ``describe`` method computes summary statistics for each column:

In [None]:
countries['pop_est'].describe()

**Sort**ing your data **by** a specific column is another important first-check:

In [None]:
countries.sort_values(by='pop_est', ascending=False)

## **Basic operations on Series and DataFrames**



### **Elementwise-operations**

The typical arithmetic (+, -, \*, /) and comparison (==, >, <, ...) operations work *element-wise*.

With as scalar:

In [None]:
population = countries['pop_est'].head()
population

In [None]:
population / 1000

In [None]:
population > 1_000_000

With two Series objects:

In [None]:
countries['gdp_md_est'] / countries['pop_est']

### **Aggregations (reductions)**

Pandas provides a large set of **summary** functions that operate on different kinds of pandas objects (DataFrames, Series, Index) and produce a single value. When applied to a DataFrame, the result is returned as a pandas Series (one value for each column).

For example, the average population number is computes as follows:

In [None]:
population.mean()

The maximum GDP:

In [None]:
countries['gdp_md_est'].max()

For dataframes, only the numeric columns are included in the result:

In [None]:
countries.median(numeric_only=True)

### **Adding new columns**

We can add a new column to a DataFrame with similar syntax as selecting a column: create a new column by assigning the output to the DataFrame with a new column name in between the `[]`.

For example, to add the GDP per capita calculated above, we can do:

In [None]:
countries['gdp_capita'] = countries['gdp_md_est'] / countries['pop_est']

In [None]:
countries.head()

## **Indexing: selecting a subset of the dataframe**

The pandas package offers several ways to subset, filter, and isolate data in your DataFrames.

### **Subset variables (columns)**

You can select a single column using a square bracket [ ] with a column name in it. The output is a pandas Series object. A pandas Series is a one-dimensional array containing data of any type, including integer, float, string, boolean, python objects, etc.

Selecting a **single column**:

In [None]:
countries['pop_est']

Remember that the same syntax can also be used to *add* a new columns: `df['new'] = ...`.

We can also select **multiple columns** by passing a list of column names into `[]`: Here, square brackets are used in two different ways. The outer square brackets are used to indicate a subset of a DataFrame, and the inner square brackets to create a list.

In [None]:
countries[['name', 'pop_est']] # double [[]]

### **Subset observations (rows)**
Using `[]`, slicing or boolean indexing accesses the **rows**:

### **Slicing**

In [None]:
countries[0:4]

### **Boolean indexing (filtering)**

Often, you want to select rows based on a certain condition. This can be done with *'boolean indexing'* (like a WHERE clause in SQL).

The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed.

In [None]:
# taking the first 5 rows to illustrate
df = countries.head()
df

In [None]:
mask = df['pop_est'] > 1_000_000
mask

The `mask` in the above cell returns boolean values (True and False). True for rows which have 'pop_est' greater than 1_000_000 and False for others. Using this mask, you can select rows of the dataframe with 'pop_est' greater than 1,000,000 as shown below.

In [None]:
df[mask]

In [None]:
# or in one go
df[df['pop_est'] > 1_000_000]

With the full dataset:

In [None]:
countries[countries['gdp_md_est'] > 5_000_000]

In [None]:
countries[countries['continent'] == "Oceania"]

Two or more rows can also be selected using the `.isin()` method. For example, to select rows with thier index within the range of 2 to 10;

In [None]:
countries[countries.index.isin(range(2,10))]

It is alos possible to select rows by labels or conditions using `.loc[]` and `.iloc[]` ("location" and "integer location"). `.loc[]` uses a label to point to a row, column or cell, whereas `.iloc[]` uses the numeric position.

In [None]:
countries.loc[1:5] # this selects the first 5 rows of the dataframe

In [None]:
countries.iloc[1:5]

Good example of the use of `loc[]` and `iloc[]` is shown in the example below from Stakoverflow.

In [None]:
s = pd.Series(list("abcdef"), index=[49, 48, 47, 0, 1, 2])

In [None]:
s.loc[0]    # value at index label 0

In [None]:
s.iloc[0]   # value at index location 0

In [None]:
s.loc[0:1]  # rows at index labels between 0 and 1 (inclusive)

In [None]:
s.iloc[0:1] # rows at index location between 0 and 1 (exclusive)

**An overview of the possible comparison operations:**

Operator   |  Description
------ | --------
==       | Equal
!=       | Not equal
\>       | Greater than
\>=       | Greater than or equal
<       | Lesser than
<=       | Lesser than or equal

and to combine multiple conditions:

Operator   |  Description
------ | --------
&       | And (`cond1 & cond2`)
\|       | Or (`cond1 \| cond2`)



---



# **2. Excercises**

<div class="alert alert-success">

**EXERCISE 1**:

>What is the population  of Canada from the countries dataframe?


<details>
  <summary>Hints</summary>

*  Use Boolean indexing by the country name and get the population estimate
* countries[countries['name']== `the countryname here`]['pop_est']

</details>
    
</div>

<div class="alert alert-success">

**EXERCISE 2**:

>In which continent Trinidad and Tobago is located?


<details>
  <summary>Hints</summary>

*  Use Boolean indexing by the country name and get the continent.
* countries[countries['name'] == `the countryname here`]['continent']

</details>
    
</div>

<div class="alert alert-success">

**EXERCISE 3**:

>From the countries dataframe, how many countries are in Europe?


<details>
  <summary>Hints</summary>

*  Use Boolean indexing by the country continent  to filter the counries in Europe and get the length of the filtered dataframe.
* Europ_countries = countries[countries['continent'] == the continent you want]
* length = len(Europ_countries).

</details>
    
</div>

<div class="alert alert-success">

**EXERCISE 4**:

>From the dataframe you got in exercise 3, compute the total population estimate of Europe.


<details>
  <summary>Hints</summary>

*  Get the 'pop_est' column from the Europe_countries dataframe and get the sume of the column.
* popn_Europe = europ_countries['pop_est']
* popn_Europe.sum()

</details>
    
</div>



---



# **3. GeoPandas**


Geospatial data is often available from specific GIS file formats or data stores, like ESRI shapefiles, GeoJSON files, geopackage files, PostGIS (PostgreSQL) database, ...

We can use the GeoPandas library to read many of those GIS file formats (relying on the `fiona` library under the hood, which is an interface to GDAL/OGR), using the `geopandas.read_file` function.

For example, a shapefile with all the countries of the world can be read as follows;

In [None]:
import geopandas as gpd

In [None]:
countries = gpd.read_file("https://surfdrive.surf.nl/files/index.php/s/Wie88hfXHsOmM86/download?path=%2F&files=ne_110m_admin_0_countries.zip")
countries

In [None]:
countries.head()

In [None]:
countries.plot()

## **What's a GeoDataFrame?**

GeoPandas library is used to read in the geospatial data, and this returned a `GeoDataFrame`:

A GeoDataFrame contains a tabular, geospatial dataset:

* It has a **'geometry' column** that holds the geometry information (or features in GeoJSON).
* The other columns are the **attributes** (or properties in GeoJSON) that describe each of the geometries

Such a `GeoDataFrame` is just like a pandas `DataFrame`, but with some additional functionality for working with geospatial data:

* A `.geometry` attribute that always returns the column with the geometry information (returning a GeoSeries). The column name itself does not necessarily need to be 'geometry', but it will always be accessible as the `.geometry` attribute.
* It has some extra methods for working with spatial data (area, distance, buffer, intersection, ...), which we will learn in later notebooks

In [None]:
countries.geometry

In [None]:
type(countries.geometry)

In [None]:
countries.geometry.area

**It's still a DataFrame**, so all the Pandas functionality can be used on the geospatial dataset, and to do data manipulations with the attributes and geometry information together.

For example, average population number over all countries can be calculated (by accessing the 'pop_est' column, and calling the `mean` method on it):

In [None]:
countries['pop_est'].mean()

Or, we can use boolean filtering to select a subset of the dataframe based on a condition:

In [None]:
africa = countries[countries['continent'] == 'Africa'] # Selecting the African continent

In [None]:
africa.plot() #plotting continent Africa

## **References**


1.   This notebook is compiled from notebooks from [DS-python-geospatial](https://github.com/jorisvandenbossche/DS-python-geospatial).
2.   The offical **[10 minutes to pandas](https://pandas.pydata.org/docs/user_guide/10min.html#min)** is a very good introduction to beginner.







