# GEOG 5160 6160 Lab 01

This lab provides a brief overview of using pandas and Python for data manipulation. Some examples and the introduction are taken from https://ourcodingclub.github.io

`pandas` is a hugely popular, and still growing, Python library used across a range of disciplines from environmental and climate science, through to social science, linguistics, biology, as well as a number of applications in industry such as data analytics, financial trading, and many others. 

`pandas` is designed to help you deal with a whole set of tasks necessary for data analysis and manipulation. It simplifies the loading of data from external sources such as text files and databases, as well as providing ways of analyzing and manipulating data once it is loaded into your computer. The features provided in `pandas` automate and simplify a lot of the common tasks that would take many lines of code to write in the basic Python language. 

If you have used R’s dataframes before, or the `numpy` package in Python, you may find some similarities in the Python `pandas` package. But if not, don’t worry because this tutorial doesn’t assume any knowledge of NumPy or R, only basic-level Python.

To use `pandas` functions, you will first need to `import` the package using the following code. By convention, the `pandas` module is almost always imported `as pd`. Every time we use a pandas feature thereafter, we can shorten what we type by just typing `pd`, such as `pd.some_function()`.

In [1]:
import pandas as pd

## pandas data type

### Series

A series is a one-dimensional array-like structure designed to hold a single vector of data and an associated index. You can easily create a series using the `pd.Series()` function and a Python list. Here we use a set of numbers:

In [2]:
my_series = pd.Series([4.6, 2.1, -4.0, 3.0])
print(my_series)

0    4.6
1    2.1
2   -4.0
3    3.0
dtype: float64


Note the list of numbers, organized into a single column on the right, and the index on the left (0 - 3). Note that Python is zero-indexed, i.e. indices start at 0, not 1. If you just want the values, not the index:

In [3]:
print(my_series.values)

[ 4.6  2.1 -4.   3. ]


### Data frames

DataFrames represent tabular data, a bit like a spreadsheet. These are organized into columns (each of which is a Series), which generally represent variables or features, and rows, which generally represent observations. Each column can store a single data-type, e.g. floating point numbers, strings, boolean values etc. DataFrames can be indexed by either their row or column names. 

DataFrames can be created from a Python dictionary, or by loading in a text file containing tabular data.

#### Dictionaries

Dictionaries are Python structures that contain a set of `key:value` pairs of information. In the most basic form, a single value is used to index a single key. In DataFrames, however, we can index a list with multiple values to a single key. The key is then used to create the column heading in a DataFrame, and the list represents the set of values (one per observation). The following first creates a dictionary with some basic information about a set of Scottish hills, then uses `pd.DataFrame` to convert this to a DataFrame:

In [None]:
scottish_hills = {'Hill Name': ['Ben Nevis', 'Ben Macdui', 'Braeriach', 'Cairn Toul', 'Sgòr an Lochain Uaine'],
                  'Height': [1345, 1309, 1296, 1291, 1258],
                  'Latitude': [56.79685, 57.070453, 57.078628, 57.054611, 57.057999],
                  'Longitude': [-5.003508, -3.668262, -3.728024, -3.71042, -3.725416]}
dataframe = pd.DataFrame(scottish_hills)
print(dataframe)

The `pd.DataFrame()` function then converts this dictionary to a DataFrame. Dictionary keys are used for column headers, and each Series becomes a column of values (note that a row index number was automatically generated).

In [None]:
#### Reading from files

`pandas` has a built function (`pd.read_csv()`) that allows you to read in comma separated value (CSV) files. An alternative is to use Python's built in `CSV` library. Here, we'll read in the Penguin dataset, and use this to create a new DataFrame called `penguins`:

In [None]:
penguins = pd.read_csv("../datafiles/penguins.csv")

DataFrames have a set of attributes and methods that can be used to find out basic information about them. For example, `shape` returns the dimensions of the data (here, 344 rows and 8 columns):

In [None]:
penguins.shape

Use `head(6)` to get the first six lines (changing the `6` will show more or fewer lines):

In [None]:
penguins.head(6)

And you can use `tail()` to see the last few rows

In [None]:
penguins.tail(4)

If you want to just see the column names:

In [None]:
penguins.columns.values

Use `describe()` to get summary statistics on the columns (numeric only):

In [None]:
penguins.describe()

## Accessing values

You can access individual series in the DataFrame using the column name:

In [None]:
penguins['flipper_length_mm']

Note that the 4th observation has a missing value marked by `NaN`. The pandas function `dropna()` allows us to remove all observations with missing values. Here, we create a new DataFrame, the extract the flipper length values again. Note that the length (shown at the bottom) has decreased. 

In [None]:
penguins2 = penguins.dropna()
penguins2['flipper_length_mm']

You can also access individual Series using a `.` notation. 

In [None]:
penguins.flipper_length_mm

You can also access values in a pandas DataFrame using a row, column index. The format for this is `dataframe.iloc[row]` or `dataframe.iloc[row, col]`. The `iloc` is important here - it tells pandas that you are using integer indices to access the data. To access the first row, simply enter

In [None]:
penguins.iloc[0]

To access the first value of the first row

In [None]:
penguins.iloc[0,0]

An equivalent approach would be to extract the first Series (the `species` column) and get the first value

In [None]:
penguins.species.iloc[0]

If you want to access a range of rows and/or columns, you can use a colon to indicate the start and end of the range you want. To extract the first 3 rows of our DataFrame (note that the end of the range is the row after the range we want):

In [None]:
penguins.iloc[0:3]

You can do the same to extract a range of columns. Here, we'll extract the 4 to 6th row (`3:6`),and the 3rd to 6th column (`2:6`):

In [None]:
penguins.iloc[3:6, 2:6]

To add a new column to a DataFrame, simply specify the new Series name within `[]` and provide the vector of values. For example, to include an observation index starting at 1:

In [None]:
penguins['id'] = range(1,345)
penguins.head()

Unwanted columns can be removed using `drop()`:

In [None]:
penguins = penguins.drop('id', axis = 1)
penguins.head()

## Conditional selection

You can use the values in any of the DataFrame series to conditionally filter parts of the dataset. For example, if we want to find all the rows corresponding to female penguins:

In [None]:
penguins.sex == 'female'

This returns a Boolean Series, with all 'female' rows marked as `True`. If we incorporate this as an index, then this will only return the rows that meet the condition (i.e. are `True`):

In [None]:
penguins[penguins.sex == 'female']

And we can easily use this to create a new DataFrame with only the female penguin data.

In [None]:
females = penguins[penguins['sex'] == 'female']
females.shape

You can also use conditional selection with numeric values. Here, we'll create a subset of all penguins weighing over 5kg:

In [None]:
bigbirds = penguins[penguins['body_mass_g'] > 5000]
bigbirds.shape

You can include multiple conditions by including one of the usual set of operators ('AND' = `&`, 'OR' = `|`, 'NOT = `!`). In this example, we extract all Adelie penguins from Biscoe Island:

In [None]:
adelie_biscoe = penguins[(penguins['species'] == 'Adelie') & (penguins['island'] == 'Biscoe')]
adelie_biscoe.shape

## Split, apply, combine

pandas DataFrames include a useful method `groupby()`. This allows you to form subgroups of the data and run some functions on each one without having to create a new DataFrame for each subgroup. This is called split, apply, combine as we need to do three steps:

- split involves breaking up a data frame into groups
- apply involves running some function for each group
- combine merges the results into a new table

To demonstrate this, we'll calculate the mean body mass of female and male penguins. To start, we'll just calculate the mean for the whole dataset. We can do this by simply calling the `mean()` method on the Series:

In [None]:
penguins.body_mass_g.mean()

Now let's group the penguins by `sex`:

In [None]:
penguins.groupby('sex')

This creates a `groupby` index. To actually calculate the mean weight, we can simply append the `mean()` method to this:

In [None]:
penguins.groupby('sex').mean()

As we didn't specify which data Series we wanted the mean for, Python has calculated this for all numeric Series. To only get the body mass values, simply include this following the `groupby()` method:

In [None]:
penguins.groupby('sex').body_mass_g.mean()

The `groupby()` method also works with two (or more) groups. Simply specify each group name between `[]`. To get the mean body mass for each species on each island:

In [None]:
penguins.groupby(['species', 'island'])['body_mass_g'].mean()

For two groups, pandas DataFrames have a `pivot_table()` method that allow you to make a better formatted version of this table. By default, this will calculate the mean of the variable for each combination of the `index` and `columns`):

In [None]:
penguins.pivot_table('body_mass_g', index = 'sex', columns = 'species')

## Plotting

### matplotlib

We will run through some simple plotting examples. We'll start by using matplotlib, one of the oldest and most established plotting packages for Python. We'll start by importing it:

In [None]:
import matplotlib.pyplot as plt

Unlike the previous packages, we are importing only a submodule of matplotlib called `pyplot`, which has a more user friendly interface than the basic package. To import a submodule, we specify `import module.submodule`. 

If you are using a jupyter notebook, the following code will force any plot made to appear in the notebook:

In [None]:
%matplotlib inline

We'll now plot a couple of variables from the penguins dataset as a scatter plot. The function is `plt.scatter`, and we need to specify the x and y variable, here as pandas Series:

In [None]:
plt.scatter(x = penguins.bill_length_mm, y = penguins.bill_depth_mm)

The basic function produces a plot without annotations. Matplotlib works as a series of layers, and you can easily add axis labels and a title as follows:

In [None]:
plt.scatter(x = penguins.bill_length_mm, y = penguins.bill_depth_mm)
plt.xlabel("(mm)")
plt.ylabel("(mm)")
plt.title("Bill length vs bill depth")

Matplotlib comes with a set of plot styles that change the plot layout, background color, etc. Use `plt.style.use()` to change the style. Here, we'll change to the `classic` style. The full set of styles can be found here:

https://matplotlib.org/3.2.1/gallery/style_sheets/style_sheets_reference.html

In [None]:
plt.style.use('classic')
plt.scatter(x = penguins.bill_length_mm, y = penguins.bill_depth_mm)
plt.xlabel("(mm)")
plt.ylabel("(mm)")
plt.title("Bill length vs bill depth")

Histograms can be made with the `plt.hist()` function. Note the use of `color` to set the fill and `ec` to set the outline

In [None]:
plt.hist(penguins.bill_length_mm, color = "skyblue", ec="white")
plt.xlabel("(mm)")
plt.ylabel("Frequency")
plt.title("Histogram of bill length")

### Seaborn

The matplotlib package has a great deal more flexibility than we can go through here. In particular, the ability to add elements individually allows you to make some very complex figures, but may take some time and trial and error to complete. A newer package, seaborn, was developed on top of matplotlib and provides a much simpler way to make complex figures, including easy use of color and size to represent data values. Start by loading this (it is usually imported with the pseudonym `sns`):

In [None]:
import seaborn as sns
sns.set()

To start with, let's make a simple bar plot of the Scottish Hill data. This shows the basic format for a seaborn plot: `sns.plotype(x, y, data)`:

In [None]:
sns.barplot(x = "Hill Name", y = "Height", data = scottish_hills)

Now, we'll use the penguins data set to make a scatterplot of bill length and bill depth:

In [None]:
sns.scatterplot(x = "bill_length_mm", y = "bill_depth_mm", data = penguins)

Like matplotlib, the seaborn package has a set of themes to change the layout of your plots. Here, we'll change to white background with black gridlines:

In [None]:
sns.set_theme(style="whitegrid")
sns.scatterplot(x = "bill_length_mm", y = "bill_depth_mm", data = penguins)

seaborn makes it easty to include a third (and possibly fourth) Series of data in the plot to control the size and color of symbols and lines. Here, we'll include the body mass Series to change the symbol size:

In [None]:
sns.scatterplot(x = "bill_length_mm", y = "bill_depth_mm", 
                size = "body_mass_g",
                data = penguins)

If you'd prefer to set the symbol color rather than size, then use the `hue` argument. We also use the `s` argument to increase the size of the symbols (try changing this to see the effect):

In [None]:
sns.scatterplot(x = "bill_length_mm", y = "bill_depth_mm", 
                s = 200, hue = "body_mass_g",
                data = penguins)

seaborn has a large set of built in color palettes (including the ColorBrewer palettes). To change the palette, simply use `palette = 'name'`. Here we'll use the ColorBrewer Blues palette. We also change the symbol outline to black to help highlight symbols with a lighter shade of blue. 

Further information about seaborns color palettes can be found here:
https://seaborn.pydata.org/tutorial/color_palettes.html

In [None]:
sns.scatterplot(x = "bill_length_mm", y = "bill_depth_mm", 
                s = 100, hue = "body_mass_g",
                palette = 'Blues', ec = 'black',
                data = penguins)

If you use a pandas Series with character strings, seaborn will use this to color by group level. For example, if we want to illustrate the difference between the different species of penguin:

In [None]:
sns.scatterplot(x = "bill_length_mm", y = "bill_depth_mm", 
                hue = "species",
                data = penguins)

Because seaborn is built on top of matplotlib, it returns a object that can be modified with matplotlib functions. Here, we save the output in an object called `ax`. We can then add different axis labels and a title

In [None]:
ax = sns.scatterplot(x = "bill_length_mm", y = "bill_depth_mm", 
                     hue = "species",
                     data = penguins)
ax.set(xlabel = "Bill length (mm)", ylabel = "Bill depth (mm)")
ax.set_title("Penguins, penguins, penguins")

If you have a dataset with a set of groups, seaborn can use facets to split the full dataset into subplots, one per level of a group. This is done in two steps. First, we define which Series is to be used to partition the data, with the `col` argument used to split the group plots into columns in the final plot. Second, we use the `map()` method to create a new plot type for each level of the requested group:

In [None]:
g = sns.FacetGrid(penguins, col="species")
g.map(sns.scatterplot, "bill_length_mm", "bill_depth_mm")

With two groupings, you can split the data by each; one along the rows and one along the columns of the final plot. To illustrate this, we'll split the penguins data by both species (columns) and sex (rows):

In [None]:
g = sns.FacetGrid(penguins, col = "species", row = "sex")
g.map(sns.scatterplot, "bill_length_mm", "bill_depth_mm")

#### Other plot types

Boxplots

In [None]:
sns.set_theme(style="ticks")
sns.boxplot(x = "species", y = "body_mass_g", data = penguins)

Histograms (single variable)

In [None]:
sns.displot(x = "body_mass_g", data = penguins)

Histograms by group

In [None]:
sns.displot(x = "body_mass_g", hue = "sex", data = penguins)

Stacked histograms

In [None]:
sns.displot(x = "body_mass_g", hue = "sex", data = penguins, multiple = "stack")

Density plots

In [None]:
sns.displot(x = "body_mass_g", hue = "species", data = penguins, 
            kind = "kde", fill = True)

Stacked density plots

In [None]:
sns.displot(x = "body_mass_g", hue = "species", data = penguins, 
            kind = "kde", fill = True, multiple = "stack")