# Data Exploration with Python and Jupyter

Basic usage of the Pandas library to download a dataset,
explore its contents, clean up missing or invalid data,
filter the data according to different criteria,
and plot visualizations of the data.

- [Part 1: Python and Jupyter](https://ssciwr.github.io/jupyter-data-exploration)
- **Part 2: Pandas with toy data**
- [Part 3: Pandas with real data](https://ssciwr.github.io/jupyter-data-exploration/pandas-real-data.slides.html)

*Press `Spacebar` to go to the next slide (or `?` to see all navigation shortcuts)*

# Pandas
is a data analysis and manipulation Python library

In [None]:
# Import the Pandas library
import pandas as pd

In [None]:
# Import some toy data as a pandas DataFrame
df = pd.read_csv("https://ssciwr.github.io/jupyter-data-exploration/data.csv")

In [None]:
type(df)

In [None]:
len(df)

In [None]:
# Display the first few rows of data
df.head()

In [None]:
# Display general DataFrame info (columns, entries, types)
df.info()

# Selecting rows and columns

Three main ways of doing this:

- Python-style indexing operator `[]`
- Pandas `loc` function (label-based)
- Pandas `iloc` function (index-based)

We'll start with the more intuitive Python-style methods, and later move into the more powerful `loc` and `iloc` alternatives

In [None]:
# A DataFrame is a bit like a Dictionary - we can lookup columns by name
names = df["Name"]

In [None]:
# A column of a DataFrame is a Series
type(names)

In [None]:
names.head()

In [None]:
# A Series is a bit like a List - we can select items by index
names[0]

In [None]:
# Here are the first three items:
names[0:3]

In [None]:
# Can also iterate over items
for name in names:
    print(name, "", end="")

In [None]:
# alternative syntax: dataframe.column_name
# both of these are equivalent:
ages1 = df["Age"]
print(ages1.head())

ages2 = df.Age
print(ages2.head())

# note: this only works if the column label is a valid python object name, e.g. can't contain a space

# iloc
- select data based on its *location*
- first specify row(s), then column(s): `df.iloc[row, col]`
- treating dataset as a matrix, or a list of lists
- note: slices are *exclusive*, i.e. `df.iloc[0:2]` returns rows 0 and 1, but not 2

In [None]:
# First row of data (column is implicitly "all" if not specified)
df.iloc[0]

In [None]:
# First row of data (using : slice operator to select all columns)
df.iloc[0, :]

In [None]:
# First column of data
df.iloc[:, 0].head()

In [None]:
# Can select slices of rows and columns: e.g. first 3 rows, last 2 columns
df.iloc[0:3, -2:]

In [None]:
# Can also select a list of indices, e.g. rows 3,5,7, columns 3,5
df.iloc[[3, 5, 7], [3, 5]]

# loc
- select data based on its index *label* and column *label*, instead of location
- first specify row(s), then column(s): `df.loc[:, "Name"]`
- often the most useful method
- note: slices are *inclusive*, i.e. `df.loc[0:2]` returns rows 0 and 1 *and* 2
- note: in our example, the index label is a number, and it is the same as the row index, but in general this is not the case

In [None]:
# Row with index label "0" (column is implicitly "all" if not specified)
df.loc[0]

In [None]:
# Row with index label "0" (using : slice operator to select all columns)
df.loc[0, :]

In [None]:
# "Name" column of data (using : slice operator to select all rows)
df.loc[:, "Name"].head()

In [None]:
# Can also select a list of labels, e.g. index labels 3,5,7, columns "Height","Wears glasses"
df.loc[[3, 5, 7], ["Height", "Wears glasses"]]

# Conditionals
- a statement that is either true or false
  - `a == b` : true if `a` is equal to `b`
  - `a != b` : true if `a` is not equal to `b`
  - `a > b` : true if `a` is greater than `b`
  - `a >= b` : true if `a` is greater than or equal to `b`
  - `a < b` : true if `a` is less than `b`
  - `a <= b` : true if `a` is less than or equal to `b`
- they can be combined
  - `a & b` : true if a and b are both true, otherwise false
  - `a | b` : true if a or b is true, otherwise false
- if `a` is a pandas Series, the result is a Boolean Series
  - with a True or False result for each row
  - which can be used by loc to select data
- this is very flexible and powerful

In [None]:
# This returns True, as the condition 10 > 9 is true
10 > 9

In [None]:
# Similarly, this returns False, as the condition 8 > 9 is false
8 > 9

In [None]:
# Can do the same with a Series - returns a Boolean (true/false) Series
df["Age"] > 9

In [None]:
# loc can take this as the selection, e.g. older than 9
df.loc[df["Age"] > 9]

In [None]:
# can combine conditions with & e.g. older than 9 and have blue eyes
df.loc[(df["Age"] > 9) & (df["Eye colour"] == "blue")]
# note: good idea to wrap each condition in brackets when combining them

In [None]:
# can have multiple conditions with | e.g. younger than 7 or wears glasses
df.loc[(df["Age"] < 7) | (df["Wears glasses"] == "yes")]

# Summarizing data

Some useful functions for getting a quick overview of a Series:

- `describe()`
  - for numerical data: mean, min, max, std deviation, etc
  - for strings: count, number of unique items, most common item
- `count()` - the number of items in a Series
- `unique()` - a list of the unique items in a Series
- `value_counts()` - the count of each unique item in a Series
- `plot()` - plot numerical data on the y-axis (with the Index on the x-axis)
- `hist()` - plot a histogram of frequency of each unique item

You can also use these methods on the whole DataFrame, but only numerical data columns will be considered.

In [None]:
df["Eye colour"].describe()

In [None]:
df["Eye colour"].count()

In [None]:
df["Eye colour"].unique()

In [None]:
df["Eye colour"].value_counts()

In [None]:
df["Eye colour"].hist()

In [None]:
df["Height"].describe()

In [None]:
df["Height"].hist()

# Plotting

- `df.plot` contains various plot methods
  - type `df.plot.` in a code cell then press `Tab` to see them listed
  - or read the docs by typing `?df.plot` in a code cell and running the cell
- specify the column to plot with `x="Column Name"`
- commonly used plots
  - `line` for time series data
  - `hist` for categorical data
  - `scatter` to plot the relationship between two columns
- can use `plot` method on Series or Dataframe
- returns a Matplotlib object

# Matplotlib

- terminology
  - `figure`: the "canvas" on which plots will be made
  - `axis`: a plot - a figure can have one or several of these
  - these are implicitly created if you don't do so yourself
- useful commands
  - `ax = plt.subplot()`: create an axis you can pass to Pandas to plot on
  - `fig, axs = plt.subplots(ncols=3)`: create a figure and multiple axes you can pass to Pandas to plot on
  - `plt.savefig("plot.png")`: save the plot as an image

In [None]:
# import matplotlib

import matplotlib.pyplot as plt

In [None]:
# see how two columns are correlated with a scatter plot
df.plot.scatter(x="Age", y="Height")

In [None]:
# do the same thing, but use matplotlib to customise the plot
# make a larger figure
fig, axs = plt.subplots(figsize=(12, 4))
# pass our axis to pandas plot
df.plot.scatter(x="Age", y="Height", ax=axs)
# set a title
plt.title("Height vs Age")
# display the plot
plt.show()

In [None]:
# filter the data before plotting, and plot multiple labelled datapoints
fig, axs = plt.subplots(figsize=(12, 4))
df.loc[df["Sex"] == "Male"].plot.scatter(
    x="Age", y="Height", ax=axs, label="Male", marker="x", color="green"
)
df.loc[df["Sex"] == "Female"].plot.scatter(
    x="Age", y="Height", ax=axs, label="Female", marker="o", color="blue"
)
plt.legend()
plt.title("Height vs Age")
plt.show()

In [None]:
fig, axs = plt.subplots(nrows=1, ncols=3, figsize=(16, 6))
df["Sex"].value_counts().plot.pie(ax=axs[0])
df["Wears glasses"].value_counts().plot.pie(ax=axs[1])
df["Eye colour"].value_counts().plot.pie(ax=axs[2])
plt.plot()

In [None]:
fig, axs = plt.subplots(nrows=1, ncols=3, figsize=(16, 6))
for ax, column in zip(axs, ["Sex", "Wears glasses", "Eye colour"]):
    df[column].value_counts().plot.pie(ax=ax)
plt.plot()

# Groupby

- **split** the data into groups
- **apply** some function to each group
- **combine** the results

In [None]:
grouped = df.groupby(["Sex"])

In [None]:
type(grouped)

In [None]:
grouped.groups

In [None]:
grouped["Age"].count()

In [None]:
grouped["Age"].mean()

# Types

- Default type is `Object`, aka String
- Pandas tries to identify types like numbers, dates and booleans
- Can also tell Pandas what type a column should be
- Using the correct type has multiple benefits
  - better performance (use less RAM, faster to run)
  - more functionality (e.g. summary / plots of numerical types)

In [None]:
# display type of each column
df.dtypes

In [None]:
# display memory usage of each column
df.memory_usage(deep=True)

In [None]:
# list unique values in "Sex" column:
df["Sex"].unique()

In [None]:
# see how much RAM is used to store this column as strings
df["Sex"].memory_usage(deep=True)

In [None]:
# convert to a category type
df["Sex"] = df["Sex"].astype("category")

In [None]:
# list unique values in column:
df["Sex"].unique()

In [None]:
# check RAM usage now
df["Sex"].memory_usage(deep=True)

In [None]:
# do the same for eye colour
df["Eye colour"] = df["Eye colour"].astype("category")

In [None]:
# list unique values to confirm Wears glasses is really a boolean:
df["Wears glasses"].unique()

In [None]:
# see how much RAM is used to store this column as strings
df["Wears glasses"].memory_usage(deep=True)

In [None]:
# convert "yes" to True, "no" to False
df["Wears glasses"] = df["Wears glasses"].map({"yes": True, "no": False})
df["Wears glasses"].unique()

In [None]:
df["Wears glasses"].memory_usage(deep=True)

# Next

- [Part 3: Pandas with real data](https://ssciwr.github.io/jupyter-data-exploration/pandas-real-data.slides.html)