# Data Exploration with Python and Jupyter

Basic usage of the Pandas library to download a dataset,
explore its contents, clean up missing or invalid data,
filter the data according to different criteria,
and plot visualizations of the data.

- [Part 1: Python and Jupyter](https://ssciwr.github.io/jupyter-data-exploration)
- **Part 2: Pandas with toy data**
- [Part 3: Pandas with real data](https://ssciwr.github.io/jupyter-data-exploration/pandas-real-data.slides.html)

*Press Spacebar or the right arrow key to go to the next slide*

# Pandas
is a data analysis and manipulation Python library

In [None]:
# Import the Pandas library
import pandas as pd

In [None]:
# Import some toy data as a pandas DataFrame
df = pd.read_csv("https://ssciwr.github.io/jupyter-data-exploration/data.csv")

In [None]:
type(df)

In [None]:
len(df)

In [None]:
# Display the first few rows of data
df.head()

In [None]:
# List the columns
df.columns

# Selecting rows and columns

Three main ways of doing this:

- Python-style indexing operator `[]`
- Pandas `loc` function (label-based)
- Pandas `iloc` function (index-based)

We'll start with the more intuitive Python-style methods, and later move into the more powerful `loc` and `iloc` alternatives

In [None]:
# A DataFrame is a bit like a Dictionary - we can lookup columns by name
names = df["Name"]

In [None]:
# A column of a DataFrame is a Series
type(names)

In [None]:
names.head()

In [None]:
# A Series is a bit like a List - we can select items by index
names[0]

In [None]:
# Here are the first three items:
names[0:3]

In [None]:
# Can also iterate over items
for name in names:
    print(name, "", end="")

# iloc
- select data based on its location
- first specify row(s), then column(s)
- treating dataset as a matrix, or a list of lists

In [None]:
# First row of data (column is implicitly "all" if not specified)
df.iloc[0]

In [None]:
# First row of data (using : slice operator to select all columns)
df.iloc[0, :]

In [None]:
# First column of data
df.iloc[:, 0].head()

In [None]:
# Can select slices of rows and columns: e.g. first 3 rows, last 2 columns
df.iloc[0:3, -3:-1]

In [None]:
# Can also select a list of indices, e.g. rows 3,5,7, columns 3,5
df.iloc[[3, 5, 7], [3, 5]]

# loc
- select data based on its index *label* and column *label*, instead of location
- first specify row(s), then column(s)
- often the most useful method

In [None]:
# First row of data (column is implicitly "all" if not specified)
df.loc[0]

In [None]:
# First row of data (using : slice operator to select all columns)
df.loc[0, :]

In [None]:
# "Name" column of data
df.loc[:, "Name"].head()

In [None]:
# Can also select a list of indices, e.g. rows 3,5,7, columns "Height","Wears glasses"
df.loc[[3, 5, 7], ["Height", "Wears glasses"]]

# Conditionals
- a statement that is either true or false
  - `a == b` : true if `a` is equal to `b`
  - `a != b` : true if `a` is not equal to `b`
  - `a > b` : true if `a` is greater than `b`
  - `a >= b` : true if `a` is greater than or equal to `b`
  - `a < b` : true if `a` is less than `b`
  - `a <= b` : true if `a` is less than or equal to `b`
- they can be combined
  - `a & b` : true if a and b are both true, otherwise false
  - `a | b` : true if a or b is true, otherwise false
- if `a` is a pandas Series, the result is a Boolean Series
  - with a True or False result for each row
  - which can be used by loc to select data
- this is very flexible and powerful

In [None]:
# This returns a Boolean (true/false) Series for the statement Age is greater than 9:
df["Age"] > 9

In [None]:
# loc can take this as the selection, e.g. older than 9
df.loc[df["Age"] > 9]

In [None]:
# can combine conditions with & e.g. older than 9 and have blue eyes
df.loc[(df["Age"] > 9) & (df["Eye colour"] == "blue")]

In [None]:
# can have multiple conditions with | e.g. younger than 7 or wears glasses
df.loc[(df["Age"] < 7) | (df["Wears glasses"] == "yes")]

# Types

- Default type is `Object`, aka String
- Pandas tries to identify types like numbers, dates and booleans
- Can also tell Pandas what type a column should be
- Using the correct type has multiple benefits
  - better performance (use less RAM, faster to run)
  - more functionality (e.g. summary / plots of numerical types)

In [None]:
# display type of each column
df.dtypes

In [None]:
# display memory usage of each column
df.memory_usage(deep=True)

In [None]:
# list unique values in "Sex" column:
df["Sex"].unique()

In [None]:
# see how much RAM is used to store this column as strings
df["Sex"].memory_usage(deep=True)

In [None]:
# convert to a category type
df["Sex"] = df["Sex"].astype("category")

In [None]:
# list unique values in column:
df["Sex"].unique()

In [None]:
# check RAM usage now
df["Sex"].memory_usage(deep=True)

In [None]:
# do the same for eye colour
df["Eye colour"] = df["Eye colour"].astype("category")

In [None]:
# list unique values to confirm Wears glasses is really a boolean:
df["Wears glasses"].unique()

In [None]:
# see how much RAM is used to store this column as strings
df["Wears glasses"].memory_usage(deep=True)

In [None]:
# convert "yes" to True, "no" to False
df["Wears glasses"] = df["Wears glasses"].map({"yes": True, "no": False})

In [None]:
# list unique values to confirm Wears glasses is really a boolean:
df["Wears glasses"].unique()

In [None]:
# see how much RAM is used to store this column as booleans
df["Wears glasses"].memory_usage(deep=True)

# Next

- [Part 3: Pandas with real data](https://ssciwr.github.io/jupyter-data-exploration/pandas-real-data.slides.html)