# Introduction to Pandas

## Topics

- What is a dataframe
- Getting information in
- Getting information out
- Changing information while it's in there

Time: 5 minutes

In [2]:
import pandas as pd

## Dataframes

Dataframes are tables.

They have rows, identified by an index.

They have columns, identified by a name.

They are _mutable_.
This means that some pandas operations can change the data inside a dataframe.

## Getting Data In

The simplest way to get data (already in Python) into a dataframe is to first make the data into a _dictionary_.

Dictionaries in python are of the form:

`{key: value}`

In [3]:
# Make a dictionary with two keys
# The values here are lists, of equal length
d = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}
d

{'col1': [1, 2, 3], 'col2': [4, 5, 6]}

In [4]:
# Pass the dictionary straight to the pd.DataFrame class
pd.DataFrame(data=d)

Unnamed: 0,col1,col2
0,1,4
1,2,5
2,3,6


 The columns are named after the keys in the dictionary.
 The rows are given an index, to help you specify them later on.

Columns in pandas dataframes have a certain _dtype_ (data type).
If entries don't all look like they have the same dtype, that's because pandas has gone for the most generic; "object".

In [7]:
pd.DataFrame(
    data={"mixed": [1,2.0,d], "floats": [0.0, 0.2, 0.4]}
)

Unnamed: 0,mixed,floats
0,1,0.0
1,2.0,0.2
2,"{'col1': [1, 2, 3], 'col2': [4, 5, 6]}",0.4


In general the less generic the dtype is the more efficient the memory storage, and therefore speed, of using the dataframe. 

If you have data in a `csv` file (as we will) there's a dedicated `pd.read_csv` function to read that data into a dataframe.

We use `df` to mark a variable as a dataframe (or as a stand-in for "generic dataframe").

In [8]:
example_df = pd.DataFrame(data=d)
example_df

Unnamed: 0,col1,col2
0,1,4
1,2,5
2,3,6


In [9]:
# use df.insert to add a column
example_df.insert(
    1,                      # Where to put it
    "inserted_column",      # What to call it
    ["a", "b", "c"]         # Values
    )

Did anything happen?

In [10]:
example_df

Unnamed: 0,col1,inserted_column,col2
0,1,a,4
1,2,b,5
2,3,c,6


This operation happened "in place".
We didn't create a new value and assign it to a variable.
We gave the dataframe an instruction and it _mutated_ itself based on the instruction.

This is generally very memory efficient.
It also makes it harder to see what has changed.

You can stick dataframes together using `pd.concat` and passing it a list of dataframes.

In [11]:
# Make a dataframe with only one row and some columns
second_df = pd.DataFrame(
    data = {"col1": [4,], "novel_column": ["d"],  "col2": [7,]}
    )

pd.concat([example_df, second_df]) 

# This concatenated rows-wise
# You can concatenate column-wise as well

Unnamed: 0,col1,inserted_column,col2,novel_column
0,1,a,4,
1,2,b,5,
2,3,c,6,
0,4,,7,d


Some things to note:

1. Pandas filled in the gaps with "NaN" (Not a Number)
2. The indices aren't unique
3. We haven't mutated anything so this joint df isn't saved

In [12]:
# It defaults to concatenating rows,
# you can concatenate columns by passing axis=1 as an argument
concatenated_df = pd.concat([example_df, second_df]) 
concatenated_df.reset_index()

Unnamed: 0,index,col1,inserted_column,col2,novel_column
0,0,1,a,4,
1,1,2,b,5,
2,2,3,c,6,
3,0,4,,7,d


Reset index didn't mutate `concatenated_df`, it returned a new dataframe with better indices!
It also stored all the old indices as a new column.
We can specify that we want to `drop` the old indices,
and that we want the operation to happen `inplace` like so.

In [13]:
concatenated_df.reset_index(inplace=True, drop=True)

This kind of inconsistency is part of why people are building replacement tools, such as `polars` (also because polars uses a more recent standard for in-memory data storage and transfer).
However we still need all the tools built on top of pandas, such as `geopandas` until the new ones are built.

## Getting Data Out

Things we might want out of a dataframe:

- A certain column
- A certain row
- A certain cell
- All of the rows that match some criteria

### Grabbing columns

In [14]:
# Select a single column, returns a pd.Series
concatenated_df["inserted_column"]

0      a
1      b
2      c
3    NaN
Name: inserted_column, dtype: object

In [15]:
# Select multiple columns, returns a pd.DataFrame
concatenated_df[["inserted_column", "col1"]]

Unnamed: 0,inserted_column,col1
0,a,1
1,b,2
2,c,3
3,,4


In [16]:
# Select a subset of rows using iloc (integer location)
# This is rows 1 (inclusive) to 3 (exclusive)
concatenated_df.iloc[1:3]

Unnamed: 0,col1,inserted_column,col2,novel_column
1,2,b,5,
2,3,c,6,


In [17]:
# Select a subset of rows using iloc
# This is rows 1 (inclusive) to 3 (exclusive)
# Then we select only column 0 (inclusive) to 2 (exclusive)
concatenated_df.iloc[1:3, 0:2]

Unnamed: 0,col1,inserted_column
1,2,b
2,3,c


In [18]:
# You can use this style of access on any Python list-like data
# A colon means "between", and on its own selects everything
# Negative numbers count back from the end
# So this is all rows, and the last column
concatenated_df.iloc[:, -1]

0    NaN
1    NaN
2    NaN
3      d
Name: novel_column, dtype: object

In [19]:
# Use `df.loc` (location) instead of `df.iloc` (integer location) to refer to indices and labels
concatenated_df.loc[1, ["col2"]] 

col2    5
Name: 1, dtype: object

In [20]:
# use  `df.at` to get a cell
concatenated_df.at[2, "col1"]

3

### Comparisons

Generally we don't know the cell of the data we want, but we do know how to specify what kinds of data we want.

By passing a Series (a list of data) to a logical operation we get back a Series of that operation on each entry.

In [21]:
# Showcasing "col1"
concatenated_df["col1"]

0    1
1    2
2    3
3    4
Name: col1, dtype: int64

In [22]:
# Comparing a Series with an integer
concatenated_df["col1"] > 2

0    False
1    False
2     True
3     True
Name: col1, dtype: bool

A series like this is called a _truth series_.
We can pass a truth series into a dataframe, and it will return only those rows marked `True`.

In [23]:
concatenated_df[concatenated_df["col1"] > 2]

Unnamed: 0,col1,inserted_column,col2,novel_column
2,3,c,6,
3,4,,7,d


In [24]:
concatenated_df[concatenated_df["col2"] == 6]

Unnamed: 0,col1,inserted_column,col2,novel_column
2,3,c,6,


In [25]:
# You can use logical operations like & (and) and ~ (not) as well
# Note the brackets, without them it's too ambiguous what you want
(concatenated_df["col1"] > 2) & (concatenated_df["col2"] == 6)

0    False
1    False
2     True
3    False
dtype: bool

In [26]:
concatenated_df[
    (concatenated_df["col1"] > 2) & (concatenated_df["col2"] == 6)
]

Unnamed: 0,col1,inserted_column,col2,novel_column
2,3,c,6,


## Updating information

Just like getting information you, you can get information back in the same way

In [27]:
concatenated_df.at[2,"col1"] = 12

In [28]:
concatenated_df

Unnamed: 0,col1,inserted_column,col2,novel_column
0,1,a,4,
1,2,b,5,
2,12,c,6,
3,4,,7,d


We can _map_ a series or dataframe by passing a function to the `map` method.
(For series this is simple, it's a little more fiddly for dataframes.)

In [47]:
def square(x):
    x = x - 3
    return x*x

# Make a new column called "col3", and fill it
# with the results of the calculation
concatenated_df["col3"]=concatenated_df["col2"].map(
square
)

In [36]:
concatenated_df

Unnamed: 0,col1,inserted_column,col2,novel_column,col3
0,1,a,4,,1
1,2,b,5,,4
2,12,c,6,,9
3,4,,7,d,16


## Recap

Dataframes are tables

Some actions mutate the dataframe, so watch out. Usually there are optional arguments to control this.

Rows are labelled by an index, columns by a name

You can make truth series using comparison operators, and use them to select matching rows

You can update and add information into an existing dataframe

