# `pandas`

In [None]:
%pylab inline
plt.style.use('ggplot')

Note the import convention:

In [None]:
import numpy as np
import pandas as pd

In [None]:
np.random.seed(983456)

## Creating `pd.Series`

When creating Pandas `Series` you can provide values only:

In [None]:
s = pd.Series(np.random.randn(10))
s

Values and series name:

In [None]:
s = pd.Series(np.random.randn(10), name="random_series")
s

Values, index and series name:

In [None]:
s = pd.Series(np.random.randn(10), name="random_series",
              index=np.random.randint(23, size=(10,)))
s

In [None]:
s.index

Index can be created explicitly (and can have it's own name):

In [None]:
s = pd.Series(np.random.randn(10), name="random_series",
              index=pd.Index(np.random.randint(23, size=(10,)), name="main_index"))
s

In [None]:
s.index

Series can be created from a dictionary as well

In [None]:
s = pd.Series({'a':3, 'c':6, 'b':2}, name="dict_series")
s

In [None]:
s.index

## Creating `pd.DataFrame`

Agai, we can use just values and Pandas will create an index (both row and column) for us:

In [None]:
df = pd.DataFrame(np.arange(20).reshape((5,4)))
df

Easy way to access data types in a dataframe:

In [None]:
df.dtypes

We can provide column names:

In [None]:
df = pd.DataFrame(np.arange(20).reshape((5,4)),
                  columns=['a', 'b', 'c', 'd'])
df

Values, index and column names:

In [None]:
import string
df = pd.DataFrame(np.arange(20).reshape((5,4)),
                  columns=['a', 'b', 'c', 'd'],
                  index=np.random.choice(list(string.ascii_lowercase), 5, replace=False))
df

In [None]:
df.columns

In [None]:
df.index

Can you guess what does `df['a']` mean?

In [None]:
df['a']

Can we access a row in the same way?

In [None]:
df['h']

Each column is `pd.Series`:

In [None]:
type(df['a'])

# Reading CSV files

We will use [Titanic dataset](https://www.kaggle.com/c/titanic/data):

In [None]:
titanic_train = pd.read_csv('train.csv')

By default, Pandas creates an integer row index and reads column names from `0-th` row of a CSV file:

In [None]:
titanic_train

Glimpse into a dataframe:

In [None]:
titanic_train.head()

In [None]:
titanic_train.tail()

In [None]:
titanic_train.info()

In [None]:
titanic_train.describe()

In [None]:
titanic_train

## Basic indexing of Pandas dataframes

We can set index column in `pd.read_csv`:

In [None]:
titanic_train = pd.read_csv('train.csv', index_col='PassengerId')
titanic_test = pd.read_csv('test.csv', index_col='PassengerId')

Accessing a single column:

In [None]:
titanic_train["Survived"]

A set of columns:

In [None]:
titanic_train[["Name", "Survived"]]

Just in case, column order is not important (usually):

In [None]:
%timeit titanic_train[["Name", "Survived"]]

In [None]:
%timeit titanic_train[["Survived", "Name"]]

Integer indexing is also available with `[]` notation, but with some peculiarities:

In [None]:
titanic_train[2:4]

But:

In [None]:
titanic_train[2]

`[]` may be ambiguous, and it's better to use it only for column access. If you want to use row labels, use `.loc`:

In [None]:
titanic_train.loc[2]

In [None]:
titanic_train.head()

Note, that `titanic_train.loc[...]` is label-based, not positional, although row labels are integers. This is even more elaborated for non-monotonic indexes (both default one and `PassengerId` are unique and monotonic).

Label-based slice (inclusive bounds):

In [None]:
titanic_train.loc[2:4]

Positional slice (exclusive upper bound):

In [None]:
titanic_train[2:4]

`.loc` indexing is very flexible and can combine row and column access in one run:

In [None]:
titanic_train.loc[2:4, "Age"]

In [None]:
titanic_train.loc[2:4, ["Age"]]

This one won't work:

In [None]:
titanic_train[2:4, ["Age"]]

In [None]:
titanic_train.loc[2:10:2, ["Age"]]

In [None]:
titanic_train.loc[titanic_train["Age"] < 5, ["Name", "Pclass"]]

In [None]:
titanic_train.loc[(titanic_train["Age"] < 5) & (titanic_train.Pclass == 2), "Name"]

This won't work:

In [None]:
titanic_train.loc[titanic_train["Age"] < 5 & titanic_train.Pclass == 2, "Name"]

In [None]:
titanic_train["Age"] < 5 & titanic_train.Pclass

In [None]:
titanic_train["Age"] < 5 & titanic_train.Pclass == 2

`.iloc`, in contrast, is explicitly positional and can combine both row and column positions (and upper bounds are always exclusive):

In [None]:
titanic_train.iloc[:2, 3]  # Note resulting series name: Pandas preserves column name

In [None]:
titanic_train.iloc[:2, 3:5]

You cannot mix positional and label-based indexing:

In [None]:
titanic_train.iloc[:2, "Name"]

But you still can use filtering:

In [None]:
titanic_train.iloc[(titanic_train.Age < 10).values, 2]  # titanic_train.iloc[titanic_train.Age < 10, 2] won't work

## Performance

But how index is useful? (note the filtering notation)

In [None]:
titanic_train = pd.read_csv('train.csv')

In [None]:
%timeit titanic_train[titanic_train.PassengerId==400]

In [None]:
titanic_train = pd.read_csv('train.csv', index_col='PassengerId')

In [None]:
%timeit titanic_train.loc[400]

## Combining dataframes

In [None]:
pd.concat([titanic_train, titanic_test], ignore_index=True)

Note, how Pandas filled `Survived` column (which is not even present `titanic_test`!). Better way to combine dataframes when index has actual meaning:

In [None]:
titanic = pd.concat([titanic_train, titanic_test])

In [None]:
titanic

# Indexing `pd.Series` in depth

In [None]:
np.random.seed(983456)

N_ELEMS = 20

s = pd.Series(np.random.randint(20, size=(N_ELEMS,)),
              index=list(string.ascii_lowercase)[:N_ELEMS],
              name='randint_series')
s

## Indexing with `[]`

In [None]:
s

In [None]:
s['i']  # But there's a caveat: it may be series or just an element

In [None]:
s[['i']]

Slicing works not the way you would expect it to work (both bounds are inclusive):

In [None]:
s['a':'f']

Indexing array work as well:

In [None]:
s[['k', 'q', 'a', 'r']]

In [None]:
s.index

Note, that positional indexing works as well:

In [None]:
s[0:5]

In [None]:
s[5:3:-1]

## Indexing with `.loc`

In [None]:
np.random.seed(983456)

s_int_idx = pd.Series(np.random.randint(20, size=(N_ELEMS,)),
                      index=np.random.choice(N_ELEMS, N_ELEMS, replace=False),
                      name='randint_series')
s_int_idx

We have integer index. What if we use slicing here? Will it go positional or use row index?

In [None]:
s_int_idx[2:15]

Surprising. But that's the way Pandas works and you'll love it over time (it's API is strongly tailored to most common operations making them more concise).

In [None]:
s_int_idx[2]  # label

In [None]:
s_int_idx[2:5]  # position

Boolean mask? Sure.

In [None]:
s_int_idx[s_int_idx.index.isin(range(2,6))]

In [None]:
s_int_idx

Again, `[]` may often be ambiguous. Use `.loc` or `.iloc` to make your code readable and clean:

In [None]:
s_int_idx.loc[2:15]  # label

In [None]:
s_int_idx.iloc[2:15]  # position

What if we take some random upper bound? It won't work generally:

In [None]:
s_int_idx.loc[2:456]

Because of this:

In [None]:
s_int_idx.index.is_monotonic

But we can make it work (or rather you now know when it works and when it doesn't):

In [None]:
s_int_idx.sort_index().loc[2:234]

Because:

In [None]:
s_int_idx.sort_index().index.is_monotonic

We'll see why this works a bit later. We can do complex filtering/masking/boolean indexing as well:

In [None]:
s_int_idx[s_int_idx.index!=11]

In [None]:
s_int_idx[(s_int_idx>15) | (s_int_idx<5)]

In [None]:
s_int_idx.loc[s_int_idx!=14]

# Indexing `pd.DataFrame`

In [None]:
np.random.seed(983456)

df = pd.DataFrame(np.arange(20).reshape((5,4)),
                  columns=['d', 'c', 'b', 'a'],
                  index=np.random.choice(list(string.ascii_lowercase), 5, replace=False))
df

Ok, so `[]` (without `loc` or `iloc`) probably is positional?

In [None]:
df[2:5]

In [None]:
df['o'] # Nope, it doesn't work that way

But here's the thing: **the same** `[]` notation works differently if you're using column labels:

In [None]:
df['a']

In [None]:
df

Note, that this one returns a dataframe:

In [None]:
df[['b']]

... and this one returns `pd.Series`:

In [None]:
df['b']

In [None]:
df

In [None]:
df.columns

In [None]:
df.columns[2:]

In [None]:
df[df.columns[2:]]

In [None]:
df.iloc[:, 2:]

In [None]:
df

So, `[]` is positional. Is it?

In [None]:
df[:'g'] # Surprising!

In [None]:
df['a':'u'] # Not really surprising

In [None]:
df.sort_index()['a':'u']

But neither `a`, nor `u` are even in row index!

In [None]:
df

In [None]:
df['d':]

In [None]:
df["x":]

But:

In [None]:
df.sort_index()['d':]

In [None]:
df

In [None]:
df['k':'z'] # No, that won't work

In [None]:
df.sort_index()['k':'zjyyf']

In reality, Pandas keeps track of ranking of index labels:

In [None]:
df.index.to_series().rank()

If index is monotonic, it allows for out-of-index indexing:

In [None]:
df.sort_index().index.to_series().rank()

In [None]:
df.sort_index()['b':'m']  # Pandas can unambiguously set 'b' to be less than 'd' and 'm' to be between 'l' and 'o'

In [None]:
df

In [None]:
df.loc['o':'x', 'c'] = 5

In [None]:
df_sub = df['o':'x']
df_sub['c'] = 5 # Not a very good idea

In [None]:
df[(df['a']>12) | (df['b']<3)]

In [None]:
df

## Indexing with `.loc`

General rule is (for readability and to exclude weird bugs):

- use `[]` when accessing columns by label,
- use `.loc` when accessing both rows and columns by label,
- use `.iloc` for positional indexing.

In [None]:
df

In [None]:
df.loc['o']

In [None]:
df.loc['o', 'b']

In [None]:
df.loc['o':, 'b']

In [None]:
df.loc['g':, 'b':]

In [None]:
df

In [None]:
df.loc['g':, 'c':'d'].shape

In [None]:
df

Column index is still an index and works in a similar manner.

In [None]:
df.columns.to_series().rank()

In [None]:
df.loc['x':, 'a'::-2]

In [None]:
df.loc['x':, 'c':'d']

In [None]:
df.sort_index(axis='columns').loc['x':, 'c':'d']

In [None]:
df.loc[:, ["a", "b"]]

`.loc` can contain a mask (Pandas will align it for you):

In [None]:
df.loc[df['c']>10, 'c']

In [None]:
df.loc[:, df.columns[2:]]

In [None]:
df.loc[[1,2], 'c'] # This won't work: .loc cannot use a mix

## `SettingWithCopyWarning`

In [None]:
df

Each indexing operation generates either a copy, or a view to the dataframe and in contrast to NumPy Pandas provides no guarantee.

In [None]:
df.loc[df['a']>10, 'c']

In [None]:
df.__setitem__?

An assignment like this works the same way as in NumPy and original dataframe is modified (under the hood it's just a call to `df.__setitem__`):

In [None]:
df.loc[df['a']>10, 'c'] = 10

In [None]:
df

This one, however, contains two chained `__getitem__` calls:

In [None]:
df.loc[df['a']>10]['c']

The following assignment generates a warning (it's unknown if `df.loc[df['a']>10]` is a view or a copy):

In [None]:
df.loc[df['a']>10]['c'] = 20.

In [None]:
df

Let's decompose it:

In [None]:
df_1 = df.loc[df['a']>10]

In [None]:
df_1

In [None]:
df_1['c'] = 25

In [None]:
df_1

In [None]:
df

# Dataframe arithmetic

In [None]:
df_1 = pd.DataFrame(np.arange(40).reshape(10,4),
                    columns=['a', 'b', 'c', 'd'],
                    index=np.random.choice(list(string.ascii_lowercase), 10, replace=False))
df_1

In [None]:
df_2 = pd.DataFrame(np.arange(40).reshape(10,4),
                    columns=['a', 'e', 'c', 'd'],
                    index=np.random.choice(list(string.ascii_lowercase), 10, replace=False))
df_2

In [None]:
# A lot of missing values
df_1 + df_2

We can provide a fill value for missing **operands**:

In [None]:
df_1.add(df_2, fill_value=0)

Operations between dataframes and series are aligned along column by default:

In [None]:
s_1 = pd.Series(np.arange(10),
                name='f',
                index=np.random.choice(list(string.ascii_lowercase), 10, replace=False))

In [None]:
s_1

In [None]:
df_1

In [None]:
df_1 + s_1

In [None]:
s_1 + df_1

The default can be changed:

In [None]:
df_1.add(s_1, axis='index')

Such an alignment (along columns) allows for many common operation to be written in a short form. For example, to normalize each row, we just do

In [None]:
(df_1 - df_1.mean()) / df_1.std()

In [None]:
df_1.mean()

# Applying functions to dataframes

In [None]:
df

Main entry method to apply a function over rows or columns:

In [None]:
df.apply(lambda row: np.sqrt(row.d), axis=1)

This one is faster, though:

In [None]:
%timeit df['d'].apply(lambda x: np.sqrt(x))

Pandas allows to use NumPy functions directly:

In [None]:
np.sqrt(df['d'])

Which is faster:

In [None]:
%timeit np.sqrt(df['d'])

In [None]:
# Better way
%timeit np.sqrt(df['d'].values)

In [None]:
np.sqrt(df['d'].values)

Ofter replacing `apply` altogether is the best option:

In [None]:
df_copy = df.copy()
df_copy["d_sqrt"] = np.sqrt(df['d'].values)

Note, that we often can mix Pandas and NumPy:

In [None]:
df

In [None]:
df.values

In [None]:
df.values.sum(axis=1)

In [None]:
df.apply(lambda x: x.sum(), axis=1)

In [None]:
np.sum(df, axis=1)

In [None]:
df.sum(axis=1)

Pandas is smart enough to combine the result in a proper manner:

In [None]:
dfm = df.apply(lambda x: pd.Series({'sum': x.sum(),
                                    'sqrt': np.sqrt(x['d'])}),
               axis=1)

In [None]:
dfm

# Dataframe statistics

In [None]:
titanic['Pclass'].value_counts()

In [None]:
titanic.SibSp.value_counts()

In [None]:
titanic.Embarked.value_counts() # S = Southampton, C = Cherbourg, Q = Queens Town

In [None]:
titanic.Sex.value_counts()

In [None]:
print("Average age: %2.2f" % titanic['Age'].mean())
print("STD of age: %2.2f" % titanic['Age'].std())
print("Minimum age: %2.2f" % titanic['Age'].min())
print("Maximum age: %2.2f" % titanic['Age'].max())

In [None]:
print("Average number of siblings/spouse: %2.2f" % titanic['SibSp'].mean())
print("Average number of siblings/spouse in class 1: %2.2f" % titanic.loc[titanic.Pclass==1, 'SibSp'].mean())
print("Average number of siblings/spouse in class 2: %2.2f" % titanic.loc[titanic.Pclass==2, 'SibSp'].mean())
print("Average number of siblings/spouse in class 3: %2.2f" % titanic.loc[titanic.Pclass==3, 'SibSp'].mean())

In [None]:
print("Minimum age (not survived): %2.2f" % titanic.loc[titanic.Survived==0, 'Age'].min())
print("Maximum age (not survived): %2.2f" % titanic.loc[titanic.Survived==0, 'Age'].max())
print("Mean age (not survived): %2.2f" % titanic.loc[titanic.Survived==0, 'Age'].mean())

In [None]:
print("Minimum age (survived): %2.2f" % titanic.loc[titanic.Survived==1, 'Age'].min())
print("Maximum age (survived): %2.2f" % titanic.loc[titanic.Survived==1, 'Age'].max())
print("Mean age (survived): %2.2f" % titanic.loc[titanic.Survived==1, 'Age'].mean())

# Replacing and renaming

In [None]:
titanic.replace(22, 122).head()

In [None]:
import re
titanic.replace(re.compile(r'\(.*\)'), '').head()

In [None]:
titanic.rename(lambda x: x.lower(), axis=1).head()

In [None]:
titanic.rename({'SibSp':'siblings_spouses'}, axis=1).head()

# String operations

In [None]:
titanic.head()

In [None]:
titanic.replace(re.compile(r'\(.*\)'), '').Name.str.split(",", expand=True)

In [None]:
(titanic
 .replace(re.compile(r'\(.*\)'), '')
 .Name.str
 .split(',', expand=True)
 .rename({0:'family_name', 1:'first_name'}, axis=1)
 .head())

# Cleaning data

`isnull` is very convenient method:

In [None]:
titanic.isnull().head()

Resulting dataframe can now be used to determine if there any missing values (by column or by row):

In [None]:
titanic.isnull().any()

In [None]:
titanic.isnull().any(axis=1).head()

Or calculate how many missing values are in a dataframe (by row or by column):

In [None]:
titanic.isnull().sum()

In [None]:
titanic.head(15)

Pandas is smart enough to fill missing values by column:

In [None]:
fill_values = titanic[['Age', 'Fare']].mean()

In [None]:
fill_values

In [None]:
titanic[titanic.Fare.isnull()]

In [None]:
titanic.fillna(fill_values).head(15)

# Getting indicators and dummy variables

In [None]:
pd.get_dummies(titanic, columns=['Pclass', 'Sex', 'Embarked']).head()