In [1]:
import sys; sys.path.append("..")
from utils import count_down

# Pandas -- Series and DataFrames

Main source: https://github.com/jakevdp/PythonDataScienceHandbook/tree/master/notebooks

#### Pandas is a library for fast and efficient computation on big datasets. As in Numpy, many operations in Pandas are vectorized and thus efficient and fast.


Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks (-> relational algebra) and spreadsheet programs.

As we saw, NumPy's ndarray data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks. While it serves this purpose very well, its limitations become clear when we need more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us. Pandas, and in particular its Series and DataFrame objects, builds on the NumPy array structure and provides efficient access to these sorts of "data munging" tasks that occupy much of a data scientist's time.

In [2]:
# Just as we import numpy usually as np, we import pandas under the alias of pd. 
# We'll import numpy as well, because we'll need it often when using pandas
import numpy as np
import pandas as pd

## The Pandas Series Object
A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:

In [None]:
# missing values 
data = pd.Series([0.25, 0.5, np.NaN, 1.0])
data

In [None]:
type(data)

In [None]:
data.values, type(data.values)

In [None]:
#The index is an array-like object of type pd.Index
data.index, type(data.index), list(data.index)

Like with a NumPy array, data can be accessed by the associated index via the familiar Python square-bracket notation:

In [None]:
data[1:3]

In [None]:
type(data[1])

In [None]:
print(dir(data))

### Series as generalized NumPy array

From what we've seen so far, it may look like the Series object is basically interchangeable with a one-dimensional NumPy array. The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

This explicit index definition gives the Series object additional capabilities. For example, the index need not be an integer, but can consist of values of any desired type. For example, if we wish, we can use strings as an index:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'd', 'c'])
data

In [None]:
data.index = list("AbCD")
data

In [None]:
data["b"] == data[1]

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[3, 7, 3, 4])
data

When an explicit index is present, it is preferred! (*as long as we don't slice!*)

In [None]:
data[3]

In [None]:
type(data[3])

### Series as specialized dictionary

In this way, you can think of a Pandas Series a bit like a specialization of a Python dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

The Series-as-dictionary analogy can be made even more clear by constructing a Series object directly from a Python dictionary:

In [None]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population

In [None]:
population['Texas']

Unlike a dictionary, though, the Series also supports array-style operations such as slicing:

In [None]:
population['California':'Illinois']
# note that Illinois is included!

### Constructing Series objects

In [None]:
# data can be a scalar
pd.Series(5, index=[100, 200, 300])

In [None]:
# data can be a dictionary, 
ser = pd.Series({2:'a', 1:'b', 3:'c'})
ser

In [None]:
ser.to_dict()

## The Pandas DataFrame Object

The next fundamental structure in Pandas is the DataFrame. Like the Series object, it can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. We'll now take a look at each of these perspectives.

### DataFrame as a generalized NumPy array

If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by "aligned" we mean that they share the same index.



To demonstrate this, let's first construct a new Series listing the area of each of the five states discussed in the previous section:

In [None]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

Now that we have this along with the population Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:

In [None]:
states = pd.DataFrame({'population': population,
                       'area': area,
                       'country': 'USA'})
print(states.dtypes)
states

This looks like a generalized dictionary! The keys are the names of the state, and the values are like a list [area, gountry, population]

In [None]:
states.sort_values(by="population", ascending=False)

In [None]:
states['population'], type(states['population'])

In [None]:
states["population"].idxmax() #figures out the "key(s)"(indices) of the DataFrame where "population" has its max

In [None]:
states.loc[states["population"].idxmax()] #returnes the series at the given index

In [None]:
states['California']

In [None]:
states.loc['California']

In [None]:
states.index

In [None]:
states.columns

In [None]:
states.values

In [None]:
type(states.values)

Thus the DataFrame can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

## DataFrame as specialized dictionary

Similarly, we can also think of a DataFrame as a specialization of a dictionary. Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data. For example, asking for the 'area' attribute returns the Series object containing the areas we saw earlier:

In [None]:
states["area"]
# note that indexing a DataFrame with square brackets gets the *column*!

In [None]:
type(states["area"])

### Constructing DataFrame objects

A Pandas DataFrame can be constructed in a variety of ways:

#### From a single Series object

A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series:

In [None]:
population

In [None]:
pd.DataFrame(population, columns=['population'])

#### From multiple Series

In [None]:
s1 = pd.Series(['100', '200', 'python', '300.12', '400'])
s2 = pd.Series(['10', '20', 'php', '30.12', '40'])
df = pd.concat([s1, s2], axis=1)
df

#### From a list of dicts 

Any list of dictionaries can be made into a DataFrame. We'll use a simple list comprehension to create some data. Even if some keys in the dictionary are missing, Pandas will fill them in with NaN (i.e., "not a number") values:

In [None]:
df = pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}], index=["first_dict", "second_dict"])
df

As every single column must have a consistent dtype and np.NaN is a float, some of the numbers get coerced into floats:

In [None]:
df['a']

In [None]:
df['b']

In [None]:
type(np.NaN)

In [None]:
df.dtypes

If we wanted to get the rows, pandas would need to coerce the numbers explicitly: 

In [None]:
df

In [None]:
df.loc['first_dict']

#### From a two-dimensional NumPy array

Given a two-dimensional array of data, we can create a DataFrame with any specified column and index names. If omitted, an integer index will be used for each:


In [None]:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df

### Exercise

Create a DataFrame from the given dictionary as well as the 'qualifies' list, with the given indices:

In [None]:
exam_data = {
    'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
    'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
    'attempts': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
    }
qualifies = ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

In [None]:
count_down(3)

In [None]:
df = pd.DataFrame(exam_data , index=labels)
df['qualifies'] = qualifies
df

In [None]:
df = pd.DataFrame({**exam_data, **{'qualifies': qualifies}} , index=labels)
df

## The Pandas Index Object

We have seen here that both the Series and DataFrame objects contain an explicit index that lets you reference and modify data. This Index object is an interesting structure in itself, and it can be thought of as an immutable array:

In [None]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

In [None]:
ind[0] = 1

In [None]:
sr = pd.Series(0, index=ind)
sr

Index objects have a name:

In [None]:
ind.names = ['indexx']
ind

In [None]:
sr = pd.Series(np.zeros_like(ind), index=ind)
sr

Index objects also have many of the attributes familiar from NumPy arrays:

In [None]:
ind.size, ind.shape, ind.ndim, ind.dtype

While viewing Indices as immutable list is natural, indices also allow for set-operations:

In [None]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [None]:
indA & indB

In [None]:
indA ^ indB

# Data Indexing and Selection

From the numpy lecture, we already know about indexing, slicing, masking, and fancy indexing:

In [None]:
a = np.arange(16).reshape(4,4)
a

In [None]:
a[:, [1, 3]][a[:, [1, 3]] % 3 == 0]
# Takes those values of the second and fourth column that are divisible by 3

Here we'll look at similar means of accessing and modifying values in Pandas Series and DataFrame objects. The corresponding patterns in Pandas are very similar to those of numpy, though there are a few quirks to be aware of.

We'll start with the simple case of the one-dimensional Series object, and then move on to the more complicated two-dimensional DataFrame object.

## Data Selection in Series

As we saw in the previous section, a Series object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary. If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.

### Series as dictionary

Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values, which means most of the corresponding functions work just as well for them:

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

In [None]:
data.__contains__('b')

In [None]:
'b' in data

In [None]:
np.array_equal(data.keys(), data.index)

In [None]:
data

In [None]:
list(data.items())

In [None]:
data['e'] = 1.25
data

### Series as one-dimensional array

Series builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, slices, masking, and fancy indexing. Examples of these are as follows:

In [None]:
# slicing by explicit index
data['a':'c']

In [None]:
# slicing by implicit integer index
data[0:2] 
# Note that when slicing with an explicit index (i.e., data['a':'c']), the final index is included in the slice, 
# while when slicing with an implicit index (i.e., data[0:2]), the final index is excluded from the slice.

In [None]:
(data > 0.3) & (data < 0.8)

In [None]:
# masking
data[(data > 0.3) & (data < 0.8)]

In [None]:
# fancy indexing
data[['a', 'e']]

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[1, 2, 3, 4])
data

In [None]:
data[1:3]

**If your Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index.**

In [None]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

In [None]:
# explicit index when indexing
data[1]

In [None]:
# implicit index when slicing
data[1:3]

The **loc** attribute allows indexing and slicing that always references the explicit index:

In [None]:
data.loc[1]

In [None]:
data.loc[1:3]

The **iloc** attribute allows indexing and slicing that always references the implicit Python-style index:

In [None]:
data.iloc[1]

In [None]:
data.iloc[1:3]

Please, save yourself the pain and be always explicit about what you do -- use ``.loc`` and ``.iloc``

In [None]:
%%bash
python -c "import this" | grep "Explicit"

## Data Selection in DataFrame

Recall that a DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index. These analogies can be helpful to keep in mind as we explore data selection within this structure.

In [None]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop, 'T':'T'})
data

Note that if we index a DataFrame, we index the **column**!!

In [None]:
# Dictionary-style indexing results in a Series....
print(type(data["area"]))
data["area"]

In [None]:
# We can also dereference, though it leads to side-effects if that's actually also a method...
data.area

In [None]:
type(data.values)

With this picture in mind, many familiar array-like observations can be done on the DataFrame itself. For example, we can transpose the full DataFrame to swap rows and columns:

In [None]:
data.T

For array-style indexing, Pandas again uses the loc and iloc indexers mentioned earlier. Using the iloc indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the DataFrame index and column labels are maintained in the result:

In [None]:
data.values[:3, :2]

In [None]:
data.iloc[:3, :2]

In [None]:
data

In [None]:
data.loc[:'Illinois', :'pop']

In [None]:
data.loc[:,['area','pop']]

So, this is how we get a row!

In [None]:
data.loc["California", :]

In [None]:
# adding a new column.. (vectorized calculations!)
data['density'] = data['pop'] / data['area']
# we can combine masking with fancy indexing
data.loc[data.density > 100, ['pop', 'density']]

While indexing refers to columns, slicing refers to rows:

In [None]:
data['area']

In [None]:
data['Florida':'Illinois']

Again, rather be explicit about your indexing to save yourself from a lot of confusion.

In [None]:
data['area':'pop']

In [None]:
data.loc[:, 'area':'pop']

Fast access to a single member using **at**

In [None]:
%%timeit
data.loc['Florida', 'pop']

In [None]:
%%timeit
data.at['Florida', 'pop']

### Boolean Indexing

In [None]:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df['E'] = ["one", "two", "three"] * 2
df

In [None]:
df['E'].isin(['one','two'])

In [None]:
df[df['E'].isin(['one','two'])] = np.NaN
df

In [None]:
pd.isna(df).any(axis=1)

In [None]:
df[~df.isna().any(axis=1)]

In [None]:
df.dropna(how="any")

### Exercise

Write a Pandas snippet to get the names and scores of the people where the number of attempts in the examination is greater than 2 as a dict.


In [None]:
exam_data = {'name': ['Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin', 'Jonas'],
    'score': [12.5, 9, 16.5, np.nan, 9, 20, 14.5, np.nan, 8, 19],
    'attempts' : [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
    'qualify': ['yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(exam_data , index=labels)
df

In [None]:
count_down(3)

In [None]:
df[df['attempts'] > 2][['name', 'score']]

In [None]:
df[df['attempts'] > 2].set_index('name')['score'].to_dict()

## Pandas indexing

While Pandas does provide objects that natively handle three-dimensional and four-dimensional data, a far more common pattern in practice is to make use of `hierarchical indexing` (also known as `multi-indexing`) to incorporate multiple index levels within a single index. In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional Series and two-dimensional DataFrame objects.

In [None]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations, index=index)
pop

In [None]:
index = pd.MultiIndex.from_tuples(index)
index

In [None]:
index.names = ['state', 'year']

In [None]:
pop = pop.reindex(index)
pop

In [None]:
pop['California', 2000], pop['California', 2010]

In [None]:
pop.iloc[0], pop.iloc[1]

### MultiIndex as extra dimension: stack() and unstack()

You might notice something else here: we could easily have stored the same data using a simple ``DataFrame`` with index and column labels.
In fact, Pandas is built with this equivalence in mind. The ``unstack()`` method will quickly convert a multiply indexed ``Series`` into a conventionally indexed ``DataFrame``:

In [None]:
pop.unstack()

In [None]:
index.names = [None, None]
pop = pop.reindex(index)
pop

In [None]:
pop.unstack()

In [None]:
pop.index.names = [None, None]
pop.unstack().T

In [None]:
popdf = pop.unstack(level=0)
popdf

In [None]:
popdf.stack()

### Index setting and resetting

Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished with the ``reset_index`` method.
Calling this on the population dictionary will result in a ``DataFrame`` with a *state* and *year* column holding the information that was formerly in the index.
For clarity, we can optionally specify the name of the data for the column representation:

In [None]:
pop

In [None]:
pop.index.names = ['state', 'year']
print(type(pop))
pop

In [None]:
pop_flat = pop.reset_index(name='population')
pop_flat

Often when working with data in the real world, the raw input data looks like this and it's useful to build a ``MultiIndex`` from the column values.
This can be done with the ``set_index`` method of the ``DataFrame``, which returns a multiply indexed ``DataFrame``:

In [None]:
pop_df = pop_flat.set_index(['state', 'year'])
pop_df

In [None]:
pop_df.rename_axis([None, None])

In [None]:
asdf = pop_df.rename_axis([None, None]).unstack()
asdf

In [None]:
asdf.columns

In [None]:
asdf["area"] = 999
asdf

In [None]:
asdf.columns

In [None]:
print(type(asdf["area"]))
asdf["area"]

In [None]:
print(type(asdf["population"]))
asdf["population"]

In [None]:
pop_flat

In [None]:
pop_df2 = pop_flat.set_index('state').rename_axis(None)
pop_df2

In [None]:
pop_df

In [None]:
pop_df.reset_index()

# Reading Series and DataFrames

In [None]:
%%bash
head Pokemon.csv

In [101]:
df = pd.read_csv("Pokemon.csv")

Imagine someboy gave you a random dataset. You don't know any of its contents. What are the first steps you do?

In [None]:
df.head()

In [None]:
df.describe()

In [None]:
df["Type 1"].value_counts()

In [None]:
df["Legendary"].value_counts()

In [None]:
df = pd.read_csv("Pokemon.csv", index_col=0)
df.tail()

In [None]:
df.reset_index().tail()

In [None]:
df.reset_index().drop_duplicates(subset="#").tail()

In [None]:
df.reset_index().drop_duplicates(subset="#").reset_index().drop('index', axis=1).tail()

In [None]:
df = df[df['Name'] != 'Volcanion']
df.tail()

In [102]:
no_duplicates = df.reset_index().drop_duplicates(subset="#").reset_index().drop("index", axis=1)
no_duplicates.tail()

Unnamed: 0,level_0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
716,793,717,Yveltal,Dark,Flying,680,126,131,95,131,98,99,6,True
717,794,718,Zygarde50% Forme,Dragon,Ground,600,108,100,121,81,95,95,6,True
718,795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
719,797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
720,799,721,Volcanion,Fire,Water,600,80,110,120,130,90,70,6,True


In [103]:
no_duplicates.set_index("#").to_csv('Pokemon_no_duplicates.csv')
#no_duplicates.to_excel('Pokemon_no_duplicates.xlsx', sheet_name='Sheet1')

In [None]:
%%bash
head Pokemon_no_duplicates.csv

In [105]:
gen_one = no_duplicates[no_duplicates["Generation"] == 1].set_index("#")
gen_one.tail()

Unnamed: 0_level_0,level_0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
147,159,Dratini,Dragon,,300,41,64,45,50,50,50,1,False
148,160,Dragonair,Dragon,,420,61,84,65,70,70,70,1,False
149,161,Dragonite,Dragon,Flying,600,91,134,95,100,100,80,1,False
150,162,Mewtwo,Psychic,,680,106,110,90,154,90,130,1,True
151,165,Mew,Psychic,,600,100,100,100,100,100,100,1,False


In [106]:
first_gen_dict = gen_one["Name"].to_dict()


[str(key)+" : "+str(val) for index, (key, val) in enumerate(first_gen_dict.items()) if index < 9]

['1 : Bulbasaur',
 '2 : Ivysaur',
 '3 : Venusaur',
 '4 : Charmander',
 '5 : Charmeleon',
 '6 : Charizard',
 '7 : Squirtle',
 '8 : Wartortle',
 '9 : Blastoise']

**End of Tuesday-Lecture**

**Addendum: Renaming Columns**

In [3]:
df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [2, 5, 7, 8]})
df

Unnamed: 0,a,b
0,1,2
1,2,5
2,3,7
3,4,8


In [5]:
df = df.rename({'b': 'c'}, axis='columns')
df

Unnamed: 0,a,c
0,1,2
1,2,5
2,3,7
3,4,8


# Ufuncs and Aggregation

## Aggregation in Pandas

Aggregations are functions, where one or more dimensions of data are collapsed onto a single value, like the `max`, `sum` or `mean`- functions.

Stat-operations generally *exclude* missing data.

### For Series

In [6]:
a = np.arange(7)
ser = pd.Series(a**2, index=a)
ser

0     0
1     1
2     4
3     9
4    16
5    25
6    36
dtype: int64

In [7]:
ser.sum()
#mean(), median(), min(), max(), ...

91

### For DataFrames

In [8]:
df = pd.DataFrame({'A': a**2,
                   'B': a**3})
df

Unnamed: 0,A,B
0,0,0
1,1,1
2,4,8
3,9,27
4,16,64
5,25,125
6,36,216


In [9]:
df.mean()

A    13.0
B    63.0
dtype: float64

In [10]:
df.mean(axis=0)

A    13.0
B    63.0
dtype: float64

In [11]:
df.mean(axis='rows')

A    13.0
B    63.0
dtype: float64

In [12]:
df.mean(axis=1)

0      0.0
1      1.0
2      6.0
3     18.0
4     40.0
5     75.0
6    126.0
dtype: float64

In [13]:
df.mean(axis='columns')

0      0.0
1      1.0
2      6.0
3     18.0
4     40.0
5     75.0
6    126.0
dtype: float64

The following table summarizes some other built-in Pandas aggregations:

| Aggregation              | Description                     |
|--------------------------|---------------------------------|
| ``count()``              | Total number of items           |
| ``first()``, ``last()``  | First and last item             |
| ``mean()``, ``median()`` | Mean and median                 |
| ``min()``, ``max()``     | Minimum and maximum             |
| ``std()``, ``var()``     | Standard deviation and variance |
| ``mad()``                | Mean absolute deviation         |
| ``prod()``               | Product of all items            |
| ``sum()``                | Sum of all items                |

These are all methods of ``DataFrame`` and ``Series`` objects.

## Ufuncs


We know Ufuncs already from Numpy: It are vectorized functions that change all values of an array simultaneously. 

Pandas does the same, with a nice twist: for unary operations like negation and trigonometric functions, these ufuncs will *preserve index and column labels* in the output, and for binary operations such as addition and multiplication, Pandas will automatically *align indices* when passing the objects to the ufunc.


This means that keeping the context of data and combining data from different sources –both potentially error-prone tasks with raw NumPy arrays– become essentially foolproof ones with Pandas.

In [15]:
rng = np.random.RandomState(0)
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,5,0,3,3
1,7,9,3,5
2,2,4,7,6


In [16]:
np.exp(df)

Unnamed: 0,A,B,C,D
0,148.413159,1.0,20.085537,20.085537
1,1096.633158,8103.083928,20.085537,148.413159
2,7.389056,54.59815,1096.633158,403.428793


### UFuncs: Index Alignment

For binary operations on two ``Series`` or ``DataFrame`` objects, Pandas will align indices in the process of performing the operation.
This is very convenient when working with incomplete data.

In [17]:
area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
                  'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
                        'New York': 19651127}, name='population')
area

Alaska        1723337
Texas          695662
California     423967
Name: area, dtype: int64

In [18]:
population

California    38332521
Texas         26448193
New York      19651127
Name: population, dtype: int64

In [19]:
area.index & population.index

Index(['Texas', 'California'], dtype='object')

In [28]:
area/population

California    0.011060
Texas         0.026303
dtype: float64

In [22]:
"divide" in dir(pd.DataFrame)

True

In [25]:
popdens = area.divide(population, fill_value=0)
popdens

Alaska             inf
California    0.011060
New York      0.000000
Texas         0.026303
dtype: float64

In [26]:
popdens = popdens.replace([np.inf, -np.inf], np.nan)
popdens.dropna()

California    0.011060
New York      0.000000
Texas         0.026303
dtype: float64

In [29]:
(area/population).dropna()

California    0.011060
Texas         0.026303
dtype: float64

In [30]:
A = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                 columns=list('AB'))
A

Unnamed: 0,A,B
0,12,1
1,6,7


In [31]:
B = pd.DataFrame(rng.randint(0, 20, (3, 3)),
                 columns=list('ABC'))
B

Unnamed: 0,A,B,C
0,14,17,5
1,13,8,9
2,19,16,19


In [32]:
A+B

Unnamed: 0,A,B,C
0,26.0,18.0,
1,19.0,15.0,
2,,,


In [33]:
A.add(B, fill_value=0)

Unnamed: 0,A,B,C
0,26.0,18.0,5.0
1,19.0,15.0,9.0
2,19.0,16.0,19.0


### More Index-Alignment

In [35]:
df = pd.DataFrame({'a': np.random.randint(3, size=10)}, index=np.arange(1, 20, 2))
df

Unnamed: 0,a
1,0
3,0
5,1
7,1
9,0
11,1
13,0
15,1
17,1
19,2


Let's add a new column to this DataFrame!

In [36]:
tmp = pd.Series([0]*len(df.index))
tmp

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
dtype: int64

In [37]:
#df['new'] = tmp   #changes the original one
df.assign(new=tmp) #creates a copy

Unnamed: 0,a,new
1,0,0.0
3,0,0.0
5,1,0.0
7,1,0.0
9,0,0.0
11,1,
13,0,
15,1,
17,1,
19,2,


In [38]:
old_aligned, new_aligned = df.align(tmp, axis=0)
old_aligned

Unnamed: 0,a
0,
1,0.0
2,
3,0.0
4,
5,1.0
6,
7,1.0
8,
9,0.0


In [40]:
new_aligned

0     0.0
1     0.0
2     0.0
3     0.0
4     0.0
5     0.0
6     0.0
7     0.0
8     0.0
9     0.0
11    NaN
13    NaN
15    NaN
17    NaN
19    NaN
dtype: float64

In [44]:
tmp = pd.Series([0]*len(df.index), index=df.index)
tmp

1     0
3     0
5     0
7     0
9     0
11    0
13    0
15    0
17    0
19    0
dtype: int64

In [45]:
df['new'] = tmp
df

Unnamed: 0,a,new
1,0,0
3,0,0
5,1,0
7,1,0
9,0,0
11,1,0
13,0,0
15,1,0
17,1,0
19,2,0


## apply()

While some ufuncs (like cumsum or exp) are pre-defined by pandas, the method `apply` can be used to run an arbitrary function on all elements of a Series or DataFrame.

In [46]:
a = np.arange(7)
df = pd.DataFrame({'A': a**2,
                   'B': a**3})
df

Unnamed: 0,A,B
0,0,0
1,1,1
2,4,8
3,9,27
4,16,64
5,25,125
6,36,216


In [47]:
df.cumsum()

Unnamed: 0,A,B
0,0,0
1,1,1
2,5,9
3,14,36
4,30,100
5,55,225
6,91,441


In [48]:
df["A_cumsum"] = df.cumsum()["A"]
df["B_cumsum"] = df.apply(np.cumsum)["B"]
df

Unnamed: 0,A,B,A_cumsum,B_cumsum
0,0,0,0,0
1,1,1,1,1
2,4,8,5,9
3,9,27,14,36
4,16,64,30,100
5,25,125,55,225
6,36,216,91,441


Using Lambda-functions, we can combine `apply` with arbitrary functions. Note that the argument of the function is always an entire column of the dataset.

In [49]:
df.sum()

A            91
B           441
A_cumsum    196
B_cumsum    812
dtype: int64

In [51]:
df.apply(lambda x: print(x.sum()))

91
441
196
812


A           None
B           None
A_cumsum    None
B_cumsum    None
dtype: object

In [53]:
df

Unnamed: 0,A,B,A_cumsum,B_cumsum
0,0,0,0,0
1,1,1,1,1
2,4,8,5,9
3,9,27,14,36
4,16,64,30,100
5,25,125,55,225
6,36,216,91,441


In [54]:
df.apply(lambda x: x.max() - x.min())

A            36
B           216
A_cumsum     91
B_cumsum    441
dtype: int64

Note that `apply` works for both DataFrames and Series!

In [56]:
df["A"].apply(lambda x: print(x))

0
1
4
9
16
25
36


0    None
1    None
2    None
3    None
4    None
5    None
6    None
Name: A, dtype: object

In [58]:
df["A_normed"] = df["A"].apply(lambda x: x/df["A"].max())
df

Unnamed: 0,A,B,A_cumsum,B_cumsum,A_normed
0,0,0,0,0,0.0
1,1,1,1,1,0.027778
2,4,8,5,9,0.111111
3,9,27,14,36,0.25
4,16,64,30,100,0.444444
5,25,125,55,225,0.694444
6,36,216,91,441,1.0


We can even use dictionaries with the apply-function!

In [60]:
z_moves = {"Normal": "Breakneck Blitz", "Fighting": "All-Out Pummeling", "Flying": "Supersonic Skystrike", "Poison": "Acid Downpour", "Ground": "Tectonic Rage", "Rock": "Continental Crush", "Bug": "Savage Spin-Out", "Ghost": "Never-Ending Nightmare",
"Steel": "Corkscrew Crash", "Fire": "Inferno Overdrive", "Water": "Hydro Vortex", "Grass": "Bloom Doom", "Electric": "Gigavolt Havoc", "Psychic": "Shattered Psyche", "Ice": "Subzero Slammer", "Dragon": "Devastating Drake", "Dark": "Black Hole Eclipse", "Fairy": "Twinkle Tackle"}
df = pd.read_csv("Pokemon.csv")
df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False


In [62]:
df["Z-Move"] = df["Type 1"].apply(lambda x:z_moves[x])
df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Z-Move
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False,Bloom Doom
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False,Bloom Doom
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False,Bloom Doom
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False,Bloom Doom
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False,Inferno Overdrive


Using `apply`, we can also convert a Series of lists into a DataFrame, by making the individual columns to Series:

In [63]:
s = pd.Series([ ['Red', 'Green', 'White'], ['Red', 'Black'], ['Yellow']]) 
print(type(s))
s

<class 'pandas.core.series.Series'>


0    [Red, Green, White]
1           [Red, Black]
2               [Yellow]
dtype: object

In [64]:
df = s.apply(pd.Series)
print(type(df))
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,0,1,2
0,Red,Green,White
1,Red,Black,
2,Yellow,,


### Exercise

Write a Pandas program to convert this Series of lists to one flat Series.

In [76]:
s = pd.Series([ ['Red', 'Green', 'White'], ['Red', 'Black'], ['Yellow']])
s

0    [Red, Green, White]
1           [Red, Black]
2               [Yellow]
dtype: object

In [66]:
count_down(3)

HBox(children=(IntProgress(value=0, max=180), HTML(value='')))




In [67]:
s.apply(pd.Series).stack()

0  0       Red
   1     Green
   2     White
1  0       Red
   1     Black
2  0    Yellow
dtype: object

In [77]:
s = s.apply(pd.Series).stack().reset_index(drop=True) 
s

0       Red
1     Green
2     White
3       Red
4     Black
5    Yellow
dtype: object

# Group-By

In [82]:
df = pd.read_csv("Pokemon.csv")
df.groupby('Type 1')['Name'].count()

Type 1
Bug          69
Dark         31
Dragon       32
Electric     44
Fairy        17
Fighting     27
Fire         52
Flying        4
Ghost        32
Grass        70
Ground       32
Ice          24
Normal       98
Poison       28
Psychic      57
Rock         44
Steel        27
Water       112
Name: Name, dtype: int64

## Split-Apply-Combine

While simple operations are already pre-defined by pandas, custom aggregations and operations can be performed via **group-by**. The group-by operation can be described as having the following steps:

* **Splitting** the data into groups based on some criteria (breaking up and grouping depending on the value of a key)
* **Applying** a function to each group independently (aggregation, transformation, filtering, ...)
* **Combining** the results into a data structure

A typical example, for where the *apply* is a summerization aggregation, is illustrated here:

![](split-apply-combine.png)

In [83]:
tmp = np.array([list("ABCABC"), np.arange(1,7)]).T
tmp

array([['A', '1'],
       ['B', '2'],
       ['C', '3'],
       ['A', '4'],
       ['B', '5'],
       ['C', '6']], dtype='<U21')

In [84]:
df = pd.DataFrame(tmp, columns=["key", "data"])
df["data"] = pd.to_numeric(df["data"])
df

Unnamed: 0,key,data
0,A,1
1,B,2
2,C,3
3,A,4
4,B,5
5,C,6


In [85]:
df.groupby("key")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fc01e82f748>

Note that what is returned is not a set of `DataFrames`, but a `DataFrameGroupBy` object. This object is where the magic is: you can think of it as a special view of the `DataFrames`, which is poised to dig into the groups but does no actual computation until the aggregation is applied. This "lazy evaluation" approach means that common aggregates can be implemented very efficiently in a way that is almost transparent to the user.

To produce a result, we can apply an aggregate to this `DataFrameGroupBy` object, which will perform the appropriate apply/combine steps to produce the desired result:

In [90]:
df.groupby("key").sum().reset_index()

Unnamed: 0,key,data
0,A,5
1,B,7
2,C,9


In [92]:
df.groupby("key")["data"].sum()
# we can do column indexing just like on a normal DataFrame

key
A    5
B    7
C    9
Name: data, dtype: int64

### Iteration over groups

The ``GroupBy`` object supports direct iteration over the groups, returning each group as a ``Series`` or ``DataFrame``:

In [93]:
df

Unnamed: 0,key,data
0,A,1
1,B,2
2,C,3
3,A,4
4,B,5
5,C,6


In [95]:
for (key, _) in df.groupby("key"):
    print(key)
    
print()
for (_, group) in df.groupby("key"):
    print(group, "\n")

A
B
C

  key  data
0   A     1
3   A     4 

  key  data
1   B     2
4   B     5 

  key  data
2   C     3
5   C     6 



### Dispatch methods

Any method not explicitly implemented by the ``GroupBy`` object will be passed through and called on the groups, whether they are ``DataFrame`` or ``Series`` objects.
For example, you can use the ``describe()`` method of ``DataFrame``s to perform a set of aggregations that describe each group in the data:

In [96]:
df.describe()

Unnamed: 0,data
count,6.0
mean,3.5
std,1.870829
min,1.0
25%,2.25
50%,3.5
75%,4.75
max,6.0


In [97]:
df.groupby("key").describe()

Unnamed: 0_level_0,data,data,data,data,data,data,data,data
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
A,2.0,2.5,2.12132,1.0,1.75,2.5,3.25,4.0
B,2.0,3.5,2.12132,2.0,2.75,3.5,4.25,5.0
C,2.0,4.5,2.12132,3.0,3.75,4.5,5.25,6.0


In [107]:
df = pd.read_csv("Pokemon_no_duplicates.csv", index_col=0)
df.groupby('Generation')["Name"].nunique()

Generation
1    151
2    100
3    135
4    107
5    156
6     72
Name: Name, dtype: int64

### Exercise

The given dataset contains a column `Region` as well as a column `Pop. Density`. Write a snippet that takes as argument the dataframe containing all the countries, and returns a `Series` mapping regions to the average Population density of its countries.

In [110]:
countries = pd.read_csv('countries.csv', index_col=0)
countries.head()

Unnamed: 0,Country,Subcontinent,Region,In EU,Population,Area,Pop. Density,Coastline,Net migration,Infant mortality,...,Phones,Arable,Crops,Other,Climate,Birthrate,Deathrate,Agriculture,Industry,Service
0,Afghanistan,West & Central Asia,ASIA (EX. NEAR EAST),False,31056997,1677019.0,47.96,0.0,23.06,163.07,...,3.22,12.13,0.22,87.65,1.0,46.6,20.34,0.38,0.24,0.38
1,Albania,Europe,EASTERN EUROPE,False,3581655,74457.03,124.59,1.26,-4.93,21.52,...,71.2,21.09,4.42,74.49,3.0,15.11,5.22,0.232,0.188,0.579
2,Algeria,North Africa,NORTHERN AFRICA,False,32930091,6168683.0,13.83,0.04,-0.39,31.0,...,78.1,3.22,0.25,96.53,1.0,17.14,4.61,0.101,0.6,0.298
3,American Samoa,Oceania,OCEANIA,False,57794,515.408,290.42,58.29,-20.71,9.27,...,259.54,10.0,15.0,75.0,2.0,22.46,3.27,,,
4,Andorra,Europe,WESTERN EUROPE,False,71201,1212.115,152.14,0.0,6.6,4.05,...,497.18,2.22,0.0,97.78,3.0,8.71,6.25,,,


In [109]:
count_down(5)

HBox(children=(IntProgress(value=0, max=300), HTML(value='')))




In [112]:
countries.groupby("Region")["Pop. Density"].mean().rename_axis(None).sort_values(ascending=False)

ASIA (EX. NEAR EAST)    1264.819286
WESTERN EUROPE           952.042857
NEAR EAST                427.078750
NORTHERN AMERICA         260.872000
LATIN AMER. & CARIB      136.191778
OCEANIA                  131.182857
EASTERN EUROPE           100.890833
SUB-SAHARAN AFRICA        92.259020
C.W. OF IND. STATES       56.700833
BALTICS                   39.833333
NORTHERN AFRICA           38.935000
Name: Pop. Density, dtype: float64

## Aggregate, filter, transform, apply

So far, we focused on aggregation for the combine operation, but there are more options available.
In particular, ``GroupBy`` objects have ``aggregate()``, ``filter()``, ``transform()``, and ``apply()`` methods that efficiently implement a variety of useful operations before combining the grouped data.

For the purpose of the following subsections, we'll use this ``DataFrame``:

In [113]:
def create_df():
    rng = np.random.RandomState(0)
    df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                       'data1': range(6),
                       'data2': rng.randint(0, 10, 6)},
                       columns = ['key', 'data1', 'data2'])
    return df
    
df = create_df()
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


### Aggregation

While we used some *aggregation-functions* already, the function `aggregate` is the explicit version thereof.  
It can take a string, a function, or a list thereof, and compute all the aggregates at once.

In [114]:
df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


In [115]:
df.groupby('key').aggregate(['min', np.median, max])

Unnamed: 0_level_0,data1,data1,data1,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,3,4.0,5
B,1,2.5,4,0,3.5,7
C,2,3.5,5,3,6.0,9


Another useful pattern is to pass a dictionary mapping column names to operations to be applied on that column:

In [116]:
df.groupby('key').aggregate({'data1': 'min',
                             'data2': 'max'})

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,5
B,1,7
C,2,9


### Filtering

A filtering operation allows you to drop data based on the group properties.
For example, we might want to keep all groups in which the standard deviation is larger than some critical value:

In [117]:
def filter_func(x):
    return x['data2'].std() > 4

df

Unnamed: 0,key,data1,data2
0,A,0,5
1,B,1,0
2,C,2,3
3,A,3,3
4,B,4,7
5,C,5,9


In [118]:
df.groupby('key').std()

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,2.12132,1.414214
B,2.12132,4.949747
C,2.12132,4.242641


In [119]:
df.groupby('key').filter(filter_func)
#note that this is not an aggregate - the result has the shape of the original DataFrame, just with certain lines left out!

Unnamed: 0,key,data1,data2
1,B,1,0
2,C,2,3
4,B,4,7
5,C,5,9


### The apply() method

The ``apply()`` method lets you apply an arbitrary function to the group results.
The function should take a ``DataFrame``, and return either a Pandas object (e.g., ``DataFrame``, ``Series``) or a scalar; the combine operation will be tailored to the type of output returned.

First, remember our ``apply`` from before:

In [121]:
df = create_df()
df["data1"] = df["data1"].apply(lambda x: x/df["data1"].max())
df

Unnamed: 0,key,data1,data2
0,A,0.0,5
1,B,0.2,0
2,C,0.4,3
3,A,0.6,3
4,B,0.8,7
5,C,1.0,9


Keep in mind that ``groupby`` only returns a *view of the original DataFrame*.  
Here is an ``apply()`` that normalizes the (grouped) first column by the sum of the (grouped) second:

In [None]:
import warnings; warnings.filterwarnings('ignore')
try:
    del newdf
except:
    pass

In [122]:
sums = df.groupby('key')["data2"].sum()
print(sums, '\n\n\n')
for key, group in df.groupby('key'):
    group["data1"] /= sums[key]
    try:
        newdf = newdf.append(group)
    except:
        newdf = group.copy()
    print(newdf, '\n')

newdf

key
A     8
B     7
C    12
Name: data2, dtype: int64 



  key  data1  data2
0   A  0.000      5
3   A  0.075      3 

  key     data1  data2
0   A  0.000000      5
3   A  0.075000      3
1   B  0.028571      0
4   B  0.114286      7 

  key     data1  data2
0   A  0.000000      5
3   A  0.075000      3
1   B  0.028571      0
4   B  0.114286      7
2   C  0.033333      3
5   C  0.083333      9 



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


Unnamed: 0,key,data1,data2
0,A,0.0,5
3,A,0.075,3
1,B,0.028571,0
4,B,0.114286,7
2,C,0.033333,3
5,C,0.083333,9


In [123]:
def norm_by_data2(x):
    # x is a DataFrame of group values
    x['data1'] /= x['data2'].sum()
    return x

df

Unnamed: 0,key,data1,data2
0,A,0.0,5
1,B,0.2,0
2,C,0.4,3
3,A,0.6,3
4,B,0.8,7
5,C,1.0,9


In [124]:
df.groupby('key').apply(norm_by_data2)

Unnamed: 0,key,data1,data2
0,A,0.0,5
1,B,0.028571,0
2,C,0.033333,3
3,A,0.075,3
4,B,0.114286,7
5,C,0.083333,9


## Specifying the split key

In the simple examples presented before, we split the ``DataFrame`` on a single column name.
This is just one of many options by which the groups can be defined, and we'll go through some other options for group specification here.

### A list, array, series, or index providing the grouping keys

The key can be any series or list with a length matching that of the ``DataFrame``. For example:

In [125]:
df

Unnamed: 0,key,data1,data2
0,A,0.0,5
1,B,0.2,0
2,C,0.4,3
3,A,0.6,3
4,B,0.8,7
5,C,1.0,9


In [126]:
L = [0, 1, 0, 1, 2, 0]
df.groupby(L).sum()

Unnamed: 0,data1,data2
0,1.4,17
1,0.8,3
2,0.8,7


### A dictionary or series mapping index to group

Another method is to provide a dictionary that maps index values to the group keys:

In [127]:
df2 = df.set_index('key')
df2

Unnamed: 0_level_0,data1,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0.0,5
B,0.2,0
C,0.4,3
A,0.6,3
B,0.8,7
C,1.0,9


In [129]:
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
df2.groupby(mapping).sum()

Unnamed: 0,data1,data2
consonant,2.4,19
vowel,0.6,8


Grouping by multiple columns forms a hierarchical index

In [130]:
df2.groupby([mapping, "key"]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
Unnamed: 0_level_1,key,Unnamed: 2_level_1,Unnamed: 3_level_1
consonant,B,1.0,7
consonant,C,1.4,12
vowel,A,0.6,8


Video tutorial from Pycon 2015

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('5JnMutdy6Fw')