# Data Manipulation with Pandas

The following code consists of code snippets and general notes within the Python Data Science Handbook by Jake VanderPlas.

We will focus on Series and DataFrame objects.

In [1]:
import pandas
pandas.__version__

'0.24.2'

In [2]:
import pandas as pd
import numpy as np

## Introducing Pandas Objects

**Series**, **DataFrame**, **Index**

### The Pandas Series Object

**one dimensional array of indexed data**

In [3]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [4]:
data.values  # a familiar NumPy array

array([0.25, 0.5 , 0.75, 1.  ])

In [5]:
data.index  # a pd.Index object

RangeIndex(start=0, stop=4, step=1)

In [6]:
data[1:3]

1    0.50
2    0.75
dtype: float64

### Series as generalized NumPy Array

The difference between a `Series` object and a one-dimensional Numpy array is the presence of the index.

- The numpy array has an ***implicitly defined*** integer index
- The `Series` object has an ***explicitly defined*** index associated with values, such as any defined type like using strings as an index


In [7]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [8]:
data['b']

0.5

In [9]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                index=[2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [10]:
data[5]

0.5

### Series as specialized dictionary

- "A dictionary is a structure that maps arbitrary keys to a set of arbitrary values"
- "A `Series` is a structure that maps ***typed keys*** to a set of ***typed values***.

This type-specific code makes `Series` objects more efficient than Python dictionaries

In [11]:
population_dict = {'California': 38332521,
                  'Texas': 26448193,
                  'New York': 19651127,
                  'Florida': 19552860,
                  'Illinois': 12882135}
population = pd.Series(population_dict)
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

A `Series` will be created by drawing the index from sorted keys

In [12]:
population['California']

38332521

In [13]:
population['California':'Illinois']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

### Constructing Series Objects

- `pd.Series(data)`
- `pd.Series(data, index=index)`
- `pd.Series(dictionary)`

In [14]:
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

In [15]:
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

In [16]:
pd.Series({2:'a', 1:'b', 3:'c'})  # keys will be sorted

2    a
1    b
3    c
dtype: object

In [17]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])  # choosing indices

3    c
2    a
dtype: object

### The Pandas Data Frame Object

Could be a generalization of a NumPy array, or a specialization of a Python dictionary

#### Dataframe as a generalized numpy array

- a `DataFrame` is analogous to a 2-D array with both flexible row indices and column names
- a `DataFrame` is like a sequence of aligned `Series` objects

In [18]:
area_dict = {'California': 423967,
            'Texas': 695662,
            'New York': 141297,
            'Florida': 170312,
            'Illinois': 149995}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [19]:
states = pd.DataFrame({'population': population,
                      'area': area})
states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [20]:
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

In [21]:
states.columns

Index(['population', 'area'], dtype='object')

In [22]:
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

#### Constructing Dataframe Objects

In [23]:
# From a single Series Object
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


In [24]:
# From a list of dicts
data = [{'a': i, 'b': 2 * i}
       for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [25]:
# Filled NAs
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


In [26]:
# From a dictionary of Series objects
pd.DataFrame({'population': population,
             'area': area})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [27]:
# From a 2-D Numpy array
pd.DataFrame(np.random.rand(3, 2),
            columns=['foo', 'bar'],
            index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.068902,0.254353
b,0.582596,0.049694
c,0.221211,0.951991


In [28]:
# From a Numpy structured array
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

### The Pandas Index Object

- Can be thought of as an ***immutable array*** or as an ***ordered set*** or ***multiset*** since duplicate indices can exist

In [29]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

#### Index as an immutable array

In [30]:
ind[1]

3

In [31]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

In [32]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


In [33]:
ind[1] = 0

TypeError: Index does not support mutable operations

#### Index as ordered set

In [None]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [None]:
indA & indB  # intersection

In [None]:
indA | indB  # union

In [None]:
indA ^ indB  # symmetric difference

## Data Indexing and Selection

### Data Selection in Series

#### Series as Dictionary

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                index=['a', 'b', 'c', 'd'])
data

In [None]:
'a' in data

In [None]:
data.keys()

In [None]:
list(data.items())

In [None]:
data['e'] = 1.25
data

#### Series as 1-D Array

In [None]:
data['a':'c']

In [None]:
data[0:2]

In [None]:
data[(data > 0.3) & (data < 0.8)]

In [None]:
data[['a', 'e']]

### Indexers: loc, iloc, and ix

In [34]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [35]:
# explicit indexing
data[1]

'a'

In [36]:
# implicit index when slicing
data[1:3]

3    b
5    c
dtype: object

The `loc` attribute allows indexing and slicing that always references the explicit index

In [37]:
data.loc[1]

'a'

In [38]:
data.loc[1:3]

1    a
3    b
dtype: object

The `iloc` attribute allows indexing and slicing that references the implicit Python-style index

In [39]:
data.iloc[1]

'b'

In [40]:
data.iloc[1:3]

3    b
5    c
dtype: object

### Data Selection in DataFrame

#### Dataframe as a Dictionary

In [42]:
area = pd.Series({'California': 423967,
                 'Texas': 695662,
                 'New York': 141297,
                 'Florida': 170312,
                 'Illinois': 149995})
pop = pd.Series({'California': 38332521,
                'Texas': 26448193,
                'New York': 19651127,
                'Florida': 19552860,
                'Illinois': 12882135})
data = pd.DataFrame({'area': area, 'pop': pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [43]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [44]:
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [46]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


#### Dataframe as 2-D Array

In [47]:
data.values

array([[4.23967000e+05, 3.83325210e+07, 9.04139261e+01],
       [6.95662000e+05, 2.64481930e+07, 3.80187404e+01],
       [1.41297000e+05, 1.96511270e+07, 1.39076746e+02],
       [1.70312000e+05, 1.95528600e+07, 1.14806121e+02],
       [1.49995000e+05, 1.28821350e+07, 8.58837628e+01]])

In [48]:
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


In [49]:
data.values[0]  # access a row

array([4.23967000e+05, 3.83325210e+07, 9.04139261e+01])

In [50]:
data['area']  # access a column

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [53]:
data.iloc[:3, :2]  # access like a NumPy array

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [54]:
data.loc[:'Illinois', :'pop']  # access by names

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [56]:
data.ix[:3, :'pop']  # combination of the two

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [57]:
data.loc[data.density > 100, ['pop', 'density']]

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


In [58]:
data.iloc[0, 2] = 90
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


#### Additional Indexing Conventions

- ***indexing*** refers to columns, ***slicing*** refers to rows

In [59]:
data['Florida':'Illinois']

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [60]:
data[1:3]

Unnamed: 0,area,pop,density
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


In [61]:
data[data.density > 100]

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


## Operating on Data in Pandas