# Data Manipulation with Pandas

We've been looking at NumPy and its `ndarray` object

- This provides efficient storage and manipulation of dense typed arrays

Pandas is built on NumPy and provides an efficient implementation of a `DataFrame`

- Convenient storage interface for labelled data.

- Provides powerful data operations familiar to users of database and spreadsheet programs.

NumPy useful for providing essential features for data organization.

It is however limited where flexibility is required:

- attaching labels to data

- working with missing data

- attempting operations that do not map well to element-wise broadcasting e.g grouping

Pandas helps in these "data munging tasks" that occupy much of a data scientist's time.

In [1]:
# Importing pandas
import pandas as pd

In [2]:
pd.__version__

'1.4.2'

In [3]:
import numpy as np

## The Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:

In [4]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

As we see in the output, the Series wraps both a sequence of values and a sequence of indices, which we can access with the `values` and `index` attributes.

In [5]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [6]:
# gives an array like object of type pd.Index
data.index

RangeIndex(start=0, stop=4, step=1)

In [7]:
# Like a NumPy array, data can be accessed via associated index
data[1]

0.5

In [8]:
data[1:3]

1    0.50
2    0.75
dtype: float64

## `Series` as a generalised NumPy array

`Series` object basically a 1-D NumPy array

Difference in the index:

- for NumPy it is implicitly defined

- for the Pandas `Series` it is explicitly defined

This means the index doesn't have to be an integer...

In [9]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [10]:
data['b']

0.5

In [11]:
# a non-sequential indexing
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

In [12]:
data[2]

0.25

## `Series` as specialised dictionary

Pandas `Series` a bit like a Python dictionary

Dictionarys map arbitrary keys to a set of arbitrary values

`Series` maps *typed* keys to a set of *typed* values

Similar premise that yields the efficiency of NumPy arrays

In [13]:
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}
population = pd.Series(population_dict)
population      # The key become index

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [14]:
# index drawn from *sorted* keys
population['California']

38332521

In [15]:
# slicing supported
population['California':'Florida']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
dtype: int64

## Constructing Series objects

To construct, it's generally a variation of 

`>>> pd.Series(data, index=index)`

- index is an optional argument
- data can be one of many entities...

In [16]:
# data can be a list
pd.Series([2, 4, 6])

0    2
1    4
2    6
dtype: int64

In [17]:
# data can be a scalar, which gets repeated
pd.Series(5, index=[100, 200, 300])

100    5
200    5
300    5
dtype: int64

In [18]:
# data can be a dictionary
pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object

In [19]:
# can set different indexes if wanted!
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

3    c
2    a
dtype: object

n.b the last series is populated only with the the explicitly identified keys

# The Pandas `DataFrame` Object

Like the `Series` object discussed in the previous section, the `DataFrame` can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.

## DataFrame as a generalised NumPy array

Recall: a `Series` is an analog of a one-dimensional array with flexible indices

Given this, a `DataFrame` is an analog of a two-dimensional array with both flexible row indices and flexible column names

In [20]:
# first let's construct a new Series
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995, 'UAE': 183310}
area = pd.Series(area_dict)
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
UAE           183310
dtype: int64

Let's use that population data from before...

In [21]:
states = pd.DataFrame({'population': population,
                       'area': area})
states

Unnamed: 0,population,area
California,38332521.0,423967
Florida,19552860.0,170312
Illinois,12882135.0,149995
New York,19651127.0,141297
Texas,26448193.0,695662
UAE,,183310


In [22]:
# let's check the index labels
states.index

Index(['California', 'Florida', 'Illinois', 'New York', 'Texas', 'UAE'], dtype='object')

In [23]:
# data frames also have columns attribute
states.columns

Index(['population', 'area'], dtype='object')

In [24]:
states.values

array([[38332521.,   423967.],
       [19552860.,   170312.],
       [12882135.,   149995.],
       [19651127.,   141297.],
       [26448193.,   695662.],
       [      nan,   183310.]])

So we have a generalisation of a 2D NumPy array, with generalised row and column indices

## DataFrame as specialised dictionary

Recall: dictionarys map keys to values

A `DataFrame` maps a column name to a `Series` of column data

In [25]:
states['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
UAE           183310
Name: area, dtype: int64

## Constructing a DataFrame object

There are a number of ways to construct a `DataFrame`.

1. From a collection of `Series` objects.

2. A single `DataFrame` can be constructed from a single `Series`:

In [26]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


3. It can also be constructed from a __list of dictionaries__

In [27]:
data = [{'a': i, 'b': 2 * i} for i in range(3)]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


In [28]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


If some keys are missing, Pandas fills these with `NaN`

4. A `DataFrame` can be constructed from a __dictionary of Series objects__

In [29]:
pd.DataFrame({'population': population,
              'area': area})

Unnamed: 0,population,area
California,38332521.0,423967
Florida,19552860.0,170312
Illinois,12882135.0,149995
New York,19651127.0,141297
Texas,26448193.0,695662
UAE,,183310


5. You can create a `DataFrame`from a __two dimensional array__, with specfied column and index names. (If ommitted, an integer index will be used instead)

In [30]:
pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.822596,0.040374
b,0.593579,0.65469
c,0.702189,0.633921


6. From a __NumPy structred array__

In [31]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [32]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


# The Pandas Index Object

Both the `Series` and `DataFrame` objects contain an explicit *index* that lets you reference and modify data.

The `Index` object is an immutable array

In [33]:
ind = pd.Index([2, 3, 5, 7, 11])
ind

Int64Index([2, 3, 5, 7, 11], dtype='int64')

`Index` operates like an array: we can index an index, and also slice:

In [34]:
ind[1]

3

In [35]:
ind[::2]

Int64Index([2, 5, 11], dtype='int64')

In [36]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


In [37]:
# what happens when attempting to change a value of an Index?
ind[1] = 0      # Cannot change index

TypeError: Index does not support mutable operations

Why is this immutable?

## Index as an ordered set

Pandas designed to faciliatate operations such as joins across datasets using set aritmetic.

`Index` objects can be combined using the union, intersection and difference operations

In [None]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

In [None]:
indA.intersection(indB)  # intersection

Int64Index([3, 5, 7], dtype='int64')

In [None]:
indA.union(indB)

Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

In [None]:
indA.symmetric_difference(indB)

Int64Index([1, 2, 9, 11], dtype='int64')

# Summary

We've looked at the following during this lecture:

- The `Series` object
- The `DataFrame`object
- The `Index` object