# Pandas Series
* Similar to a 1 dimensional numpy array (i.e. a "column vector")
* A Pandas Dataframe is composed of 1 or more Pandas Series representing columns
* Series supports vector operations

This notebook will demonstrate a few of the most common operations that can be performed on Series.

Terminology: will use "Series", "Column Vector" and "Column" interchangeably depending on context.

[Pandas Series Documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html)

In [1]:
import pandas as pd
print("Pandas Version {}".format(pd.__version__))
import numpy as np
print("Numpy Version {}".format(np.__version__))

Pandas Version 0.22.0
Numpy Version 1.13.3


#### Aside: Python 3.6 f String Formatting

In [2]:
print(f'Pandas Version {pd.__version__}')
print(f'Numpy Version {np.__version__}')

Pandas Version 0.22.0
Numpy Version 1.13.3


### Series Construction Example

In [3]:
# Construct a series
x = pd.Series(data=range(5), dtype="int64", name="Col A")
print(type(x))

<class 'pandas.core.series.Series'>


In [4]:
print(x)

0    0
1    1
2    2
3    3
4    4
Name: Col A, dtype: int64


In [5]:
# Demonstrate vector arithmetic
x += 10 # inplace modification of x
x

0    10
1    11
2    12
3    13
4    14
Name: Col A, dtype: int64

In [6]:
# This series has 5 rows and is implicitly a single column
x.shape

(5,)

In [8]:
# number of rows
len(x)

5

In [9]:
# Number of non-nul values
x.count()

5

In [10]:
# Get the values in x
x.values

array([10, 11, 12, 13, 14])

In [11]:
# the values are numpy array
type(x.values)

numpy.ndarray

In [12]:
# what type of values are in the numpy array
y = x.values
print(y.dtype)
print(x.values.dtype) # same using method chaining
print(x.dtype) # same using Series attribute

int64
int64
int64


In [13]:
# Get the index in x
x.index

RangeIndex(start=0, stop=5, step=1)

### Vector Operations can Generate New Series

In [14]:
# An operation on a series can produce a series of a different type
x % 2 == 0

0     True
1    False
2     True
3    False
4     True
Name: Col A, dtype: bool

In [15]:
bool_series = (x % 2 == 0)
print(bool_series)

0     True
1    False
2     True
3    False
4     True
Name: Col A, dtype: bool


In [16]:
# Relational Operation: OR
y = pd.Series(data=[True, True, False, False, False])
bool_series | y

0     True
1     True
2     True
3    False
4     True
dtype: bool

In [17]:
# Relational Operation: AND
y = pd.Series(data=[True, True, False, False, False])
bool_series & y

0     True
1    False
2    False
3    False
4    False
dtype: bool

In [18]:
# Python logical operators are not vectorized
# The following throws a ValueError
try:
    bool_series and y
except ValueError:
    print(ValueError)
else:
    print("No Problem")

<class 'ValueError'>


### Series Indexing

In [19]:
# can index a series using a boolean series
x[bool_series]

0    10
2    12
4    14
Name: Col A, dtype: int64

In [20]:
# can do the above in 1 step
x[x % 2 == 0]

0    10
2    12
4    14
Name: Col A, dtype: int64

In [21]:
# can index a series positionally
x[0]

10

In [22]:
# can index a series positionally using a list
x[[0,2,4]]

0    10
2    12
4    14
Name: Col A, dtype: int64

In [23]:
# Index defaults to numeric range, but it can be anything
idx = "aa bb cc dd ee".split()
x.index = idx
x

aa    10
bb    11
cc    12
dd    13
ee    14
Name: Col A, dtype: int64

In [24]:
# Index by index value
# Analgous to: return the row which has a "rowname" of 'aa
x['aa']

10

In [25]:
x[['aa','cc','ee']]

aa    10
cc    12
ee    14
Name: Col A, dtype: int64

In [26]:
# Can still index positionally
x[0]

10

In [27]:
x[[0, 2, 4]]

aa    10
cc    12
ee    14
Name: Col A, dtype: int64

### Series Method Examples

In [28]:
x.sum()

60

In [29]:
x.max()

14

### Series Apply Examples

In [30]:
x *= -1
x

aa   -10
bb   -11
cc   -12
dd   -13
ee   -14
Name: Col A, dtype: int64

In [31]:
# Apply a numpy function to each element in the series
x.apply(np.abs)

aa    10
bb    11
cc    12
dd    13
ee    14
Name: Col A, dtype: int64

In [32]:
# note that the above was *not* performed inplace
x

aa   -10
bb   -11
cc   -12
dd   -13
ee   -14
Name: Col A, dtype: int64

In [33]:
# to change the actual series
x = x.apply(np.abs)
x

aa    10
bb    11
cc    12
dd    13
ee    14
Name: Col A, dtype: int64

In [34]:
# apply our own function to each value
def my_add2(x):
    return x+2

x.apply(my_add2)

aa    12
bb    13
cc    14
dd    15
ee    16
Name: Col A, dtype: int64

In [35]:
# same using lambda notations
x.apply(lambda x: x+2)

aa    12
bb    13
cc    14
dd    15
ee    16
Name: Col A, dtype: int64

### Common Data Science Operations

In [36]:
# Create a new series
x = pd.Series(data=[4,1,2,1,2,4], index="a b c d e f".split())
x

a    4
b    1
c    2
d    1
e    2
f    4
dtype: int64

In [37]:
x.value_counts()

4    2
2    2
1    2
dtype: int64

In [38]:
x.sort_values()

b    1
d    1
c    2
e    2
a    4
f    4
dtype: int64

### Null (or Missing) Values
Represented by np.NaN

In [39]:
x.iloc[[3,4]] = np.NaN
x

a    4.0
b    1.0
c    2.0
d    NaN
e    NaN
f    4.0
dtype: float64

In [40]:
x.isnull()

a    False
b    False
c    False
d     True
e     True
f    False
dtype: bool

In [48]:
x.describe()

count    4.00
mean     2.75
std      1.50
min      1.00
25%      1.75
50%      3.00
75%      4.00
max      4.00
dtype: float64

### Method Chaining

In [41]:
# How many null values are there in total
x.isnull().sum()

2

In [42]:
# What percent of the values are null
x.isnull().sum() / x.size

0.33333333333333331

In [43]:
# Null values are ignored with statistical ops
x.sum() == x[x.notnull()].sum()

True

In [44]:
# x.size is length of column vector including null values
print("size:  ", x.size)

# x.count() is count of non-null values
print("count: ", x.count())

size:   6
count:  4


In [45]:
# mean is sum of non-null values divided by count of non-null values
x.mean() == x.sum() / x.count()

True

In [46]:
# tricker expression to validate the above statements
# don't do this in practice, just to verify understanding
# mean = sum of non-null values / count of non-null values
x.mean() == x[x.notnull()].sum() / x[x.notnull()].size

True

In [47]:
# In any programming language, you should not compare float values
# without use of a tolerance
# The following will return True whereas the above would not for more
# complicated expressions
np.isclose(x.mean(), x.sum() / x.count())

True