In [None]:
from __future__ import print_function, division

# 2.1 Series
Vectorized operations and array support is provided in Python via the `pandas` library. In this section, we will learn about `Series` and how to manipulate it. 

A `Series` is an indexed data structure which stores data in an array. It is analogous to a single column in Excel. Data stored in a `Series` can be accessed quickly using its index and allows us to perform "vectorised" operations. 

__Content:__
  - 2.1.0 Series Structure
  - 2.1.1 Basic Operations
  - 2.1.2 Indexing and Selecting Data
  - 2.1.3 Boolean Masking
  - 2.1.4 Missing Values
  - 2.1.5 Exercises
  
The `numpy` (stands for **num**erical **py**thon) library and `pandas` (short for **pan**el **da**ta) provide us with array support and `Series` data structure. 

In [1]:
#import statement for the numpy and pandas libraries
import numpy as np
import pandas as pd

## 2.1.0 Series 
In base Python, we can store data using a list - which stores it in an array like structure. We access each item we want by using a *positional* index. But sometimes, a positional index can be inconvenient. It is desireable that we have a flexibility of defining a custom index so that retrieving data is made much more convenient. 

At this juncture, it is sufficient to understand series as a list (usually of objects with the same `dtype`) given together with an *index*. 

To create a series, we may call the `Series` class and initiate and instance by passing an iterable (a `list` or a numpy `array`) upon initialization. 

In [24]:
# Create a series from a list
s = pd.Series([2,3,5,7,11])
# Display the series
s

0     2
1     3
2     5
3     7
4    11
dtype: int64

The contents of the series is stored as an *attribute* of the series and is accessed with `s.values`. `s.values` is an *array* and not a normal list of objects. The index itself can be accessed with `s.index`. 

In [25]:
# Get the series values 
print(s.values)

# Get the series index
print(s.index)

[ 2  3  5  7 11]
RangeIndex(start=0, stop=5, step=1)


In [26]:
# Series index may be reassigned
s.index = ["a", "b", "c", "d", "e"]

# Display
s

a     2
b     3
c     5
d     7
e    11
dtype: int64

### Accessing series contents: slicing
We may use basic slicing operations to access objects stored in series. 

In [27]:
# Accessing series by position
s[0]

2

In [28]:
# Accessing series by index
s["b"]

3

In this regard, we may think of a series as a dictionary. However there is an important difference. Dictionaries do not support slicing. With series, we can perform slicing with the index entries. 

In [10]:
# Slicing using index. Slicing using index right inclusive. 
s["a": "d"]

a    2
b    3
c    5
d    7
dtype: int64

In [12]:
# Contrast this with slicing using positional entries. 
s[0: 3]

a    2
b    3
c    5
dtype: int64

### `.loc` and `.iloc` selection based slicers
If the index is assigned numerical labels, we can recover the right inclusive behaviour using the `.loc` property

In [29]:
s.index = range(1,6)
s[1:4] # Series defaults to positional slicing when numbers are used to contruct a slice. 

2    3
3    5
4    7
dtype: int64

In contrast, the use of `.loc` property instructs `pandas` to use numerical labels to access the array instead. Note the right inclusive behaviour. 

In [17]:
s.loc[1:4]

1    2
2    3
3    5
4    7
dtype: int64

Take note of a common "gotcha" when using numerical indices. When using `s[n]` where n is intended to be the positional index, pandas interprets this as a numerical label instead. To remove ambiguity then, use `.loc` to clearly indicate label based selection and `.iloc` for positional based selection. 

In [30]:
s

1     2
2     3
3     5
4     7
5    11
dtype: int64

In [None]:
# Note that s[0] is an error and s[1] is not 3 but 2
s[0]

In [32]:
s[1]

2

Instead use `iloc` to clearly indicate that positional based selection is intended. 

In [33]:
# s.iloc[1] means take the entry at the 2nd position 
s.iloc[1]

3

In [34]:
# s.loc[1] and s[1] mean the same thing
s.loc[1] == s[1]

True

## 2.1.1 Basic Operations
Given a list `l = [1,2,3]`, what do we do in order to multiply each entry by 2? This sort of operations (and others) are common in numerical algorithms and are known as "vectorized" operations. In base python, we can only use a `for` statement to perform this operation

In [18]:
l=[1,2,3]

Notice that 

In [19]:
2*l 

[1, 2, 3, 1, 2, 3]

doesn't exactly give the output we desire. Instead use a for loop,

In [20]:
for i in [0,1,2]:
    l[i] *= 2
print(l)

[2, 4, 6]


However, if we had declared `l` as a series, this very same operation can be carried out by the following code

In [23]:
v = pd.Series([1,2,3])
v

0    1
1    2
2    3
dtype: int64

In [24]:
# Multiply each entry by 2
2*v

0    2
1    4
2    6
dtype: int64

While the syntactical benefits are clear, what is more important is the fact that such pandas routines are written in highly optimised C code and hence runs much faster for longer series.

We may also perform arithmetic operations entry-wise. What is important to note is that arithmetic operations are carried on entries that *match on the index*. 

In [30]:
v1, v2 = pd.Series([1,2,3], index=["a", "b", "c"]), pd.Series([5,7,9], index=["b", "a", "c"])
v1

a    1
b    2
c    3
dtype: int64

In [27]:
v1+v1

a    2
b    4
c    6
dtype: int64

Note that if we perform `v1+v2` the expected answer *is not* `[6, 9, 12]`

In [29]:
v2

b    5
a    7
c    9
dtype: int64

In [28]:
v1+v2

a     8
b     7
c    12
dtype: int64

That is because the actual operation carried out was `[1+7 2+5 3+9]` which is indeed `[8 7 12]`. For this reason, vectorized operations on series which do not have matching indices with result in `NaN` entries. 

In [33]:
v3 = pd.Series([1,2,3,4], index=["a", "c", "d", "g"])
v3

a    1
c    2
d    3
g    4
dtype: int64

In [34]:
v1+v3

a    2.0
b    NaN
c    5.0
d    NaN
g    NaN
dtype: float64

which is because the index of `v3` matches the index of `v1` only at `"a"` and `"c"`. Therefore, the following is valid

In [36]:
v4 = pd.Series([3,6,9]*2, index=["a", "b","c"]*2) 
v4

a    3
b    6
c    9
a    3
b    6
c    9
dtype: int64

In [37]:
v1+v4

a     4
a     4
b     8
b     8
c    12
c    12
dtype: int64

which adds up entries that match up on their corresponding alphabetic index and *broadcasts* the result to match the length of the longer index. 

### Vectorized mathematical functions
In base python, mathematical functions are accessed via the `math` library. However the functions there do not work with iterables. 

In [39]:
import math

w = pd.Series([-1, -0.5, -0.25])

# The exponential function is given by math.exp
math.exp(-1)

0.36787944117144233

So what if we want to exponentiate each entry in `w`?

In [40]:
# Notice that the following does not work
math.exp(w)

TypeError: cannot convert the series to <class 'float'>

The error is raised because `math.exp` can only accept `float` type arguments. In order to exponentiate each entry, we use a *u-function* instead available in the `numpy` library. 

In [41]:
np.exp(w)

0    0.367879
1    0.606531
2    0.778801
dtype: float64

### Statistical functions on series
Everyday statistical functions are available as method calls to the Series object. 

In [42]:
# Create a series
import scipy.stats as stats


In [45]:
s = pd.Series(stats.norm.rvs(loc=23.4, scale=5, size=100, random_state=1234567))
s

0     20.826876
1     21.152228
2     32.073107
3     26.616901
4     23.530697
5     23.801905
6     19.413055
7     20.259568
8     21.668863
9     28.240424
10    26.928087
11    12.616516
12    28.152997
13    26.090943
14    21.146034
15    26.154162
16    19.602183
17    29.518800
18    22.600897
19    16.120767
20    29.442618
21    14.792363
22    23.370517
23    27.751472
24    17.072270
25    16.375855
26    21.902323
27    28.213427
28    24.011318
29    30.411446
        ...    
70    17.092509
71    20.726803
72    22.741059
73    23.762847
74    19.301517
75    24.651997
76    22.915576
77    23.239738
78    31.741743
79    27.511096
80    26.720920
81    23.227832
82    30.362909
83    30.046219
84    22.156567
85    24.422340
86    21.645164
87    23.687439
88    20.440059
89    17.498987
90    18.142480
91    33.386666
92    22.194500
93    28.919044
94    26.751023
95    28.979267
96    25.500641
97    19.515959
98    15.150009
99    15.723663
Length: 100, dtype: floa

In [46]:
# Basic statistical functions are available as method calls to the series
s.mean() 

23.529730920049616

In [47]:
# Std Dev
s.std()

4.9212328933141363

In [48]:
# Median / Q1 / Q3
s.median() # same as s.quantile(0.5)

23.071704191258213

In [49]:
# Max/Min value
s.max()

37.554177833875727

In [50]:
# Which index has the max/min value?
s.idxmax()

40

## 2.1.3 Conditional Selection

Besides index and positional based slicing of a series, we can extract data from series using condition or logical based selection. 

In [41]:
s = pd.Series([65, 90, 101, -7, 125], index=["aa", "ab", "cd", "ce", "ag"])
# Which values are < 10?
s

aa     65
ab     90
cd    101
ce     -7
ag    125
dtype: int64

Let's search for all entries which are even numbers

In [3]:
s % 2 ==0 

aa    False
ab     True
cd    False
ce    False
ag    False
dtype: bool

The above is known as a *mask*. Think of it as a filter by which we sift out entries corresponding to `False` and retain only those which are `True`. The above indicates that the entry corresponding to index `"ab"` is an even number. 

In [4]:
# This displays the series consisting only of even numbers
s[s%2==0]

ab    90
dtype: int64

An range based selection criteria like $\min<x<\max$ can only be implement using "bit wise" logical operators. That means a criteria like $ 50 <s<100$ is coded as `(s > 50) & (s < 100)`

In [5]:
# Select values which are 2 < x < 10.
s[(s > 50) & (s < 100)]

aa    65
ab    90
dtype: int64

 Use bit wise "or" , |, to do range based selection of the form $x < a$ or $x> b$. 

In [6]:
s[(s < 25) | (s > 100)]

cd    101
ce     -7
ag    125
dtype: int64

The method `.isin()` allows us to select entries that are contained in a given list 

In [42]:
# The list is passed to the method as an argument. Note that not all entries in the list needs to be present in the series. 
s.isin([-7, 65, 90, 100])

aa     True
ab     True
cd    False
ce     True
ag    False
dtype: bool

In [43]:
s[s.isin([-7, 65, 90, 100])]

aa    65
ab    90
ce    -7
dtype: int64

### 2.1.4 Missing Values
NaN (Not a Number) is the standard missing marker used in pandas. There are methods for us to identify missing values, the indices corresponding to missing values and how to filter missing values. 

In [36]:
# Create a new series
s1 = pd.Series([13,np.nan,19])
s1

0    13.0
1     NaN
2    19.0
dtype: float64

In [37]:
# Check for missing value
s1.isnull()

0    False
1     True
2    False
dtype: bool

In [38]:
# Drop missing values
s1.dropna()

0    13.0
2    19.0
dtype: float64

In [39]:
# Replace missing values
s1.fillna(17)

0    13.0
1    17.0
2    19.0
dtype: float64

## 2.1.5 Exercises

Q1: Create a new series `t` where the values are marks and the indices are names.

- names  `['Alvin','Ben','Carl','Danny','Ella','Fang','Gil','Han','Irene','Jane','Ken','Lim','Mark','Ng','Ong','Peng','Quek','Roy','Sam']`
- marks  `[48,62,66,26,72,74,72,55,70,80,62,66,'TX',93,65,30,75,58,51]`

In [38]:
# Answer should return a series object
names = ['Alvin','Ben','Carl','Danny','Ella','Fang','Gil','Han','Irene','Jane','Ken','Lim','Mark','Ng','Ong','Peng','Quek','Roy','Sam']
marks = [48,62,66,26,72,74,72,55,70,80,62,66,'TX',93,65,30,75,58,51]
t = pd.Series(marks, index = names)

Q2: How many students are there in this class?

In [39]:
# Answer should return a number.
len(t)

19

Q3(a): What is the average score of the test?

In [40]:
# t.mean() will return an error message

Q3(b): What should you replace 'TX' with? Then, answer Q3(a) again.

In [42]:
# Answer should return a number
t1 = t.replace('TX',np.nan)
t1.mean()

62.5

Q4: Who scored the highest mark?

In [43]:
# Answer should return a string (name)
t1.idxmax()

'Ng'

Q5: Who failed the test (<50)? 

In [45]:
# Answer should return a list of names
t1[t1 < 50].index

Index(['Alvin', 'Danny', 'Peng'], dtype='object')

Q6: What is the percentage of B+ and above (>=75 / number of students who took MST)? Leave your answer correct to 2 d.p.

In [46]:
# Answer should return a number correct to 2 d.p.
x = t1[t1>=75]
round(100*len(x)/18,2)

16.67