# 2.1 Series

__Content:__
  - 2.1.0 Series Structure
  - 2.1.1 Basic Operations
  - 2.1.2 Indexing and Selecting Data
  - 2.1.3 Boolean Masking
  - 2.1.4 Missing Values
  - 2.1.5 Exercises

Python has abundant libraries for various purposes like scientific computing, data analysis, data visualisation and machine learning. In this course, we'll be using **numpy** for numerical computation, **pandas** for data preparation, **matplotlib** for visualisation and **scipy.stats** for statistical analysis.

In [1]:
#import (library) as (give the library a nickname/alias)
import numpy as np
import pandas as pd

## 2.1.0 Series Structure
Series is a one-dimensional array capable of holding any data type, like int, str, float,....

It is similar to list in Python, but each element is lablled with an index.

In [2]:
# Create a series from a list
s = pd.Series([2,3,5,7,11])
# Display the series
s

0     2
1     3
2     5
3     7
4    11
dtype: int64

In [3]:
# Get the series values 
s.values

array([ 2,  3,  5,  7, 11], dtype=int64)

In [4]:
# Get the series index
s.index

RangeIndex(start=0, stop=5, step=1)

In [5]:
# Name the index
s0 = pd.Series([2,3,5,7,11], index=['a','b','c','d','e'])
s0

a     2
b     3
c     5
d     7
e    11
dtype: int64

In [6]:
# Size/Length of series
s.size # same answer as len(s)

5

In [7]:
# Sort values
s.sort_values(ascending=False)

4    11
3     7
2     5
1     3
0     2
dtype: int64

## 2.1.1 Basic Operations

Series allows us to use "vectorised operations". This means that arithmetic operations can be performed without the use of a `for` loop to iterate through elements of the Series. 

In [8]:
s + s

0     4
1     6
2    10
3    14
4    22
dtype: int64

In [9]:
s * 2

0     4
1     6
2    10
3    14
4    22
dtype: int64

In [10]:
s **2

0      4
1      9
2     25
3     49
4    121
dtype: int64

In [11]:
np.exp(s)

0        7.389056
1       20.085537
2      148.413159
3     1096.633158
4    59874.141715
dtype: float64

Everyday statistical functions are available as method calls to the Series object. 

In [12]:
# Sum
s.sum() # same answer as sum(s)

28L

In [13]:
# Average
s.mean() # same asnwer as np.mean(s)

5.6

In [14]:
# Std Dev
s.std()

3.5777087639996634

In [15]:
# Median / Q1 / Q3
s.median() # same as s.quantile(0.5)

5.0

In [16]:
# Max/Min value
s.max()

11

In [17]:
# Which index has the max/min value?
s.idxmax()

4

## 2.1.2 Indexing and Selecting Data using .iloc and .loc

How do we access specific elements in the series? There are two ways, and integer positionals method using the `iloc` method and a `loc` method which selects elements by their indexing label. 

`.iloc` is integer location based selection (from 0, 1, 2 to length-1 of the axis) 

In [18]:
# Select the first value
s.iloc[0]

2

In [19]:
# Select the last value
s.iloc[-1]

11

In [20]:
# Select a contiguous subset of values
s.iloc[1:4]

1    3
2    5
3    7
dtype: int64

In [22]:
# Select the first 3 values
s.iloc[:3]

0    2
1    3
2    5
dtype: int64

.loc is labelled based selection.

In [21]:
# Select value based on index label
s0.loc['b'] # return same asnwer as s0.iloc[1] or s0['b']

3

In [22]:
# Select a list of values based on labels
s0.loc[['a','c','e']] # return same answer as s0.iloc[[0,2,4]] or s0[['a','c','e']]

a     2
c     5
e    11
dtype: int64

## 2.1.3 Conditional Selection

How do we  select elements of a series based on some criteria. For example, select all elements in `s` which are less than `10`. 

In [25]:
# Which values are < 10?
s < 10

0     True
1     True
2     True
3     True
4    False
dtype: bool

In [23]:
# Select values based on boolean mask
s[s<10]

0    2
1    3
2    5
3    7
dtype: int64

In [25]:
# How many terms are <10? 
len(s[s<10])

4

In [24]:
# Select values which are 2 < x < 10.
s[(s>2) & (s<10)]

1    3
2    5
3    7
dtype: int64

In [None]:
# Note that the following is not valid

s[2<s<10]

## 2.1.4 Missing Values

NaN (Not a Number) is the standard missing marker used in Pandas.

In [29]:
# Create a new series
s1 = pd.Series([13,np.nan,19])
# Append new series
s2 = s.append(s1)
# observe the indices
s2

0     2.0
1     3.0
2     5.0
3     7.0
4    11.0
0    13.0
1     NaN
2    19.0
dtype: float64

In [30]:
# Append new series, ignoring the index
s2 = s.append(s1,ignore_index=True)
s2

0     2.0
1     3.0
2     5.0
3     7.0
4    11.0
5    13.0
6     NaN
7    19.0
dtype: float64

In [31]:
# Check for missing value
s2.isnull()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
7    False
dtype: bool

In [32]:
# How many missing values are there?
s2.isnull().sum()

1

In [33]:
# Drop missing values
s2.dropna()

0     2.0
1     3.0
2     5.0
3     7.0
4    11.0
5    13.0
7    19.0
dtype: float64

In [34]:
# Replace missing values
s2.fillna(17)

0     2.0
1     3.0
2     5.0
3     7.0
4    11.0
5    13.0
6    17.0
7    19.0
dtype: float64

In [35]:
# Replace a value by another value
# Eg: Replace 2 by 'even prime'
s2.replace(2,'even prime')

0    even prime
1             3
2             5
3             7
4            11
5            13
6           NaN
7            19
dtype: object

## 2.1.5 Exercises

Q1: Create a new series `t` where the values are marks and the indices are names.

- names  `['Alvin','Ben','Carl','Danny','Ella','Fang','Gil','Han','Irene','Jane','Ken','Lim','Mark','Ng','Ong','Peng','Quek','Roy','Sam']`
- marks  `[48,62,66,26,72,74,72,55,70,80,62,66,'TX',93,65,30,75,58,51]`

In [38]:
# Answer should return a series object
names = ['Alvin','Ben','Carl','Danny','Ella','Fang','Gil','Han','Irene','Jane','Ken','Lim','Mark','Ng','Ong','Peng','Quek','Roy','Sam']
marks = [48,62,66,26,72,74,72,55,70,80,62,66,'TX',93,65,30,75,58,51]
t = pd.Series(marks, index = names)

Q2: How many students are there in this class?

In [39]:
# Answer should return a number.
len(t)

19

Q3(a): What is the average score of the test?

In [40]:
# t.mean() will return an error message

Q3(b): What should you replace 'TX' with? Then, answer Q3(a) again.

In [42]:
# Answer should return a number
t1 = t.replace('TX',np.nan)
t1.mean()

62.5

Q4: Who scored the highest mark?

In [43]:
# Answer should return a string (name)
t1.idxmax()

'Ng'

Q5: Who failed the test (<50)? 

In [45]:
# Answer should return a list of names
t1[t1 < 50].index

Index(['Alvin', 'Danny', 'Peng'], dtype='object')

Q6: What is the percentage of B+ and above (>=75 / number of students who took MST)? Leave your answer correct to 2 d.p.

In [46]:
# Answer should return a number correct to 2 d.p.
x = t1[t1>=75]
round(100*len(x)/18,2)

16.67