## Basic Data Processing with Pandas

### Introduction to `Pandas` and Series Data

Reference books: 

* [Python for Data Analysis by McKinney](https://wesmckinney.com/pages/book.html)

* [Learning the Pandas Library by Harrison](https://www.amazon.co.uk/Learning-Pandas-Library-Munging-Analysis/dp/153359824X)

`Pandas` was created by Wes McKinney in 2008.

#### Series Data Structure

`Series` are one of the core data structures in `Pandas`. It is a 'cross' between a list and a dictionary. The items are all stored in an order and there are labels with which they can be retrieved.

An easy way to visualize this is as a two-column data, where the first column is the special index (like the keys in a dictionary) and the second column is the actual data.

In [1]:
import pandas as pd

In [2]:
students = ['Alice', 'Jack', 'Molly']

pd.Series(students)

0    Alice
1     Jack
2    Molly
dtype: object

Underneath, `Pandas` store series values in a typed array using the `Numpy` library, which speeds up data processing vs traditional python libraries.

In [3]:
numbers = [1, 2, 3]

pd.Series(numbers)

0    1
1    2
2    3
dtype: int64

*Handling missing data*:

Python has the `None` type to indicate a lack of data.

Underneath, `Pandas` does some type conversion. For a list of strings with one element `None` type, `Pandas` inserts it as a `None` and uses the type object for the underlying array.

However, for a list of numbers (`int` or `float`), with a `None` element, `Pandas` automatically converts it into a special floating point value designated as `NaN`, which stands for "Not a Number".

In [4]:
students = ['Alice', 'Jack', None]

pd.Series(students)

0    Alice
1     Jack
2     None
dtype: object

In [5]:
numbers = [1, 2, None]

pd.Series(numbers)

0    1.0
1    2.0
2    NaN
dtype: float64

You can notice that:

* `NaN` is a different value.

* `Pandas` set the `dtype` of this series to floating point numbers instead of object or ints $\rightarrow$ `Pandas` does this because integers can be typecast to floats.

* Both `None` and `NaN` can represent missing data. But, underneath, `Pandas` doesn't represent them in the same way. `NaN` is NOT equivalent to `None`.

In [6]:
import numpy as np

np.nan == None

False

In [7]:
np.nan == np.nan

False

In [8]:
# You need to use np.isnan()
np.isnan(np.nan)

True

In [10]:
# Create a Series using dictionaries
student_scores = {'Alice': 'Physics',
                  'Jack': 'Chemistry',
                  'Molly': 'English'}

s = pd.Series(student_scores) # Keys are considered as index
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

In [11]:
s.index

Index(['Alice', 'Jack', 'Molly'], dtype='object')

In [12]:
students = [('Alice', 'Brown'), ('Jack', 'White'), ('Molly', 'Green')]

pd.Series(students) # dtype: object is not just for strings but for arbitrary objects

0    (Alice, Brown)
1     (Jack, White)
2    (Molly, Green)
dtype: object

In [13]:
# Separate your index from the data by passing in the index as a list
# Pandas favours only the indicaces values that you provide and ignores the keys that are not in the index 
# (adding None or NaN)
s = pd.Series(['Physics', 'Chemistry', 'English'], index=['Alice', 'Jack', 'Molly'])
s

Alice      Physics
Jack     Chemistry
Molly      English
dtype: object

In [14]:
student_scores = {'Alice': 'Physics', 'Jack': 'Chemistry', 'Molly': 'English'}

s = pd.Series(student_scores, index=['Alice', 'Molly', 'Sam'])

In [15]:
s

Alice    Physics
Molly    English
Sam          NaN
dtype: object

#### Querying a Series

In [16]:
student_classes = {'Alice': 'Physics',
                   'Jack': 'Chemistry',
                   'Molly': 'English',
                   'Sam': 'History'}

s = pd.Series(student_classes)
s

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [17]:
# Use `iloc[]` to check entries by index number
s.iloc[3]

'History'

In [18]:
# Use `loc[]` to check entries by index label/name
s.loc['Molly']

'English'

`Pandas` tries to make the code readable and provides a smart syxtax using the indexing operator directly on the series itself (identifying if you are trying to use `iloc` or `loc`).

However, you need to be careful when using the indexing operator on the Series when the index is a list of integers. Are you querying by index position or label?

In [19]:

s[3]

'History'

In [20]:
s['Molly']

'English'

In [21]:
# Safer option is to be explicit
s.iloc[3]

'History'

In [22]:
class_code = {99: 'Physics',
              100: 'Chemistry',
              101: 'English',
              102: 'History'}

s = pd.Series(class_code)

In [25]:
# s[0] # big error message
s.iloc[0]

'Physics'

**Vectorization with `Numpy`**

In [26]:
# Perform operations with Series (slow procedure)
grades = pd.Series([90, 80, 70, 60])

total = 0
for grade in grades:
    total += grade

print(total/len(grades))

75.0


In [27]:
# Pandas and numpy libraries support a method of computation called "Vectorization" (faster)
import numpy as np

total = np.sum(grades)
print(total/len(grades))

75.0


In [28]:
# Demonstrating whether one method is slower than the other
# Create a big series of numbers
numbers = pd.Series(np.random.randint(0, 1000, 10000))

# Check the first items in Series
numbers.head()

0    120
1    380
2    918
3    291
4    547
dtype: int32

In [29]:
len(numbers)

10000

In [30]:
# The IPython interpreter has incorporated magic functions (%)
# Runs code a few times to determine (on average) how long it takes (by default is 1000 loops). 
# It needs to be the first line in the cell

In [32]:
%%timeit -n 100

total = 0
for number in numbers:
    total += number

total/len(numbers)

3.69 ms ± 1.23 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [33]:
%%timeit -n 100

total = np.sum(numbers)
total/len(numbers) # Parallel computing (vectorization) features allow computers to execute multiple instructions at once

The slowest run took 4.88 times longer than the fastest. This could mean that an intermediate result is being cached.
296 µs ± 128 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


**Broadcasting**

In [34]:
# Broadcasting --> Applying an operation to every value in the series
numbers.head()

0    120
1    380
2    918
3    291
4    547
dtype: int32

In [35]:
# For example, adding 2 to all the values in the Series
numbers += 2
numbers.head()

0    122
1    382
2    920
3    293
4    549
dtype: int32

In [43]:
# However you can also iterate through all the items using `iteritems()`
for label, value in numbers.items():
    numbers.at[label] = value + 2

numbers.head()

0    124
1    384
2    922
3    295
4    551
dtype: int32

*Iterating* "any time" in `Pandas` may not be the best way to do operations: Use *Vectorization* and *Broadcasting* instead.

In [44]:
%%timeit -n 10

s = pd.Series(np.random.randint(0, 1000, 10000))

for label, value in s.items():
    s.at[label] = value + 2

890 ms ± 475 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [45]:
%%timeit -n 10

s = pd.Series(np.random.randint(0, 1000, 10000))

s += 2

The slowest run took 4.39 times longer than the fastest. This could mean that an intermediate result is being cached.
869 µs ± 483 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


The `.loc[]` attribute lets you modify and add new data in place. 

In [46]:
s = pd.Series([1, 2, 3])

s.loc['History'] = 102

In [47]:
s

0            1
1            2
2            3
History    102
dtype: int64

In [48]:
s.index

Index([0, 1, 2, 'History'], dtype='object')

In [49]:
s.values

array([  1,   2,   3, 102], dtype=int64)

In [50]:
# Showing a case where index values are not unique
student_classes = pd.Series({'Alice': 'Physics', 
                             'Jack': 'Chemistry', 
                             'Molly': 'English',
                             'Sam': 'History'})

student_classes

Alice      Physics
Jack     Chemistry
Molly      English
Sam        History
dtype: object

In [51]:
# Now imagine a Series with Kelly classes
kelly_classes = pd.Series(['Philosophy', 'Arts', 'Math'], index=['Kelly', 'Kelly', 'Kelly'])
kelly_classes

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

In [52]:
# We want to append all of the data in a new Series
all_student_classes = student_classes.append(kelly_classes)
all_student_classes

  all_student_classes = student_classes.append(kelly_classes)


Alice       Physics
Jack      Chemistry
Molly       English
Sam         History
Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object

Important keynotes:

* `Pandas` will take the Series and try to infer the best data types to use when using `append()`

* `append()` doesn't change the underlying Series but returns a new one which is made up of the two appended together. The original Series won't change.

* When we query the appended Series for Kelly, we get all the values to that index

In [53]:
all_student_classes.loc['Kelly']

Kelly    Philosophy
Kelly          Arts
Kelly          Math
dtype: object