<img style="border-radius: 0.5rem;" src="banner.jpg" />
<a style="margin-top: 1rem;" class="btn btn-lg btn-block btn-success" href="https://dsg.ucfsigai.org/fa18/signin/">
    Sign in (https://dsg.ucfsigai.org/fa18/signin/)
</a>

# Intro to Data Analysis with Pandas & Numpy
---
by: John Muchovej \([@ionlights](github.com/ionlights/)\), on 12 Sep 2018

In [None]:
def dataset(path):
    import os
    from pathlib import Path
    datadir = Path(os.environ["DATA_DIR"])
    return Path(datadir.joinpath(path))

---

## What's NumPy?

> NumPy is the fundamental package for scientific computing with Python. It contains among other things:
> 
> - a powerful N-dimensional array object
> - sophisticated (broadcasting) functions
> - tools for integrating C/C++ and Fortran code
> - useful linear algebra, Fourier transform, and random number capabilities
>
> Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly and speedily integrate with a wide variety of databases.
>
> &ndash; according to http://numpy.org

### So what's that actually mean, for us?
You'll see more, tonight, but effectively, `numpy` is library that allows us to work with linear algebra, be lazy, and perform array operations on a much larger (and more efficient) scale than Python's `list` allows for! :D

#### Let's import `numpy`, so we can bask in its glory

In [None]:
import numpy as np

### Why do arrays matter?

As we'll see throughout the semester, arrays (of various degrees) are crucial to almost everything we can accomplish in machine learning, whether in research or industry.

We'll start out looking a speed differences, as this is oen of the primary selling points of `numpy`.

In [None]:
rows = 10000
cols = 10000

In [None]:
# the ';' is used to keep the notebook from exploding due to size of the output
%time np.zeros((rows, cols))
;

In [None]:
# the ';' is used to keep the notebook from exploding due to size of the output
%time [[0 for _ in range(cols)] for _ in range(rows)]
;

In [None]:
from IPython.display import Markdown
Markdown(f"As you can see, just to generate a matrix of ({rows}, {cols}) is significantly faster using `numpy`.")

We can also use commands like `np.ones` and `np.full` to generate these sorts of matrices with `1` or `<custom-value>` &ndash; which makes creation of arrays not only convenient, but also low-cost operations.

#### Tangent: `numpy` knows what it's holding

In [None]:
temp = np.full((rows, cols, 3, 5), 42, dtype=np.double)

print("temp.shape = " + str(temp.shape))
print("temp.ndim  = " + str(temp.ndim) + " this is also referred to as RANK")
print("temp.dtype = " + str(temp.dtype))
print("temp.size  = " + str(temp.size))

---

### Let's generate some "functional" data to play with

In [None]:
rand_np = None
%time rand_np = np.random.rand(rows, cols)
;

In [None]:
import random

In [None]:
rand_list = None
%time rand_list = [[random.random() for _ in range(cols)] for _ in range(rows)]
;

As you can see, although both are slower, `numpy` still wipes the floor with Python's `list`. This is extremely advantageous when we begin deal with large datasets where we need to perform lots of repeated operations on them.

We'll actually get to doing that later tonight.

---

### Where'd it go? – Indexing

Indexing, and slicing, are how we extract information from Python `list`s as well as `np.ndarray`s. Their abilities are quite different and `numpy` tends to come out on top, in terms of "intuitive" slicing.

In [None]:
index_example = None # generate

In [None]:
for row in index_example:
    print(row)

Let's try printing the first 3 rows of our `list`, the way we typically do with C, Java, etc.

In [None]:
# print first 3 values without slicing

Now, let's try doing this by Python slicing.

In [None]:
# print first 3 values with slicing
print(index_example[0:5])

Alright, so printing rows works wonders, but this is a `2D` array, which means it also has columns. How might we do that??? (Let's do something a bit simpler – print out a sub-matrix, so 3 rows, and 3 columns.)

In [None]:
# print first 3 columns of the first 3 rows, using Python's list

Hmm... getting information from arrays like this, seems pretty cumbersome. Especially if we want a sub-matrix (a `2D` array), `numpy` do it better? (Hint: YES.)

In [None]:
np_slice = np.asarray(index_example)

In [None]:
# repeat row from python, but in np

In [None]:
# repeat submatrix from python, but in np

In [None]:
# try np slicing, but in python lists

#### A bit more nuance to selecting values from `np.ndarray`s

In [None]:
full_slice = np.random.rand(4, 6)
full_slice

Let's try to print out the 2nd column of `full_slice`.

In [None]:
full_slice[:, 1]

In [None]:
full_slice[:, 1].shape

Let's try to print out the 2nd row of `full_slice`.

In [None]:
full_slice[1, :]

In [None]:
full_slice[1, :].shape

---

### Being lazy &ndash; Broadcasting

This is going to be a quick example, it'll be something we see more commonly later on the semester, but for now we'll go with a somewhat boring example, adding two vectors that'd mis-matched.

In [None]:
broadcasting = np.arange(3)
print(broadcasting.reshape((3, 1)))
print(broadcasting)
broadcasting.reshape((3, 1)) + broadcasting

Although this is a random example, the point is that `numpy` does have built-in abilities to handle mis-matching shapes of vectors (and matrices).

Basically, what happens here to actually do this... is `numpy`, internally, will copy `broadcasting` and `broadcasting.reshape` to make them both match in shape (being `(3,3)`.

Basically, you'll end up with...
```python
[[0,0,0]  and... [[0,1,2]
 [1,1,1]  and...  [0,1,2]
 [2,2,2]] and...  [0,1,2]]
```

---
---

## Pandas, not the bamboo consumers

### What is it?

> pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
> 
> &ndash; according to https://pandas.pydata.org/

### Core components
`pandas` has two core components – `pandas.Series` and `pandas.DataFrame`.
- `pandas.Series` are equivalent to `np.array` or a spreadsheet's column
- `pandas.DataFrame` is equivalent to a spreadsheet

In [None]:
import pandas as pd

Regardless of working in industry, research, or your own projects – the majority of your time doing data science and machine learning will be spend collecting and **_cleaning_** data. Cleaning is massively facilitated by `pandas` &ndash; cleaning tends to involve dealing with missing values, inconsistent formatting, malformed records, or nonsensical outliers.

### Dropping Columns in a `pandas.DataFrame`

Sometimes, not all the data we have as part of our dataset is useful.
> **a trivial example**
> *you want to analyze a student's grades, but are you're given: &lt;name&gt;, &lt;address&gt;, &lt;grades&gt;, &lt;parent1-name&gt;, &lt;parent2-name&gt;, &lt;PID&gt;, ..."*
>
> With this, everything except &lt;grades&gt; and &lt;PID&gt;, is effectively useless, which means we can get rid of them

It's always a Good Idea&trade; to dump the data you don't need, as this will typically free up memory and may accelerate runtimes, too.

Thankfully, `pandas` provides this functionality for us through their [`pandas.DataFrame.drop()`][pddrop] (which can drop columns or rows).

[pddrop]: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html

Now let's take a look at our first dataset: `britishlib-flickr-images-books.csv`.

In [None]:
bl = pd.read_csv(dataset("britishlib-flickr-images-books.csv"))

In [None]:
# head

`pd.DataFrame.head()` will, by default, load the first 5 rows of a `pd.DataFrame`; if we take a look at the columns (next cell), we'll see that there are quite a few which don't actually provide much information which descripts the books themselves.

In [None]:
# info

In [None]:
lets_drop = ["Edition Statement", "Corporate Author", 
             "Corporate Contributors", "Former owner", 
             "Engraver", "Contributors", 
             "Issuance type", "Shelfmarks"]
# drop

In [None]:
# info

### Changing the Index in a `pandas.DataFrame`

Indices in `pandas` allow for more versatile slicing and labeling of data within `pd.DataFrame`s. Normally, it's quite useful to have a unique index.

In [None]:
# test Identifier uniqueness

In [None]:
# set to new index
bl.head()

Notice, we used `..., inplace=True)` – there are a variety of functions which need this, if you don't want to set `df_new = df_old.<some_function>`, because these functions will simply modify the `DataFrame` and return the copy.

With an Index which we've set, we can access them using the `pd.DataFrame.loc[]` functionality; this allows us to look up rows based on value of the index.

There's also a `pd.DataFrame.iloc[]` which is like `pd.DataFrame.loc` but is an integer index: so `df_bl.loc[206]` won't necessarily be the same as `df_bl.iloc[206]`.

In [None]:
bl.loc[216]

In [None]:
bl.iloc[1]

Now that we've dropped the unnecessary data, and set our `Index` to something more relevant, let's clean-up some of the columns. Doing this will not only enforce a strict format we can exploit later, but it will also involve developing an understanding of the dataset.

In [None]:
bl.get_dtype_counts()

Based on ^, it looks like we've got 6 "objects" &ndash; which are analogous to `str` in Python. `pandas`/`numpy` will apply this `dtype` to anything that doesn't neatly fit into numerical or categorical dtypes.

In [None]:
bl.info()

However, if we look at our columns, "Date of Publication" should be an `integer`, no? Especially since this allows for calculations we may need to do later on.

In [None]:
bl.loc[1808:, "Date of Publication"].head(10)

Some of these look like normal years we'd expect, but `1929` and `2956`, for instance, definitely don't match the expectation of being a year &ndash; which should be an `float64` (in terms of `np.dtype`).

So, some things we need to do to clean this up:
1. Remove dates in square brackets (e.g. \[1875\])
1. Convert date ranges to to their start date (e.g. 1860-63)
1. Completely remove dates we're uncertain about (e.g. \[1904?\])
1. Convert `nan` strings into `np.nan`

Thankfully, there's something called RegEx which allows us to take advantage of the format of years (don't concern yourself with this, RegEx might be a topic for later).

In [None]:
date_extract = r'^(\d{4})'

This regex looks for 4, integer (\d) values at the begining of a string &ndash; this should be enough for our cases. We'll gloss over this, for now, as it's not the purpose &ndash; if you want more info on RegEx, take a look at https://regexr.com/.

In [None]:
extract = bl["Date of Publication"].str.extract(date_extract, expand=False)
extract.head()

It's still an object! :/ Right &ndash; we didn't do that conversion, but it's a relatively simple fix to do so. Run the `pd.Series` through `pd.to_numeric`.

In [None]:
bl["Date of Publication"] = pd.to_numeric(extract)
bl["Date of Publication"].dtype

In [None]:
bl["Date of Publication"].isnull().sum() / len(bl)

Seems like about 12% of our data is null, awesome!

### Combining `str` Methods with `np` to Clean Columns

Earlier, we used `df['Date of Publication'].str` &ndash; this is a pretty nifty way to perform string operations in `pd`. Generally, these operations mimic those in native Python, or compiled RegEx &ndash; like `.split()`, `.replace()`, and `.capitalize()`.

Cleaning up `Place of Publication` is a bit more of a challenge, and to do this, we'll combine `pd.str` with `np.where`, the latter is basically a vectorized if/else statement. (It's dope.)

```python
np.where(condition, then, else)
```

Here, `condition` is either an array-like object or a boolean mask (more on masks in a bit). `then` is what's to be used when we evaluate to `True`, and else is what's used otherwise.

`.where()` takes each element in the object used for the `condition`, checks its "truthiness" and returns a `np.ndarray` containing the matching conditions for `then` or `else`.

We can turn these in to compounded `if-then` statements, allowing us to compute based on multiple conditions.

In [None]:
bl["Place of Publication"].head(10)

By the looks of it, "London" and "Oxford" seem to be the primary cities, along with some other, mildly identifiable information, which is ultimately not going to serve any purpose for us.

In [None]:
bl.loc[4157862]

In [None]:
bl.loc[4159587]

The joys of cleaning data &ndash; while they were published in the same place, the cities are apparently different, based on the hyphens.

Thankfully, Python (and `pd`) have a `str.contains(...)` which allows us to find substrings and snag that value with a mask.

In [None]:
london = bl["Place of Publication"].str.contains("London")

In [None]:
oxford = bl["Place of Publication"].str.contains("Oxford")

Combining them with `np.where`...

In [None]:
bl['Place of Publication'] = np.where(london, 'London',
                                      np.where(oxford, 'Oxford',
                                               bl["Place of Publication"].str.replace('-', ' ')))

In [None]:
bl.head()

**NOTE:** We're far from fully cleaning this dataset, this was just a taste; but this is the general process you'd go through for such a task. :)