# An important note about course materials

As of this week, 
* There are no separate lecture notes. 
* Lecture notes are incorporated directly into your workbooks. 
* This directory contains all student workbooks, listed in the order in which you should complete them.
* Thus, you can work entirely in JupyterHub and never go back to Canvas. 
* This makes it easier to find everything, and will likely continue. 
* Please relay any concerns to me through Piazza. Thanks! 

Prof. Couch

Please run the following cell to show an overview video

In [1]:
%%html
<iframe id="kaltura_player" src="https://cdnapisec.kaltura.com/p/1813261/sp/181326100/embedIframeJs/uiconf_id/38997502/partner_id/1813261?iframeembed=true&playerId=kaltura_player&entry_id=1_govlzyqa&flashvars[streamerType]=auto&amp;flashvars[localizationCode]=en&amp;flashvars[leadWithHTML5]=true&amp;flashvars[sideBarContainer.plugin]=true&amp;flashvars[sideBarContainer.position]=left&amp;flashvars[sideBarContainer.clickToClose]=true&amp;flashvars[chapters.plugin]=true&amp;flashvars[chapters.layout]=vertical&amp;flashvars[chapters.thumbnailRotator]=false&amp;flashvars[streamSelector.plugin]=true&amp;flashvars[EmbedPlayer.SpinnerTarget]=videoHolder&amp;flashvars[dualScreen.plugin]=true&amp;flashvars[Kaltura.addCrossoriginToIframe]=true&amp;&wid=1_3qf4xznt" width="512" height="288" allowfullscreen webkitallowfullscreen mozAllowFullScreen allow="autoplay *; fullscreen *; encrypted-media *" frameborder="0" title="Kaltura Player"></iframe>

# A recap of where we are

So far, 
* we know our way around Jupyter Notebooks. 
     * creating and saving notebooks. 
     * editing text and programs. 
     * submitting assignments. 
* we know how to code basic things in Python/iPython within notebooks.
     * assignment
     * iteration
     * conditionals
     * functions
* we know the various data structures available in python/iPython: 
     * lists -- for sequences that can change size.
     * tuples -- for things with positional meaning. 
     * sets -- for sets without duplicates. 
     * dicts -- for things that are filed by keyword. 
     * classes -- for data and functions that work together. 
* and we've been exposed to some deeper concepts in Python and programming in general. 
     * *encapsulation* -- recording sequences for reuse. 
     * *preconditions and postconditions* -- how to reuse cells for new tasks. 
     * *stringification* -- how python allows classes to control how they're printed.

So far, however, we've been working at a fairly deep level in the guts of python. The reality is that: 
* Data scientists function in Python at a *higher level of abstraction*. 
* Thus, it is necessary to apply the principles above to *software and patterns written by others.* 
* Survival and effective function thus requires that we adopt a *consumer approach to python programming.*
* *Classes rule.* People have written very powerful classes that utilize the mechanisms I've shown you to raise the level of abstraction in how you program. 
* You will very seldom interact with the primitive data structures we've seen so far. Instead, we'll interact with classes that accomplish tasks more easily. 
* *These classes require a new and different set of skills.*

# A personal note
At this point in the course, I am *straining to avoid* turning you into master Python programmers. Actually, that isn't needed. Instead, I need to turn you into master *consumers* of other peoples' Python programs. That is -- indeed -- a somewhat distinct skill. 

# Data abstraction

*Data abstraction* refers to the practice of encapsulating ideas about data in classes, and using class methods instead of lower-level python code to accomplish tasks. In a data abstraction situation, 
* Classes are opaque. You should not need to read the code of a class to use it. 
* Class methods are documented via *preconditions* (also called *prerequisites*) and postconditions (also called *results*).  
* Thus, each class enforces a *contract* with you, that if you give it what it needs, it will provide you with what you need. 
* Classes are not part of python; they are imported. You only need to import a class once per workbook, in a cell before you use it. 


# What you need to know in order to use a class
* (*Installs:* what do you need to install via `pip install`)
* *Imports:* what do you need to import? 
* *Usage:* what are class methods? What do they do? 

# An example: okpy
* Imports: `from client.api.notebook import Notebook`
* Installs: `pip install okpy`  (I did this for you in https://jupyterhub.cs.tufts.edu)
* Usage: `ok = Notebook('Data Abstraction.ok')` and then: 
    * `ok.auth({options})`: authenticate a student. 
    * `ok.grade({assignment})` Grade an assignment
    * `ok.submit()` submit the whole notebook for grading. 

You start using this via: 

In [None]:
# Don't change this cell; just run it. 
from client.api.notebook import Notebook
ok = Notebook('Data abstraction.ok')
ok.auth(inline=True)

# How okpy actually works
A. I edit a configuration file `Data abstraction.ok` that tells it how to function. Run the cell below to display it. 

In [None]:
# run this cell to display the contents of the file
%pycat Data abstraction.ok

B. I edit tests in the subdirectory `tests`. These have names `question.py` where `question` is the number of the question. 
For example, here is the test for question 1: 

In [None]:
# run this cell to display the contents of the file
%pycat tests/q01.py

# The idea of data abstraction

* Remember how to use a thing. 
* Forget about what it actually is. 
* Forget how it does it. 
    
   *Don't pay attention to that entity behind the curtain.* 
   to paraphrase and update the  Wizard of Oz. 

# What is a data structure? 
* Data abstraction requires creating *data structures* that obey particular rules. 
* The python concepts of `list`, `set`, `dict`, `tuple` are all data structures. 
* So are all classes that you or other people might create from them. 
* A data structure includes: 
  * a way of inputting data. 
  * methods for manipulating data. 
* For example, `Frame.py` from your last module is a very primitive data structure.   

# An example: Pandas data types
(This and the following workbooks are based upon and motived by https://pandas.pydata.org/pandas-docs/version/0.21/dsintro.htm)

One of the most powerful data science packages is *Pandas*. To learn more about it, read https://pandas.pydata.org/ . You can't avoid it. It's ubiquitous. 

Pandas is structured around several relatively opaque data structures that it uses for everything. To use it, you must "wrangle" whatever data you have into its format. This can be trivial or a challenge. What you get for this wrangling is a very well-behaved object with a strong set of properties, methods, etc. From then on, you can do everything you need to do with this object.  

Everything we have done so far in the course, and everything I have shown you about Python, is motivated by building a foundation for understanding the Pandas `DataFrame`. But we have to proceed more slowly than that. There are three topics we need to discuss, in order: 
* `numpy` and the concept of an `ndarray`
* Making a pandas `Series` from an `ndarray`. 
* Making a pandas `DataFrame` from several `Series`. 

# The data abstraction `ndarray`
Our first step "up the long ladder" is to understand the `numpy` library and the concept of an `ndarray`. `numpy` is the python numerical computing library. This  workbook is based upon https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html 

As a first approximation, `numpy` is a python library for *linear algebra*. It is possible to compute many basic linear algebraic functions (e.g., matrix multiplication) using the package. Its data type underlies the `Series` and `DataFrame` concepts in Pandas, so we need to understand it first.  

For example, consider

In [None]:
import numpy as np
x = np.array([1,2,3])
x

In [None]:
y = np.array([4,5,6])
y

In [None]:
x + y

# Whoa there! What just happened? 

The objects x and y are `array`s in `numpy`. Among other things, `numpy` defines how to add `arrays`. The sum of two `array`s is the array resulting from summing each corresponding pair. So, we can also write things like: 

In [None]:
x - y

In [None]:
x * y

# Beyond the mystery and magic
This behavior is not magic. It is part of Python class capabilities. Among other things, one can teach a class how to add two class instances! 

Remember in the Python `class` exercise how I tried to avoid teaching you everything about `class`es? *This is why!* There are a lot of details about `class`es -- including how to do things like this -- that you don't really need to know to be incredibly literate data scientists. It is enough to understand how to use these capabilities. 

The bottom line is that an `numpy` `array` represents a vector or matrix, and can be used 'like a number' in a lot of contexts, e.g., a lot of common sense things make sense here: 

In [None]:
x + 3

In [None]:
x == y

In [None]:
y == x + 3

In [None]:
(y == x + 3).all()

In [None]:
if (y == x + 3).all(): 
    print("they're equal")
else: 
    print("they're not equal")

In [None]:
if (x == y).all(): 
    print("they're equal")
else: 
    print("they're not equal")

In [None]:
z = np.array([1,3,4])
x == z

In [None]:
if (x == z).any(): 
    print("at least one element is equal")
else: 
    print("no elements are equal")

In [None]:
x[0]

From these experiments, we can conclude that: 
* adding a number to an array results in an array. 
* array indexes start at 0. 
* you can use `x[i]` to get to the ith element (from 0). 
* adding an `array` to an `array` of the same size results in an `array` of that size. 
* using `.any()` and `.all()` on an array represents logical `or` and `and`, respectively. 

# An aside: there's no `array` in native Python. 
The implementors of Python didn't create a native `array` type. The main reason for this is that the thing `[1,2,3]` in Python isn't implemented as an array in the sense of Java and C, so we call it something akin to how it's implemented, a `list`. However, for the purpose of linear algebra, we need something that acts like an `array`. So `numpy` provides that. 

Let's check your understanding of these concepts. 

In [None]:
# Don't change this cell; just run it. 
from client.api.notebook import Notebook
ok = Notebook('Data abstraction.ok')
ok.auth(inline=True)

1. Make up an array of the numbers 1 to 5. Put into a variable x.

In [None]:
# your answer: 
x = np.array([1,2,3,4,5])
x

In [None]:
_ = ok.grade('q01')  # test that your answer is correct 

2. Write code that sets `y` to the vector created by adding 5 to each element of `x`. 

In [None]:
# Your answer: 
y = x + 5
y

In [None]:
_ = ok.grade('q02')  # test that your answer is correct 

# Is the 'for' loop obsolete? 

Sort of. Let's just say that there are very efficient ways to do things in `numpy` without `for` loops. I'm sure that you could tell me whether 7 is a member of y via a `for` loop. But you can also do that with `arrays` much more simply: 

3. (Advanced) Consider that `y` *is an iterable* and write an expression that is True if 7 is in `y`, and False if not. Put that value into `z`.

In [None]:
# your answer: 
z = (7 in y) 
z

# Whoa there! 
The advanced problem shows that there are things about an `array` that are inherited from its status as something else. E.g., the following also works:

In [None]:
for i in y: 
    print(i)

# The treasure hunt
Most every common thing that you might want to do to an `array` with a `for` loop is easier to do with some `numpy.ndarray` function and/or some combination of those functions and native Python. A very large user community has gone to great expense to make using an `array` as simple as possible! 

What this means -- in practical terms -- is that it is often simpler to look around for a solution in the *numpy user manual* than to code it yourself. Thus, programming with `numpy` requires both knowledge of native Python and "treasure hunting" in the `numpy` documentation! 

Let's have some fun with a few treasure hunts through https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html 

4. Complete the function below so that it always returns the sum of the one-dimensional array `x` passed to it. Beware: I will test it on multiple arrays `x`!

In [None]:
def mysum(x): 
    # your answer: 
    return x.sum()

In [None]:
# run this to check your code
mysum(np.array([1,2,3,4,5]))

In [None]:
_ = ok.grade('q04')  # test that your answer is correct

5. In the function below, return a normalized set of data whose mean is 0.0, by subtracting the current mean from x. 

In [None]:
def renorm(x): 
    # your answer: 
    return x - x.mean()

In [None]:
# run this to check your code
x = np.array([5, 6, 7, 8, 9])
renorm(x)

In [None]:
_ = ok.grade('q05')  # test that your answer is correct

6. (Advanced) What happens if you try to do the same things you did to arrays to lists? 

___Your answer:___

# When you are done with this notebook, 
* Save and checkpoint. 
* Change `ready` to `True` in the cell below. 
* Run the cell below. 

In [None]:
ready = False  # change to True when ready to submit
if not ready: 
    raise Exception("change ready to True when ready to submit")
_ = ok.submit()