# Multi-dimensional arrays

Since we _happen_ to be working on this section after having covered 03-03-data-ingest, this notebook encompasses elements of both!

## 1. Obtain the data 

For this exercise, we'll be downloading data ourselves. The [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) is a widely used resource that hosts over 500 data sets. For this exercise, we will be using the [Dow Jones Index Data Set](https://archive.ics.uci.edu/ml/datasets/dow+jones+index). This landing page is typical for a dataset:
* It says something about the dataset and also provides key metrics about it, such as Number of Instances (rows), Number of Attributes (columns).
* Near the top, it has two links: _Download:_ Data Folder, Data Set Description.

Clicking on [Data Folder](https://archive.ics.uci.edu/ml/machine-learning-databases/00312/) on the landing page brings you to a page that provides a download link for the data.

1. Click on the zip file link to download the data to your machine.
2. Unzip the downloaded file.
3. Drag and drop the `dow_jones_index.data` file into your Jupyter folder.
4. Take a look at the downloaded file by executing the next cell.

If the next cell shows the data, you have succeeded in downloading it to the correct location. If not, move it to the right place in your workspace until it shows there &mdash; _Drag and drop_ is your friend!

In [None]:
!more dow_jones_index.data

Observe the following:
* There are 16 columns in the data, 
* Some columns are integers, others are strings or dates or money (starting with `$`), etc.

We will be reading the data into our notebook step by step.

First, instead of having to type column names into your code, it's good practice to copy-and-paste from the output of the previous command, like so:

In [None]:
# Copy-paste the header from the output of the previous cell into the next line...

colnames = 'column names from output of previous cell'.split(',')
print (colnames)

### Python's concept of dates and times

Native Python doesn't have any concept of date and time. The [datetime package](https://docs.python.org/3/library/datetime.html) provides these concepts in a library. The package supports a number of classes for representing a set of related concepts. 

In addition, numpy provides the data type called “datetime64”, so named because “datetime” is already taken by the datetime library included in Python. See [Numpy Datetimes and Timedeltas](https://numpy.org/doc/stable/reference/arrays.datetime.html) for more details.

In this notebook we will focus on Python's native **datetime**.

### 2. Read just the first three columns

We will define a variable `cols_to_take` as 3 at first, then progressively increase it to process more and more of the data. Execute the next two cells.

In [None]:
import numpy as np

In [None]:
cols_to_take = 3
dtype={'names': tuple(colnames[:cols_to_take]),
       'formats': ('i4', 'S8', 'S10')}
dji_1 = np.loadtxt('dow_jones_index.data', dtype=dtype, \
                    comments='#', delimiter=',', skiprows=1, usecols=[col for col in range(cols_to_take)])
dji_1[:5]

### 2a. What do `b'AA'` or `b'1/7/2011'` mean?

`b'whatever'` is like a string but different. To be specific, it's a 'byte string'. We won't delve into much detail here except to show the following example where `print(dt.strptime(begin_date, '%m/%d/%Y'))` fails. We have to explicitly decode the byte string into a character (Unicode) string using the `'utf-8'` encoding scheme. [More about character encoding](https://www.w3.org/International/questions/qa-what-is-encoding) in case you are curious!

In [None]:
from datetime import datetime as dt
begin_date = b"1/7/2011"
try:
    print(dt.strptime(begin_date, '%m/%d/%Y'))
except TypeError:
    print ('Failed type check')
    print (dt.strptime(str(begin_date, 'utf-8'), '%m/%d/%Y'))

In [None]:
dt.strptime('1/14/2011', '%m/%d/%Y')

### 3. Reload the first three columns, with column 2 as a datetime.

We will read the date field and store it as a [UNIX time](https://en.wikipedia.org/wiki/Unix_time). It is the number of seconds that have elapsed since the Unix epoch, minus leap seconds; the Unix epoch is 00:00:00 UTC on 1 January 1970 (an arbitrary date); leap seconds are ignored.

It's worthwhile to convert one of these values back to a date format to be satisfied that it works and we can convert that value back to the date when needed!

The next cell converts a datetime object (initialized from bytes b"1/14/2011"), then converts it to a UNIX timestamp, converts it back to the datetime object and prints it.

In [None]:
from datetime import datetime as dt

date_in = b"1/14/2011"

def convert_date(date_bytes):
    return dt.strptime(str(date_bytes, 'utf-8'), '%m/%d/%Y').timestamp()

dt.fromtimestamp(convert_date(date_in))

We are now ready to read in the first three columns of the data. Notice that column 2 (the third column) is declared as a float.

In [None]:
dtype={'names': tuple(colnames[:cols_to_take]),
       'formats': ('i4', 'S8', 'f4')}

dji_2 = np.loadtxt('dow_jones_index.data', dtype=dtype, ... fill in ...)
dji_2[:5]

#### 3.1 Validate date values in column 2.

It's worthwhile to convert one of these values back to a date format to be satisfied that it works and we can convert that value back to the date when needed!

In [None]:
dt.fromtimestamp(1.2943764e+09)

### 4. Convert the stock price values

The stock prices are prepended with a `$` sign. 
Write a converter for stripping it from the string and storing the price **as an integer as number of cents**. 
Rewrite the code above to read the prices in addition to the first three columns. 
You will now have 7 columns in your output.

Why did we convert prices from dollars to pennies? Floating point numbers don't always accurately represent dollar values. You may have seen output showing price as 39.000000001 or 40.999999999. To ensure that we don't run into this problem, we convert the dollars to pennies and store the result as an integer.

### 5. Validate volume data

Three attributes are related: `'volume'` ( v<sub>this week</sub> ), `'previous_weeks_volume'` (v<sub>last week</sub>) and `'percent_change_volume_over_last_wk'` (&Delta;v expressed as a fraction). 

* For some records, the previous week's figures are not available as they are the beginning of a period. Skip those comparisons.
* For all other records, show the records where
    v<sub>this week</sub> &ne; ( v<sub>last week</sub> &times; ( 1 + &Delta;v)). 
They should be identical, right?

Hint: use a technique similar to the exercise in the [original notebook](03-02-multi-dimensional-data.ipynb) to examine all rows whose elements are more than 1 standard deviation from the mean for the respective columns.

# When you're done, submit the notebook

1. **Run all the cells in order.**

2. Submit the notebook by saving it as PDF. 
    * In the cluster environment, it's File | Print (Save as PDF) and submit to [Gradescope](https://www.gradescope.com/courses/182658)<sup>&dagger;</sup>, 
    * On other versions, it may be File | Download As (PDF) and then submit to [Gradescope](https://www.gradescope.com/courses/182658)<sup>&dagger;</sup>.

<sup>&dagger;</sup>To submit to Gradescope, log into the website, add course 9W7PW3 (if not already added) and submit. The assignment name should match the name of this notebook.