##### <img src="../SDSS-Logo.png" style="display:inline; width:500px" />


## Learning Objectives
- use and manipulate NumPy Arrays
- use the functions that come with NumPy library 



### Continuing our study of different data types in python
- There are several data types in Python that can hold data collections.
- We've covered strings, lists, and now moving on to NumPy arrays.
- Numpy arrays are the backbone of a lot of Data Science work in Python.

### NumPy arrays:
 * Build on the capabilities of strings and lists, but are
   * Faster
   * Handle only homogeneous elements (all elements of the same type)
   * Have multiple ways to initalize 
   * Allow Boolean selection of elements (this is important, and highly useful)
   
### Some NumPy facts:
- NumPy is an external library and not a part of the standard python libraries.
- Find out more about NumPy at [NumPy](https://www.numpy.org/) and [PLYMI-Module 3](https://www.pythonlikeyoumeanit.com/module_3.html)
 

## Motivation

Arrays are another FUNDAMENTAL data type.  You can think of an array as a spreadsheet, or a table, or more generally, a collection of things arranged in rows and columns. N-dimensional arrays generalize this concept.

We will often use arrays to hold, access, and manipulate collections of data that are related to interesting problems, or hold intermediate forms of problem solutions.  

Thus, we need to be able to manipuate arrays in a fascile fashion, and that requires us to learn all of the details on how to slice and dice these 2D structures.

In [None]:
import comp116
import numpy as np


import pickle

import json
import pathlib 
import os
import sys

EMISS = []
for line in open('Unit-5-2-EMISS.txt'):
    EMISS.append(json.loads(line))
description_prefix = 'Total carbon dioxide emissions from all sectors, '

years = []
us_coal = []
us_petro = []
us_nat_gas = []
for d in EMISS:
    if description_prefix + 'coal, United States' in d['name']:
        for row in d['data']:
            years.append(int(row[0]))
            us_coal.append(row[1])
    if description_prefix + 'petroleum, United States' in d['name']:
        for row in d['data']:
            us_petro.append(row[1])
    if description_prefix + 'natural gas, United States' in d['name']:
        for row in d['data']:
            us_nat_gas.append(row[1])
            
us_coal = np.array(us_coal)
us_petro = np.array(us_petro)
us_nat_gas = np.array(us_nat_gas)
us_co2 = np.array([us_coal, us_nat_gas, us_petro])
us_co2 = np.transpose(us_co2)

assert len(years) == len(us_co2)


with open('Unit-5-2-Numpy.data.pickle', 'wb') as f:
    pickle.dump((years, us_coal, us_petro, us_nat_gas ), f)
del row, us_co2, us_coal, us_petro, us_nat_gas, years
#End

# Read in some data that we will use later
with open('Unit-5-2-Numpy.data.pickle', 'rb') as f:
    years, us_coal, us_petro, us_nat_gas = pickle.load(f)


### Since NumPy is an external library, or in Python parlance, *module,* we need to import numpy to be able to use it.
- The statement `import numpy as np` in the first code cell is importing the numpy library.
- Notice that we are importing numpy as np.   We are renaming it becuase we don't like to type long names!
- This means that we can refer to numpy functions using just np instead of having to type numpy everytime.
- `import numpy as np` is widely used - a common convention.
- ### generally speaking, modules contain valuable built-in defintions and functions.
- you can use these to save time and also be reasonably confident that the function will work correctly.
- you can just import pieces, which may be useful if you only need a limited number of the functions and you want to save space, e.g., `from math import factorial`


### range and numpy arange functions
- You can generate a range of values for a Python list using the `range()` function.
- `range(start, stop, step)` generates integers starting from start and ending at stop, with increments of step between numbers.


<br />

In [None]:
# Use range to get a range of 0 to 9
iX = range(10)
# What type of an object is range?
print(iX)
print("iX is of type ", type(iX))
# We can use the list function to convert the range to a list
iL = list(iX)
print(iL)

### numpy arange()
- The numpy `arange(start, stop, step)` function generates a numpy array, starting at `start` and going upto (but not including) `stop` in steps of `step`.
- Check out <a target="_blank" href='https://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html'>`arange()`</a> function.   
- There's also a length function, `len` that returns the number of elements in the array.

<br />
<br />

In [None]:
# Initial an np array 
import numpy as np
arr = np.arange(10)  # Creates an nd-array [ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ]
print("arr=", arr)

# You can get the length of a nd-array
print("len(arr)=", len(arr))

# You can modify elements of an nd-array
arr[5] = 36
print("After arr[5] = 36, arr=", arr)

### Once the `import numpy as np` statement in run:

* All NumPy functions will be called `np.xxx()` in our code, where `xxx` is the specific Numpy function.  

* Essentially we are importing all of NumPy functionality and then tell the Python compiler we want to use NumPy's average function by typing in `np.average()`.

* In this way, Python knows we want to use the numpy `average()` function.



## NumPy array elements are all of the same type

Since all members of a NumPy array are required to be the same type,
you can be assured that if the first element of array `arr` is an integer then they all are.
This is different than Python lists where each element can be of a different type.
NumPy arrays are **homogenous** types.

Since `arr[0]` is an integer 0, what do you think 
happens when we assign a float value to one element?

## Update the middle element of a NumPy array

Set the middle element of `arr` to the value 36.5 and then print out `arr`.

What are the steps to do this:
1. Find out how many elements are in the array
1. Find the middle element of an array
1. Set the middle element of the array to 36.5


In [None]:

arr[ len(arr) // 2 ] = 36.5
   
print('Arr is', arr)

### Assigning a value that can't be converted

The above worked, because 36.5 could be converted to an integer.
What happens if we assign a value that can't be converted to an integer?

Assign the character string `'ar'` to `arr[0]`.


In [None]:
# Assign the value 'ar' to arr[0]
try:
    arr[0] = 'ar'
except:
    print('Exception occured', sys.exc_info()[1])
    pass


## Create a NumPy array of zeros

* There's a NumPy function `zeros`. that creates an array of zeros.
* Print out `zero_arr` to see what it's value is.
* Print out the length of `zero_arr`


In [None]:
# Create an array of zeros
zero_arr = np.zeros(10)

print('The length of zero_arr is', len(zero_arr))
print('The contents of zero_arr is', zero_arr)


### NumPy Zeros takes an optional parameter

* The optional parameter for allows you to make the array boolean False's.

* Variable `five_falses` is a NumPy array of five Falses.

* You can also initialize a NumPy array using a Python list.
* In fact, there are so many ways to initialize an array, there's a whole web page on [NumPy array creation](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.creation.html).


In [None]:
five_falses = np.zeros(5, dtype=bool)

print('five_falses is', five_falses)


another_five_falses = np.array([False] * 5)

print('The following members of five_falses is equal to the same members of another_five_falses', 
      five_falses == another_five_falses)


## What did == do?

For a NumPy array, ==
tested if **each** element was equal.
This will come in handy later.

Some numpy functions that are widely used:

 - <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.any.html?highlight=any#numpy.any" target="_blank">np.any()</a> Will tell you if *any* element of a Numpy array **evaluates** to true.
 - <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.all.html?highlight=all#numpy.all" target="_blank">np.all()</a> Will tell you if *all* elements of a Numpy array **evaluates** to true.
   
   
Set variable `any_five_equal` to the value of whether any element in `five_falses` is equal to any element in `another_five_falses`.

Set variable `all_five_equal` to the value of whether all elements in `five_falses` are equal to all element in `another_five_falses`.


In [None]:
print(five_falses, another_five_falses)
any_five_equal = np.any(five_falses == another_five_falses)
print('any_five_equal is', any_five_equal)

all_five_equal = np.all(five_falses == another_five_falses)

print('all_five_equal is', all_five_equal)


## Let's read in some data

The original data is taken from the <a href="https://www.eia.gov/" target="_blank">U.S. Energy Information Administration</a> which has the mission, authorized by U.S. Congress, to _"Collect, analyze, and disseminate independent and impartial energy information to promote sound policymaking, efficient markets, and public understanding of energy and its interaction with the economy and the environment."_

If you are interested in this data, you can download the actual data from their <a href="https://www.eia.gov/opendata/bulkfiles.php" target="_blank">latest bulk download site</a>.

Essentially `us_coal` is a variable that has the number of millions of metric tons of $CO_2$ emitted by the United States.
 

In [None]:
with open('Unit-5-2-Numpy.data.pickle', 'rb') as f:
    years, us_coal, us_petro, us_nat_gas = pickle.load(f)
print('United State CO2 emissions from coal starting in the year', years[-1])

# Use a function created by Majikes at UNC to pretty-print this data.
comp116.array_to_html(us_coal, row_names=years, col_names=['Coal CO2'],
                      title='United States CO2 emmissions from coal in millions of metric tons')


## Did the United States ever produce more than two billion metric tons of $CO_2$ from coal?

Looking at the data, you can see if there ever was a time the U.S. produced more than a metric ton of $CO_2$ from coal.
But how would you do it with NumPy?


Remember how `np.any` will take a Boolean expression?

In [None]:
# First let's just see what the comparison creates

us_coal > 2000


In [None]:
# Let's use np.any to see if ANY values are True
over_two_billion_tons = np.any( us_coal > 2000)

print('Since 1980 the United States has, during some years, produced over two billion metric tons of CO2 from coal is a', 
      over_two_billion_tons, 'statement')

## Has the U.S. always produced at least 2 billion metric tons of $CO_2$ from coal?

Set the variable `always_over_two_billion_tons` to be the boolean of whether the U.S. has always (since 1980) produced 
over two billion metric tons of $CO_2$ from coal.

Use the NumPy all function to test if all are equal.

In [None]:

always_over_two_billion_tons = np.all(us_coal >= 2000)

print('The U.S., since 1980, has always produced at least 2 billion metric tons of CO2 from coal is a',
     always_over_two_billion_tons, 'statement')

### Various NumPy functions

There are various NumPy functions.
Below is  a list of functions that are quite useful.
 * [np.sum()](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.sum.html) takes the sum of items in the array
 * [np.mean()](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.mean.html) takes the mean of the items in the array.   Note that there is also a [np.average()](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.average.html)
 * [np.std()](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.std.html) takes the standard deviation of the items in the array.
 * [np.max()](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.maximum.html) finds the maximum value element in the array.
 * [np.min()](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.minimum.html) finds the minimum value element in the array. 
 * [np.prd()](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.prod.html) computes the product of all the elements in the array.

 

## What is the minimum US emission of $CO_2$ from coal?

Set variable `min_coal` to the minimum of United States emission of $CO_2$ from coal since 1980.

In [None]:

min_coal = np.min(us_coal)

print('Since 1980 the United States minimum annual production of CO2 from coal is,' ,
     min_coal)

print('The offset of the minimum within the "us_coal" array is', np.argmin(us_coal))  #see below!

## Which year was the minimum?

Notice that we have an array `years` that match `us_coal` in size?
`len(years) == len(us_coal)` is True.
So if the first year (year with offset 0) is 2016, what year was the minimum.

 * [np.argmax](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.argmax.html) returns the offset of the maximum value
 * [np.argmin](https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.argmin.html) returns the offset of the minimum value
 
Set variable `min_offset` to the offset minimum United States emission of $CO_2$ from coal since 1980.  
Set the variable `min_year` to the year that had the minimum emission  

1. Find the offset of the minumum amount of `us_coal` and assign it to `min_offset`.
2. Find the element in `years` that is the same offset.


In [None]:
print('The number of elements in variable years matches the number elements in us_coal is a',
      len(years) == len(us_coal), 'statement.')

min_offset = np.argmin(us_coal)
print(min_offset)

# If so, in this case, it just turns out that min_offset is 0 so years[0] - min_offset is same as years[min_offset]
# what is the RIGHT way to do this?
min_year = years[0] - min_offset
#print(min_year)
#min_year = years[min_offset]
#print(min_year)


print('Since 1980 the United States minimum annual production of CO2 from coal was,' ,
     min_coal, 'in', min_year)

### You can use min_offset to get to the year from the `years` array.


In [None]:
min_year = years[min_offset]
print('The minimum year,', min_year, ', the US emitted', min_coal, 'million metric tons of CO2 from coal')

print('You can also reference the year directly using min_year[min_offset] =', years[min_offset])

## Counting items

What if you wanted to know the number of years that the U.S. emitted two billion metric tons of $CO_2$ from coal?

NumPy has a function for that.
[np.count_nonzero](https://docs.scipy.org/doc/numpy/reference/generated/numpy.count_nonzero.html?highlight=count#numpy.count_nonzero) will return the values that are non-zero.

To use this, you have to remember that False evaluates to an integer 0.

print out `coal_us >= 20000`

In [None]:
print(us_coal >= 2000)

### Count up the emissions that are 2 billion metric tons or greater

Set variable `num_2_billion` to the number of years that were equal to or greater than two billion

In [None]:

num_2_billion = np.count_nonzero(us_coal >= 2000)

print('The United States emitted more than two billion metric tons of CO2 from coal for', num_2_billion,
     'years.')

## What is the maximum US emission of $CO_2$ from coal?

Set variable `max_coal` to the max of United States emission of $CO_2$ from coal since 1980.

In [None]:

max_coal = np.max(us_coal)

print('Since 1980 the United States produced the maximum number of million metric tons produced was' ,
     max_coal)

## What is the average US emission of $CO_2$ from coal?

Set variable `average_coal` to the average of United States emission of $CO_2$ from coal since 1980.

In [None]:

average_coal = np.average(us_coal)

print('Since 1980 the United States produced, on average,' ,
     average_coal, 'million metric tons annually.')