# Dataquest.io NumPy Learning Module


## What is NumPy?

NumPy is a library that combines the flexibility and ease-of-use of Python with the speed of C. In this mission, we'll start by getting familiar with the core NumPy data structure and then build up to using NumPy to work with the dataset <code>world_alcohol.csv</code>, which contains data on how much alcohol is consumed per capita in each country.

## Getting started

In [62]:
import numpy as np

### The core datastructure of NumPy is an <code>ndarray</code> object
<code>ndarray</code> stands for *N-dimmensional array*

*N-dimensional* refers to the number of indices needed to select individual values from the object.

#### E.G.

<img src="https://s3.amazonaws.com/dq-content/6/numpy_ndimensional.svg">


A 1-dimensional array is often referred to as a <code>vector</code> while a 2-dimensional array is often referred to as a <code>matrix</code>. Both of these terms are both borrowed from a branch of mathematics called linear algebra.

### Constructing <code>ndarray</code> objects

1-dimensional/vector:

In [63]:
vector = np.array([2,3,4])
vector

array([2, 3, 4])

2-dimensions/matrix:

In [64]:
matrix  = np.array([[2,3,4],[4,5,6],[7,8,9]])
matrix

array([[2, 3, 4],
       [4, 5, 6],
       [7, 8, 9]])

Arrays have a certain number of elements. The array below has 5 elements:

In [65]:
five_el = np.array([2,3,4,5,6])
five_el

array([2, 3, 4, 5, 6])

Matrices instead use *rows* and *columns*.

The matrix below has 3 rows and 5 columns, often referred to as a 3 by 5 matrix. In standard python, this is the analogue of a collection of 3 nested lists with 5 elements each. 

**E.G.**

In [66]:
from IPython.display import HTML, display

data = [[1,2,3,5,6],
        [4,5,6,5,6],
        [7,8,9,5,6],
        ]

display(HTML(
    '<table><tr>{}</tr></table>'.format(
        '</tr><tr>'.join(
            '<td>{}</td>'.format('</td><td>'.join(str(_) for _ in row)) for row in data)
        )
))



0,1,2,3,4
1,2,3,5,6
4,5,6,5,6
7,8,9,5,6


#### Output the shape of the <code>ndarray</code> obj

It's often useful to know how many elements an array contains. We can use the <code>ndarray.shape</code> property to figure out how many elements are in the array.

For vectors, the <code>shape</code> property contains a <code>tuple</code> with <code>1</code> element. A <code>tuple</code> is a kind of list where the elements can't be changed.

For matrices, the <code>shape</code> property contains a <code>tuple</code> with <code>2</code> elements.

In [67]:
print(vector.shape)
print(matrix.shape)

(3,)
(3, 3)


### Reading in Datasets

We can read in datasets using the <code>numpy.genfromtxt()</code> function. Our dataset, <code>world_alcohol.csv</code> is a comma separated value dataset. We can specify the delimiter using the delimiter parameter:

In [68]:
world_alcohol = np.genfromtxt("/Users/rksquared/projects/pyeicon/five_thirty_eight_tstdata/world_alcohol.csv", delimiter=",")
print(type(world_alcohol))

<class 'numpy.ndarray'>


In [69]:
print(world_alcohol)

[[             nan              nan              nan              nan
               nan]
 [             nan   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00]
 [             nan   8.90000000e+01   1.32000000e+02   5.40000000e+01
    4.90000000e+00]
 [             nan   2.50000000e+01   0.00000000e+00   1.40000000e+01
    7.00000000e-01]
 [             nan   2.45000000e+02   1.38000000e+02   3.12000000e+02
    1.24000000e+01]
 [             nan   2.17000000e+02   5.70000000e+01   4.50000000e+01
    5.90000000e+00]
 [             nan   1.02000000e+02   1.28000000e+02   4.50000000e+01
    4.90000000e+00]
 [             nan   1.93000000e+02   2.50000000e+01   2.21000000e+02
    8.30000000e+00]
 [             nan   2.10000000e+01   1.79000000e+02   1.10000000e+01
    3.80000000e+00]
 [             nan   2.61000000e+02   7.20000000e+01   2.12000000e+02
    1.04000000e+01]
 [             nan   2.79000000e+02   7.50000000e+01   1.91000000e+02
    9.70000000e+00]
 [        

For reference, the column labels are:

* `Year` -- the year the data in the row is for.
* `WHO Region` -- the region in which the country is located.
* `Country` -- the country the data is for.
* `Beverage Types` -- the type of beverage the data is for.
* `Display Value` -- the number of liters, on average, of the beverage type a citizen of the country drank in the given year.

### The Limitations fo NumPy

**Each value in a NumPy array has to have the same data type.** NumPy data types are similar to Python data types, but have slight differences. Here are some of the common ones:

* `bool`: Boolean.
    Can be `True` or `False`.
    
* `int`: Integer values.
    Can be `int16`, `int32`, or `int64`. The suffix 16, 32, or 64 indicates the number of bits.
    
* `float`: Floating point values.
    Can be `float16`, `float32`, or `float64`. The suffix 16, 32, or 64 indicates how many numbers after the decimal point the number can have.
    
* `string`: String values.
    Can be `string` or `unicode`, which are two different ways a computer can store text.
    
NumPy will automatically figure out an appropriate data type when reading in data or converting lists to arrays. You can check the data type of a NumPy array using the `dtype` property.

In [70]:
world_alcohol_dtype = world_alcohol.dtype
world_alcohol_dtype

dtype('float64')

Here's how NumPy represents the first few rows of the dataset:

```array([[             nan,              nan,              nan,              nan,              nan],
       [  1.98600000e+03,              nan,              nan,              nan,   0.00000000e+00],
       [  1.98600000e+03,              nan,              nan,              nan,   5.00000000e-01]])```

There are a few concepts we haven't been introduced to yet that we'll dive into:

* Many items in `world_alcohol` are `nan`, including the entire first row. `nan`, which stands for "not a number", is a data type used to represent missing values. Some of the numbers are written like `1.98600000e+03`.
* The data type of `world_alcohol` is `float`. Because all of the values in a NumPy array have to have the same data type, NumPy attempted to convert all of the columns to floats when they were read in. The `numpy.genfromtxt()` function will attempt to guess the correct data type of the array it creates.

In this case, the `WHO Region`, `Country`, and `Beverage Types` columns are actually strings, and couldn't be converted to floats. When NumPy can't convert a value to a numeric data type like float or integer, it uses a special `nan` value that stands for "not a number". NumPy assigns an `na` value, which stands for "not available", when the value doesn't exist. `nan` and `na` values are types of missing data. We'll dive more into how to deal with missing data in later missions.

The whole first row of `world_alcohol.csv` is a header row that contains the names of each column. This is not actually part of the data, and consists entirely of strings. Since the strings couldn't be converted to floats properly, NumPy uses `nan` values to represent them.

If you haven't seen scientific notation before, you might not recognize numbers like `1.98600000e+03`. Scientific notation is a way to condense how very large or very precise numbers are displayed. We can represent `100` in scientific notation as `1e+02`. The `e+02` indicates that we should multiply what comes before it by 10 ^ 2(10 to the power 2, or 10 squared). This results in 1 * 100, or 100. Thus, `1.98600000e+03` is actually 1.986 * 10 ^ 3, or 1986. 1000000000000000 can be written as `1e+15`.

In this case, `1.98600000e+03` is actually longer than `1986`, but NumPy displays numeric values in scientific notation by default to account for larger or more precise numbers.

### Customizing the data input

When reading in the data using the `numpy.genfromtxt()` function, we can use parameters to customize how we want the data to be read in. While we're at it, we can also specify that we want to skip the `header` row of `world_alcohol.csv`.

To specify the data type for the entire NumPy array, we use the keyword argument `dtype` and set it to `"U75"`. This specifies that we want to read in each value as a `75` byte `unicode` data type. We'll dive more into `unicode` and *bytes* later on, but for now, it's enough to know that this will read in our data properly.

To skip the header when reading in the data, we use the `skip_header` parameter. The `skip_header` parameter accepts an integer value, specifying the number of lines from the top of the file we want NumPy to ignore.

In [71]:
world_alcohol = np.genfromtxt("/Users/rksquared/projects/pyeicon/five_thirty_eight_tstdata/world_alcohol.csv", dtype='U75', skip_header=1, delimiter=",")
world_alcohol

array([['Afghanistan', '0', '0', '0', '0.0'],
       ['Albania', '89', '132', '54', '4.9'],
       ['Algeria', '25', '0', '14', '0.7'],
       ['Andorra', '245', '138', '312', '12.4'],
       ['Angola', '217', '57', '45', '5.9'],
       ['Antigua & Barbuda', '102', '128', '45', '4.9'],
       ['Argentina', '193', '25', '221', '8.3'],
       ['Armenia', '21', '179', '11', '3.8'],
       ['Australia', '261', '72', '212', '10.4'],
       ['Austria', '279', '75', '191', '9.7'],
       ['Azerbaijan', '21', '46', '5', '1.3'],
       ['Bahamas', '122', '176', '51', '6.3'],
       ['Bahrain', '42', '63', '7', '2.0'],
       ['Bangladesh', '0', '0', '0', '0.0'],
       ['Barbados', '143', '173', '36', '6.3'],
       ['Belarus', '142', '373', '42', '14.4'],
       ['Belgium', '295', '84', '212', '10.5'],
       ['Belize', '263', '114', '8', '6.8'],
       ['Benin', '34', '4', '13', '1.1'],
       ['Bhutan', '23', '0', '0', '0.4'],
       ['Bolivia', '167', '41', '8', '3.8'],
       ['Bosnia-Herz

Now that the data is in the right format, let's learn how to explore it. 
### Indexing an `ndarray` in NumPy
Recall that a matrix in numpy has 2 dimensions - rows and columns. We can access this as below.

**E.G.** We'll assign the amount of alcohol Uruguayans drank in other beverages per capita in `1986` to `uruguay_other_1986`. This is the second row and fifth column. Then we'll assign the country in the third row to `third_country`. `Country` is the third column.

In [72]:
uruguay_other_1986 = world_alcohol[1,4]

third_country = world_alcohol[2,2]

print(uruguay_other_1986)
print(third_country)

4.9
0


### Slicing across `ndarray`s

Like lists, vector slicing is from the first index up to but not including the second index. Matrix slicing is a bit more complex, and has four forms:

* When we want to select one entire dimension, and a single element from the other.
* When we want to select one entire dimension, and a slice of the other.
* When you want to select a slice of one dimension, and a single element from the other.
* When we want to slice both dimensions.

We'll dive into the next cell. When we want to select one whole dimension, and an element from the other, we can do this:

**E.G**
We'll assign the whole third column from `world_alcohol` to the variable `countries`. Then we'll assign the whole fifth column from `world_alcohol` to the variable `alcohol_consumption`.

This will select all of the rows, but only the column with index `1`. The colon by itself `:` specifies that the entirety of a single dimension should be selected. Think of the colon as selecting from the first element in a dimension up to and including the last element.

In [73]:
countries = world_alcohol[:, 2]

alcohol_consumption = world_alcohol[:, 4]

print(countries)
print(alcohol_consumption)

['0' '132' '0' '138' '57' '128' '25' '179' '72' '75' '46' '176' '63' '0'
 '173' '373' '84' '114' '4' '0' '41' '173' '35' '145' '2' '252' '7' '0' '1'
 '56' '65' '1' '122' '2' '1' '124' '192' '76' '3' '1' '254' '87' '87' '137'
 '154' '170' '0' '3' '81' '44' '286' '147' '74' '4' '69' '0' '0' '194' '3'
 '35' '133' '151' '98' '0' '100' '117' '3' '112' '438' '69' '0' '31' '302'
 '326' '98' '215' '61' '114' '1' '0' '3' '118' '69' '42' '97' '202' '21'
 '246' '22' '34' '0' '97' '0' '216' '55' '29' '152' '0' '244' '133' '15'
 '11' '4' '0' '1' '100' '0' '0' '31' '68' '50' '0' '189' '114' '6' '18' '1'
 '3' '0' '6' '88' '79' '118' '2' '5' '200' '71' '16' '0' '63' '104' '39'
 '117' '160' '186' '215' '67' '42' '16' '226' '122' '326' '2' '205' '315'
 '221' '18' '0' '38' '5' '1' '131' '25' '3' '12' '293' '51' '11' '0' '76'
 '157' '104' '13' '178' '2' '60' '100' '35' '15' '258' '27' '1' '2' '21'
 '156' '3' '22' '71' '41' '9' '237' '135' '126' '6' '158' '35' '101' '18'
 '100' '2' '0' '19' '18']
['0.0' '4

When we want to select one whole dimension, and a slice of the other, we need to use special notation.

**E.G**
* Assign all the rows and the first 2 columns of `world_alcohol` to `first_two_columns`.
* Assign the first 10 rows and the first column of `world_alcohol` to `first_ten_years`.
* Assign the first 10 rows and all of the columns of `world_alcohol` to `first_ten_rows`.

In [74]:
first_two_columns = world_alcohol[:,:2]
print(first_two_columns)
first_ten_years = world_alcohol[:10,0]
print(first_ten_years)
first_ten_rows = world_alcohol[:10,:]
print(first_ten_rows)

[['Afghanistan' '0']
 ['Albania' '89']
 ['Algeria' '25']
 ['Andorra' '245']
 ['Angola' '217']
 ['Antigua & Barbuda' '102']
 ['Argentina' '193']
 ['Armenia' '21']
 ['Australia' '261']
 ['Austria' '279']
 ['Azerbaijan' '21']
 ['Bahamas' '122']
 ['Bahrain' '42']
 ['Bangladesh' '0']
 ['Barbados' '143']
 ['Belarus' '142']
 ['Belgium' '295']
 ['Belize' '263']
 ['Benin' '34']
 ['Bhutan' '23']
 ['Bolivia' '167']
 ['Bosnia-Herzegovina' '76']
 ['Botswana' '173']
 ['Brazil' '245']
 ['Brunei' '31']
 ['Bulgaria' '231']
 ['Burkina Faso' '25']
 ['Burundi' '88']
 ["Cote d'Ivoire" '37']
 ['Cabo Verde' '144']
 ['Cambodia' '57']
 ['Cameroon' '147']
 ['Canada' '240']
 ['Central African Republic' '17']
 ['Chad' '15']
 ['Chile' '130']
 ['China' '79']
 ['Colombia' '159']
 ['Comoros' '1']
 ['Congo' '76']
 ['Cook Islands' '0']
 ['Costa Rica' '149']
 ['Croatia' '230']
 ['Cuba' '93']
 ['Cyprus' '192']
 ['Czech Republic' '361']
 ['North Korea' '0']
 ['DR Congo' '32']
 ['Denmark' '224']
 ['Djibouti' '15']
 ['Domin

We can also slice along both dimensions simultaneously. 

**E.G.**
Assign the first `20` rows of the columns at index `1` and `2` of `world_alcohol` to `first_twenty_regions`.

In [75]:
first_twenty_regions = world_alcohol[:20, 1:3]
print(first_twenty_regions)

[['0' '0']
 ['89' '132']
 ['25' '0']
 ['245' '138']
 ['217' '57']
 ['102' '128']
 ['193' '25']
 ['21' '179']
 ['261' '72']
 ['279' '75']
 ['21' '46']
 ['122' '176']
 ['42' '63']
 ['0' '0']
 ['143' '173']
 ['142' '373']
 ['295' '84']
 ['263' '114']
 ['34' '4']
 ['23' '0']]


## Computation with NumPy

So far, we worked with `world_alcohol.csv`, which records per capita alcohol consumption for each country. We'll use the same dataset in this mission, and eventually determine which country consumes the most alcohol per capita.

(first let's reinitialize a clear `world_alcohol` variable/file input)

In [76]:
world_alcohol = np.genfromtxt("/Users/rksquared/projects/pyeicon/five_thirty_eight_tstdata/npcomp_world_alcohol.csv", dtype='U75', skip_header=1, delimiter=",")
print(world_alcohol)

[['1986' 'Western Pacific' 'Viet Nam' 'Wine' '0']
 ['1986' 'Americas' 'Uruguay' 'Other' '0.5']
 ['1985' 'Africa' "Cte d'Ivoire" 'Wine' '1.62']
 ..., 
 ['1986' 'Europe' 'Switzerland' 'Spirits' '2.54']
 ['1987' 'Western Pacific' 'Papua New Guinea' 'Other' '0']
 ['1986' 'Africa' 'Swaziland' 'Other' '5.15']]


Here's what the first few rows look like:
<img src="https://i.imgur.com/auWXna9.png">

Each row specifies how many liters of a type of alcohol the average citizen of a particular country drank in a given year. The first row, for example, shows how many liters of wine the typical Vietnamese citizen drank in `1986`.

Here's a description of each column in the dataset:

* `Year`  -- The year the data in the row is for
* `WHO Region` -- The region in which the country is located
* `Country` -- The country of the data is for
* `Beverage Types` -- The type of beverage
* `Display Value` -- The average number of liters drunk per capita

One of the most powerful aspects of the NumPy module is the ability to make comparisons across an entire array. These comparisons result in Boolean values.

If you'll recall from an earlier mission, the double equals sign (`==`) compares two values. When used with NumPy, it will compare the second value to each element in the vector. If the value are equal, the Python interpreter returns `True`; otherwise, it returns `False`. It stores the Boolean results in a new vector.

With this basic understanding, let's try it out. 

The variable `world_alcohol` already contains the data set we're working with.
Extract the third column in `world_alcohol`, and compare it to the string `Canada`. Assign the result to `countries_canada`.
Extract the first column in `world_alcohol`, and compare it to the string `1984`. Assign the result to `years_1984`.

In [77]:
countries_canada = world_alcohol[:, 2] == 'Canada'
print(countries_canada)

years_1984 = world_alcohol[:,0] == '1984'
print(years_1984)

[False False False ..., False False False]
[False False False ..., False False False]


We mentioned that comparisons are very powerful, but it may not have been obvious why on the last cell. Comparisons give us the power to select elements in arrays using Boolean vectors. This allows us to conditionally select certain elements in vectors, or certain rows in matrices.

**E.G.**
With a simple vector --

    vector = numpy.array([5, 10, 15, 20])
    equal_to_ten = (vector == 10)

    print(vector[equal_to_ten])

The code above:

Creates `vector`.
Compares `vector` to the value `10`, which generates a Boolean vector `[False, True, False, False]`. It assigns the result to `equal_to_ten`.
Uses `equal_to_ten` to only select elements in vector where `equal_to_ten` is `True`. This results in the vector `[10]`.
We can use the same principle to select rows in matrices:

    matrix = numpy.array([
                [5, 10, 15], 
                [20, 25, 30],
                [35, 40, 45]
             ])
    second_column_25 = (matrix[:,1] == 25)
    print(matrix[second_column_25, :])

The code above:

Creates `matrix`.
Uses `second_column_25` to select any rows in `matrix` where `second_column_25` is `True`.
We end up with this matrix:

    [
    [20, 25, 30]
    ]

We selected a single row from `matrix`, which was returned in a new *matrix*.

#### Now let's try it with some real data:

* Compare the third column of `world_alcohol` to the string `Algeria`.
* Assign the result to `country_is_algeria`.
* Select only the rows in `world_alcohol` where `country_is_algeria` is `True`.
* Assign the result to `country_algeria`.

In [78]:
country_is_algeria = (world_alcohol[:, 2] == "Algeria")

country_algeria = world_alcohol[country_is_algeria, :]

print(country_is_algeria)
print(country_algeria)

[False False False ..., False False False]
[['1984' 'Africa' 'Algeria' 'Spirits' '0.01']
 ['1987' 'Africa' 'Algeria' 'Beer' '0.17']
 ['1987' 'Africa' 'Algeria' 'Spirits' '0.01']
 ['1986' 'Africa' 'Algeria' 'Wine' '0.1']
 ['1984' 'Africa' 'Algeria' 'Other' '0']
 ['1989' 'Africa' 'Algeria' 'Beer' '0.16']
 ['1989' 'Africa' 'Algeria' 'Spirits' '0.01']
 ['1989' 'Africa' 'Algeria' 'Wine' '0.23']
 ['1986' 'Africa' 'Algeria' 'Spirits' '0.01']
 ['1984' 'Africa' 'Algeria' 'Wine' '0.12']
 ['1985' 'Africa' 'Algeria' 'Beer' '0.19']
 ['1985' 'Africa' 'Algeria' 'Other' '0']
 ['1986' 'Africa' 'Algeria' 'Beer' '0.18']
 ['1985' 'Africa' 'Algeria' 'Wine' '0.11']
 ['1986' 'Africa' 'Algeria' 'Other' '0']
 ['1989' 'Africa' 'Algeria' 'Other' '0']
 ['1987' 'Africa' 'Algeria' 'Other' '0']
 ['1984' 'Africa' 'Algeria' 'Beer' '0.2']
 ['1985' 'Africa' 'Algeria' 'Spirits' '0.01']
 ['1987' 'Africa' 'Algeria' 'Wine' '0.1']]


**On the last screen**, we made comparisons based on a single condition. We can also perform comparisons with multiple conditions by specifying each one separately, then joining them with an ampersand (`&`). When constructing a comparison with multiple conditions, it's critical to put each one in parentheses.

Here's an example of how we would do this with a vector:

    vector = numpy.array([5, 10, 15, 20])
    equal_to_ten_and_five = (vector == 10) & (vector == 5)
    
In the above statement, we have two conditions, `(vector == 10)` and `(vector == 5)`. We use the ampersand (`&`) to indicate that both conditions must be True for the final result to be True. The statement returns `[False, False, False, False]`, because none of the elements can be `10` and `5` at the same time. Here's a diagram of the comparison logic:

<img src="https://i.imgur.com/BOPXDTx.png" style="height:600px;">

We can also use the pipe symbol (`|`) to specify that either one condition `or` the other should be `True`:

    vector = numpy.array([5, 10, 15, 20])
    equal_to_ten_or_five = (vector == 10) | (vector == 5)
The code above will result in `[True, True, False, False]`.

#### Now let's try it with some real data:

* Perform a comparison with multiple conditions, and join the conditions with `&`.
    * Compare the first column of `world_alcohol` to the string `1986`.
    * Compare the third column of `world_alcohol` to the string `Algeria`.
    * Enclose each condition in parentheses, and join the conditions with `&`.
    * Assign the result to `is_algeria_and_1986`.
* Use `is_algeria_and_1986` to select rows from `world_alcohol`.
* Assign the rows that `is_algeria_and_1986` selects to `rows_with_algeria_and_1986`.

In [80]:
is_algeria_and_1986 = (world_alcohol[:,0]=='1986') & (world_alcohol[:,2]=='Algeria')

rows_with_algeria_and_1986 = world_alcohol[is_algeria_and_1986, :]

print(rows_with_algeria_and_1986)

[['1986' 'Africa' 'Algeria' 'Wine' '0.1']
 ['1986' 'Africa' 'Algeria' 'Spirits' '0.01']
 ['1986' 'Africa' 'Algeria' 'Beer' '0.18']
 ['1986' 'Africa' 'Algeria' 'Other' '0']]
