## 1. Numpy

In the previous notebooks, we used nested lists in Python to represent datasets. Python lists offer a few advantages when representing data:

- lists can contain mixed types
- lists can shrink and grow dynamically

Using Python lists to represent and work with data also has a few key disadvantages:

- to support their flexibility, lists tend to consume lots of memory
- they struggle to work with medium and larger sized datasets

While there are many different ways to classify programming languages, an important way that keeps performance in mind is the difference between **low-level** and **high-level** languages. Python is a high-level programming language that allows us to quickly write, prototype, and test our logic. The C programming language, on the other hand, is a low-level programming language that is highly performant but has a much slower human workflow.

<span style="background-color: #F9EBEA; color:##C0392B">NumPy</span> is a library that combines the flexibility and ease-of-use of Python with the speed of C. In this mission, we'll start by getting familiar with the core NumPy data structure and then build up to using NumPy to work with the dataset <span style="background-color: #F9EBEA; color:##C0392B">world_alcohol.csv</span>, which contains data on how much alcohol is consumed per capita in each country.


### 1.1 Creating Arrays

The core data structure in NumPy is the <span style="background-color: #F9EBEA; color:##C0392B">ndarray</span> object, which stands for **N-dimensional array**. An **array** is a collection of values, similar to a list. **N-dimensional** refers to the number of indices needed to select individual values from the object.

<img width="500" alt="creating a repo" src="https://drive.google.com/uc?export=view&id=0BxhVm1REqwr0X0VuT3NoZGZ0UlU">

A **1-dimensional** array is often referred to as a vector while a **2-dimensional** array is often referred to as a **matrix**. Both of these terms are both borrowed from a branch of mathematics called linear algebra. They're also often used in data science literature, so we'll use these words throughout this course.

To use <span style="background-color: #F9EBEA; color:##C0392B">NumPy</span>, we first need to import it into our environment. NumPy is commonly imported using the alias <span style="background-color: #F9EBEA; color:##C0392B">np</span>:

>```python
import numpy as np
```

We can directly construct arrays from lists using the <span style="background-color: #F9EBEA; color:##C0392B">numpy.array()</span> function. To construct a vector, we need to pass in a single list (with no nesting):

>```python
matrix = np.array([[5, 10, 15], [20, 25, 30], [35, 40, 45]])
```

In [1]:
import numpy as np

vector = np.array([10, 20, 30])
matrix = np.array([[5, 10, 15], [20, 25, 30], [35, 40, 45]])

print(vector[0])
print(matrix[0])
print(matrix[0][1])

10
[ 5 10 15]
10


### 1.2 Array shape

It's often useful to know how many elements an array contains. We can use the [ndarray.shape](http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.ndarray.shape.html) property to figure out how many elements are in the array.



In [2]:
vector = np.array([1, 2, 3, 4])
print(vector.shape)

matrix = np.array([[5, 10, 15], [20, 25, 30]])
print(matrix.shape)

(4,)
(2, 3)


### 1.3 Using numpy

We can read in datasets using the [numpy.genfromtxt()](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.genfromtxt.html) function. Our dataset, <span style="background-color: #F9EBEA; color:##C0392B">world_alcohol.csv</span> is a comma separated value dataset. We can specify the delimiter using the delimiter parameter:

>```python
import numpy
data = numpy.genfromtxt("data.csv", delimiter=",")
```

**"world_alcohol.csv'**

Here's what each column represents:

-  <span style="background-color: #F9EBEA; color:##C0392B">Year</span> -- the year the data in the row is for.
-  <span style="background-color: #F9EBEA; color:##C0392B">WHO Region</span> -- the region in which the country is located.
-  <span style="background-color: #F9EBEA; color:##C0392B">Country</span> -- the country the data is for.
-  <span style="background-color: #F9EBEA; color:##C0392B">Beverage Types</span> -- the type of beverage the data is for.
-  <span style="background-color: #F9EBEA; color:##C0392B">Display Value</span> -- the number of liters, on average, of the beverage type a citizen of the country drank in the given year.



In [3]:
import numpy as np
world_alcohol = np.genfromtxt("world_alcohol.csv", delimiter=',')
print(type(world_alcohol))

<class 'numpy.ndarray'>


Each value in a NumPy array has to have the same data type. NumPy data types are similar to Python data types, but have slight differences. You can find a full list of NumPy data types [here](http://docs.scipy.org/doc/numpy-1.10.1/user/basics.types.html). 

In [4]:
print(world_alcohol.dtype)

float64


### 1.4 Inspecting the data

Here's how NumPy represents the first few rows of the dataset:

>```python
array([[             nan,              nan,              nan,              nan,              nan],
       [  1.98600000e+03,              nan,              nan,              nan,   0.00000000e+00],
       [  1.98600000e+03,              nan,              nan,              nan,   5.00000000e-01]])
```

In [11]:
world_alcohol

array([[             nan,              nan,              nan,
                     nan,              nan],
       [  1.98600000e+03,              nan,              nan,
                     nan,   0.00000000e+00],
       [  1.98600000e+03,              nan,              nan,
                     nan,   5.00000000e-01],
       ..., 
       [  1.98600000e+03,              nan,              nan,
                     nan,   2.54000000e+00],
       [  1.98700000e+03,              nan,              nan,
                     nan,   0.00000000e+00],
       [  1.98600000e+03,              nan,              nan,
                     nan,   5.15000000e+00]])

There are a few concepts we haven't been introduced to yet that we'll dive into into:

- Many items in <span style="background-color: #F9EBEA; color:##C0392B">world_alcohol</span> are <span style="background-color: #F9EBEA; color:##C0392B">nan</span>, including the entire first row. <span style="background-color: #F9EBEA; color:##C0392B">nan</span>, which stands for **"not a number"**, is a data type used to represent missing values.
- Some of the numbers are written like <span style="background-color: #F9EBEA; color:##C0392B">1.98600000e+03</span>.

The data type of <span style="background-color: #F9EBEA; color:##C0392B">world_alcohol</span> is float. Because all of the values in a **NumPy array have to have the same data type**, NumPy attempted to convert all of the columns to floats when they were read in. The <span style="background-color: #F9EBEA; color:##C0392B">numpy.genfromtxt()</span> function will attempt to guess the correct data type of the array it creates.

In this case, the **WHO Region**, **Country**, and **Beverage Types** columns are actually <span style="background-color: #F9EBEA; color:##C0392B">strings</span>, and couldn't be converted to <span style="background-color: #F9EBEA; color:##C0392B">floats</span>. When NumPy can't convert a value to a numeric data type like float or integer, it uses a special nan value that stands for **"not a number"**. NumPy assigns an na value, which stands for "not available", when the value doesn't exist. <span style="background-color: #F9EBEA; color:##C0392B">nan</span> and <span style="background-color: #F9EBEA; color:##C0392B">na</span> values are types of missing data. We'll dive more into how to deal with missing data in later classes.

The whole first row of <span style="background-color: #F9EBEA; color:##C0392B">world_alcohol.csv</span> is a header row that contains the names of each column. This is not actually part of the data, and consists entirely of strings. Since the strings couldn't be converted to floats properly, NumPy uses nan values to represent them.

### 1.5 Reading the data correctly

When reading in the data using the [numpy.genfromtxt()](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.genfromtxt.html) function, we can use parameters to customize how we want the data to be read in. While we're at it, we can also specify that we want to skip the header row of <span style="background-color: #F9EBEA; color:##C0392B">world_alcohol.csv</span>.

To specify the data type for the entire NumPy array, we use the keyword argument dtype and set it to <span style="background-color: #F9EBEA; color:##C0392B">"U75"</span>. This specifies that we want to read in each value as a 75 byte unicode data type. We'll dive more into unicode and bytes later on, but for now, it's enough to know that this will read in our data properly.

To skip the header when reading in the data, we use the skip_header parameter. The <span style="background-color: #F9EBEA; color:##C0392B">skip_header</span> parameter accepts an integer value, specifying the number of lines from the top of the file we want NumPy to ignore.


In [12]:
world_alcohol = np.genfromtxt("world_alcohol.csv", delimiter=",", dtype="U75", skip_header=1)
print(world_alcohol)

[['1986' 'Western Pacific' 'Viet Nam' 'Wine' '0']
 ['1986' 'Americas' 'Uruguay' 'Other' '0.5']
 ['1985' 'Africa' "Cte d'Ivoire" 'Wine' '1.62']
 ..., 
 ['1986' 'Europe' 'Switzerland' 'Spirits' '2.54']
 ['1987' 'Western Pacific' 'Papua New Guinea' 'Other' '0']
 ['1986' 'Africa' 'Swaziland' 'Other' '5.15']]


In [13]:
#slicing
print(world_alcohol[1,:])
print(world_alcohol[1,0:2])

['1986' 'Americas' 'Uruguay' 'Other' '0.5']
['1986' 'Americas']


### 1.6 Array Comparisons

One of the most powerful aspects of the NumPy module is the ability to make comparisons across an entire array. These comparisons result in Boolean values.




In [14]:
vector = np.array([5, 10, 15, 20])
vector == 10

array([False,  True, False, False], dtype=bool)

In [15]:
matrix = np.array([[5, 10, 15], 
                   [20, 25, 30],
                   [35, 40, 45]]
                 )
matrix == 25

array([[False, False, False],
       [False,  True, False],
       [False, False, False]], dtype=bool)

### 1.7 Selecting elements

We mentioned that comparisons are very powerful, but it may not have been obvious why on the last screen. Comparisons give us the power to select elements in arrays using Boolean vectors. This allows us to conditionally select certain elements in vectors, or certain rows in matrices.



In [16]:
vector = np.array([5, 10, 15, 20])
equal_to_ten = (vector == 10)

print(vector[equal_to_ten])

[10]


In [None]:
vector = np.array([5, 10, 15, 20])
equal_to_ten_and_five = (vector == 10) | (vector == 5)

print(equal_to_ten_and_five)

In [17]:
vector = np.array([5, 10, 15, 20])
equal_to_ten_or_five = (vector == 10) | (vector == 5)
vector[equal_to_ten_or_five] = 50
print(vector)

[50 50 15 20]


### 1.8 Computing with NumPy

Now that alcohol_consumption consists of numeric values, we can perform computations on it. NumPy has a few built-in methods that operate on arrays. You can view all of them in the documentation. For now, here are a few important ones:


-  <span style="background-color: #F9EBEA; color:##C0392B">sum()</span> -- Computes the sum of all the elements in a vector, or the sum along a dimension in a matrix
-  <span style="background-color: #F9EBEA; color:##C0392B">mean()</span> -- Computes the average of all the elements in a vector, or the average along a dimension in a matrix
-  <span style="background-color: #F9EBEA; color:##C0392B">max()</span> -- Identifies the maximum value among all the elements in a vector, or the maximum along a dimension in a matrix

Here's an example of how we'd use one of these methods on a vector:

In [18]:
vector = np.array([5, 10, 15, 20])
vector.sum()

50

With a matrix, we have to specify an additional keyword argument, axis. The axis dictates which dimension we perform the operation on.  <span style="background-color: #F9EBEA; color:##C0392B">1</span> means that we want to perform the operation on each  <span style="background-color: #F9EBEA; color:##C0392B">row</span>, and  <span style="background-color: #F9EBEA; color:##C0392B">0</span> means on each  <span style="background-color: #F9EBEA; color:##C0392B">column</span>. The example below performs an operation across each row:




In [19]:
matrix = np.array([
                [5, 10, 15], 
                [20, 25, 30],
                [35, 40, 45]
             ])
matrix.sum(axis=1)

array([ 30,  75, 120])

### 1.9 NumPy Strengths And Weaknesses
You should now have a good foundation in NumPy, and in handling issues with your data. NumPy is much easier to work with than lists of lists, because:

- It's easy to perform computations on data.
- Data indexing and slicing is faster and easier.
- We can convert data types quickly.
Overall, NumPy makes working with data in Python much more efficient. It's widely used for this reason, especially for  <span style="background-color: #F9EBEA; color:##C0392B">machine learning</span>.

You may have noticed some limitations with NumPy as you worked through the past two missions, though. For example:

- All of the items in an array must have the **same data type**. For many datasets, this can make arrays cumbersome to work with.
- Columns and rows must be **referred to by number**, which gets confusing when you go back and forth from column name to column number.
- In the next few missions, we'll learn about the Pandas library, one of the most popular data analysis libraries. **Pandas builds on NumPy, but does a better job addressing the limitations of NumPy**.



<br>
<div class="alert alert-info">
<b>Exercise Start.</b>
</div>

**Complete the script to produce the output shown**: 

1. Script
>```python
a = [2.22, 8.52, 2.47]
b = ['t', 'v', 'c']
c = [5, 2, 4]
print(___)
```
Output:
>```python
[[2.22, 8.52, 2.47], ['t', 'v', 'c'], [5, 2, 4]]
```



2. Script
>```python
import numpy as np
store = np.array([4, 5, 2, 2, 2, 7])
cost  = np.array([71, 82, 65, 87, 89, 71])
np_cols = np.column_stack((___, ___))
print(np_cols)
```
Output:
>```python
[[ 4 71]
 [ 5 82]
 [ 2 65]
 [ 2 87]
 [ 2 89]
 [ 7 71]]
```
3. Script
>```python
import numpy as np
x = np.array([[4, 6, 9],
              [4, 9, 6]])
for i in _____(x):
    print(i)
```
Output:
>```python
4
6
9
4
9
6
```
4. Script
>```python
stocks = {'Microsoft': 'MSFT', 'Facebook': 'FB'}
for x, y in stocks_______
    print('The ticker for ' + x + ' is ' + y)
```
Output:
>```python
The ticker for Microsoft is MSFT
The ticker for Facebook is FB
```
5. Script
>```python
import numpy as np
y = np.array([[1, 2, 3], 
              [14, 15, 16]])
_______(y)
```
Output:
>```python
array([[ 1, 14],
       [ 2, 15],
       [ 3, 16]])
```
6. Script
>```python
profits = [[21, 21, 30], [15, 30, 7], [24, 30, 7]]
for x in profits:
    print(____)
```
Output:
>```python
21
15
24
```
7. Script
>```python
import numpy as np
costs = [3, 17, 7, 8]
print(______(costs) <= 4)
```
Output:
>```python
[True False False False]
```