## About This Notebook 
In this **Working with NumPy** notebook, we will learn:
- How to use `numpy.genfromtxt()` to read in an ndarray.
- What a boolean array is, and how to create one.
- How to use boolean indexing to filter values in one and two-dimensional ndarrays.
- How to assign one or more new values to an ndarray based on their locations.
- How to assign one or more new values to an ndarray based on their values.
***
## 1. Reading CSV files with NumPy

In this chapter, we will learn a technique called <b>Boolean Indexing</b>. Before we dig deeper into this topic, let's first learn how to read files into NumPy ndarrays. Below is the simplified syntax of the function, as well as an explanation for the two parameters:

````python
np.genfromtxt(filename, delimiter=None)
````

- ``filename``: A positional argument, usually a string representing the path to the text file to be read.
- ``delimiter``: A named argument, specifying the string used to separate each value.

In our case, the data is stored in a CSV file, therefore the delimiter is a comma ",".
So this is how we can read in a file named ``data.csv``:


````python
data = np.genfromtxt('<path_to_>/data.csv', delimiter = ',')
````

### Task 3.2.1:
Now let's try to read our nyc_taxis.csv file into NumPy.

1. Import the NumPy library and assign to the alias ``np``.
2. Use the `np.genfromtxt()` function to read the nyc_taxis.csv file into NumPy. Assign the result to taxi. Do not forget to use also delimiter argument, such as shown above.
3. Use the ``ndarray.shape`` attribute to assign the shape of taxi to ``taxi_shape``.

In [1]:
# Start your code here:

import numpy as np
taxi = np.genfromtxt('../../../Data/csv/nyc_taxis.csv', delimiter = ',')
taxi_shape = taxi.shape

print(taxi_shape)

(89561, 15)


## 2. Reading CSV files with NumPy Continued

We have used the `numpy.genfromtxt()` function to read the ``nyc_taxis.csv`` file into NumPy in the previous notebook.

Just to refresh your memory, in the previous mission we have done something like this:

````python
# import nyc_taxi.csv as a list of lists
f = open("nyc_taxis.csv", "r")
taxi_list = list(csv.reader(f))

# remove the header row
taxi_list = taxi_list[1:]

# convert all values to floats
converted_taxi_list = []
for row in taxi_list:
    converted_row = []
    for item in row:
        converted_row.append(float(item))
    converted_taxi_list.append(converted_row)

taxi = np.array(converted_taxi_list)
````

Have you noticed that we converted all the values to floats before we converted the list of lists to an ndarray? 
> The reason for this is because that NumPy ndarrays can contain only **one datatype**.

This part of the code was omitted in the previous exercise, because when `numpy.getfromtxt()` is called, the function automatically tries to determine the data type of the file by looking at the values.

To see which datatype we have in the ndarray, simply use `ndarray.dtype` attribute like this:
````python
print(taxi.dtype)
````
### Task 3.2.2:
1. Use the `numpy.genfromtxt()` function to again read the nyc_taxis.csv file into NumPy, but this time, skip the first row. Assign the result to `taxi`.
2. Assign the shape of `taxi` to `taxi_shape`.

In [2]:
# Start your code here:

import numpy as np
taxi = np.genfromtxt('../../../Data/csv/nyc_taxis.csv', delimiter = ',', skip_header = True)
taxi_shape = taxi.shape

## 3. Boolean Arrays

In this session, we're going to focus on the boolean array.

Do you remember that the boolean (or bool) type is a built-in Python type that can be one of two unique values:

- True
- False

Do you also remember that we've used boolean values when working with Python comparison operators like 
- ``==`` equal
- ``>`` greater than
- ``<`` less than
- ``!=`` not equal

See a couple examples of simple boolean operations below just to refresh your memory:

In [3]:
print(type(3.5) == float)

True


In [4]:
print(5 > 6)

False


In the previous notebook where we explored vector operations we learned that the result of an operation between a ndarray and a single value is a new ndarray:

In [5]:
import numpy as np
print(np.array([2,4,6,8]) + 10)

#The + 10 operation is applied to each value in the array

[12 14 16 18]


Guess what happens when we perform a **boolean operation** between an ndarray and a single value:

In [6]:
import numpy as np
print(np.array([2,4,6,8]) < 5)

[ True  True False False]


## 4. Boolean Indexing with 1D ndarrays

In the last exercise, we learned how to create boolean arrays using vectorized boolean operations. Now, I want to show you a technique known as **boolean indexing**, (or index/select) using boolean arrays.
See an example from the previous notebook:

In [7]:
c = np.array([80.0, 103.4, 96.9, 200.3])
c_bool = c > 100
print(c_bool)

[False  True False  True]


How do we index using our new boolean array? All we need to do is to use the square brackets like this:

In [8]:
result = c[c_bool]
print(result)

[103.4 200.3]


The boolean array acts as a filter, the values that corresponding to **True** become part of the result and the values that corresponding to **False** are removed from the final list.

How can we use boolean indexing knowledge in our data set?
For example, to confirm the number of taxi rides from the month of january, we can do this:

In [9]:
# First, select just the pickup_month column (second column in the ndarray with column index 1)
pickup_month = taxi[:,1]

# use a boolean operation to make a boolean array, where the value 1 corresponds to January
january_bool = pickup_month == 1

# use the new boolean array to select only the items from pickup_month that have a value of 1
january = pickup_month[january_bool]

# use the .shape attribute to find out how many items are in our january ndarray
january_rides = january.shape[0]
print(january_rides)

13481


### Task 3.2.4:

1. Calculate the number of rides in the taxi ndarray that are from **February**:
    - Create a boolean array, ``february_bool``, that evaluates whether the items in ``pickup_month`` are equal to ``2``.
    - Use the ``february_bool`` boolean array to index ``pickup_month``. Assign the result to ``february``.
    - Use the ``ndarray.shape`` attribute to find the number of items in `february`. Assign the result to ``february_rides``.

In [10]:
# Start your code below:

pickup_month = taxi[:,1]
february_bool = pickup_month == 2
february = pickup_month[february_bool]
february_rides = february.shape[0]

## 5. Boolean Indexing with 2D ndaarays

Now it is time to use boolean indexing with ``2D ndarrays``. 
> One thing to keep in mind is that the boolean array must have the same length as the dimension you're indexing. This is one of the constraints when we work with 2D ndarrays.

In [11]:
arr = np.array([
                [1,2,3],
                [4,5,6],
                [7,8,9],
                [10,11,12]
])

print(arr)

[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]


In [12]:
bool_1 = [True, False, 
        True, True]
print(arr[bool_1])

[[ 1  2  3]
 [ 7  8  9]
 [10 11 12]]


In [13]:
print(arr[:, bool_1])

IndexError: boolean index did not match indexed array along dimension 1; dimension is 3 but corresponding boolean dimension is 4

You see that `bool_1`'s shape (4) is not the same as the shape of `arr`'s second axis(3), so it can't be used to index and produces an error.

In [15]:
bool_2 = [False, True, True]
print(arr[:,bool_2])

[[ 2  3]
 [ 5  6]
 [ 8  9]
 [11 12]]


`bool_2`'s shape (3) is the same as the shape of `arr`'s second axis (3), so this selects the 2nd and 3rd columns.

Now let's apply what we have learned to our data set. This time we will analyze the average speed of trips. Recall that we calculated the ``average travel speed `` as follows:
````python
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)
````

Next, how do we check for trips with an average speed greater than 20,000 mph?

In [16]:
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)

In [17]:
# create a boolean array for trips with average
# speeds greater than 20,000 mph
trip_mph_bool = trip_mph > 20000

# use the boolean array to select the rows for
# those trips, and the pickup_location_code,
# dropoff_location_code, trip_distance, and
# trip_length columns
trips_over_20000_mph = taxi[trip_mph_bool,5:9]

print(trips_over_20000_mph)

[[ 2.   2.  23.   1. ]
 [ 2.   2.  19.6  1. ]
 [ 2.   2.  16.7  2. ]
 [ 3.   3.  17.8  2. ]
 [ 2.   2.  17.2  2. ]
 [ 3.   3.  16.9  3. ]
 [ 2.   2.  27.1  4. ]]


### Task 3.2.5 (HARD):
1. Create a boolean array, ``tip_bool``, that determines which rows have values for the `tip_amount` column of more than 50.<br>
Hint: You might have to examine the original nyc_taxis.csv file to find an index of desired column.
2. Use the ``tip_bool`` array to select all rows from taxi with values tip amounts of more than 50, and the columns from indexes `5` to `13` inclusive. Assign the resulting array to ``top_tips``.

In [18]:
#Start your code below

tip_amount = taxi[:,12]
tip_bool = tip_amount > 50
top_tips = taxi[tip_bool, 5: 14]

## 6. Assigning Values in ndarrays (OPTIONAL)

After having learned how to retrieve data from ndarrays, now we will use the same indexing techniques to modify values within an ndarray. The syntax looks like this: <br>

````python
ndarray[location_of_values] = new_value
````

With 1D array, all we need to do is to specify one specific index location like this:

In [19]:
a = np.array(['red','blue','black','blue','purple'])
a[0] = 'orange'
print(a)

['orange' 'blue' 'black' 'blue' 'purple']


Or multiple values can be assigned at once:

In [20]:
a[3:] = 'pink'
print(a)

['orange' 'blue' 'black' 'pink' 'pink']


With a 2D ndarray, just like with a 1D ndarray, we can assign one specific index location:

In [21]:
ones = np.array([[1, 1, 1, 1, 1],
                 [1, 1, 1, 1, 1],
                 [1, 1, 1, 1, 1]])
ones[1,2] = 99
print(ones)

[[ 1  1  1  1  1]
 [ 1  1 99  1  1]
 [ 1  1  1  1  1]]


Or we can assign a whole row:

In [22]:
ones[0] = 42
print(ones)

[[42 42 42 42 42]
 [ 1  1 99  1  1]
 [ 1  1  1  1  1]]


Or a whole column:

In [23]:
ones[:,2] = 0
print(ones)

[[42 42  0 42 42]
 [ 1  1  0  1  1]
 [ 1  1  0  1  1]]


## 7. Assignment Using Boolean Arrays (OPTIONAL)

Boolean arrays become extremely powerful when used for assignment, like this:

In [24]:
a2 = np.array([1, 2, 3, 4, 5])

a2_bool = a2 > 2

a2[a2_bool] = 99

print(a2)

[ 1  2 99 99 99]


The boolean array has the ability to control the values that the assignment applies to, and the other values remain unchanged.

In [25]:
a = np.array([1, 2, 3, 4, 5])

a [ a > 2] = 99

print(a)

[ 1  2 99 99 99]


## 8. Assignment Using Boolean Arrays Continued (OPTIONAL)

Now let's take a look at an example of assignment using a boolean array with two dimensions:

In [26]:
b = np.array([
                [1,2,3],
                [4,5,6],
                [7,8,9]           
])

b[b > 4] = 99
print(b)

# The b > 4 boolean operation produces a 2D boolean array 
# which then controls the values that the assignment applies to.

[[ 1  2  3]
 [ 4 99 99]
 [99 99 99]]


We can also use a 1D boolean array to perform assignment on a 2D array:

In [27]:
c = np.array([
                [1,2,3],
                [4,5,6],
                [7,8,9]           
])

c[c[:,1] > 2, 1] = 99

print(c)

[[ 1  2  3]
 [ 4 99  6]
 [ 7 99  9]]


The above code selected the second column (with column index 1), and used boolean index technique (which value is > 2). The boolean array is only applied to the second column, while all other values remaining unchanged.

The pseudocode syntax for this code is the following, first we used an intermediate variable:

````python
bool = array[:, column_for_comparison] == value_for_comparison
array[bool, column_for_assignment] = new_value
````

And now all in one line:

````python
array[array[:, column_for_comparison] == value_for_comparison, column_for_assignment] = new_value
````