# Numpy tutorial
- We will be exploring fundamental functions of Numpy in python. 
- To make the tutorial more practical, we will be using a real database called nyc_taxis.csv for demostration purpose.

### Numpy tutorial strucutre
- Compare Numpy n-dimentional array to basic list of lists, hence, understand the benefit of using Numpy
- Study Boolean Indexing with Numpy
- Study some basic Numpy built-in methods (functions) such as np.max(), np.min(), etc.

In [1]:
import numpy as np
import csv

In [2]:
data_ndarray = np.array([10,20,30])

In [3]:
type(data_ndarray)

numpy.ndarray

In [4]:
# import nyc_taxi.csv as a list of lists
dataset_loc = 'D:/Dataquest/Dataquest 2022 Learning/Datasets/'
f = open( dataset_loc + 'nyc_taxis.csv','r')
taxi_list = list(csv.reader(f))


In [5]:
#remove the header row
taxi_list = taxi_list[1:]

#convert all values to floats
converted_taxi_list = []
for row in taxi_list:
    converted_row = []
    for item in row:
        converted_row.append(float(item))
    converted_taxi_list.append(converted_row)


In [6]:
# convert list of lists to 2-dimensional array using numpy.array()
taxi = np.array(converted_taxi_list)


In [7]:
# let's practice selecting one row, multiple rows and single items from our taxi ndarray
print(f'the first row of ndarray {taxi[0]}')
print(f'from row 391 to row 500 of ndarray {taxi[391:501]}')
print(f'select a single element at row 21 and column 5 {taxi[21,5]}')

the first row of ndarray [2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 4.000e+00
 2.100e+01 2.037e+03 5.200e+01 8.000e-01 5.540e+00 1.165e+01 6.999e+01
 1.000e+00]
from row 391 to row 500 of ndarray [[2.016e+03 1.000e+00 2.000e+00 ... 0.000e+00 2.630e+01 2.000e+00]
 [2.016e+03 1.000e+00 2.000e+00 ... 3.000e+00 3.030e+01 1.000e+00]
 [2.016e+03 1.000e+00 2.000e+00 ... 6.670e+00 4.001e+01 1.000e+00]
 ...
 [2.016e+03 1.000e+00 4.000e+00 ... 0.000e+00 5.534e+01 2.000e+00]
 [2.016e+03 1.000e+00 4.000e+00 ... 3.090e+00 1.339e+01 1.000e+00]
 [2.016e+03 1.000e+00 4.000e+00 ... 4.000e+00 2.680e+01 1.000e+00]]
select a single element at row 21 and column 5 4.0


In [8]:
# use vector addition to add fare_amount and fees_amount. Assign the result to fare_and_fees.
fare_amount = taxi[:,9]
fees_amount = taxi[:,10]
fare_and_fees = fare_amount + fees_amount
print(f'first 5 rows of fare_amount {fare_amount[:6]}')
print(f'first 5 rows of fees_amount {fees_amount[:6]}')
print(f'first 5 rows of fare_and_fees {fare_and_fees[:6]}')

first 5 rows of fare_amount [52.  45.  36.5 26.  17.5 52. ]
first 5 rows of fees_amount [0.8 1.3 1.3 1.3 1.3 0.8]
first 5 rows of fare_and_fees [52.8 46.3 37.8 27.3 18.8 52.8]


In [9]:
# use vector division to divide trip_distance_miles by trip_length_hours. Assign the result to trip_mph.
trip_distance_miles = taxi[:,7]
trip_length_seconds = taxi[:,8]

trip_length_hours = trip_length_seconds / 3600 # 3600 seconds is one hour
trip_mph=trip_distance_miles/trip_length_hours

print(f'first 5 rows of trip_distance_miles{trip_distance_miles[:6]}')
print(f'first 5 rows of trip_length_hours {trip_length_hours[:6]}')
print(f'first 5 rows of trip_mph {trip_mph[:6]}')

first 5 rows of trip_distance_miles[21.   16.29 12.7   8.7   5.56 21.45]
first 5 rows of trip_length_hours [0.56583333 0.42222222 0.40611111 0.33611111 0.21083333 0.55666667]
first 5 rows of trip_mph [37.11340206 38.58157895 31.27222982 25.88429752 26.3715415  38.53293413]


### NumPy ndarrays have methods for many different calculations. Here are a few of the key methods:
- ndarray.min() to calculate the minimum value
- ndarray.max() to calculate the maximum value
- ndarray.mean() to calculate the mean or average value
- ndarray.sum() to calculate the sum of the values

In [10]:
print(f'maximum value of trip_mph is {trip_mph.max()}')
print(f'minimumvalue of trip_mph is {trip_mph.min()}')
print(f'average of trip_mph is {trip_mph.mean()}')

maximum value of trip_mph is 82800.0
minimumvalue of trip_mph is 0.0
average of trip_mph is 169.98315083655157


###  If we use the ndarray.max() method on a 2D ndarray without any additional parameters, it will return a single value, just like a 1D array

![This is a image](https://raw.githubusercontent.com/tongNJ/Dataquest-Online-Courses-2022/main/Pictures/numpy-pic-1.PNG)

### But what if we want to find the maximum value of each row? We need to use the axis parameter and specify a value of 1 to indicate that we want to calculate the maximum value for each row.

![This is a image2](https://raw.githubusercontent.com/tongNJ/Dataquest-Online-Courses-2022/main/Pictures/numpy-pic-2.PNG)

### Instead of using csv to import data, we can also use numpy.genformtxt() function to read files into Numpy ndarrays.

In [18]:
taxi = np.genfromtxt(dataset_loc + 'nyc_taxis.csv', delimiter=',',skip_header=True)
print(taxi[:3])
print(taxi.shape)
print(taxi.dtype)

[[2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 4.000e+00
  2.100e+01 2.037e+03 5.200e+01 8.000e-01 5.540e+00 1.165e+01 6.999e+01
  1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 1.000e+00
  1.629e+01 1.520e+03 4.500e+01 1.300e+00 0.000e+00 8.000e+00 5.430e+01
  1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 5.000e+00 0.000e+00 2.000e+00 6.000e+00
  1.270e+01 1.462e+03 3.650e+01 1.300e+00 0.000e+00 0.000e+00 3.780e+01
  2.000e+00]]
(2013, 15)
float64


### Now, let's focus on arguably the most powerful method to index data: the Boolean array. A Boolean array, as the name suggests, is an array of Boolean values. We sometimes call Boolean arrays Boolean vectors or Boolean masks.

You may recall that the Boolean (or bool) type is a built-in Python type that can be one of two unique values:
- True
- False

Now, let's look at what happens when we perform a Boolean operation between an ndarray and a single value:
print(np.array([2,4,6,8]) < 5)
[ True  True False False]

A similar pattern occurs – each value in the array is compared to five. If the value is less than five, True is returned. Otherwise, False is returned.


