# Data Analysis using Numpy and Pandas

NumPy, short for "Numerical Python," is a fundamental library for scientific computing in Python. It's a favorite among programmers because it makes complex tasks simple

This is what we will do in this section:

- Speeding up our code with vectorized operations
- Selecting data from NumPy ndarrays
- Analyzing data using NumPy methods

## Loading data from csv and converting to a NumPy array - Some basic properties

The data has been stored in data/nyc_taxi.csv

To get you started, we've used Python's csv module to load the nyc_taxis.csv file and converted it to a list of lists containing float values. The results have been saved to converted_taxi_list.

Add a single line of code using the numpy.array() constructor to convert the converted_taxi_list variable to a NumPy ndarray, and assign the result to the variable name taxi.

In [4]:
# Let's check the tree of the parent where the files have been stored 
!tree 

[01;34m.[0m
├── [01;34mdata[0m
│   ├── nyc_taxis.csv
│   └── nyc_taxis.csv:Zone.Identifier
├── data-analysis.ipynb
├── notebook_101.ipynb
└── README.md

1 directory, 5 files


In [6]:
# Importing the required libraries 
import csv
import numpy as np


In [17]:
# Load the data in a list without headers
with open("data/nyc_taxis.csv", "r") as csv_file:
    csv_reader = csv.reader(csv_file)
    csv_list = list(csv_reader)
    csv_list = csv_list[1:]         # This is a list without headers

# Convert all values to float

converted_taxi_list = []
for row in csv_list:
    converted_row = []
    for element in row:
        converted_row.append(float(element))
    converted_taxi_list.append(converted_row)

# Convert the list in an array called taxi
taxi  = np.array(converted_taxi_list)

In [18]:
# Printing an array
print(taxi)

[[2.016e+03 1.000e+00 1.000e+00 ... 1.165e+01 6.999e+01 1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 8.000e+00 5.430e+01 1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 0.000e+00 3.780e+01 2.000e+00]
 ...
 [2.016e+03 6.000e+00 3.000e+01 ... 5.000e+00 6.334e+01 1.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 8.950e+00 4.475e+01 1.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 0.000e+00 5.484e+01 2.000e+00]]


In [19]:
print(taxi.shape)

(2013, 15)


In [20]:
taxi.ndim

2

In [21]:
## Slicing the array
row_0 = taxi[0]                     # Select the first row.
rows_391_to_500 = taxi[391:501]     # Select rows between 391 to 500 (inclusive)
row_21_column_5 = taxi[21,5]        # Selecting one specific element 

In [22]:
# slicing for columns 

# Select every row for the columns at indices 1, 4, and 7
columns_1_4_7 = taxi[:, [1,4,7]]

# Select the columns at indices 5 to 8 inclusive for the row at index 99
row_99_columns_5_to_8 = taxi[99,5:9]

# Select the rows at indices 100 to 200 inclusive for the column at index 14
rows_100_to_200_column_14 = taxi[100:201, 14]

## Vector operations 

NumPy ndarrays not only make selecting data much easier, they also allow us to perform vectorized operations more efficiently. Vectorized operations apply to multiple data points at once, making them faster than traditional loops.

From the provided taxi ndarray:

1. Slice the taxi array to extract all rows and the 10th column only. Assign the result to a new variable called fare_amount.

1. Slice the taxi array to extract all rows and the 11th column only. Assign the result to a new variable called fees_amount.

1. Add the fare_amount and fees_amount arrays element-wise. Assign the result to a new variable called fare_and_fees.

1. Print the fare_and_fees variable.

In [23]:
fare_amount = taxi [:,9]                        # Col 10
fees_amount = taxi [:,10]                       # Col 11
fare_and_fees = fare_amount + fees_amount       # Summing the two cols up

print(fare_and_fees)    

[52.8 46.3 37.8 ... 52.8 35.8 49.3]


**Some more vector operations including [NumPy broadcasting!](https://numpy.org/doc/stable/user/basics.broadcasting.html)** 

From the provided taxi ndarray, do the following:

1. Create the variables trip_distance_miles, trip_length_seconds, and trip_length_hours as shown above in the Learn section.

1. Use the formula in the Learn section to calculate the average speed in miles per hour for each trip, and assign the result to a new variable called trip_mph.

1. Print the first 10 values of the trip_mph array.

In [26]:
# 1. Create the variables trip_distance_miles, trip_length_seconds, and trip_length_hours
trip_distance_miles = taxi[:, 7]
trip_length_seconds = taxi[:, 8]
trip_length_hours = trip_length_seconds / 3600

# 2. 
trip_mph = trip_distance_miles / trip_length_hours
# 3. 
print(trip_mph[:11])

[37.11340206 38.58157895 31.27222982 25.88429752 26.3715415  38.53293413
 32.81553398 35.95075239 51.00702576 33.20207254 40.73619632]


## Calculating Statistics for 1D Ndarrays

On the previous screen, we created trip_mph, a 1D ndarray of the average speed in miles per hour of each trip in our taxi dataset. Based on the first ten values of trip_mph, those NYC taxi drivers are fast; most average over 30 mph!

Now let's dive deeper into our data by determining the minimum, maximum, and mean values for our newly created trip_mph 1D ndarray.

In [27]:
# Calculating minimum / maximum speeds from 1 D array
mph_min = trip_mph.min()
mph_max = trip_mph.max()

print(f"The minimum speed is {mph_min}")
print(f"The max speed is {mph_max}")

The minimum speed is 0.0
The max speed is 82800.0


In [29]:
mph_avg = trip_mph.mean()
print(f"The avergae speed is {round(mph_avg,2)}")

The avergae speed is 169.98


In [34]:
# Calculating median.. median is not available as a method. 
# There is a class function to do it
mph_median = np.median(trip_mph)
round(mph_median,2)

24.18

## Calculating Statistics for 2D Ndarrays

When working with a 2D ndarray, using the ndarray.max() method without any additional parameters returns a single value like we saw with a 1D ndarray, representing the overall maximum:

- But what if we want to find the maximum value of each row? We can use the **axis parameter and set it to 1** to find the maximum value for each row:

- Similarly, we set **axis to 0 to find the maximum value of each column**:


**Exercise:** 
<br>
**fare amount + fees amount + tolls amount + tip amount = total amount**

In [35]:
# extract the first 5 rows only
taxi_first_five = taxi[:5]
# select columns: fare_amount, fees_amount, tolls_amount, and tip_amount
fare_components = taxi_first_five[:, 9:13]

1. Use the ndarray.sum() method to calculate the sum of each row in fare_components. Assign the result to fare_sums.
1. Extract the 14th column in taxi_first_five. Assign it to fare_totals.
1. Print fare_totals and fare_sums or use the variable inspector to compare the results and make sure they match.

In [36]:
fare_sums = fare_components.sum(axis = 1)
fare_totals = taxi_first_five[:,13]

print(fare_totals)
print(fare_sums)

[69.99 54.3  37.8  32.76 18.8 ]
[69.99 54.3  37.8  32.76 18.8 ]
