The dataset for this assignment can be found [here](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page)<br>
Data Description <br/>
<br>`pickup_year`: The year of the trip.<br/>
<br>`pickup_month`: The month of the trip (January is 1, December is 12).<br/>
<br>`pickup_day`: The day of the month of the trip.<br/>
<br>`pickup_location_code`: The airport or borough where the trip started.<br/>
<br>`dropoff_location_code`: The airport or borough where the trip finished.<br/>
<br>`trip_distance`: The distance of the trip in miles.<br/>
<br>`trip_length`: The length of the trip in seconds.<br/>
<br>`fare_amount`: The base fare of the trip, in dollars.<br/>
<br>`total_amount`: The total amount charged to the passenger, including all fees, tolls and tips.<br/>

In [27]:
import csv
import numpy as np
# import nyc_taxi.csv as a list of lists
f = open("nyc_taxis.csv", "r")
taxi_list = list(csv.reader(f))

#Header Column
header = taxi_list[0]
print(header)
#No Header column
taxi_list = taxi_list[1:]


# convert all values to floats
converted_taxi_list = []
for row in taxi_list:
    converted_row = []
    for item in row:
        converted_row.append(float(item))
    converted_taxi_list.append(converted_row)


taxi = np.array(converted_taxi_list)

['pickup_year', 'pickup_month', 'pickup_day', 'pickup_dayofweek', 'pickup_time', 'pickup_location_code', 'dropoff_location_code', 'trip_distance', 'trip_length', 'fare_amount', 'fees_amount', 'tolls_amount', 'tip_amount', 'total_amount', 'payment_type']


In [12]:
print(taxi)

[[2.016e+03 1.000e+00 1.000e+00 ... 1.165e+01 6.999e+01 1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 8.000e+00 5.430e+01 1.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 0.000e+00 3.780e+01 2.000e+00]
 ...
 [2.016e+03 6.000e+00 3.000e+01 ... 5.000e+00 6.334e+01 1.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 8.950e+00 4.475e+01 1.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 0.000e+00 5.484e+01 2.000e+00]]


In [13]:
#Volume of the data records
print("Taxi shape :",taxi.shape)

Taxi shape : (89560, 15)


In [14]:
# Fare & Fees amound
fare_amount = taxi[:,9]
fees_amount = taxi[:,10]
fare_and_fees = fare_amount + fees_amount

print(fare_and_fees)

[52.8 46.3 37.8 ... 52.8 35.8 49.3]


In [32]:
trip_distance_miles = taxi[:,7]
trip_length_seconds = taxi[:,8]

trip_length_hours = trip_length_seconds / 3600 # 3600 seconds is one hour

trip_mph = trip_distance_miles/trip_length_hours
print(trip_mph)

[37.11340206 38.58157895 31.27222982 ... 22.29907867 42.41551247
 36.90473407]


In [16]:
mph_min = trip_mph.min()
print(mph_min)
mph_max = trip_mph.max()
print(mph_max)
mph_mean = trip_mph.mean()
print(mph_mean)

0.0
82800.0
32.24258580925573


In [17]:
# we'll compare against the first 5 rows only
taxi_first_five = taxi[:5]
# select these columns: fare_amount, fees_amount, tolls_amount, tip_amount
fare_components = taxi_first_five[:,9:13]
fare_sums = fare_components.sum(axis = 1)
fare_totals = taxi_first_five[:,13]
print(fare_sums)
print(fare_totals)

[69.99 54.3  37.8  32.76 18.8 ]
[69.99 54.3  37.8  32.76 18.8 ]


NumPy Way

In [24]:
import numpy as np
taxi_np = np.genfromtxt('nyc_taxis.csv', delimiter = ',', skip_header=1)
taxi_np.shape

(89560, 15)

In [25]:
print(taxi.dtype)

float64


From the first method of using Python Function and Concepts we had to write more lines of code to get the taxi dataset into all **float type** but using Numpy module we were able to do the same idelogy in few lines of code. Even though we didnt convert directly all rows into float type, numpy converts everything internally as numpy ndarrays can contain only one datatype

We didn't have to complete this step because when `numpy.genfromtxt()` reads in a file, it attempts to determine the data type of the file by looking at the values.

In [31]:
#Calculate the number of rides in Jan and Feb
pickup_month = taxi_np[:,1]
jan_bool = pickup_month  == 1
print(type(jan_bool))
jan = pickup_month[jan_bool]
jan_ride  = jan.shape[0]
print(jan_ride)

feb_bool = pickup_month == 2
feb = pickup_month[feb_bool]
feb_ride = feb.shape[0]
print(feb_ride)

<class 'numpy.ndarray'>
13481
13333


In [33]:
#Calcutation the Taxi_mph for the second time for analysis

trip_mph = taxi[:,7]/ (taxi[:,8]/ 3600)

print(trip_mph.max())

82800.0


We see that a trip of 82800 mph is impossible therefore we investigate the distance and time taken for this records to figure out a pattern. Using the concept of Boolean Indexing we filter out the records which have greater than 20,000mph 

In [36]:
trip_mph_bool = trip_mph > 20000

result = taxi[trip_mph_bool, 5:9]
print(header[5:9] ,"\n")
print(result)

['pickup_location_code', 'dropoff_location_code', 'trip_distance', 'trip_length'] 

[[ 2.   2.  23.   1. ]
 [ 2.   2.  19.6  1. ]
 [ 2.   2.  16.7  2. ]
 [ 3.   3.  17.8  2. ]
 [ 2.   2.  17.2  2. ]
 [ 3.   3.  16.9  3. ]
 [ 2.   2.  27.1  4. ]]


From the above filtering technique, we see most of these data have a distnace greater than 16miles but seconds less than 5 which is impossible for a trip

In [41]:
tip_amount = taxi[:,12]

tip_bool = tip_amount > 50
top_tips = taxi[tip_bool,5:14]

print(header[5:14])
print(top_tips)

['pickup_location_code', 'dropoff_location_code', 'trip_distance', 'trip_length', 'fare_amount', 'fees_amount', 'tolls_amount', 'tip_amount', 'total_amount']
[[4.0000e+00 2.0000e+00 2.1450e+01 2.0040e+03 5.2000e+01 8.0000e-01
  0.0000e+00 5.2800e+01 1.0560e+02]
 [3.0000e+00 4.0000e+00 9.2000e+00 1.0410e+03 2.7000e+01 1.3000e+00
  5.5400e+00 6.0000e+01 9.3840e+01]
 [2.0000e+00 0.0000e+00 1.9800e+01 1.6710e+03 5.2500e+01 1.3000e+00
  5.5400e+00 5.9340e+01 1.1868e+02]
 [4.0000e+00 2.0000e+00 1.8420e+01 2.9680e+03 5.2000e+01 8.0000e-01
  5.5400e+00 8.0000e+01 1.3834e+02]
 [3.0000e+00 6.0000e+00 4.9000e-01 1.5800e+02 3.5000e+00 1.8000e+00
  0.0000e+00 7.0000e+01 7.5300e+01]
 [2.0000e+00 2.0000e+00 2.7000e+00 3.8100e+02 9.5000e+00 8.0000e-01
  0.0000e+00 6.0000e+01 7.0300e+01]
 [3.0000e+00 4.0000e+00 9.5400e+00 1.2100e+03 2.7500e+01 8.0000e-01
  5.5400e+00 5.5000e+01 8.8840e+01]
 [2.0000e+00 4.0000e+00 1.7600e+01 3.2510e+03 5.2000e+01 8.0000e-01
  5.5400e+00 6.5000e+01 1.2334e+02]
 [4.0000e+

In [43]:
zeros = np.zeros([taxi_np.shape[0], 1])
print(zeros)
taxi_modified = np.concatenate([taxi_np, zeros], axis=1)
print(taxi_modified)

[[0.]
 [0.]
 [0.]
 ...
 [0.]
 [0.]
 [0.]]
[[2.016e+03 1.000e+00 1.000e+00 ... 6.999e+01 1.000e+00 0.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 5.430e+01 1.000e+00 0.000e+00]
 [2.016e+03 1.000e+00 1.000e+00 ... 3.780e+01 2.000e+00 0.000e+00]
 ...
 [2.016e+03 6.000e+00 3.000e+01 ... 6.334e+01 1.000e+00 0.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 4.475e+01 1.000e+00 0.000e+00]
 [2.016e+03 6.000e+00 3.000e+01 ... 5.484e+01 2.000e+00 0.000e+00]]


In [46]:
taxi_modified[taxi_modified[:, 5] == 2, 15] = 1 #JFK
taxi_modified[taxi_modified[:, 5] == 3, 15] = 1 #LGA
taxi_modified[taxi_modified[:, 5] == 5, 15] = 1 #EWR

In [47]:
jfk = taxi[taxi[:,6] == 2,6]
jfk_count = jfk.shape[0]

laguardia = taxi[taxi[:,6] == 3, 6]
laguardia_count = laguardia.shape[0]

newark = taxi[taxi[:,6]==5 ,6]
newark_count = newark.shape[0]

In [48]:
#Removing bad data and calculating stastistics of the cleaned data
trip_mph = taxi[:,7] / (taxi[:,8] / 3600)

cleaned_taxi = taxi[trip_mph < 100]

mean_distance = cleaned_taxi[:,7].mean()
mean_length = cleaned_taxi[:,8].mean()
mean_total_amount = cleaned_taxi[:,13].mean()

In [53]:
print("Average Distance :",mean_distance,"\n","Average Length in seconds :",mean_length,"\n","Average Total Amount :",mean_total_amount)

Average Distance : 12.666396599932893 
 Average Length in seconds : 2239.503657309026 
 Average Total Amount : 48.98131853260262
