# Analysing NYC Taxi Trip Dataset Using Numpy

There are a variety of approaches for plotting large datasets, but most of them are very unsatisfactory. Here we first show some of the issues, then demonstrate how the datashader library helps make large datasets truly practical.

I'll use part of the well-studied NYC Taxi trip database, with the locations of all NYC taxi pickups and dropoffs from the year of 2016. Although we know what the data is, let's approach it as if we are doing data mining, and see what it takes to understand the dataset from scratch.

In [4]:
import numpy as np
my_data=np.genfromtxt("C:\\Users\\ridva\OneDrive\\Masaüstü\\Notes\\B47-DS-TR-main\\14-Office Hours document\\Office Hours 3-Numpy\\nyc_taxis1.csv",delimiter=',',skip_header=True,dtype='int32')

In [5]:
np.__version__

'1.20.1'

In [6]:
my_data

array([[2016,    1,    1, ...,   11,   69,    1],
       [2016,    1,    1, ...,    8,   54,    1],
       [2016,    1,    1, ...,    0,   37,    2],
       ...,
       [2016,    6,   30, ...,    5,   63,    1],
       [2016,    6,   30, ...,    8,   44,    1],
       [2016,    6,   30, ...,    0,   54,    2]])

In [7]:
my_data.ndim

2

In [8]:
my_data.shape

(89560, 15)

In [9]:
my_data.size

1343400

In [10]:
my_data.dtype

dtype('int32')

In [11]:
my_data.astype('int64')

array([[2016,    1,    1, ...,   11,   69,    1],
       [2016,    1,    1, ...,    8,   54,    1],
       [2016,    1,    1, ...,    0,   37,    2],
       ...,
       [2016,    6,   30, ...,    5,   63,    1],
       [2016,    6,   30, ...,    8,   44,    1],
       [2016,    6,   30, ...,    0,   54,    2]], dtype=int64)

In [12]:
my_data.itemsize

4

In [13]:
my_data.nbytes

5373600

In [14]:
my_data.data

<memory at 0x00000161DDB706C0>

In [15]:
np.isnan(my_data)

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

# ... Mean speed of all rides

In [16]:
speed=my_data[:,7]/(my_data[:,8]/3600)
np.sort(speed)

array([    0.,     0.,     0., ..., 30600., 68400., 82800.])

In [17]:
print(speed.mean())

30.664805016170458


In [18]:
print(speed.max())

82800.0


# ... Number of rides taken in February

In [19]:
rides_feb=my_data[my_data[:,1] ==2,1]
print(rides_feb.shape[0])

13333


# ... Number of rides where tip more than 50$

In [20]:
print(my_data[my_data[:,12]>50,12])

[ 52  60  59  80  70  60  55  65  80  62 100  58  62  75  60  70]


In [21]:
print(my_data[my_data[:,12]>50,12].shape[0])

16


In [22]:
a=my_data[my_data[:,12]>50,12]
np.sum(a)

1068

# ... Number of rides where drop was JFK(6) airpot

In [24]:
print(my_data[my_data[:,6] == 6,6].shape[0])
a=np.all([my_data[:,6] == 6 , my_data[:,1] == 2],axis=0)
print(my_data[a,0:7])
my_data[a,0:7].shape[0]

9000
[[2016    2    1 ...    0    3    6]
 [2016    2    1 ...    0    2    6]
 [2016    2    1 ...    0    3    6]
 ...
 [2016    2   29 ...    5    3    6]
 [2016    2   29 ...    5    3    6]
 [2016    2   29 ...    5    3    6]]


1361

# ... trip_lengths in February

In [28]:
a=my_data[my_data[:,1] ==2,8]
print(a,a.sum(),sum(a), a.shape[0])

[1214  676 1175 ...  111  995  932] 27206595 27206595 13333


# delete

In [26]:
my_data1=np.delete(my_data,0,axis=1,)
my_data1

array([[ 1,  1,  5, ..., 11, 69,  1],
       [ 1,  1,  5, ...,  8, 54,  1],
       [ 1,  1,  5, ...,  0, 37,  2],
       ...,
       [ 6, 30,  4, ...,  5, 63,  1],
       [ 6, 30,  4, ...,  8, 44,  1],
       [ 6, 30,  4, ...,  0, 54,  2]])

In [27]:
my_data1.shape

(89560, 14)

# Back Delete

In [29]:
my_data2=np.insert(my_data1,0,[2016]*89560,axis=1)

In [30]:
my_data2.shape

(89560, 15)

# Sum of Columns

In [31]:
my_data.sum(axis=0)

array([180552960,    323712,   1405513,    344030,    276084,    265087,
          302645,   1092372, 200254468,   3423875,     47311,    286609,
          495798,   4339989,    115572])

# .... Find Total_amount for each month

In [32]:
for i in range(1,7):
    a=my_data[my_data[:,1] ==i,8]
    print('{}.ay   {}'.format(i,a.sum()))
    #f"{}.ay  {}",i,a,sm()

1.ay   28444724
2.ay   27206595
3.ay   34091293
4.ay   33562056
5.ay   39051229
6.ay   37898571


# Tips by location

In [33]:
for i in range(1,7):
    a=my_data[my_data[:,6] ==i,12]
    b=my_data[my_data[:,6] ==i,7]
    print('{}.lokasyon   {}   {}'.format(i,a.sum(),b.sum()))
my_data[:,12]

1.lokasyon   47780   126097
2.lokasyon   79177   183876
3.lokasyon   91054   157528
4.lokasyon   254339   539439
5.lokasyon   971   2091
6.lokasyon   18571   65701


array([11,  8,  0, ...,  5,  8,  0])