# Lecture 3: August 11th, 2023

__Reminders:__ 
* Keep working on HW1 and HW2
    * Anthony plans to go over homework on Friday and Monday; definitely attend if you'd like some help

__Recap:__ Recall in the last lecture we were thinking about how to represent the following data in python. We were also tasked with computing the average of the "Rating" column. We saw that as a list of lists it was pretty clunky...

![csv-image.png](csv-image.png)

In [None]:
#from last lecture
mylist = [
    [1.99,17,"Alice",4.9],
    [10.45,10,"Bob",2.5],
    [19.99,5,"Eve",4.1]
]

### Why pandas? Wrong Approach 2

We know that NumPy is very good at dealing with arrays. What if we tried to store our data with NumPy instead?

In [None]:
import numpy as np

In [None]:
mylist

[[1.99, 17, 'Alice', 4.9], [10.45, 10, 'Bob', 2.5], [19.99, 5, 'Eve', 4.1]]

In [None]:
#first, let's convert this list into a NumPy array
arr = np.array(mylist)
arr

array([['1.99', '17', 'Alice', '4.9'],
       ['10.45', '10', 'Bob', '2.5'],
       ['19.99', '5', 'Eve', '4.1']], dtype='<U32')

If you look carefully, you might realize something is "off" about `arr`...

In [None]:
type(arr)

numpy.ndarray

Recall: our original goal was to compute the average of the ratings column. So let's try to get that done here.

In [None]:
# get all rows in the last column
arr[:,-1]

array(['4.9', '2.5', '4.1'], dtype='<U32')

Notice, for multi-dimensional slicing, this doesn't work for lists

In [None]:
#Same slicing fails for a list
mylist[:,-1]

TypeError: list indices must be integers or slices, not tuple

In [1]:
arr[:,-1].mean()

UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('<U32'), dtype('<U32')) -> None

What's going wrong: NumPy wants all of the entries in an array to have __the same__ data type. 

In many datasets, having multiple different data types makes sense. Even in this small example, we have floats, ints, and strings, and they all make sense based on the column that they're in.

In [2]:
mylist

[[1.99, 17, 'Alice', 4.9], [10.45, 10, 'Bob', 2.5], [19.99, 5, 'Eve', 4.1]]

When we convert `mylist` to `arr`, NumPy assumes that every entry should be a string based on the fact that elements from the sellers column are stored as strings.

That being said, we can still compute the average pretty easily. Here's how:

In [3]:
#interpret as floats
arr[:,-1].astype(float)

array([4.9, 2.5, 4.1])

In [4]:
arr[:,-1].astype(float).mean()

3.8333333333333335

To deal with specific elements we could cast to a float.

In [5]:
float(arr[0,-1])

4.9

In both of these conversion examples, we're converting data on an as-need basis. There's no way of setting the dtyle for just one column of the array.

### Why pandas? Right Approach

In this section, we'll formally introduce pandas. For Math 10, pandas is the most important python library.
* pandas is like the python equivalent of Excel.

* Convert `mylist` to a pandas DataFrame

In [6]:
#pd is a standard naming convention
#always use it in Math 10
import pandas as pd

In [8]:
mylist

[[1.99, 17, 'Alice', 4.9], [10.45, 10, 'Bob', 2.5], [19.99, 5, 'Eve', 4.1]]

In [9]:
pd.DataFrame(mylist)

Unnamed: 0,0,1,2,3
0,1.99,17,Alice,4.9
1,10.45,10,Bob,2.5
2,19.99,5,Eve,4.1


This is already a huge improvement from NumPy; notice that each column has its own data type.
* What does `object` mean? Object is a data type that typically not one of the specialized numerical values that pandas recognizes. You'll it a lot with strings, lists, etc.

In [10]:
#Question: will the following work?
pd.mylist

AttributeError: module 'pandas' has no attribute 'mylist'

`pd.` is looking for a function defined in pandas. In this example, pandas is looking for a function called `mylist`.

* Read in the `sample-data.csv` directly using the pandas function `read_csv`. Save the resulting DataFrame using the variable name `df`.

This is the most common way we'll get data in to pandas. 
* We'll mostly work with csv (comma separated values) files
* If you're interested in working with Excel files, the process is almost exactly the same, but there are a few extra steps; in most cases, it's easier to convert your file first to csv and then import.

In [11]:
df = pd.read_csv("sample-data.csv")
df

Unnamed: 0,Cost,Quantity,Seller,Rating
0,1.99,17,Alice,4.9
1,10.45,10,Bob,2.5
2,19.99,5,Eve,4.1


Among all the methods we've seen so far, this is the first one that gives us the names of the columns!

* Evaluate the `dtypes` attribute of `df`

In [None]:
# DataFrame object
type(df)

pandas.core.frame.DataFrame

In [None]:
df.dtypes

Cost        float64
Quantity      int64
Seller       object
Rating      float64
dtype: object

* Define `col` to be equal to the "Rating" column

In [None]:
df

Unnamed: 0,Cost,Quantity,Seller,Rating
0,1.99,17,Alice,4.9
1,10.45,10,Bob,2.5
2,19.99,5,Eve,4.1


In [None]:
#first example of indexing, here's how we can get a column
col = df["Rating"]
col

0    4.9
1    2.5
2    4.1
Name: Rating, dtype: float64

Notice, `col` is a different data type from `df`

In [None]:
type(col)

pandas.core.series.Series

`col` is what's called a pandas Series. You can think of Series as columns of DataFrames.

To recap: the two most important data types in pandas are `DataFrames` and `Series`.

In [None]:
#Great question asked: will the following give the same thing?
#Not quite, but we'll see how to make it work in the next section
df[:,3]

TypeError: '(slice(None, None, None), 3)' is an invalid key

* How many rows and columns are in `df`?

We can gather this information using the `shape` attribute

In [None]:
df.shape

(3, 4)

This is telling us we have 3 rows and 4 columns.

* Compute the average of the ratings columns

In [None]:
col.mean()

3.8333333333333335

Appreciate how elegant this computation was compared to when we used NumPy and just lists. One thing to notice: pandas does not actually solve any of the precision issues we brought up last time.

### Two ways to index in pandas

Here are two ways to index in pandas:
* label-based (the names that things are given): `loc` (location)
* Integer-position-based (the index, start couting from 0): `iloc` (integer location)

In [None]:
df

Unnamed: 0,Cost,Quantity,Seller,Rating
0,1.99,17,Alice,4.9
1,10.45,10,Bob,2.5
2,19.99,5,Eve,4.1


Let's say I wanted to get Bob's name...

In [None]:
#label-based indexing
df.loc[1,"Seller"]

'Bob'

In [None]:
#Question from the chat!
#This is returning all of row 1
df.loc[1]

Cost        10.45
Quantity       10
Seller        Bob
Rating        2.5
Name: 1, dtype: object

In [None]:
type(df.loc[1])

pandas.core.series.Series

In [None]:
df.loc[1]["Seller"]

'Bob'

In [None]:
#Notice, we get the same as df.loc[1]
df.loc[1,:]

Cost        10.45
Quantity       10
Seller        Bob
Rating        2.5
Name: 1, dtype: object

What makes the above example slightly confusing is that the row names and integer positions happen to coincide. E.g. integer position 2 is the same as label 2, in this case.

In [None]:
#integer-based indexing
df.iloc[1,2]

'Bob'

In [None]:
df

Unnamed: 0,Cost,Quantity,Seller,Rating
0,1.99,17,Alice,4.9
1,10.45,10,Bob,2.5
2,19.99,5,Eve,4.1


In [None]:
#How could we get the left-most column
df.iloc[:,0]

0     1.99
1    10.45
2    19.99
Name: Cost, dtype: float64

In [None]:
#returns a list of column names
df.columns

Index(['Cost', 'Quantity', 'Seller', 'Rating'], dtype='object')

### Boolean indexing in pandas

Boolean Series are another way of indexing in pandas! This should remind you a lot of NumPy!

In [None]:
df

Unnamed: 0,Cost,Quantity,Seller,Rating
0,1.99,17,Alice,4.9
1,10.45,10,Bob,2.5
2,19.99,5,Eve,4.1


In [None]:
df["Quantity"] < 12

0    False
1     True
2     True
Name: Quantity, dtype: bool

* Get the sub-DataFrame containing all rows where the Quantity is smaller than 12.

In [None]:
sub_df = df[df["Quantity"] < 12]
sub_df

Unnamed: 0,Cost,Quantity,Seller,Rating
1,10.45,10,Bob,2.5
2,19.99,5,Eve,4.1


Warning!! Notice here that the row labels no longer match the integer positions!

In [None]:
#Row labeled 1, the column labeled "Cost"
sub_df.loc[1,"Cost"]

10.45

In [None]:
#Index 0 row, index 0 column
sub_df.iloc[0,0]

10.45

In [None]:
sub_df.iloc[1,0]

19.99

In [None]:
sub_df.shape

(2, 4)


***

That's it for EDA Unit 1! The focus of EDA Unit 2 will be on exploring new data and techniques for getting a feel of what the data is all about.

### Exploring the taxis dataset

In today's lecture we'll be using a dataset from the Seaborn library. Typically we'd import this data from Seaborn, but when we import it this way Seaborn takes care of a number of "data cleaning" steps for us. Today, we'll just import the csv directly. 

* Load the taxis csv file and store it with the variable name `df`.

In [None]:
df = pd.read_csv("taxis.csv")

Get a sense for the contents of `df` using the following:
* The shape attribute

In [None]:
df.shape

(6433, 14)

* The head method to view the first few rows

In [None]:
#Give the first three rows
df.head(3)

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan


In [None]:
#If I don't specify a number, it will give me the first 5 rows
df.head()

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.7,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.1,0.0,13.4,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan


`head()` is good for seeing what kinds of data are included in our DataFrame. However, we might be worried that certain values are all clustered in the beginning, leading us to make incorrect assumptions about the data. What if we wanted a more random sampling?

* The `sample` method; similar to `head`, but returns a random selection of rows (could be out of order)

In [None]:
df.sample(3)

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
5092,2019-03-06 10:00:44,2019-03-06 10:32:14,3,6.79,24.5,0.0,0.0,27.8,yellow,cash,Upper West Side North,Battery Park City,Manhattan,Manhattan
1642,2019-03-11 14:51:12,2019-03-11 15:08:55,6,1.74,11.5,2.96,0.0,17.76,yellow,credit card,Central Park,Clinton East,Manhattan,Manhattan
4949,2019-03-01 21:08:12,2019-03-01 21:22:33,6,2.38,11.0,2.96,0.0,17.76,yellow,credit card,Times Sq/Theatre District,Gramercy,Manhattan,Manhattan


* `info` method

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   pickup           6433 non-null   object 
 1   dropoff          6433 non-null   object 
 2   passengers       6433 non-null   int64  
 3   distance         6433 non-null   float64
 4   fare             6433 non-null   float64
 5   tip              6433 non-null   float64
 6   tolls            6433 non-null   float64
 7   total            6433 non-null   float64
 8   color            6433 non-null   object 
 9   payment          6389 non-null   object 
 10  pickup_zone      6407 non-null   object 
 11  dropoff_zone     6388 non-null   object 
 12  pickup_borough   6407 non-null   object 
 13  dropoff_borough  6388 non-null   object 
dtypes: float64(5), int64(1), object(8)
memory usage: 703.7+ KB


Missing values and how to deal with them is a large problem in data science. It will also be a topic of further study in Math 10.

From this data, I can infer that there are 44 missing values in the "payment" column.

In [None]:
6433 - 6389

44

* The method `describe`; similar to `info`, but it gives information about the distribution of numbers in the numeric columns.

In [None]:
df.describe()

Unnamed: 0,passengers,distance,fare,tip,tolls,total
count,6433.0,6433.0,6433.0,6433.0,6433.0,6433.0
mean,1.539251,3.024617,13.091073,1.97922,0.325273,18.517794
std,1.203768,3.827867,11.551804,2.44856,1.415267,13.81557
min,0.0,0.0,1.0,0.0,0.0,1.3
25%,1.0,0.98,6.5,0.0,0.0,10.8
50%,1.0,1.64,9.5,1.7,0.0,14.16
75%,2.0,3.21,15.0,2.8,0.0,20.3
max,6.0,36.7,150.0,33.2,24.02,174.82


Here, we can see that the average cost was about 13, while the median cost was 9.5

* How many different values are in the "pickup_borough" column? First, get a pandas Series containing the column.

In [None]:
ser = df["pickup_borough"]
ser

0       Manhattan
1       Manhattan
2       Manhattan
3       Manhattan
4       Manhattan
          ...    
6428    Manhattan
6429       Queens
6430     Brooklyn
6431     Brooklyn
6432     Brooklyn
Name: pickup_borough, Length: 6433, dtype: object

In [None]:
#Each unique value that appears in ser
ser.unique()

array(['Manhattan', 'Queens', nan, 'Bronx', 'Brooklyn'], dtype=object)

Here, we see the value `nan`; this stands for not a number. It represents missing data.

In pandas, there are methods specifically for Series. `unique` is one of them; if we try to use it on a DataFrame it will not work.

In [None]:
df.unique()

AttributeError: 'DataFrame' object has no attribute 'unique'

### Indexing the taxis dataset

__Motiation:__ Get "pickup_borough" column using `loc` or `iloc`.

By far, the easiest way to get this column is how we did it above:

In [None]:
df["pickup_borough"]

0       Manhattan
1       Manhattan
2       Manhattan
3       Manhattan
4       Manhattan
          ...    
6428    Manhattan
6429       Queens
6430     Brooklyn
6431     Brooklyn
6432     Brooklyn
Name: pickup_borough, Length: 6433, dtype: object

In [None]:
#First thing you might try, but doesn't work
#It's looking for a row with this name
df.loc["pickup_borough"]

KeyError: 'pickup_borough'

In [None]:
df.loc[:,"pickup_borough"]

0       Manhattan
1       Manhattan
2       Manhattan
3       Manhattan
4       Manhattan
          ...    
6428    Manhattan
6429       Queens
6430     Brooklyn
6431     Brooklyn
6432     Brooklyn
Name: pickup_borough, Length: 6433, dtype: object

If we now want to use `iloc`, we'll need to know the integer position of "pickup_borough"...

In [None]:
df.columns

Index(['pickup', 'dropoff', 'passengers', 'distance', 'fare', 'tip', 'tolls',
       'total', 'color', 'payment', 'pickup_zone', 'dropoff_zone',
       'pickup_borough', 'dropoff_borough'],
      dtype='object')

We see that it is integer location 12

In [None]:
df.iloc[:,12]

0       Manhattan
1       Manhattan
2       Manhattan
3       Manhattan
4       Manhattan
          ...    
6428    Manhattan
6429       Queens
6430     Brooklyn
6431     Brooklyn
6432     Brooklyn
Name: pickup_borough, Length: 6433, dtype: object

Some more (potentially) confusing examples

* If I index with a slice, it takes rows

In [None]:
df[:3]

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan


* If I index without a slice, it will look for columns

In [None]:
#There's no column called "3" in this case
df[3]

KeyError: 3

Question asked in the chat: how could pick rows 234 to 236?

In [None]:
df[234:237]

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
234,2019-03-06 21:56:51,2019-03-06 22:03:35,1,1.1,6.5,1.54,0.0,11.84,yellow,credit card,Gramercy,West Village,Manhattan,Manhattan
235,2019-03-20 23:19:55,2019-03-20 23:46:00,1,4.82,19.5,4.66,0.0,27.96,yellow,credit card,Union Sq,Financial District South,Manhattan,Manhattan
236,2019-03-15 19:33:40,2019-03-15 19:54:59,1,3.3,15.5,2.96,0.0,22.76,yellow,credit card,Upper East Side South,Gramercy,Manhattan,Manhattan


In [None]:
df.iloc[234:237,:]

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
234,2019-03-06 21:56:51,2019-03-06 22:03:35,1,1.1,6.5,1.54,0.0,11.84,yellow,credit card,Gramercy,West Village,Manhattan,Manhattan
235,2019-03-20 23:19:55,2019-03-20 23:46:00,1,4.82,19.5,4.66,0.0,27.96,yellow,credit card,Union Sq,Financial District South,Manhattan,Manhattan
236,2019-03-15 19:33:40,2019-03-15 19:54:59,1,3.3,15.5,2.96,0.0,22.76,yellow,credit card,Upper East Side South,Gramercy,Manhattan,Manhattan


By slice, we mean indexing using `:` to get a large number of rows, for example, without having to type them all out by hand.

How could we get row 234?

In [None]:
#Cheat!
df[234:235]

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
234,2019-03-06 21:56:51,2019-03-06 22:03:35,1,1.1,6.5,1.54,0.0,11.84,yellow,credit card,Gramercy,West Village,Manhattan,Manhattan


In [None]:
df.iloc[234,:]

pickup             2019-03-06 21:56:51
dropoff            2019-03-06 22:03:35
passengers                           1
distance                           1.1
fare                               6.5
tip                               1.54
tolls                              0.0
total                            11.84
color                           yellow
payment                    credit card
pickup_zone                   Gramercy
dropoff_zone              West Village
pickup_borough               Manhattan
dropoff_borough              Manhattan
Name: 234, dtype: object

What if we wanted a couple different columns?

In [None]:
df.iloc[:,[5,9]]

Unnamed: 0,tip,payment
0,2.15,credit card
1,0.00,cash
2,2.36,credit card
3,6.15,credit card
4,1.10,credit card
...,...,...
6428,1.06,credit card
6429,0.00,credit card
6430,0.00,cash
6431,0.00,credit card


### Working with dates in pandas

In [None]:
df.head(3)

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan


* Convert the pickup and dropoff columns to datetime values using the pandas function `to_datetime`. Save the results as a new column in `df` named "picktime" and "droptime".

In [None]:
df["pickup"] #notice, just strings

0       2019-03-23 20:21:09
1       2019-03-04 16:11:55
2       2019-03-27 17:53:01
3       2019-03-10 01:23:59
4       2019-03-30 13:27:42
               ...         
6428    2019-03-31 09:51:53
6429    2019-03-31 17:38:00
6430    2019-03-23 22:55:18
6431    2019-03-04 10:09:25
6432    2019-03-13 19:31:22
Name: pickup, Length: 6433, dtype: object

In [None]:
pd.to_datetime(df["pickup"])

0      2019-03-23 20:21:09
1      2019-03-04 16:11:55
2      2019-03-27 17:53:01
3      2019-03-10 01:23:59
4      2019-03-30 13:27:42
               ...        
6428   2019-03-31 09:51:53
6429   2019-03-31 17:38:00
6430   2019-03-23 22:55:18
6431   2019-03-04 10:09:25
6432   2019-03-13 19:31:22
Name: pickup, Length: 6433, dtype: datetime64[ns]

* Notice, the dtype has changed
* Haven't actually made any changes to `df` yet...

Smaller examples to see how this conversion works:

In [None]:
today = "August 11, 2023"

In [None]:
type(today)

str

In [None]:
ts = pd.to_datetime(today)
ts

Timestamp('2023-08-11 00:00:00')

Once we have this object, there are all kinds of methods and attributes available to us...

In [None]:
ts.day_name()

'Friday'

In [None]:
ts.day_of_year

223

In [None]:
#Check all attributes and methods available...
dir(ts)

['__add__',
 '__array_priority__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__radd__',
 '__reduce__',
 '__reduce_cython__',
 '__reduce_ex__',
 '__repr__',
 '__rsub__',
 '__setattr__',
 '__setstate__',
 '__setstate_cython__',
 '__sizeof__',
 '__str__',
 '__sub__',
 '__subclasshook__',
 '__weakref__',
 '_date_repr',
 '_repr_base',
 '_round',
 '_short_repr',
 '_time_repr',
 'asm8',
 'astimezone',
 'ceil',
 'combine',
 'ctime',
 'date',
 'day',
 'day_name',
 'day_of_week',
 'day_of_year',
 'dayofweek',
 'dayofyear',
 'days_in_month',
 'daysinmonth',
 'dst',
 'floor',
 'fold',
 'freq',
 'freqstr',
 'fromisocalendar',
 'fromisoformat',
 'fromordinal',
 'fromtimestamp',
 'hour',
 'is_leap_year',
 'is_month_end',
 'is_month_start',
 'is_quarter_end',
 'is_quarter_start',
 '

In [None]:
df["picktime"] = pd.to_datetime(df["pickup"])
df["droptime"] = pd.to_datetime(df["dropoff"])

In [None]:
#notice the new columns are all the way at the end
df

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough,picktime,droptime
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.60,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan,2019-03-23 20:21:09,2019-03-23 20:27:24
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.00,0.0,9.30,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan,2019-03-04 16:11:55,2019-03-04 16:19:00
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan,2019-03-27 17:53:01,2019-03-27 18:00:25
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.70,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan,2019-03-10 01:23:59,2019-03-10 01:49:51
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.10,0.0,13.40,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan,2019-03-30 13:27:42,2019-03-30 13:37:14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6428,2019-03-31 09:51:53,2019-03-31 09:55:27,1,0.75,4.5,1.06,0.0,6.36,green,credit card,East Harlem North,Central Harlem North,Manhattan,Manhattan,2019-03-31 09:51:53,2019-03-31 09:55:27
6429,2019-03-31 17:38:00,2019-03-31 18:34:23,1,18.74,58.0,0.00,0.0,58.80,green,credit card,Jamaica,East Concourse/Concourse Village,Queens,Bronx,2019-03-31 17:38:00,2019-03-31 18:34:23
6430,2019-03-23 22:55:18,2019-03-23 23:14:25,1,4.14,16.0,0.00,0.0,17.30,green,cash,Crown Heights North,Bushwick North,Brooklyn,Brooklyn,2019-03-23 22:55:18,2019-03-23 23:14:25
6431,2019-03-04 10:09:25,2019-03-04 10:14:29,1,1.12,6.0,0.00,0.0,6.80,green,credit card,East New York,East Flatbush/Remsen Village,Brooklyn,Brooklyn,2019-03-04 10:09:25,2019-03-04 10:14:29


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=e69582a0-522e-4774-8ddf-e30478c7f5fd' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>