### Introduction to Pandas 

For doing data analysis and manipulation in Python, Pandas is one of the most powerful, resourceful and easy to begin with packages. It consists of data structures and functions and can assist us with doing a huge set of tasks for analytical work. In order to use these resources, Pandas needs to be loaded using the **import** command, as shown below, for it to be used in our script. We can also define an alias for Pandas (we used pd) as we are going to use Pandas' modules multiple times in our code.

<img src="https://pandas.pydata.org/docs/_static/pandas.svg" width="200"/>

A little note about Pandas is that it uses object-oriented notations. No need to worry if you are not familiar with this concept. However, getting a high-level intuition about object-oriented programming (OOP) is going to be helpful in understanding how Pandas work. All data structures that store our data in the program are a form of Pandas objects defined by a blueprint called Class. For example, a DataFrame (to be discussed below) is a data-structure to store data in a format similar to spreadsheet. We can create a DataFrame object to store a table of our dataset and this object will have multiple methods and functions that operate on this object.

Pandas comes usually installed with Jupyter environments like the one you are using here. If pandas is installed, you can check the current version of Pandas by using the ```pd.__version__``` command after the import command below:

*Note: pd is an alias*

In [1]:
import pandas as pd
pd.__version__

'1.4.2'

### Data Structures in Pandas

We will learn the 2 fundamental data structures in Pandas: **Series** and **DataFrames**. 

<img src="series.png" width="300"/>

**Series** is a one-dimensional array which can hold data of various types. Series are labelled where an index labels each element on the axis. Each index element has to be unique. Series can be created out from a list, a python dictionary or even a scalar value. A pandas Series has a single dtype.

Below, we will go through 2 of the ways to create a Pandas Series. The method below creates a Series out of a Python dictionary.

In [2]:
sample_dictionary = {'Mercury': 35, 'Venus': 67, 'Earth': 93}

sample_series = pd.Series(sample_dictionary)

In [3]:
# Print the series
sample_series

Mercury    35
Venus      67
Earth      93
dtype: int64

Note: if we do not specify the index, the default numeric range will be given.

We will now create a Series from a list, without giving the index. In the cell after that, we will provide the index while we create a Series out of a list.

In [4]:
sample_list = [35, 67, 93]

sample_series = pd.Series(sample_list)

sample_series

0    35
1    67
2    93
dtype: int64

In [5]:
sample_series = pd.Series(sample_list, index = ['Mercury', 'Venus', 'Earth'])

sample_series

Mercury    35
Venus      67
Earth      93
dtype: int64

<hr style="border:2px solid gray">

Unlike Series, **DataFrames** are a multi-dimensional, consisting of row and columns. They are similar to a spreadsheet or a SQL table. Pandas provides all the functionality and methods to deal with data in the DataFrame. Each row in a DataFrame is labeled with an index, as in Series. Whereas, there are also labels for each column. In one of upcoming sections, we will also have a look at multi-level indexing for DataFrames.

<img src="dataframe.png" width="500"/>

There are several ways to create or form a DataFrame. A DataFrame can be made of one or multiple series combined. It can also be formed out of lists and dictionaries. In the first cell, we will create a DataFrame using a Dictionary of lists. 

In the examples below, we are creating data for site views (in thousands) on a website per browser every year.

In [6]:
df = pd.DataFrame({
    'Chrome': [67, 74, 89],
    'Safari': [44, 58, 70],
    'Firefox': [8, 14, 16]
}, index = [2018, 2019, 2020])

df

Unnamed: 0,Chrome,Safari,Firefox
2018,67,44,8
2019,74,58,14
2020,89,70,16


Below we will create a DataFrame using a list of dictionaries.

In [7]:
df = pd.DataFrame(
    [{"Chrome": 67, "Safari": 44, "Firefox":8 }, 
     {"Chrome": 74, "Safari": 58, "Firefox": 14},
    {"Chrome": 89, "Safari": 70, "Firefox": 16}]
)

df

Unnamed: 0,Chrome,Safari,Firefox
0,67,44,8
1,74,58,14
2,89,70,16


Note: If we do not provide the index, pandas defaults the index list to a range of integers beginning from 0.

In the following cell we will create a DataFrame using a 2 dimensional list.

In [8]:
df = pd.DataFrame(
    [[67, 74, 89],
     [44, 58, 70],
     [8, 14, 16]],
    index = [2018, 2019, 2020], columns = ["Chrome", "Safari", "Firefox"]
)

df

Unnamed: 0,Chrome,Safari,Firefox
2018,67,74,89
2019,44,58,70
2020,8,14,16


### Files in Pandas

The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that generally return a pandas object. 

Pandas offeres reader functions which can read in a data file and return it as a pandas object. The most useful one `read.csv()` will read a text file and convert it to a DataFrame with default arguments. Pandas also offers `read_json()` and `read_excel` to read json and excel files, respectively, 2 of the most common file types for data.

We will use the `read.csv` function to read the Chicago Police Traffic Stops data.

In [9]:
stops_df = pd.read_csv('idot/IDOT_2021.csv')

The [documentation](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) provides great detail of the arguments that `read_csv()` function accepts. Some of the common ones that you must know are:

**filepath_or_buffer**: Either a path to a file or URL.
<br>**sep**: Delimiter to use. Default delimiter is `','` (for comma-separated files. Use `'\t'` for tab-separated files, tsv)

Typing and running the DataFrame name on a Jupyter cell displays the dataset with rows and columns truncated.

In [10]:
stops_df

Unnamed: 0,DATESTOP,TIMESTOP,DURATION,OFFNAME,OFFBDGE,CITY_I,STATE,VEHMAKE,VEHYEAR,YRBIRTH,...,DOGALERT,DOGALERTSRCH,DOGALERTSRCHCONTRA,DOGDRUG,DOGPARA,DOGALC,DOGWEAP,DOGSTOLPROP,DOGOTHER,DOGDRAMT
0,1/1/21,0:33,5,FIDEL LEGORRETA,5902.0,CHICAGO,IL,CHEVROLET,2017.0,1993,...,0,0,0,0,0,0,0,0,0,0
1,1/1/21,1:50,4,VICTOR PEREZ,7383.0,CHICAGO,IL,FORD,2012.0,1957,...,0,0,0,0,0,0,0,0,0,0
2,1/1/21,8:50,4,STEPHANIE ORTIGARA,18302.0,CHICAGO,IL,FORD,2007.0,1967,...,0,0,0,0,0,0,0,0,0,0
3,1/1/21,12:41,6,JASON ARROYO,14502.0,CHICAGO,IL,BMW,1998.0,1990,...,0,0,0,0,0,0,0,0,0,0
4,1/1/21,13:51,5,MONTY OWENS,11975.0,CHICAGO,IL,TOYOTA,2002.0,1945,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
377894,12/30/21,23:40,2,MATTHEW DRINNAN,13585.0,CHICAGO,IL,HYUNDAI,2020.0,1964,...,0,0,0,0,0,0,0,0,0,0
377895,12/31/21,20:35,5,EMMANUEL GARCIA,19038.0,CHICAGO,IL,NISSAN,2007.0,1973,...,0,0,0,0,0,0,0,0,0,0
377896,12/31/21,20:31,21,LUIS NUNEZ,18229.0,CHICAGO HEIGHTS,IL,HONDA,2007.0,1980,...,0,0,0,0,0,0,0,0,0,0
377897,12/31/21,21:33,2,GUSTAVO DOMINGUEZ,15235.0,CHICAGO,IL,KIA,2014.0,1982,...,0,0,0,0,0,0,0,0,0,0


This dataset is huge! Pandas also offers methods to view a small sample of a Series or DataFrame object. The `head()` and `tail()` methods enable you to do so showing you the first set of rows and last set of rows, respectively. By default, they display 5 rows. The argument to these methods can be changed to the number of rows needed to be displayed.

In [11]:
# View first 5 rows
stops_df.head()

Unnamed: 0,DATESTOP,TIMESTOP,DURATION,OFFNAME,OFFBDGE,CITY_I,STATE,VEHMAKE,VEHYEAR,YRBIRTH,...,DOGALERT,DOGALERTSRCH,DOGALERTSRCHCONTRA,DOGDRUG,DOGPARA,DOGALC,DOGWEAP,DOGSTOLPROP,DOGOTHER,DOGDRAMT
0,1/1/21,0:33,5,FIDEL LEGORRETA,5902.0,CHICAGO,IL,CHEVROLET,2017.0,1993,...,0,0,0,0,0,0,0,0,0,0
1,1/1/21,1:50,4,VICTOR PEREZ,7383.0,CHICAGO,IL,FORD,2012.0,1957,...,0,0,0,0,0,0,0,0,0,0
2,1/1/21,8:50,4,STEPHANIE ORTIGARA,18302.0,CHICAGO,IL,FORD,2007.0,1967,...,0,0,0,0,0,0,0,0,0,0
3,1/1/21,12:41,6,JASON ARROYO,14502.0,CHICAGO,IL,BMW,1998.0,1990,...,0,0,0,0,0,0,0,0,0,0
4,1/1/21,13:51,5,MONTY OWENS,11975.0,CHICAGO,IL,TOYOTA,2002.0,1945,...,0,0,0,0,0,0,0,0,0,0


In [12]:
# View last 3 rows
stops_df.tail(3)

Unnamed: 0,DATESTOP,TIMESTOP,DURATION,OFFNAME,OFFBDGE,CITY_I,STATE,VEHMAKE,VEHYEAR,YRBIRTH,...,DOGALERT,DOGALERTSRCH,DOGALERTSRCHCONTRA,DOGDRUG,DOGPARA,DOGALC,DOGWEAP,DOGSTOLPROP,DOGOTHER,DOGDRAMT
377896,12/31/21,20:31,21,LUIS NUNEZ,18229.0,CHICAGO HEIGHTS,IL,HONDA,2007.0,1980,...,0,0,0,0,0,0,0,0,0,0
377897,12/31/21,21:33,2,GUSTAVO DOMINGUEZ,15235.0,CHICAGO,IL,KIA,2014.0,1982,...,0,0,0,0,0,0,0,0,0,0
377898,12/31/21,21:20,4,FRANK GIANAKAKIS,6934.0,CHICAGO,IL,TOYOTA,2015.0,1987,...,0,0,0,0,0,0,0,0,0,0


### Basic metadata

Pandas enables you to access basic metadata for its object by using some attributes. The important ones we are going to learn here are **shape**, **dtypes** and axis labels.

**Shape** gives the axis dimensions of the object. For a DataFrame, it will give the number of rows and columns.

In [13]:
stops_df.shape

(377899, 55)

**dtypes** will give the data type for each column in a DataFrame. We are going to learn more about data types in the upcoming sections.

In [14]:
stops_df.dtypes

DATESTOP               object
TIMESTOP               object
DURATION                int64
OFFNAME                object
OFFBDGE               float64
CITY_I                 object
STATE                  object
VEHMAKE                object
VEHYEAR               float64
YRBIRTH                 int64
DRSEX                   int64
DRRACE                float64
REASSTOP              float64
TYPEMOV               float64
RESSTOP                 int64
BEAT_I                  int64
VEHCONSREQ              int64
VEHCONSGIV              int64
VEHSRCHCOND             int64
VEHSRCHCONDBY           int64
VEHCONTRA               int64
VEHDRUGS                int64
VEHPARA                 int64
VEHALC                  int64
VEHWEAP                 int64
VEHSTOLPROP             int64
VEHOTHER                int64
VEHDRAMT                int64
DRCONSREQ               int64
DRCONSGIV               int64
DRVSRCHCOND             int64
DRVSRCHCONDBY           int64
PASSCONSREQ             int64
PASSCONSGI

Pandas objects have axis labels. We have seen that each element in a Series is labeled through an index. Whereas, a DataFrame has a set of labels for each row and another set for columns. We can access these labels for a DataFrame using `.index` and `.columns` methods.

In [15]:
stops_df.columns

Index(['DATESTOP', 'TIMESTOP', 'DURATION', 'OFFNAME', 'OFFBDGE', 'CITY_I',
       'STATE', 'VEHMAKE', 'VEHYEAR', 'YRBIRTH', 'DRSEX', 'DRRACE', 'REASSTOP',
       'TYPEMOV', 'RESSTOP', 'BEAT_I', 'VEHCONSREQ', 'VEHCONSGIV',
       'VEHSRCHCOND', 'VEHSRCHCONDBY', 'VEHCONTRA', 'VEHDRUGS', 'VEHPARA',
       'VEHALC', 'VEHWEAP', 'VEHSTOLPROP', 'VEHOTHER', 'VEHDRAMT', 'DRCONSREQ',
       'DRCONSGIV', 'DRVSRCHCOND', 'DRVSRCHCONDBY', 'PASSCONSREQ',
       'PASSCONSGIV', 'PASSSRCHCOND', 'PASSSRCHCONDBY', 'PASSDRVCONTRA',
       'PASSDRVDRUGS', 'PASSDRVPARA', 'PASSDRVALC', 'PASSDRVWEAP',
       'PASSDRVSTOLPROP', 'PASSDRVOTHER', 'PASSDRVDRAMT', 'DOGPERFSNIFF',
       'DOGALERT', 'DOGALERTSRCH', 'DOGALERTSRCHCONTRA', 'DOGDRUG', 'DOGPARA',
       'DOGALC', 'DOGWEAP', 'DOGSTOLPROP', 'DOGOTHER', 'DOGDRAMT'],
      dtype='object')

In [16]:
stops_df.index

RangeIndex(start=0, stop=377899, step=1)

As you can notice, since there wasn't any index provided by the dataset, Pandas set the default index to range of integers from 0 to length of the DataFrame.

### Selecting and Subsetting

Pandas offers a few ways that we can use to select a set of data which can include a cell, row, column or a subset of the entire dataframe. The 3 ways that we will go through are selection using `[]`, `.loc` (label based indexing) and `.iloc` (position based indexing).

Using `[]` is an archaic way of indexing which mimics the same way we would index dictionaries in Python. In Pandas, it'll select the lower-dimensional set. For the Series below we are using the row Index to select a row below.

In [17]:
sample_series['Mercury']

35

Similarly, we can use this notation to index a column in a DataFrame. This will result in a Series.

In [18]:
df['Chrome']

2018    67
2019    44
2020     8
Name: Chrome, dtype: int64

While you may stil find this notation being used for indexing DataFrames and Series, it can often result in errors and confusion. The purpose to bring it up here was to provide familiarity with this notation. Therefore, we will focus more on using `.loc` for selection which is a more powerful and explicit indexer.

The integer below is the label for the index that will extract the row with index label 2.

In [19]:
stops_df.loc[2]

DATESTOP                          1/1/21
TIMESTOP                            8:50
DURATION                               4
OFFNAME               STEPHANIE ORTIGARA
OFFBDGE                          18302.0
CITY_I                           CHICAGO
STATE                                 IL
VEHMAKE                             FORD
VEHYEAR                           2007.0
YRBIRTH                             1967
DRSEX                                  1
DRRACE                               2.0
REASSTOP                             3.0
TYPEMOV                              NaN
RESSTOP                                3
BEAT_I                               332
VEHCONSREQ                             2
VEHCONSGIV                             0
VEHSRCHCOND                            2
VEHSRCHCONDBY                          0
VEHCONTRA                              0
VEHDRUGS                               0
VEHPARA                                0
VEHALC                                 0
VEHWEAP         

We can also print out a set of rows using slices. `:2` refers to index in the the range 0 to 2, a shorthand for `0:2`. 

Note that the index labels are interger in this DataFrame. This method also works well with non-integer index. For eg. we can also using slicing for sting labels. In that case, pandas will return all the rows in between the rows whose labels are specified in the slice, including the specified rows as well.

In [20]:
stops_df.loc[:2]

Unnamed: 0,DATESTOP,TIMESTOP,DURATION,OFFNAME,OFFBDGE,CITY_I,STATE,VEHMAKE,VEHYEAR,YRBIRTH,...,DOGALERT,DOGALERTSRCH,DOGALERTSRCHCONTRA,DOGDRUG,DOGPARA,DOGALC,DOGWEAP,DOGSTOLPROP,DOGOTHER,DOGDRAMT
0,1/1/21,0:33,5,FIDEL LEGORRETA,5902.0,CHICAGO,IL,CHEVROLET,2017.0,1993,...,0,0,0,0,0,0,0,0,0,0
1,1/1/21,1:50,4,VICTOR PEREZ,7383.0,CHICAGO,IL,FORD,2012.0,1957,...,0,0,0,0,0,0,0,0,0,0
2,1/1/21,8:50,4,STEPHANIE ORTIGARA,18302.0,CHICAGO,IL,FORD,2007.0,1967,...,0,0,0,0,0,0,0,0,0,0


Using the `.loc` attribute, we can also select the second dimension (column for a DataFrame). In the cell below, we are selecting all rows and the 'OFFNAME' column.

In [21]:
stops_df.loc[: ,'OFFNAME']

0            FIDEL LEGORRETA
1               VICTOR PEREZ
2         STEPHANIE ORTIGARA
3               JASON ARROYO
4                MONTY OWENS
                 ...        
377894       MATTHEW DRINNAN
377895       EMMANUEL GARCIA
377896            LUIS NUNEZ
377897     GUSTAVO DOMINGUEZ
377898      FRANK GIANAKAKIS
Name: OFFNAME, Length: 377899, dtype: object

By specifying a single value in both the dimensions, we can select the specific cell as well.

In [22]:
stops_df.loc[5 ,'OFFNAME']

'MONTY OWENS'

Or select a range of rows over multiple columns.

In [23]:
stops_df.loc[55:60, ['OFFNAME','DRSEX']]

Unnamed: 0,OFFNAME,DRSEX
55,ESTEBAN PEREZ,1
56,BRENDAN CIMAGLIA,1
57,RAUL GONZALEZ,1
58,HAMZEH SUWI,2
59,JESSICA BAUTISTA,2
60,NAWACHUD BURAKETRACHAKUL,2


The selection can also be a list of indices rather than a range.

In [24]:
stops_df.loc[[55,78,903], ['OFFNAME','DRSEX']]

Unnamed: 0,OFFNAME,DRSEX
55,ESTEBAN PEREZ,1
78,MICHAEL MANNION,1
903,LUIS HUESCA,2


Another useful approach that comes handy when performing data manipulation operations is selection using boolean operators. In the cell below, before the comparision operator, we are subsetting the DataFrame to return all rows in the 'DURATION' column. We are then comparing the 'DURATION' value and returning `True` if the value is strictly less than 5. This returns a boolean array. 

In [25]:
stops_df.loc[:, 'DURATION'] < 5

0         False
1          True
2          True
3         False
4         False
          ...  
377894     True
377895    False
377896    False
377897     True
377898     True
Name: DURATION, Length: 377899, dtype: bool

The above comparision operation can then be used to filter out rows that are True for the above case. In the cell below, we are using the same comparision operation to return cells from the entire DataFrame that have a stop duration of less than 5. Notice, we are then subsetting it to only provide Officer's Names of those rows.

In [26]:
stops_df.loc[stops_df.loc[:,'DURATION'] < 5, 'OFFNAME']

1               VICTOR PEREZ
2         STEPHANIE ORTIGARA
5                MONTY OWENS
8                CARLOS MOTA
9            MARCUS LYLES JR
                 ...        
377891         JOSEPH ALFARO
377892       KRYSTAL TROTTER
377894       MATTHEW DRINNAN
377897     GUSTAVO DOMINGUEZ
377898      FRANK GIANAKAKIS
Name: OFFNAME, Length: 198861, dtype: object

Another form of selection that we introduced is Selection by Position using `.iloc`. This attribute is a form of indexing using integers that mark position of the cell. 

In [27]:
stops_df.iloc[8]

DATESTOP                        1/1/21
TIMESTOP                         17:26
DURATION                             2
OFFNAME                    CARLOS MOTA
OFFBDGE                         3548.0
CITY_I                GLENDALE HEIGHTS
STATE                               IL
VEHMAKE                          LEXUS
VEHYEAR                         2014.0
YRBIRTH                           1980
DRSEX                                1
DRRACE                             5.0
REASSTOP                           1.0
TYPEMOV                            6.0
RESSTOP                              3
BEAT_I                            1824
VEHCONSREQ                           2
VEHCONSGIV                           0
VEHSRCHCOND                          2
VEHSRCHCONDBY                        0
VEHCONTRA                            0
VEHDRUGS                             0
VEHPARA                              0
VEHALC                               0
VEHWEAP                              0
VEHSTOLPROP              

In the case of this dataset, index labels are sorted integers which is the same as the index position as well, therefore, in this specific case, using wither `.loc` or `.iloc` will return the same set of rows. However, since we have string labels for eac column, we will have to provide integer values that mark the position of the column in the DataFrame. Providing column name or any non-integer value in `.iloc` will return an error.

In the cell below, we are selecting rows 2 till 6, with 2 included, and selecting the 4th column ('DURATION' is at position 4).

In [28]:
stops_df.iloc[2:6, 4]

2    18302.0
3    14502.0
4    11975.0
5    11975.0
Name: OFFBDGE, dtype: float64

We can also select a range for rows as well as for columns.

In [29]:
stops_df.iloc[2:6, 4:8]

Unnamed: 0,OFFBDGE,CITY_I,STATE,VEHMAKE
2,18302.0,CHICAGO,IL,FORD
3,14502.0,CHICAGO,IL,BMW
4,11975.0,CHICAGO,IL,TOYOTA
5,11975.0,CHICAGO,IL,HONDA


Note: Out of range slice indexes are handled gracefully just as in Python. It will not return an error but will return everything in the range apart from the our of range slices. If the entire slice index is out of range, Python will return empty list.


### Data types and conversion

We saw above how the attribute `dtypes` provides us with the the data type for each column in a DataFrame. Pandas has a set of data types that it understands and manipulates. It also includes a set of extended types from third-party libraries that have implemented this extension.

We will touch upon the most basic and common ones that you saw in the result returned from running the command `stops_df.dtypes`. In brief, they include:

- integer: default integer types are int64 (called nullable integers too)
- float: default float types are float64
- boolean: stores boolean data (True/False) with missing values
- date/time: datetime64[ns] format timestamp
- object:  holds any Python object, including strings
- string: dedicated to strings
- category: stores limited, fixed number of possible values

We can print the data types for the stops DataFrame again and can see that by default, Pandas doesn't interpret all data types correctly in the form that can be useful for us. For eg. DATESTOP and TIMESTOP are object types (stored as strings) and will prevent us from performing several temporal operations. (We will cover datetime in detail in the following section)

Similarly, DRRACE and DRSEX, which indicates the category of the Driver's Race and Sex, respectively. However, since these categories are assigned float and integer, respectively, this can mislead our analytical process and any prediction we might want to make.

In [30]:
stops_df.iloc[:, :15].dtypes

DATESTOP     object
TIMESTOP     object
DURATION      int64
OFFNAME      object
OFFBDGE     float64
CITY_I       object
STATE        object
VEHMAKE      object
VEHYEAR     float64
YRBIRTH       int64
DRSEX         int64
DRRACE      float64
REASSTOP    float64
TYPEMOV     float64
RESSTOP       int64
dtype: object

The `astype` method enables us to convert data types to the desired ones. A convenient way of performing this operation is to pass a dictionary with the column name whose type needs to be changed as key and the desired data type as value. We are then assigning the returned result to the original DataFrame to replace it with these new changes.

In [31]:
stops_df = stops_df.astype({'OFFBDGE': 'object', 'DRRACE': 'category'})

Let's use `dtypes` to check the new assigned data types

In [33]:
stops_df.iloc[:, :15].dtypes

DATESTOP      object
TIMESTOP      object
DURATION       int64
OFFNAME       object
OFFBDGE       object
CITY_I        object
STATE         object
VEHMAKE       object
VEHYEAR      float64
YRBIRTH        int64
DRSEX          int64
DRRACE      category
REASSTOP     float64
TYPEMOV      float64
RESSTOP        int64
dtype: object

### Datetime in Pandas

Temporal data (consisting of date and time stamps) is highly useful in data analytics and can result in useful features. Pandas offers the ability to parse time series information from various sources and formats. This can enable us to perform several operations on datetime values such as sorting, predicting or categorizing data by certain time periods.

One of the most convenient formats for datetime that Pandas uses is referred to as Date times which is a specific date and time with timezone support. You can notice that this is in the form of datetime64[ns].

Series and DataFrame have extended data type support and functionality for datetime. To convert object of in a DataFrame to date-like objects  we will use the to_datetime function

In [35]:
pd.to_datetime("2010/11/12", format="%Y/%m/%d")

Timestamp('2010-11-12 00:00:00')

Using it on the stops DataFrame, we get the following result. We will then assign it back to the original DataFrame.

In [39]:
pd.to_datetime(stops_df['DATESTOP'])

0        2021-01-01
1        2021-01-01
2        2021-01-01
3        2021-01-01
4        2021-01-01
            ...    
377894   2021-12-30
377895   2021-12-31
377896   2021-12-31
377897   2021-12-31
377898   2021-12-31
Name: DATESTOP, Length: 377899, dtype: datetime64[ns]

In addition to the required datetime string, a format argument can be passed to ensure specific parsing. This could also potentially speed up the conversion considerably.

There are several time/date properties that one can access from Timestamp or a collection of timestamps like a DatetimeIndex. We will use one such way to extract month out of our dataset.

In [41]:
pd.to_datetime(stops_df['DATESTOP']).dt.month

0          1
1          1
2          1
3          1
4          1
          ..
377894    12
377895    12
377896    12
377897    12
377898    12
Name: DATESTOP, Length: 377899, dtype: int64