# <p style="padding:10px;background-color:#85BB65;margin:0;color:white;font-family:newtimeroman;font-size:150%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:500">Parsing Dates</p>


In [2]:
# modules we'll use
import pandas as pd
import numpy as np
import seaborn as sns
import datetime

In [4]:
df = pd.read_csv("earthquakes.csv")
df.head()

Unnamed: 0,Date,Time,Latitude,Longitude,Type,Depth,Depth Error,Depth Seismic Stations,Magnitude,Magnitude Type,...,Magnitude Seismic Stations,Azimuthal Gap,Horizontal Distance,Horizontal Error,Root Mean Square,ID,Source,Location Source,Magnitude Source,Status
0,01/02/1965,13:44:18,19.246,145.616,Earthquake,131.6,,,6.0,MW,...,,,,,,ISCGEM860706,ISCGEM,ISCGEM,ISCGEM,Automatic
1,01/04/1965,11:29:49,1.863,127.352,Earthquake,80.0,,,5.8,MW,...,,,,,,ISCGEM860737,ISCGEM,ISCGEM,ISCGEM,Automatic
2,01/05/1965,18:05:58,-20.579,-173.972,Earthquake,20.0,,,6.2,MW,...,,,,,,ISCGEM860762,ISCGEM,ISCGEM,ISCGEM,Automatic
3,01/08/1965,18:49:43,-59.076,-23.557,Earthquake,15.0,,,5.8,MW,...,,,,,,ISCGEM860856,ISCGEM,ISCGEM,ISCGEM,Automatic
4,01/09/1965,13:32:50,11.938,126.427,Earthquake,15.0,,,5.8,MW,...,,,,,,ISCGEM860890,ISCGEM,ISCGEM,ISCGEM,Automatic


**We'll be working with the "date" column from the dataframe. Let's make sure it actually looks like it contains dates.**



In [3]:
# print the first few rows of the date column
print(df['Date'].head())

0    01/02/1965
1    01/04/1965
2    01/05/1965
3    01/08/1965
4    01/09/1965
Name: Date, dtype: object


 #### **<mark style="background-color:#85BB65;color:white;border-radius:5px;opacity:1.0">Notice that</mark>**
**At the bottom of the output of head(), you can see that it says that the data type of this column is "object". So, let's fix that.**

# **<span style='color:#85BB65'> Convert our date columns to datetime: </span>**

**Most of the entries in the "Date" column follow the same format: "month/day/four-digit year". However, the entry at index 3378 follows a completely different pattern.**

**This appear to be an issue with data entry: ideally, all entries in the column have the same format. We can get an idea of how widespread this issue is by checking the length of each entry in the "Date" column.**

In [4]:
date_lengths = df['Date'].str.len()
date_lengths.value_counts()

10    23409
24        3
Name: Date, dtype: int64

**Looks like there are two more rows that has a date in a different format. let's indice corresponding to those rows and print the data.**


In [5]:
indices = np.where([date_lengths == 24])[1]
print('Indices with corrupted data:', indices)
df.loc[indices]

Indices with corrupted data: [ 3378  7512 20650]


Unnamed: 0,Date,Time,Latitude,Longitude,Type,Depth,Depth Error,Depth Seismic Stations,Magnitude,Magnitude Type,...,Magnitude Seismic Stations,Azimuthal Gap,Horizontal Distance,Horizontal Error,Root Mean Square,ID,Source,Location Source,Magnitude Source,Status
3378,1975-02-23T02:58:41.000Z,1975-02-23T02:58:41.000Z,8.017,124.075,Earthquake,623.0,,,5.6,MB,...,,,,,,USP0000A09,US,US,US,Reviewed
7512,1985-04-28T02:53:41.530Z,1985-04-28T02:53:41.530Z,-32.998,-71.766,Earthquake,33.0,,,5.6,MW,...,,,,,1.3,USP0002E81,US,US,HRV,Reviewed
20650,2011-03-13T02:23:34.520Z,2011-03-13T02:23:34.520Z,36.344,142.344,Earthquake,10.1,13.9,289.0,5.8,MWC,...,,32.3,,,1.06,USP000HWQP,US,US,GCMT,Reviewed


In [6]:
df.loc[3378, "Date"] = "02/23/1975"
df.loc[7512, "Date"] = "04/28/1985"
df.loc[20650, "Date"] = "03/13/2011"

**now , all entry follow one pattern .**

Now that we know that our date column isn't being recognized as a date, it's time to convert it so that it is recognized as a date. This is called **<span style='color:#85BB65'> parsing dates </span>** because we're taking in a string and identifying its component parts.

The basic idea is that you need to point out which parts of the date are where and what punctuation is between them. There are lots of possible parts of a date, but the most common are `%d` for day, `%m` for month, `%y` for a two-digit year and `%Y` for a four digit year.

#### **<mark style="background-color:#85BB65;color:white;border-radius:5px;opacity:1.0">examples</mark>**


* **1/17/07 has the format "%m / %d / %y"**
* **17-1-2007 has the format "%d - %m - %Y"**

In [7]:
# create a new column, date_parsed, with the parsed dates
df['date_parsed'] = pd.to_datetime(df['Date'], format="%m/%d/%Y")

In [8]:
# print the first few rows
df['date_parsed'].head()

0   1965-01-02
1   1965-01-04
2   1965-01-05
3   1965-01-08
4   1965-01-09
Name: date_parsed, dtype: datetime64[ns]

**you can see that the dtype is datetime64 and also dates have been rearranged so that they fit the default order datetime objects (year-month-day).**

### **<span style='color:#85BB65'> What if I run into an error with multiple date formats? </span>**

sometimes you'll run into an error when there are multiple date formats in a single column. then, you can have pandas try to infer what the right date format should be. You can do that like so:

`df['date_parsed'] = pd.to_datetime(df['Date'], infer_datetime_format=True)`
 
### **<span style='color:#85BB65'> Why don't you always use infer_datetime_format = True? </span>**

There are two big reasons:

* **pandas won't always been able to figure out the correct date format.**
* **it's much slower than specifying the exact format of the dates.**

In [9]:
# get the day of the month from the date_parsed column
day_of_month = df['date_parsed'].dt.day
day_of_month.head()

0    2
1    4
2    5
3    8
4    9
Name: date_parsed, dtype: int64

<div style="border-radius:10px;border:#85BB65 solid;padding: 15px;background-color:#ffffff00;font-size:100%;text-align:left">
'dt.day' does not know how to deal with a column with the dtype "object". Even though our dataframe has dates in it, we have to parse them before we can interact with them in a useful way.</div>

#### **<mark style="background-color:#85BB65;color:white;border-radius:5px;opacity:1.0">Notice that</mark>**

**If we tried to get the same information from the original `date` column, we would get an error, AttributeError: Can only use `.dt accessor` with datetimelike values.**

***

<br>

<div style="text-align: center;">
   <span style="font-size: 4.5em; font-weight: bold; font-family: Arial;">THANK YOU!</span>
</div>/

<br>
<br>

<div style="text-align: center;">
    <span style="font-size: 5em;">✔️</span>
</div>

<br>

<div style="text-align: center;">
   <span style="font-size: 1.4em; font-weight: bold; font-family: Arial; max-width:1200px; display: inline-block;">
       If you find this notebook useful, I'd greatly appreciate your upvote!
   </span>
</div>
