# Appendix B - Pandas Refresher

This appendix provides a real-world example of data processing and can be used to test or refresh your Pandas knowledge.  Pandas is a portmanteau of **"panel data"** - a term used in Econometrics.

FDA publishes medical device recalls dataset on regular basis. Thedata set has a field called "event_date_initiated". 
This is the date when a medical device recall is initiated. We extracted this field and saved the data to the file event_date_initiated.csv for exploration and processing.

This example uses the following Python basic concepts:

In [20]:
import pandas as pd

## 1. Basic Data Exploration

**Read the file into a Pandas dataframe**

In [21]:
df = pd.read_csv("../data/event_date_initiated.csv")

**Find out what type of object the date_initiated is**

It is of type "object" which means it is a string/text.

In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43749 entries, 0 to 43748
Data columns (total 1 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   event_date_initiated  43749 non-null  object
dtypes: object(1)
memory usage: 341.9+ KB


**Find out the number of rows**

In [23]:
df.shape

(43749, 1)

**Display the first five rows**

In [24]:
df.head()

Unnamed: 0,event_date_initiated
0,2002-12-26
1,2003-03-25
2,2003-03-25
3,2004-01-27
4,2003-12-10


**Display the last five rows**

In [25]:
df.tail()

Unnamed: 0,event_date_initiated
43744,2020-09-09
43745,2020-07-06
43746,2020-01-17
43747,2020-05-28
43748,2020-08-20


**Display the random five rows**

In [26]:
df.sample(5)

Unnamed: 0,event_date_initiated
29159,2012-05-17
38192,2015-02-09
7983,2016-11-16
34603,2016-01-05
103,2014-11-17


## 2. Wrangle event_date_initiated

**Extract year from the date**

In [27]:
df["year"] = df["event_date_initiated"].str[:4]
df.head()

Unnamed: 0,event_date_initiated,year
0,2002-12-26,2002
1,2003-03-25,2003
2,2003-03-25,2003
3,2004-01-27,2004
4,2003-12-10,2003


**Get unique years using set() function**

In [28]:
year_set = set(df["year"])
len(year_set)

26

In [29]:
year_set

{'0010',
 '0012',
 '0013',
 '1997',
 '1998',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015',
 '2016',
 '2017',
 '2018',
 '2019',
 '2020'}

**Replace the incorrect year 0010, 0012, 0013 with 2010, 2012, 2013**

In [30]:
def transform_init_date(init_date):
    
    if init_date.startswith("00"):
        return init_date.replace("00", "20", 1) 
    else:
        return init_date
    

df["event_date_initiated2"] = df["event_date_initiated"].apply(transform_init_date)

## 3. How do you know the changes were successful?

**Method one**

In [31]:
df["year2"] = df["event_date_initiated2"].str[:4]
year_set2 = set(df["year2"])
len(year_set2)

23

In [32]:
year_set2

{'1997',
 '1998',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015',
 '2016',
 '2017',
 '2018',
 '2019',
 '2020'}

**Method two**

In [33]:
df[df["event_date_initiated"] != df["event_date_initiated2"]]

Unnamed: 0,event_date_initiated,year,event_date_initiated2,year2
2290,0012-12-06,12,2012-12-06,2012
2344,0013-11-26,13,2013-11-26,2013
2432,0012-11-30,12,2012-11-30,2012
5267,0013-05-16,13,2013-05-16,2013
6045,0013-03-05,13,2013-03-05,2013
6801,0013-05-16,13,2013-05-16,2013
19636,0012-12-13,12,2012-12-13,2012
19910,0013-04-12,13,2013-04-12,2013
25453,0013-03-05,13,2013-03-05,2013
27812,0013-11-25,13,2013-11-25,2013


## 4. Saved the corrected data to a file

In [34]:
df["event_date_initiated2"].to_csv("event_date_initiated_corrected.csv", index=False)