# Python Introduction By an Example
A dataset provided by FDA has a column "event_date_initiated". 
This is the date when a medical device recall is initiated.
The file event_date_initiated.csv contains just this column.
1. Read the file and assign dates to a list object
2. Find out the number of rows/elements
4. Find out the years this dataset covers
5. Notice the data quality issue (year 0011, 0012, and 0013)
6. Correct the problemetic years
7. Save the corrected data to a new file


## Read the file and assign all lines to a list.

In [15]:
with open("event_date_initiated.csv") as f:
    date_list = f.readlines() 

## Find out what type of object the date_list is 

In [None]:
type(date_list)

## Find out the number of rows/elements

In [13]:
len(date_list)

43749

## Display the first 5 dates

In [5]:
date_list[:5]

['event_date_initiated\n',
 '2002-12-26\n',
 '2003-03-25\n',
 '2003-03-25\n',
 '2004-01-27\n']

## Ignore the first element

In [18]:
date_list = date_list[1:]
date_list[:5]

['2002-12-26\n',
 '2003-03-25\n',
 '2003-03-25\n',
 '2004-01-27\n',
 '2003-12-10\n']

## Remove the new line "\n"

In [19]:
date_list = [x.strip("\n") for x in date_list]
date_list[:5]

['2002-12-26', '2003-03-25', '2003-03-25', '2004-01-27', '2003-12-10']

## Extract year from the date

In [20]:
year_list = [x.split("-")[0] for x in date_list]
year_list[:5]

['2002', '2003', '2003', '2004', '2003']

In [21]:
len(year_list)

43749

## Get unique years using set() function

In [22]:
year_set = set(year_list)
len(year_set)

26

In [23]:
year_set

{'0010',
 '0012',
 '0013',
 '1997',
 '1998',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015',
 '2016',
 '2017',
 '2018',
 '2019',
 '2020'}

## Replace the year 0011, 0012, 0013 with 2011, 2012, 2013

In [24]:
date_list2 = []

for init_date in date_list:
    if init_date.startswith("00"):
        date_list2.append(init_date.replace("00", "20", 1))
    else:
        date_list2.append(init_date)
        
date_list2[:10]

['2002-12-26',
 '2003-03-25',
 '2003-03-25',
 '2004-01-27',
 '2003-12-10',
 '2004-01-27',
 '2003-03-20',
 '2003-08-08',
 '2000-11-16',
 '2002-10-31']

## How do you know the changes were successful?

### Method one

In [25]:
year_list2 = [x.split("-")[0] for x in date_list2]
set(year_list2)

{'1997',
 '1998',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015',
 '2016',
 '2017',
 '2018',
 '2019',
 '2020'}

### Method two

In [34]:
date_list2 = []

for i in range(len(date_list)):
    init_date = date_list[i]
    if init_date.startswith("00"):
        print(f"found and corrected problemtic date {init_date} at position {i}")
        date_list2.append(init_date.replace("00", "20", 1))
    else:
        date_list2.append(init_date)

found and corrected problemtic date 0012-12-06 at position 2290
found and corrected problemtic date 0013-11-26 at position 2344
found and corrected problemtic date 0012-11-30 at position 2432
found and corrected problemtic date 0013-05-16 at position 5267
found and corrected problemtic date 0013-03-05 at position 6045
found and corrected problemtic date 0013-05-16 at position 6801
found and corrected problemtic date 0012-12-13 at position 19636
found and corrected problemtic date 0013-04-12 at position 19910
found and corrected problemtic date 0013-03-05 at position 25453
found and corrected problemtic date 0013-11-25 at position 27812
found and corrected problemtic date 0013-12-13 at position 27855
found and corrected problemtic date 0013-04-11 at position 29256
found and corrected problemtic date 0013-03-05 at position 32382
found and corrected problemtic date 0013-04-12 at position 33046
found and corrected problemtic date 0010-08-17 at position 36031
found and corrected problemtic 

In [27]:
date_list[41681]

'0013-03-05'

In [28]:
date_list2[41681]

'2013-03-05'

In [31]:
for i in range(len(date_list)):
    if date_list[i] != date_list2[i]:
        print(date_list[i], "->", date_list2[i])

0012-12-06 -> 2012-12-06
0013-11-26 -> 2013-11-26
0012-11-30 -> 2012-11-30
0013-05-16 -> 2013-05-16
0013-03-05 -> 2013-03-05
0013-05-16 -> 2013-05-16
0012-12-13 -> 2012-12-13
0013-04-12 -> 2013-04-12
0013-03-05 -> 2013-03-05
0013-11-25 -> 2013-11-25
0013-12-13 -> 2013-12-13
0013-04-11 -> 2013-04-11
0013-03-05 -> 2013-03-05
0013-04-12 -> 2013-04-12
0010-08-17 -> 2010-08-17
0013-03-05 -> 2013-03-05
0013-03-05 -> 2013-03-05
0013-03-05 -> 2013-03-05
0013-03-05 -> 2013-03-05
0013-03-05 -> 2013-03-05
0013-03-05 -> 2013-03-05


## Saved the corrected data to a file

In [38]:
with open("event_date_initiated_corrected.csv", "wt") as f:
    f.write("\n".join(date_list2))

In [39]:
with open("event_date_initiated_corrected.csv") as f:
    corrected_date_list = f.readlines() 
    
corrected_date_list[:5]

['2002-12-26\n',
 '2003-03-25\n',
 '2003-03-25\n',
 '2004-01-27\n',
 '2003-12-10\n']

# The End!