In [40]:
import pandas as pd
import datetime as dt
import numpy as np
import re
import datetime
%config IPCompleter.greedy=True

import helper

### A few ideas that come to mind:
* Change the Date Opened column entries to datetime objects
* Make the entries of the recall number integers, be smart about how to come up with unique integers
* Make the entries of the Pounds Recalled column integers
* Make the entries of the Recall Class column more descriptive String objects
* Make a table where I have the recall reasons all as one df and by year

## I. Load the data

In [2]:
raw_df_dict = {}
root = 'data'
filenames = ['recalls_2005.csv', 'recalls_2006.csv', 'recalls_2007.csv', 'recalls_2008.csv', 'recalls_2009.csv', 'recalls_2010.csv', 'recalls_2011.csv', 'recalls_2012.csv', 'recalls_2013.csv', 'recalls_2014.csv', 'recalls_2015.csv', 'recalls_2016.csv', 'recalls_2017.csv', 'recalls_2018.csv']
year_pattern = r'\d{4}'

# Load the data using a predefined helper funtion
raw_df_dict = helper.load_data(root, filenames, year_pattern)

Let's take a peek at a sample content of the data dictionary just created.

In [3]:
# Get the second item in the dictionary
recalls_ls = list(raw_df_dict.items())
# Get a random index of the items in the dictionary
idx = np.random.randint(len(recalls_ls))
# Get one of the recalls data in the dictionary
year, df = recalls_ls[idx]
num = 5
print('\n\nThese are the first {} rows of the recalls data of {}:\n'.format(num, year))
df.head(num)



These are the first 5 rows of the recalls data of 2016:



Unnamed: 0,Recall Number,Open Date,Class,Pounds Recalled,Product,Problem Type
0,001-2016,4-Jan-16,2,89568,Beef products,Extraneous Material
1,002-2016,5-Jan-16,1,14,Cajun Hickory Smoked Pork Tasso,Listeria monocytogenes
2,003-2016,5-Jan-16,1,1125,Chicken products,Other
3,004-2016,6-Jan-16,1,7687,"Beef, Pork, and Chicken Products",Other
4,005-2016,8-Jan-16,2,4040,Pork Sausage,Undeclared Substance


The data was correctly loaded and correctly indexed in the dictionary.

## II. Investigate the data

### 1. Investigate column names and positions

As we were taking a peek at the data in loaded we could see some inconsistencies between column names. Let's investigate this a bit further.

In [4]:
df_dict = raw_df_dict.copy()

In [5]:
cols_names_by_df = helper.display_columns_by_df(df_dict)
cols_names_by_df

Unnamed: 0,0,1,2,3,4,5
2005,Date Opened,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
2006,Date Opened,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
2007,Date Opened,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
2008,Date Opened,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
2009,Date Opened,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
2010,Recall Date,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
2011,Recall Date,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
2012,Recall Date,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
2013,Recall Date,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
2014,Recall Date,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled


In [6]:
col_names_groups = helper.extract_col_names_groups(df_dict)
col_names_groups

{1: [2005, 2006, 2007, 2008, 2009],
 2: [2010, 2011, 2012, 2013, 2014],
 3: [2015],
 4: [2016, 2017, 2018]}

There appears to be 4 groups in the dataframes when it comes to column naming. The first group with uniform names and position across the columns are the dataframes for the year 2005 through 2009.
The second group covers years 2010 to 2014. 
The third group is the year 2015.
And the fourth group goes from 2015 to 2018.

Next, let's see how the naming of columns differ within those groups.

In [7]:
# Get randomly a year from each of the 4 groups of column naming
years = helper.get_samples_from_groups(col_names_groups)
years

[2008, 2013, 2015, 2016]

In [8]:
# Create a new dataframe with the column names in the dataframe of each of those years to compare the columns naming across all groups

cols_df = pd.DataFrame(data = [df_dict[years[0]].columns, df_dict[years[1]].columns, df_dict[years[2]].columns, df_dict[years[3]].columns], 
                      index = ['Group 1', 'Group 2', 'Group 3', 'Group 4'], 
                      columns = ['Pos ' + str(i+1) for i in range(6)]
                     )

cols_df

Unnamed: 0,Pos 1,Pos 2,Pos 3,Pos 4,Pos 5,Pos 6
Group 1,Date Opened,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
Group 2,Recall Date,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
Group 3,Recall Number,Date Opened,Recall Class,Pounds Recalled,Product,Problem Type
Group 4,Recall Number,Open Date,Class,Pounds Recalled,Product,Problem Type


### Remarks (columns names)

1. There are 3 different names for the column with the date the recall was initiated: Date Opened, Recall Date and Open Date. 
2. There are 2 column names for the class of the recall: Recall Class and Class
3. There are are 2 different column names for the reason the recall was initiated: Reason for Recall and Problem Type.

### Solution

1. The column with the date of the recall will be renamed Recall Date for the dataframes of groups 1, 3 and 4
2. The column with the class of the recall will be renamed Recall Class for the dataframes of group 4
3. The reason for the recall column will be renamed Recall Reason across all the dataframes

### Remarks (columns positions)

1. The column with the class of the recall is always the third column for all dataframes
2. The date the recall was initiated is the 1rst column of the dataframes of groups 1 and 2 but the second column for groups 3 and 4
3. The identifying number of the recall is the 2nd column for the dataframes of groups 1 and 2 but the first column for groups 3 and 4
4. The product column is the 4th column of the dataframes of grous 1 and 2 but the 5th of dataframes of groups 3 and 4
5. The Pounds Recalled column is the 4th column of the dataframes of groups 3 and 4 but the 6th of dataframes of groups 1 and 2
6. The reason for the recall column is the 5th column of the dataframes of groups 1 and 2 but the 6th of dataframes of groups 3 and 4

### Solution

The columns will be reorganized across all dataframes to be in this order: Recall Number, Recall Date, Recall Class, Product, Recall Reason, Pounds Recalled

### 2. Fix column names and positions

In [9]:
def fix_col_name_and_pos(year, names_changes):    
    new_cols = ['Recall Number', 'Recall Date', 'Risk Level', 'Product', 'Recall Reason', 'Pounds Recalled']
    
    # Rename the columns that need to be renamed
    df_dict[year] = df_dict[year].rename(names_changes, axis=1)
    
    # Specify the position that each column must occupy
    df_dict[year] = df_dict[year][new_cols]

In [10]:
cols_names_changes_dict = {1: {'Date Opened': 'Recall Date', 'Reason for Recall': 'Recall Reason', 'Recall Class': 'Risk Level'},
                           2: {'Reason for Recall': 'Recall Reason', 'Recall Class': 'Risk Level'},
                           3: {'Date Opened': 'Recall Date', 'Problem Type': 'Recall Reason', 'Recall Class': 'Risk Level'},
                           4: {'Open Date': 'Recall Date', 'Class': 'Risk Level', 'Problem Type': 'Recall Reason'}
                          }

for group, years in col_names_groups.items():
    for year in years:
        fix_col_name_and_pos(year, cols_names_changes_dict[group])

In [11]:
cols_names_by_df = helper.display_columns_by_df(df_dict)
cols_names_by_df

Unnamed: 0,0,1,2,3,4,5
2005,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2006,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2007,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2008,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2009,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2010,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2011,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2012,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2013,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2014,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled


The columns names and positions are now uniform across all the dataframes.

### 3. Investigate Columns Data Types

In [12]:
cols_dtype_by_df = helper.display_columns_by_df(df_dict, dtype=True)
cols_dtype_by_df

Unnamed: 0,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2005,object,object,object,object,object,object
2006,object,object,object,object,object,object
2007,object,object,object,object,object,object
2008,object,object,object,object,object,object
2009,object,object,object,object,object,object
2010,object,object,object,object,object,object
2011,object,object,object,object,object,object
2012,object,object,object,object,object,object
2013,object,object,object,object,object,object
2014,object,object,object,object,object,object


We can see there are some inconsistencies in the data types of the Risk Level and Pounds Recalled columns so we will need to investigate those columns a bit further and settle on an appropriate uniform datatype.

However, from the get-go we would much rather have the entries of the Pounds Recalled column be integer considering that they represent numbers and we might want to sum those numbers at some point.

We suppose that the Recall Date must also be a string by default as our files were .csv files. We would prefer the entries of that column be datetime objects to ease our workflow in case we ever needed to do some operations on those dates.

### 4. Fix Columns Data Types

* Fix Pounds Recalled column data type

We will start by fixing the data type of the pounds column. While looking at our data earlier we saw that the Pounds Recalled though stored as string objects had a comma to make reading them clearer. We will have to remember this as we try to convert those entries to int values.

In [13]:
recalls = df_dict[2011].copy()
recalls.head()

Unnamed: 0,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
0,001-2011,Jan 03 2011,III,Frozen Chicken Mushroom Pies,Undeclared Substance,600
1,002-2011,Jan 06 2011,I,Frozen Meat and Poultry Tamale Products,Undeclared Allergen,144633
2,003-2011,Jan 10 2011,II,Ground Beef Products,Other,247800
3,004-2011,Jan 11 2011,III,"Breakfast Stackers Sausage, Egg & Cheese",Undeclared Substance,101629
4,005-2011,Jan 14 2011,II,Beef Trim,Other,2234


In [14]:
number_pattern = r'\d+,?\d*'
entries_type_dict = {}
# working_dict = df_dict.copy()
for year, df in df_dict.items():
    entries_type_dict[year] = helper.get_column_entries_groups(df, 'Pounds Recalled', number_pattern)

In [15]:
entries_type_dict

{2005: {'Number': 52, 'Undetermined': 1},
 2006: {'Number': 34},
 2007: {'Number': 58},
 2008: {'Number': 52, 'Undetermined': 2},
 2009: {'Number': 68, 'Undetermined': 1},
 2010: {'Number': 69, 'Undetermined': 2},
 2011: {'Number': 97, 'Undetermined': 6},
 2012: {'Number': 81, 'Undetermined': 1},
 2013: {'Number': 73, 'Undetermined': 2},
 2014: {'Number': 94},
 2015: {'Number': 146, 'Undetermined': 4},
 2016: {'Number': 122},
 2017: {'Number': 131},
 2018: {'Number': 125}}

In [16]:
for year, df in df_dict.items():
    if year == 2017:
        pass
    df['Pounds Recalled'] = df['Pounds Recalled'].astype(str).str.replace(',', '').str.replace('Undetermined', '0').astype('int64')

In [17]:
cols_dtype_by_df = helper.display_columns_by_df(df_dict, dtype=True)
cols_dtype_by_df

Unnamed: 0,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2005,object,object,object,object,object,int64
2006,object,object,object,object,object,int64
2007,object,object,object,object,object,int64
2008,object,object,object,object,object,int64
2009,object,object,object,object,object,int64
2010,object,object,object,object,object,int64
2011,object,object,object,object,object,int64
2012,object,object,object,object,object,int64
2013,object,object,object,object,object,int64
2014,object,object,object,object,object,int64


Done! The data type of the Pounds Recalled column has now been fixed in all dataframes.

* Fix Risk Level column data type and entries

We will now looking into the Risk Level column's data types. We can already see from the table above that its entries are mostly stored as object except for the data of 2015 and 2016 where the entries are stored as integers.

In any case as I want to use descriptive strings for that column the data type of the column will end up being string objects.

For now, as we already know that the entries here are categorical values, let's look at the unique values we have in each dataframe of that column to ease the process of changing them to descriptive strings.

In [18]:
unique_entries = {}
for year, df in df_dict.items():
    unique_entries[year] = list(df['Risk Level'].unique())

In [19]:
unique_entries

{2005: ['I', 'III', 'II'],
 2006: ['II', 'I', 'III'],
 2007: ['I', 'III', 'II'],
 2008: ['I', 'II'],
 2009: ['I', 'II', 'III'],
 2010: ['I', 'II', 'III'],
 2011: ['III', 'I', 'II'],
 2012: ['I', 'II', 'III'],
 2013: ['I', 'III', 'II'],
 2014: ['I', 'II', 'III'],
 2015: [1, 3, 2],
 2016: [2, 1, 3],
 2017: ['III', 'I', 'II'],
 2018: ['I', 'III', 'II']}

We can see that most columns use the Romain numerals I, II and III to describe the risk level while the columns storing the entries as integers use the numbers 1, 2 and 3.

A quick investigation at the origin of the data tells us that 1 corresponds to I, 2 to II and 3 to III. We also learn that 1/I represents the highest risk level while 3/III represents the lowest.

The change can be pretty straight forward then. 

In [20]:
def fix_risk_level(entry):
    if re.search(r'\b3\b|\bIII\b', entry):
        return 'Low'
    if re.search(r'\b2\b|\bII\b', entry):
        return 'Medium'
    if re.search(r'\b1\b|\bI\b', entry):
        return 'High'

In [21]:
for year, df in df_dict.items():
    df['Risk Level'] = df['Risk Level'].astype(str).apply(fix_risk_level)

Let's check the unique entries we now have in each dataframe.

In [22]:
unique_entries = {}
for year, df in df_dict.items():
    unique_entries[year] = list(df['Risk Level'].unique())

In [23]:
unique_entries

{2005: ['High', 'Low', 'Medium'],
 2006: ['Medium', 'High', 'Low'],
 2007: ['High', 'Low', 'Medium'],
 2008: ['High', 'Medium'],
 2009: ['High', 'Medium', 'Low'],
 2010: ['High', 'Medium', 'Low'],
 2011: ['Low', 'High', 'Medium'],
 2012: ['High', 'Medium', 'Low'],
 2013: ['High', 'Low', 'Medium'],
 2014: ['High', 'Medium', 'Low'],
 2015: ['High', 'Low', 'Medium'],
 2016: ['Medium', 'High', 'Low'],
 2017: ['Low', 'High', 'Medium'],
 2018: ['High', 'Low', 'Medium']}

And checking the data type of the Risk Level column, we can see that the data type is now uniform across all the dataframes.

In [24]:
cols_dtype_by_df = helper.display_columns_by_df(df_dict, dtype=True)
cols_dtype_by_df

Unnamed: 0,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2005,object,object,object,object,object,int64
2006,object,object,object,object,object,int64
2007,object,object,object,object,object,int64
2008,object,object,object,object,object,int64
2009,object,object,object,object,object,int64
2010,object,object,object,object,object,int64
2011,object,object,object,object,object,int64
2012,object,object,object,object,object,int64
2013,object,object,object,object,object,int64
2014,object,object,object,object,object,int64


Done! The data types of the entries of the Risk Level column as well as the actual entries have been changed to make the data easier to understand.

* Fix Recall Date column data type - make the data type datetime objects

Looking at the dataframes showed us that the date format is uniform in each dataframe but varies greatly from one dataframe to another.

Let's take a look at all the different formats we have for the date in the dataframes right now. As the format is the same in all rows of any given dataframe we can just look at a single row to know the format through the whole rows.

In [32]:
date_format_dict = {}
for year, df in df_dict.items():
    date_format_dict[year] = df['Recall Date'].iloc[0]

In [33]:
date_format_dict

{2005: 'Jan 05 2005',
 2006: 'Jan 05 2006',
 2007: 'Jan 03 2007',
 2008: 'Jan 05 2008',
 2009: 'Jan 03 2009',
 2010: 'Jan 09 2010',
 2011: 'Jan 03 2011',
 2012: 'Jan 14 2012',
 2013: 'Jan 15 2013',
 2014: 'Jan 10 2014',
 2015: '1/2/2015',
 2016: '4-Jan-16',
 2017: 'Jan 5, 2017',
 2018: 'Jan 4, 2018'}

It seems we have 4 different types of format right now. The formatting for the years 2005-2014, then the format for 2015, the format for 2016 and then another format for 2017 and 2018.

For simplicity sake, we will keep the format of 2005-2014 it is the format in most dataframes. We then have to fix the format for only 4 years.

In [34]:
pattern = r'[0-12]+/[0-31]+/\d{4}'
months_dict = {'1': 'Jan', '2': 'Feb', '3': 'Mar', '4': 'Apr', '5': 'May', '6': 'Jun', '7': 'Jul', '8': 'Aug', '9': 'Sep', '10': 'Oct', '11': 'Nov', '12': 'Dec'}
def fix_date_format(entry):
    if re.search(pattern, entry):
        month, day, year = entry.split('/')
        return months_dict[month] + ' ' + day + ' ' + year
    elif '-' in entry:
        day, month, year = entry.split('-')
        return month + ' ' + day + ' ' + year
    else:
        return entry.replace(',', '')

In [35]:
for year, df in df_dict.items():
    df['Recall Date'] = df['Recall Date'].apply(fix_date_format)

Checking the different format that we have for the date across the dataframes now shows the dict below.

In [38]:
date_format_dict = {}
for year, df in df_dict.items():
    date_format_dict[year] = df['Recall Date'].iloc[0]

In [39]:
date_format_dict

{2005: 'Jan 05 2005',
 2006: 'Jan 05 2006',
 2007: 'Jan 03 2007',
 2008: 'Jan 05 2008',
 2009: 'Jan 03 2009',
 2010: 'Jan 09 2010',
 2011: 'Jan 03 2011',
 2012: 'Jan 14 2012',
 2013: 'Jan 15 2013',
 2014: 'Jan 10 2014',
 2015: 'Jan 2 2015',
 2016: 'Jan 4 16',
 2017: 'Jan 5 2017',
 2018: 'Jan 4 2018'}

Now that we have a uniform format for the date across all dataframes we can move to changin those entries to datetime objects.