In [1]:
import pandas as pd
import datetime as dt
import numpy as np
%config IPCompleter.greedy=True

import helper

### A few ideas that come to mind:
* Change the Date Opened column entries to datetime objects
* Make the entries of the recall number integers, be smart about how to come up with unique integers
* Make the entries of the Pounds Recalled column integers
* Make the entries of the Recall Class column more descriptive String objects
* Make a table where I have the recall reasons all as one df and by year

## I. Load the data

In [2]:
df_dict = {}
root = 'datasets'
filenames = ['recalls_2005.csv', 'recalls_2006.csv', 'recalls_2007.csv', 'recalls_2008.csv', 'recalls_2009.csv', 'recalls_2010.csv', 'recalls_2011.csv', 'recalls_2012.csv', 'recalls_2013.csv', 'recalls_2014.csv', 'recalls_2015.csv', 'recalls_2016.csv', 'recalls_2017.csv', 'recalls_2018.csv']

# Define a function to load all the files into dataframes and place them into a dictionary: the key is the year and the value the data as a pandas dataframe
def load_data(filenames):
    for filename in filenames:
        key = filename.split('.')[0].split('_')[1]
        year = int(key)
        df_dict[year] = pd.read_csv(root + '/' + filename)

In [3]:
load_data(filenames)

Let's take a peek at a sample content of the data dictionary just created.

In [4]:
# Get the second item in the dictionary
recalls_ls = list(df_dict.items())
# Get a random index of the items in the dictionary
idx = np.random.randint(len(recalls_ls))
# Get one of the recalls data in the dictionary
year, df = recalls_ls[idx]
num = 5
print('\n\nThese are the first {} rows of the recalls data of {}:\n'.format(num, year))
df.head(num)



These are the first 5 rows of the recalls data of 2016:



Unnamed: 0,Recall Number,Open Date,Class,Pounds Recalled,Product,Problem Type
0,001-2016,4-Jan-16,2,89568,Beef products,Extraneous Material
1,002-2016,5-Jan-16,1,14,Cajun Hickory Smoked Pork Tasso,Listeria monocytogenes
2,003-2016,5-Jan-16,1,1125,Chicken products,Other
3,004-2016,6-Jan-16,1,7687,"Beef, Pork, and Chicken Products",Other
4,005-2016,8-Jan-16,2,4040,Pork Sausage,Undeclared Substance


The data was correctly loaded and correctly indexed in the dictionary.

## II. Investigate the data

### 1. Investigate column names and positions

As we were taking a peek at the data in loaded we could see some inconsistencies between column names. Let's investigate this a bit further.

In [5]:
cols_names_by_df = helper.display_columns_by_df(df_dict)
cols_names_by_df

Unnamed: 0,0,1,2,3,4,5
2005,Date Opened,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
2006,Date Opened,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
2007,Date Opened,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
2008,Date Opened,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
2009,Date Opened,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
2010,Recall Date,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
2011,Recall Date,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
2012,Recall Date,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
2013,Recall Date,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
2014,Recall Date,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled


In [6]:
col_names_groups = helper.extract_col_names_groups(df_dict)
col_names_groups

{1: [2005, 2006, 2007, 2008, 2009],
 2: [2010, 2011, 2012, 2013, 2014],
 3: [2015],
 4: [2016, 2017, 2018]}

There appears to be 4 groups in the dataframes when it comes to column naming. The first group with uniform names and position across the columns are the dataframes for the year 2005 through 2009.
The second group covers years 2010 to 2014. 
The third group is the year 2015.
And the fourth group goes from 2015 to 2018.

Next, let's see how the naming of columns differ within those groups.

In [7]:
# Get randomly a year from each of the 4 groups of column naming
years = helper.get_samples_from_groups(col_names_groups)
years

[2008, 2010, 2015, 2016]

In [8]:
# Create a new dataframe with the column names in the dataframe of each of those years to compare the columns naming across all groups

cols_df = pd.DataFrame(data = [df_dict[years[0]].columns, df_dict[years[1]].columns, df_dict[years[2]].columns, df_dict[years[3]].columns], 
                      index = ['Group 1', 'Group 2', 'Group 3', 'Group 4'], 
                      columns = ['Col ' + str(i) for i in range(6)]
                     )

cols_df

Unnamed: 0,Col 0,Col 1,Col 2,Col 3,Col 4,Col 5
Group 1,Date Opened,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
Group 2,Recall Date,Recall Number,Recall Class,Product,Reason for Recall,Pounds Recalled
Group 3,Recall Number,Date Opened,Recall Class,Pounds Recalled,Product,Problem Type
Group 4,Recall Number,Open Date,Class,Pounds Recalled,Product,Problem Type


### Remarks (columns names)

1. There are 3 different names for the column with the date the recall was initiated: Date Opened, Recall Date and Open Date. 
2. There are 2 column names for the class of the recall: Recall Class and Class
3. There are are 2 different column names for the reason the recall was initiated: Reason for Recall and Problem Type.

### Solution

1. The column with the date of the recall will be renamed Recall Date for the dataframes of groups 1, 3 and 4
2. The column with the class of the recall will be renamed Recall Class for the dataframes of group 4
3. The reason for the recall column will be renamed Recall Reason across all the dataframes

### Remarks (columns positions)

1. The column with the class of the recall is always the third column for all dataframes
2. The date the recall was initiated is the 1rst column of the dataframes of groups 1 and 2 but the second column for groups 3 and 4
3. The identifying number of the recall is the 2nd column for the dataframes of groups 1 and 2 but the first column for groups 3 and 4
4. The product column is the 4th column of the dataframes of grous 1 and 2 but the 5th of dataframes of groups 3 and 4
5. The Pounds Recalled column is the 4th column of the dataframes of groups 3 and 4 but the 6th of dataframes of groups 1 and 2
6. The reason for the recall column is the 5th column of the dataframes of groups 1 and 2 but the 6th of dataframes of groups 3 and 4

### Solution

The columns will be reorganized across all dataframes to be in this order: Recall Number, Recall Date, Recall Class, Product, Recall Reason, Pounds Recalled

### 2. Fixing column names and positions

In [9]:
def fix_col_name_and_pos(year, names_changes):    
    new_cols = ['Recall Number', 'Recall Date', 'Risk Level', 'Product', 'Recall Reason', 'Pounds Recalled']
    
    # Rename the columns that need to be renamed
    df_dict[year] = df_dict[year].rename(names_changes, axis=1)
    
    # Specify the position that each column must occupy
    df_dict[year] = df_dict[year][new_cols]

In [10]:
cols_names_changes_dict = {1: {'Date Opened': 'Recall Date', 'Reason for Recall': 'Recall Reason', 'Recall Class': 'Risk Level'},
                2: {'Reason for Recall': 'Recall Reason', 'Recall Class': 'Risk Level'},
                3: {'Date Opened': 'Recall Date', 'Problem Type': 'Recall Reason', 'Recall Class': 'Risk Level'},
                4: {'Open Date': 'Recall Date', 'Class': 'Risk Level', 'Problem Type': 'Recall Reason'}}

for group, years in col_names_groups.items():
    for year in years:
        fix_col_name_and_pos(year, cols_names_changes_dict[group])

In [11]:
cols_names_by_df = helper.display_columns_by_df(df_dict)
cols_names_by_df

Unnamed: 0,0,1,2,3,4,5
2005,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2006,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2007,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2008,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2009,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2010,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2011,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2012,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2013,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled
2014,Recall Number,Recall Date,Risk Level,Product,Recall Reason,Pounds Recalled


The columns names and positions are now uniform across all the dataframes.

### 3. Investigate Columns Data Types

The entries of the Pounds Recalled column would be better off being of type integer.

In [12]:
class_names_meaning = {'I':'High', 'II':'Medium', 'III':'Low'}
class_names_meaning_2016 = {1:'High', 2:'Medium', 3:'Low'}

In [13]:
for year, df in df_dict.items():
    if year=='2016':
        df_dict[year]['Risk Level'] = df_dict[year]['Risk Level'].map(class_names_meaning_2016)
    else:
        df_dict[year]['Risk Level'] = df_dict[year]['Risk Level'].map(class_names_meaning)

In [14]:
df_dict['2016'].info()

KeyError: '2016'

In [None]:
df_dict['2016'].head()

From the two cells below I conclude that I don't need to know the product that was recalled. I, however, am extremely interested in the type of product it is. Beef? Pork? Poultry? This is going to require some substantial investigating to find how to get these into those categories.

In [None]:
df = df_dict['2005']

In [None]:
df['Product'].value_counts()

In [None]:
fr = df_dict[2006].dtypes
fr

## Before doing this, work on those dtype for the columns!

In [None]:
rcl_rsn_dict = {}

def get_rcl_rsn_vc(df_dict):
    for year, df in df_dict.items():
        key = 'rcl_rsn_' + year
        df = pd.DataFrame(df['Recall Reason'].value_counts())
        df = df.transpose()
        df = df.rename({'Recall Reason': year})
        rcl_rsn_dict[key] = df

In [None]:
get_rcl_rsn_vc(df_dict)

In [None]:
vc_dfs = list(rcl_rsn_dict.values())

In [None]:
recall_reasons_combined = pd.concat(vc_dfs, sort=False)

In [None]:
recall_reasons_combined

In [None]:
recall_reasons_combined = recall_reasons_combined.fillna(0)

In [None]:
recall_reasons_combined = recall_reasons_combined.astype('int64')

In [None]:
recall_reasons_combined

In [None]:
recall_2005['Recall Class'].value_counts()