### Import Packages and Datasets

To begin working with the FDA recall data, first import the necessary general packages and set your directory paths.

In [1]:
import os
import pandas as pd
import csv
import re

In [2]:
wd = os.getcwd()
data_dir = os.path.join(wd,"..","data")
github_data_dir = os.path.join(wd,"..","github_data")
code_dir = os.path.join(wd,"..","code")

Next import our packages for processing UPCs [Insert Links?] [This will be changed to an import when functions are put in modules]

In [4]:
run "../code/FDA_Preprocess.py"

Finally, import the FDA Recall Press Release Data and FDA Recall Enforcement Report Data, and read them into a Pandas DataFrame. [Give background information or download links here?]

In [4]:
press = pd.read_csv(os.path.join(data_dir, "FDA_recalls.csv"),skip_blank_lines = True,encoding='ISO-8859-1')
enforce = pd.read_csv(os.path.join(data_dir,"FDA_food_enforcements_2012-06_to_2016-07.csv"), encoding = 'ISO-8859-1')

The rows in the DataFrame correspond to an individual FDA recall. Each recall is associated with a single company and reason for recall, but may be associated with more than one product. The columns containing the information concerning the recalls vary between the two datasets, with the `enforce` data containing more detailed and FDA-specific information then the `press` data. 

In [5]:
press.iloc[0]

DATE                                      Fri, 01 May 2015 20:41:00 -0400
BRAND_NAME                                               Sun Rich, Subway
PRODUCT_DESCRIPTION     Apple slices, apple slices with Dip, Sunshine ...
REASON                                             Listeria monocytogenes
COMPANY                                              Sun Rich Fresh Foods
COMPANY_RELEASE_LINK      http://www.fda.gov/Safety/Recalls/ucm445391.htm
PHOTOS_LINK             \r\t\t\thttp://www.fda.gov/Safety/Recalls/ucm4...
Name: 0, dtype: object

In [6]:
enforce.iloc[0]

Product.Type                                                                                     Food
Event.ID                                                                                        66563
Status                                                                                      Completed
Recalling.Firm                                                               Reser's Fine Foods, Inc.
Address1                                                                        15570 SW Jenkins Road
Address2                                                                                          NaN
City                                                                                        Beaverton
State.Province                                                                                     OR
Postal.Code                                                                                     97006
Country                                                                           

## Extract UPCs from text fields

The first step in processing the FDA data is to extract the UPCs from the text fields they are contained in by making use of the `makeUPCCol` function in the `FDA_Preprocess` module (documenation can be found [HERE]). The function operates in one of two ways depending on what is passed in for `string_list`. These two ways are exemplified by the two different FDA datasets `press` and `enforce`, with the parameter `link` as `True` and `False` respectively.

### Press Release Dataset (`link = True`)

In the Press Release Dataset, the text containing the product information, including UPCs, resides in the page linked to by the URL in the `COMPANY_RELEASE_LINK` column-- the text itself is not contained table.

In order to use the `makeUPCCOl` function with this dataset, you must pass `True` for the `link` parameter. This indicates that the `string_list` parameter being passed is a list of URLs rather than a list of strings.  The `makeUPCCol` function calls the `makeUPCList` function for each row in the dataset, and utilizes the Beautiful Soup with an HTML parser to pull out the text contained within the page. The value passed for `link` also indicates the default regex pattern that should be used for finding UPCs within the text. If `link=True` is passed, the default for the pattern parameter is `UPC_PATTERN_PAGE`. This regex pattern is ideal when the text being searched is an XML because it slightly stricter than the `UPC_PATTERN_TEXT` pattern, and is less likely to pull out unrelated number clusters or partial UPCs.

Create and append a new column called `upc` to the `press` DataFrame that contains the list of UPCs associated with each recall:

In [56]:
press["upc"] = makeUPCCol(press["COMPANY_RELEASE_LINK"], link = True)

0 rows processed
500 rows processed
1000 rows processed
1500 rows processed
2000 rows processed
2500 rows processed
3000 rows processed
3148 rows processed : COMPLETE


In [57]:
press[0:1]

Unnamed: 0,DATE,BRAND_NAME,PRODUCT_DESCRIPTION,REASON,COMPANY,COMPANY_RELEASE_LINK,PHOTOS_LINK,upc
0,"Fri, 01 May 2015 20:41:00 -0400","Sun Rich, Subway","Apple slices, apple slices with Dip, Sunshine ...",Listeria monocytogenes,Sun Rich Fresh Foods,http://www.fda.gov/Safety/Recalls/ucm445391.htm,http://www.fda.gov/Safety/Recalls/ucm445392.htm,"[060243004531, 060243004647, 060243012963, 060..."


### Enforcement Dataset (`link = False`)

In the Enforcement data, the product information is contained in strings within the `Code.Info` or `Product.Description` columns. The default parameter `link=False` is the correct value for this dataset, indicating that `string_list` parameter is a list of strings containing the UPCs. When `link=False`, the default parameter for `pattern`, `UPC_PATTERN_TEXT`, is used. This is the optimal pattern when searching strings directly-- it is slightly more liberal with the patterns that it looks for and is able to capture more number clusters without fear of false matches or incomplete UPCs. 

Create and append a new column called `upc_pd` to the `enforce` DataFrame that contains the list of UPCs for each recall found in the `Product.Description` column:

In [58]:
enforce['upcs_pd'] = makeUPCCol(enforce["Product.Description"], verbose = False)

Create and append a new column called `upc_ci` to the `enforce` DataFrame that contains the list of UPCs for each recall found in the `Code.Info` column:

In [59]:
enforce['upcs_ci'] = makeUPCCol(enforce["Code.Info"], verbose = False)

Processing the `enforce` data is much faster, and there are many more rows, so we pass `verbose=False` to keep the function from printing out status messages.

We next combine the UPCs found the `Product.Description` and `Code.Info` columns and obtain a list containing the unique UPCs for each recall. 

Append a column called `upc` to the `enforce` DataFrame that contains the list UPCs associated with each recall:

In [60]:
union_col = [list(set(enforce.upcs_pd[rownum]).union(set(enforce.upcs_ci[rownum]))) for rownum in range(enforce.shape[0])]
enforce['upc'] = pd.Series(union_col)

Additionally, create a list of all of the 12 digit UPCs from an recall event to be used for pattern matching in the UPC processing stage using the `makeEventUPCCol` function:

In [61]:
enforce["event_upc12"] = makeEventUPCCol(enforce["upc"], enforce["Event.ID"])

In [62]:
enforce.iloc[0:1]

Unnamed: 0,Product.Type,Event.ID,Status,Recalling.Firm,Address1,Address2,City,State.Province,Postal.Code,Country,...,Reason.for.Recall,Recall.Initiation.Date,Center.Classification.Date,Termination.Date,Report.Date,Code.Info,upcs_pd,upcs_ci,upc,event_upc12
0,Food,66563,Completed,"Reser's Fine Foods, Inc.",15570 SW Jenkins Road,,Beaverton,OR,97006,United States,...,The recalled products are potentially contamin...,10/22/2013,12/24/2013,,01/01/2014,Use by dates: 10/21/13-12/11/13.,"[071117002164, 071117001631, 052548517571, 071...",[],"[071117002164, 074865208369, 071117001631, 071...","[071117001648, 079453469252, 758108301566, 071..."


## DataFrame Import/Export

Python lists within Pandas DataFrames are not read in and out as strings, so we created a set of functions to convert between these types more easily. 

In [5]:
run "../code/DataFrame_io.py"

### Write Data Out

First remove escape sequence tokens from the `PHOTOS_LINK` column in the `press` DataFrame:

In [64]:
photo_col = list()
for p_link in press["PHOTOS_LINK"]:
    match = re.findall("(http.+htm)", p_link)
    if match:
        photo_col.append(match[0])
    else:
        photo_col.append("")
press["PHOTOS_LINK"] = pd.Series(photo_col)

Next convert the lists of UPCs in both DataFrames to semi-colon delimited strings:

In [65]:
press["upc"] = pd.Series(listToStringCol(press["upc"]))

In [66]:
enforce["upc"] = pd.Series(listToStringCol(enforce["upc"]))
enforce["event_upc12"] = pd.Series(listToStringCol(enforce["event_upc12"]))

Finally, write both DataFrames to CSVs:

In [74]:
#ignore if you wish to keep the colums showing the UPCs extracted from each column
enforce.drop(["upcs_pd", "upcs_ci"], axis = 1, inplace = True)

In [77]:
press.to_csv("../github_data/press_upc.csv")
enforce.to_csv("../github_data/enforce_upc.csv")

### Read Data In

Read CSVs into DataFrames:

In [173]:
press = pd.read_csv("../github_data/press_upc.csv", index_col = 0).fillna("")
enforce = pd.read_csv("../github_data/enforce_upc.csv", index_col = 0).fillna("")

In [174]:
press["upc"]=pd.Series(stringToListCol(press["upc"]), index = press.index)

In [175]:
enforce["upc"] = pd.Series(stringToListCol(enforce["upc"]))
enforce["event_upc12"] = pd.Series(stringToListCol(enforce["event_upc12"]))

## Process UPCs and lookup ASINs

Next process the UPCs using the `makeUPCProcessedList` function in the `UPC_ASIN_Process` module determine the correct 12-digit UPC, or list of possible 12-digit UPCs for each UPC extracted. This function allows for an optional parameter of `event_upc12_list` to be passed. This is a list of potentially similar UPCs that, through various pattern matching strategies, can be used to determine the correct 12-digit UPC. Then lookup the ASIN for each 12-digit UPC using the `getASIN` function. When processing the UPCs in an entire DataFrame, it is more efficient to use the `makeUPCProcessedASINTuples` function, which processed the UPC and does the ASIN lookup at the same time. The function returns a list of tuples in the format `(row_number, upc_processed_nestedlist, asin_nestedlist)`.  
Given the large number of UPCs and the API rate rate limit of 1 query per second, this process takes a significant period of time. It is therefore encouraged to pass the `pickle=True` (as well as `verbose=True` in order to keep track of the functions progress) parameter. The function will then use the Python pickling process (https://docs.python.org/3/library/pickle.html) to update and save the list of tuples to the file `data_pickle` every 200 rows.  

In [10]:
import pickle

In [117]:
run "../code/UPC_ASIN_Process.py"

First process the press data, passing the name of the DataFrame, the name of the column containing the UPCs, and both pickle and verbose as True. The parameter `event_upc12_colname` is left at the default value of `None`, indicating that the list of potentially similar UPCs to be used for pattern matching should be constructed from all of the 12-digit UPCs from the same FDA recall. The `rowrange` parameter is also left at the default, indicating that the entire DataFrame should be processed. 

In [32]:
#Console output not shown
press_tuple_list = makeUPCProcessedASINTuples(press, "upc", pickle = True, verbose = True)

When the process is complete (the last line printed should read: `3154 rows processed & saved`), the processed tuples should be stored in `press_tuples` as well as `data_pickle`. MAKE SURE TO RENAME THE PICKLE FILE IMMIDIATELY SO THE FILE WILL NOT BE OVERWRITTEN. 

In [198]:
os.rename('data_pickle', 'press_pickle')

Open the pickled file and check the first tuple and the length of the list to ensure that the process went smoothly

In [13]:
with open('press_pickle', 'rb') as file:
    press_tuples = pickle.load(file)

In [56]:
press_tuples[0]

(0,
 [['060243004531'],
  ['060243004647'],
  ['060243012963'],
  ['060243012932'],
  ['060243005088'],
  ['060243004586']],
 [['B00D4KXTAG'],
  ['UPCNOTFOUND'],
  ['B01785QBMU'],
  ['B00P150LIA'],
  ['B00BTGKHC0'],
  ['B00TIYSSTE']])

In [15]:
len(press_tuples)

3154

Then repeat the process with the enforce data. This time, pass the `event_upc12_colname` as `event_upc12`, which was prepared in the FDA Preprocessing section above. 

In [36]:
#Console output not shown
enforce_tuple_list = makeUPCProcessedASINTuples(enforce, "upc", event_upc12_colname = "event_upc12", pickle = True, verbose = True)

If your notebook stops running, or a error is thrown by the API, the pickling and saving process ensures that the work is not lost and one can more or less pick up where they left off. The last save point can be determined either from the last the status printed to the console (e.g. `3200 rows processed & saved`) or by reading in the pickle file and checking the length or indices. Indices are included in the tuples in order to make processing the data in pieces and later recombining simpler and easier to check. Again, be sure and rename the pickle file before proceeding with the rest of the data processing. 

In [61]:
with open('data_pickle', 'rb') as file:
    enforce_tuples_1 = pickle.load(file)

In [62]:
enforce_tuples_1[-1]

(3199,
 [['053000011149', '753000011148', '853000011145', '653000011141']],
 [['B000TQHTPE', 'UPCNOTFOUND', 'UPCNOTFOUND', 'UPCNOTFOUND']])

In [24]:
os.rename('data_pickle', 'enforce_pickle_0_3200')

Pass the `rowrange` parameter as a (`first_row','last_row`) tuple in order to continue from the last save point:

In [38]:
enforce_tuple_list = makeUPCProcessedASINTuples(enforce, "upc", event_upc12_colname = "event_upc12", rowrange = (3200, enforce.shape[0]), pickle = True, verbose = True)

In [48]:
os.rename('data_pickle', 'enforce_pickle_3200_end')

Open the pickle file(s), combine the lists:

In [50]:
with open('enforce_pickle_3200_end', 'rb') as file:
    enforce_tuples_2 = pickle.load(file)

In [51]:
enforce_tuples = enforce_tuples_1+enforce_tuples_2

Next read the tuple lists into Pandas DataFrames:

In [70]:
press_codes_df = pd.DataFrame.from_records(press_tuples, columns = ["row", "upc_processed", "asin"])
enforce_codes_df = pd.DataFrame.from_records(enforce_tuples, columns = ["row", "upc_processed", "asin"])

Make sure that the index and `row` columns match-- this is a good check to make sure that recombining the lists worked correctly. The redundant `row` columns can then be dropped.

In [73]:
enforce_codes_df.iloc[0:1]

Unnamed: 0,row,upc_processed,asin
0,0,"[[071117002164], [074865208369], [071117001631...","[[B00JVD5VMY], [B00V6WFALA], [B01EX9EHUM], [B0..."


In [74]:
press_codes_df.drop(["row"], axis = 1, inplace = True)
enforce_codes_df.drop(["row"], axis = 1, inplace = True)

Unfortunately, for a variety of reasons, querying the API for the ASIN does not always work correctly, and the response can be a long error code rather than than the 10 character ASIN or `UPCNOTFOUND`. Instead of querying the the API again if it returns a "error string", it is far more efficient to look for and fix these after the initial processing is complete. 
Use the `fixASINErrors` function on both DataFrames. This should fix all of the error strings, but this can be double checked by running the function again and checking that `0 Error Strings found and fixed` is printed.

In [119]:
press_codes_df["asin"] = fixASINErrors(press_codes_df["asin"], press_codes_df["upc_processed"])

3 Error Strings found and fixed


In [121]:
enforce_codes_df["asin"] = fixASINErrors(enforce_codes_df["asin"], enforce_codes_df["upc_processed"])

65 Error Strings found and fixed
