### Import Packages and Datasets

To begin working with the FDA recall data, first import the necessary general packages and set your directory paths.

In [1]:
import os
import pandas as pd
import csv
import re

In [2]:
wd = os.getcwd()
data_dir = os.path.join(wd,"..","data")
github_data_dir = os.path.join(wd,"..","github_data")
code_dir = os.path.join(wd,"..","code")

Next import our packages for processing UPCs [Insert Links?] [This will be changed to an import when functions are put in modules]

In [3]:
run "../code/FDA_Preprocess.py"

Finally, import the FDA Recall Press Release Data and FDA Recall Enforcement Report Data, and read them into a Pandas DataFrame. [Give background information or download links here?]

In [20]:
press = pd.read_csv(os.path.join(data_dir, "FDA_recalls.csv"),skip_blank_lines = True,encoding='ISO-8859-1')
enforce = pd.read_csv(os.path.join(data_dir,"FDA_food_enforcements_2012-06_to_2016-07.csv"), encoding = 'ISO-8859-1')

The rows in the DataFrame correspond to an individual FDA recall. Each recall is associated with a single company and reason for recall, but may be associated with more than one product. The columns containing the information concerning the recalls vary between the two datasets, with the `enforce` data containing more detailed and FDA-specific information then the `press` data. 

In [21]:
press.iloc[0]

DATE                                      Fri, 01 May 2015 20:41:00 -0400
BRAND_NAME                                               Sun Rich, Subway
PRODUCT_DESCRIPTION     Apple slices, apple slices with Dip, Sunshine ...
REASON                                             Listeria monocytogenes
COMPANY                                              Sun Rich Fresh Foods
COMPANY_RELEASE_LINK      http://www.fda.gov/Safety/Recalls/ucm445391.htm
PHOTOS_LINK             \r\t\t\thttp://www.fda.gov/Safety/Recalls/ucm4...
Name: 0, dtype: object

In [6]:
enforce.iloc[0]

Product.Type                                                                                     Food
Event.ID                                                                                        66563
Status                                                                                      Completed
Recalling.Firm                                                               Reser's Fine Foods, Inc.
Address1                                                                        15570 SW Jenkins Road
Address2                                                                                          NaN
City                                                                                        Beaverton
State.Province                                                                                     OR
Postal.Code                                                                                     97006
Country                                                                           

## Extract UPCs from text fields

The first step in processing the FDA data is to extract the UPCs from the text fields they are contained in by making use of the `makeUPCCol` function in the `FDA_Preprocess` module (documenation can be found [HERE]). The function operates in one of two ways depending on what is passed in for `string_list`. These two ways are exemplified by the two different FDA datasets `press` and `enforce`, with the parameter `link` as `True` and `False` respectively.

### Press Release Dataset (`link = True`)

In the Press Release Dataset, the text containing the product information, including UPCs, resides in the page linked to by the URL in the `COMPANY_RELEASE_LINK` column-- the text itself is not contained table.

In order to use the `makeUPCCOl` function with this dataset, you must pass `True` for the `link` parameter. This indicates that the `string_list` parameter being passed is a list of URLs rather than a list of strings.  The `makeUPCCol` function calls the `makeUPCList` function for each row in the dataset, and utilizes the Beautiful Soup with an HTML parser to pull out the text contained within the page. The value passed for `link` also indicates the default regex pattern that should be used for finding UPCs within the text. If `link=True` is passed, the default for the pattern parameter is `UPC_PATTERN_PAGE`. This regex pattern is ideal when the text being searched is an XML because it slightly stricter than the `UPC_PATTERN_TEXT` pattern, and is less likely to pull out unrelated number clusters or partial UPCs. 

Create and append a new column called `upc` to the `press` DataFrame that contains the list of UPCs associated with each recall:

In [7]:
press['upc'] = makeUPCCol(press["COMPANY_RELEASE_LINK"], link = True)

0 rows processed
500 rows processed
1000 rows processed
1500 rows processed
2000 rows processed
2500 rows processed
3000 rows processed
3148 rows processed : COMPLETE


In [25]:
press[0:1]

Unnamed: 0,DATE,BRAND_NAME,PRODUCT_DESCRIPTION,REASON,COMPANY,COMPANY_RELEASE_LINK,PHOTOS_LINK
0,"Fri, 01 May 2015 20:41:00 -0400","Sun Rich, Subway","Apple slices, apple slices with Dip, Sunshine ...",Listeria monocytogenes,Sun Rich Fresh Foods,http://www.fda.gov/Safety/Recalls/ucm445391.htm,\r\t\t\thttp://www.fda.gov/Safety/Recalls/ucm4...


### Enforcement Dataset (`link = False`)

In the Enforcement data, the product information is contained in strings within the `Code.Info` or `Product.Description` columns. The default parameter `link=False` is the correct value for this dataset, indicating that `string_list` parameter is a list of strings containing the UPCs. When `link=False`, the default parameter for `pattern`, `UPC_PATTERN_TEXT`, is used. This is the optimal pattern when searching strings directly-- it is slightly more liberal with the patterns that it looks for and is able to capture more number clusters without fear of false matches or incomplete UPCs. 

Create and append a new column called `upc_pd` to the `enforce` DataFrame that contains the list of UPCs for each recall found in the `Product.Description` column:

In [9]:
enforce['upcs_pd'] = makeUPCCol(enforce["Product.Description"], verbose = False)

Create and append a new column called `upc_ci` to the `enforce` DataFrame that contains the list of UPCs for each recall found in the `Code.Info` column:

In [10]:
enforce['upcs_ci'] = makeUPCCol(enforce["Code.Info"], verbose = False)

Processing the `enforce` data is much faster, and there are many more rows, so we pass `verbose=False` to keep the function from printing out status messages.

We next combine the UPCs found the `Product.Description` and `Code.Info` columns and obtain a list containing the unique UPCs for each recall. 

Append a column called `upc` to the `enforce` DataFrame that contains the list UPCs associated with each recall:

In [11]:
union_col = [list(set(enforce.upcs_pd[rownum]).union(set(enforce.upcs_ci[rownum]))) for rownum in range(enforce.shape[0])]
enforce['upc'] = pd.Series(union_col)

Additionally, we create a list of all of the 12 digit UPCs from an recall event to be used for pattern matching in the UPC processing stage using the `makeEventUPCCol` function. 

In [12]:
enforce["event_upc12"] = makeEventUPCCol(enforce["upc"], enforce["Event.ID"])

In [17]:
enforce.iloc[0:1]

Unnamed: 0,Product.Type,Event.ID,Status,Recalling.Firm,Address1,Address2,City,State.Province,Postal.Code,Country,...,Reason.for.Recall,Recall.Initiation.Date,Center.Classification.Date,Termination.Date,Report.Date,Code.Info,upcs_pd,upcs_ci,upc,event_upc12
0,Food,66563,Completed,"Reser's Fine Foods, Inc.",15570 SW Jenkins Road,,Beaverton,OR,97006,United States,...,The recalled products are potentially contamin...,10/22/2013,12/24/2013,,01/01/2014,Use by dates: 10/21/13-12/11/13.,"[052548517571, 071117001631, 071117002164, 071...",[],"[071117001631, 074865208369, 071117646078, 071...","[758108301610, 071117181500, 071117141771, 071..."
