In [1]:
import pandas as pd

From FDA website: (https://open.fda.gov/apis/food/enforcement/searchable-fields/)
--

State and Country:

- The state, and country where the recalling firm is located

Classification:

- Numerical designation (I, II, or III) that is assigned by FDA to a particular product recall that indicates the relative degree of health hazard.

    - Class I = Dangerous or defective products that predictably could cause serious health problems or death. Examples include: food found to contain botulinum toxin, food with undeclared allergens, a label mix-up on a lifesaving drug, or a defective artificial heart valve.

    - Class II = Products that might cause a temporary health problem, or pose only a slight threat of a serious nature. Example: a drug that is under-strength but that is not used to treat life-threatening situations.

    - Class III = Products that are unlikely to cause any adverse health reaction, but that violate FDA labeling or manufacturing laws. Examples include: a minor container defect and lack of English labeling in a retail food.


Reason for recall: (I don't use this directly, see reason_for_recall simplified below)

- Information describing how the product is defective and violates the FD&C Act or related statutes.



From Kaggle dataset creator: (https://www.kaggle.com/datasets/chiyucheng/fda-food-enforcement-20082022)
--

Reason for recall simplified:

- "We then created new tables, one for each of the four categories with simplified reasons attached to the original data. For example, for any event whose reason for recall contains the key word E. coli , we attach E. coli. to it at a separate column named as reason_for_recall_simplified."


In [2]:
# loading the dataset downloaded from Kaggle into a pandas DataFrame object:
df = pd.read_csv('./combined.csv')

In [3]:
df['country'].value_counts() 
# We can see that most of the data is from the United States, so I will just use this info

United States               21062
Canada                        123
Israel                         86
France                         13
Taiwan                         12
Mexico                         10
Korea (the Republic of)         8
Chile                           7
Ireland                         5
Italy                           4
Belgium                         4
China                           3
Egypt                           3
Costa Rica                      3
India                           3
Germany                         2
Netherlands                     1
Vietnam                         1
Thailand                        1
Poland                          1
Guatemala                       1
United Kingdom                  1
Dominican Republic (the)        1
Sweden                          1
Australia                       1
Name: country, dtype: int64

In [4]:
# Getting data just from the US and making it a new dataframe:
us_df = df[df['country']=='United States'] 

In [5]:
# checking the iso code for the states: 
us_df['state'].unique()
# we see that it is the two letter codes which will be important for choosing a geojson file later

array(['VA', 'GA', 'NC', 'MN', 'CT', 'NJ', 'SC', 'NY', 'IN', 'AL', 'MI',
       'MA', 'WA', 'WI', 'KS', 'FL', 'OR', 'CA', 'AR', 'IL', 'UT', 'TX',
       'MD', 'WV', 'ID', 'MO', 'RI', 'ME', 'NH', 'OH', 'PA', 'SD', 'VT',
       'ND', 'CO', 'WY', 'NV', 'DE', 'AZ', 'LA', 'OK', 'DC', 'MT', 'PR',
       'NM', 'IA', 'AK', 'KY', 'MS', 'NE', 'TN', 'HI'], dtype=object)

In [6]:
# Making a list of the information I plan to use:
important_info = ['state','classification','reason_for_recall_simplified']

In [7]:
# Making a new dataframe with just the information I want to use:
data = us_df[important_info]

In [8]:
def show_missing(df):
    '''Return a pandas dataframe describing the contents of a source dataframe including missing values.'''
    
    column = []
    dtype = []
    count = []
    unique = []
    missing = []
    
    for col in df.columns:
        column.append(col)
        dtype.append(df[col].dtype)
        count.append(len(df[col]))
        unique.append(len(df[col].unique()))
        missing.append(df[col].isna().sum())

    output = pd.DataFrame({
        'column': column, 
        'dtype': dtype,
        'count': count,
        'unique': unique,
        'missing': missing, 
    })    
        
    return output

In [9]:
data_check = show_missing(data)
data_check # we see that there are no missing values!

Unnamed: 0,column,dtype,count,unique,missing
0,state,object,21062,52,0
1,classification,int64,21062,3,0
2,reason_for_recall_simplified,object,21062,20,0


In [10]:
# aggregating the data by state:

lst_data = data.values.tolist() # put dataframe to list
state_info = {} # empty dictionary to put aggregated data with states as keys
for x in lst_data: # looping through data
    state = x[0]
    c = x[1]
    r = x[2]
    if state not in state_info.keys():
        state_info[state] = {'classes': [c], 'recalls': [r]} # making value a dictionary with two keys, with lists as values
    else:
        state_info[state]['classes'].append(c)
        state_info[state]['recalls'].append(r)
        
def most_frequent(lst):
    '''Returns the most frequently occuring item in a list.'''
    return max(set(lst), key = lst.count)

final_state_info = []
for state, vals in state_info.items():
    classes = vals['classes']
    avg_class = sum(classes)/len(classes) # aggregating the classificaton by taking the mean
    
    recalls = vals['recalls']
    top_recall = most_frequent(recalls) # getting the most frequently ocurring recall reason
    
    recall_count = len(recalls) # getting the amount of recalls
    
    final_state_info.append([state, avg_class, top_recall, recall_count])

In [11]:
final_data = pd.DataFrame(final_state_info, columns = ['state', 'avg_class', 'top_recall', 'recall_count'])
final_data.to_csv('./final_data.csv')