# Data Preparation and Merging 

Creating a comprehensive dataset for use in ML


Prepared by: Becky L, last updated: 10/28/24

## Data Sources:

### EPA - Excess Food Opportunities Map.  Restaurants, retailers (grocery), and foodbank locations (rolled up to State level) were used. 

https://experience.arcgis.com/experience/793a7912cb184f7792fc02a9bac4192b

> __About the U.S. EPA Excess Food Opportunities Map:__ The U.S. EPA Excess Food Opportunities Map supports nationwide diversion of excess food from landfills. The interactive map identifies and displays facility-specific information about potential generators and recipients of excess food in the industrial, commercial and institutional sectors and also provides estimates of excess food by generator type.  The map displays the locations of nearly 950,000 potential excess food generators.

> For this project, only the following sources were used: food banks, food wholesale and retail, restaurants and food services.  Although the locations were provided by address, the data were pivoted and summarized at the state level.  A secondary dataset was obtained at the zipcode level for these data only and were not part of the combined dataset.

> __Food Wholesale and Retail, US and Territories, 2023, EPA Region 9:__ Abstract: This table contains features that represent food wholesalers and retailers (supermarkets, grocery stores, and supercenters) represented by 24 unique NAICS codes. Establishment-specific information except the annual excess food estimate was licensed to the EPA from D&B Hoovers in 2021 (https://www.dnb.com/).  Calculations used to estimate annual excess food estimates are described in EPA’s 2023 publication: EPA Excess Food Opportunities Map Version 3 - Technical Methodology. The dataset contains 197,455 facilities.

> __Restaurants and Food Services, US and Territories, 2023, EPA Region 9:__ Abstract: This table contains features that represent restaurants and caterers represented by six unique NAICS codes. Establishment-specific information except the annual excess food estimate was licensed to the EPA from D&B Hoovers in 2021 (https://www.dnb.com/). Calculations used to estimate annual excess food estimates are described in EPA’s 2023 publication: EPA Excess Food Opportunities Map Version 3 - Technical Methodology. The dataset contains 451,092 facilities. 

>__Food Banks:__ Abstract: This layer contains point features that represent facilities that recover excess food to feed people across the US. Food bank information was collected in 2015 from Feeding America(www.feedingamerica.org), a national organization for food banks, and includes food banks as well as Partner Distribution Organizations (PDO) and Regional Distribution Organizations (RDO). Sources of annual excess food weight are described in EPA’s 2019 publication: Technical Methodology for the EPA Excess Food Opportunities Map. The dataset contains 316 facilities. This layer is not authoritative as the data is from 2015; this layer is planned to be updated and replaced in 2024.

![epa.png](images/epa.png)



### ReFED: summary and detail csv files for US and States on food surplus.
https://insights-engine.refed.org/food-waste-monitor?view=overview&year=2022

> __ReFED Insights Engine:__ An online hub for data and solutions featuring the most comprehensive examination of food loss and waste in the United States – includes the Food Waste Monitor, Solutions Database, Solution Provider Directory, Impact Calculator, Capital Tracker, and Policy Finder.

![title](images/refed.png)

### Census: population estimates by State 
https://www.census.gov/data/tables/time-series/demo/popest/2020s-state-total.html

> __Census.gov State Population Totals and Components of Change: 2020-2023__ This page features files containing state population totals and components of change for years 2020 to 2023.  For the most recent data available, please refer to the Vintage 2023 data. Vintage 2023 is the most recent completed vintage and consistent set of estimates.

![title](images/census.png)


### USDA: prevalence of food insecurity average 2021-2023 by state 
https://www.ers.usda.gov/topics/food-nutrition-assistance/food-security-in-the-u-s/key-statistics-graphics/

> __State-level Prevalence of Food Insecurity__
> Prevalence rates of food insecurity varied considerably from State to State. Data for 3 years, 2021–2023, were combined to provide more reliable statistics at the State level. Estimated prevalence rates of food insecurity during this 3-year period ranged from 7.4 percent in New Hampshire to 18.9 percent in Arkansas; estimated prevalence rates of very low food security ranged from 3.2 percent in Iowa, Massachusetts, New Hampshire, New Jersey, and North Dakota to 7.0 percent in South Carolina.
Referencing: Household Food Security in the United States in 2023 https://www.ers.usda.gov/publications/pub-details/?pubid=109895

![title](images/usda.png)



## Dataset Design:

From EPA data
* State = US State (note: DC has limited data)
* Excess_Food_Low_Tons = estimated excess food from grocery and restaurants combined (low estimate)
* Excess_Food_High_Tons = estimated excess food from grocery and restaurants combined (high estimate)

From ReFED data (limited to restaurant and grocery source only) by state
* Tons_Beverages_Don = donated tons of beverages (ready-to-drink) in tons
* Tons_Bread_Don = donated bread/bakery in tons
* Tons_Dairy_Eggs_Don = donated dairy/eggs in tons
* Tons_Dry_Don	= donated dry goods in tons
* Tons_Frozen_Don = donated frozen foods in tons
* Tons_Meat_Don = donated meat (incl seafood) in tons
* Tons_Prepared_Don = donated prepared food in tons
* Tons_Produce_Don = donated produce in tons

From USDA data
* households_avg_21_23 = number of households averaged across 2021-2023
* food_insecurity_percent = % food insecure in that state
* very_low_food_security_percent = % very food insecure in that state

From Census data (estimated for July 1 of each year by state)
* pop2020 = population estimate for 2020
* pop2021 = pop estimate for 2021
* pop2022 = pop estimate for 2022
* pop2023 = pop estimate for 2023

From EPA data (above) for food banks

* FoodBank_recd_TonYear	= donations recieved at a foodbank in tons (totalled for each state)
* foodbanks = number of food banks per state

## Processing Steps

* individual csv and xlsx files were evaluated for data quality and data errors.
* Some files contained state names, while others contained state 2-letter abbreviations.
* All full state names were converted to 2-letter abbreviations.
* Data containing PR and VI were removed.
* Data containing multiple records for each state were summated by state.
* Data were joined on state abbreviations.
* Column headers were changed to descriptive names.
* All data processing occurred in Microsoft Excel.