---
title: "Data Collection"
format:
    html: 
        code-fold: false
---

For this project, data was collected primarily through APIs, which provided access to two key datasets: food-related nutritional data and food waste data. These datasets were fetched using requests to interact with the respective APIs, ensuring the collection of real-time and up-to-date information.

The food data was obtained from the FoodData Central API, which offered comprehensive details about food items, including nutritional attributes (e.g., protein, fats, carbohydrates, vitamins, minerals) and serving sizes. The food waste data was sourced from the ReFED API, which provided insights into food surplus and waste across various food categories, along with their environmental and economic impact.

The data collected from these APIs was then processed using Python libraries like pandas and cleaned to remove irrelevant columns, handle missing values, and standardize formats. This method of data collection ensured that the project had access to accurate, up-to-date, and comprehensive datasets to drive analysis on food waste patterns, nutritional content, and related metrics.


{{< include overview.qmd >}} 

## Data Source Information

•	Food Waste Data: The food waste data comes from the ReFED Food Surplus Database, which is publicly available. It includes detailed food waste information, such as the category of food waste, food type, and surplus amounts. The data can be accessed from the following [link](https://insights-engine.refed.org/food-waste-monitor?break_by=destination&indicator=tons-surplus&view=detail&year=2018).

•	Nutrient Data: Nutrient data is fetched using the USDA Food Database API. The API provides detailed information on food products, including nutrient content (e.g., calories, protein, fat, etc.), serving size, food category, and other attributes. The API documentation can be found [here](https://fdc.nal.usda.gov/api-guide).

•	Original Data: The IPNI Estimates of Nutrient Uptake and Removal dataset is extracted from a PDF document using the pdfplumber library. The extracted data is then processed and saved as CSV files for further analysis.

## Data Collection Methods

•	API Use: The USDA Food Database API is used to gather detailed nutritional information for various food items. This includes attributes such as food name, serving size, brand, and various nutrients (protein, fat, carbs, vitamins, minerals, etc.). A Python script fetches data for each unique food name from the dataset food_waste. The API request is structured with parameters like food name and page size to collect relevant information.


## Relevance to the Project

The ReFED Food Surplus Data is key to understanding food waste patterns across different food categories, helping identify the impact of food waste on the environment and the economy. The USDA Food Database is essential for providing the nutritional content of food items, enabling a comprehensive analysis of the relationship between food waste and nutrition. The IPNI Nutrient Data further supports the project by offering estimates of nutrient uptake and removal, which can be linked to food surplus and waste. These datasets together help analyze food waste, nutrients, and the overall environmental and health impact of wasted food.

## Code

In [None]:
import requests
import pandas as pd
import json
import fitz  
import pdfplumber
from rapidfuzz import process
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

import pandas as pd
from difflib import SequenceMatcher


with open('technical-details/data-collection/config.json') as f:
    keys = json.load(f)
API_KEY = keys['fdaapi']

BASE_URL = "https://api.nal.usda.gov/fdc/v1/foods/search"

#SOURE : https://insights-engine.refed.org/food-waste-monitor?break_by=destination&indicator=tons-surplus&view=detail&year=2018
foodwaste = pd.read_csv("data/raw-data/ReFED_US_Food_Surplus_Detail.csv")
foodwaste.columns
foodwaste["food_category"].nunique()

foodwaste['food_name'] = foodwaste['food_category'].where(foodwaste['food_category'] != "Not Applicable", foodwaste['food_type'])

def fetch_food_data(df, page_size=12):
    all_results = [] 
    if 'food_name' not in df.columns or df['food_name'].empty:
        raise ValueError("DataFrame must contain a non-empty 'food_name' column")

    for food_name in df['food_name'].unique():
        params = {
            'query': str(food_name),
            'api_key': API_KEY,
            'pageSize': page_size
        }

        response = requests.get(BASE_URL, params=params)

        if response.status_code == 200:
            results = response.json()
            if 'foods' in results:
                all_results.extend(results['foods'])
        else:
            print(f"Error for {food_name}: {response.status_code}")

    return all_results


def process_food_data(data):
    food_items = []

    if data:
        for food in data:
            if 'servingSizeUnit' in food and food['servingSizeUnit']:
                food_info = {
                    'food_name': food.get('description', ''),
                    'fdc_id': food.get('fdcId', ''),
                    'brand': food.get('brandOwner', ''),
                    'food_category': food.get('foodCategory', ''),
                    'market_country': food.get('marketCountry', ''),
                    'serving_size': food.get('servingSize', 0),
                    'serving_size_unit': food.get('servingSizeUnit', ''),
                }

                for nutrient in food.get('foodNutrients', []):
                    nutrient_name = nutrient.get('nutrientName', '')
                    nutrient_value = nutrient.get('value', 0)

                    if nutrient_name == 'Energy':  # Calories
                        food_info['calories'] = nutrient_value
                    elif nutrient_name == 'Protein':
                        food_info['protein'] = nutrient_value
                    elif nutrient_name == 'Total lipid (fat)':  # Fat
                        food_info['fat'] = nutrient_value
                    elif nutrient_name == 'Carbohydrate, by difference':  # Carbs
                        food_info['carbs'] = nutrient_value
                    elif nutrient_name == 'Fiber, total dietary':  # Fiber
                        food_info['fiber'] = nutrient_value
                    elif nutrient_name == 'Sugars, total':  # Sugar
                        food_info['sugar'] = nutrient_value
                    elif nutrient_name == 'Vitamin A, IU':  # Vitamin A (IU)
                        food_info['vitamin_a_iu'] = nutrient_value
                    elif nutrient_name == 'Vitamin C, total ascorbic acid':  # Vitamin C (mg)
                        food_info['vitamin_c_mg'] = nutrient_value
                    elif nutrient_name == 'Cholesterol':  # Cholesterol (mg)
                        food_info['cholesterol_mg'] = nutrient_value
                    elif nutrient_name == 'Fatty acids, total saturated':  # Saturated Fat (g)
                        food_info['saturated_fat_g'] = nutrient_value
                    elif nutrient_name == 'Calcium, Ca':  # Calcium (mg)
                        food_info['calcium'] = nutrient_value
                    elif nutrient_name == 'Iron, Fe':  # Iron (mg)
                        food_info['iron'] = nutrient_value
                    elif nutrient_name == 'Sodium, Na':  # Sodium (mg)
                        food_info['sodium'] = nutrient_value
                    elif nutrient_name == 'Potassium, K':  # Potassium (mg)
                        food_info['potassium'] = nutrient_value
                    elif nutrient_name == 'Magnesium, Mg':  # Magnesium (mg)
                        food_info['magnesium'] = nutrient_value
                    elif nutrient_name == 'Phosphorus, P':  # Phosphorus (mg)
                        food_info['phosphorus'] = nutrient_value

                    food_info['percent_daily_value'] = nutrient.get('percentDailyValue', 0)

                food_info['microbes'] = ', '.join(str(microbe) for microbe in food.get('microbes', [])) if food.get('microbes') else 'No microbes listed'
                
                food_info['allergens'] = ', '.join(str(allergen) for allergen in food.get('allergens', ['None'])) if food.get('allergens') else 'None'

                food_info['additives'] = ', '.join(str(additive) for additive in food.get('additives', [])) if food.get('additives') else 'None'

                food_info['labels'] = ', '.join(food.get('labels', ['None'])) if food.get('labels') else 'None'

                food_info['nutrient_group'] = food.get('nutrientGroup', 'Unknown')
                food_info['brand_name'] = food.get('brandOwner', 'Unknown')
                food_info['food_labels'] = ', '.join(food.get('labels', ['None'])) if food.get('labels') else 'None'
                food_info['package_size'] = food.get('packageSize', 'Not available')
                food_info['food_safety_info'] = food.get('foodSafetyInfo', 'No safety info available')
                food_info['expiration_date'] = food.get('expirationDate', 'Not available')
                food_info['country_of_origin'] = food.get('countryOfOrigin', 'Not specified')

                food_items.append(food_info)

    if food_items:
        food_df = pd.DataFrame(food_items)
    else:
        food_df = pd.DataFrame()

    return food_df

food_data = foodwaste(foodwaste)



food_data.head()
food_data.shape
foodwaste.head()
foodwaste.shape

print(food_data)
print(foodwaste)

{{< include closing.qmd >}} 