<a href="https://colab.research.google.com/github/wbendinelli/pneumonia_classification/blob/main/etl_phl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What are the main factors that determine post-harvest losses of grains?

### Hypotheses to be tested

H1. Macroeconomic conditions influence post-harvest losses (PHL) of grains.

H2. Increase in food production and increase in PHL per capita are directly correlated.

H3. Higher level of economic development of a country can bring a lower level of PHL.

H4. Lack of food storage and food marketing infrastructure can lead to higher PHL.

### Article abstract
Reducing post-harvest losses (PHL) improves food security, safety, and profits for actors in the food supply chain. Despite its importance, most published studies on PHL have been qualitative due to data limitations. This paper aims to address this gap by investigating the impact of macroeconomic factors on PHL in developing countries using a quantitative approach. The study will contribute to the existing body of knowledge by providing empirical evidence on the relationship between macroeconomic factors and PHL, which can inform policy and practice aimed at reducing PHL and improving food security and safety.

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
#libraries used in the study
import pandas as pd
import os
from zipfile import ZipFile
import numpy as np

## Extracting

In [None]:
# Define the list of files to keep
keep_files = ["FoodBalanceSheetsHistoric_E_All_Data_(Normalized).csv",
              "Food_Security_Data_E_All_Data_(Normalized).csv",
              "Trade_Indices_E_All_Data_(Normalized).csv",
              "Prices_E_All_Data_(Normalized).csv"]

# Get the path to the folder containing the files
folder_path = '/content/drive/MyDrive/ia_ml_projects/post_harvest_loss/fao_database/'

# Change the current working directory to the folder
current_dir = os.chdir(folder_path)

# Iterate through all files in the current directory
for file in os.listdir(current_dir):
    # Check if the file is a ZIP file
    if file.endswith(".zip"):
        # Open the ZIP file
        with ZipFile(file, 'r') as zip_ref:
            # Extract all files from the ZIP file
            zip_ref.extractall(current_dir)

# Remove non-zip files and files not in the keep list
for filename in os.listdir(current_dir):
    if not filename.endswith(".zip") and filename not in keep_files:
        file_path = os.path.join(folder_path, filename)
        os.remove(file_path)

# Print a success message
print("All non-zip files and files not in the keep list have been removed.")

All non-zip files and files not in the keep list have been removed.


In [None]:
#importing lists for the study
path1 = '/content/drive/MyDrive/ia_ml_projects/post_harvest_loss/auxiliary_files/countries_list.xlsx'
path2 = '/content/drive/MyDrive/ia_ml_projects/post_harvest_loss/auxiliary_files/fao_item_list.xlsx'

countries_list = pd.read_excel(path1, sheet_name = 'list_economies', header=[2])
fao_item_list = pd.read_excel(path2, sheet_name = 'item_description', header=[2])

#on the FAOSTAT, download all data files in .zip format

path1 = '/content/drive/MyDrive/ia_ml_projects/post_harvest_loss/fao_database/FoodBalanceSheetsHistoric_E_All_Data_(Normalized).csv'
fbs_raw_data = pd.read_csv(path1, encoding = "ISO-8859-1")

path2 = '/content/drive/MyDrive/ia_ml_projects/post_harvest_loss/fao_database/Food_Security_Data_E_All_Data_(Normalized).csv'
fsi_raw_data = pd.read_csv(path2, encoding = "ISO-8859-1")

path3 = '/content/drive/MyDrive/ia_ml_projects/post_harvest_loss/fao_database/Trade_Indices_E_All_Data_(Normalized).csv'
trade_raw_data = pd.read_csv(path3, encoding = "ISO-8859-1")

path4 = '/content/drive/MyDrive/ia_ml_projects/post_harvest_loss/fao_database/Prices_E_All_Data_(Normalized).csv'
prices_raw_data = pd.read_csv(path4, encoding = "ISO-8859-1")

  fsi_raw_data = pd.read_csv(path2, encoding = "ISO-8859-1")


## Transforming

### Food balance sheets data transforming

In [None]:
#renamed columns
fbs_data = fbs_raw_data.rename(columns={
    'Area Code': 'area_code',
    'Area Code (M49)': 'area_code_m49',
    'Area': 'area',
    'Item Code': 'item_code',
    'Item Code (FBS)': 'item_code_fbs',
    'Item': 'item',
    'Element Code': 'element_code',
    'Element': 'element',
    'Year Code': 'year_code',
    'Year': 'year',
    'Unit': 'unit',
    'Value': 'value',
    'Flag': 'flag'})


#filter conditions
cond1 = fbs_data.item_code.isin([2511, 2514, 2555, 2805]) #getting only crops itens (Wheat, maize soybeans, rice)
cond2 = fbs_data.area_code < 5000 #getting completely food balance sheets
cond3 = fbs_data.year_code >= 2000
#cond4 = fbs_data.year_code <= 2010
cond5 = fbs_data.element_code.isin([645, 664, 674, 684])
condf = cond1 & cond2 & cond3 & ~cond5 #& cond4

fbs_data_filter_one = fbs_data.loc[condf]

#renaming rows in element column
fbs_data_filter_one['element'] = fbs_data_filter_one['element'].replace(
    {'Production': 'fbs_production',
     'Import Quantity': 'fbs_import_quantity',
     'Stock Variation': 'fbs_stock_variation',
     'Domestic supply quantity': 'fbs_domestic_supply_quantity',
     'Seed': 'fbs_seed',
     'Losses': 'fbs_losses',
     'Food': 'fbs_food',
     'Export Quantity': 'fbs_export_quantity',
     'Feed': 'fbs_feed',
     'Other uses (non-food)': 'fbs_other_uses',
     'Processing': 'fbs_processing'})

#adding populations
cond1 = fbs_data.item_code.isin([2501])
condf = cond1

fbs_data_filter_two = fbs_data.loc[condf].rename(columns={'value': 'fbs_population'})
fbs_data_filter_two = fbs_data_filter_two[['area_code', 'year_code', 'fbs_population']]

#pivoting table
fbs_pivot = fbs_data_filter_one.pivot_table(values = 'value', index = ['area_code', 'year_code'], columns = ['element'], aggfunc=np.sum)
fbs_pivot = pd.DataFrame(fbs_pivot.to_records())
fbs_pivot_merged = pd.merge(fbs_pivot, fbs_data_filter_two, on=['area_code', 'year_code'], how="inner")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fbs_data_filter_one['element'] = fbs_data_filter_one['element'].replace(


### Food Security data transforming


In [None]:
fsi_data = fsi_raw_data.rename(columns={
    'Area Code': 'area_code',
    'Area': 'area',
    'Area Code (M49)': 'area_code_m49',
    'Item Code': 'item_code',
    'Item': 'item',
    'Element Code': 'element_code',
    'Element': 'element',
    'Year Code': 'year_code',
    'Year': 'year',
    'Unit': 'unit',
    'Value': 'value',
    'Flag': 'flag',
    'Note': 'note'})

#filter conditions
cond1 = fsi_data.element_code.isin([6121]) #getting main values and excluding confidence interval
condf = cond1

fbs_data_filter_one = fsi_data.loc[condf]

#renaming variables
path1 = '/content/drive/MyDrive/ia_ml_projects/post_harvest_loss/auxiliary_files/fbi_item_features_renamed.xlsx'
fsi_features_renamed = pd.read_excel(path1)
fsi_features_renamed = fsi_features_renamed[['item_code', 'item_renamed']]

#merging dataset with new features renamed
fbs_data_filter_one['item_code'] = fbs_data_filter_one['item_code'].astype(str)
fsi_features_renamed['item_code'] = fsi_features_renamed['item_code'].astype(str)

fsi_merged = pd.merge(fbs_data_filter_one, fsi_features_renamed, on=['item_code'], how="left")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  fbs_data_filter_one['item_code'] = fbs_data_filter_one['item_code'].astype(str)


In [None]:
def get_last_or_first(x):
  """
  Returns the last element of a tuple if it has more than one element,
  otherwise returns the first element.

  Args:
      x: A tuple.

  Returns:
      The last element of the tuple if it has more than one element,
      otherwise the first element.
  """
  if len(x) > 1:
    return x[-1]
  else:
    return x[0]


In [None]:
#correcting the moving average of the years
fsi_merged['year'] = fsi_merged['year'].str.split('-')
fsi_merged['year_code'] = fsi_merged.year.apply(get_last_or_first)

#pivoting table
fsi_pivot = fsi_merged.pivot(values = 'value', index = ['area_code', 'year_code'], columns = ['item_renamed'])
fsi_pivot = fsi_pivot.reset_index()
fsi_pivot['year_code'] = fsi_pivot['year_code'].astype('int64')

In [None]:
def remove_less_than_symbol(df):
  """
  This function iterates through each column in a dataframe and removes the '<' symbol from any string values.

  Args:
      df: The pandas dataframe to process.

  Returns:
      A new pandas dataframe with the '<' symbol removed from all string columns.
  """

  for col in df.columns:
    # Check if the column contains string values
    if df[col].dtype == "object":
      # Remove the '<' symbol from string values
      df[col] = df[col].str.replace("<", "")

  return df

# Apply the function to the data_merged dataframe
fsi_pivot = remove_less_than_symbol(fsi_pivot)
fsi_pivot = fsi_pivot.apply(pd.to_numeric)

### Trade indices data transformation

In [None]:
trade_data = trade_raw_data.rename(columns={
    'Area Code': 'area_code',
    'Area': 'area',
    'Item Code': 'item_code',
    'Item': 'item',
    'Element Code': 'element_code',
    'Element': 'element',
    'Year Code': 'year_code',
    'Year': 'year',
    'Unit': 'unit',
    'Value': 'value',
    'Flag': 'flag'})

#filter conditions
cond1 = trade_data.element_code.isin([64, 65, 94, 95])
#cond2 = trade_data.year_code >= 2000
#cond3 = trade_data.year_code <= 2010
condf = cond1 #& cond2 & cond3

trade_data = trade_data.loc[condf]

##renaming rows in element column
trade_data['element'] = trade_data['element'].replace(
    {'Import Value Base Period Quantity': 'trade_import_value_qtty',
     'Import Value Base Period Price': 'trade_import_value_price',
     'Export Value Base Quantity': 'trade_export_value_qtt',
     'Export Value Base Price': 'trade_export_value_price'})

##pivoting table
trade_pivot = trade_data.pivot_table(values = 'value', index = ['area_code', 'year_code'], columns = ['element'], aggfunc=np.sum)
trade_pivot = pd.DataFrame(trade_pivot.to_records())

### Merging data

In [None]:
#merging tables
data_merged = pd.merge(fbs_pivot,countries_list.rename(columns={'code_faostat': 'area_code'}), on=['area_code'])
data_merged = pd.merge(data_merged,fsi_pivot, on=['area_code', 'year_code'])
data_merged = pd.merge(data_merged,trade_pivot, on=['area_code', 'year_code'])

#filtering the same countries as the study: GUSTAVSSON, J. et al. Global food losses and food waste – Extent, causes and prevention. Rome, FAO, 2011.
cond1 = data_merged.gustavsson == 1
condf = cond1

data_merged = data_merged.loc[condf]

## Balancing and filling gaps

### Removing columns with many missing values

In [None]:
#Drop columns with all null values
#data_merged.dropna(axis=1, how="all", inplace=True)

# Calculate the percentage of missing values in each column
missing_values = data_merged.isna().sum() / len(data_merged)

# Select columns with more than 50% missing values
columns_to_drop = missing_values[missing_values > 0.35].index.tolist()
print(columns_to_drop)

# Drop the selected columns
data_merged = data_merged.drop(columns=columns_to_drop)

['fbi_children_affected_wasting_perc', 'fbi_infants_breastfeeding_perc', 'fbi_people_undernourished_perc', 'fbi_severe_food_insecure_famele_people_perc', 'fbi_severe_food_insecure_famele_people_perc_mm', 'fbi_severe_food_insecure_male_people_perc', 'fbi_severe_food_insecure_male_people_perc_mm', 'fbi_severe_food_insecure_rural_people_perc', 'fbi_severe_food_insecure_semi_dense_people_perc', 'fbi_severe_food_insecure_urban_people_perc', 'fbi_severe_moderate_food_insecure_famele_people_perc', 'fbi_severe_moderate_food_insecure_famele_people_perc_mm', 'fbi_severe_moderate_food_insecure_male_people_perc', 'fbi_severe_moderate_food_insecure_male_people_perc_mm', 'fbi_severe_moderate_food_insecure_people_perc', 'fbi_severe_moderate_food_insecure_people_perc_mm', 'fbi_severe_moderate_food_insecure_rural_people_perc', 'fbi_severe_moderate_food_insecure_semi_dense_people_perc', 'fbi_severe_moderate_food_insecure_urban_people_perc', 'fbi_share_severe_food_insecure_pop_perc', 'fbi_share_severe_fo

### Creating a summary table for the region to fill in the largest gaps

In [None]:
# Define the columns to drop
columns_to_drop = ["country", "code_c", "region", "income_group", "code_g"]

# Drop the specified columns from the dataframe
data_merged_drop = data_merged.drop(columns=columns_to_drop)
data_region_median = data_merged_drop.groupby(['code_r', 'year_code']).median().reset_index()

In [None]:
def fill_gaps_with_group_mean(df, variable_list, groupby_column):
  """
  Fills missing values in specified columns of a dataframe with the mean value within each group.

  Args:
      df: The pandas dataframe to process.
      variable_list: A list of column names where missing values should be filled.
      groupby_column: The name of the column to group by.

  Returns:
      A new pandas dataframe with missing values filled with the mean value within each group.
  """

  # Iterate through each variable in the list
  for variable in variable_list:
    # Fill missing values in the variable using the groupby mean
    df[variable] = df[variable].fillna(df.groupby(groupby_column)[variable].transform('mean'))

  # Return the updated dataframe
  return df

# Define the list of variables with gaps and the groupby column
nan_cols = [i for i in data_region_median.columns if data_region_median[i].isnull().any()]

variable_list = nan_cols
groupby_column = "code_r"

# Apply the function to the data_merged dataframe
data_merged_filled = fill_gaps_with_group_mean(data_region_median, variable_list, groupby_column)

### Filling in missing values in the main table

#### Validating FBS data

In [None]:
#This idea was abandoned because the data panel is unbalanced due to the impossibility of being (countries that were born, countries that started to have statistics) and not due to a lack of data!


#Identify all countries and years present
#countries = data_merged['area_code'].unique()
#years = range(data_merged['year_code'].min(), data_merged['year_code'].max() + 1)

#Create a full dataFrame with all combinations of country and year
#data_balanced = pd.MultiIndex.from_product([countries, years], names=['area_code', 'year_code']).to_frame(index=False)

#data_balanced = pd.merge(data_balanced,data_merged, on=['area_code', 'year_code'], how='left')

In [None]:
def fill_missing_with_zero(df, column_substring):
  """
  Fills missing values in all columns of a dataframe whose names contain a specific substring with 0.

  Args:
      df: The pandas dataframe to process.
      column_substring: The substring to search for in column names.

  Returns:
      A new pandas dataframe with missing values filled with 0 in the specified columns.
  """

  # Identify columns that contain the substring
  columns_to_fill = [col for col in df.columns if column_substring in col]

  # Fill missing values in the selected columns with 0
  for col in columns_to_fill:
    df[col] = df[col].fillna(0)

  # Return the updated dataframe
  return df

# Define the substring to search for
column_substring = "fbs"

# Fill missing values with 0 in the specified columns
data_merged_filled = fill_missing_with_zero(data_merged, column_substring)

**FBS basic identities according to FAO**

As many countries do not collect data on stock levels for the majority of products, absolute opening and closing stock levels are replaced by estimate of the change in stock levels during the reference period.

**a) Domestic supply = Domestic utilization**

Production + Imports – Exports – ΔStocks = Food + Feed + Seed + Tourist Food + Industrial Use + Loss + Residual Use

**b) Total supply = Total utilization**

Production + Imports – ΔStocks = Exports + Food + Feed + Seed + Tourist Food + Industrial Use + Loss + Residual Use


In [None]:
data_merged_filled['domestic_supply'] = (  data_merged_filled['fbs_production']
                                         + data_merged_filled['fbs_import_quantity']
                                         - data_merged_filled['fbs_export_quantity']
                                         + data_merged_filled['fbs_stock_variation'])

data_merged_filled['domestic_utilization'] = (  data_merged_filled['fbs_food']
                                              + data_merged_filled['fbs_feed']
                                              + data_merged_filled['fbs_seed']
                                              + data_merged_filled['fbs_processing']
                                              + data_merged_filled['fbs_other_uses']
                                              + data_merged_filled['fbs_losses'])

data_merged_filled['diff_identities'] = data_merged_filled['domestic_supply'] - data_merged_filled['domestic_utilization']

#### Filling gaps in FBI data

In [None]:
def fill_fbi_first_two_rows_with_area_mean(df):
  """
  Fills missing values in the first two rows of FBI columns for each area code with the mean of the area code.

  Args:
      df: The pandas DataFrame.

  Returns:
      A new DataFrame with missing values filled.
  """

  # Identify FBI columns.
  fbi_columns = [col for col in df.columns if col.startswith('fbi')]

  # Create a new DataFrame to store the transformed data.
  df_transformed = df.copy()

  # Group by area code.
  grouped_df = df.groupby('area_code')

  # Iterate over area codes.
  for area_code, group in grouped_df:
    # Get the first two rows of the current area code.
    first_two_rows = group.head(2)

    # Check if any of the FBI columns have NaN values in the first two rows.
    if first_two_rows[fbi_columns].isna().any().any():
      # Calculate the area code mean for each FBI column.
      area_means = group[fbi_columns].mean().to_dict()

      # Fill missing values in the first two rows for each FBI column.
      for col in fbi_columns:
        first_two_rows.loc[first_two_rows[col].isna(), col] = area_means[col]

      # Update the transformed DataFrame with the filled first two rows.
      df_transformed.loc[first_two_rows.index] = first_two_rows

  return df_transformed

data_merged_filled = fill_fbi_first_two_rows_with_area_mean(data_merged_filled)

#### Filling gaps by region median

In [None]:
# Identificar colunas comuns
common_columns = list(set(data_merged_filled.columns) & set(data_region_median.columns))

# Excluir colunas de agrupamento das colunas comuns, se estiverem presentes
grouping_columns = ['code_r', 'year_code']
common_columns = [col for col in common_columns if col not in grouping_columns]

# Agrupar o dataset data_region_median por 'code_r' e 'year_code' e calcular a mediana
region_median_grouped = data_region_median.groupby(['code_r', 'year_code'])[common_columns].median().reset_index()

# Preencher os valores faltantes no dataset data_merged_filled
for col in common_columns:
    data_merged_filled[col] = data_merged_filled.apply(
        lambda row: region_median_grouped[
            (region_median_grouped['code_r'] == row['code_r']) &
            (region_median_grouped['year_code'] == row['year_code'])
        ][col].values[0] if pd.isna(row[col]) else row[col], axis=1
    )

In [None]:
data_region_median.to_excel('data_region_median.xlsx')
from google.colab import files
files.download('data_region_median.xlsx')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Loading

Precisa fazer as variáveis dummies de região e phl, abertura comércio exterior, etc..
