# Part 1

# Necessary Lists & Functions

In [None]:
import pandas as pd

In [None]:
asean_list = ['VNM','LAO','THA','KHM','MYS','SGP','MMR','PHL','BRN','IDN']
south_asia_list = ['BGD','IND','PAK','NPL','LKA','BTN']
all_countries_list = ['VNM','LAO','THA','KHM','MYS','SGP','MMR','PHL','BRN','IDN','BGD','IND','PAK','NPL','LKA','BTN']

# Data Cleaning
def cleanUp(data):
    data = data[data['code'].isin(all_countries_list)]
    data['year'] = pd.to_datetime(data['year'], format='%Y').dt.year
    data = data.drop(data[data['year'] <= 2010].index)
    data = data.set_index(['code','year'])
    return data

def basic_stats(df):
    countries = list(df['country'].unique())
    n_countries = len(countries) 
    time_series = f"{df.year.min()} ~ {df.year.max()}"
    missing_data_summary = df.drop(['country','year'], axis=1).notna().sum()
    
    return {
        'countries (first 5)': countries[:5],
        'country_count': n_countries,
        'time_series': time_series,
        'observations_bycolumn': missing_data_summary
    }

# 1. Data sets

## 1.1 import all dataset

In [None]:
df_access = pd.read_csv("../data/processed/access_merged.csv")
df_controls = pd.read_csv("../data/processed/control_var.csv")
df_agriculture = pd.read_csv("../data/processed/agriculture_merged.csv")
df_staple = pd.read_csv("../data/processed/StapleFoodStability_adjusted.csv")
df_findex = pd.read_csv("../data/processed/cleaned_output_2011_2022.csv")
df_mobileTransaction = pd.read_csv('../Data/processed/mobile_transact.csv')


## 1.1 Agri (Agriculture Output)
- Source: World Bank FAO

In [None]:
df_agriculture_clean = cleanUp(df_agriculture)
df_agriculture_clean.head(2)

Unnamed: 0,country,year,FarmCredit,ICTPolicy,ProductionValue,ProcessingValue,Fertilizer
11,BGD,2011,3271.100803,,1649775.0,,
12,BGD,2012,3213.392664,,1854820.0,249808.744409,


In [31]:
basic_stats(df_agri)

{'countries (first 5)': ['BGD', 'BRN', 'BTN', 'IDN', 'IND'],
 'country_count': 16,
 'time_series': '2011 ~ 2023',
 'observations_bycolumn': FarmCredit         42
 ICTPolicy          34
 ProductionValue    55
 ProcessingValue    82
 Fertilizer          0
 dtype: int64}

- In this section, we examine agricultural output using data from the World Bank FAO. The dataset includes key indicators such as farm credit, ICT policy presence, agricultural production value, food and beverage processing value, and fertilizer usage. Our summary shows data coverage for 16 countries from 2000 to 2023. 

- Among these indicators, production and processing values have the most complete records, with 99 and 111 observations respectively. This suggests relatively strong data availability for measuring economic contributions of agricultural and food sectors, while access to farm credit and ICT policy information is more limited across countries and years.

## 1.2 Con (Control Variables)
- Source: WDI

In [None]:
df_control = df_control.drop(['country'], axis = 'columns')
df_control_clean = cleanUp(df_control)
df_control_clean.head(2)

Unnamed: 0,country,year,"Fertility rate, total (births per woman)","GDP per capita, PPP (constant 2021 international $)","Life expectancy at birth, total (years)","Mortality rate, infant (per 1,000 live births)",Population density (people per sq. km of land area),Population growth (annual %),"Population, total",Poverty headcount ratio at national poverty lines (% of population),Rural population (% of total population),Urban population (% of total population)
51,BGD,2011,2.309,4352.426434,68.499,35.4,1179.926834,0.908526,153591076.0,,68.775,31.225
52,BGD,2012,2.263,4592.048409,68.989,34.1,1191.289091,0.958356,155070101.0,,68.007,31.993


In [None]:
basic_stats(df_control_clean)

{'countries (first 5)': ['BGD', 'BTN', 'BRN', 'KHM', 'IND'],
 'country_count': 16,
 'time_series': '2011 ~ 2023',
 'observations_bycolumn': Fertility rate, total (births per woman)                               208
 GDP per capita, PPP (constant 2021 international $)                    207
 Life expectancy at birth, total (years)                                208
 Mortality rate, infant (per 1,000 live births)                         208
 Population density (people per sq. km of land area)                    192
 Population growth (annual %)                                           208
 Population, total                                                      208
 Poverty headcount ratio at national poverty lines (% of population)     52
 Rural population (% of total population)                               208
 Urban population (% of total population)                               208
 dtype: int64}

- In this section, we collect 10 control variables from WDI, covering economic development, population estimatesand composition, poverty prevalence, and vital statistics. Our summary shows data coverage for 16 countries from 1960 to 2023.

- Among these indicators, except poverty headcount ratio has many missing values, all other 9 variables have a considerable amount of data to serve as background variables. 

## 1.3 GFI (Global Financial Inclusion)
- Source: Findex

In [None]:
df_findex_clean = cleanUp(df_findex)
df_findex_clean.head(2)

Unnamed: 0,country,year,Account (% age 15+),Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),"Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+)","Account ownership at a financial institution or with a mobile-money-service provider, male (% of population ages 15+)","Account ownership at a financial institution or with a mobile-money-service provider, older adults (% of population ages 25+)","Account ownership at a financial institution or with a mobile-money-service provider, poorest 40% (% of population ages 15+)","Account ownership at a financial institution or with a mobile-money-service provider, primary education or less (% of population ages 15+)","Account ownership at a financial institution or with a mobile-money-service provider, richest 60% (% of population ages 15+)",...,"Used a mobile phone or the internet to check account balance(% with a financial institution account, age 15+)","Used a mobile phone or the internet to pay bills, female (% age 15+)","Used a mobile phone or the internet to pay bills, male (% age 15+)","Used a mobile phone or the internet to pay bills, rural (% age 15+)","Used a mobile phone or the internet to pay bills, urban (% age 15+)",Used a mobile phone or the internet to send money (% age 15+),"Used a mobile phone or the internet to send money, female (% age 15+)","Used a mobile phone or the internet to send money, male (% age 15+)","Used a mobile phone or the internet to send money, rural (% age 15+)","Used a mobile phone or the internet to send money, urban (% age 15+)"
0,BGD,2011,31.74,31.74,26.01,37.29,36.76,19.06,21.06,40.18,...,10.63,1.54,3.36,8.75,11.08,17.58,7.64,28.28,19.96,16.83
1,BGD,2012,31.49,31.49,26.156667,36.65,36.32,20.383333,22.34,38.883333,...,10.63,1.54,3.36,8.75,11.08,17.58,7.64,28.28,19.96,16.83


In [None]:
basic_stats(df_findex_clean)

{'countries (first 5)': ['BGD', 'BRN', 'BTN', 'IDN', 'IND'],
 'country_count': 17,
 'time_series': '2011 ~ 2022',
 'observations_bycolumn': Account (% age 15+)                                                                                                              192
 Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+)                  192
 Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+)          192
 Account ownership at a financial institution or with a mobile-money-service provider, male (% of population ages 15+)            192
 Account ownership at a financial institution or with a mobile-money-service provider, older adults (% of population ages 25+)    192
                                                                                                                                 ... 
 Used a mobile phone or the internet to send money (% age

For the Global Financial Inclusion (GFI) dataset from Findex, we collect account ownership and digital financial service usage across 17 countries from 2011 to 2022. The dataset includes over 100 indicators detailing ownership by demographic breakdowns (e.g., gender, income, education), as well as behaviors such as using mobile phones or the internet to send money.

## 1.4 Staple (Staple Food Output)
- Source: FAO

In [None]:
df_staple_clean = cleanUp(df_staple)
df_staple_clean = df_staple_clean.rename(columns={'rolling_std': 'foodSupply_stability'})
df_staple.head(2)

Unnamed: 0,country,year,Food supply quantity (kg/capita/yr)_Cereals - Excluding Beer,Food supply quantity (kg/capita/yr)_Starchy Roots,foodSupply_stability
1,AFG,2011,197.29,5.72,
2,AFG,2012,190.31,6.4,


In [None]:
basic_stats(df_staple_clean)

{'countries (first 5)': ['AFG', 'BGD', 'BTN', 'KHM', 'IND'],
 'country_count': 18,
 'time_series': '2011 ~ 2022',
 'observations_bycolumn': Food supply quantity (kg/capita/yr)_Cereals - Excluding Beer    208
 Food supply quantity (kg/capita/yr)_Starchy Roots               208
 foodSupply_stability                                            205
 dtype: int64}

For the staple food dataset, we collect 18 countries from 2010-2022 over three indicators, food supply quantity for cereals, food supply quantity for starch, standard deviations. 

## 1.5 Access (to digital finance)
- Source: WDI

In [None]:
df_access_clean = cleanUp(df_access)
df_access_clean.head(2)

Unnamed: 0,country,year,Rural Access to Electricity(Percent of Population),Mobile Cellular Subscriptions (per 100 people),Fixed Broadband Subsciptions (per 100 people)
0,BGD,2024,,,
1,BGD,2023,99.6,,7.51447


In [None]:
basic_stats(df_access_clean)

{'countries (first 5)': ['BGD', 'BTN', 'IND', 'LKA', 'NPL'],
 'country_count': 16,
 'time_series': '2011 ~ 2024',
 'observations_bycolumn': Rural Access to Electricity(Percent of Population)    204
 Mobile Cellular Subscriptions (per 100 people)        190
 Fixed Broadband Subsciptions (per 100 people)         201
 dtype: int64}

For the access to digital finance from WDI, we collect 16 countries from 1960-2024, covering variables like rural access to electricity, mobile subscriptions, and fixed broadband subscriptions. Among which, mobile sucscription has the largest amount of available data for us to use.

## 1.6 Mobile (Mobile Money Transactions)
- Source: Financial Access Survey, IMF

In [None]:
df_mobile = pd.read_csv('../Data/processed/mobile_transact.csv')
df_mobileTransaction = df_mobileTransaction[['country', 'year', 'mobile_money_transactions']]
df_mobileTransaction_clean = cleanUp(df_mobileTransaction)
df_mobileTransaction_clean.head(2)


Unnamed: 0,country,year,mobile_money_transactions
80,IDN,2016,7063689.0
81,IDN,2017,12375470.0


In [None]:
basic_stats(df_mobileTransaction_clean)

{'countries (first 5)': ['IDN', 'MMR', 'BGD', 'PAK', 'PHL'],
 'country_count': 19,
 'time_series': '2016 ~ 2023',
 'observations_bycolumn': mobile_money_transactions    152
 dtype: int64}

For the mobile money transactions dataset, we collected data from 19 countries in South and Southeast Asia spanning the years 2016 to 2023. The indicator used is the total number of mobile money transactions per year, resulting in 152 valid observations. This subset helps us capture recent trends in digital financial inclusion across the region.

## 1.7 Fertilizer use (Nitrogen application per farmland)
- Source: Global data on fertilizer use by crop and by country, DRYAD(https://datadryad.org/dataset/doi:10.5061/dryad.2rbnzs7qh)

# 2. Merge Datasets

In [42]:
import pandas as pd

def merge_and_count(df_1, df_2, how = 'inner'):
    country_col = 'country'
    date_col = 'year'
    rows_df_1, rows_df_2 = len(df_1), len(df_2)
    merged_df = pd.merge(df_1, df_2, how=how, on=[country_col, date_col])

    countries_df_1 = df_1[country_col].unique()
    countries_merged = merged_df[country_col].unique()

    dropped_countries = set(countries_df_1) - set(countries_merged)

    rows_merged = len(merged_df)
    countries_df_1_count, countries_merged_count = (
        len(countries_df_1),
        len(countries_merged)
    )

    print(f"Rows in df_1: {rows_df_1}, Countries in df_1: {countries_df_1_count}")
    print(f"Rows in df_2: {rows_df_2}, Countries in df_2: {countries_merged_count}")
    print(f"Rows in merged DataFrame: {rows_merged}, Countries in merged DataFrame: {countries_merged_count}")
    print(f"Dropped countries from df_1: {list(dropped_countries)}")

    return merged_df

## 2.1 Merge con and agri

In [43]:
df_merge_1 = merge_and_count(
    df_1 = df_con,
    df_2 = df_agri,
    how = 'outer'
)
df_merge_1.head(2)


Rows in df_1: 208, Countries in df_1: 16
Rows in df_2: 151, Countries in df_2: 16
Rows in merged DataFrame: 236, Countries in merged DataFrame: 16
Dropped countries from df_1: []


Unnamed: 0,country,year,"Fertility rate, total (births per woman)","GDP per capita, PPP (constant 2021 international $)","Life expectancy at birth, total (years)","Mortality rate, infant (per 1,000 live births)",Population density (people per sq. km of land area),Population growth (annual %),"Population, total",Poverty headcount ratio at national poverty lines (% of population),Rural population (% of total population),Urban population (% of total population),FarmCredit,ICTPolicy,ProductionValue,ProcessingValue,Fertilizer
0,BGD,2011,2.309,4352.426434,68.499,35.4,1179.926834,0.908526,153591076.0,,68.775,31.225,3271.100803,,1649775.0,,
1,BGD,2012,2.263,4592.048409,68.989,34.1,1191.289091,0.958356,155070101.0,,68.007,31.993,3213.392664,,1854820.0,249808.744409,


## 2.2 Merge df_merge1 and GFI

In [44]:
df_merge_2 = merge_and_count(
    df_1 = df_merge_1,
    df_2 = df_gfi,
    how = 'outer'
)
df_merge_2.head(2)

Rows in df_1: 236, Countries in df_1: 16
Rows in df_2: 204, Countries in df_2: 17
Rows in merged DataFrame: 248, Countries in merged DataFrame: 17
Dropped countries from df_1: []


Unnamed: 0,country,year,"Fertility rate, total (births per woman)","GDP per capita, PPP (constant 2021 international $)","Life expectancy at birth, total (years)","Mortality rate, infant (per 1,000 live births)",Population density (people per sq. km of land area),Population growth (annual %),"Population, total",Poverty headcount ratio at national poverty lines (% of population),...,"Used a mobile phone or the internet to check account balance(% with a financial institution account, age 15+)","Used a mobile phone or the internet to pay bills, female (% age 15+)","Used a mobile phone or the internet to pay bills, male (% age 15+)","Used a mobile phone or the internet to pay bills, rural (% age 15+)","Used a mobile phone or the internet to pay bills, urban (% age 15+)",Used a mobile phone or the internet to send money (% age 15+),"Used a mobile phone or the internet to send money, female (% age 15+)","Used a mobile phone or the internet to send money, male (% age 15+)","Used a mobile phone or the internet to send money, rural (% age 15+)","Used a mobile phone or the internet to send money, urban (% age 15+)"
0,BGD,2011,2.309,4352.426434,68.499,35.4,1179.926834,0.908526,153591076.0,,...,10.63,1.54,3.36,8.75,11.08,17.58,7.64,28.28,19.96,16.83
1,BGD,2012,2.263,4592.048409,68.989,34.1,1191.289091,0.958356,155070101.0,,...,10.63,1.54,3.36,8.75,11.08,17.58,7.64,28.28,19.96,16.83


## 2.3 Merge df_merge2 and staple

In [45]:
df_merge_3 = merge_and_count(
    df_1 = df_merge_2,
    df_2 = df_staple,
    how = 'outer'
)
df_merge_3.head(2)

Rows in df_1: 248, Countries in df_1: 17
Rows in df_2: 208, Countries in df_2: 21
Rows in merged DataFrame: 296, Countries in merged DataFrame: 21
Dropped countries from df_1: []


Unnamed: 0,country,year,"Fertility rate, total (births per woman)","GDP per capita, PPP (constant 2021 international $)","Life expectancy at birth, total (years)","Mortality rate, infant (per 1,000 live births)",Population density (people per sq. km of land area),Population growth (annual %),"Population, total",Poverty headcount ratio at national poverty lines (% of population),...,"Used a mobile phone or the internet to pay bills, rural (% age 15+)","Used a mobile phone or the internet to pay bills, urban (% age 15+)",Used a mobile phone or the internet to send money (% age 15+),"Used a mobile phone or the internet to send money, female (% age 15+)","Used a mobile phone or the internet to send money, male (% age 15+)","Used a mobile phone or the internet to send money, rural (% age 15+)","Used a mobile phone or the internet to send money, urban (% age 15+)",Food supply quantity (kg/capita/yr)_Cereals - Excluding Beer,Food supply quantity (kg/capita/yr)_Starchy Roots,foodSupply_stability
0,AFG,2011,,,,,,,,,...,,,,,,,,197.29,5.72,
1,AFG,2012,,,,,,,,,...,,,,,,,,190.31,6.4,


## 2.4 Merge df_merge3 and access

In [46]:
df_merge_4 = merge_and_count(
    df_1 = df_merge_3,
    df_2 = df_access,
    how = 'outer'
)
df_merge_4.head(2)


Rows in df_1: 296, Countries in df_1: 21
Rows in df_2: 224, Countries in df_2: 21
Rows in merged DataFrame: 312, Countries in merged DataFrame: 21
Dropped countries from df_1: []


Unnamed: 0,country,year,"Fertility rate, total (births per woman)","GDP per capita, PPP (constant 2021 international $)","Life expectancy at birth, total (years)","Mortality rate, infant (per 1,000 live births)",Population density (people per sq. km of land area),Population growth (annual %),"Population, total",Poverty headcount ratio at national poverty lines (% of population),...,"Used a mobile phone or the internet to send money, female (% age 15+)","Used a mobile phone or the internet to send money, male (% age 15+)","Used a mobile phone or the internet to send money, rural (% age 15+)","Used a mobile phone or the internet to send money, urban (% age 15+)",Food supply quantity (kg/capita/yr)_Cereals - Excluding Beer,Food supply quantity (kg/capita/yr)_Starchy Roots,foodSupply_stability,Rural Access to Electricity(Percent of Population),Mobile Cellular Subscriptions (per 100 people),Fixed Broadband Subsciptions (per 100 people)
0,AFG,2011,,,,,,,,,...,,,,,197.29,5.72,,,,
1,AFG,2012,,,,,,,,,...,,,,,190.31,6.4,,,,


## 2.5 Merge df_merge4 and mobile

In [47]:
df_merge_5 = merge_and_count(
    df_1 = df_merge_4,
    df_2 = df_mobile,
    how = 'outer'
)
df_merge_5.head(2)
df_merge_5.to_csv('../Data/processed/merged_5.csv', index=False)

Rows in df_1: 312, Countries in df_1: 21
Rows in df_2: 296, Countries in df_2: 21
Rows in merged DataFrame: 470, Countries in merged DataFrame: 21
Dropped countries from df_1: []
