# Milestone 2
# Part 1: 
- Merge 6 datasets covering digital access, agriculture output, food stability, financial inclusion, mobile money transaction, and demography in 10 ASEAN countries, and 6 South Asian countries.
- Goal: Provide a comprehensive for other researchers for their future research. The final merged dataset has over 130 variables. We thought this merged dataset would be helpful for researchers who are interested in analyzing the relationship between Fintech and agricultural poor in Asia. 

# Part 2:
- We refine our research scope for Milestone 2 and process, merge, and use only limited number variables to get descriptive statistics, visualizations, and regression analysis.


______________________________



Research Question: What is the impact of mobile cellular subscriptions on food supply in ASEAN and South Asian countries?


Empirical Model: 
$$
\log(\text{Food Supply})_{it} = \beta_0 + \beta_1 \log(\text{Mobile Subscriptions})_{it} + \beta_2 X_{it} + \epsilon_{it}
$$


**1.	Variables:**

-	Dependent Variable: Food supply quantity 

-	Independent Variables: Mobile cellular subscription

-	Control Variables: total population, urban population, rural population, life expectancy, mortality rate, population density, poverty headcount ratio, population growth, fertility rate, GDP per capita
 
-	Time: 2011-2024
-	Region: ASEAN and South Asia
-	Number of Country: 16

    -	ASEAN: 'VNM','LAO','THA','KHM','MYS','SGP','MMR','PHL','BRN','IDN'
    -	South Asia = 'BGD','IND','PAK','NPL','LKA','BTN'

-	Note: (Which variable) is collected only every 4 year

**2.	Data Source (Detailed link is listed in [Readme](https://github.com/Graspp-25-Spring/graspp_2025s_fintech/blob/review_milestone2/README.md)) and Download Method:** 

- Dependent Variable: 
    -	Source: Credit to Agriculture, Value Added (Agriculture, Forestry and Fishing), Value Added (Manufacture of food and beverages), Fertilizer consumption are from World Bank FAO; Use of Financial Services, mobile money transactions is from Financial Access Survey, IMF
    -	Method: API and manually download csv files

- Independent Variables: 
    -	World Development Indicators (WDI) and Findex
    -	Method: construct the download workbank data function to fetch data


- Control Variables: 
    -	Source: All from World Development Indicators (WDI), World Bank 
    -	Method: construct the download workbank data function to fetch data




# Part 1

# 1. Data sets

In [68]:
import pandas as pd

In [69]:
def basic_stats(df):
    countries = list(df['country'].unique())
    n_countries = len(countries) 
    time_series = f"{df.year.min()} ~ {df.year.max()}"
    missing_data_summary = df.drop(['country','year'], axis=1).notna().sum()
    
    return {
        'countries (first 5)': countries[:5],
        'country_count': n_countries,
        'time_series': time_series,
        'observations_bycolumn': missing_data_summary
    }


## 1.1 Agri (Agriculture Output)
- Source: World Bank FAO

In [70]:
df_agri = pd.read_csv('../data/processed/Agriculture_data.csv')
df_agri = df_agri.rename(columns={'code': 'country'})
df_agri.head(2)

Unnamed: 0,country,year,FarmCredit,ICTPolicy,ProductionValue,ProcessingValue,Fertilizer
0,BGD,2000,,,583661.0,,158.108
1,BGD,2001,,,590372.0,,174.59


In [71]:
basic_stats(df_agri)

{'countries (first 5)': ['BGD', 'BRN', 'BTN', 'IDN', 'IND'],
 'country_count': 16,
 'time_series': '2000 ~ 2023',
 'observations_bycolumn': FarmCredit          68
 ICTPolicy           34
 ProductionValue     99
 ProcessingValue    111
 Fertilizer          64
 dtype: int64}

- In this section, we examine agricultural output using data from the World Bank FAO. The dataset includes key indicators such as farm credit, ICT policy presence, agricultural production value, food and beverage processing value, and fertilizer usage. Our summary shows data coverage for 16 countries from 2000 to 2023. 

- Among these indicators, production and processing values have the most complete records, with 99 and 111 observations respectively. This suggests relatively strong data availability for measuring economic contributions of agricultural and food sectors, while access to farm credit and ICT policy information is more limited across countries and years.

## 1.2 Con (Control Variables)
- Source: WDI

In [72]:
df_con = pd.read_csv('../data/processed/control_var.csv')
df_con = df_con.drop(['country'], axis = 'columns').rename(columns={'code': 'country'})
df_con.head(2)

Unnamed: 0,country,year,"Fertility rate, total (births per woman)","GDP per capita, PPP (constant 2021 international $)","Life expectancy at birth, total (years)","Mortality rate, infant (per 1,000 live births)",Population density (people per sq. km of land area),Population growth (annual %),"Population, total",Poverty headcount ratio at national poverty lines (% of population),Rural population (% of total population),Urban population (% of total population)
0,BGD,1960,6.742,,43.98,178.6,,,51828660.0,,94.865,5.135
1,BGD,1961,6.78,,44.887,173.3,409.544042,2.818718,53310348.0,,94.722,5.278


In [73]:
basic_stats(df_con)

{'countries (first 5)': ['BGD', 'BTN', 'BRN', 'KHM', 'IND'],
 'country_count': 16,
 'time_series': '1960 ~ 2023',
 'observations_bycolumn': Fertility rate, total (births per woman)                               1024
 GDP per capita, PPP (constant 2021 international $)                     543
 Life expectancy at birth, total (years)                                1024
 Mortality rate, infant (per 1,000 live births)                          970
 Population density (people per sq. km of land area)                     992
 Population growth (annual %)                                           1008
 Population, total                                                      1024
 Poverty headcount ratio at national poverty lines (% of population)      89
 Rural population (% of total population)                               1024
 Urban population (% of total population)                               1024
 dtype: int64}

- In this section, we collect 10 control variables from WDI, covering economic development, population estimatesand composition, poverty prevalence, and vital statistics. Our summary shows data coverage for 16 countries from 1960 to 2023.

- Among these indicators, except poverty headcount ratio has many missing values, all other 9 variables have a considerable amount of data to serve as background variables. 

## 1.3 GFI (Global Financial Inclusion)
- Source: Findex

In [74]:
df_gfi = pd.read_csv('../Data/processed/cleaned_output_2011_2022.csv')
df_gfi = df_gfi.rename(columns={'code': 'country'})
df_gfi.head(2)

Unnamed: 0,country,year,Account (% age 15+),Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+),"Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+)","Account ownership at a financial institution or with a mobile-money-service provider, male (% of population ages 15+)","Account ownership at a financial institution or with a mobile-money-service provider, older adults (% of population ages 25+)","Account ownership at a financial institution or with a mobile-money-service provider, poorest 40% (% of population ages 15+)","Account ownership at a financial institution or with a mobile-money-service provider, primary education or less (% of population ages 15+)","Account ownership at a financial institution or with a mobile-money-service provider, richest 60% (% of population ages 15+)",...,"Used a mobile phone or the internet to check account balance(% with a financial institution account, age 15+)","Used a mobile phone or the internet to pay bills, female (% age 15+)","Used a mobile phone or the internet to pay bills, male (% age 15+)","Used a mobile phone or the internet to pay bills, rural (% age 15+)","Used a mobile phone or the internet to pay bills, urban (% age 15+)",Used a mobile phone or the internet to send money (% age 15+),"Used a mobile phone or the internet to send money, female (% age 15+)","Used a mobile phone or the internet to send money, male (% age 15+)","Used a mobile phone or the internet to send money, rural (% age 15+)","Used a mobile phone or the internet to send money, urban (% age 15+)"
0,BGD,2011,31.74,31.74,26.01,37.29,36.76,19.06,21.06,40.18,...,10.63,1.54,3.36,8.75,11.08,17.58,7.64,28.28,19.96,16.83
1,BGD,2012,31.49,31.49,26.156667,36.65,36.32,20.383333,22.34,38.883333,...,10.63,1.54,3.36,8.75,11.08,17.58,7.64,28.28,19.96,16.83


In [75]:
basic_stats(df_gfi)

{'countries (first 5)': ['BGD', 'BRN', 'BTN', 'IDN', 'IND'],
 'country_count': 17,
 'time_series': '2011 ~ 2022',
 'observations_bycolumn': Account (% age 15+)                                                                                                              192
 Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+)                  192
 Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+)          192
 Account ownership at a financial institution or with a mobile-money-service provider, male (% of population ages 15+)            192
 Account ownership at a financial institution or with a mobile-money-service provider, older adults (% of population ages 25+)    192
                                                                                                                                 ... 
 Used a mobile phone or the internet to send money (% age

For the Global Financial Inclusion (GFI) dataset from Findex, we collect account ownership and digital financial service usage across 17 countries from 2011 to 2022. The dataset includes over 100 indicators detailing ownership by demographic breakdowns (e.g., gender, income, education), as well as behaviors such as using mobile phones or the internet to send money.

## 1.4 Staple (Staple Food Output)
- Source: FAO

In [76]:
df_staple = pd.read_csv('../Data/processed/StapleFoodStability_adjusted.csv')
df_staple = df_staple.rename(columns={'code': 'country'})
df_staple.head(2)

Unnamed: 0,country,year,Food supply quantity (kg/capita/yr)_Cereals - Excluding Beer,Food supply quantity (kg/capita/yr)_Starchy Roots,rolling_std
0,AFG,2010,202.73,6.69,
1,AFG,2011,197.29,5.72,


In [None]:
basic_stats(df_staple)

For the staple food dataset, we collect 18 countries from 2010-2022 over three indicators, food supply quantity for cereals, food supply quantity for starch, standard deviations. 

## 1.5 Access (to digital finance)
- Source: WDI

In [77]:
df_access = pd.read_csv('../Data/processed/access_merged.csv')
df_access['year'] = pd.to_datetime(df_access['year']).dt.year
df_access = df_access.rename(columns={'code': 'country'})
df_access.head(2)


Unnamed: 0,country,year,Rural Access to Electricity(Percent of Population),Mobile Cellular Subscriptions (per 100 people),Fixed Broadband Subsciptions (per 100 people)
0,BGD,2024,,,
1,BGD,2023,99.6,,7.51447


In [78]:
basic_stats(df_access)

{'countries (first 5)': ['BGD', 'BTN', 'IND', 'LKA', 'NPL'],
 'country_count': 16,
 'time_series': '1960 ~ 2024',
 'observations_bycolumn': Rural Access to Electricity(Percent of Population)    446
 Mobile Cellular Subscriptions (per 100 people)        808
 Fixed Broadband Subsciptions (per 100 people)         333
 dtype: int64}

For the access to digital finance from WDI, we collect 16 countries from 1960-2024, covering variables like rural access to electricity, mobile subscriptions, and fixed broadband subscriptions. Among which, mobile sucscription has the largest amount of available data for us to use.

## 1.6 Mobile (Mobile Money Transactions)
- Source: Financial Access Survey, IMF

In [79]:
df_mobile = pd.read_csv('../Data/processed/mobile_transact.csv')
# Filter for specific countries in South Asia and Southeast Asia
country_list = ['AFG', 'BGD', 'BRN', 'BTN', 'IDN', 'IND', 'IRN', 'KHM', 'LAO',
       'LKA', 'MDV', 'MMR', 'MYS', 'NPL', 'PAK', 'PHL', 'SGP', 'THA',
       'TLS', 'VNM', 'WLD']
df_mobile = df_mobile[df_mobile['code'].isin(country_list)]

df_mobile = df_mobile.rename(columns={'code': 'country'})
df_mobile = df_mobile[['country', 'year', 'mobile_money_transactions']]
df_mobile.head(2)


Unnamed: 0,country,year,mobile_money_transactions
80,IDN,2016,7063689.0
81,IDN,2017,12375470.0


In [80]:
basic_stats(df_mobile)

{'countries (first 5)': ['IDN', 'MMR', 'BGD', 'PAK', 'PHL'],
 'country_count': 19,
 'time_series': '2016 ~ 2023',
 'observations_bycolumn': mobile_money_transactions    152
 dtype: int64}

For the mobile money transactions dataset, we collected data from 19 countries in South and Southeast Asia spanning the years 2016 to 2023. The indicator used is the total number of mobile money transactions per year, resulting in 152 valid observations. This subset helps us capture recent trends in digital financial inclusion across the region.

# 2. Merge Datasets

In [81]:
import pandas as pd

def merge_and_count(df_1, df_2, how = 'inner'):
    country_col = 'country'
    date_col = 'year'
    rows_df_1, rows_df_2 = len(df_1), len(df_2)
    merged_df = pd.merge(df_1, df_2, how=how, on=[country_col, date_col])

    countries_df_1 = df_1[country_col].unique()
    countries_merged = merged_df[country_col].unique()

    dropped_countries = set(countries_df_1) - set(countries_merged)

    rows_merged = len(merged_df)
    countries_df_1_count, countries_merged_count = (
        len(countries_df_1),
        len(countries_merged)
    )

    print(f"Rows in df_1: {rows_df_1}, Countries in df_1: {countries_df_1_count}")
    print(f"Rows in df_2: {rows_df_2}, Countries in df_2: {countries_merged_count}")
    print(f"Rows in merged DataFrame: {rows_merged}, Countries in merged DataFrame: {countries_merged_count}")
    print(f"Dropped countries from df_1: {list(dropped_countries)}")

    return merged_df

## 2.1 Merge con and agri

In [82]:
df_merge_1 = merge_and_count(
    df_1 = df_con,
    df_2 = df_agri,
    how = 'outer'
)
df_merge_1.head(2)


Rows in df_1: 1024, Countries in df_1: 16
Rows in df_2: 270, Countries in df_2: 16
Rows in merged DataFrame: 1061, Countries in merged DataFrame: 16
Dropped countries from df_1: []


Unnamed: 0,country,year,"Fertility rate, total (births per woman)","GDP per capita, PPP (constant 2021 international $)","Life expectancy at birth, total (years)","Mortality rate, infant (per 1,000 live births)",Population density (people per sq. km of land area),Population growth (annual %),"Population, total",Poverty headcount ratio at national poverty lines (% of population),Rural population (% of total population),Urban population (% of total population),FarmCredit,ICTPolicy,ProductionValue,ProcessingValue,Fertilizer
0,BGD,1960,6.742,,43.98,178.6,,,51828660.0,,94.865,5.135,,,,,
1,BGD,1961,6.78,,44.887,173.3,409.544042,2.818718,53310348.0,,94.722,5.278,,,,,


## 2.2 Merge df_merge1 and GFI

In [83]:
df_merge_2 = merge_and_count(
    df_1 = df_merge_1,
    df_2 = df_gfi,
    how = 'outer'
)
df_merge_2.head(2)

Rows in df_1: 1061, Countries in df_1: 16
Rows in df_2: 204, Countries in df_2: 17
Rows in merged DataFrame: 1073, Countries in merged DataFrame: 17
Dropped countries from df_1: []


Unnamed: 0,country,year,"Fertility rate, total (births per woman)","GDP per capita, PPP (constant 2021 international $)","Life expectancy at birth, total (years)","Mortality rate, infant (per 1,000 live births)",Population density (people per sq. km of land area),Population growth (annual %),"Population, total",Poverty headcount ratio at national poverty lines (% of population),...,"Used a mobile phone or the internet to check account balance(% with a financial institution account, age 15+)","Used a mobile phone or the internet to pay bills, female (% age 15+)","Used a mobile phone or the internet to pay bills, male (% age 15+)","Used a mobile phone or the internet to pay bills, rural (% age 15+)","Used a mobile phone or the internet to pay bills, urban (% age 15+)",Used a mobile phone or the internet to send money (% age 15+),"Used a mobile phone or the internet to send money, female (% age 15+)","Used a mobile phone or the internet to send money, male (% age 15+)","Used a mobile phone or the internet to send money, rural (% age 15+)","Used a mobile phone or the internet to send money, urban (% age 15+)"
0,BGD,1960,6.742,,43.98,178.6,,,51828660.0,,...,,,,,,,,,,
1,BGD,1961,6.78,,44.887,173.3,409.544042,2.818718,53310348.0,,...,,,,,,,,,,


## 2.3 Merge df_merge2 and staple

In [84]:
df_merge_3 = merge_and_count(
    df_1 = df_merge_2,
    df_2 = df_staple,
    how = 'outer'
)
df_merge_3.head(2)

Rows in df_1: 1073, Countries in df_1: 17
Rows in df_2: 225, Countries in df_2: 21
Rows in merged DataFrame: 1125, Countries in merged DataFrame: 21
Dropped countries from df_1: []


Unnamed: 0,country,year,"Fertility rate, total (births per woman)","GDP per capita, PPP (constant 2021 international $)","Life expectancy at birth, total (years)","Mortality rate, infant (per 1,000 live births)",Population density (people per sq. km of land area),Population growth (annual %),"Population, total",Poverty headcount ratio at national poverty lines (% of population),...,"Used a mobile phone or the internet to pay bills, rural (% age 15+)","Used a mobile phone or the internet to pay bills, urban (% age 15+)",Used a mobile phone or the internet to send money (% age 15+),"Used a mobile phone or the internet to send money, female (% age 15+)","Used a mobile phone or the internet to send money, male (% age 15+)","Used a mobile phone or the internet to send money, rural (% age 15+)","Used a mobile phone or the internet to send money, urban (% age 15+)",Food supply quantity (kg/capita/yr)_Cereals - Excluding Beer,Food supply quantity (kg/capita/yr)_Starchy Roots,rolling_std
0,AFG,2010,,,,,,,,,...,,,,,,,,202.73,6.69,
1,AFG,2011,,,,,,,,,...,,,,,,,,197.29,5.72,


## 2.4 Merge df_merge3 and access

In [85]:
df_merge_4 = merge_and_count(
    df_1 = df_merge_3,
    df_2 = df_access,
    how = 'outer'
)
df_merge_4.head(2)


Rows in df_1: 1125, Countries in df_1: 21
Rows in df_2: 1040, Countries in df_2: 21
Rows in merged DataFrame: 1141, Countries in merged DataFrame: 21
Dropped countries from df_1: []


Unnamed: 0,country,year,"Fertility rate, total (births per woman)","GDP per capita, PPP (constant 2021 international $)","Life expectancy at birth, total (years)","Mortality rate, infant (per 1,000 live births)",Population density (people per sq. km of land area),Population growth (annual %),"Population, total",Poverty headcount ratio at national poverty lines (% of population),...,"Used a mobile phone or the internet to send money, female (% age 15+)","Used a mobile phone or the internet to send money, male (% age 15+)","Used a mobile phone or the internet to send money, rural (% age 15+)","Used a mobile phone or the internet to send money, urban (% age 15+)",Food supply quantity (kg/capita/yr)_Cereals - Excluding Beer,Food supply quantity (kg/capita/yr)_Starchy Roots,rolling_std,Rural Access to Electricity(Percent of Population),Mobile Cellular Subscriptions (per 100 people),Fixed Broadband Subsciptions (per 100 people)
0,AFG,2010,,,,,,,,,...,,,,,202.73,6.69,,,,
1,AFG,2011,,,,,,,,,...,,,,,197.29,5.72,,,,


## 2.5 Merge df_merge4 and mobile

In [86]:
df_merge_5 = merge_and_count(
    df_1 = df_merge_4,
    df_2 = df_mobile,
    how = 'outer'
)
df_merge_5.head(2)
df_merge_5.to_csv('../Data/processed/merged_5.csv', index=False)

Rows in df_1: 1141, Countries in df_1: 21
Rows in df_2: 296, Countries in df_2: 21
Rows in merged DataFrame: 1299, Countries in merged DataFrame: 21
Dropped countries from df_1: []


# Part 2

# 3. Descriptive

In [87]:
df_descriptive = df_merge_5.describe(include='all')
df_descriptive
#df_descriptive.to_csv('../reports/merged_all_descriptive.csv')


Unnamed: 0,country,year,"Fertility rate, total (births per woman)","GDP per capita, PPP (constant 2021 international $)","Life expectancy at birth, total (years)","Mortality rate, infant (per 1,000 live births)",Population density (people per sq. km of land area),Population growth (annual %),"Population, total",Poverty headcount ratio at national poverty lines (% of population),...,"Used a mobile phone or the internet to send money, male (% age 15+)","Used a mobile phone or the internet to send money, rural (% age 15+)","Used a mobile phone or the internet to send money, urban (% age 15+)",Food supply quantity (kg/capita/yr)_Cereals - Excluding Beer,Food supply quantity (kg/capita/yr)_Starchy Roots,rolling_std,Rural Access to Electricity(Percent of Population),Mobile Cellular Subscriptions (per 100 people),Fixed Broadband Subsciptions (per 100 people),mobile_money_transactions
count,1299,1299.0,1183.0,701.0,1183.0,1129.0,1137.0,1167.0,1183.0,121.0,...,302.0,183.0,202.0,374.0,374.0,370.0,601.0,951.0,488.0,168.0
unique,21,,,,,,,,,,...,,,,,,,,,,
top,IND,,,,,,,,,,...,,,,,,,,,,
freq,93,,,,,,,,,,...,,,,,,,,,,
mean,,1997.162433,3.719742,18222.361459,63.592063,54.588751,507.194331,1.837808,129413800.0,18.327273,...,24.874106,8.292131,20.336535,217.895856,54.362193,11.071854,78.340939,45.265395,5.10529,18741750.0
std,,19.796513,1.736222,28862.561149,10.395211,46.522259,1341.852289,1.154691,288686400.0,12.193311,...,16.450165,5.993485,13.638558,42.429393,61.071679,11.911301,26.460133,55.344925,6.933406,66919260.0
min,,1960.0,0.97,759.282764,11.295,1.7,4.935514,-8.403645,88347.0,4.3,...,5.86,2.0,7.29,129.68,5.72,0.682473,0.6,0.0,0.000389,0.04496658
25%,,1980.0,2.187,4371.385718,57.237,19.3,58.811943,1.19163,6688529.0,9.4,...,9.71,3.94,10.48,184.47,18.76,3.121202,63.5,0.0,0.615923,2.473611
50%,,2000.0,3.315,6843.54655,66.232,40.2,119.497005,1.85418,30758810.0,16.0,...,28.28,6.25,15.43,218.65,29.46,5.657648,90.0,3.737725,1.9488,179.6953
75%,,2016.0,5.3985,13491.879417,70.6855,81.5,285.568271,2.43172,95256070.0,22.5,...,33.38,15.98,19.28,254.995,53.13,14.212558,99.5,93.220269,7.103368,2101688.0


## Select Variables

Analyze the relationship between Food supply quantity (cereals) as the proxy of agricultural output and mobile cellular subscription as the proxy of mobile network penetration and the population's access to communication infrastructure

## 3.1 Independent Variable (mobile cellular subscription )

In [88]:
print(df_descriptive['Mobile Cellular Subscriptions (per 100 people)'])

count     951.000000
unique           NaN
top              NaN
freq             NaN
mean       45.265395
std        55.344925
min         0.000000
25%         0.000000
50%         3.737725
75%        93.220269
max       181.767026
Name: Mobile Cellular Subscriptions (per 100 people), dtype: float64


The variable **"Mobile Cellular Subscriptions (per 100 people)"** has 951 observations with a mean of 45 people. The values range from 0 to 181 people, indicating a high variation across countries and years. The standard deviation is 55 people. 

## 3.2 Outcome variables

For now, we use 
as a preliminary assessment, 6 variables were selectd 
1) (<u>financial access</u>) Financial institution account, female (% age 15+)
2) (<u>e-commerce</u>) Made a digital in-store merchant payment: using a mobile phone (% age 15+)
3) (<u>mobile transaction</u>) Use a mobile phone or the internet to make payments, buy things, or to send or receive money using a financial institution account (% age 15+)
4) (<u>saving</u>) Saved at a financial institution or using a mobile money account (% age 15+)
5) (<u>Food productivity</u>) Food supply quantity (kg/capita/yr)_Cereals - Excluding Beer
6) (<u>Resilience on food production</u>) rolling_std (let's change variable name as "food supply stability")



In [89]:
print(df_descriptive[['Financial institution account, female (% age 15+)', 'Made a digital in-store merchant payment: using a mobile phone (% age 15+)', 'Saved at a financial institution or using a mobile money account (% age 15+)', 'Store money using a financial institution or a mobile money account (% age 15+)', 'Use a mobile phone or the internet to make payments, buy things, or to send or receive money using a financial institution account (% age 15+)', 'Food supply quantity (kg/capita/yr)_Cereals - Excluding Beer', 'rolling_std']])

        Financial institution account, female (% age 15+)  \
count                                          321.000000   
unique                                                NaN   
top                                                   NaN   
freq                                                  NaN   
mean                                            47.764553   
std                                             26.232088   
min                                              2.950000   
25%                                             28.017500   
50%                                             37.850000   
75%                                             76.640000   
max                                             98.210000   

        Made a digital in-store merchant payment: using a mobile phone (% age 15+)  \
count                                          302.000000                            
unique                                                NaN                            
top      

### Based on the descriptive statistics, we decided to use **"Food supply quantity (kg/capita/yr)_Cereals - Excluding Beer"** as outcome variable for now due its high data availability. 

In [90]:
print(df_descriptive['Food supply quantity (kg/capita/yr)_Cereals - Excluding Beer'])

count     374.000000
unique           NaN
top              NaN
freq             NaN
mean      217.895856
std        42.429393
min       129.680000
25%       184.470000
50%       218.650000
75%       254.995000
max       290.380000
Name: Food supply quantity (kg/capita/yr)_Cereals - Excluding Beer, dtype: float64


The variable **"Food supply quantity (kg/capita/yr)_Cereals – Excluding Beer"** has 374 observations with a mean of approximately 217.90 kg per capita per year. The values range from 129.68 to 290.38, indicating variation in cereal supply across countries and years. The standard deviation is 42.43, showing moderate dispersion from the mean.

## 3.3 Control variables

add demography

In [93]:
print(df_descriptive[['Fertility rate, total (births per woman)', 
                      'GDP per capita, PPP (constant 2021 international $)', 
                      'Life expectancy at birth, total (years)', 
                      'Mortality rate, infant (per 1,000 live births)', 
                      'Population density (people per sq. km of land area)', 
                      'Population growth (annual %)', 
                      'Population, total',
                      'Poverty headcount ratio at national poverty lines (% of population)',
                      'Rural population (% of total population)',
                      'Urban population (% of total population)'
                      ]])

        Fertility rate, total (births per woman)  \
count                                1183.000000   
unique                                       NaN   
top                                          NaN   
freq                                         NaN   
mean                                    3.719742   
std                                     1.736222   
min                                     0.970000   
25%                                     2.187000   
50%                                     3.315000   
75%                                     5.398500   
max                                     7.322000   

        GDP per capita, PPP (constant 2021 international $)  \
count                                          701.000000     
unique                                                NaN     
top                                                   NaN     
freq                                                  NaN     
mean                                         18222.361459   

# 4. Graph (Please see the [Visualization_one](https://github.com/Graspp-25-Spring/graspp_2025s_fintech/blob/main/notebooks/Visualizations_One.ipynb) for ALL the visualization and regression)

In [None]:
import seaborn as sns
sns.scatterplot(
    data = df_merge_5,
    x = 'Food supply quantity (kg/capita/yr)_Cereals - Excluding Beer',
    y = 'Mobile Cellular Subscriptions (per 100 people)'
)

We can see that in general there's a positive correlation between Mobile Cellular Subscriptions and Food Supply Quantity, despite some outliers exist. 

# 5. Regression

## 5.1 Transform

In [None]:
df_feat = df_merge_5.query("year == 2020")[[
    'Food supply quantity (kg/capita/yr)_Cereals - Excluding Beer',
    'Mobile Cellular Subscriptions (per 100 people)'
]]
df_feat.head(5)


In [102]:
import numpy as np
df_feat = df_feat.assign(
    food_log = df_feat['Food supply quantity (kg/capita/yr)_Cereals - Excluding Beer'].apply(np.log),
    mobile_log = df_feat['Mobile Cellular Subscriptions (per 100 people)'].apply(np.log)
)
df_feat.head(5)

Unnamed: 0,Food supply quantity (kg/capita/yr)_Cereals - Excluding Beer,Mobile Cellular Subscriptions (per 100 people),food_log,mobile_log
14,189.06,,5.242064,
15,189.06,,5.242064,
89,278.99,105.291163,5.631176,4.656729
90,278.99,105.291163,5.631176,4.656729
91,278.99,105.291163,5.631176,4.656729


## 5.2 Regression

In [66]:
import statsmodels.api as sm


def regression(df, x, y):
    cols = x + [y]
    data = df[cols].dropna()
    X = data[x]
    X = sm.add_constant(X) # Capital X for convention
    Y = data[y] # Capital Y for convention

    # Create and fit the OLS model
    model = sm.OLS(Y, X)
    results = model.fit()

    return results.summary()


regression(
    df=df_feat,
    x=['mobile_log'],
    y='food_log'
)

0,1,2,3
Dep. Variable:,food_log,R-squared:,0.118
Model:,OLS,Adj. R-squared:,0.088
Method:,Least Squares,F-statistic:,3.895
Date:,"Mon, 26 May 2025",Prob (F-statistic):,0.058
Time:,00:59:39,Log-Likelihood:,6.9504
No. Observations:,31,AIC:,-9.901
Df Residuals:,29,BIC:,-7.033
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,4.1777,0.638,6.549,0.000,2.873,5.482
mobile_log,0.2665,0.135,1.974,0.058,-0.010,0.543

0,1,2,3
Omnibus:,3.305,Durbin-Watson:,0.984
Prob(Omnibus):,0.192,Jarque-Bera (JB):,2.668
Skew:,-0.599,Prob(JB):,0.263
Kurtosis:,2.205,Cond. No.,87.7


There is a positive elasticity between mobile cellular subscriptions and food supply (measured as cereal supply per capita). A 10% increase in mobile subscriptions is associated with roughly a 2.7% increase in food supply, all else equal. However, the relationship is only marginally statistically significant, and the model explains a modest portion of the variation in food supply across countries.

## Regression with control variables (not successful)

In [106]:
import statsmodels.api as sm

def ols(df, y_data, y_feat, x_data, x_feat, controls=None):
    y_col = f"{y_data}_{y_feat}" if y_feat else y_data
    x_col = f"{x_data}_{x_feat}" if x_feat else x_data

    # Include controls in column selection if provided
    cols = [y_col, x_col] + (controls if controls else [])
    data = df[cols].dropna()

    y = data[y_col]
    X = data[[x_col] + (controls if controls else [])]
    X = sm.add_constant(X)  # Add intercept

    model = sm.OLS(y, X)
    results = model.fit()
    print(results.summary())


In [111]:
 controls=[
        'Fertility rate, total (births per woman)', 
                      'GDP per capita, PPP (constant 2021 international $)', 
                      'Life expectancy at birth, total (years)', 
                      'Mortality rate, infant (per 1,000 live births)', 
                      'Population density (people per sq. km of land area)', 
                      'Population growth (annual %)', 
                      'Population, total',
                      'Poverty headcount ratio at national poverty lines (% of population)',
                      'Rural population (% of total population)',
                      'Urban population (% of total population)'
    ]

In [114]:

ols(
    df=df_descriptive,
    y_data='Food supply quantity (kg/capita/yr)_Cereals - Excluding Beer',
    y_feat='log',  # apply log to Y
    x_data='Mobile Cellular Subscriptions (per 100 people)',
    x_feat='log',  # apply log to X
   controls=controls
)


KeyError: "['Food supply quantity (kg/capita/yr)_Cereals - Excluding Beer_log', 'Mobile Cellular Subscriptions (per 100 people)_log'] not in index"