# 2D Project

## Student's Name:
- Boey Sze Min, Jeanelle (1006037)
- Joash Tan Jia Le (1005862)
- Keith Chua Dian-Xun (1005880)
- Wong Wei Jin Justin (1006001)

## Contributions:
- Boey Sze Min, Jeanelle (1006037)
    - sdiifs
- Joash Tan Jia Le (1005862)
- Keith Chua Dian-Xun (1005880)
- Wong Wei Jin Justin (1006001)

# TO DO

### Overview About the Problem

Describe here the problem you are trying to solve.

## What is Food Security? 

According to the United Nations' (UN) Committee on World Food Security, food security is defined as people, "at all times", having "Physical, Social, and Economic" access to "sufficient, safe and nutritious food" that meets their "dietary needs for an active and healthy life" (FAO, 1996). 

 

## Problem Statement 

In line with the definition of Food security provided by UN, we have decided to find out how physical, social, and economic factors have affected the dietary requirements for an active and healthy lifestyle of the population of each country. As a result, we have come up with the following problem statement: “How might we predict whether the dietary needs for individuals are met for each country, considering a country’s economy, and its citizen’s social and physical environment?” 

### Datasets

We have obtained the following raw data for all the countries over a few years:
1. GDP per capita, adjusted for PPP (USD):   
   Link: https://data.worldbank.org/indicator/NY.GDP.PCAP.PP.CD   
   CSV file name: "DDW_GDP per capita adjusted.csv"   

2. Amount of agricultural land allocated per countries ($km^{2}$):   
   Link: https://data.worldbank.org/indicator/AG.LND.AGRI.K2?name_desc=false    
   CSV file name: "DDW_Agricultural Land.csv"   

3. Percentage of population with access to basic drinking water (%):   
   Link: https://www.kaggle.com/datasets/utkarshxy/who-worldhealth-statistics-2020-complete?select=basicDrinkingWaterServices.csv    
   CSV file name: "DDW_Basic Water Drinking Services.csv"   

4. $CO_{2}$ emitted in each country (million metric tonnes):   
   Link: https://github.com/owid/co2-data   
   CSV file name: "DDW_CO2.csv"   

5. Percentage of population with eating disorder (%):    
   Link: https://ourworldindata.org/grapher/share-with-an-eating-disorder    
   CSV file name: "DDW_Eating Disorder.csv"   

6. Number of people employed in agriculture, forestry and fisheries:   
   Link: https://www.fao.org/faostat/en/#data/OEA    
   CSV file name: "DDW_Employment In Agriculture.csv"   

7. Population of each country:   
   Link: https://github.com/owid/co2-data    
   CSV file name: "DDW_Population.csv"   

8. Daily calories supplied (kcal/capita day):   
   Link: https://ourworldindata.org/food-supply   
   CSV file name: "DDW_Food Supply.csv"   

9. Minimum dietary calories requirement (kcal/capita day):   
   Link: https://www.fao.org/faostat/en/#data/FS   
   CSV file name: "DDW_Min Cal Intake.csv"   

10. Fat supplied (g/capita day):   
   Link: https://ourworldindata.org/food-supply   
   CSV file name: "DDW_Fat and Protein Supply.csv"   

11. Protein supplied (g/capita day):   
   Link: https://ourworldindata.org/food-supply   
   CSV file name: "DDW_Fat and Protein Supply.csv"   

12. Percent of population malnourished:     
   Link: https://ourworldindata.org/hunger-and-undernourishment#:~:text=across%20the%20world.-,Summary,million%20people%20globally%20are%20undernourished   
   CSV file name: "DDW_Prevalence of Undernourishment.csv"   

The code below extracts the relevant data from each file and processses it to match our 6 factors.

For **model 1**, the 6 factors that we are using to predict food security are:
1. GDP per capita, adjusted for PPP (USD)
2. Agricultural land per capita ($km^{2}$ per capita)
3. Percentage of population with basic water service (%)
4. Percentage of population with eating disorder (%)
5. Percentage of population employed in agriculture forestry fishery (%)
6. $CO_{2}$ emitted per agricultural land area (kg/$m^{2}$)   

For our dependent value we will be calculating whether the calorie supplied per capita per day is greater or less than the minimum calories required per capita per day, we will be naming this our _**Y Ratio**_.
$$
 \text{Y Ratio} = \frac{\text{Daily calories supplied (kcal/capita day)}}{\text{Minimum calories required (kcal/capita day)}}
$$

If _Y ratio_ ${<}$ 1, we say that the daily calories supplied is not sufficient to meet the minimum calories required.   
If _Y ratio_ ${>}$ 1, we say that the daily calories supplied is sufficient to meet the minimum calories required.



In [46]:
# put Python code to read and describe your data

## Imports

In [47]:
# import libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

## Cleaning the data for $1^{st}$ model

Store the specified year that we want from the csv as get_year.

In [48]:
# set the year to extract
get_year = 2017

Extracting the code requires us to look at the csv beforehand and understand how the csv is structured.   
Cleaning up the csv and obtaining what we wanted follows the folloiwng similar steps.
1. Read the csv
2. drop NaN values
3. Extract values from the specified year in _get_year_
4. Extract values that we want
5. Rename the columns for easy understanding
6. Reset the index to 0
7. Find unique countries for use later
8. Print the dataframe for us to get a preview

### Extract calories supplied data for the specified year

In [49]:
# read calories supply csv
df_calories = pd.read_csv("DDW_food Supply.csv")
# extract values only when item is "Grand Total"
df_calories = df_calories[(df_calories["Item"] == "Grand Total")]
# extract specified year values
df_calories = df_calories[(df_calories["Year"]==get_year)]
# extract year, country and calories supply
df_calories = df_calories[["Year", "Area", "Value"]]
df_calories = df_calories.rename(columns = {"Area": "Country", "Value": "Calories_supply"})
# set index from 0
df_calories = df_calories.reset_index(drop=True)
# find unique countries in calories supply data
df_calories_countries = df_calories.Country.unique()
# print dataframe
df_calories

Unnamed: 0,Year,Country,Calories_supply
0,2017,Afghanistan,2303
1,2017,Albania,3326
2,2017,Algeria,3383
3,2017,Angola,2441
4,2017,Antigua and Barbuda,2446
...,...,...,...
209,2017,Least Developed Countries,2402
210,2017,Land Locked Developing Countries,2539
211,2017,Small Island Developing States,2685
212,2017,Low Income Food Deficit Countries,2505


### Extract minimum calorie intake required data for the specified year

In [50]:
# read minimum calorie intake csv
df_min_cal = pd.read_csv("DDW_Min Cal Intake.csv")
# extract values that are not missing
df_min_cal = df_min_cal[df_min_cal["Value"].notna()]
# extract specified year values
df_min_cal = df_min_cal[(df_min_cal["Year"]==get_year)]
# extract year, country and values
df_min_cal = df_min_cal[["Year", "Area", "Value"]]
df_min_cal = df_min_cal.rename(columns = {"Area": "Country", "Value":"Mininum_calorie_intake"})
# set index from 0
df_min_cal = df_min_cal.reset_index(drop=True)
# find unique countries in miniumum calorie intake required data
df_min_cal_countries = df_min_cal.Country.unique()
# print dataframe
df_min_cal

Unnamed: 0,Year,Country,Mininum_calorie_intake
0,2017,Afghanistan,1676.0
1,2017,Albania,1911.0
2,2017,Algeria,1781.0
3,2017,Angola,1659.0
4,2017,Antigua and Barbuda,1888.0
...,...,...,...
183,2017,Vanuatu,1695.0
184,2017,Venezuela (Bolivarian Republic of),1817.0
185,2017,Viet Nam,1785.0
186,2017,Yemen,1703.0


### Extract GDP, adjusted for PPP, per capita data for the specified year

In [51]:
# read GDP, adjusted for PPP, per capita csv
df_GDP = pd.read_csv("DDW_GDP per capita adjusted.csv")
# extract values that are not missing
df_GDP = df_GDP[(df_GDP["Value"].notna())]
# extract specified year values
df_GDP = df_GDP[(df_GDP["Year"]==get_year)]
# extract year, country and values
df_GDP = df_GDP[["Year","Area","Value"]]
df_GDP = df_GDP.rename(columns = {"Area": "Country", "Value":"GDP"})
# set index from 0
df_GDP = df_GDP.reset_index(drop=True)
# find unique countries in GDP per capita data
df_GDP_countries = df_GDP.Country.unique()
# print data frame
df_GDP

Unnamed: 0,Year,Country,GDP
0,2017,Afghanistan,2058.4
1,2017,Albania,12771.0
2,2017,Algeria,11737.4
3,2017,Angola,7310.9
4,2017,Antigua and Barbuda,19840.3
...,...,...,...
182,2017,Uzbekistan,6840.7
183,2017,Vanuatu,3081.5
184,2017,Viet Nam,8996.4
185,2017,Zambia,3485.0


### Extract total population data for the specified year

In [52]:
# read total population employed csv
df_pop = pd.read_csv("DDW_Population.csv")
# extract specified year values
df_pop = df_pop[(df_pop["year"]==get_year)]
# extract values that are not missing
df_pop = df_pop[df_pop["population"].notna()]
# extract year, country and population value
df_pop = df_pop[["year", "country", "population"]]
df_pop = df_pop.rename(columns = {"year": "Year", "country": "Country", "population":"Population"})
# set index from 0
df_pop = df_pop.reset_index(drop=True)
# find unique countries in total population data
df_pop_coutries = df_pop.Country.unique()
# print dataframe
df_pop

Unnamed: 0,Year,Country,Population
0,2017,Afghanistan,3.629611e+07
1,2017,Africa,1.244222e+09
2,2017,Albania,2.884169e+06
3,2017,Algeria,4.138918e+07
4,2017,Andorra,7.699700e+04
...,...,...,...
226,2017,Wallis and Futuna,1.189400e+04
227,2017,World,7.548173e+09
228,2017,Yemen,2.783481e+07
229,2017,Zambia,1.685361e+07


### Extract land area used for agriculture data for the specified year

In [53]:
# read agricultural land csv
df_agri = pd.read_csv("DDW_Agricultural Land.csv")
# extract countries and specified year
df_agri = df_agri[["Country Name", str(get_year)]]
# remove rows with values missing
df_agri = df_agri.dropna()
# insert new column at index 0 with name "Year" and value: get_year
df_agri.insert(0, "Year", get_year)
df_agri = df_agri.rename(columns = {str(get_year): "Agri_land", "Country Name": "Country"})
# find unique countries in agricultural land
df_agri_countries = df_agri.Country.unique()
# print dataframe
df_agri

Unnamed: 0,Year,Country,Agri_land
0,2017,Aruba,20.00
1,2017,Africa Eastern and Southern,6538552.75
2,2017,Afghanistan,379100.00
3,2017,Africa Western and Central,3589797.00
4,2017,Angola,563974.30
...,...,...,...
260,2017,Samoa,620.00
262,2017,"Yemen, Rep.",234520.00
263,2017,South Africa,963410.00
264,2017,Zambia,238360.00


### Extract $CO_{2}$ (in million metric tonnes) data for the specified year

In [54]:
# read CO2 csv
df_co2 = pd.read_csv("DDW_CO2.csv")
# extract specified year values
df_co2 = df_co2[(df_co2["year"]==get_year)]
# extract values that are not missing
df_co2 = df_co2[df_co2["co2"].notna()]
# extract year, country and values
df_co2 = df_co2[["year","country","co2"]]
df_co2 = df_co2.rename(columns = {"year": "Year", "country": "Country", "co2": "CO2"})
# set index from 0
df_co2 = df_co2.reset_index(drop=True)
# find unique countries in total co2 data
df_co2_countries = df_co2.Country.unique()
# print dataframe
df_co2

Unnamed: 0,Year,Country,CO2
0,2017,Afghanistan,6.860
1,2017,Africa,1384.372
2,2017,Albania,5.302
3,2017,Algeria,154.936
4,2017,Andorra,0.465
...,...,...,...
232,2017,Wallis and Futuna,0.026
233,2017,World,35925.738
234,2017,Yemen,9.951
235,2017,Zambia,6.517


### Extract basic water drinking services data for the specified year

In [55]:
# read Basic Water Drinking Services csv
df_water = pd.read_csv("DDW_Basic Water Drinking Services.csv")
# extract values that are not missing
df_water = df_water[df_water["Value"].notna()]
# extract specified year values
df_water = df_water[(df_water["Year"]==get_year)]
# extract year, country and values
df_water = df_water[["Year", "Area", "Value"]]
df_water = df_water.rename(columns = {"Area": "Country", "Value": "Basic_water"})
# set index from 0
df_water = df_water.reset_index(drop=True)
# find unique countries in total water data
df_water_countries = df_water.Country.unique()
# print dataframe
df_water

Unnamed: 0,Year,Country,Basic_water
0,2017,Afghanistan,66.8
1,2017,Albania,94.1
2,2017,Algeria,93.8
3,2017,American Samoa,99.0
4,2017,Andorra,99.0
...,...,...,...
232,2017,Small Island Developing States,83.1
233,2017,Low income economies,56.5
234,2017,Lower-middle-income economies,86.5
235,2017,High-income economies,99.0


### Extract Eating Disorder data for the specified year

In [56]:
# read eating disorder csv
df_disorder = pd.read_csv("DDW_Eating Disorder.csv")
# extract values that are not missing
df_disorder = df_disorder[df_disorder["Prevalence - Eating disorders - Sex: Both - Age: Age-standardized (Percent)"].notna()]
df_disorder = df_disorder.rename(columns = {"Entity": "Country", "Prevalence - Eating disorders - Sex: Both - Age: Age-standardized (Percent)":"Prevalence"})
# extract specified year values
df_disorder = df_disorder[(df_disorder["Year"]==get_year)]
# extract year, country and values
df_disorder = df_disorder[["Year", "Country", "Prevalence"]]
df_disorder = df_disorder.rename(columns = {"Prevalence": "Eating_disorder"})
# set index from 0
df_disorder = df_disorder.reset_index(drop=True)
# find unique countries in eating disorder data
df_disorder_countries = df_disorder.Country.unique()
# print dataframe
df_disorder

Unnamed: 0,Year,Country,Eating_disorder
0,2017,Afghanistan,0.12
1,2017,African Region (WHO),0.11
2,2017,Albania,0.14
3,2017,Algeria,0.22
4,2017,American Samoa,0.14
...,...,...,...
223,2017,World Bank Lower Middle Income,0.13
224,2017,World Bank Upper Middle Income,0.17
225,2017,Yemen,0.14
226,2017,Zambia,0.12


### Extract number of people employed in agriculture data for the specified year

In [57]:
# read employment csv
df_employment = pd.read_csv("DDW_Employment In Agriculture.csv")
# extract values that are not missing
df_employment = df_employment[df_employment["Value"].notna()]
# extract specified year values
df_employment = df_employment[(df_employment["Year"]==get_year)]
# extract year, country and values
df_employment = df_employment[["Year", "Area", "Value"]]
# multiply by 1000 to find actual number employed (units was in 1000 people)
df_employment["Value"] = df_employment["Value"]*1000
df_employment = df_employment.rename(columns = {"Area": "Country", "Value":"Employed_num"})
# set index from 0
df_employment = df_employment.reset_index(drop=True)
# find unique countries in eating disorder data
df_employment_countries = df_employment.Country.unique()
# print dataframe
df_employment

Unnamed: 0,Year,Country,Employed_num
0,2017,Afghanistan,2740235.0
1,2017,Albania,453779.0
2,2017,Algeria,1102072.0
3,2017,Argentina,6840.0
4,2017,Armenia,317111.0
...,...,...,...
111,2017,Uruguay,143717.0
112,2017,Uzbekistan,3671300.0
113,2017,Venezuela (Bolivarian Republic of),1166489.0
114,2017,Viet Nam,21564822.0


### Create list of common countries

Find the countries that we have all the data for as we do not want to use countries with missing values.

In [58]:
# cc represent common countries
cc1 = set(df_calories_countries)
cc2 = set(df_min_cal_countries)
cc3 = set(df_GDP_countries)
cc4 = set(df_pop_coutries)
cc5 = set(df_agri_countries)
cc6 = set(df_co2_countries)
cc7 = set(df_water_countries)
cc8 = set(df_disorder_countries)
cc9 = set(df_employment_countries)

cc = list(cc1 & cc2 & cc3 & cc4 & cc5 & cc6 & cc7 & cc8 & cc9)
print("Common Countries", cc, "\n", "Length", len(cc))

Common Countries ['Peru', 'El Salvador', 'Rwanda', 'Belize', 'Cyprus', 'Honduras', 'Bulgaria', 'Latvia', 'Switzerland', 'Estonia', 'Bangladesh', 'Belarus', 'Italy', 'Uruguay', 'Hungary', 'Serbia', 'Dominican Republic', 'Nepal', 'Paraguay', 'Azerbaijan', 'Uganda', 'Denmark', 'Jamaica', 'Portugal', 'Myanmar', 'Burundi', 'Indonesia', 'Poland', 'Mauritius', 'Armenia', 'Mali', 'South Africa', 'United Arab Emirates', 'Malaysia', 'Jordan', 'Afghanistan', 'Finland', 'Israel', 'Philippines', 'Australia', 'Iceland', 'Lithuania', 'Albania', 'Ecuador', 'Colombia', 'Austria', 'North Macedonia', 'France', 'Tunisia', 'Germany', 'New Zealand', 'Chile', 'Djibouti', 'Czechia', 'Uzbekistan', 'Ukraine', 'Togo', 'Ireland', 'Sweden', 'Sri Lanka', 'Greece', 'Canada', 'Panama', 'Ghana', 'Georgia', 'Seychelles', 'Mexico', 'Kazakhstan', 'Costa Rica', 'Guatemala', 'Japan', 'Bosnia and Herzegovina', 'Algeria', 'Brazil', 'Thailand', 'Mauritania', 'Montenegro', 'Mongolia', 'Slovenia', 'Samoa', 'Cambodia', 'Luxembou

## Calculating and Combining our dataframes according to our combined countries

We will start by calculating the necessary dependent values first before we fully combine.

### Calculate Y ratio 
$$
\frac{\text{Daily calories supplied (kcal/capita day)}}{\text{Minimum calories required (kcal/capita day)}}
$$


In [59]:
# extract values from countries in combined countries list
df_calories_cc = df_calories[df_calories["Country"].isin(cc)]
df_min_cal_cc = df_min_cal[df_min_cal["Country"].isin(cc)]

df_y_ratio = df_calories_cc.copy()
# extract minimum calorie value
df_y_ratio["Mininum_calorie_intake"] = list(df_min_cal_cc["Mininum_calorie_intake"])
# calculate Y ratio based on stated equation
df_y_ratio["y_ratio"] = df_y_ratio["Calories_supply"]/df_y_ratio["Mininum_calorie_intake"]
# set index from 0
df_y_ratio = df_y_ratio.reset_index(drop=True)
# print dataframe
df_y_ratio

Unnamed: 0,Year,Country,Calories_supply,Mininum_calorie_intake,y_ratio
0,2017,Afghanistan,2303,1676.0,1.374105
1,2017,Albania,3326,1911.0,1.740450
2,2017,Algeria,3383,1781.0,1.899495
3,2017,Armenia,3072,1875.0,1.638400
4,2017,Australia,3404,1911.0,1.781266
...,...,...,...,...,...
85,2017,Uganda,2030,1693.0,1.199055
86,2017,Ukraine,3062,1908.0,1.604822
87,2017,United Arab Emirates,3074,2045.0,1.503178
88,2017,Uruguay,3158,1861.0,1.696937


### Calculate Agriculture Land per capita ($\textrm{m}^2$ per capita)

In [60]:
# extract values from countries in combined countries list
df_agri_cc = df_agri[df_agri["Country"].isin(cc)]
df_pop_cc = df_pop[df_pop["Country"].isin(cc)]

df_agri_pop = df_agri_cc.copy()
# extract population value
df_agri_pop["Total_population"] = list(df_pop_cc["Population"])
# calculate agriculture land per capita, multiply by 1,000,000 to convert km^2 to m^2
df_agri_pop["Agri_land_cap"] = (df_agri_pop["Agri_land"]/df_agri_pop["Total_population"])*1000000
# set index from 0
df_agri_pop = df_agri_pop.reset_index(drop=True)
# print dataframe
df_agri_pop

Unnamed: 0,Year,Country,Agri_land,Total_population,Agri_land_cap
0,2017,Afghanistan,379100.00,36296108.0,10444.646021
1,2017,Albania,11742.81,2884169.0,4071.470847
2,2017,United Arab Emirates,3838.00,41389176.0,92.729558
3,2017,Armenia,16761.00,2944789.0,5691.749052
4,2017,Australia,3718370.00,24584620.0,151247.812657
...,...,...,...,...,...
85,2017,Ukraine,414890.00,41166588.0,10078.318854
86,2017,Uruguay,142229.00,44487708.0,3197.040405
87,2017,Uzbekistan,255332.00,9487206.0,26913.297761
88,2017,Samoa,620.00,3436645.0,180.408509


### Calculate percentage of population employed in the agriculture industry (%)

In [61]:
# extract values from countries in combined countries list
df_employment_cc = df_employment[df_employment["Country"].isin(cc)]
df_pop_cc = df_pop[df_pop["Country"].isin(cc)]


df_percent_employed = df_employment_cc.copy()
# extract population value
df_percent_employed["Total_population"] = list(df_pop_cc["Population"])
# calculate percent employed; (employed number/total population) for each country
df_percent_employed["Employed_%"] = (df_percent_employed["Employed_num"]/df_percent_employed["Total_population"])*100
# set index from 0
df_percent_employed = df_percent_employed.reset_index(drop=True)
# print dataframe
df_percent_employed

Unnamed: 0,Year,Country,Employed_num,Total_population,Employed_%
0,2017,Afghanistan,2740235.0,36296108.0,7.549666
1,2017,Albania,453779.0,2884169.0,15.733440
2,2017,Algeria,1102072.0,41389176.0,2.662706
3,2017,Armenia,317111.0,2944789.0,10.768547
4,2017,Australia,318393.0,24584620.0,1.295090
...,...,...,...,...,...
85,2017,Uganda,4095242.0,41166588.0,9.947975
86,2017,Ukraine,2489400.0,44487708.0,5.595703
87,2017,United Arab Emirates,65857.0,9487206.0,0.694166
88,2017,Uruguay,143717.0,3436645.0,4.181898


### Calculate $CO_{2}$ per agricultural land (kg per $m^{2}$)

In [62]:
# extract values from countries in combined countries list
df_co2_cc = df_co2[df_co2["Country"].isin(cc)]
df_agri_cc = df_agri[df_agri["Country"].isin(cc)]


df_co2_land = df_co2_cc.copy()
# extract Agricultural land in km^2
df_co2_land["Agri_land"] = list(df_agri_cc["Agri_land"])
# calculate CO2 per agricultral land; CO2 in MMT, 1 MMT = 1,000,000,000 kg
# 1 km^2 = 1,000,000 m^2; result in kg per m^2
df_co2_land["CO2_agri"] = (df_co2_land["CO2"]/df_co2_land["Agri_land"])*1000
# set index from 0
df_co2_land = df_co2_land.reset_index(drop=True)
# print dataframe
df_co2_land

Unnamed: 0,Year,Country,CO2,Agri_land,CO2_agri
0,2017,Afghanistan,6.860,379100.00,0.018095
1,2017,Albania,5.302,11742.81,0.451510
2,2017,Algeria,154.936,3838.00,40.368942
3,2017,Armenia,5.537,16761.00,0.330350
4,2017,Australia,414.751,3718370.00,0.111541
...,...,...,...,...,...
85,2017,Uganda,5.374,414890.00,0.012953
86,2017,Ukraine,223.085,142229.00,1.568492
87,2017,United Arab Emirates,168.831,255332.00,0.661221
88,2017,Uruguay,6.163,620.00,9.940323


### Extracting the rest of the dataframes needed by countries that are within the combined countries list

In [63]:
# extract GDP value from countries in combined countries list
df_GDP_cc = df_GDP[df_GDP["Country"].isin(cc)]
# set index from 0
df_GDP_cc = df_GDP_cc.reset_index(drop=True)

# extract water value from countries in combined countries list
df_water_cc = df_water[df_water["Country"].isin(cc)]
# set index from 0
df_water_cc = df_water_cc.reset_index(drop=True)

# extract eating disorder value from countries in combined countries list
df_disorder_cc = df_disorder[df_disorder["Country"].isin(cc)]
# set index from 0
df_disorder_cc = df_disorder_cc.reset_index(drop=True)

# print dataframe to check if need be
# df_GDP_cc
# df_water_cc
# df_disorder_cc

### Combining all the variables into one dataframe

In [64]:
# add country and Y ratio column
dfyears_combined = df_y_ratio.loc[:, ["Country", "y_ratio"]]
# add GDP column
dfyears_combined["GDP"] = df_GDP_cc.loc[:, "GDP"]
# add agriculture per land column
dfyears_combined["Agri_land_cap"] = df_agri_pop.loc[:, "Agri_land_cap"]
# add percent basic water service column
dfyears_combined["Basic_water"] = df_water_cc.loc[:, "Basic_water"]
# add percent eating disorder column
dfyears_combined["Eating_disorder"] = df_disorder_cc.loc[:, "Eating_disorder"]
# add percent employed in agriculture column
dfyears_combined["Employed_%"] = df_percent_employed.loc[:, "Employed_%"]
# add co2 per agricultural land column
dfyears_combined["CO2_agri"] = df_co2_land.loc[:, "CO2_agri"]
# add specified year in the first column
dfyears_combined.insert(0, "Year", get_year)
# print dataframe
dfyears_combined

# export dataframe
# dfyears_combined.to_csv(f"df{get_year}_combined.csv")

Unnamed: 0,Year,Country,y_ratio,GDP,Agri_land_cap,Basic_water,Eating_disorder,Employed_%,CO2_agri
0,2017,Afghanistan,1.374105,2058.4,10444.646021,66.8,0.12,7.549666,0.018095
1,2017,Albania,1.740450,12771.0,4071.470847,94.1,0.14,15.733440,0.451510
2,2017,Algeria,1.899495,11737.4,92.729558,93.8,0.22,2.662706,40.368942
3,2017,Armenia,1.638400,12115.1,5691.749052,99.0,0.13,10.768547,0.330350
4,2017,Australia,1.781266,48398.5,151247.812657,99.0,1.11,1.295090,0.111541
...,...,...,...,...,...,...,...,...,...
85,2017,Uganda,1.199055,2074.7,10078.318854,51.0,0.10,9.947975,0.012953
86,2017,Ukraine,1.604822,11860.6,3197.040405,93.8,0.13,5.595703,1.568492
87,2017,United Arab Emirates,1.503178,67183.6,26913.297761,99.0,0.31,0.694166,0.661221
88,2017,Uruguay,1.696937,23009.9,180.408509,99.0,0.38,4.181898,9.940323


## Combinining the dataframes for years 2013 to 2017 into one dataframe

*Year 2013, 2014, 2015, 2016, 2017 was done on seperate occassions and exported to one csv file for each year.

In [65]:
# read the csv for each year
df2013_combined = pd.read_csv("df2013_combined_model1.csv")
df2014_combined = pd.read_csv("df2014_combined_model1.csv")
df2015_combined = pd.read_csv("df2015_combined_model1.csv")
df2016_combined = pd.read_csv("df2016_combined_model1.csv")
df2017_combined = pd.read_csv("df2017_combined_model1.csv")

In [66]:
# combine all five dataframes into one
dfallyears_combined_1 = pd.concat([df2013_combined, df2014_combined, df2015_combined, df2016_combined, df2017_combined])
# drop the extra column caused due to concat function
dfallyears_combined_1 = dfallyears_combined_1.drop(columns = "Unnamed: 0")
# set index from 0
dfallyears_combined_1 = dfallyears_combined_1.reset_index(drop=True)
# print dataframe
dfallyears_combined_1

# export dataframe
# dfallyears_combined.to_csv("dfallyears_combined.csv")

Unnamed: 0,Year,Country,y_ratio,GDP,Agri_land_cap,Basic_water,Eating_disorder,Employed_%,CO2_agri
0,2013,Albania,1.718685,11361.3,4.088797e+03,92.6,0.14,15.527752,0.415059
1,2013,Algeria,1.914206,11319.1,3.223308e+04,93.0,0.22,2.991599,0.107726
2,2013,Argentina,1.728202,24424.1,3.988526e+02,98.9,0.35,0.159302,11.297267
3,2013,Armenia,1.607238,10691.3,1.283048e+06,99.0,0.13,14.568022,0.001489
4,2013,Australia,1.790601,46744.6,1.168097e+03,99.0,1.10,1.286984,14.473973
...,...,...,...,...,...,...,...,...,...
433,2017,Uganda,1.199055,2074.7,1.007832e+04,51.0,0.10,9.947975,0.012953
434,2017,Ukraine,1.604822,11860.6,3.197040e+03,93.8,0.13,5.595703,1.568492
435,2017,United Arab Emirates,1.503178,67183.6,2.691330e+04,99.0,0.31,0.694166,0.661221
436,2017,Uruguay,1.696937,23009.9,1.804085e+02,99.0,0.38,4.181898,9.940323


### Removal of outliers

In [67]:
# write function to return the upper limit and lower limit based on IQR
# df = dataframe; i = the name of the column
def identify_upper_lower (df, i):
    q3 = df[i].quantile (q = 0.75)
    q1 = df[i].quantile (q = 0.25)
    IQR = q3 - q1
    upper_limit = q3 + 1.5 * IQR
    lower_limit = q1 - 1.5 * IQR
    return upper_limit, lower_limit

### List of countries left after removol of countries with outliers

In [68]:
# identifier limit for each factor
GDP_limit = identify_upper_lower(dfallyears_combined_1, "GDP")
Agri_cap_limit = identify_upper_lower(dfallyears_combined_1, "Agri_land_cap")
Water_limit = identify_upper_lower(dfallyears_combined_1, "Basic_water")
Eating_limit = identify_upper_lower(dfallyears_combined_1, "Eating_disorder")
Employed_limit = identify_upper_lower(dfallyears_combined_1, "Employed_%")
CO2_limit = identify_upper_lower(dfallyears_combined_1, "CO2_agri")

# removal of rows with outliers
dfall_clean_1 = dfallyears_combined_1.copy()
# remove GDP outlier
dfall_clean_1 = dfall_clean_1.loc[(dfall_clean_1["GDP"] <= GDP_limit[0]) & (dfall_clean_1["GDP"] >= GDP_limit[1]), :]
# remove Agriculture land per capita outlier
dfall_clean_1 = dfall_clean_1.loc[(dfall_clean_1["Agri_land_cap"] <= Agri_cap_limit[0]) & (dfall_clean_1["Agri_land_cap"] >= Agri_cap_limit[1]), :]
# remove basic water outlier
dfall_clean_1 = dfall_clean_1.loc[(dfall_clean_1["Basic_water"] <= Water_limit[0]) & (dfall_clean_1["Basic_water"] >= Water_limit[1]), :]
# remove eating disorder outlier
dfall_clean_1 = dfall_clean_1.loc[(dfall_clean_1["Eating_disorder"] <= Eating_limit[0]) & (dfall_clean_1["Eating_disorder"] >= Eating_limit[1]), :]
# remove percentage employed outlier
dfall_clean_1 = dfall_clean_1.loc[(dfall_clean_1["Employed_%"] <= Employed_limit[0]) & (dfall_clean_1["Employed_%"] >= Employed_limit[1]), :]
# remove CO2 emitted per agricultural land outlier
dfall_clean_1 = dfall_clean_1.loc[(dfall_clean_1["CO2_agri"] <= CO2_limit[0]) & (dfall_clean_1["CO2_agri"] >= CO2_limit[1]), :]
# print dataframe
dfall_clean_1

Unnamed: 0,Year,Country,y_ratio,GDP,Agri_land_cap,Basic_water,Eating_disorder,Employed_%,CO2_agri
0,2013,Albania,1.718685,11361.3,4088.797116,92.6,0.14,15.527752,0.415059
1,2013,Algeria,1.914206,11319.1,32233.078561,93.0,0.22,2.991599,0.107726
5,2013,Austria,1.907645,52997.8,5574.676863,99.0,0.61,2.195136,1.420584
6,2013,Azerbaijan,1.660085,14651.7,1424.010121,89.8,0.15,17.872206,2.634867
7,2013,Bangladesh,1.384876,3965.4,607.352349,96.8,0.11,17.144373,0.665898
...,...,...,...,...,...,...,...,...,...
430,2017,Thailand,1.467374,17423.0,3281.326073,99.0,0.14,17.025430,1.259181
432,2017,Tunisia,1.901540,11234.5,12607.756302,96.0,0.22,4.459936,0.202782
434,2017,Ukraine,1.604822,11860.6,3197.040405,93.8,0.13,5.595703,1.568492
435,2017,United Arab Emirates,1.503178,67183.6,26913.297761,99.0,0.31,0.694166,0.661221


### Standardization

In [69]:
def normalize_z(df):
    dfout = (df - df.mean(axis = 0)) / df.std(axis = 0)
    return dfout

In [70]:
dfall_normalized_1 = dfall_clean_1.copy()
# standardize GDP
dfall_normalized_1["GDP_normalized"] = normalize_z(dfall_normalized_1["GDP"])
# standardize agricultural land per capita
dfall_normalized_1["Agri_land_cap_normalized"] = normalize_z(dfall_normalized_1["Agri_land_cap"])
# standardize basic water service
dfall_normalized_1["Basic_water_normalized"] = normalize_z(dfall_normalized_1["Basic_water"])
# standardize eating disorder
dfall_normalized_1["Eating_disorder_normalized"] = normalize_z(dfall_normalized_1["Eating_disorder"])
# standardize percentage employed
dfall_normalized_1["Employed_%_normalized"] = normalize_z(dfall_normalized_1["Employed_%"])
# standardize CO2 emitted per agricultural land
dfall_normalized_1["CO2_agri_normalized"] = normalize_z(dfall_normalized_1["CO2_agri"])
# dfall_normalized_1.to_csv("dfallyears_normalized_model1.csv")
dfall_normalized_1

Unnamed: 0,Year,Country,y_ratio,GDP,Agri_land_cap,Basic_water,Eating_disorder,Employed_%,CO2_agri,GDP_normalized,Agri_land_cap_normalized,Basic_water_normalized,Eating_disorder_normalized,Employed_%_normalized,CO2_agri_normalized
0,2013,Albania,1.718685,11361.3,4088.797116,92.6,0.14,15.527752,0.415059,-0.835531,-0.545290,-1.440541,-0.914102,1.985003,-0.587407
1,2013,Algeria,1.914206,11319.1,32233.078561,93.0,0.22,2.991599,0.107726,-0.837990,1.389634,-1.300914,-0.379971,-0.609040,-0.793541
5,2013,Austria,1.907645,52997.8,5574.676863,99.0,0.61,2.195136,1.420584,1.591304,-0.443136,0.793489,2.223916,-0.773848,0.087017
6,2013,Azerbaijan,1.660085,14651.7,1424.010121,89.8,0.15,17.872206,2.634867,-0.643746,-0.728495,-2.417929,-0.847336,2.470129,0.901459
7,2013,Bangladesh,1.384876,3965.4,607.352349,96.8,0.11,17.144373,0.665898,-1.266610,-0.784640,0.025542,-1.114401,2.319522,-0.419165
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
430,2017,Thailand,1.467374,17423.0,3281.326073,99.0,0.14,17.025430,1.259181,-0.482217,-0.600804,0.793489,-0.914102,2.294910,-0.021239
432,2017,Tunisia,1.901540,11234.5,12607.756302,96.0,0.22,4.459936,0.202782,-0.842921,0.040390,-0.253712,-0.379971,-0.305205,-0.729785
434,2017,Ukraine,1.604822,11860.6,3197.040405,93.8,0.13,5.595703,1.568492,-0.806428,-0.606598,-1.021660,-0.980868,-0.070186,0.186221
435,2017,United Arab Emirates,1.503178,67183.6,26913.297761,99.0,0.31,0.694166,0.661221,2.418141,1.023899,0.793489,0.220926,-1.084436,-0.422302


## Features and Targets preperation

### Training 1st Model

Defining functions that we have used in class

In [71]:
def get_features_targets(df, feature_names, target_names):
    df_feature = df.loc[:,feature_names]
    df_target = df.loc[:,target_names]
    return df_feature, df_target

def prepare_feature(df_feature):
    feature = df_feature.to_numpy().reshape(-1, len(df_feature.columns))
    X = np.concatenate((np.ones((feature.shape[0], 1)),feature), axis = 1)
    return X

def prepare_target(df_target):
    target = df_target.to_numpy().reshape(-1, len(df_target.columns))
    return target

def predict(df_feature, beta):
    X = prepare_feature(df_feature)
    return np.matmul(X, beta)

def calc_linear(X, beta):
    return np.matmul(X, beta)

def split_data(df_feature, df_target, random_state=None, test_size=0.5):
    indexes = df_feature.index
    if random_state != None:
        np.random.seed(random_state)
    
    k = int(test_size*len(indexes))
    test_index = np.random.choice(indexes, k, replace = False)
    indexes = set(indexes)
    test_index = set(test_index)
    train_index = indexes - test_index
    
    df_feature_train = df_feature.loc[train_index,:]
    df_feature_test = df_feature.loc[test_index,:]
    
    df_target_train = df_target.loc[train_index,:]
    df_target_test = df_target.loc[test_index,:]
    return df_feature_train, df_feature_test, df_target_train, df_target_test

def r2_score(y, ypred):
    ymean = np.mean(y)
    diff = y - ymean
    sstot = np.matmul(diff.T, diff)
    error = y - ypred
    ssres = np.matmul(error.T,error)
    return 1 - ssres/sstot

def mean_squared_error(target, pred):
    n = target.shape[0]
    error = target - pred
    return (1/n)*np.matmul(error.T, error)[0][0]

def compute_cost(X, y, beta):
    J = 0
    m = X.shape[0]
    error = calc_linear(X,beta)-y
    error_sq = np.matmul(error.T,error)
    J = (1/(2*m)) * error_sq
    return J

def gradient_descent(X, y, beta, alpha, num_iters):
    m = X.shape[0]
    J_storage = np.zeros((num_iters,1))
    for n in range(num_iters):
        deriv = np.matmul(X.T,calc_linear(X,beta)-y)
        beta = beta - (alpha * (1/m)) * deriv
        J_storage[n] = compute_cost(X,y,beta)
    return beta, J_storage

### Considering Possible Removal of Features

We will be using another metric to verify whether there is a possible relationship between between our features and the target. We decided to use Mean Absolute Error (MAE).
$$\large \textrm{MAE} = \frac{1}{n} \sum |y_i - x_i| %$$

In [72]:
def mean_absolute_error(df,feature,target):
    x = df[feature].to_numpy()
    y = df[target].to_numpy()
    n = df.shape[0]
    error = np.abs((y-x))
    return (1/n) * (error.sum())

In [73]:
def mean_absolute_percentage_error(df, feature, target):
    x = df[feature].to_numpy()
    y = df[target].to_numpy()
    n = df.shape[0]
    error = np.abs((y-x)/y)
    return ((1/n) * (error.sum())) / 100

# KEITH PART BELOW IS CLEANING OF MODEL 2 AND 3

## Cleaning the data for $2^{nd}$ Model

### Extract calories supplied data for the specified year

In [74]:
# read food supply csv
df_undernourishment = pd.read_csv("DDW_Prevalence of Undernourishment.csv")
# extract values from the specified year
df_undernourishment = df_undernourishment[(df_undernourishment["Year"]==get_year)]
# extract year, country and percent undernourished
df_undernourishment = df_undernourishment[["Year", "Entity", "Prevalence of undernourishment (% of population)"]]
df_undernourishment = df_undernourishment.rename(columns = {"Entity": "Country", "Prevalence of undernourishment (% of population)": "Undernourishment"})
# set index from 0
df_undernourishment = df_undernourishment.reset_index(drop=True)
# find unique countries in percent undernourished
df_undernourishment_countries = df_undernourishment.Country.unique()
# print dataframe
df_undernourishment

Unnamed: 0,Year,Country,Undernourishment
0,2017,Afghanistan,23.000000
1,2017,Albania,4.100000
2,2017,Algeria,2.700000
3,2017,Angola,15.400000
4,2017,Argentina,3.100000
...,...,...,...
171,2017,Vanuatu,9.600000
172,2017,Venezuela,22.200001
173,2017,Vietnam,7.200000
174,2017,World,8.200000


### Extract GDP, adjusted for PPP, per capita data for the specified year

In [85]:
# done in Model 1 part
# unique countries
df_GDP_countries
# print dataframe
df_GDP

Unnamed: 0,Year,Country,GDP
0,2017,Afghanistan,2058.4
1,2017,Albania,12771.0
2,2017,Algeria,11737.4
3,2017,Angola,7310.9
4,2017,Antigua and Barbuda,19840.3
...,...,...,...
182,2017,Uzbekistan,6840.7
183,2017,Vanuatu,3081.5
184,2017,Viet Nam,8996.4
185,2017,Zambia,3485.0


### Extract total population data for the specified year

In [86]:
# done in Model 1 part
# unique countries
df_pop_coutries
# print dataframe
df_pop

Unnamed: 0,Year,Country,Population
0,2017,Afghanistan,3.629611e+07
1,2017,Africa,1.244222e+09
2,2017,Albania,2.884169e+06
3,2017,Algeria,4.138918e+07
4,2017,Andorra,7.699700e+04
...,...,...,...
226,2017,Wallis and Futuna,1.189400e+04
227,2017,World,7.548173e+09
228,2017,Yemen,2.783481e+07
229,2017,Zambia,1.685361e+07


### Extract land area used for agriculture data for the specified year

In [87]:
# done in Model 1 part
# unique countries
df_agri_countries
# print dataframe
df_agri

Unnamed: 0,Year,Country,Agri_land
0,2017,Aruba,20.00
1,2017,Africa Eastern and Southern,6538552.75
2,2017,Afghanistan,379100.00
3,2017,Africa Western and Central,3589797.00
4,2017,Angola,563974.30
...,...,...,...
260,2017,Samoa,620.00
262,2017,"Yemen, Rep.",234520.00
263,2017,South Africa,963410.00
264,2017,Zambia,238360.00


### Extract $CO_{2}$ (in million metric tonnes) data for the specified year

In [88]:
# done in Model 1 part
# unique countries
df_co2_countries
# print dataframe
df_co2

Unnamed: 0,Year,Country,CO2
0,2017,Afghanistan,6.860
1,2017,Africa,1384.372
2,2017,Albania,5.302
3,2017,Algeria,154.936
4,2017,Andorra,0.465
...,...,...,...
232,2017,Wallis and Futuna,0.026
233,2017,World,35925.738
234,2017,Yemen,9.951
235,2017,Zambia,6.517


### Extract basic water drinking services data for the specified year

In [89]:
# done in Model 1 part
# unique countries
df_water_countries
# print dataframe
df_water

Unnamed: 0,Year,Country,Basic_water
0,2017,Afghanistan,66.8
1,2017,Albania,94.1
2,2017,Algeria,93.8
3,2017,American Samoa,99.0
4,2017,Andorra,99.0
...,...,...,...
232,2017,Small Island Developing States,83.1
233,2017,Low income economies,56.5
234,2017,Lower-middle-income economies,86.5
235,2017,High-income economies,99.0


### Extract Eating Disorder data for the specified year

In [90]:
# done in Model 1 part
# unique countries
df_disorder_countries
# print dataframe
df_disorder

Unnamed: 0,Year,Country,Eating_disorder
0,2017,Afghanistan,0.12
1,2017,African Region (WHO),0.11
2,2017,Albania,0.14
3,2017,Algeria,0.22
4,2017,American Samoa,0.14
...,...,...,...
223,2017,World Bank Lower Middle Income,0.13
224,2017,World Bank Upper Middle Income,0.17
225,2017,Yemen,0.14
226,2017,Zambia,0.12


### Extract number of people employed in agriculture data for the specified year

In [91]:
# done in Model 1 part
# unique countries
df_employment_countries
# print dataframe
df_employment

Unnamed: 0,Year,Country,Employed_num
0,2017,Afghanistan,2740235.0
1,2017,Albania,453779.0
2,2017,Algeria,1102072.0
3,2017,Argentina,6840.0
4,2017,Armenia,317111.0
...,...,...,...
111,2017,Uruguay,143717.0
112,2017,Uzbekistan,3671300.0
113,2017,Venezuela (Bolivarian Republic of),1166489.0
114,2017,Viet Nam,21564822.0


### Create list of common countries

Find the countries that we have all the data for as we do not want to use countries with missing values.

In [95]:
# cc2 represent common countries for model 2
set1_2 = set(df_undernourishment_countries)
set2_2 = set(df_GDP_countries)
set3_2 = set(df_pop_coutries)
set4_2 = set(df_agri_countries)
set5_2 = set(df_co2_countries)
set6_2 = set(df_water_countries)
set7_2 = set(df_disorder_countries)
set8_2 = set(df_employment_countries)

cc2 = countries = list(set1_2 & set2_2 & set3_2 & set4_2 & set5_2 & set6_2 & set7_2 & set8_2)
print("Common Countries", cc2, "\n", "Length", len(cc2))

Common Countries ['Peru', 'El Salvador', 'Rwanda', 'Belize', 'Cyprus', 'Honduras', 'Bulgaria', 'Latvia', 'Switzerland', 'Estonia', 'Bangladesh', 'Belarus', 'Italy', 'Uruguay', 'Hungary', 'Serbia', 'Dominican Republic', 'Nepal', 'Paraguay', 'Azerbaijan', 'Denmark', 'Jamaica', 'Portugal', 'Myanmar', 'Indonesia', 'Poland', 'Mauritius', 'Armenia', 'Mali', 'South Africa', 'United Arab Emirates', 'Malaysia', 'Jordan', 'Afghanistan', 'Finland', 'Israel', 'Philippines', 'Australia', 'Iceland', 'Lithuania', 'Albania', 'Ecuador', 'Colombia', 'Austria', 'North Macedonia', 'France', 'Tunisia', 'Germany', 'New Zealand', 'Chile', 'Djibouti', 'Czechia', 'Uzbekistan', 'Ukraine', 'Togo', 'Ireland', 'Sweden', 'Sri Lanka', 'Greece', 'Canada', 'Panama', 'Ghana', 'Georgia', 'Mexico', 'Kazakhstan', 'Costa Rica', 'Guatemala', 'Japan', 'Bosnia and Herzegovina', 'Algeria', 'Brazil', 'Thailand', 'Mauritania', 'Montenegro', 'Mongolia', 'Slovenia', 'Samoa', 'Cambodia', 'Luxembourg', 'Romania', 'Barbados', 'Kuwait

### Features and Target Preparation

Describe here what are the features you use and why these features. Put any Python codes to prepare and clean up your features. 

Do the same thing for the target. Describe your target and put any codes to prepare your target.

In [76]:
# put Python code to prepare your featuers and target

### Building Model

Describe your model. Is this Linear Regression or Logistic Regression? Put any other details about the model. Put the codes to build your model.

In [77]:
# put Python code to build your model

### Evaluating the Model

Describe your metrics and how you want to evaluate your model. Put any Python code to evaluate your model. Use plots to have a visual evaluation.

In [78]:
# put Python code to evaluate the model and to visualize its accuracy

### Improving the Model

Discuss any steps you can do to improve the models. Put any python codes. You can repeat the steps above with the codes to show the improvement in the accuracy. 

### Dataset for the improved Model

Describe here your data set. Put the link to the sources of your dataset. Describe your data and what are the columns.

Put some Python codes here to describe and visualize your data.

In [79]:
# put Python code to read and describe your data

### Features and Target Preparation for the improved Model

Describe here what are the features you use and why these features. Put any Python codes to prepare and clean up your features. 

Do the same thing for the target. Describe your target and put any codes to prepare your target.

In [80]:
# put Python code to prepare your featuers and target

### Building the improved Model

Describe your model. Is this Linear Regression or Logistic Regression? Put any other details about the model. Put the codes to build your model.

In [81]:
# put Python code to build your model

### Evaluating the improved Model

Describe your metrics and how you want to evaluate your model. Put any Python code to evaluate your model. Use plots to have a visual evaluation.

In [82]:
# put Python code to evaluate the model and to visualize its accuracy

### Discussion and Analysis

Discuss your model and accuracy in solving the problem. Analyze the results of your metrics. Put any conclusion here.