# 2D Design Template

# Overview

The purpose of this project is for you to apply what you have learnt in this course. This includes working with data and visualizing it, create model of linear regression or logistic regression, as well as using metrics to measure the accuracy of your model. 

Please find the project handout description in the following link:
- [DDW-MU-Humanities Handout](https://sutdapac-my.sharepoint.com/:w:/g/personal/franklin_anariba_sutd_edu_sg/ESlibHS4e3hDtOyWQ8noqdsBrC8UXO3wwMmTszX6vFIVVg?e=rjIslZ)
- [DDW-MU-SocialStudies Handout](https://sutdapac-my.sharepoint.com/:w:/g/personal/franklin_anariba_sutd_edu_sg/EQ8CAm4PPupOlXqv9zTNkQYBpD_yGdwMWytBYpJTi9dzew?e=beEbac)


## Deliverables

You need to submit this Jupyter notebook together with the dataset into Vocareum. Use the template in this notebook to work on this project.

## Students Submission

Student's Name:
- Boey Sze Min, Jeanelle (1006037)
- Joash Tan Jia Le (1005862)
- Keith Chua Dian-Xun (1005880)
- Wong Wei Jin Justin (1006001)

### Overview About the Problem

Describe here the problem you are trying to solve.

### Dataset

Describe here your data set. Put the link to the sources of your dataset. Describe your data and what are the columns.

Put some Python codes here to describe and visualize your data.

We have obtained the following raw data for all the countries over a few years:
1. GDP per capita, adjusted for PPP (USD):   
   Link: https://data.worldbank.org/indicator/NY.GDP.PCAP.PP.CD   
   CSV file name: "DDW_GDP per capita adjusted.csv"   

2. Amount of agricultural land allocated per countries ($km^{2}$):   
   Link: https://data.worldbank.org/indicator/AG.LND.AGRI.K2?name_desc=false    
   CSV file name: "DDW_Agricultural Land.csv"   

3. Percentage of population with access to basic drinking water (%):   
   Link: https://www.kaggle.com/datasets/utkarshxy/who-worldhealth-statistics-2020-complete?select=basicDrinkingWaterServices.csv    
   CSV file name: "DDW_Basic Water Drinking Services.csv"   

4. $CO_{2}$ emitted in each country (million metric tonnes):   
   Link: https://github.com/owid/co2-data   
   CSV file name: "DDW_CO2.csv"   

5. Percentage of population with eating disorder (%):    
   Link: https://ourworldindata.org/grapher/share-with-an-eating-disorder    
   CSV file name: "DDW_Eating Disorder.csv"   

6. Number of people employed in agriculture, forestry and fisheries:   
   Link: https://www.fao.org/faostat/en/#data/OEA    
   CSV file name: "DDW_Employment In Agriculture.csv"   

7. Population of each country:   
   Link: https://github.com/owid/co2-data    
   CSV file name: "DDW_Population.csv"   

8. Daily calories supplied (kcal/capita day):   
   Link: https://ourworldindata.org/food-supply   
   CSV file name: "DDW_Food Supply.csv"   

9. Minimum dietary calories requirement (kcal/capita day):   
   Link: https://www.fao.org/faostat/en/#data/FS   
   CSV file name: "DDW_Min Cal Intake.csv"   

10. Fat supplied (g/capita day):   
   Link: https://ourworldindata.org/food-supply   
   CSV file name: "DDW_Fat and Protein Supply.csv"   

11. Protein supplied (g/capita day):   
   Link: https://ourworldindata.org/food-supply   
   CSV file name: "DDW_Fat and Protein Supply.csv"   

12. Percent of population malnourished:     
   Link: https://ourworldindata.org/hunger-and-undernourishment#:~:text=across%20the%20world.-,Summary,million%20people%20globally%20are%20undernourished   
   CSV file name: "DDW_Prevalence of Undernourishment.csv"   

The code below extracts the relevant data from each file and processses it to match our 6 factors.

For **model 1**, the 6 factors that we are using to predict food security are:
1. GDP per capita, adjusted for PPP (USD)
2. Agricultural land per capita ($km^{2}$ per capita)
3. Percentage of population with basic water service (%)
4. Percentage of population with eating disorder (%)
5. Percentage of population employed in agriculture forestry fishery (%)
6. $CO_{2}$ emitted per agricultural land area (kg/$m^{2}$)   

For our dependent value we will be calculating whether the calorie supplied per capita per day is greater or less than the minimum calories required per capita per day, we will be naming this our _**Y Ratio**_.
$$
 \text{Y Ratio} = \frac{\text{Daily calories supplied (kcal/capita day)}}{\text{Minimum calories required (kcal/capita day)}}
$$

If _Y ratio_ ${<}$ 1, we say that the daily calories supplied is not sufficient to meet the minimum calories required.   
If _Y ratio_ ${>}$ 1, we say that the daily calories supplied is sufficient to meet the minimum calories required.



In [None]:
# put Python code to read and describe your data

## Imports

In [47]:
# import libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

## Cleaning the data in csv files

Store the specified year that we want from the csv as get_year.

In [104]:
# set the year to extract
get_year = 2017

Extracting the code requires us to look at the csv beforehand and understand how the csv is structured.   
Cleaning up the csv and obtaining what we wanted follows the folloiwng similar steps.
1. Read the csv
2. drop NaN values
3. Extract values from the specified year in _get_year_
4. Extract values that we want
5. Rename the columns for easy understanding
6. Reset the index to 0
7. Find unique countries for use later
8. Print the dataframe for us to get a preview

### Extract calories supplied data for the specified year

In [76]:
# read calories supply csv
df_calories = pd.read_csv("DDW_food Supply.csv")
# extract values only when item is "Grand Total"
df_calories = df_calories[(df_calories["Item"] == "Grand Total")]
# extract specified year values
df_calories = df_calories[(df_calories["Year"]==get_year)]
# extract year, country and calories supply
df_calories = df_calories[["Year", "Area", "Value"]]
df_calories = df_calories.rename(columns = {"Area": "Country", "Value": "Calories_supply"})
# set index from 0
df_calories = df_calories.reset_index(drop=True)
# find unique countries in calories supply data
df_calories_countries = df_calories.Country.unique()
# print dataframe
df_calories

Unnamed: 0,Year,Country,Calories_supply
0,2017,Afghanistan,2303
1,2017,Albania,3326
2,2017,Algeria,3383
3,2017,Angola,2441
4,2017,Antigua and Barbuda,2446
...,...,...,...
209,2017,Least Developed Countries,2402
210,2017,Land Locked Developing Countries,2539
211,2017,Small Island Developing States,2685
212,2017,Low Income Food Deficit Countries,2505


### Extract minimum calorie intake required data for the specified year

In [77]:
# read minimum calorie intake csv
df_min_cal = pd.read_csv("DDW_Min Cal Intake.csv")
# extract values that are not missing
df_min_cal = df_min_cal[df_min_cal["Value"].notna()]
# extract specified year values
df_min_cal = df_min_cal[(df_min_cal["Year"]==get_year)]
# extract year, country and values
df_min_cal = df_min_cal[["Year", "Area", "Value"]]
df_min_cal = df_min_cal.rename(columns = {"Area": "Country", "Value":"Mininum_calorie_intake"})
# set index from 0
df_min_cal = df_min_cal.reset_index(drop=True)
# find unique countries in miniumum calorie intake required data
df_min_cal_countries = df_min_cal.Country.unique()
# print dataframe
df_min_cal

Unnamed: 0,Year,Country,Mininum_calorie_intake
0,2017,Afghanistan,1676.0
1,2017,Albania,1911.0
2,2017,Algeria,1781.0
3,2017,Angola,1659.0
4,2017,Antigua and Barbuda,1888.0
...,...,...,...
183,2017,Vanuatu,1695.0
184,2017,Venezuela (Bolivarian Republic of),1817.0
185,2017,Viet Nam,1785.0
186,2017,Yemen,1703.0


### Extract GDP, adjusted for PPP, per capita data for the specified year

In [87]:
# read GDP, adjusted for PPP, per capita csv
df_GDP = pd.read_csv("DDW_GDP per capita adjusted.csv")
# extract values that are not missing
df_GDP = df_GDP[(df_GDP["Value"].notna())]
# extract specified year values
df_GDP = df_GDP[(df_GDP["Year"]==get_year)]
# extract year, country and values
df_GDP = df_GDP[["Year","Area","Value"]]
df_GDP = df_GDP.rename(columns = {"Area": "Country", "Value":"GDP"})
# set index from 0
df_GDP = df_GDP.reset_index(drop=True)
# find unique countries in GDP per capita data
df_GDP_countries = df_GDP.Country.unique()
# print data frame
df_GDP

Unnamed: 0,Year,Country,GDP
0,2017,Afghanistan,2058.4
1,2017,Albania,12771.0
2,2017,Algeria,11737.4
3,2017,Angola,7310.9
4,2017,Antigua and Barbuda,19840.3
...,...,...,...
182,2017,Uzbekistan,6840.7
183,2017,Vanuatu,3081.5
184,2017,Viet Nam,8996.4
185,2017,Zambia,3485.0


### Extract total population data for the specified year

In [88]:
# read total population employed csv
df_pop = pd.read_csv("DDW_Population.csv")
# extract specified year values
df_pop = df_pop[(df_pop["year"]==get_year)]
# extract values that are not missing
df_pop = df_pop[df_pop["population"].notna()]
# extract year, country and population value
df_pop = df_pop[["year", "country", "population"]]
df_pop = df_pop.rename(columns = {"year": "Year", "country": "Country", "population":"Population"})
# set index from 0
df_pop = df_pop.reset_index(drop=True)
# find unique countries in total population data
df_pop_coutries = df_pop.Country.unique()
# print dataframe
df_pop

Unnamed: 0,Year,Country,Population
0,2017,Afghanistan,3.629611e+07
1,2017,Africa,1.244222e+09
2,2017,Albania,2.884169e+06
3,2017,Algeria,4.138918e+07
4,2017,Andorra,7.699700e+04
...,...,...,...
226,2017,Wallis and Futuna,1.189400e+04
227,2017,World,7.548173e+09
228,2017,Yemen,2.783481e+07
229,2017,Zambia,1.685361e+07


### Extract land area used for agriculture data for the specified year

In [89]:
# read agricultural land csv
df_agri = pd.read_csv("DDW_Agricultural Land.csv")
# extract countries and specified year
df_agri = df_agri[["Country Name", str(get_year)]]
# remove rows with values missing
df_agri = df_agri.dropna()
# insert new column at index 0 with name "Year" and value: get_year
df_agri.insert(0, "Year", get_year)
df_agri = df_agri.rename(columns = {str(get_year): "Agri_land", "Country Name": "Country"})
# find unique countries in agricultural land
df_agri_countries = df_agri.Country.unique()
# print dataframe
df_agri

Unnamed: 0,Year,Country,Agri_land
0,2017,Aruba,20.00
1,2017,Africa Eastern and Southern,6538552.75
2,2017,Afghanistan,379100.00
3,2017,Africa Western and Central,3589797.00
4,2017,Angola,563974.30
...,...,...,...
260,2017,Samoa,620.00
262,2017,"Yemen, Rep.",234520.00
263,2017,South Africa,963410.00
264,2017,Zambia,238360.00


### Extract $CO_{2}$ (in million metric tonnes) data for the specified year

In [90]:
# read CO2 csv
df_co2 = pd.read_csv("DDW_CO2.csv")
# extract specified year values
df_co2 = df_co2[(df_co2["year"]==get_year)]
# extract values that are not missing
df_co2 = df_co2[df_co2["co2"].notna()]
# extract year, country and values
df_co2 = df_co2[["year","country","co2"]]
df_co2 = df_co2.rename(columns = {"year": "Year", "country": "Country", "co2": "CO2"})
# set index from 0
df_co2 = df_co2.reset_index(drop=True)
# find unique countries in total co2 data
df_co2_countries = df_co2.Country.unique()
# print dataframe
df_co2

Unnamed: 0,Year,Country,CO2
0,2017,Afghanistan,6.860
1,2017,Africa,1384.372
2,2017,Albania,5.302
3,2017,Algeria,154.936
4,2017,Andorra,0.465
...,...,...,...
232,2017,Wallis and Futuna,0.026
233,2017,World,35925.738
234,2017,Yemen,9.951
235,2017,Zambia,6.517


### Extract basic water drinking services data for the specified year

In [91]:
# read Basic Water Drinking Services csv
df_water = pd.read_csv("DDW_Basic Water Drinking Services.csv")
# extract values that are not missing
df_water = df_water[df_water["Value"].notna()]
# extract specified year values
df_water = df_water[(df_water["Year"]==get_year)]
# extract year, country and values
df_water = df_water[["Year", "Area", "Value"]]
df_water = df_water.rename(columns = {"Area": "Country", "Value": "Basic_water"})
# set index from 0
df_water = df_water.reset_index(drop=True)
# find unique countries in total water data
df_water_countries = df_water.Country.unique()
# print dataframe
df_water

Unnamed: 0,Year,Country,Basic_water
0,2017,Afghanistan,66.8
1,2017,Albania,94.1
2,2017,Algeria,93.8
3,2017,American Samoa,99.0
4,2017,Andorra,99.0
...,...,...,...
232,2017,Small Island Developing States,83.1
233,2017,Low income economies,56.5
234,2017,Lower-middle-income economies,86.5
235,2017,High-income economies,99.0


### Extract Eating Disorder data for the specified year

In [92]:
# read eating disorder csv
df_disorder = pd.read_csv("DDW_Eating Disorder.csv")
# extract values that are not missing
df_disorder = df_disorder[df_disorder["Prevalence - Eating disorders - Sex: Both - Age: Age-standardized (Percent)"].notna()]
df_disorder = df_disorder.rename(columns = {"Entity": "Country", "Prevalence - Eating disorders - Sex: Both - Age: Age-standardized (Percent)":"Prevalence"})
# extract specified year values
df_disorder = df_disorder[(df_disorder["Year"]==get_year)]
# extract year, country and values
df_disorder = df_disorder[["Year", "Country", "Prevalence"]]
df_disorder = df_disorder.rename(columns = {"Prevalence": "Eating_disorder"})
# set index from 0
df_disorder = df_disorder.reset_index(drop=True)
# find unique countries in eating disorder data
df_disorder_countries = df_disorder.Country.unique()
# print dataframe
df_disorder

Unnamed: 0,Year,Country,Eating_disorder
0,2017,Afghanistan,0.12
1,2017,African Region (WHO),0.11
2,2017,Albania,0.14
3,2017,Algeria,0.22
4,2017,American Samoa,0.14
...,...,...,...
223,2017,World Bank Lower Middle Income,0.13
224,2017,World Bank Upper Middle Income,0.17
225,2017,Yemen,0.14
226,2017,Zambia,0.12


### Extract number of people employed in agriculture data for the specified year

In [93]:
# read employment csv
df_employment = pd.read_csv("DDW_Employment In Agriculture.csv")
# extract values that are not missing
df_employment = df_employment[df_employment["Value"].notna()]
# extract specified year values
df_employment = df_employment[(df_employment["Year"]==get_year)]
# extract year, country and values
df_employment = df_employment[["Year", "Area", "Value"]]
# multiply by 1000 to find actual number employed (units was in 1000 people)
df_employment["Value"] = df_employment["Value"]*1000
df_employment = df_employment.rename(columns = {"Area": "Country", "Value":"Employed_num"})
# set index from 0
df_employment = df_employment.reset_index(drop=True)
# find unique countries in eating disorder data
df_employment_countries = df_employment.Country.unique()
# print dataframe
df_employment

Unnamed: 0,Year,Country,Employed_num
0,2017,Afghanistan,2740235.0
1,2017,Albania,453779.0
2,2017,Algeria,1102072.0
3,2017,Argentina,6840.0
4,2017,Armenia,317111.0
...,...,...,...
111,2017,Uruguay,143717.0
112,2017,Uzbekistan,3671300.0
113,2017,Venezuela (Bolivarian Republic of),1166489.0
114,2017,Viet Nam,21564822.0


### Create list of common countries

Find the countries that we have all the data for as we do not want to use countries with missing values.

In [94]:
# cc represent common countries
cc1 = set(df_calories_countries)
cc2 = set(df_min_cal_countries)
cc3 = set(df_GDP_countries)
cc4 = set(df_pop_coutries)
cc5 = set(df_agri_countries)
cc6 = set(df_co2_countries)
cc7 = set(df_water_countries)
cc8 = set(df_disorder_countries)
cc9 = set(df_employment_countries)

cc = list(cc1 & cc2 & cc3 & cc4 & cc5 & cc6 & cc7 & cc8 & cc9)
print("Common Countries", cc, "\n", "Length", len(cc))

Common Countries ['Bulgaria', 'Germany', 'Uganda', 'Ukraine', 'Latvia', 'Uruguay', 'Belize', 'Canada', 'Czechia', 'Ireland', 'Luxembourg', 'Netherlands', 'Seychelles', 'Spain', 'Bangladesh', 'Portugal', 'Australia', 'Mauritius', 'Algeria', 'Thailand', 'Georgia', 'Italy', 'Sweden', 'France', 'Israel', 'Rwanda', 'Togo', 'Paraguay', 'Azerbaijan', 'Belgium', 'Samoa', 'Finland', 'Philippines', 'Tunisia', 'Costa Rica', 'United Arab Emirates', 'Cambodia', 'Denmark', 'Ecuador', 'North Macedonia', 'Albania', 'Cyprus', 'Panama', 'Hungary', 'Dominican Republic', 'South Africa', 'Ghana', 'Switzerland', 'Honduras', 'Mauritania', 'Norway', 'Mali', 'Chile', 'Slovenia', 'Djibouti', 'Jordan', 'Greece', 'Estonia', 'Mexico', 'El Salvador', 'Myanmar', 'Malaysia', 'Brazil', 'Uzbekistan', 'Jamaica', 'Nepal', 'Iceland', 'Kuwait', 'Mongolia', 'Burundi', 'Malta', 'Barbados', 'Austria', 'Serbia', 'Kazakhstan', 'Afghanistan', 'Peru', 'Romania', 'Montenegro', 'Poland', 'Sri Lanka', 'Colombia', 'Japan', 'Bosnia an

## Calculating and Combining our dataframes according to our combined countries

We will start by calculating the necessary dependent values first before we fully combine.

### Calculate Y ratio 
$$
\frac{\text{Daily calories supplied (kcal/capita day)}}{\text{Minimum calories required (kcal/capita day)}}
$$


In [96]:
# extract values from countries in combined countries list
df_calories = df_calories[df_calories["Country"].isin(cc)]
df_min_cal = df_min_cal[df_min_cal["Country"].isin(cc)]

df_y_ratio = df_calories.copy()
# extract minimum calorie value
df_y_ratio["Mininum_calorie_intake"] = list(df_min_cal["Mininum_calorie_intake"])
# calculate Y ratio based on stated equation
df_y_ratio["y_ratio"] = df_y_ratio["Calories_supply"]/df_y_ratio["Mininum_calorie_intake"]
# set index from 0
df_y_ratio = df_y_ratio.reset_index(drop=True)
# print dataframe
df_y_ratio

Unnamed: 0,Year,Country,Calories_supply,Mininum_calorie_intake,y_ratio
0,2017,Afghanistan,2303,1676.0,1.374105
1,2017,Albania,3326,1911.0,1.740450
2,2017,Algeria,3383,1781.0,1.899495
3,2017,Armenia,3072,1875.0,1.638400
4,2017,Australia,3404,1911.0,1.781266
...,...,...,...,...,...
85,2017,Uganda,2030,1693.0,1.199055
86,2017,Ukraine,3062,1908.0,1.604822
87,2017,United Arab Emirates,3074,2045.0,1.503178
88,2017,Uruguay,3158,1861.0,1.696937


### Calculate Agriculture Land per capita ($\textrm{m}^2$ per capita)

In [97]:
# extract values from countries in combined countries list
df_agri = df_agri[df_agri["Country"].isin(cc)]
df_pop = df_pop[df_pop["Country"].isin(cc)]

df_agri_pop = df_agri.copy()
# extract population value
df_agri_pop["Total_population"] = list(df_pop["Population"])
# calculate agriculture land per capita, multiply by 1,000,000 to convert km^2 to m^2
df_agri_pop["Agri_land_cap"] = (df_agri_pop["Agri_land"]/df_agri_pop["Total_population"])*1000000
# set index from 0
df_agri_pop = df_agri_pop.reset_index(drop=True)
# print dataframe
df_agri_pop

Unnamed: 0,Year,Country,Agri_land,Total_population,Agri_land_cap
0,2017,Afghanistan,379100.00,36296108.0,10444.646021
1,2017,Albania,11742.81,2884169.0,4071.470847
2,2017,United Arab Emirates,3838.00,41389176.0,92.729558
3,2017,Armenia,16761.00,2944789.0,5691.749052
4,2017,Australia,3718370.00,24584620.0,151247.812657
...,...,...,...,...,...
85,2017,Ukraine,414890.00,41166588.0,10078.318854
86,2017,Uruguay,142229.00,44487708.0,3197.040405
87,2017,Uzbekistan,255332.00,9487206.0,26913.297761
88,2017,Samoa,620.00,3436645.0,180.408509


### Calculate percentage of population employed in the agriculture industry (%)

In [98]:
# extract values from countries in combined countries list
df_employment = df_employment[df_employment["Country"].isin(cc)]
df_pop = df_pop[df_pop["Country"].isin(cc)]


df_percent_employed = df_employment.copy()
# extract population value
df_percent_employed["Total_population"] = list(df_pop["Population"])
# calculate percent employed; (employed number/total population) for each country
df_percent_employed["Employed_%"] = (df_percent_employed["Employed_num"]/df_percent_employed["Total_population"])*100
# set index from 0
df_percent_employed = df_percent_employed.reset_index(drop=True)
# print dataframe
df_percent_employed

Unnamed: 0,Year,Country,Employed_num,Total_population,Employed_%
0,2017,Afghanistan,2740235.0,36296108.0,7.549666
1,2017,Albania,453779.0,2884169.0,15.733440
2,2017,Algeria,1102072.0,41389176.0,2.662706
3,2017,Armenia,317111.0,2944789.0,10.768547
4,2017,Australia,318393.0,24584620.0,1.295090
...,...,...,...,...,...
85,2017,Uganda,4095242.0,41166588.0,9.947975
86,2017,Ukraine,2489400.0,44487708.0,5.595703
87,2017,United Arab Emirates,65857.0,9487206.0,0.694166
88,2017,Uruguay,143717.0,3436645.0,4.181898


### Calculate co2 per agricultural land (kg per $m^{2}$)

In [99]:
# extract values from countries in combined countries list
df_co2 = df_co2[df_co2["Country"].isin(cc)]
df_agri = df_agri[df_agri["Country"].isin(cc)]


df_co2_land = df_co2.copy()
# extract Agricultural land in km^2
df_co2_land["Agri_land"] = list(df_agri["Agri_land"])
# calculate CO2 per agricultral land; CO2 in MMT, 1 MMT = 1,000,000,000 kg
# 1 km^2 = 1,000,000 m^2; result in kg per m^2
df_co2_land["CO2_agri"] = (df_co2_land["CO2"]/df_co2_land["Agri_land"])*1000
# set index from 0
df_co2_land = df_co2_land.reset_index(drop=True)
# print dataframe
df_co2_land

Unnamed: 0,Year,Country,CO2,Agri_land,CO2_agri
0,2017,Afghanistan,6.860,379100.00,0.018095
1,2017,Albania,5.302,11742.81,0.451510
2,2017,Algeria,154.936,3838.00,40.368942
3,2017,Armenia,5.537,16761.00,0.330350
4,2017,Australia,414.751,3718370.00,0.111541
...,...,...,...,...,...
85,2017,Uganda,5.374,414890.00,0.012953
86,2017,Ukraine,223.085,142229.00,1.568492
87,2017,United Arab Emirates,168.831,255332.00,0.661221
88,2017,Uruguay,6.163,620.00,9.940323


### Extracting the rest of the dataframes needed by countries that are within the combined countries list

In [100]:
# extract GDP value from countries in combined countries list
df_GDP = df_GDP[df_GDP["Country"].isin(cc)]
# set index from 0
df_GDP = df_GDP.reset_index(drop=True)

# extract water value from countries in combined countries list
df_water = df_water[df_water["Country"].isin(cc)]
# set index from 0
df_water = df_water.reset_index(drop=True)

# extract eating disorder value from countries in combined countries list
df_disorder = df_disorder[df_disorder["Country"].isin(cc)]
# set index from 0
df_disorder = df_disorder.reset_index(drop=True)

# print dataframe to check if need be
# df_GDP
# df_water
# df_disorder

### Combining all the variables into one dataframe

In [109]:
# add country and Y ratio column
dfyears_combined = df_y_ratio.loc[:, ["Country", "y_ratio"]]
# add GDP column
dfyears_combined["GDP"] = df_GDP.loc[:, "GDP"]
# add agriculture per land column
dfyears_combined["Agri_land_cap"] = df_agri_pop.loc[:, "Agri_land_cap"]
# add percent basic water service column
dfyears_combined["Basic_water"] = df_water.loc[:, "Basic_water"]
# add percent eating disorder column
dfyears_combined["Eating_disorder"] = df_disorder.loc[:, "Eating_disorder"]
# add percent employed in agriculture column
dfyears_combined["Employed_%"] = df_percent_employed.loc[:, "Employed_%"]
# add co2 per agricultural land column
dfyears_combined["CO2_agri"] = df_co2_land.loc[:, "CO2_agri"]
# add specified year in the first column
dfyears_combined.insert(0, "Year", get_year)
# print dataframe
dfyears_combined

# export dataframe
# dfyears_combined.to_csv(f"df{get_year}_combined.csv")

## Combinining the dataframes for years 2013 to 2017 into one dataframe

*Year 2013, 2014, 2015, 2016, 2017 was done on seperate occassions and exported to one csv file for each year.

In [108]:
# read the csv for each year
df2013_combined = pd.read_csv("df2013_combined.csv")
df2014_combined = pd.read_csv("df2014_combined.csv")
df2015_combined = pd.read_csv("df2015_combined.csv")
df2016_combined = pd.read_csv("df2016_combined.csv")
df2017_combined = pd.read_csv("df2017_combined.csv")

In [110]:
# combine all five dataframes into one
dfallyears_combined = pd.concat([df2013_combined, df2014_combined, df2015_combined, df2016_combined, df2017_combined])
# drop the extra column caused due to concat function
dfallyears_combined = dfallyears_combined.drop(columns = "Unnamed: 0")
# set index from 0
dfallyears_combined = dfallyears_combined.reset_index(drop=True)
# print dataframe
dfallyears_combined

# export dataframe
# dfallyears_combined.to_csv("dfallyears_combined.csv")

Unnamed: 0,Year,Country,y_ratio,GDP,Agri_land_cap,Basic_water,Eating_disorder,Employed_%,CO2_agri
0,2013,Albania,1.718685,11361.3,4.088797e+03,92.6,0.14,15.527752,0.415059
1,2013,Algeria,1.914206,11319.1,3.223308e+04,93.0,0.22,2.991599,0.107726
2,2013,Argentina,1.728202,24424.1,3.988526e+02,98.9,0.35,0.159302,11.297267
3,2013,Armenia,1.607238,10691.3,1.283048e+06,99.0,0.13,14.568022,0.001489
4,2013,Australia,1.790601,46744.6,1.168097e+03,99.0,1.10,1.286984,14.473973
...,...,...,...,...,...,...,...,...,...
433,2017,Uganda,1.199055,2074.7,1.007832e+04,51.0,0.10,9.947975,0.012953
434,2017,Ukraine,1.604822,11860.6,3.197040e+03,93.8,0.13,5.595703,1.568492
435,2017,United Arab Emirates,1.503178,67183.6,2.691330e+04,99.0,0.31,0.694166,0.661221
436,2017,Uruguay,1.696937,23009.9,1.804085e+02,99.0,0.38,4.181898,9.940323


### Features and Target Preparation

Describe here what are the features you use and why these features. Put any Python codes to prepare and clean up your features. 

Do the same thing for the target. Describe your target and put any codes to prepare your target.

In [None]:
# put Python code to prepare your featuers and target

### Building Model

Describe your model. Is this Linear Regression or Logistic Regression? Put any other details about the model. Put the codes to build your model.

In [1]:
# put Python code to build your model

### Evaluating the Model

Describe your metrics and how you want to evaluate your model. Put any Python code to evaluate your model. Use plots to have a visual evaluation.

In [None]:
# put Python code to evaluate the model and to visualize its accuracy

### Improving the Model

Discuss any steps you can do to improve the models. Put any python codes. You can repeat the steps above with the codes to show the improvement in the accuracy. 

### Dataset for the improved Model

Describe here your data set. Put the link to the sources of your dataset. Describe your data and what are the columns.

Put some Python codes here to describe and visualize your data.

In [None]:
# put Python code to read and describe your data

### Features and Target Preparation for the improved Model

Describe here what are the features you use and why these features. Put any Python codes to prepare and clean up your features. 

Do the same thing for the target. Describe your target and put any codes to prepare your target.

In [2]:
# put Python code to prepare your featuers and target

### Building the improved Model

Describe your model. Is this Linear Regression or Logistic Regression? Put any other details about the model. Put the codes to build your model.

In [None]:
# put Python code to build your model

### Evaluating the improved Model

Describe your metrics and how you want to evaluate your model. Put any Python code to evaluate your model. Use plots to have a visual evaluation.

In [None]:
# put Python code to evaluate the model and to visualize its accuracy

### Discussion and Analysis

Discuss your model and accuracy in solving the problem. Analyze the results of your metrics. Put any conclusion here.