# Countries of the World

This notebook is written to look at the world data and try to find some relationship between the attributes given.

### Contents of the notebook:

- **Looking at Data**
- **Visualizing Data**
- **Finding relationship between different Attributes** 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import geopandas
from scipy import stats

import seaborn as sns
sns.set(style="darkgrid")

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
# Load the data
world_data = pd.read_csv("../input/countries of the world.csv", na_values='NaN')

# Looking at the Data

In [None]:
# Get the information about the different attributes in the data
world_data.info()

From the information above, we see that there are a lot of attributes in the data. They are:

- **Country:** This entry includes country's name approved by the US Board on Geographic Name
- **Region:** Region in which the country lies in
- **population:** Total Number of people in the country
- **Area (sq. mi.):** Area of country in square miles
- **Pop. Density (per sq. mi.):** Density of population in number of people per square mile
- **Coastline (coast/area ratio):** This entry gives the total length of the boundary between the land area (including islands) and the sea.
- **Net migration:** This entry includes the figure for the difference between the number of persons entering and leaving a country during the year per 1,000 persons (based on midyear population). An excess of persons entering the country is referred to as net immigration (e.g., 3.56 migrants/1,000 population); an excess of persons leaving the country as net emigration (e.g., -9.26 migrants/1,000 population). The net migration rate indicates the contribution of migration to the overall level of population change. The net migration rate does not distinguish between economic migrants, refugees, and other types of migrants nor does it distinguish between lawful migrants and undocumented migrants.
- **Infant mortality (per 1000 births):** Number of indant deaths per 1000 births
- **GDP ($ per capita):** the total value of goods produced and services provided in a country per person during one year
- **Literacy (%):** Total number of literate persons in a given age group, expressed as a percentage of the total population in that age group. 
- **Phones (per 1000):** Number of phones per 1000 people
- **Arable (%):** Percentage of land used or suitable for growing crops
- **Crops (%):** Percentage of land that is under crop irrigation
- **Other (%):** Percentage of land that is not arable or under crop
- **Climate:** Brief description of typical weather regimes throughout the yea
- **Birthrate:** the number of live births per thousand of population per year
- **Deathrate:** the number of deaths per thousand of population per year.
- **Agriculture:** Percentage contributed to the GDP
- **Industry:** Percentage contributed to the GDP
- **Service:** Percentage contributed to the GDP

In [None]:
# Get 5 random row entries
world_data.sample(5)

In [None]:
# Get names of regions
world_data['Region'].unique()

# I don't know what the 'NEAR EAST' region is. It is very vague and I can't pinpoint to what it represents.

In [None]:
# Clean Region names (remove whitespace) and set them as index
world_data['Region'] = world_data['Region'].str.strip()
world_data['Country'] = world_data['Country'].str.strip()
world_data['Region'].unique()
world_data.set_index('Region', inplace=True)
world_data.sample(5)

In [None]:
# Replace ',' with '.' in columns with numerical values
for column in world_data.columns:
    if (column != 'Country') and (world_data[column].dtype == 'object'):
        world_data[column] = world_data[column].str.replace(',', '.')
        world_data[column] = world_data[column].replace('NaN', np.NaN)
        world_data[column] = pd.to_numeric(world_data[column])
world_data.info()

In [None]:
# Get the EDA values for only complete data
complete_data = world_data.dropna()
complete_data.describe()

## Visualizing Data

Now we will try to chart some of the attributes and see how these change over different parts of the world.

In [None]:
# Get the number of countires in different regions of the world
countries_per_region = pd.DataFrame(world_data.groupby(level=[0])['Country'].count())
countries_per_region = countries_per_region.reset_index()

# Plot Number of countries in different regions of the world
plt.figure(figsize = (12,6))
ax = sns.barplot(x='Region', y='Country', data=countries_per_region)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.set_ylabel('Number of Countries');      # ; to suppress the output for this line

So from the above graph, we can see that most number of countries lie in `Sub-Saharan Africa` followed by `Latin America` and `Caribbean`. `Baltics` region has the least number of countries.

In [None]:
# Get locations for different countries from geopandas library
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
world.info()

In [None]:
# Plot world map using geopandas
world = world.rename(columns={'name':'Country', 'geometry':'borders'}).set_geometry('borders');
ax = world.plot(figsize=(14,8));

In [None]:
# Merge original dataset with geopandas dataset
merged_data = world.merge(world_data, how='inner',on='Country').set_geometry('borders');
merged_data.head(5)

In [None]:
# Get the shape of the merged data frame
merged_data.shape

**NOTE:** It is important to note here that geopandas has data for about `177` countries while our original dataset had about `227` countries. And on taking an intersection, the combined dataset reduced even further, to `151` countries. For analysis purpose, the number of countries is not enough but for plotting purposes, it is fine. We do see some empty spaces on world maps, but still get a overall idea.

In [None]:
num_total_countries = 227
num_geopandas_countries = 177
num_intersection_countries = 151

In [None]:
# Calculate percentage for the available data
perc_data_geopandas = (num_geopandas_countries / num_total_countries) * 100
perc_data_avail = (num_intersection_countries / num_total_countries) * 100

print("% of data in geopandas: ", perc_data_geopandas)
print("% of data available after intersection of geopandas with given dataset: ", perc_data_avail)

So, in reality we are loosing about `35%` of our original dataset when we are plotting on the world map using geopandas. But to my knowledge, most of the data that is lost is for the countries which are much smaller. Data is still being plotted for the majority of big and relevant countries.

### Population
Population in this dataset refers to the total number of people in a country.

In [None]:
# Plot world population
merged_data.plot(column='Population', cmap='OrRd', figsize=(14,8));

In [None]:
# Plot population for different regions
reset_world_data = world_data.reset_index()

plt.figure(figsize = (12,6))
ax = sns.barplot(x='Region', y='Population', data=reset_world_data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);  

Above graph shows that most of the world population lies in the `Asian` region. This is indeed true because most of the population is in `India` and `China` (shown in next graph). Next region with highest population is the `Northern America`.

In [None]:
# Show 10 countries with highest population
reset_world_data = reset_world_data.sort_values(by='Population', ascending=False)

plt.figure(figsize = (12,6))
ax = sns.barplot(x='Country', y='Population', data=reset_world_data[:10])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);  

Above graph shows that `China` and `India` are more than 3 times more populated than `United States` which ranks 3rd. The data on the graph is a little different from the latest data as given at: https://www.prb.org/wp-content/uploads/2018/08/2018_WPDS.pdf. In the latest data, `Nigeria` has more population than `Bangladesh` and `Russia`. Also, `Mexico` has come in the top 10 populated countries.

### Birthrate
In the given dataset, birthrate is the number of live births per thousand of population per year.

In [None]:
# Get a dataframe with entries that have non null birthrate values
birthrate_df = merged_data.loc[merged_data['Birthrate'].notnull(), :]
birthrate_df.sample(5)

In [None]:
# Plot birthrates on world map
birthrate_df.plot(column='Birthrate', cmap='OrRd', figsize=(16,16));

Wait, `India` and `China` despite being the world's most populated don't have the highest birthrate? Atleast that's wat the data is saying. It is is showing that the birthrate is highest in most countries in the `Sub-Saharan Africa`(really dark colors) while somewhat high in most `South-American` and `Asian Countries`.

In [None]:
# Plot the spread of birthrates in different regions of the world 
reset_world_data = world_data.reset_index()

plt.figure(figsize = (12,6))
ax = sns.boxplot(x='Region', y='Birthrate', data=reset_world_data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);  

From the above graph, we can clearly see that the range of birthrate is highest in `Sub-Saharan Africa`. `Asia` despite being the most populated, doesn't have the highest birthrate (however the highest birthrate in Asia is still comptetitive). This observation may come as a surprise to some people.

In [None]:
# Plot the highest 10 birthrates by countries
reset_world_data = reset_world_data.sort_values(by='Birthrate', ascending=False)

plt.figure(figsize = (12,6))
ax = sns.barplot(x='Country', y='Birthrate', data=reset_world_data[:10])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90); 

In [None]:
reset_world_data[(reset_world_data['Country'] == 'China') | (reset_world_data['Country'] == 'India')][['Country','Birthrate']]

So, `Nigeria` has the highest birthrate with value about 50, followed by `Mali` and `Uganda`. `India` and `China`, which have the highest population have less than half the birthrate than a lot of countries in `Africa`. What could be a reason for this? Maybe some sort of population control policies implemented by the governments in India and China can be the reason for lower birthrates and indeed that was the case for China. 

## Finding relationship between different Attributes

In [None]:
# Calcualtion relations between different attributes
corr_matrix = world_data.corr()
corr_matrix

In [None]:
plt.figure(figsize = (13,13))

# Reverse the color map to see positive relations as darker colors
cmap = sns.cm.rocket_r

# Plot the correlation heatmap
sns.heatmap(data=corr_matrix, cmap=cmap, annot=True, cbar=True, square=True, fmt='.2f');

So we can see certain very dominant correlations here (either very dark colors or very light colors). Some of the are explained here:

- **Population:** 
   - Positive correlation with Area
- **GDP:**
    - Positive Correlation with Literacy, Phones, Service
    - Negative Correlation with Birthrate, Agriculture
- **Literacy:**
    - Positive Correlation with Phones, Service
    - Negative Correlation with Birthrate, Agriculture
- **Phones:**
    - Positive Correlation with Service
    - Negative Correlation with Agriculture
- **Service:**
    - Negative Correlation with Agriculture, Birthrate

Let's remove the unnecessary attributes to get a much clearer heatmap.

In [None]:
# Calculate correlation matrix for selected attributes only
columns_to_process = ['Population', 'Infant mortality (per 1000 births)','Area (sq. mi.)','GDP ($ per capita)', 
                      'Literacy (%)', 'Phones (per 1000)', 'Birthrate', 'Agriculture', 'Service']

new_corr_matrix = world_data[columns_to_process].corr()
new_corr_matrix

In [None]:
# Plot the heatmap for correlation matrix

plt.figure(figsize = (12,12));
# Reverse the color map to see positive relations as darker colors
cmap = sns.cm.rocket_r
# Plot the correlation heatmap
ax = sns.heatmap(data=new_corr_matrix, cmap=cmap, annot=True, cbar=True, fmt='.2f', square=True,);

Now, it is much easier to look at the relation between different attributes. In this notebook, I will be only working to show the correlation of Population and GDP with other attributes. In the next commit, I will be adding the analysis of other attributes as well.

## Population

Population is the total number of people living in a country. From the correlation heatmap, we see that there is a positive relation between population and area of a country. From intuition, it makes sense that a bigger country can have larger population. For example, it is very likely that USA will have more population than Colombia. There are exceptions to this of course, for example India has higher population than USA(which is bigger) but these cases are relative. However, we are trying to find a general trend in the data.

### Correlation with Area

In [None]:
# Plot population vs area
sns.relplot(x='Area (sq. mi.)', y='Population', hue='Region', height=5, aspect=2, data=reset_world_data);

In [None]:
# Sort the world data by area
sorted_area_world_data = reset_world_data.sort_values(by='Area (sq. mi.)', ascending=False)
sorted_area_world_data.sample(5)

In [None]:
# Try to look at the countries in the small patch on the bottom left in the previous graph
countries_in_small_patch = sorted_area_world_data[7:]

plt.figure(figsize = (12,6));
ax = sns.relplot(x='Area (sq. mi.)', y='Population', hue='Region', height=5, aspect=2,data=countries_in_small_patch);

In [None]:
# Try to fit a linear regression model

slope, intercept, r_value, p_value, std_err = stats.linregress(reset_world_data["Area (sq. mi.)"], reset_world_data["Population"])
print("Linear Regression - world data, r-2 value: ", r_value**2)

slope, intercept, r_value, p_value, std_err = stats.linregress(countries_in_small_patch["Area (sq. mi.)"], countries_in_small_patch["Population"])
print("Linear Regression - small patch, r-2 value: ", r_value**2)

So, from linear regression, only 22% is predictable which is not great and it decreases when we remove extremes. But if we keep that aside for a moment, what we can see is that our initial idea was right that the `population generally increases with area`.

In [None]:
# Plot population,area and a linear regression model fit for the entire world
plt.figure(figsize = (12,6))
ax = sns.regplot(x='Area (sq. mi.)', y='Population', data=reset_world_data, ci=50)

In [None]:
# Plot population,area and a linear regression model fit for the countries in the small patch
plt.figure(figsize = (12,6))
ax = sns.regplot(x='Area (sq. mi.)', y='Population', data=countries_in_small_patch)

## GDP

In the given data set, GDP is the total value of goods produced and services provided in a country per person during one year. One may expect that the GDP value will be higher in the developed countries and lower in developing countries. Let's try to look for the trend in GDP and its impact.

In [None]:
# Plot GDP for countries on world map
merged_data.plot(column='GDP ($ per capita)', figsize=(16,8), legend=True);

The world map shows the distribution of the GDP across the world. As you can see, `USA` has one the highest GDP in the world.

In [None]:
# Plot range of GDP per region
plt.figure(figsize=(12,6))

ax = sns.boxplot(x='Region', y='GDP ($ per capita)', data=reset_world_data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);  

In [None]:
# Sort the world data according to GDP
sorted_gdp_world_data = reset_world_data.sort_values(by='GDP ($ per capita)', ascending=False)
sorted_gdp_world_data

### Correlation with Literacy

In [None]:
# Plot a country's literacy rate against GDP
sns.relplot(x='GDP ($ per capita)', y='Literacy (%)', data=reset_world_data, kind='line', size=5, aspect=2, ci=0, legend=False);

In [None]:
# Plot a linear regression fit for a country's literacy rate against GDP
plt.figure(figsize=(12,6));
sns.regplot(x='GDP ($ per capita)', y='Literacy (%)', data=reset_world_data, order=1, robust=True, ci=None);

Hence, as the `GDP($ per capita)` increases, so does the literacy rate. One of the reasons that I can come up with is that people with higher GDP can spend more on education. Countries with lower GDP will find it hard to pay the costs that are associated with education whether it be salaries of teachers, school supplies etc.

### Correlation with Phones

In [None]:
# Get GDP and Phone data without nan values and plot it
phone_data_without_nans = reset_world_data[['GDP ($ per capita)','Phones (per 1000)']].dropna()
sns.relplot(x='GDP ($ per capita)', y='Phones (per 1000)', data=phone_data_without_nans, kind='line', size=4, aspect=2, ci=0, legend=False);

In [None]:
# Plot a linear fit for GDP and Phones data
plt.figure(figsize=(12,6));
sns.regplot(x='GDP ($ per capita)', y='Phones (per 1000)', data=phone_data_without_nans, order=1, ci=None, robust=True);

In [None]:
slope, intercept, r_value, p_value, std_err = stats.linregress(phone_data_without_nans['GDP ($ per capita)'], phone_data_without_nans['Phones (per 1000)'])
print("Linear Regression - world data, slope: ", slope)
print("Linear Regression - world data, r-2 value: ", r_value**2)

We can see that there is a positive relation, a slope of 0.0189 and the linear regression line fits the data very well, r-squared value of about 0.7.  <br>
So, it means as the GDP increases, the number of phones in a country increases. Intuitively people will buy a mobile phone only if they can afford it. But there is a strange observation in line graph. We see that the increasing trend continues till GDP of about $35000 and then it falls. So, Luxembourg has the highest GDP but less number of phones per 1000 people. What could be the reason for it?  <br>
According to Deloitte (https://www2.deloitte.com/lu/en/pages/technology-media-and-telecommunications/articles/global-mobile-consumer-survey-2017-press-release.html), most of the people in Luxembourg don't use their cellphones for calling but for messaging and writing emails. I would assume that most of the phone usage is for professional purposes. Hence, majority of non-adult population don't use phones. I would like to point out that this is just a speculation, what I try to guess from data. Validation needs to be done from actual situation and research.

### Correlation with Birthrate

In [None]:
# Get GDP, Birthrate, Infant Mortality data without nans and plot it
birthrate_data_without_nans = reset_world_data[['GDP ($ per capita)','Birthrate', 'Infant mortality (per 1000 births)']].dropna()

plt.figure(figsize=(12,6));
sns.lineplot(x='GDP ($ per capita)', y='Birthrate', data=birthrate_data_without_nans, ci=0, legend=False);

In [None]:
# Plot a linear regression fit for GDP and Birthrate
plt.figure(figsize=(12,6));
sns.regplot(x='GDP ($ per capita)', y='Birthrate', data=birthrate_data_without_nans, order=1, ci=None, robust=True);

Both the graphs reflect a downward trend of Birthrate as GDP decreases. This is a well known trend. According to Federal Reserve Bank of St. Louis(https://www.stlouisfed.org/on-the-economy/2016/december/link-fertility-income), there are several possible reasons for higher birthrate in countries with low GDP:

- Time is relatively cheap in poor countries, so spending time away from work to take care of a child is not as costly as in a rich country. If this effect is strong enough, it can (and probably does) offset the fact that it is difficult to afford a child on a low income.
- A child may require more education to be successful in a rich country. Thus, a child may be more costly there, so families may opt to have fewer, more educated children.
- Infant mortality can play a role. More births might be needed to achieve a desired number of surviving children when infant mortality is high, as it tends to be in poor countries (shown below).

In [None]:
# Plot Birthrate and infant Mortality together against GDP
plt.figure(figsize=(14,6))

# add birthrate on primary y-axis
ax = sns.lineplot(x='GDP ($ per capita)', y='Birthrate', data=birthrate_data_without_nans, ci=0, label='Birthrate')
ax.legend(loc='center right', bbox_to_anchor=(0.65, 0.955))

# add Infant mortality on secondary y-axis
ax2 = ax.twinx()
sns.lineplot(ax=ax2, x='GDP ($ per capita)', y='Infant mortality (per 1000 births)', data=birthrate_data_without_nans, color='green',label='Infant mortality (per 1000 births)', ci=0)
ax2.lines[0].set_linestyle("--")

# Change the plot settings for easy viewing
ax.yaxis.grid(which="major", linewidth=1)
ax2.yaxis.grid(which="major",linewidth=0.5, linestyle='--')

# To get the same scale, uncomment the below line and run again
# ax.set_yticks(np.linspace(0, ax2.get_yticks()[-1], len(ax2.get_yticks())));

1. ### Correlation with Service, Industry and Agriculture

In [None]:
# Get data for GDP, Service, Agriculture without any nan entries
eco_data_without_nans = reset_world_data[['GDP ($ per capita)','Service', 'Industry', 'Agriculture']].dropna()
eco_data_without_nans.sample(5)

In [None]:
plt.figure(figsize=(13,6))

# Plot the Service values
ax = sns.lineplot(x='GDP ($ per capita)', y='Service', data=eco_data_without_nans, ci=0, label="Service");

In [None]:
plt.figure(figsize=(13,6))

# Plot the Agriculture values
sns.lineplot(x='GDP ($ per capita)', y='Agriculture', data=eco_data_without_nans, ci=0, label='Agriculture');

In [None]:
plt.figure(figsize=(13,6))

# Plot the Industry values
sns.lineplot(x='GDP ($ per capita)', y='Industry', data=eco_data_without_nans, ci=0, label='Industry');

In [None]:
plt.figure(figsize=(13,6))

# Plot the Service values
ax = sns.lineplot(x='GDP ($ per capita)', y='Service', data=eco_data_without_nans, ci=0, label="Service")

ax2 = ax.twinx()
# plot the Industry values
sns.lineplot(ax=ax2, x='GDP ($ per capita)', y='Industry', color="r", data=eco_data_without_nans, ci=0, label="Industry")
# plot the Agriculture values
sns.lineplot(ax=ax2, x='GDP ($ per capita)', y='Agriculture', color="g", data=eco_data_without_nans, ci=0, label="Agriculture")

# Set legend
ax2.legend(loc='center right')
ax.legend(loc='center right', bbox_to_anchor=(1, 0.6))

# Set plot settings for easy viewing
ax.yaxis.grid(which="major", linewidth=1)
ax2.yaxis.grid(which="major",linewidth=0.5, linestyle='--')
ax2.set_yticks(np.linspace(0, ax.get_yticks()[-1], len(ax.get_yticks())));

In [None]:
# Plot linear regression fit for GDP and Service values
plt.figure(figsize=(12,6));
sns.regplot(x='GDP ($ per capita)', y='Service', data=eco_data_without_nans, order=1, ci=None, robust=True);

In [None]:
# Plot linear regression fit for GDP and Industry values
plt.figure(figsize=(12,6))
sns.regplot(x='GDP ($ per capita)', y='Industry', data=eco_data_without_nans, ci=None, truncate=True);

In [None]:
slope, intercept, r_value, p_value, std_err = stats.linregress(eco_data_without_nans['GDP ($ per capita)'], eco_data_without_nans['Industry'])
print("Linear Regression - world data, slope: ", slope)
print("Linear Regression - world data, r-2 value: ", r_value**2)

In [None]:
# Plot linear regression fit for GDP and Agriculture values
plt.figure(figsize=(12,6))
sns.regplot(x='GDP ($ per capita)', y='Agriculture', data=eco_data_without_nans, ci=None, truncate=True);

From the above graphs, we can conclude that GDP increases when the majority of conutry's economy depends on Service sector and not agriculture. A good example of this is shown below form the given data: 

- `Liberia` and `Somalia` have high percentage of their economy depended on agriculture. The GDP for botht he countries is lower or equal than $1000. This is quiet low.

- On the other hand, `Jersey` and `Cayman Islands` have high percentage of their economy depended on service sector. The resultant GDP in these countries are about $25,000 or higher. These values may not be the highest, but they are comparatively very high. 

Also, it seems industry sector does not play a significant role in GDP. Rather it has a slight downward slop, indicating inverse relation, but we can also see that the data doesn't exactly fit the linear regression well.

In [None]:
col_to_print = ['Region','Country', 'GDP ($ per capita)', 'Agriculture', 'Industry', 'Service']

In [None]:
# get the countries with highest Agriculture sector contributions to the GDP
print("Countries with highest Agriculture:")
reset_world_data.sort_values('Agriculture', ascending=False)[col_to_print].head(5)

In [None]:
# get the countries with highest Service sector contributions to the GDP
print("Countries with highest Service:")
reset_world_data.sort_values('Service', ascending=False)[col_to_print].head(5)

In [None]:
# get the countries with highest Industry sector contributions to the GDP
print("Countries with highest Industry:")
reset_world_data.sort_values('Industry', ascending=False)[col_to_print].head(5)

From above tables, we can conclude that:

- GDP is directly proportional to the contribution from Service sector and inverserly proportional to the contribution from Service sector.
- From the table above with countries having highest Industry contribution, it is hard to find exact relation to GDP since the higher industry contribution lead to erratic GDP values. In some cases, the Industry contribution is high and the GDP as well (eg - `Qatar`), but there are cases where value is high but the GDP is low(eg -` Equatorial Guinea`).
- A much more sensible measure to check would be to combine Industry and Service sectors and check them against Agriculture and then try to find their impact on GDP. I believe higher the value of the Service and Industry combination, higher will be the GDP. If the combined value is lower, then the GDP will drop. This can result in much more accureate measurements.

Finding that the highest Service sector values do not yield the best GDP, then what is a good combination of Service, Agriculture and Industry contributions? What values could result in better GDP? What proportion of a country's economy should be in each sector for maximum development? I shall leave that for future work.