# **Countries of the world Data Analysis**

In this kernel I will use Pandas and Seaborn to analyze this dataset about the countries of the world. First I will clean the data, then use data vizualization to gain some insights.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
import missingno as missin

In [None]:
data = pd.read_csv('../input/countries-of-the-world/countries of the world.csv')
data.sample(5)

In [None]:
data.info()

After loading the data we can see that there are some missing values, we will have to convert the numbers from 0,00 to 0.00 format, and fix the data types.

In [None]:
fig,ax = plt.subplots(figsize=(8,8))
missin.matrix(data,ax=ax,sparkline=False)
plt.show()

I'm using the missingno library to understand the relation of the missing values better.
Seems like most of the missing data are coming from the same few countries, especially in the Agriculture, Industry, Service columns.

I decided to fill the missing values with the means of each column.

In [None]:
data.fillna(data.mean(), inplace=True)

In [None]:
def value_converter(cols):
    for c in cols:
        data[c] = data[c].astype(str)
        new_data = []
        for val in data[c]:
            val = val.replace(',','.')
            val = float(val)
            new_data.append(val)

        data[c] = new_data

cols = data[['Pop. Density (per sq. mi.)', 'Coastline (coast/area ratio)', 'Net migration', 'Infant mortality (per 1000 births)', 'Literacy (%)', 'Phones (per 1000)', 'Arable (%)', 'Crops (%)', 'Other (%)', 'Climate', 'Birthrate', 'Deathrate', 'Agriculture', 'Industry', 'Service']]

value_converter(cols)

I created a value converter function which helps me replace the commas with dots, and change the data types to float. After running the function let's check our data again.

In [None]:
data.info()

In [None]:
data['Region'] = data.Region.str.strip()
data['Country'] = data.Country.str.strip()

I also realized that in the country and region columns there are spaces before and after some names. 

After fixing the spaces, it looks like our data is clean, so it's time for the exploratory data analysis.



In [None]:
region = data['Region'].value_counts()
plt.figure(figsize=(10,7))
sns.barplot(y=region.index, x=region.values, palette='rocket')
plt.title('Number of Countries by Region')
plt.xlabel('Regions')
plt.ylabel('Counts')
plt.show()

First I wanted to know the number of countries in each region. It looks like the Sub-Saharan and the Latin America/Caribbean region contains the most nations.

The next viz shows the shows each region based on their GDP.

In [None]:
plt.figure(figsize=(10,7))
sns.boxenplot(data=data, x='Region', y='GDP ($ per capita)')
plt.title('GDP per Region')
plt.xticks(rotation=45)
plt.show()

As expected, Northern America and Western Europe have the highest median, but there are some outliers in the Latin America & Caribbean region also. 

Just out of curiousity, let's check which country is that:

In [None]:
data[data['Region'] == 'LATIN AMER. & CARIB'].nlargest(1, 'GDP ($ per capita)')

Next I want to see how the columns correlate with each other

In [None]:
fig,ax = plt.subplots(figsize=(10,10))
sns.heatmap(data.corr(),linewidth=1,annot=True,linecolor="Black",fmt=".1f")
plt.title("Correlation Map",fontsize=15)
plt.show()

I want to analyze further the relation between GDP and some of the fields with the strongest correlation (positive and negative) with GDP.

In [None]:
x = data.loc[:,["Region","GDP ($ per capita)","Infant mortality (per 1000 births)","Birthrate","Phones (per 1000)","Literacy (%)","Service"]]
sns.pairplot(x, hue="Region",palette='Paired', diag_kind='hist')
plt.show()

The strongest correlation is with Phones (per 1000), so I decided to create another plot, with these two columns only. I decided to use hexplots, which are one of the most satisfying charts in the whole seaborn library in my opinion:)

In [None]:
sns.jointplot(x='GDP ($ per capita)', y='Phones (per 1000)', kind='hex', data=data)
plt.show()

It also caught my eyes that there is a strong relationship between Birthrate and Infant mortality. Let's take a look at those two aswell:

In [None]:
sns.jointplot(x='Infant mortality (per 1000 births)', y='Birthrate', kind='hex', data=data)
plt.show()

I also wanted to analyze the way how different sectors are distributed in each region, and which are the top countries in each of the 3 sectors (Agriculture, Service, Industry)

In [None]:
fig = plt.figure(figsize=(20,15))
ax1 = fig.add_subplot(311)
ax2 = fig.add_subplot(312)
ax3 = fig.add_subplot(313)

sns.barplot(data=data, x='Agriculture', y='Region', ax=ax1)
sns.barplot(data=data, x='Service', y='Region', ax=ax2)
sns.barplot(data=data, x='Industry', y='Region', ax=ax3)

ax1.set_xlabel('Agriculture', fontsize=20)
ax2.set_xlabel('Service', fontsize=20)
ax3.set_xlabel('Industry', fontsize=20)

plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.barplot(data = data.nlargest(15, 'Service'), y = 'Country', x = 'Service', palette='mako')
plt.title("TOP15 Countries with the highest Service %", size=16)
plt.xlabel(xlabel='Service', fontsize=14)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.barplot(data = data.nlargest(15, 'Industry'), y = 'Country', x = 'Industry', palette='rocket')
plt.title("TOP15 Countries with the highest Industry %", size=16)
plt.xlabel(xlabel='Industry', fontsize=14)
plt.show()

In [None]:
plt.figure(figsize=(12,8))
sns.barplot(data = data.nlargest(15, 'Agriculture'), y = 'Country', x = 'Agriculture', palette='magma')
plt.title("TOP15 Countries with the highest Agriculture %", size=16)
plt.xlabel(xlabel='Agriculture', fontsize=14)
plt.show()

In the end of the analysis it occured me that I read something about how Infant mortality is a great indicator of the development of a country in the book Factfullnes by Hans Rosling. (which was one of the best books I read in 2020, I strongly recommend it:) )

In [None]:
plt.figure(figsize=(15,10))
sns.lmplot(x='GDP ($ per capita)', y='Infant mortality (per 1000 births)', data=data, fit_reg=False, hue = 'Region')
plt.show()

Indeed, Hans Rosling was right, it's easy to see that the vast majority of the countries with high GDP keeps the mortality rate very low.

Thank you for your attention! 
This is is my first ever kernel, so it's more than possible that I made some mistakes. Any feedback or recommendation is highly appreciated.