# EDA for World Data

In this exploratory data analysis, we will analyze and find some insights about countries from around the globe. There are some questions that will be answered here in this EDA:

* Correlation:
    1.      <a href="#corr1">Is there any correlation between suicide and urbanization rate?</a>
    2.      <a href="#corr2">Do countries with increasing median age have higher suicide rate?</a>
    3.      <a href="#corr3">Does higher fertility rate means higher population growth?</a>
    4.      <a href="#corr4">Is there any correlation between fertility rate and meat consumption?</a>
    5.      <a href="#corr5">Do rich countries have higher urbanization rate?</a>
    6.      <a href="#corr6">Is there any correlation between life expectancy and meat consumption level?</a>
    
    
* <a href="#top_suicide_world">Which country has the highest suicide rate?</a>
* <a href="#top_suicide_asean">Which Southeast Asian country has the highest (and least) suicide rate?</a>
* <a href="#top_meat_region">Which region has the highest meat consumption?</a>
* <a href="#top_suicide_region">Which region has the highest suicide rate?</a>

Before we get into the analysis side of things, let's import the modules needed for this EDA.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns # for visualization purposes
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt # for visualization purposes
%matplotlib inline

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

There are 9 datasets. These datasets originally came from [Wikipedia's list of countries](https://bit.ly/2YG48Ks). 

The cell below will read all of the datasets.

In [None]:
life_data = pd.read_csv('../input/world-data-by-country-2020/Life expectancy.csv')
urban_data = pd.read_csv('../input/world-data-by-country-2020/Urbanization rate.csv')
median_data = pd.read_csv('../input/world-data-by-country-2020/Median age.csv')
growth_data = pd.read_csv('../input/world-data-by-country-2020/Population growth.csv')
sex_data = pd.read_csv('../input/world-data-by-country-2020/Sex-ratio.csv')
suicide_data = pd.read_csv('../input/world-data-by-country-2020/Suicide rate.csv')
meat_data = pd.read_csv('../input/world-data-by-country-2020/Meat consumption.csv')
gdp_data = pd.read_csv('../input/world-data-by-country-2020/GDP per capita.csv')
fertility_data = pd.read_csv('../input/world-data-by-country-2020/Fertility.csv')

The name of the variables correspond to each related dataset.

***life_data*** refers to ***Life Expectancy*** dataset.

***meat_data*** refers to ***Meat Consumption*** dataset.

***urban_data*** refers to ***Urbanization Rate*** dataset.

***median_data refers*** to ***Median Age*** dataset.

***growth_data*** refers to ***Population Growth*** dataset.

***sex_data*** refers to ***Sex Ratio*** dataset.

***suicide_data*** refers to ***Suicide Rate*** dataset.

***gdp_data*** refers to ***GDP per Capita*** dataset.

***fertility_data*** refers to ***Fertility Rate*** dataset.

Each dataset contains 3 columns, with 2 of the 3 columns are all the same for each dataset (columns ***Country*** and ***ISO-Code*** of the country). Meanwhile, the other column corresponds to the topic of the dataset. For example, ***meat_data*** will have ***Meat consumption*** column and so on.

Here is the example:


In [None]:
meat_data.head()

> The more detailed description of the datasets are available [here](https://www.kaggle.com/daniboy370/world-data-by-country-2020).

With all of the datasets ready, now we can begin our analysis.

## Correlation



We can answer the questions by 2 methods.

**The first** one is by using **regression plot** with the help of seaborn module. Each country is represented by a single dot of scatter plot.

**The second** is by calculating its **Correlation Coefficient**. In this EDA, we will use **PPMCC** (Pearson's product moment correlation coefficient) that will measure the strength of the linear correlation between 2 variables. The value of the coefficient (*r*) can only be in *-1 ≤ r ≤ 1* interval.

Basically, both methods are the same. The first one is the visualization of the linear regression model fit and the second one will give the value.

> Note : Correlation is not causation.


<a id="corr1"></a>
### Is there any correlation between suicide and urbanization rate?

In [None]:
# Check if there's a correlation between suicide rate and urbanization rate.

plt.figure(figsize=(16,7))
suicide_data.sort_values(by = 'Country', ascending = False)
urban_data.sort_values(by = 'Country', ascending = False)

df_1 = pd.merge(suicide_data, urban_data, on = 'Country')  # Merge the datasets
sns.regplot(x=df_1['Urbanization rate'], y=df_1['Suicide rate']) 

# x-axis will be urbanization rate and y-axis will be suicide rate


In [None]:
# Calculate the correlation coefficient between suicide and urbanization rate

column_1 = df_1["Suicide rate"]
column_2 = df_1["Urbanization rate"]
correlation = column_1.corr(column_2)
correlation

As we can see from the two outputs above, the correlation is very weak.

In the visualization part, the scatter plots are scattered mostly below the 15 of the suicide rate *y-axis*.

Meanwhile, the *r* value is **-0.045**. That is almost 0. The negative value indicates negative linear correlation, which means that the line is slightly goes downward. The strongest correlation is either -1 or 1. The weakest *r* value is 0. We can conclude that there is almost no correlation at all between suicide and urbanization rate.


<a id="corr2"></a>
### Do countries with increasing median age have higher suicide rate?

In [None]:
# Check if there's a correlation between suicide rate and median age.

plt.figure(figsize=(16,7))
suicide_data.sort_values(by = 'Country', ascending = False)
median_data.sort_values(by = 'Country', ascending = False)

df_2 = pd.merge(suicide_data, median_data, on = 'Country')  # Merge the datasets
sns.regplot(x=df_2['Median age'], y=df_2['Suicide rate'])

# x-axis will be median age and y-axis will be suicide rate

In [None]:
# Calculate the correlation coefficient between suicide rate and median age

column_1 = df_2["Suicide rate"]
column_2 = df_2["Median age"]
correlation = column_1.corr(column_2)
correlation

The outputs indicate something similar from the previous correlation i.e. between suicide and urbanization rate.

The scatter plots are mostly located below the 15 of the suicide rate *y-axis*.

Meanwhile, the *r* value is also very weak (**0.022**). We can conclude that there is almost no correlation at all between suicide rate and median age.


<a id="corr3"></a>

### Does higher fertility rate means higher population growth?

In [None]:
print(fertility_data[fertility_data['Country'].duplicated()])

# Drop 'Russia' with index 149 from the dataframe (because of duplicates)
fertility_data = fertility_data.drop(149)


> Note : We need not remove ***Guinea*** because one of the Guineas is actually ***Guinea-Bissau***. It's OK for now to have 2 Guineas. We will rename it later, in a different part.

> We need to remove one of the Russia  countries because in the original dataset, there's only one Russia.


In [None]:
# Check if there's a correlation between fertility rate and population growth.

plt.figure(figsize=(16,7))
fertility_data.sort_values(by = 'Country', ascending = False)
growth_data.sort_values(by = 'Country', ascending = False)

df_3 = pd.merge(fertility_data, growth_data, on = 'Country')  # Merge the datasets
sns.regplot(x=df_3['Population growth'], y=df_3['Fertility'])

# x-axis will be population growth and y-axis will be fertility rate

In [None]:
# Calculate the correlation coefficient between fertility rate and population growth.

column_1 = df_3['Population growth']
column_2 = df_3["Fertility"]
correlation = column_1.corr(column_2)
correlation

The visualization shows that the scatter plots are mostly located below 4 of *y-axis* and below 2 of *x-axis*.

Meanwhile, the *r* value is **0.625** which indicates moderate positive correlation. We can conclude that there's a correlation between fertility rate and population growth, but it's not a strong one.


<a id="corr4"></a>
### Is there any correlation between fertility rate and meat consumption?

In [None]:
# Check if there's a correlation between fertility rate and meat consumption.

plt.figure(figsize=(16,7))
fertility_data.sort_values(by = 'Country', ascending = False)
meat_data.sort_values(by = 'Country', ascending = False)

df_4 = pd.merge(fertility_data, meat_data, on = 'Country')  # Merge the datasets
sns.regplot(x=df_4['Meat consumption'], y=df_4['Fertility'])

# x-axis will be meat consumption level and y-axis will be fertility rate

In [None]:
# Calculate the correlation coefficient between fertility rate and meat consumption.

column_1 = df_4['Meat consumption']
column_2 = df_4["Fertility"]
correlation = column_1.corr(column_2)
correlation

The visualization shows a descending regression line (which makes sense because the *r* value is also negative). The regression line shows that there's a trend of higher fertility rate with lower meat consumption rate.

However, this correlation is not strong (**-0.606**).


<a id="corr5"></a>

### Do rich countries have higher urbanization rate?

In [None]:
# Check if there's a correlation between GDP/capita level and urbanization rate.

plt.figure(figsize=(16,7))
urban_data.sort_values(by = 'Country', ascending = False)
gdp_data.sort_values(by = 'Country', ascending = False)

df_5 = pd.merge(gdp_data, urban_data, on = 'Country')   # Merge the datasets
sns.regplot(x=df_5['Urbanization rate'], y=df_5['GDP per capita'])

# x-axis will be urbanization rate and y-axis will be GDP per Capita

In [None]:
# Calculate the correlation coefficient between GDP/capita level and urbanization rate.

column_1 = df_5['GDP per capita']
column_2 = df_5["Urbanization rate"]
correlation = column_1.corr(column_2)
correlation

There's a positive moderate correlation between GDP/capita level and urbanization rate, hence the **0.634** *r* value.

This tells us that there's a trend (not a strong trend, though) that richer countries have higher urbanization rate.


<a id="corr6"></a>

### Is there any correlation between life expectancy and meat consumption?

In [None]:
# Check if there's a correlation between life expectancy and meat consumption level.

plt.figure(figsize=(16,7))
life_data.sort_values(by = 'Country', ascending = False)
meat_data.sort_values(by = 'Country', ascending = False)

df_6 = pd.merge(life_data, meat_data, on = 'Country')    # Merge the datasets
sns.regplot(x=df_6['Life expectancy'], y=df_6['Meat consumption'])

# x-axis will be life expectancy level and y-axis will be meat consumption level

In [None]:
# Calculate the correlation coefficient between life expectancy and meat consumption level.

column_1 = df_6['Life expectancy']
column_2 = df_6["Meat consumption"]
correlation = column_1.corr(column_2)
correlation

The correlation coefficient has a value of **0.69**. That's quite high, but not high enough to reach strong correlation interval (*0.8 ≤ r ≤ 1*).

This correlation is quite interesting. There's a positive moderate correlation that indicates higher life expectancy with increasing meat consumption rate.


<a id="top_suicide_world"></a>

## Top 10 country with the highest suicide rate in the world

Now, let's find out which countries have the highest suicide rate.


In [None]:
top_10_suicide = suicide_data.sort_values('Suicide rate',ascending=False)['Country'].head(10)
plt.figure(figsize=(15,5))

sns.barplot(y = suicide_data['Suicide rate'], x = top_10_suicide)


The country of Guyana is ranked no.1, followed by Lesotho and Russia. Rank 4,5,6, etc. can be seen from the visualiaztion.

Guyana has **30.2** suicide rate, followed by Lesotho (**28.9**) and Russia (**26.5**). If you're wondering, South Korea has **20.2** suicide rate.


<a id="top_suicide_asean"></a>

## ASEAN countries ranked by suicide rate level

Now, let's see the ranking of Southeast Asian countries by their suicide rates.

> Note : ASEAN is an association of Southeast Asian countries.

In [None]:
rank_asean_suicide = suicide_data.loc[suicide_data['Country'].isin(['Malaysia','Myanmar','Philippines','Vietnam','Brunei' ,'Indonesia', 'Singapore', 'Thailand', 'Laos', 'Cambodia', 'East Timor'])].sort_values('Suicide rate',ascending=False).head(11)

plt.figure(figsize=(15,5))

sns.barplot(y = rank_asean_suicide['Suicide rate'], x = rank_asean_suicide['Country'])

As we can see here, Thailand is ranked no.1 with **12.9** suicide rate followed by Laos (**9.3**) and Myanmar (**8.1**).

Meanwhile Philippines, Indonesia, and Brunei have the lowest suicide rate in ASEAN region. Philippines and Indonesia have the same suicide rate (**3.7**) and Brunei has **4.5**.

<a id="top_meat_region"></a>
<a id="top_suicide_region"></a>

## Which region has the highest meat consumption and suicide rate?

This time, we will find out which region that will be the answer for the question above.

But, what is **'region'** ?

In this context, region is a collection of countries that have similar locations to one another. For example : Brazil, Argentina, Chile, and Suriname belong to South America region because their locations are adjacent.



Before we begin the analysis, we want to change the 66th index of the country in the suicide data and 69th index of the country in meat data, from ***Guinea*** to ***Guinea-Bissau*** ( if this isn't done, there will be a duplicate of Guinea ).


In [None]:
suicide_data['Country'] = suicide_data['Country'].replace([suicide_data.Country[66]],'Guinea-Bissau')
print(suicide_data.Country[66])

meat_data['Country'] = meat_data['Country'].replace([meat_data.Country[69]],'Guinea-Bissau')
print(meat_data.Country[69])

The code below collects the name of countries based on their regions. Some of the regions will have 2 variables, which are ***(region_name)_1*** and ***(region_name)_2***. This must be done because the countries that are available in suicide rate and meat consumption datasets are different.

For example :

Middle East, North Africa, and Greater Arabia (for the sake of convenience, I shorten it to *'Menaga'*) region only has ***'me_na_ga_1'*** variable. This indicates that the countries from suicide rate and meat consumption datasets of the Menaga countries are the same. So, the ***'me_na_ga_1'*** variable belongs to suicide rate and meat consumption data.

Meanwhile, Asia region has 2 variables. ***asia_1*** belongs to suicide data and ***asia_2*** belongs to meat consumption data. You can see the difference from their corresponding list's elements.


In [None]:
# 1 belongs to suicide or meat comsumption data 
# If there's 2, it belongs to meat comsumption data only.
# The countries that are available in suicide and meat comsumption datasets are different.


# Asia
asia_1 = ['East Timor','Vietnam','Uzbekistan','Turkmenistan','Thailand','Tajikistan','Sri Lanka','South Korea','Singapore','Philippines','North Korea','Nepal','Myanmar','Mongolia','Malaysia','Maldives','Laos','Kyrgyzstan','Kazakhstan','Japan','Indonesia','India','Bangladesh', 'Bhutan', 'Brunei', 'Cambodia', 'China']
asia_2 = [country for country in asia_1 if country not in ['Brunei']] 

# Middle East, North Africa, Greater Arabia
me_na_ga_1 = ['Yemen','United Arab Emirates','Turkey','Tunisia','Syria','Somalia','Saudi Arabia','Qatar','Pakistan','Oman','Norway','Morocco','Libya','Lebanon','Kuwait','Jordan','Israel','Iraq','Iran','Egypt','Afghanistan', 'Algeria','Bahrain', 'Azerbaijan','Cyprus']

# Europe
europe_1 = ['United Kingdom','Ukraine','Switzerland','Sweden','Spain','Slovenia','Slovakia','Serbia','Russia','Romania','Portugal','Poland','North Macedonia','Netherlands','Montenegro','Moldova','Malta','Luxembourg','Lithuania','Latvia','Italy','Ireland','Iceland','Hungary','Greenland','Greece','Germany','Georgia','France','Finland','Estonia','Denmark','Czech Republic','Albania', 'Armenia', 'Austria', 'Belarus', 'Belgium', 'Bosnia and Herzegovina','Bulgaria', 'Croatia']
europe_2 = [country for country in europe_1 if country not in ['Albania', 'Montenegro', 'Serbia']] + ['FR Yugoslavia']

# North America
n_america_1 = ['United States','Mexico','Canada']

# Central America and Caribbean
ca_carib_1 = ['Trinidad and Tobago','The Bahamas','Saint Vincent and the Grenadines','Saint Lucia','Panama','Nicaragua','Jamaica','Haiti','Honduras','Guatemala','Grenada','El Salvador','Dominican Republic','Antigua and Barbuda', 'Barbados', 'Belize','Costa Rica', 'Cuba']
ca_carib_2 = ca_carib_1 + ['Dominica', 'Guadeloupe', 'Martinique', 'Saint Kitts and Nevis', 'Virgin Islands']

# South America
s_america_1 = ['Venezuela','Uruguay','Suriname','Peru','Paraguay','Guyana','Argentina', 'Bolivia', 'Brazil', 'Chile', 'Colombia', 'Ecuador']
s_america_2 = s_america_1 + ['French Guiana']

# Sub-saharan Africa
ss_africa_1 = ['Zimbabwe','Zambia','Uganda','Togo','The Gambia','Tanzania','Sudan','South Sudan','South Africa','Sierra Leone','Seychelles','Senegal','Rwanda','Republic of the Congo','Nigeria','Niger','Namibia','Mozambique','Mauritius','Mauritania','Mali','Malawi','Madagascar','Liberia','Lesotho','Kenya','Ivory Coast','Guinea-Bissau','Guinea','Ghana','Gabon','Ethiopia','Eswatini','Eritrea','Equatorial Guinea','Djibouti','Democratic Republic of the Congo','Angola', 'Benin', 'Botswana', 'Burkina Faso', 'Burundi', 'Cameroon', 'Cape Verde', 'Central African Republic', 'Chad', 'Comoros']
ss_africa_2 = [country for country in ss_africa_1 if country not in ['Eswatini', 'Equatorial Guinea']] + ['Sao Tome and Principe']

# Australia and Oceania
au_oce_1 = ['Vanuatu','Tonga','Solomon Islands','Samoa','Papua New Guinea','New Zealand','Kiribati','Fiji','Australia']
au_oce_2 = [country for country in au_oce_1 if country not in ['Tonga']] + ['American Samoa', 'French Polynesia', 'Guam', 'New Caledonia']


Once we get the classification of countries based on their regions, now we must calculate the mean (average) suicide rate and meat consumption level from each region.

The regions are : 
1. Asia 
2. Middle East , North Africa, Greater Arabia 
3. Europe 
4. North America 
5. Central America and Caribbean 
6. South America 
7. Sub-saharan Africa 
8. Australia and Ocenia

That's 8 regions. We must calculate the mean of each region (both for the suicide rate and meat consumption level, results in 16 averages in total).

We will use 3 functions so that they will be used again and again for each region.

The first function is ***'get_index'***.

This function will take 2 parameters (data and countries). The ***'data'*** parameter will be the data we need. The ***'countries'*** parameter will be the countries based on the region. The purpose of this function is to get the indices of the countries that are in the ***'data'*** parameter.


The second function is ***'get_rate'***. 

This function will take 3 parameters (data, column, and index). The ***'data'*** parameter will be the data we need. The ***'column'*** parameter will be the column that we need to get the suicide rate or meat consumption level. The ***'index'*** parameter is the list of indices that we get from the previous function. The purpose of this function is to get the suicide rate or the meat consumption level of the countries that are in the ***'data'*** parameter.


The third function is ***'get_average'***. 

This function will take 1 parameter (rates). The ***'rates'*** parameter will be the list that contains the suicide rate or meat consumption level on each region. The purpose of this function is to calculate the average of suicide rate or meat consumption level of each region.


In [None]:
# Get the index
def get_index(data, countries):
    indices = []
    for country in countries:
        target_column = data.isin([country])
        series = target_column.any()
        columns = list(series[series == True].index)
        
        for column in columns:
            rows = list(target_column[column][target_column[column] == True].index)
            
            for row in rows:  
                indices.append(row)
                
    return indices

# Get the suicide rate and meat consumption level for each country
def get_rate(data, column, index):
    rate = []
    for ix in index:
        rate.append(data[column].iloc[ix])
    return rate

# Get the average
def get_average(rates):
    return sum(rates)/len(rates)

# Asia
index_asia_1 = get_index(suicide_data, asia_1)
suicide_asia = get_rate(suicide_data, 'Suicide rate', index_asia_1)
avg_asia_1 = get_average(suicide_asia)

index_asia_2 = get_index(meat_data, asia_2)
meat_asia = get_rate(meat_data, 'Meat consumption', index_asia_2)
avg_asia_2 = get_average(meat_asia)

# Middle East, North Africa, and Greater Arab
index_me_na_ga_1 = get_index(suicide_data, me_na_ga_1)
suicide_me_na_ga = get_rate(suicide_data, 'Suicide rate', index_me_na_ga_1)
avg_me_na_ga_1 = get_average(suicide_me_na_ga)

index_me_na_ga_2 = get_index(meat_data, me_na_ga_1)
meat_me_na_ga = get_rate(meat_data, 'Meat consumption', index_me_na_ga_2)
avg_me_na_ga_2 = get_average(meat_me_na_ga)

# Europe
index_europe_1 = get_index(suicide_data, europe_1)
suicide_europe = get_rate(suicide_data, 'Suicide rate', index_europe_1)
avg_europe_1 = get_average(suicide_europe)

index_europe_2 = get_index(meat_data, europe_2)
meat_europe = get_rate(meat_data, 'Meat consumption', index_europe_2)
avg_europe_2 = get_average(meat_europe)

# North America
index_n_america_1 = get_index(suicide_data, n_america_1)
suicide_n_america = get_rate(suicide_data, 'Suicide rate', index_n_america_1)
avg_n_america_1 = get_average(suicide_n_america)

index_n_america_2 = get_index(meat_data, n_america_1)
meat_n_america = get_rate(meat_data, 'Meat consumption', index_n_america_2)
avg_n_america_2 = get_average(meat_n_america)

# Central America and Caribbean
index_ca_carib_1 = get_index(suicide_data, ca_carib_1)
suicide_ca_carib = get_rate(suicide_data, 'Suicide rate', index_ca_carib_1)
avg_ca_carib_1 = get_average(suicide_ca_carib)

index_ca_carib_2 = get_index(meat_data, ca_carib_2)
meat_ca_carib = get_rate(meat_data, 'Meat consumption', index_ca_carib_2)
avg_ca_carib_2 = get_average(meat_ca_carib)

# South America
index_s_america_1 = get_index(suicide_data, s_america_1)
suicide_s_america = get_rate(suicide_data, 'Suicide rate', index_s_america_1)
avg_s_america_1 = get_average(suicide_s_america)

index_s_america_2 = get_index(meat_data, s_america_2)
meat_s_america = get_rate(meat_data, 'Meat consumption', index_s_america_2)
avg_s_america_2 = get_average(meat_s_america)

# Sub-Saharan Africa
index_ss_africa_1 = get_index(suicide_data, ss_africa_1)
suicide_ss_africa = get_rate(suicide_data,'Suicide rate', index_ss_africa_1)
avg_ss_africa_1 = get_average(suicide_ss_africa)

index_ss_africa_2 = get_index(meat_data, ss_africa_2)
meat_ss_africa = get_rate(meat_data, 'Meat consumption', index_ss_africa_2)
avg_ss_africa_2 = get_average(meat_ss_africa)

# Australia and Oceania
index_au_oce_1 = get_index(suicide_data, au_oce_1)
suicide_au_oce = get_rate(suicide_data, 'Suicide rate', index_au_oce_1)
avg_au_oce_1 = get_average(suicide_au_oce)

index_au_oce_2 = get_index(meat_data, au_oce_2)
meat_au_oce = get_rate(meat_data, 'Meat consumption', index_au_oce_2)
avg_au_oce_2 = get_average(meat_au_oce)

All of the averages stored in their corresponding region. Each region has 2 averages (***(region_name)_1*** belongs to suicide rate and ***(region_name)_2*** belongs to meat consumption level).

We need to combine all of the variables that stored the averages into one dataframe. We will create 2 dataframes:


In [None]:
# Create two dataframes for the average suicide rate and meat consumption level per region

data_suicide = {'Region': ['Asia', 'Europe', 'North America', 'South America', 'Central America and Caribbean', 'Middle East, North Africa, and Greater Arab', 'Sub-Saharan Africa', 'Australia and Oceania'], 'Average suicide rate': [avg_asia_1, avg_europe_1, avg_n_america_1, avg_s_america_1, avg_ca_carib_1, avg_me_na_ga_1, avg_ss_africa_1, avg_au_oce_1 ]}


data_meat = {'Region': ['Asia', 'Europe', 'North America', 'South America', 'Central America and Caribbean', 'Middle East, North Africa, and Greater Arab', 'Sub-Saharan Africa', 'Australia and Oceania'], 'Average meat consumption': [avg_asia_2, avg_europe_2, avg_n_america_2, avg_s_america_2, avg_ca_carib_2, avg_me_na_ga_2, avg_ss_africa_2, avg_au_oce_2 ]}


df_data_suicide = pd.DataFrame(data_suicide)


df_data_meat = pd.DataFrame(data_meat)

Here are the datasets :


In [None]:
df_data_meat.sort_values(by='Average meat consumption', ascending = False)

In [None]:
df_data_suicide.sort_values(by='Average suicide rate', ascending = False)

It's kinda boring doesn't it to see it in a dataframe form. Let's visualize it!

In [None]:
# Sort the dataframe and assign it to a new variable
meat_sorted = df_data_meat.sort_values(by='Average meat consumption', ascending = False)


meat_values = [97, 73, 65, 54, 53, 41, 29, 16]
meat_regions = meat_sorted['Region'].values.tolist()   # Get the values from 'Region' column

fig, ax = plt.subplots(figsize=(10,5))   # Create frame and axis

# Make a horizontal bar graph
ax.barh(meat_regions,meat_values, color = 'red', edgecolor = 'black')

# Add gridlines to the frame
ax.grid(b= True, color ='black', linestyle='-.',linewidth = 1, alpha =0.4)

plt.title('Average Meat Consumption Level Across Regions', fontsize = 25, loc = 'left')
plt.ylabel('')

# Add text value above the bars
for value in ax.patches:
    plt.text(value.get_width() + 0.2, value.get_y()+0.5, str(round((value.get_width()),2)), fontsize=15, fontweight='bold',color='black')

There. 

That's better doesn't it ?

Thank you for reading this EDA until the end. The analyses here are still very basic. It's my first EDA ever :).
