<a id="top"></a> 
### <center><font color='red'>SUICIDE HOTLINES - INTERNATIONAL</font></center><center>www.opencounseling.com/suicide-hotlines</center>

# <center>Suicides - Data Visualization and World Maps</center>

This notebook contains overview of datasets, data visualizations, correlations and maps that impacts the suicide rates in various countries.  Following datasets are utilized:

*  **Suicide Rates Overview 1985 to 2016** dataset compares socio-economic information with suicide rates by year and country and contains suicide data of 101 countries spanning 32 years. 
*  **World capitals GPS** contains countries, continents and latitude/longitude information and is combined with the above dataset to look at **suicides per continents** and to create a *Folium World* map.
*  **World Countries** is a JSON file with country geographical shape information and is used to create the *Choropleth World* map.
<br>

### Table of Content
1.  [Data Preparation](#prep)<br>
1.1  [Data Collection](#prep_coll)<br>
1.2  [Data Cleaning](#prep_clean)<br>
1.3  [Final Check](#prep_check)<br>
1.4  [Data Attributes  ](#prep_attr)<br>
1.5  [Data Overview  ](#prep_limit)<br>


2.  [Data Visualization (EDA)](#eda)<br>
2.1  [Suicide Rates per Country](#eda_country)<br>
2.2  [Suicide Rates per Continent](#eda_cont)<br>
2.3  [Suicide Trends and Population](#eda_pop)<br>
2.4  [Suicide Trends and Age Group](#eda_age)<br>
2.5  [Suicide Trends and Human Development Index (HDI)](#eda_hdi)<br>
2.6  [Suicide Trends and GDP](#eda_gdp)<br>


3.  [Correlations](#corr)<br>
3.1  [Encoding and Normalization](#corr_encode)<br>
3.2  [Overall Correlation](#corr_over)<br>
3.3  [Correlation - Male & Female](#corr_sex)<br>


4.  [World Maps](#maps)<br>
4.1  [Folium World Map](#maps_folium)<br>
4.2  [Choropleth World Map](#maps_chor)<br>
4.3  [Choropleth World Map - Animated](#maps_chorANI)<br>   

<br>

**LIMMITATION:**  Dataset is an excellent resource for data visualizations and creating maps, but contains only around half countries in the world, so by far is not complete.

**Please upvote if you found this helpful :-)**

###  Import Python Libraries

note:  to install/import some packages (such as plotly), verify following in your Kaggle notebook's "Setting" (on right, under Data):
1.  Environment is set to "latest:
2.  "internet" toggle is "ON"

In [None]:
#  Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

#  world maps
import folium
from folium.plugins import MarkerCluster
import plotly_express as px  #  ploty express

# Kaggle directories
import os
print(os.listdir("../input"))
from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

---
#  1.  Data Preparation <a id="prep"></a>
Dataset preparation requires loading, cleaning and understanding the data.

##  1.1  Data Collection <a id="prep_coll"></a>

1.1.1  [Load Datasets](#prep_load)<br>
1.1.2  [Join Datasets](#prep_join)<br>
1.1.3  [Drop un-needed columns](#prep_drop)<br>

###  1.1.1  Load Datasets <a id="prep_load"></a>
Load datasets **suicide-rates-overview-1985-to-2016** and **world-capitals-gps**.

In [None]:
df  = pd.read_csv("../input/suicide-rates-overview-1985-to-2016/master.csv")  # suicides
gps = pd.read_csv("../input/world-capitals-gps/concap.csv")   # world GPS

###  1.1.2   Join Datasets <a id="prep_join"></a>
Verify country names are consistent in **df** and **gps** dataframes.

In [None]:
count = 0
for i in sorted(df.country.unique()):
    if len(gps.CountryName[gps.CountryName == i].values) == 0:
        print('MISSING:  df: {:<35}gps:{}'.format(i,gps.CountryName[gps.CountryName == i].values))
        count = count + 1
print('check complete:  {} missing'.format(count)) 

#  update names in df to match the gps file
df.replace({'Cabo Verde':'Cape Verde','Republic of Korea':'South Korea','Russian Federation':'Russia','Saint Vincent and Grenadines':'Saint Vincent and the Grenadines'},inplace=True)

 Join the datasets using the country names as key.

In [None]:
df = df.join(gps.set_index('CountryName'), on='country')
print(df.shape)

###  1.1.3  Drop un-needed columns <a id="prep_drop"></a>

In [None]:
#  drop un-needed columns
df = df.drop([' gdp_for_year ($) ', 'country-year', 'CountryCode', 'CapitalName'], axis=1)
# sort dataframe by country and year
df = df.sort_values(['country','year'])

print(df.shape)

[go to top of section](#prep)  

##  1.2  Data Cleaning <a id="prep_clean"></a>
Check for null, missing and duplicate values and take corrective actions.

1.2.1  [Check for Null, Missing and '0' values](#prep_miss)<br>
1.2.2  [Data Clean:  'suicides_no' and 'suicides/100k pop'](#prep_sui)<br>
1.2.3  [Data Clean:  'HDI for year' and '2016' data](#prep_hdi)<br>

###  1.2.1  Check for Null, Missing and '0' values <a id="prep_miss"></a>

In [None]:
plt.title('null, missing and \'0\' values heatmap')
sns.heatmap((df.isnull()) | (df == 0), cmap = 'mako')
plt.show()

print('\nDUPLICATE VALUE COUNT:  ', df.duplicated().sum())

`suicides_no`, `suicides/100k pop` and `HDI for year` have missing values and will be corrected in the following sections.  There are no duplicate values.

###  1.2.2  Data Clean:  'suicides_no' and 'suicides/100k pop' <a id="prep_sui"></a>
Rows with no values ("null" or equal to "0") will be dropped.

In [None]:
#  percentage missing for 'suicides/100k pop' - before
for i in ['suicides_no', 'suicides/100k pop']:
    missing_before = len(df[df[i] == 0])*100/len(df)
    print(' before: {:>20} == 0:  {:>8.4f}%'.format(i,missing_before))
    
#  BEFORE - missing 'suicides_no' and 'suicides/100k pop'
sui_before = df[['suicides_no', 'suicides/100k pop']]

    
#  drop 'df['suicides/100k pop'] == 0' rows
len(df[df['suicides/100k pop'] == 0].index)
df.drop(df[df['suicides/100k pop'] == 0].index, inplace = True)

#  percentage missing for 'suicides/100k pop' - after
for i in ['suicides_no', 'suicides/100k pop']:
    missing_after = len(df[df[i] == 0])*100/len(df)
    print(' after:  {:>20} == 0:  {:>8.2f}%'.format(i,missing_after))

#  AFTER - missing 'suicides_no' and 'suicides/100k pop'
sui_after = df[['suicides_no', 'suicides/100k pop']]

In [None]:
#  heatmaps for missing 'suicides_no' and 'suicides/100k pop'
plt.figure(figsize=(10,4))
plt.subplot(121)
sns.heatmap((sui_before.isnull()) | (sui_before == 0), cmap = 'binary_r')
plt.title('BEFORE: \'suicides_\' missing')
plt.subplot(122)
sns.heatmap((sui_after.isnull()) | (sui_after == 0), cmap = 'binary_r')
plt.title('AFTER: \'suicides_\' missing')
plt.show

Missing values for 'suicides_no' and 'suicides/100k pop' were in corresponding rows - two birds, one stone :-)

###  1.2.3  Data Clean:  'HDI for year' and '2016' data <a id="prep_hdi"></a>
**HDI for year:**  (forward fill)<br>
`HDI for year` data is significantly missing, however, we also know that the Human Development Index (HDI) is a slow moving index and replacing empty values with the last HDI value will not significantly impact the analysis.

Null values of `HDI for year` will be filled with the previous year's value.

**2016 data:**  (drop)<br>
Data profile shows that the number of observations for 2016 are significantly less (~140) than for other years (>700).  2016 data will be dropped.

**forward fill 'HDI for year':**

In [None]:
#  percentage missing for 'HDI for year'
print('HDI for year == NaN:  {:>8.4f}%\t(before)'.format( len(df[df['HDI for year'].isnull()])*100/len(df)))

#  before forward fill
hdi_before = df['HDI for year'].groupby(df['year']).sum()

#  fill ''HDI for year' nulls with previous values
df['HDI for year'].fillna(method='ffill', inplace=True)

#  drop remaining''HDI for year' nulls
df.drop(df[df['HDI for year'].isnull()].index, inplace = True)

#  after forward fill
hdi_after = df['HDI for year'].groupby(df['year']).sum()

#  percentage missing for 'HDI for year'
print('HDI for year == NaN:  {:>8.4f}%\t(after)'.format( len(df[df['HDI for year'].isnull()])*100/len(df)))

**Plot BEFORE and AFTER profiles for 'HDI for year' and '2016 data':**

In [None]:
sr = df['year'].groupby(df['year']).value_counts()  # count rows per year
yr = df['year'].sort_values().unique()   #  years

#  plot HDI and 2016
plt.figure(figsize=(10,4))

plt.subplot(121)
plt.plot(yr,hdi_before)
plt.plot(yr,hdi_after)
plt.title('\'HDI for year\' BEFORE & AFTER fill', fontsize=14)
plt.xlabel('Years', fontsize=12)
plt.ylabel('HDI for year  (rows)', fontsize=12)
plt.legend(['before','after'])

#  plot data profile
plt.subplot(122)
plt.plot(yr,sr)
plt.title('Data Profile: Observations per 1985-2016\n(mean = {:.0f})'.format(sr.mean()), fontsize=14)
plt.xlabel('Years', fontsize=12)
plt.ylabel('Observations', fontsize=12)
plt.axhline(y = sr.mean(), color = 'gray', ls = '--')
plt.axvline(x = 2015, color = 'red', ls = '--')
for a,b in zip(yr, sr): 
    if a % 4 == 0:
        plt.text(a, b, str(b))
plt.show()

**Drop 2016 rows:**

In [None]:
#  drop year = 2016
print('before 2016 data drop:  ',df.shape)
df.drop(df[df['year'] == 2016].index, inplace = True)
print('after  2016 data drop:  ',df.shape)

[go to top of section](#prep)     

##  1.3  Final Check <a id="prep_check"></a>
This is a final re-check for NULL and duplicate values in the dataset.

In [None]:
plt.title('null, missing and \'0\' values heatmap')
sns.heatmap((df.isnull()) | (df == 0), cmap = 'mako')
plt.show()

print('\nDUPLICATE COUNT:  ', df.duplicated().sum())

**There are no nulls or duplicates in the dataset.**

[go to top of section](#prep)     

##  1.4  Data Attributes   <a id="prep_attr"></a>

| COLUMN | type | description |
| :--- | :--- | :--- |
| **country** | categorical | country name |
| **year** | numerical | year of data |
| **sex** | categorical |sex of suicide victim |
| **age** | categorical |age range of suicide victim |
| **suicides_no** | numerical | number of suicides |
| **population** | numerical | population of country |
| **suicides/100k pop** | numerical | suicides per 100k of population |
| **HDI for year** | numerical | Human Development Index (HDI) for that year |
| **gdp_per_capita ($)** | numerical | Gross Domestic Product per person |
| **generation** | categorical | people born between certain years |
| **CapitalLatitude** | numerical | latitude (for maps) |
| **CapitalLongitude** | numerical | longitude (for maps) |
| **ContinentName** | categorical |continent |

<br>   
**Dataset contains information on suicides from 1985-2015 and is now ready for analysis.**

[go to top of section](#prep)     

##  1.5  Data Overview   <a id="prep_over"></a>
The **Suicide Rates Overview 1985 to 2016** dataset contains data for half the countries on Earth and does have data on some of the most populous countries are not included in the dataset.

source:  https://www.worldometers.info/geography/how-many-countries-are-there-in-the-world/

In [None]:
#  Top 10 most populous countries in the world
top10 = ['China','India','United States','Indonesia','Brazil','Pakistan','Nigeria','Bangladesh','Russia','Mexico']
in_set = df.country[df.country.str.contains('|'.join(top10))].unique().tolist()

print('Out of the top 10 most populous countries:\n{}\n\nOnly the following {} are present:\n{}'.format(top10,len(in_set),in_set))

#  dataset
print('\n\nDataset has', len(df['country'].unique()),'countries on' ,len(df['ContinentName'].unique()),'continents spanning' ,len(df['year'].unique()),'years.')

**Create Continent DataFrame**

In [None]:
#  SET UP CONTINENT DATAFRAME
#  population, population percent and country count per continent
df_cont = df.groupby('ContinentName')['population'].sum().reset_index()
df_cont['Population Percentage'] = df_cont['population']*100/df_cont['population'].sum()

#  countries count per continent
cntCont = df['country'].groupby(df['ContinentName']).nunique()

#  add countryCount column to df_cont
df_cont['countryCount'] = cntCont[df_cont['ContinentName']].values 

**Plot Population and Country Count:**

In [None]:
#  PLOTs - Population AND Country Count per Continent 
plt.figure(figsize=(10,4))
plt.subplot(121)
ax = sns.barplot(data = df_cont, x = 'Population Percentage', y='ContinentName')
plt.title('Population', fontsize=14)
plt.xlim([0,45])
plt.ylabel("")
for p in ax.patches:
    ax.annotate("{:,.1f}%".format(p.get_width()), (p.get_x() + p.get_width(), p.get_y()+.4), ha='left', va='center')

plt.subplot(122)
ax = sns.barplot(data = df_cont, x = 'ContinentName', y='countryCount')
plt.title('Country Count\n({} countries)'.format(df_cont.countryCount.sum()), fontsize=14)
plt.xlabel('')
plt.xticks(rotation = 90)
plt.ylim([0,50])
for p in ax.patches:
    ax.annotate("%.0f" % p.get_height(), (p.get_x() + p.get_width()/2, p.get_height()+2), ha='center', va='center')
    
plt.show()

**DATA OVERVIEW:**
-  6 of the most populous countries are missing
-  **China** and **India** account for almost 30% of the global population and are missing 
-  most of **African** countries are missing 
-  **Europe** has 43 out of 98 countries in the data and accounts for 38% of the population
-  It is safe to assume that analysis will be skewed towards European countries

[go to top of document](#top)     

---
<a id="eda"></a>
#  2.  Data Visualization (EDA) 
Various plots are untilized to visualize the data.

**Set up ordered list for plotting:**
<br>
Sort values for:
- Continent Names
- Continent Names + 'Global'
- Age Groups

In [None]:
#  create an ordered list for plotting:
cont_list = df.ContinentName.sort_values().unique()
cont_glob = np.insert(cont_list, 0, 'Global')  # add Global as 1st
age_list = ['5-14 years', '15-24 years', '25-34 years', '35-54 years',  '55-74 years', '75+ years']

## 2.1  Suicide Rates per Country <a id="eda_country"></a>
Plot the average suicide rates per country and compare with the global mean.

In [None]:
#  suicide rate average: overall, male, female
suicide_mean = df['suicides/100k pop'].mean()
sr_m = df['suicides/100k pop'][df['sex'] == 'male'].mean()
sr_f = df['suicides/100k pop'][df['sex'] == 'female'].mean()

#  dataframe with suicide averages for each country
df_suicideRate = df['suicides/100k pop'].groupby(df['country']).mean().sort_values(ascending=False).reset_index()

#  plot suicide rates per country
plt.figure(figsize=(10,20))
ax = sns.barplot(data = df_suicideRate, x = 'suicides/100k pop', y='country')
plt.title('Suicide Rates per Country', fontsize=14)
#plt.xlabel('suicides/100k pop\n\noverall mean: {:>18.2f}\nmale mean (BLUE): {:>8.2f}\nfemale mean (RED): {:>18.2f}'.format(suicide_mean, sr_m, sr_f))
plt.text(27,95,'SUICIDE RATES:\nglobal mean (BLACK):  {:>10.4f}\nmale mean (BLUE): {:>15.4f}\nfemale mean (RED): {:>14.4f}'.format(suicide_mean, sr_m, sr_f))
plt.axvline(x= suicide_mean, color = 'black')#, ls = '--')
plt.axvline(x= sr_m, color = 'blue', ls = '--')
plt.axvline(x= sr_f, color = 'red', ls = '--')
plt.ylabel("")
for p in ax.patches:
    ax.annotate("%.2f" % p.get_width(), (p.get_x() + p.get_width() +2, p.get_y()+.4), ha='center', va='center')
plt.show()

- Global **suicides rates** is 15.17 per 100k of the population
- males are **3.5 more likely** to commit suicide than females (22.96 vs 6.68)
- **European countries**, such as Latvia, Hungary and Belarus, have significantly higher suicide rates

[go to top of section](#eda)  

## 2.2  Suicide Rates per Continents<a id="eda_cont"></a>
Plot the **population** and the **average suicide rates** for each continent.

In [None]:
#  mean suicides/100k pop per continent
df_cont['suicides/100k pop'] = df.groupby('ContinentName')['suicides/100k pop'].mean().values

#  plot Population and Suicide Rates
plt.figure(figsize=(10,4))
plt.subplot(121)
ax = sns.barplot(data = df_cont, x = 'Population Percentage', y='ContinentName')
plt.title('Population', fontsize=14)
plt.xlim([0,45])
plt.ylabel("")
for p in ax.patches:
    ax.annotate("{:,.1f}%".format(p.get_width()), (p.get_x() + p.get_width(), p.get_y()+.4), ha='left', va='center')

plt.subplot(122)
ax = sns.barplot(data = df_cont, x = 'suicides/100k pop', y='ContinentName')
plt.title('Suicide Rates', fontsize=14)
plt.xlim([0,22])
plt.ylabel("")
plt.yticks([])
plt.axvline(x= suicide_mean, color = 'gray', ls = '--')
for p in ax.patches:
    ax.annotate("%.2f" % p.get_width(), (p.get_x() + p.get_width() + 1.5, p.get_y()+.4), ha='center', va='center')
    
plt.show()

-  **Europe** has the most countries in the data and is skewing the global suicide average.

[go to top of section](#eda)  

## 2.3  Suicide Trends and Population<a id="eda_pop"></a>
- Plot **population** and **mean suicide rates** for male and female per continent, 
- and plot the **population** and **suicide rates** for male and female **over time**.

###  Mean Suicide Rates and Ratios per Continent

\begin{align*} 
\mathsf{\text{Suicide Rate Ratio} = \frac{Male Suicide Rates}{Female Suicide Rates}}
\end{align*}

In [None]:
#  add male and female suicide means
for i, row in df_cont.iterrows():
    c = df_cont.loc[i]['ContinentName']
    df_cont.loc[i, 'male'] = df['suicides/100k pop'][(df['ContinentName'] == c) & (df['sex'] == 'male')].mean()
    df_cont.loc[i, 'female'] = df['suicides/100k pop'][(df['ContinentName'] == c) & (df['sex'] == 'female')].mean()

#  add male and female ratio of suicide means
df_cont['ratio'] = df_cont['male']/df_cont['female']


#  PLOTs - suicide rates per continent AND male/female ratio
plt.figure(figsize=(12,6))
plt.subplot(121)
ax = sns.barplot(data = df, x = 'ContinentName', order = cont_list, y = 'suicides/100k pop', hue = 'sex', ci=None, palette = 'ocean')
plt.title('Male & Female Suicide Rates', fontsize=14)
plt.xlabel("")
plt.xticks(rotation = 90)
plt.ylim([0,30])
plt.legend(title = "")
for p in ax.patches:
    ax.annotate("%.1f" % p.get_height(), (p.get_x() + p.get_width()/2, p.get_height()+1), ha='center', va='center')
    
plt.subplot(122)
ax = sns.barplot(df_cont['ContinentName'], df_cont['ratio'], palette = 'winter')
plt.title('Male vs. Female Suicide Ratio', fontsize=14)
plt.xlabel("")
plt.xticks(rotation = 90)
plt.ylabel('Male/Female Suicide Ratio')
plt.ylim([1,5])
for p in ax.patches:
    ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width()/2, p.get_height()+.1), ha='center', va='center')

plt.show()

-  **Central American** countries have a dispropotionaly higher percentages of male commiting suicide than female.  Overall, males are more likely to commit suicide than women
-  **Asian** countries have a significantly higher incidence of females commiting suicides than other countries

###  Mean Suicide Rates per Continent Over Time
Following figures cover the Global and Continent suicide rates over time.

In [None]:
#  plot Suicide Rates & Population (male & female) over Time
plt.figure(figsize=(12,24))
a = 4   # subplot rows
b = 2   # subplot columns
c = 1   # subplot counter

for i in cont_glob:
    
    if i == 'Global':
        dfx = df   # GLOBAL PLOT
    else:
        dfx = df[df['ContinentName'] == i]

    #  subplots        
    plt.subplot(a,b,c)
    tw1 = sns.lineplot(data = dfx, x='year', y='population', ci = None, hue = 'sex', palette = 'hsv')
    tw2 = tw1.twinx()
    tw2 = sns.lineplot(data = dfx, x='year', y='suicides/100k pop', color ='grey', ci = None, style = True, dashes = [(2,2)])
    tw2 = plt.legend('',frameon=False)
    tw2 = plt.xlim([1985,2015])
    plt.title('{}:  Suicide Rates & Population'.format(i.upper()))
    c = c + 1
    
plt.show()

- **Globally**, the overall suicide trend has been sharply declining since 1995, even though the population has been increasing significantly in recent years
- **Africa** saw a sharp increase in suicides in 2005-2006.  Suicide rates have been erratic in recent years.
- **Asia** is seeing an uptick in suicides in recent years
- **Australia**, **North America** and **South America** are seeing an uptick in suicides, however, rates are still much lower relative to the population growth
-  **Central America** and **Europe** are both seeing an increase in population and a decrease in suicide rates

[go to top of section](#eda)  

## 2.4  Suicide Trends and Age Group<a id="eda_age"></a>
Plot the suicide trends from 1985-2015 for the various age groups.

In [None]:
print('\nGLOBAL SUICIDE RATES PER AGE GROUP\n')
print(f"{'Age Group' : <10}{'MALE': >12}{'FEMALE': >8}")
print('-'*30)
for i in age_list:
    a = df['suicides/100k pop'][(df['age'] == i) & (df['sex'] == 'male')].mean()
    b = df['suicides/100k pop'][(df['age'] == i) & (df['sex'] == 'female')].mean()
    print('{:<14}\t{:>6.2f}\t{:>6.2f}'.format(i,a,b))

In [None]:
#  plot Suicide Rates & Age Group over Time
plt.figure(figsize=(12,60))
a = 8   # subplot rows
b = 2   # subplot columns
c = 1   # subplot counter

for i in cont_glob:
    
    if i == 'Global':
        dfx = df   # GLOBAL PLOT
    else:
        dfx = df[df['ContinentName'] == i]

    #  subplot 1 - male  
    plt.subplot(a,b,c)  # male
    tw1 = sns.lineplot(data = dfx[dfx['sex'] == 'male'], x='year',    y='suicides/100k pop', hue='age', hue_order = age_list, ci = None)
    plt.title('{}:  Suicide Rates & Age (MALE)'.format(i.upper()))
    plt.legend('',frameon=False)
    plt.xlim([1985,2015])
    
    tw2 = tw1.twinx()
    tw2 = sns.lineplot(data = dfx[dfx['sex'] == 'male'], x='year',  y='suicides/100k pop', color ='grey', ci = None, style = True, dashes = [(2,2)])
    plt.legend('',frameon=False)
    plt.xlim([1985,2015])
    plt.ylabel("")
    c = c + 1
    
    #  subplot 2 - female
    plt.subplot(a,b,c)  # female
    tw1 = sns.lineplot(data = dfx[dfx['sex'] == 'female'], x='year',  y='suicides/100k pop', hue='age', hue_order = age_list, ci = None)
    plt.legend(bbox_to_anchor =(1.1, 1))
    plt.xlim([1985,2015])
    plt.ylabel("")
    
    tw2 = tw1.twinx()
    tw2 = sns.lineplot(data = dfx[dfx['sex'] == 'female'], x='year',    y='suicides/100k pop', color ='grey', ci = None, style = True, dashes = [(2,2)])
    plt.title('{}:  Suicide Rates & Age (FEMALE)'.format(i.upper()))
    plt.legend('',frameon=False)
    plt.xlim([1985,2015])
    plt.ylabel("")
    c = c + 1
    
plt.show()

**GLOBAL:** Rate of suicides have been declining for both males and females, however, there has been a significant uptick in rates of females in age groups 55+ in recent years.

- after a spike in **1995**, suicides rates globally have been on a steady decline for all age groups
- there is some indication of an **uptick in suicides** from 2013 in older people
- females in the **5-14 years** age group have a very shallow uptick in suicides for the past few decades
- people within the **50-year span** of 25-75 yrs have a steady incidence of suicides
- People that are **75+ years** have a significantly higher incidence of suicides, for both males and females
<br><br>
- **Africa** saw a significant increase in male suicide in 2005-2006, however there has been a steady decline in suicides for both sexes
- **Asia** has had a steady decline in suicides, however, the trend for males over 75 yrs in increasing
- **Australia** is seeing a significant increase in suicides in people 35-54 yrs old.  Females in age groups 5-14 yr and 55-74 yr were committing suicides at alarming rates in the mid-1990s.  There is an uptrend in suicides in recent years.
- **Central America** overall trend is declining, however there is a significant increase in female 5-14 yr committing suicides in recent years
- **Europe** is seeing an overall decline in suicides, but trend for females 55 and older is increasing
- **North America** is seeing an uptick in males 35 yr and older committing suicides.  The general trend is increasing in recent years
- **South America** has a rapidly declining suicide rate.  However, the trends are very volatile

[go to top of section](#eda)  

## 2.5  Suicide Trends and Human Development Index (HDI)<a id="eda_hdi"></a>
Plot Human Development Index (HDI) for all continents over time.

In [None]:
#  plot Suicide Rates & HDI over Time
plt.figure(figsize=(12,24))
a = 4   # subplot rows
b = 2   # subplot columns
c = 1   # subplot counter

for i in cont_glob:
    
    if i == 'Global':
        dfx = df   # GLOBAL PLOT
    else:
        dfx = df[df['ContinentName'] == i]

    #  subplots        
    plt.subplot(a,b,c)
    tw1 = sns.lineplot(data = dfx, x='year', y='HDI for year')
    
    tw2 = tw1.twinx()
    tw2 = sns.lineplot(data = dfx, x='year', y='suicides/100k pop', color ='grey', ci = None, style = True, dashes = [(2,2)])
    tw2 = plt.legend('',frameon=False)
    tw2 = plt.xlim([1985,2015])
    plt.title('{}:  Suicide Rates & HDI'.format(i.upper()))
    plt.grid()
    c = c + 1
    
plt.show()

- **Human Development Index (HDI):**  Overall HDI has increased and the suicide rate has steadily going down.  There is an inverse relationship between HDI and suicides, however, there is also a lag when HDI goes up and suicides goes down

[go to top of section](#eda)  

## 2.6  Suicide Trends and Gross Domestic Product per Capita (GDPc)<a id="eda_gdp"></a>
Plot Gross Domestic Product per Capita (GDPc) for all continents over time.

In [None]:
#  plot Suicide Rates & GDP over Time
plt.figure(figsize=(12,24))
a = 4   # subplot rows
b = 2   # subplot columns
c = 1   # subplot counter

for i in cont_glob:
    
    if i == 'Global':
        dfx = df   # GLOBAL PLOT
    else:
        dfx = df[df['ContinentName'] == i]

    #  subplots        
    plt.subplot(a,b,c)
    tw1 = sns.lineplot(data = dfx, x='year', y='gdp_per_capita ($)')
    
    tw2 = tw1.twinx()
    tw2 = sns.lineplot(data = dfx, x='year', y='suicides/100k pop', color ='grey', ci = None, style = True, dashes = [(2,2)])
    tw2 = plt.legend('',frameon=False)
    tw2 = plt.xlim([1985,2015])
    plt.title('{}:  Suicide Rates & GDP'.format(i.upper()))
    plt.grid()
    c = c + 1
    
plt.show()

- **Gross Domestic Product per Capita (GDPc):**  Overall GDP has increased and the suicide rate has steadily going down.  There is an inverse relationship between GDP and suicides

[go to top of document](#top)     

---
# 3.  Correlations<a id="corr"></a>
Correlation is a statistical metric for measuring to what extent different variables are interdependent.  In the analysis, we will look at the overall correlation, as well as the correlations based on male/female.

In order to perform correlation, we need to first take care of two very important processes:

  *  Encoding categorical attributes with numerical values
  *  Normalization of the data

## 3.1  Encoding and Normalization<a id="corr_encode"></a>
### Encoding
Machine learning algorithms cannot process categorical or text data unless they have been converted to numbers. Encoding maps categorical values to integer values, which are represented as a binary vector that are all zero values, except the index of the integer, which is set to 1.

Categorical attributes will be manually encoded with numeric values.  The steps involved are:

1.  rearrange column names
2.  encode with numerical values

In [None]:
# 1.  rearrange column name so "suicides/100k pop" is first
df_corr = df[['suicides/100k pop', 'sex', 'age', 'population', 'HDI for year', 'gdp_per_capita ($)', 'generation', 'ContinentName']]

# 2.  encode with numerical values
df_corr['sex'] = df_corr['sex'].map({'female':0,'male':1})
df_corr['age'] = df_corr['age'].map({
        '5-14 years':0,'15-24 years':1,'25-34 years':2,
        '35-54 years':3,'55-74 years':4,'75+ years':5})
df_corr['generation'] = df_corr['generation'].map({
        'Generation Z':0,'Millenials':1,'Generation X':2,
        'Boomers':3,'Silent':4,'G.I. Generation':5})
df_corr['ContinentName'] = df_corr['ContinentName'].map({
        'Africa':0,'Asia':1,'Australia':2,'Central America':3,
        'Europe':4,'North America':5,'South America':6})

df_corr.info()

###  Normalization
Normalization is a rescaling of the data from the original range so that all values are within a certain range, typically between 0 and 1. Normalized data is essential in machine learning. Correlation and models will not produce good results if the scales are not standardized.

Data in **df_corr** will be normalized and the **df** data frame will be updated with the encoded and normalized data.

In [None]:
from sklearn.preprocessing import MinMaxScaler

df_norm = MinMaxScaler().fit_transform(df_corr)
df_corr = pd.DataFrame(df_norm, index = df_corr.index, columns = df_corr.columns)

[go to top of section](#corr)  

## 3.2  Overall Correlation<a id="corr_over"></a>

In [None]:
#  Correlations - OVERALL
dataCorr = df_corr.corr()

#  print correlation
dataCorr['suicides/100k pop'].sort_values(ascending=False)

#  plot heatmap
plt.figure(figsize=(8,8))
plt.title('Suicide Correlation', fontsize=14)
sns.heatmap(dataCorr, annot=True, fmt='.2f', square=True, cmap = 'Greens_r')

- **Age**, **generation** and **sex** are significant factors in determining suicide rates.

[go to top of section](#corr)  

## 3.3  Correlation - Male & Female<a id="corr_sex"></a>

In [None]:
#  Correlation MALE - filter dataframe for male/female
df_male   = df_corr[(df_corr['sex'] == 1)]              # male
df_maleCorr = df_male.drop(["sex"], axis=1).corr()      # male corr

#  Correlation FEMALE - filter dataframe for male/female
df_female = df_corr[(df_corr['sex'] == 0)]              # female
df_femaleCorr = df_female.drop(["sex"], axis=1).corr()  # female corr

In [None]:
#  Correlation heatmaps for FEMALE/MALE
fig = plt.figure(figsize=(14,10))
fig.add_subplot(121)
plt.title('Suicide Correlation - MALE', fontsize=14)
sns.heatmap(df_maleCorr, annot=True, fmt='.2f', square=True, cmap = 'Blues_r')
fig.add_subplot(122)
plt.title('Suicide Correlation - FEMALE ', fontsize=14)
sns.heatmap(df_femaleCorr, annot=True, fmt='.2f', square=True, cmap = 'Reds_r')
plt.show()

In [None]:
#  Correlation - sorted for both male/female
corrM = df_maleCorr['suicides/100k pop'].sort_values(ascending=False)
corrF = df_femaleCorr['suicides/100k pop'].sort_values(ascending=False)

corrALL = pd.DataFrame(columns = ['MALE','correlation-m','FEMALE','correlation-f'])
corrALL['MALE']   = corrM.index
corrALL['correlation-m'] = corrM.values
corrALL['FEMALE'] = corrF.index
corrALL['correlation-f'] = corrF.values
print(corrALL)

- As noted above, **Age**, **generation** and **sex** are significant factors in determining suicide rates for both males and females.

[go to top of document](#top)     

---
# 4.  World Maps<a id="maps"></a>
Maps are a great way to visually represent geographical data.  Following three world maps are presented:

-  Folium World Map
-  Choropleth World Map
-  Choropleth World Map - Animated

## 4.1  Folium World Map<a id="maps_folium"></a>

Folium map showing the **number of suicides** per countries in the dataset.

In [None]:
#  create dataframe for mapping
mapdf = pd.DataFrame(columns =  ['country','suicides_no','lat','lon'])

mapdf.lat = mapdf.lat.astype(float).fillna(0.0)
mapdf.lon = mapdf.lat.astype(float).fillna(0.0)

mapdf['country']     = df['suicides_no'].groupby(df['country']).sum().index
mapdf['suicides_no'] = df['suicides_no'].groupby(df['country']).sum().values
for i in range(len(mapdf.country)):
    mapdf.lat[i] =  df.CapitalLatitude[(df['country'] == mapdf.country[i])].unique()
    mapdf.lon[i] = df.CapitalLongitude[(df['country'] == mapdf.country[i])].unique()


#  make map - popup displays country and suicide count
#  lat/lon must be "float"
world_map = folium.Map(location=[mapdf.lat.mean(),mapdf.lon.mean()],zoom_start=2)
marker_cluster = MarkerCluster().add_to(world_map)

for i in range(len(mapdf)-1):
    label = '{}:\t{} suicides'.format(mapdf.country[i].upper(),mapdf.suicides_no[i])
    label = folium.Popup(label, parse_html=True)
    folium.Marker(location=[mapdf.lat[i],mapdf.lon[i]],
            popup = label,
            icon = folium.Icon(color='green')
    ).add_to(marker_cluster)


world_map.add_child(marker_cluster)
world_map.save("C:\\Users\ACER\Desktop\\world_mapFolium.html")
world_map         #  display map

[go to top of section](#maps) 

## 4.2  Choropleth World Map<a id="maps_chor"></a>
Choropleth map showing the **suicide rates** of the countries in the dataset.

In [None]:
#  create dataframe with Country and mean of Suicide rates per 100k Population
df_choro = df[['suicides/100k pop','country']].groupby(['country']).mean().sort_values(by='suicides/100k pop').reset_index()

#  Update US name to match JSON file
df_choro.replace({'United States':'United States of America'},inplace=True)

#  https://www.kaggle.com/ktochylin/world-countries
world_geo = r'../input/world-countries/world-countries.json'
world_choropelth = folium.Map(location=[0, 0], tiles='Cartodb Positron',zoom_start=1)

world_choropelth.choropleth(
    geo_data=world_geo,
    data=df_choro,
    columns=['country','suicides/100k pop'],
    key_on='feature.properties.name',
    fill_color='PuBu',  # YlGn
    fill_opacity=0.7, 
    line_opacity=0.2,
    legend_name='Suicide Rates per 100k Population')
 
# display map
world_choropelth

[go to top of section](#maps)  

## 4.3  Choropleth World Map - Animated<a id="maps_chorANI"></a>
Animated choropleth map showing the **suicide rates** of the countries over **time**.

In [None]:
world_choropelth_animated = px.choropleth(df.sort_values(by='year'),
    locations ="country", 
    locationmode = 'country names',
    color ="suicides/100k pop", 
    hover_name ="country",  
    color_continuous_scale = 'PuBu', #px.colors.sequential.Reds, 
    color_continuous_midpoint = 13,
    scope ="world", 
    animation_frame ="year") 
world_choropelth_animated.show()

---
[go to top of document](#top)

**Please upvote if you found this helpful :-)**

###  END