### Preface

So this is the first notebook I'm publishing on Kaggle.
Even though I've been learning ML for some time now, and am excited about the theory behind this algorithm family, I've never had the patiance to bother with proper data exploration & visualization.
So my purpose here is to dabble with some of the stuff around ML - mainly data visualization & exploration, and to try and find key insights from the data by these methods. The important thing here for me is to learn by expirience what insights are relevant and how to visualize them nicely. 

I'll update the notebook gradually, and am eager to get feedback and ideas from you guys (on everything - notebook organization & readability, code efficiency, visualiztion, data exploration). **Any input you'll give would be greatly appriciated!.**

Hope you'll find my learning expirience entertaining! Let's start!

# Suicide Rates - Data Visualization

So I've picked this dataset as 1st - it's structured and seems proccessed nicely, and 2nd - this topic is dear to me as I'm struggling with depression and had a friend who was suicidal.

Let's load the dataset, and import some useful libraries:

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

import pandas as pd # tabular data processing
import geopandas as gpd # geospacial data processing
import seaborn as sns # easy plotting
sns.set_style("darkgrid")

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
print("Available files in kaggle/input:")
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Load suicide data:
suicide_data = pd.read_csv("/kaggle/input/suicide-rates-overview-1985-to-2016/master.csv")

### Data Exploration

First, let's see some of the entries to get a feel for the dataset:

In [None]:
# Display some data
print('There are {} entries in the dataset.'.format(len(suicide_data)))
display(suicide_data.head())

So, basicaly the dataset contains number of suicides partitioned by: 
* Country
* Year
* Sex
* Age Group

Also, for each group there's data about:
* Generation the group belongs to.
* Total population of the group (this + suicides_no gives us suicide rates in the group).
* Yearly GDP for the country group belongs to.
* Yearly GDP per capita for the country group belongs to.
* HDI ([Human Development Index](https://en.wikipedia.org/wiki/Human_Development_Index)) for the country & year the group belings to.

We will now begin exploring the features in the data. We'll start with **country**:

In [None]:
# Explore countries:
countries = suicide_data.country.unique()
print("There are {} different countries in the dataset.".format(len(countries)))

For the fun of it, let's plot the country data on a map. We can maybe do a color coded map later when we'll want to visualize countries by suicide rates, but for now let's start with a simple plot of the countries in the dataset:

In [None]:
# Show on map
world_data = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
countries_data = world_data.loc[world_data['name'].isin(countries)]
ax = world_data.plot(figsize=(20,20), color='whitesmoke', edgecolor='black', zorder=1)
countries_data.plot(color='lightblue', edgecolor='black', ax=ax)

Some of the countries are'nt shown on the map, probably due to mismatch between the geospatial dataset and the suicides dataset.
Let's check the countries that aren't shown here due to mismatch:

In [None]:
# 'Bad Names' - later for data cleaning (might be unncessary)
print('Countries in the dataset but arent shown on map:')
country_not_on_map = list(set(countries) - set(world_data['name']))
country_not_on_map.sort() # Sort for ease
print(country_not_on_map)
print('A total of {} countries are in the dataset but arent shon on the map.'.format(len(country_not_on_map)))

In [None]:
# Just in case, let's not trust the naturalearth_lowres dataset to contain every country.
country_not_on_map
print('Countries in world but not in the dataset:')
country_not_in_data = list(set(world_data['name']) - set(countries))
country_not_in_data.sort() # Sort for ease
print(country_not_in_data)
print('A total of {} countries are in the geospatial dataset but arent in the suicide dataset.'.format(len(country_not_in_data)))

Ok, so we can group the missing countries to 3 categories:
1. Countries that are missing from the suicide dataset (Such as Chad). This group might have a non-transparent policy regarding the data, or simply weren't added to dataset.
2. Countries that are missing from the geospatial dataset (Such as Malta). This group is missing due to the resulution of the geospatial dataset.
3. Countries that are in both datasets but are represented differnt (Such as Bosnia and Herzegovina)

For group 1 we can try adding data from different sources (but that can lead to another problem as each source might have a different criteria for 'suicide').<br/><br/>
For group 2 we can search for a different geospatial dataset. This isn't as importent though as the small countries would be hard to notice anyway (sorry Malta :/).<br/><br/>
For group 3 we can clean the data (simply aligning the names). We'll do so now as it should'nt take too long.

In [None]:
# Align mismatched names in the geospatial data:
world_data.loc[world_data['name'] == 'Bosnia and Herz.', 'name'] = 'Bosnia and Herzegovina'
world_data.loc[world_data['name'] == 'Czechia', 'name'] = 'Czech Republic'
world_data.loc[world_data['name'] == 'South Korea', 'name'] = 'Republic of Korea'
world_data.loc[world_data['name'] == 'Russia', 'name'] = 'Russian Federation'
world_data.loc[world_data['name'] == 'United States of America', 'name'] = 'United States'

# Replot map:
countries_data = world_data.loc[world_data['name'].isin(countries)]
ax = world_data.plot(figsize=(20,20), color='whitesmoke', edgecolor='black', zorder=1)
countries_data.plot(color='lightblue', edgecolor='black', ax=ax)

We now move from the **country** feature to the **year** feature.

In [None]:
# Explore year feature in the dataset:
sns.distplot(suicide_data.year, bins=range(suicide_data.year.min(), suicide_data.year.max() + 1), kde=False)

We see that the data isn't evenly spread accross years. There are missing values, especially pre-1995. This can be due to countries which the data for some years isn't available in our dataset.

I have no ideas for further exploring the year values, so let's move to sex:

In [None]:
# Explore sex:
print('Unique genders in our dataset: ' + str(suicide_data.sex.unique()))
print('Statistics: ')
display(suicide_data.sex.value_counts())

Male and female counts are the same, and are exact half of all the entries. Therefor each male group has it female counterpart (and vice-versa), so the data is balanced regarding sex.

Forward, to age groups!

In [None]:
# Explore age:
print('Unique age groups in our dataset: ' + str(suicide_data.age.unique()))
print('Statistics: ')
display(suicide_data.age.value_counts())

The 5-14 age group is slightly under-represented. The missing values can mean no suecides happened in some countries for this age groups, or that the data is simply missing.  
I'm renaming this group to '05-14' in order so that the default string ordering function captures the age groups order.

In [None]:
# Replace 5-14 by 05-14 in order to create right ordering:
suicide_data['age'].replace(to_replace='5-14 years', value='05-14 years', inplace=True)

Lastly, generation:

In [None]:
# Explore generation:
print('Unique generations in our dataset: ' + str(suicide_data.generation.unique()))

In [None]:
# Sorted generations in our dataset:
sorted_generations = ['G.I. Generation', 'Silent', 'Boomers', 'Generation X', 'Millenials', 'Generation Z']

In [None]:
chart = sns.catplot(x="generation", kind="count", order=sorted_generations, palette="rocket", data=suicide_data)
chart.set_xticklabels(
    rotation=90, 
    fontweight='light',
    fontsize='large')

There is an understable under-represantation for the 2 extreme generations (The time frame between 1985 - 2020 spans ~4 age groups, but for the G.I. and Z generations there are probably only 1 or 2 age groups which overlaps this time frame).  
There is also slight under-represantation of the Boomers, though this must be attributed to missing data.

I'm finished for now with exploring the features values. Let's start visualizing some interesting questions about the data:
(I'll mainly be interested in suicide rate as this is the easiest 'fair' indication for risk of suicide)

# Year

In [None]:
# Does suicide rates are growing each year?
# Reaarrange dataframe using aggregate:
suicide_data_year_reaarange = suicide_data.groupby('year').agg({'suicides_no' : 'sum', 'population' : 'sum'})
# Use year as regular column (and not index)
suicide_data_year_reaarange = suicide_data_year_reaarange.reset_index()
# Re-create the rate
suicide_data_year_reaarange['suicide_rate'] = suicide_data_year_reaarange['suicides_no'] / suicide_data_year_reaarange['population']
suicide_data_year_reaarange.head()

In [None]:
sns.regplot(x=suicide_data_year_reaarange['year'],
            y=suicide_data_year_reaarange['suicide_rate'])

The linear regression is clearly isn't fitting here. There is a spike in suecide rate during 1995, and then a gradual decline. A 2-piece linear model would be better to capture this behavior.

I don't have any reason for this behavior in mind right now, but we might need to explore more of data to answer this. (Can be a real phenomena, or a bias of our dataset).

# Sex

In [None]:
# Let's do the same but with regard to gender:
# Reaarrange dataframe using aggregate:
suicide_data_year_sex_reaarange = suicide_data.groupby(['year', 'sex']).agg({'suicides_no' : 'sum', 'population' : 'sum'})
# Use year as regular column (and not index)
suicide_data_year_sex_reaarange = suicide_data_year_sex_reaarange.reset_index()
# Re-create the rate
suicide_data_year_sex_reaarange['suicide_rate'] = suicide_data_year_sex_reaarange['suicides_no'] / suicide_data_year_sex_reaarange['population']
suicide_data_year_sex_reaarange.head()

In [None]:
sns.lmplot(x='year',
           y='suicide_rate',
           hue='sex',
           data=suicide_data_year_sex_reaarange)

The linear model fits well the female suicide rates. I'm happy to see it's a gradual decline. (Though there are oscilations around the linear trend line).  

We can see clearly that the shape of general suicide plot is dictated mainly by the shape of the male's graph. This is understandable because there are roughly the same number of males and females, and suicide rates for males are subtantially higher than for females.

In [None]:
sns.swarmplot(x=suicide_data_year_sex_reaarange['sex'],
              y=suicide_data_year_sex_reaarange['suicide_rate'])

The swarmplot conveys that males havbe higher suicide rates than females. Nevertheless, this kind of plot isn't very good here as we lose the ordering of the year feature.

In [None]:
# TODO: Find a better idea for a plot that focuses solely on males vs females suicide rates.

# Age

In [None]:
# Let's do the same but with regard to age:
# Reaarrange dataframe using aggregate:
suicide_data_year_age_reaarange = suicide_data.groupby(['year', 'age']).agg({'suicides_no' : 'sum', 'population' : 'sum'})
# Use year as regular column (and not index)
suicide_data_year_age_reaarange = suicide_data_year_age_reaarange.reset_index()
# Re-create the rate
suicide_data_year_age_reaarange['suicide_rate'] = suicide_data_year_age_reaarange['suicides_no'] / suicide_data_year_age_reaarange['population']
suicide_data_year_age_reaarange.head()

In [None]:
sns.lmplot(x='year',
           y='suicide_rate',
           hue='age',
           data=suicide_data_year_age_reaarange)

This is the most interesting plot I've generated yet. It visualizes a number of insights nicely:
* Understandably, the older the age-group the higher the risk of suicide. (Older people are lesss hopeful their life will change for the better, and due to aging and sickness can expect to live less, and objectively have a lot more physical hardships).
* The older the age group, the more the trend-line changes downward (with the 5-14 age group even sees a rise in suicide rates in recent years). This can be explained by technological advancemnet that better the lives of the old, but doesn't do much for the hardships that associated eith other age-groups (financical strain, social hardships, etc.).

In [None]:
sns.swarmplot(x=suicide_data_year_age_reaarange['age'],
              y=suicide_data_year_age_reaarange['suicide_rate'])

Again, the swarmplot conveys which age-group has a higher suicide rate, disregarding the ordering of the year.

In [None]:
# Refer only to age
suicide_data_age_reaarange = suicide_data_year_age_reaarange.groupby('age').agg({'suicides_no' : 'sum', 'population' : 'sum'})
# Use age as regular column (and not index)
suicide_data_age_reaarange = suicide_data_age_reaarange.reset_index()
# Re-create the rate
suicide_data_age_reaarange['suicide_rate'] = suicide_data_age_reaarange['suicides_no'] / suicide_data_age_reaarange['population']
sns.scatterplot(x=suicide_data_age_reaarange['age'],
              y=suicide_data_age_reaarange['suicide_rate'])
# TODO: How to add trend line, as regplot seemingly not working with non-numeric x-axis?
# TODO: Use the same coloring here

## Notes to self / further ideas/ tasks

Interesting Questions About the Data:
* Does suicide rates are growing each year?
* Does the trend differs in different culture groups?
* Does the rate differ with age?
* Does the rate differ with generation?
* Does the rate differ with sex?
* Does the rate differ with HDI?
* Does the rate differ with GDP?
* Does the rate differ with population? (should try to eliminate the year factor as much as possible)
* By country

Tasks:
* Explore missing values
* Heatmap for suecides