# Introduction

In this data storytelling, I am going to show long-time trends and differences between countries about number of suicide

The datatable and description can be found from [Kaggle](https://www.kaggle.com/szamil/who-suicide-statistics)

# Data preparation

In [None]:
# import pandas library
import pandas as pd

In [None]:
# read suicide statistic worldwide data
df = pd.read_csv('../input/who-suicide-statistics/who_suicide_statistics.csv')


view the data head and tail

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
# read number of suicide data in the 2016
invalid_data = df[df.year == 2016]
invalid_data.head(20)

drop 2016 year data since there are only small number of suicide data in the begining of the year

In [None]:
# drop 2016 year data
df.drop(invalid_data.index, axis = 0, inplace = True)

In [None]:
df.info()

drop nan and duplicate data

In [None]:
# drop NaNs
df.dropna(axis=0,inplace =True)
df.isnull().sum()

In [None]:
import numpy as np
# finding duplicates
duplicate = df.duplicated()
np.unique(duplicate)

In [None]:
# reset index from 0
df = df.reset_index(drop=True)
df.head()

In [None]:
# check total rows in the table
df.tail()

Since the sex and age attributes are object types, I will labeling them to integer type for further visulization even though I didn't use these two attributes in this story telling assignment

In [None]:
# Labeling by using LabelEncoder
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
df.sex = le.fit_transform(df.sex) # female:0 , male:1
df.age = le.fit_transform(df.age) # 15-24: 0, 25-34:1, 35-54:2 , 5-14:3, 55-74:4, 75+:5
df.head(20)

In [None]:
df.info()

For now, the data is prepared already. 

Next, I will start to visualize the data by using 4 types of story telling

# Visualization- 4 types of story telling

## 1. Change over time

When we think of worldwide number of suicide, we must want to know the suicide number in the world in recently year.

I will use the type of change over time to explore the question that "How does the global number of suicides change from year to year?"

In [None]:
# import the matplotlib library for visualization
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
# The total number of suicides worldwide each year
df_year = df.groupby("year").sum().sort_values(by='suicides_no')[['suicides_no']]
df_year = df_year.sort_index()
plt.figure(figsize=(16,8))
plt.bar(np.arange(len(df_year)), df_year.suicides_no)
plt.xticks(np.arange(len(df_year)), (df_year.index),rotation=90)
plt.title("Number of Suicide Worldwide Based in Year")
plt.show()

- The above bar plot shows that total number of suicide in the world from year 1979 to 2015 is quiet higher. 
- The number of suicide of year 1983 and 1984 are lower and from 1988 to 2003 has growing up year by year.

# 2. Drilling down

Looking to the number of suicide in the world is hard to simply guess the number in the specific country, such as USA. 

Next, I will drill down the data into only includes country usa. Let's see the change in the usa year by year.

First, I will prepare the data has only US data

In [None]:
df_usa = df.loc[df['country']=='United States of America'].groupby("year").sum().sort_values(by='suicides_no')[['suicides_no']]
df_usa = df_usa.sort_index().reset_index()

In [None]:
df_usa.head()

import the seaborn library

In [None]:
# import the seaborn library to visualize
import seaborn as sns

In [None]:
plt.figure(figsize=(16,8))
usa_year_s = sns.barplot(x='year',y='suicides_no',data=df_usa, palette='Blues')
                
usa_year_s.set_xticklabels(df_usa.year, rotation=90)
usa_year_s.set_title('Number of Suicides in Year: USA')

The barplot chart above shows that the suicide rate in America is on the rise every year, expecially from year 2003 to 2015

# 3. Zooming out

However, we can only see the usa trend of the number of suicide. How about other countries? 

Next, I will compare the total number of suicide in each country by looking on a country map

First, I will prepare the data that sum of the number of suicide for each country

In [None]:
count_max_sui=pd.DataFrame(df.groupby('country')['suicides_no'].sum().reset_index())
count_max_sui


import the plotly library for further visualization

In [None]:
# import the plotly library
from plotly.offline import init_notebook_mode, iplot

In [None]:
count = [ dict(
        type = 'choropleth',
        locations = count_max_sui['country'],
        locationmode='country names',
        z = count_max_sui['suicides_no'],
        text = count_max_sui['country'],
        colorscale = 'Viridis',
        autocolorscale = False,
        reversescale = True,
        marker = dict(
            line = dict (
                color = 'rgb(180,180,180)',
                width = 0.5
            ) ),
        colorbar = dict(
            autotick =False,
            title = 'Suicides Country-based'),
      ) ]
layout = dict(
    title = 'Suicides happening across the Globe',
    geo = dict(
        showframe = True,
        showcoastlines = True,
        projection = dict(
            type = 'Mercator'
        )
    )
)
fig = dict( data=count, layout=layout )
iplot( fig, validate=False, filename='d3-world-map' )

By looking on the above interact chart by plotly library, we can compare that the Russia has the worst situation on suicide and USA is not that serious in the world 

# 4.Intersection

Then, we maybe confused does Russia have the highest suicide rate every year?

To get the result, I will compare the number of suicide in US and Russia every year. 

In [None]:
df_russia = df.loc[df['country']=='Russian Federation'].groupby("year").sum().sort_values(by='suicides_no')[['suicides_no']]
df_russia = df_russia.sort_index().reset_index()

In [None]:
df_russia.head()

import the plotly library to visualize

In [None]:
# import graph objects as "go"
import plotly.graph_objs as go

In [None]:
# Creating trace1
trace1 = go.Scatter(
                    x = df_usa.year,
                    y = df_usa.suicides_no,
                    mode = "lines",
                    name = "Usa",
                    marker = dict(color = 'rgba(16, 112, 2, 0.8)'),
                    text= df_usa.suicides_no,
                    )
# Creating trace2
trace2 = go.Scatter(
                    x = df_russia.year,
                    y = df_russia.suicides_no,
                    mode = "lines",
                    name = "Russia",
                    marker = dict(color = 'rgba(80, 26, 80, 0.8)'),
                    text= df_russia.suicides_no,
                    )
data = [trace1, trace2]
layout = dict(title = 'Suicides over years- RUSSIA VS US',
              xaxis= dict(title= 'Number of Suicides',ticklen= 5,zeroline= False)
             )
fig = dict(data = data, layout = layout)
iplot(fig)

From the above line chart, we can tell the suicide rate of Russia is higher than usa from 1980 to 2009. However, The trend of suicide in the Russia decrease from 1999. As we can see, In year 2010, the number of suicide in Russia is lower than USA and decrease year by year. 

# Conclusion

By using 4 types of storytelling, we can explain our dataset in a order as a story which can attract audience to pay more attention on the data.

# Reference

https://www.kaggle.com/kanncaa1/plotly-tutorial-for-beginners

https://plotly.com/python/maps/