# WHOの解析と可視化

この記事は、WHO Suicide Analysisという記事を翻訳しています。 https://www.kaggle.com/roshansharma/who-suicide-analysis　この記事を書いたroshan sharmaさんに深い敬意を表します。

<img src="https://www.thenewsminute.com/sites/default/files/styles/news_detail/public/Jaseem_Rep_image_Hanging_Stool_Rope_Suicide_Death_750x500.jpg?itok=YulbRMPQ" width="800px">

### 背景:


* 世界保健機関（WHO）は、毎年約100万人が自殺で死亡すると推定しています。これは、10万人あたり16人の死亡率、または40秒ごとに1人の死亡を意味します。 2020年までに、死亡率は20秒ごとに1人に増加すると予測されています。

### WHOのさらなる報告によると：

* 過去45年間で、自殺率は世界中で60％増加しました。現在、自殺は15〜44歳（男性と女性）の層にとって、主要な3つの死因のうちの1つです。自殺を図った人の数は、自殺した人より最大20倍も多いのです。

* 自殺率は伝統的に高齢男性の中で最も高いが、若者の自殺率は、現在、すべての国の3分の1で最もリスクが高いグループにまで増加しています。

* 精神障害（特にうつ病および薬物乱用）は、自殺の全症例の90％以上に関連しています。

* しかし、自殺は多くの複雑な社会文化的要因に起因し、社会経済的、家族的および個人的な危機の期間中に発生する可能性が高い（例えば、愛する人の喪失、失業、性的指向、アイデンティティの発達の困難、コミュニティまたは他の人からの分離、社会/信念に関すること、および名誉）。

### WHOは他にもこんな言及をしている:

* ヨーロッパ、特に東ヨーロッパでは、男性と女性の両方で最も高い自殺率が報告されています。

* 東地中海地域と中央アジア共和国の自殺率は最も低い。
 
* 世界中の自殺のほぼ30％がインドと中国で発生しています。

* 世界の年齢別の自殺は次のとおりです。55％は15〜44歳、45％は45歳以上です。

* 若者の自殺は、最大の割合で増加しています。

#### 米国では、疾病管理予防センターが次のように報告しています:

* 全体として、自殺はすべてのアメリカ系アメリカ人にとって11番目の主要な死因であり、15〜24歳の若者の3番目の主要な死因です。

* 自殺は若者と成人の間で深刻な問題ですが、死亡率は65歳以上の高齢者の間で最も高くなっています。

* 男性は女性よりも自殺により死亡する可能性が4倍高い。ただし、女性は男性よりも自殺を試みる可能性が高くなります。

# ライブラリーをインストールしよう

In [None]:
!pip install bubbly

In [None]:
# for basic operations
import numpy as np
import pandas as pd

# for visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# for interactive visualizations
import plotly.offline as py
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.offline as offline
offline.init_notebook_mode()
from plotly import tools
import plotly.figure_factory as ff

from bubbly.bubbly import bubbleplot

import plotly.tools as tls
import squarify
from mpl_toolkits.basemap import Basemap
from numpy import array
from matplotlib import cm

# for providing path
import os
print(os.listdir('../input/'))

**データセットのインポート**

In [None]:
data = pd.read_csv('../input/who_suicide_statistics.csv')

data = data.sort_values(['year'], ascending = True)

print(data.shape)

In [None]:
# let's check the total number of countries' data available for suicidal analysis

print("No. of Countries available for analysis :", data['country'].nunique())


In [None]:
# checking the head of the table

dat = ff.create_table(data.head())
py.iplot(dat)

In [None]:
# let's describe the data

dat = ff.create_table(data.describe())
py.iplot(dat)


In [None]:
# renaming the columns

data.rename({'sex' : 'gender', 'suicides_no' : 'suicides'}, inplace = True, axis = 1)

data.columns

In [None]:
# checkinng the null values in the dataset

data.isnull().sum()

In [None]:
# filling missing values

data['suicides'].fillna(0, inplace = True)
# data['population'].mean()
data['population'].fillna(1664090, inplace = True)

# checking if there is any null value left
data.isnull().sum().sum()

# converting these attributes into integer format
data['suicides'] = data['suicides'].astype(int)
data['population'] = data['population'].astype(int)

## データ可視化

In [None]:
import warnings
warnings.filterwarnings('ignore')

figure = bubbleplot(dataset = data, x_column = 'suicides', y_column ='population', 
    bubble_column = 'country',  color_column = 'country', 
    x_title = "Number of Suicides", y_title = "Population", title = 'Population vs Suicides',
    x_logscale = False, scale_bubble = 3, height = 650)

py.iplot(figure, config={'scrollzoom': True})

> > > > 

* 上記のプロットを見ると、アフリカやアジアの地域ではアメリカやヨーロッパの地域に比べて自殺者の数が非常に多いという結論を導くことができます。

In [None]:
# visualising the different countries distribution in the dataset

plt.style.use('seaborn-dark')
plt.rcParams['figure.figsize'] = (15, 9)

color = plt.cm.winter(np.linspace(0, 10, 100))
x = pd.DataFrame(data.groupby(['country'])['suicides'].sum().reset_index())
x.sort_values(by = ['suicides'], ascending = False, inplace = True)

sns.barplot(x['country'].head(10), y = x['suicides'].head(10), data = x, palette = 'winter')
plt.title('Top 10 Countries in Suicides', fontsize = 20)
plt.xlabel('Name of Country')
plt.xticks(rotation = 90)
plt.ylabel('Count')
plt.show()


* ここでは、毎年の自殺者数に応じて上位10か国を見ることができます。 GDP、雇用、成長、経済、贅沢の分野で最も強力な国であるロシアとアメリカは、世界中で最も住みやすい国の1つとして評価されておりますが、自殺の観点からもリストのトップにいます。

* これらの国では生活費が非常に高いため、または関係の薬物/問題/家族関連の問題などが原因である可能性があるため、理由は失業である可能性があります。

In [None]:
# visualising the different year distribution in the dataset

plt.style.use('seaborn-dark')
plt.rcParams['figure.figsize'] = (18, 9)

x = pd.DataFrame(data.groupby(['year'])['suicides'].sum().reset_index())
x.sort_values(by = ['suicides'], ascending = False, inplace = True)

sns.barplot(x['year'], y = x['suicides'], data = x, palette = 'cool')
plt.title('Distribution of suicides from the year 1985 to 2016', fontsize = 20)
plt.xlabel('year')
plt.xticks(rotation = 90)
plt.ylabel('count')
plt.show()

In [None]:

color = plt.cm.Blues(np.linspace(0, 1, 2))
data['gender'].value_counts().plot.pie(colors = color, figsize = (10, 10), startangle = 75)

plt.title('Gender', fontsize = 20)
plt.axis('off')
plt.show()

In [None]:
# visualising the different year distribution in the dataset

plt.style.use('seaborn-dark')
plt.rcParams['figure.figsize'] = (18, 9)

x = pd.DataFrame(data.groupby(['gender'])['suicides'].sum().reset_index())
x.sort_values(by = ['suicides'], ascending = False, inplace = True)

sns.barplot(x['gender'], y = x['suicides'], data = x, palette = 'afmhot')
plt.title('Distribution of suicides wrt Gender', fontsize = 20)
plt.xlabel('year')
plt.xticks(rotation = 90)
plt.ylabel('count')
plt.show()

## 自殺の地理空間分析

In [None]:

suicide = pd.DataFrame(data.groupby(['country','year'])['suicides'].sum().reset_index())

count_max_sui=pd.DataFrame(suicide.groupby('country')['suicides'].sum().reset_index())

count = [ dict(
        type = 'choropleth',
        locations = count_max_sui['country'],
        locationmode='country names',
        z = count_max_sui['suicides'],
        text = count_max_sui['country'],
        colorscale = 'Cividis',
        autocolorscale = False,
        reversescale = True,
        marker = dict(
            line = dict (
                color = 'rgb(180,180,180)',
                width = 0.5
            ) ),
)]
layout = dict(
    title = 'Suicides happening across the Globe',
    geo = dict(
        showframe = True,
        showcoastlines = True,
        projection = dict(
            type = 'orthographic'
        )
    )
)
fig = dict( data=count, layout=layout )
iplot(fig, validate=False, filename='d3-world-map')

In [None]:
# looking at the Suicides in USA.

data[data['country'] == 'United States of America'].sample(20)

In [None]:
# replacing categorical values in the age column

data['age'] = data['age'].replace('5-14 years', 0)
data['age'] = data['age'].replace('15-24 years', 1)
data['age'] = data['age'].replace('25-34 years', 2)
data['age'] = data['age'].replace('35-54 years', 3)
data['age'] = data['age'].replace('55-74 years', 4)
data['age'] = data['age'].replace('75+ years', 5)

#data['age'].value_counts()

# suicides in different age groups

x1 = data[data['age'] == 0]['suicides'].sum()
x2 = data[data['age'] == 1]['suicides'].sum()
x3 = data[data['age'] == 2]['suicides'].sum()
x4 = data[data['age'] == 3]['suicides'].sum()
x5 = data[data['age'] == 4]['suicides'].sum()
x6 = data[data['age'] == 5]['suicides'].sum()

x = pd.DataFrame([x1, x2, x3, x4, x5, x6])
x.index = ['5-14', '15-24', '25-34', '35-54', '55-74', '75+']
x.plot(kind = 'bar', color = 'grey')

plt.title('suicides in different age groups')
plt.xlabel('Age Group')
plt.ylabel('count')
plt.show()

## 年ごとの自殺傾向の発見

In [None]:
df = data.groupby(['country', 'year'])['suicides'].mean()
df = pd.DataFrame(df)

# looking at the suicides trends for any 3 countries
plt.rcParams['figure.figsize'] = (20, 30)
plt.style.use('dark_background')

plt.subplot(3, 1, 1)
color = plt.cm.hot(np.linspace(0, 1, 40))
df['suicides']['United States of America'].plot.bar(color = color)
plt.title('Suicides Trends in USA wrt Year', fontsize = 30)

plt.subplot(3, 1, 2)
color = plt.cm.spring(np.linspace(0, 1, 40))
df['suicides']['Russian Federation'].plot.bar(color = color)
plt.title('Suicides Trends in Russian Federation wrt Year', fontsize = 30)

plt.subplot(3, 1, 3)
color = plt.cm.PuBu(np.linspace(0, 1, 40))
df['suicides']['Japan'].plot.bar(color = color)
plt.title('Suicides Trends in Japan wrt Year', fontsize = 30)

plt.show()

## 年齢層による自殺傾向の発見

In [None]:
df2 = data.groupby(['country', 'age'])['suicides'].mean()
df2 = pd.DataFrame(df2)

# looking at the suicides trends for any 3 countries
plt.rcParams['figure.figsize'] = (20, 30)

plt.subplot(3, 1, 1)
df2['suicides']['United States of America'].plot.bar()
plt.title('Suicides Trends in USA wrt Age Groups', fontsize = 30)
plt.xticks(rotation = 0)

plt.subplot(3, 1, 2)
color = plt.cm.jet(np.linspace(0, 1, 6))
df2['suicides']['Russian Federation'].plot.bar(color = color)
plt.title('Suicides Trends in Russian Federation wrt Age Groups', fontsize = 30)
plt.xticks(rotation = 0)

plt.subplot(3, 1, 3)
color = plt.cm.Wistia(np.linspace(0, 1, 6))
df2['suicides']['Japan'].plot.bar(color = color)
plt.title('Suicides Trends in Japan wrt Age Groups', fontsize = 30)
plt.xticks(rotation = 0)

plt.show()

In [None]:

plt.rcParams['figure.figsize'] = (15, 7)
plt.style.use('dark_background')

sns.stripplot(data['year'], data['suicides'], palette = 'cool')
plt.title('Year vs Suicides', fontsize = 20)
plt.xticks(rotation = 90)
plt.show()

In [None]:
# age-group vs suicides

plt.rcParams['figure.figsize'] = (15, 7)


sns.stripplot(data['gender'], data['suicides'], palette = 'Wistia')
plt.title('Age groups vs Suicides', fontsize = 20)
plt.grid()
plt.show()

In [None]:
# label encoding for gender

from sklearn.preprocessing import LabelEncoder

# creating an encoder
le = LabelEncoder()
data['gender'] = le.fit_transform(data['gender'])

data['gender'].value_counts()

In [None]:
# deleting unnecassary column

data = data.drop(['country'], axis = 1)

data.columns

In [None]:
#splitting the data into dependent and independent variables

x = data.drop(['suicides'], axis = 1)
y = data['suicides']

print(x.shape)
print(y.shape)

In [None]:
# splitting the dataset into training and testing sets

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 45)

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

In [None]:
# min max scaling

import warnings
warnings.filterwarnings('ignore')

# importing the min max scaler
from sklearn.preprocessing import MinMaxScaler

# creating a scaler
mm = MinMaxScaler()

# scaling the independent variables
x_train = mm.fit_transform(x_train)
x_test = mm.transform(x_test)

## 自殺を予測するモデル

### 線形回帰

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# creating the model
model = LinearRegression()

# feeding the training data into the model
model.fit(x_train, y_train)

# predicting the test set results
y_pred = model.predict(x_test)

# calculating the mean squared error
mse = np.mean((y_test - y_pred)**2)
print("MSE :", mse)

# calculating the root mean squared error
rmse = np.sqrt(mse)
print("RMSE :", rmse)

#calculating the r2 score
r2 = r2_score(y_test, y_pred)
print("r2_score :", r2)


### ランダムフォレスト

In [None]:
from sklearn.ensemble import RandomForestRegressor

# creating the model
model = RandomForestRegressor()

# feeding the training data into the model
model.fit(x_train, y_train)

# predicting the test set results
y_pred = model.predict(x_test)

# calculating the mean squared error
mse = np.mean((y_test - y_pred)**2)
print("MSE :", mse)

# calculating the root mean squared error
rmse = np.sqrt(mse)
print("RMSE :", rmse)

#calculating the r2 score
r2 = r2_score(y_test, y_pred)
print("r2_score :", r2)


### 決定木

In [None]:
from sklearn.tree import DecisionTreeRegressor

# creating the model
model = DecisionTreeRegressor()

# feeding the training data into the model
model.fit(x_train, y_train)

# predicting the test set results
y_pred = model.predict(x_test)

# calculating the mean squared error
mse = np.mean((y_test - y_pred)**2)
print("MSE :", mse)

# calculating the root mean squared error
rmse = np.sqrt(mse)
print("RMSE :", rmse)

#calculating the r2 score
r2 = r2_score(y_test, y_pred)
print("r2_score :", r2)


### AdaBoostRegressorモデル

In [None]:
from sklearn.ensemble import AdaBoostRegressor

# creating the model
model = AdaBoostRegressor()

# feeding the training data into the model
model.fit(x_train, y_train)

# predicting the test set results
y_pred = model.predict(x_test)

# calculating the mean squared error
mse = np.mean((y_test - y_pred)**2)
print("MSE :", mse)

# calculating the root mean squared error
rmse = np.sqrt(mse)
print("RMSE :", rmse)

#calculating the r2 score
r2 = r2_score(y_test, y_pred)
print("r2_score :", r2)


## 比較結果

**4つの全モデルに対するR2 Score**

In [None]:
r2_score = np.array([0.385, 0.851, 0.745, 0.535])
labels = np.array(['Linear Regression', 'Random Forest', 'Decision Tree', 'AdaBoost Tree'])
indices = np.argsort(r2_score)
color = plt.cm.rainbow(np.linspace(0, 1, 9))

plt.style.use('seaborn-talk')
plt.rcParams['figure.figsize'] = (18, 7)
plt.bar(range(len(indices)), r2_score[indices], color = color)
plt.xticks(range(len(indices)), labels[indices])
plt.title('R2 Score', fontsize = 30)
plt.grid()
plt.tight_layout()
plt.show()

**4つの全モデルに対するRMSE Score**

In [None]:
rmse = np.array([600, 295, 388, 521])
labels = np.array(['Linear Regression', 'Random Forest', 'Decision Tree', 'AdaBoost Tree'])
indices = np.argsort(rmse)
color = plt.cm.spring(np.linspace(0, 1, 9))

plt.style.use('seaborn-talk')
plt.rcParams['figure.figsize'] = (18, 7)

plt.bar(range(len(indices)), rmse[indices], color = color)
plt.xticks(range(len(indices)), labels[indices])
plt.title('RMSE', fontsize = 30)

plt.grid()
plt.tight_layout()
plt.show()