# Analyisis of the global suicide data from 1985 to 2016

In [35]:
import pandas as pd
import plotly.express as px
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [36]:
df = pd.read_csv("master.csv")

In [37]:
print(df)

          country  year     sex          age  suicides_no  population  \
0         Albania  1987    male  15-24 years           21      312900   
1         Albania  1987    male  35-54 years           16      308000   
2         Albania  1987  female  15-24 years           14      289700   
3         Albania  1987    male    75+ years            1       21800   
4         Albania  1987    male  25-34 years            9      274300   
...           ...   ...     ...          ...          ...         ...   
27815  Uzbekistan  2014  female  35-54 years          107     3620833   
27816  Uzbekistan  2014  female    75+ years            9      348465   
27817  Uzbekistan  2014    male   5-14 years           60     2762158   
27818  Uzbekistan  2014  female   5-14 years           44     2631600   
27819  Uzbekistan  2014  female  55-74 years           21     1438935   

       suicides/100k pop    country-year  HDI for year  gdp_for_year ($)   \
0                   6.71     Albania1987      

First let's look at the overall development of the global suicide statistics. We use the relative rate calculated with suicides_no / population, otherwise the data wouldn't tell us anything. These numbers are ugly, so I think we should multiply them with 100.000.

In [38]:
ratePerYear = {}
for i in range(2016-1985+1):
    pop = 0
    sui = 0
    for j in range(27820):
        if df.at[j, "year"] == (i+1985):
            pop += df.at[j, "population"]
            sui += df.at[j, "suicides_no"]
    ratePerYear[1985+i] = sui/pop*100000
print(ratePerYear)

{1985: 11.507335921444685, 1986: 11.716562160100937, 1987: 11.583429836497425, 1988: 11.481514107696295, 1989: 13.075652716124424, 1990: 13.184123141364283, 1991: 13.290036494673775, 1992: 13.473570250445576, 1993: 14.477430013643854, 1994: 14.983896309854323, 1995: 15.302227830617989, 1996: 14.842675800066186, 1997: 14.136594182299039, 1998: 14.46752249294399, 1999: 14.418166650163379, 2000: 14.218987981593715, 2001: 14.277564783001951, 2002: 14.054529230030578, 2003: 13.929009921042416, 2004: 13.80097220678577, 2005: 13.5093490704526, 2006: 12.676401748404448, 2007: 12.551757061993971, 2008: 12.654216998237864, 2009: 12.320792687174695, 2010: 11.951250148595177, 2011: 11.86357323019787, 2012: 12.032546293695317, 2013: 11.808460557589012, 2014: 11.661993547495705, 2015: 11.474887431996668, 2016: 11.811336909199245}


Now the funny part begins. We scatter the data using plotly.

In [39]:
px.scatter(ratePerYear.values())

We see that it was low at the start and then it went up until 1995, after this year the number of suicides decrease steadily. Maybe we should look at the distribution of the data using a boxplot.

In [40]:
px.box(ratePerYear.values())

I don't think that the boxplot is that interesting for this specific problem, but we are abel to see that there are a few very tall values.

We could us linear regression to predict the number of suicides in the future. To do that we need to remove some values, because the function is clearly not linear but a part of it is, so I would suggest we should remove everything until 1995 from the dataset.

In [41]:
model = LinearRegression()
rates = ratePerYear.values()
print(ratePerYear)

{1985: 11.507335921444685, 1986: 11.716562160100937, 1987: 11.583429836497425, 1988: 11.481514107696295, 1989: 13.075652716124424, 1990: 13.184123141364283, 1991: 13.290036494673775, 1992: 13.473570250445576, 1993: 14.477430013643854, 1994: 14.983896309854323, 1995: 15.302227830617989, 1996: 14.842675800066186, 1997: 14.136594182299039, 1998: 14.46752249294399, 1999: 14.418166650163379, 2000: 14.218987981593715, 2001: 14.277564783001951, 2002: 14.054529230030578, 2003: 13.929009921042416, 2004: 13.80097220678577, 2005: 13.5093490704526, 2006: 12.676401748404448, 2007: 12.551757061993971, 2008: 12.654216998237864, 2009: 12.320792687174695, 2010: 11.951250148595177, 2011: 11.86357323019787, 2012: 12.032546293695317, 2013: 11.808460557589012, 2014: 11.661993547495705, 2015: 11.474887431996668, 2016: 11.811336909199245}


In the plot above we see a good fit for an linear function if we remove the first 10 values. We also need to remove some other values to test or predictions

In [57]:
X = list(ratePerYear.keys())
y = list(ratePerYear.values())
X = np.array(X).reshape(-1, 1)
y = np.array(y).reshape(-1, 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
print(X_train)
print("##################################################")
print(y_train)

[[1995]
 [2000]
 [2004]
 [2010]
 [1994]
 [2001]
 [1991]
 [1992]
 [2009]
 [1985]
 [2002]
 [2014]
 [2015]
 [1997]
 [1993]
 [1987]
 [1990]
 [2012]
 [1998]
 [1999]
 [2003]
 [1989]
 [2011]
 [2016]]
##################################################
[[15.30222783]
 [14.21898798]
 [13.80097221]
 [11.95125015]
 [14.98389631]
 [14.27756478]
 [13.29003649]
 [13.47357025]
 [12.32079269]
 [11.50733592]
 [14.05452923]
 [11.66199355]
 [11.47488743]
 [14.13659418]
 [14.47743001]
 [11.58342984]
 [13.18412314]
 [12.03254629]
 [14.46752249]
 [14.41816665]
 [13.92900992]
 [13.07565272]
 [11.86357323]
 [11.81133691]]


In [55]:
model.fit(X_train, y_train)

In [61]:
x_test = np.array(x_test).reshape(-1, 1)
print(model.predict(x_test))
print("#############################")
print(y_test)

[[13.42489433]
 [13.36798138]
 [13.31106842]
 [13.25415547]
 [13.19724251]
 [13.14032956]
 [13.0834166 ]
 [13.02650364]
 [12.96959069]
 [12.91267773]]
#############################
[[13.50934907]
 [11.71656216]
 [12.55175706]
 [11.80846056]
 [12.67640175]
 [12.654217  ]
 [14.8426758 ]
 [11.48151411]]


The prediction goes in the right direction, but there is not much positive else to say about it. The problem is its simplicity and if you take a closer look at the data you will realise that there is no exact linear functinon to fit this data, because the real world is just to complex.

But there is still hope. Maybe we could build a more complex machine learning model using tensorflow to find patterns in this data, but that is an task for an other notebook.