# Introduction
Suicide Rates dataset is going to be analysed.

<font color = "blue">
Content: 
    
1. [Load and Check Data](#1)
1. [Variable Description](#2)
    * [Univariate Variable Analysis](#3)
        * [Categorical Variable](#4)
        * [Numerical Variable](#5)
1. [Basic Data Analysis](#6)
1. [Outlier Detection](#7)
1. [Missing Value](#8)
    * [Finding And Dropping Missing Value](#9)
1. [Visualisation](#10)
    * [Bar Plot](#11)
    * [Horizontal Bar Plot](#12)
    * [Point Plot](#13)
    * [Joint Plot](#14)
    * [Pie Plot](#15)
    * [Lm Plot](#16)
    * [Kde Plot](#17)
    * [Violin Plot](#18)
    * [Heatmap Correlation](#19)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
from collections import Counter
import warnings
warnings.filterwarnings("ignore")

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

<a id = "1"></a><br>
# Load and Check Data

In [None]:
data = pd.read_csv("/kaggle/input/suicide-rates-overview-1985-to-2016/master.csv")
data_hdi = pd.read_csv("/kaggle/input/human-development-index-hdi/human-development-index.csv")

In [None]:
data.sample(5)

In [None]:
data.rename(columns={"suicides/100k pop":"suicides_100k_pop","country-year":"country_year","HDI for year":"hdi_for_year"," gdp_for_year ($) ": "gdp_for_year_$","gdp_per_capita ($)":"gdp_per_capita_$"}, inplace =True)

In [None]:
data.columns

In [None]:
data.info()

In [None]:
data_hdi.head(10)

In [None]:
data_hdi.columns

In [None]:
data_hdi.rename(columns={" ((0-1; higher values are better))": "hdi_for_year","Entity":"country","Year":"year"}, inplace=True)
data_hdi.drop(["Code"],inplace=True, axis = 1)
data_hdi.sample(5)

In [None]:
data_hdi["year"] = data_hdi["year"].astype(str)
data_hdi["country_year"] = data_hdi["country"] + data_hdi["year"]

In [None]:
data.sample(5)

In [None]:
data_hdi.tail(5)

In [None]:
data = data.set_index("country_year").fillna(data_hdi.set_index("country_year")).reset_index()

In [None]:
data.info()

<a id = "2"></a><br>
# Variable Description
1.  Country : Suicide land
2.  Year : Suicide year
3.  Sex: The gender of people who committed suicide
4.  Age: Age range of suicide
5.  Suicides_no: Number of suicides per country, year, sex and age group
6.  Population: Number of population per country, year, sex and age group
7.  Country-year: Suicide country-year
8.  HDI for year: Human Develop Index at the year of person's suicide
9.  GDP for year(dollar): Gross Domestic Product at the year of person's suicide
10. GDP per capita(dollar): Gross Domestic Product per capita at the year of person's suicide
11. Generation: The generation of people who committed suicide
12. Suicides_100k_pop: The number of suicide in 100k population

In [None]:
data.info()

* float64(2): suicides_100k_pop and hdi_for_year
* int64(4): year, suicides_no, population, gdp_per_capita (dollar)
* object(6): country, sex, age, country-year, gdp_for_year (dollar), generation

<a id = "3"></a><br>
# Univariate Variable Analysis
* Categorical Variable: country, sex, country-year, generation
* Numerical Variable: year, suicides_no, population, gdp_per_capita (dollar),age, gdp_for_year (dollar), suicides_100k_pop, HDI for year

In [None]:
data.columns

In [None]:
def bar_plot(variable):
    """
    input: variable ex :"Sex"
    output: var plot & value count
    
    """
    #get feature
    var = data[variable]
    #count number of categorical variable(value/sample)
    varValue = var.value_counts()
    
    #visualize
    plt.figure(figsize = (10,10))
    plt.bar(varValue.index,varValue)
    plt.xticks(varValue.index,varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print("[{}: \n {}".format(variable,varValue))
    

<a id = "4"></a><br>
## Categorical Variable
*  Sex and Generation Distribution with bar plot.

In [None]:
category1 = ["sex", "generation"]
for c in category1:
    bar_plot(c)

<a id = "5"></a><br>
## Numerical Variable
*  Year, Suicide No., Population, GDP_per_capita, Age, GDP_for_year, Suicides_100k_pop distribution with histogram.

In [None]:
def plot_hist(variable):
    plt.figure(figsize=(9,3))
    plt.hist(data[variable],bins = 20)
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distribution with hist".format(variable))
    plt.show()

In [None]:
numericVar = ["year", "suicides_no", "population", "gdp_per_capita_$","age", "gdp_for_year_$", "suicides_100k_pop"]
for n in numericVar:
    plot_hist(n)

<a id = "6"></a><br>
# Basic Data Analysis
* Sex - Suicide No
* Generation - Suicide No
* Age - Suicide No
* Country - Suicide No

In [None]:
data.head(5)

### Sex - Suicide No
* Total suicide number by sex (sorted descending).

In [None]:
data[["sex","suicides_no"]].groupby(["sex"], as_index = False).sum().sort_values(by="suicides_no",ascending = False)

### Generation - Suicide No
* Total suicide number by generation (sorted descending).

In [None]:
data[["generation","suicides_no"]].groupby(["generation"], as_index = False).sum().sort_values(by="suicides_no",ascending = False)

### Age - Suicide No
* Total suicide number by age (sorted descending).

In [None]:
data[["age","suicides_no"]].groupby(["age"], as_index = False).sum().sort_values(by="suicides_no",ascending = False)

### Country - Suicide No
* Total suicide number by country (sorted descending).

In [None]:
data[["country","suicides_no"]].groupby(["country"], as_index = False).sum().sort_values(by="suicides_no",ascending = False)

<a id = "7"></a><br>
# Outlier Detection
* Outliers will be detected and if necessary, they will be removed.

In [None]:
def detect_outliers(df,features):
    outlier_indices = []
    
    for c in features:
        #1st quartile
        Q1 = np.percentile(df[c],25)
        #3rd quartile
        Q3 = np.percentile(df[c],75)
        #IQR
        IQR = Q3 - Q1
        #Outlier step
        outlier_step = IQR*1.5
        #detect outlier and their indices
        outlier_list_col = df[(df[c]<Q1-outlier_step) | (df[c]> Q3 + outlier_step)].index
        #store indices
        outlier_indices.extend(outlier_list_col)
        
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)
    
    return multiple_outliers

In [None]:
data.loc[detect_outliers(data,["suicides_no","population","gdp_per_capita_$"])]

In [None]:
#data = data.drop(detect_outliers(data,["suicides_no","population","gdp_per_capita_$"]),axis = 0).reset_index(drop=True)

<a id = "8"></a><br>
# Missing Value <a id = "9"></a><br>
## Finding And Dropping Missing Value
* Missing Values will be detected and if necessary they will be dropped.

In [None]:
data.isnull().any()

In [None]:
data.isnull().sum()

In [None]:
data[data["hdi_for_year"].isnull()]

In [None]:
data = data.dropna().reset_index()

In [None]:
data.info()

<a id = "10"></a><br>

# Visualisation<a id = "11"></a><br>
## Bar Plot

* **<span style="color:crimson;">50 Countries with Highest Suicide Rates</span>** will be demonstrated as **<span style="color:crimson;">bar plot</span>**.

In [None]:
len(data["country"].unique())

In [None]:
country_list = list(data['country'].unique())
country_suicide_ratio = []
for i in country_list:
    x = data[data["country"] == i]
    country_suicide_rate = sum(x.suicides_100k_pop)/len(x)
    country_suicide_ratio.append(country_suicide_rate)
df = pd.DataFrame({"country_list": country_list, "country_suicide_ratio": country_suicide_ratio})
new_index = (df["country_suicide_ratio"].sort_values(ascending=False)).index.values
sorted_data = df.reindex(new_index)
sorted_data2 = sorted_data.drop(sorted_data.tail(42).index)

# visualization
plt.figure(figsize=(30,10))
sns.barplot(x=sorted_data2['country_list'], y=sorted_data2['country_suicide_ratio'],palette="rocket")
plt.xticks(rotation= 45)
plt.xlabel('Countries')
plt.ylabel('Suicide Rate')
plt.title('Suicide Rate Given Countries')

* **<span style="color:crimson;">Total Suicidies Per Year in the United States between 1985-2015</span>**  will be demonstrated as **<span style="color:crimson;">bar plot</span>**.

In [None]:
x = data[data["country"] == "United States"]
year_list = list(x.year.unique())
for i in year_list:
    q = x[x.year == i] 
    sum_of_suicide_yearly = sum(q.suicides_no)
    sum_of_suicide_yearly

In [None]:
x = data[data["country"] == "United States"]
year_list = list(x.year.unique())
sum_of_suicide_yearly = []
for i in year_list:
    q = x[x.year == i] 
    sum_of_suicide_yearbase = sum(q.suicides_no)
    sum_of_suicide_yearly.append(sum_of_suicide_yearbase)
df = pd.DataFrame({"Total Suicide per Year" : sum_of_suicide_yearly, "Years" : year_list} )

# visualization
plt.figure(figsize=(10,10))
sns.barplot(x=df['Years'], y=df['Total Suicide per Year'],palette = sns.cubehelix_palette(27))
plt.xticks(rotation= 45)
plt.xlabel('Years')
plt.ylabel('Total Suicide per Year')
plt.title('Total Suicide per Year in U.S')                   


<a id = "12"></a><br>
## Horizontal Bar Plot

* **<span style="color:crimson;">50 Countries with Highest Suicide Rates with Suicide Percentages According To Generation</span>** will be demonstrated as **<span style="color:crimson;">horizontal bar plot</span>**

In [None]:
data.generation.unique()

In [None]:
country_list = list(data['country'].unique())


for i in country_list:
    x = data[data["country"] == i]
    generation_list = list(x["generation"].unique())
    for j in generation_list:
        y = x[x["generation"] == j]

In [None]:
country_list = list(data['country'].unique())
country_suicide_ratio = []
for i in country_list:
    x = data[data["country"] == i]
    country_suicide_rate = sum(x.suicides_100k_pop)/len(x)
    country_suicide_ratio.append(country_suicide_rate)
df = pd.DataFrame({"country_list": country_list, "country_suicide_ratio": country_suicide_ratio})
new_index = (df["country_suicide_ratio"].sort_values(ascending=False)).index.values
sorted_data = df.reindex(new_index)
sorted_data2 = sorted_data.drop(sorted_data.tail(42).index)
country_list_new = list(sorted_data2["country_list"])

country_Boomers_suicide_percentage = []
country_Generation_X_suicide_percentage = []
country_Silent_suicide_percentage = []
country_Millenials_suicide_percentage = []
country_G_I_Generation_suicide_percentage = []
country_Generation_Z_suicide_percentage = []

for i in country_list_new:
    x = data[data["country"] == i]
    b = x[x["generation"] == "Boomers"]
    gen_X = x[x["generation"] == "Generation X"]
    s = x[x["generation"] == "Silent"]
    m = x[x["generation"] == "Millenials"]
    g_i = x[x["generation"] == "G.I. Generation"]
    gen_Z = x[x["generation"] == "Generation Z"]
    z = sum(x.suicides_no)
    

        
    country_Boomers_suicide_percentage.append((sum (b.suicides_no))*100/ sum(x.suicides_no))
    country_Generation_X_suicide_percentage.append((sum (gen_X.suicides_no))*100/ sum(x.suicides_no))
    country_Silent_suicide_percentage.append((sum (s.suicides_no))*100/ sum(x.suicides_no))
    country_Millenials_suicide_percentage.append((sum (m.suicides_no))*100/ sum(x.suicides_no))
    country_G_I_Generation_suicide_percentage.append((sum (g_i.suicides_no))*100/ sum(x.suicides_no))
    country_Generation_Z_suicide_percentage.append((sum (gen_Z.suicides_no))*100/ sum(x.suicides_no))                                                


#visulisation
f,ax = plt.subplots(figsize = (15,10))
sns.barplot(x=country_Boomers_suicide_percentage,y=country_list_new,color='green',alpha = 0.5,label='Boomers' )
sns.barplot(x=country_Generation_X_suicide_percentage,y=country_list_new,color='blue',alpha = 0.7,label='Generation X')
sns.barplot(x=country_Silent_suicide_percentage,y=country_list_new,color='cyan',alpha = 0.6,label='Silent')
sns.barplot(x=country_Millenials_suicide_percentage,y=country_list_new,color='yellow',alpha = 0.6,label='Millenials')
sns.barplot(x=country_G_I_Generation_suicide_percentage,y=country_list_new,color='orange',alpha = 0.6,label='G.I Generation')
sns.barplot(x=country_Generation_Z_suicide_percentage,y=country_list_new,color='red',alpha = 0.6,label='Generation Z')

ax.legend(loc='lower right',frameon = True)
ax.set(xlabel='Percentage of Generations', ylabel='50 Countries with the Highest Suicide Rates',title = "Percentage of Countries's Suicides According to Generations ")

<a id = "13"></a><br>
## Point Plot

* **<span style="color:crimson;">The relationship between Country Suicide Ratio</span>** and **<span style="color:crimson;">Yearly GDP Per Capita</span>** will be demonstrated as **<span style="color:crimson;">point plot</span>**.

In [None]:
data.head()

In [None]:
x = data[data["country"] == "United States"]
year_list = x.year.unique()
yearly_suicide_ratio = []
yearly_gdp_p_capita = []
for i in year_list:
    s = x[x["year"] == i]
    yearly_suicide_rate = sum(s.suicides_100k_pop)/len(s)
    yearly_suicide_ratio.append(yearly_suicide_rate)
    country_yearly_gdp_p_capita = sum(s["gdp_per_capita_$"])/len(s)
    yearly_gdp_p_capita.append(country_yearly_gdp_p_capita)
    
df = pd.DataFrame({"year_list": year_list, "yearly_suicide_ratio": yearly_suicide_ratio, "yearly_gdp_p_capita": yearly_gdp_p_capita })
df.yearly_suicide_ratio = df.yearly_suicide_ratio / max(df.yearly_suicide_ratio)
df.yearly_gdp_p_capita = df.yearly_gdp_p_capita / max(df.yearly_gdp_p_capita)

#visualization
f,ax1 = plt.subplots(figsize =(20,10))
sns.pointplot(x='year_list',y='yearly_suicide_ratio',data=df,color='lime',alpha=0.8)
sns.pointplot(x='year_list',y='yearly_gdp_p_capita',data=df,color='red',alpha=0.8)
plt.text(22,0.6,'yearly_gdp_p_capita',color='red',fontsize = 13,style = 'italic')
plt.text(22,0.65,'yearly_suicide_ratio',color='lime',fontsize = 14,style = 'italic')
plt.xlabel('Year',fontsize = 15,color='blue')
plt.ylabel('Values',fontsize = 15,color='blue')
plt.title('country_suicide_ratio  VS  yearly_gdp_p_capita',fontsize = 20,color='blue')
plt.grid()

<a id = "14"></a><br>
## Joint Plot

* **<span style="color:crimson;">Yearly Suicide Ratio</span>** will be demonstrated as **<span style="color:crimson;">joint plot</span>**.

In [None]:
g = sns.jointplot(df.yearly_suicide_ratio, df.yearly_gdp_p_capita, kind="kde", size=7)
plt.savefig('graph.png')
plt.show()

* **<span style="color:crimson;">The relationship between Yearly Suicide Ratio and Yearly GDP Per Capita</span>** will be demonstrated as **<span style="color:crimson;">**joint plot**</span>**.

In [None]:
g = sns.jointplot("yearly_suicide_ratio", "yearly_gdp_p_capita", data=df,size=5, ratio=3, color="r")

<a id = "15"></a><br>
## Pie Plot

* **<span style="color:crimson;">Suicide Percantages According to Generations</span>** will be demonstrated as **<span style="color:crimson;">pie plot</span>**.

In [None]:
a = data[["generation","suicides_no"]].groupby(["generation"], as_index = False).sum().sort_values(by="suicides_no",ascending = False).set_index("generation")
labels = a.index
colors = ['slategrey','tab:blue','brown','yellow','seagreen','cyan']
explode = [0,0,0,0,0,0]
values = a.values

# visual
plt.figure(figsize = (7,7))
plt.pie(values, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', textprops={'color':"navy"})
plt.title('Suicide Percantages According to Generation',color = 'maroon',fontsize = 15)


* **<span style="color:crimson;">Suicide Percantages According to Age Ranges</span>** will be demonstrated as **<span style="color:crimson;">pie plot</span>**.

In [None]:
a = data[["age","suicides_no"]].groupby(["age"], as_index = False).sum().sort_values(by="suicides_no",ascending = False).set_index("age")
labels = a.index
colors = ['slategrey','tab:blue','brown','yellow','seagreen','cyan']
explode = [0,0,0,0,0,0]
values = a.values

# visual
plt.figure(figsize = (7,7))
plt.pie(values, explode=explode, labels=labels, colors=colors, autopct='%1.1f%%', textprops={'color':"navy"})
plt.title('Suicide Percantages According to Age Ranges',color = 'maroon',fontsize = 15)

<a id = "16"></a><br>
## Lm Plot

* **<span style="color:crimson;">The relationship between Yearly Suicide Ratio and Yearly GDP Per Capita</span>** will be demonstrated as **<span style="color:crimson;">Lm plot</span>**.


In [None]:
sns.lmplot(x="yearly_suicide_ratio", y="yearly_gdp_p_capita", data=df)
plt.show()

<a id = "17"></a><br>
## Kde Plot

* **<span style="color:crimson;">The relationship between Yearly Suicide Ratio and Yearly GDP Per Capita</span>** will be demonstrated as **<span style="color:crimson;">Kde plot</span>**.


In [None]:
sns.kdeplot(df.yearly_suicide_ratio, df.yearly_gdp_p_capita, shade=True, cut=3)
plt.show()

<a id = "18"></a><br>
## Violin Plot

* **<span style="color:crimson;">The relationship between Yearly Suicide Ratio and Yearly GDP Per Capita</span>** will be demonstrated as **<span style="color:crimson;">Violin plot</span>**.


In [None]:
df.head()

In [None]:
df1 = df[["yearly_suicide_ratio","yearly_gdp_p_capita"]]

In [None]:
pal = sns.cubehelix_palette(2, rot=-.5, dark=.3)
sns.violinplot(data=df1, palette=pal, inner="points")
plt.show()

<a id = "19"></a><br>
## Heatmap Correlation

* **<span style="color:crimson;">The relationship between Yearly Suicide Ratio and Yearly GDP Per Capita</span>** will be demonstrated as **<span style="color:crimson;">Heatmap Correlation</span>**.

In [None]:
data.head()

In [None]:
df1.corr()

In [None]:
f,ax = plt.subplots(figsize=(5, 5))
sns.heatmap(df1.corr(), annot=True, linewidths=0.5,linecolor="red", fmt= '.2f',ax=ax)
plt.show()