# Introduction
We all know what suicide is but we don't know where, what year, what age range people committed suicide. We'll search "Suicide Rates" at the this notebook and we'll try the make correlation.

<font color = "green"/>
Content:

1. [Load and Check Data](#1)
1. [Variables Description](#2)
    * [Categorical Variables](#3)
    * [Numerical Variables](#4)
1. [Outlier Detection](#5)
1. [Missing Value](#6)
    * [Find Missing Value](#7)
    * [Fill Missing Value](#8)
1. [Basic Data Analysis](#9)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv
import matplotlib.pyplot as plt # for graph
from collections import Counter
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Load and Check Data <a id ="1"></a>

In [None]:
data = pd.read_csv("../input/suicide-rates-overview-1985-to-2016/master.csv") #We open the data

In [None]:
data.columns # We seeing columns here

In [None]:
data.info()# We can take the info from data

In [None]:
data.head() # We can imagine easily

# Variables Description <a id ="2"></a>

* Country : Committed suicide in the country
* Year : Committed suicide at the year
* Sex : Gender of human
* Age : Dead people in a certain age range
* Suicides no : Number of suicided people
* Population : Population
* Suicides/100k pop : (Suicided no/ Population)*100k
* Country-year : Country and year written together
* HDI for year : Human Development Index for year
* GDP for year : Gross Domestic Product for year(in dollar terms)
* GDP per capita : Gross Domestic Product/Population(in dollar terms)
* Generation : Generation

In [None]:
#data.info() We can look which datatypes how much in data

* float64(2) : suicides/100k pop, HDI for year
* int64(4): year, suicides_no, population, gdp_per_capita
* object(6): country, sex, age, country-year, gdp_for_year, generation

## Categorical Variables <a id ="3"></a>
Categorical Variables : Country, Year, Sex, Age, Country-year, Generation

In [None]:
def bar_plot(variable):
    """
    input: variable, ex:"Sex"
    output: barplot & value count
    """
    var = data[variable] # We choose the variable we want
    varValue = var.value_counts() # We using "value_counts()" because that is provide which one how many
    
    plt.figure(figsize = (9,3)) #That is the size of barplot
    plt.bar(varValue.index,varValue.values) # We are creating barplot, first side is x so underline and second side is y so vertical line
    plt.ylabel("Frequency") # We put label for y
    plt.title(variable) # We put title 
    plt.show()# To don't show for some things
    print("{}: \n {}".format(variable,varValue)) # We turned the graphic into text

In [None]:
categoricalVar = ["year","sex","age","generation"] # We don't some variable beacuse they not help to us and we using some variable not help to us
for b in categoricalVar:
    bar_plot(b)

In [None]:
categoricalVar2 = ["country","country-year"]
for c in categoricalVar2:
    print("{} \n".format(data[c].value_counts()))

## Numerical Variables <a id ="4"></a>

* Numerical Variables: suicides/100k pop, HDI for year, suicides_no, population, gdp_per_capita ,  gdp_for_year

In [None]:
def plot_hist(variable):
    plt.figure(figsize=(9,3)) # The size of graph
    plt.xlabel(variable) # Label of x
    plt.ylabel("Frequency") # Label of y
    plt.title("Hist of {}".format(variable)) # Title of graph
    plt.hist(data[variable],bins=66) # We can create graph that 
    plt.show() # We can close the something text

In [None]:
numericalVar = ["suicides/100k pop","HDI for year","suicides_no","population"]
for n in numericalVar:
    plot_hist(n)

In [None]:
numericalVar2 = ["gdp_per_capita ($)"," gdp_for_year ($) "] #If we add that variables to numericalVar, their graph is inexplicable
for n in numericalVar2:
    print("{}: \n".format(data[n].value_counts()))

# Outlier Detection <a id ="5"></a>

In [None]:
def detect_outliers(df,features):
    outlier_indices = list()
    for f in features:
        
        Q1 = np.percentile(df[f],25) # %25
        
        Q3 = np.percentile(df[f],75) # %75
        
        IQR = Q3-Q1
        
        outlierStep = 1.5 * IQR
        
        outlier_list_col = df[(df[f] < Q1 - outlierStep) | (df[f] > Q3 + outlierStep)].index #We can find outliers
        
        outlier_indices.extend(outlier_list_col) # We extend to list
        
    outlier_indices = Counter(outlier_indices) # We can use easily
        
    multiple_indices = list(i for i , v in outlier_indices.items() if v>2) # We remove from data exactly outlier
        
    return multiple_indices

In [None]:
data.loc[detect_outliers(data,["suicides_no","population","gdp_per_capita ($)"])]

In [None]:
data = data.drop(detect_outliers(data,["suicides_no","population","gdp_per_capita ($)"]), axis = 0).reset_index(drop=True)

In [None]:
data.loc[detect_outliers(data,["suicides_no","population","gdp_per_capita ($)"])]

# Missing Value <a id ="6"></a>
* "Missing Value" is meaning no data values in there.

## Find Missing Value <a id ="7"></a>

In [None]:
data.columns[data.isnull().any()] # We can look any null in columns's value

In [None]:
data.isnull().sum()

## Fill Missing Value <a id ="8"></a>

In [None]:
data[data["HDI for year"].isnull()] # Yes very empty that column but now i don't know fill them (:

In [None]:
data.info()

# Basic Data Analysis <a id ="9"></a>
* age - sucidies_no
* generation - suicides_no
* population - sucidies_no
* HDI for year - sucidies_no

In [None]:
data[["age","suicides_no"]].groupby(["age"], as_index = False).mean().sort_values(by = "suicides_no", ascending = False)

In [None]:
data[["generation","suicides_no"]].groupby(["generation"], as_index = False).mean().sort_values(by = "suicides_no", ascending = False)

In [None]:
data[["population","suicides_no"]].groupby(["population"], as_index = False).mean().sort_values(by = "suicides_no", ascending = False)

In [None]:
data[["HDI for year","suicides_no"]].groupby(["HDI for year"], as_index = False).mean().sort_values(by = "suicides_no", ascending = False)