# Possible Features Associated With Suicide Rates

## Explatory Data Analysis
### Contents
1. Introduction
2. General View of the Data
3. Data Cleaning
4. Exploring the Data
5. Feature Engineering
6. Results

### 1- Introduction
In this study, we examined suicide rates data between 1985 and 2016. What are the rates of suicide worldwide?
Could there be features that affect suicidal behavior? What can be these features, if any? The purpose of this study is to find answers to these questions in line with the data we have. For the purpose of the study, both the data of the world and the data of some countries were examined specially.

### 2- General View Of Data
#### Variables
- country: Name of the country.
- year: Year of the data.
- sex: Gender of the people.
- age: Age of the people.
- suicides_no: Number of suicides in that population.
- population: The total number of people in that trait.
- suicides/100k pop: One hundred thousand suicide rate
- country-year: Combnation of country and year columns.
- HDI for year: HDI data for one year(HDI: Human Developmant Index).
- gdp_for_year (\\$): GDP data for one year(GDP: Gross Domestic Product)
- gdp_per_capita (\\$): GDP data per capita(GDP: Gross Domestic Product)
- generation: Generation information of the age group

For more information about data visit this page https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016

#### İmporting the useful libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from scipy.stats.mstats import winsorize
from scipy.stats import stats
from scipy.stats import zscore
from scipy.stats import jarque_bera
from scipy.stats import normaltest
from sklearn.preprocessing import normalize
from sklearn.preprocessing import scale
import warnings

warnings.filterwarnings("ignore")
%matplotlib inline
sns.set(style="whitegrid")
title_font= {"family": "arial", "weight": "bold", "color": "darkred", "size": 13}
label_font= {"family": "arial", "weight": "bold", "color": "darkblue", "size": 10}


#### Viewing data

In [None]:
df= pd.read_csv("../input/suicide-rates-overview-1985-to-2016/master.csv")
df.head() # We only display the first five lines


In [None]:
df.info()

The data consist 12 columns and 27820 lines. 
##### Categorical Variables
- country
- sex
- age
- country-year
- gdp_for_year (\\$\)
- generation
#### Continuous Variables
- year
- suicides_no
- population
- suicides/100k pop
- HDI for year
- gdp_per_capita (\\$\)


gdp_for_year column looks categorical. But it's contains numerical values. We will turn it into continuous in the later stages.

 

#### Making column names easy to use


In [None]:
df.columns = ['country', 'year', 'sex', 'age', 'suicides_no', 'population',
       'suicides/100k_rate', 'country-year', 'HDI_for_year',
       'gdp_for_year', 'gdp_per_capita', 'generation']
df.columns

#### Some statistically meaningful values about data

In [None]:
pd.options.display.float_format= "{:.6f}".format
df.describe()

### 3- Data Cleaning

Are there any missng data?

In [None]:
df.isnull().sum()*100/df.shape[0]

There are only null values in HDI_for_year. No null values appear in other columns. When examining HDI_for_year we will drop the null values. So are there any incorrect value in year, sex, age, country and generation columns? Let's analyze this columns.



 


In [None]:
df.nunique()

In [None]:
columns_names = ["year", "sex", "age", "generation", "country"]
for col in columns_names:
    print("{} unique values:\n {}".format(col, df[col].unique()))

There are not incorrect value appears. But remember, gdp_for_year must be continuos. Let's try to make continuos gdp_for_year


In [None]:
df["gdp_for_year"].str.strip()

In [None]:
df["gdp_for_year"] = df["gdp_for_year"].replace(",", "", regex=True)
df["gdp_for_year"]

In [None]:
df["gdp_for_year"]= df["gdp_for_year"].astype("int64")

In [None]:
print("incorrect values for {}: ".format("gdp_for_year"))
for value in df["gdp_for_year"]:
    try:
        float(value)
    except:
        print(value)

In [None]:
df.info()

Great! As we see gdp_for_year has continuous.


#### Study of outliers

In [None]:
plt.figure(figsize=(18,10))
columns_name = ["suicides_no", "population","suicides/100k_rate", "gdp_per_capita", "gdp_for_year" ]
for i in range(5):
    plt.subplot(2,3,i+1)
    plt.boxplot(df[columns_name[i]])
    plt.title("{} box graph".format(columns_name[i]), fontdict= title_font)
    

In [None]:
plt.figure(figsize=(18,10))
columns_name = ["suicides_no", "population","suicides/100k_rate", "gdp_per_capita", "gdp_for_year" ]
for i in range(5):
    plt.subplot(2,3,i+1)
    plt.hist(df[columns_name[i]])
    plt.title("{} histogram graph".format(columns_name[i]), fontdict=title_font)
    
    

As we see there are appear outliers in five columns; suicides_no, population, suicides/100k_rate, gdp_per_capita, gdp_for_year.

#### We will review z score for suicides_no, population, suicides/100k_rate, gdp_per_capita, gdp_for_year.

In [None]:
columns_name = ["suicides_no", "population","suicides/100k_rate", "gdp_per_capita", "gdp_for_year" ]
for names in range(0,5): 
    zscorelist = []
    zscores = zscore(df[columns_name[names]])
    for thereshold in np.arange(0,5,0.1):
        zscorelist.append((thereshold,len(np.where(zscores>thereshold)[0]))) 
        df_outliers= pd.DataFrame(zscorelist, columns=["thereshold", "outliers"])
    plt.plot(df_outliers.thereshold, df_outliers.outliers)
    plt.title(columns_name[names], fontdict=title_font)
    plt.show()
    

#### We will review with Tukey's Method.

In [None]:
columns_name = ["suicides_no", "population","suicides/100k_rate", "gdp_per_capita", "gdp_for_year" ]
for col in columns_name:
    q75, q25 = np.percentile(df[col], [75,25])
    caa = q75-q25
    comparison = pd.DataFrame(columns= [col, "thereshold", "outliers"])
    for thereshold in np.arange(1,5,0.5):
        min_value= q25- (caa*thereshold)
        max_value= q75+ (caa*thereshold)
        outliers= len(np.where((df[col]>max_value) | (df[col]<min_value))[0])
        comparison = comparison.append({col: col, "thereshold": thereshold,
                                             "outliers": outliers }, ignore_index=True)
    display(comparison)
    
        

#### We will get rid of outliers by winsorization.


In [None]:
df["winsorize_suicides_no"] = winsorize(df["suicides_no"], (0, 0.11))
plt.figure(figsize=(12,5))
plt.subplot(121)
plt.boxplot(df["winsorize_suicides_no"], whis=2.5)
plt.title("winsorize suicides_no", fontdict=title_font)

plt.subplot(122)
plt.boxplot(df["suicides_no"])
plt.title("suicides_no", fontdict=title_font)
plt.show()

In [None]:
df["winsorize_suicides/100k_rate"] = winsorize(df["suicides/100k_rate"], (0,0.05))
plt.figure(figsize=(12,5))
plt.subplot(121)
plt.boxplot(df["winsorize_suicides/100k_rate"], whis=2.5)
plt.title("winsorize suicides/100k_rate", fontdict=title_font)

plt.subplot(122)
plt.boxplot(df["suicides/100k_rate"])
plt.title("suicides/100k_rate", fontdict=title_font)
plt.show()

In [None]:
df["winsorize_gdp_per_capita"] = winsorize(df["gdp_per_capita"], (0, 0.03))
plt.figure(figsize=(12,5))
plt.subplot(121)
plt.boxplot(df["winsorize_gdp_per_capita"], whis=2.0)
plt.title("winsorize gdp_per_capita", fontdict=title_font)

plt.subplot(122)
plt.boxplot(df["gdp_per_capita"])
plt.title("gpd_per_capita", fontdict=title_font)
plt.show()

In [None]:
df["winsorize_gdp_for_year"] = winsorize(df["gdp_for_year"], (0, 0.11))
plt.figure(figsize=(12,5))
plt.subplot(121)
plt.boxplot(df["winsorize_gdp_for_year"], whis=2.5)
plt.title("winsorize gdp_for_year", fontdict=title_font)

plt.subplot(122)
plt.boxplot(df["gdp_for_year"])
plt.title("gdp_gor_year", fontdict=title_font)
plt.show()

In [None]:
df["winsorize_population"]= winsorize(df["population"], (0,0.09))
plt.figure(figsize=(12,5))
plt.subplot(121)
plt.boxplot(df["winsorize_population"], whis=3.0)
plt.title("winsorize_population", fontdict=title_font)

plt.subplot(122)
plt.boxplot(df["population"])
plt.title("population", fontdict=title_font)
plt.show()

In [None]:
columns_name = ["suicides_no", "winsorize_suicides_no", "suicides/100k_rate", 
                "winsorize_suicides/100k_rate",  "gdp_per_capita",
                "winsorize_gdp_per_capita","gdp_for_year", "winsorize_gdp_for_year", "population", "winsorize_population" ]
plt.figure(figsize=(30,12))
for i in range(2):
    plt.subplot(5,2,i+1)
    plt.hist(df[columns_name[i]])
    plt.title(columns_name[i], fontdict=title_font)
for i in range(2):
    plt.subplot(5,2,i+3)
    plt.hist(df[columns_name[i+2]])
    plt.title(columns_name[i+2], fontdict=title_font)
for i in range(2):
    plt.subplot(5,2,i+5)
    plt.hist(df[columns_name[i+4]])
    plt.title(columns_name[i+4], fontdict=title_font)
for i in range(2):
    plt.subplot(5,2,i+7)
    plt.hist(df[columns_name[i+6]])
    plt.title(columns_name[i+6], fontdict=title_font)
for i in range(2):
    plt.subplot(5,2,i+9)
    plt.hist(df[columns_name[i+8]])
    plt.title(columns_name[i+8], fontdict=title_font)

plt.show()  
        
    

#### We should covert the data to logarithmic values for normalize the data.

In [None]:
columns_name= ["suicides_no", "population", "suicides/100k_rate",
               "gdp_per_capita", "gdp_for_year"]
for name in columns_name:
    plt.figure(figsize=(15,6))
    plt.subplot(2,2,1)
    plt.hist(df[name])
    plt.title(name, fontdict=title_font)
        
    plt.subplot(2,2,2)
    plt.hist(np.log(df[name]+1))
    plt.title(name+ " (log transformation)", fontdict=title_font)
    plt.show()
    
        

As we see the data approximate to normal distribution with logarithmic transformation.

In [None]:
columns_name= ["suicides_no", "population", "suicides/100k_rate",
               "gdp_per_capita", "gdp_for_year"]
for name in columns_name:
    q75_log, q25_log = np.percentile(np.log(df[name]), [75,25])
    caa_log= q75_log-q25_log
    q75, q25 = np.percentile(df[name], [75,25])
    caa= q75-q25
    comparison = pd.DataFrame(columns= ["threshold", "outliers {}".format(name), "outliers_log"])
    for threshold in np.arange(1,5,0.5):
        max_value_log = q75_log+ (caa_log*threshold)
        min_value_log = q25_log- (caa_log*threshold)
        max_value = q75+ (caa*threshold)
        min_value = q25- (caa*threshold)
        outliers_log = len((np.where((np.log(df[name]+1)>max_value_log) | 
                               (np.log(df[name]+1)<min_value_log))[0]))
        outliers = len((np.where((df[name]>max_value) | 
                     (df[name]<min_value))[0]))
        comparison = comparison.append({"threshold": threshold, "outliers {}".format(name): outliers,
                              "outliers_log": outliers_log}, ignore_index=True)
    display(comparison)   
    



The outliers also decreased with logarithmic transformation.

### 4- Exploring The Data
We will examine some features of the relationship with suicide rates.

### 4.1- Let's study on the relationship between the economic situation and suicide rates. We will categorize the economic situation as very low, low, medium, high and very high. 


We will create a new dataset and study on it.


In [None]:
df1 = pd.DataFrame(df.groupby("country").mean()["winsorize_gdp_per_capita"])
df2= pd.DataFrame(df.groupby("country").mean()["winsorize_suicides/100k_rate"])
df1["winsorize_suicides/100k_rate"]= df2["winsorize_suicides/100k_rate"]
df1.head()

We will categorize the economic situation as very low, low, medium, high and very high.

In [None]:
def economy_convert(value):
    if value<5000:
        return "very low"
    elif value<10000:
        return "low"
    elif value<20000:
        return "medium"
    elif value<30000:
        return "high"
    else:
        return "very high"


In [None]:
df1["category"]= df1.winsorize_gdp_per_capita.apply(economy_convert)
df1

In [None]:
df1.groupby("category").mean()["winsorize_suicides/100k_rate"]

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(df1["category"], df1["winsorize_suicides/100k_rate"], order= ["very low", "low", "medium", "high", "very high"])
plt.title("Economic Status And Suicide Rate", fontdict=title_font)
plt.xlabel("Economic Category", fontdict=label_font)
plt.ylabel("Suicide Rate", fontdict=label_font)
plt.xticks(color= "black")
plt.yticks(color= "black")
plt.show()


Ttest for economic categories.

In [None]:
kategoriler = df1["category"].unique()
pd.options.display.float_format= "{:.6f}".format
karsilastirma = pd.DataFrame(columns= ["category_1", "category_2", "statistics", "p_value"])
for i in range(0,len(kategoriler)):
    for j in range(i+1, len(kategoriler)):
        ttest= stats.ttest_ind(df1[df1["category"]==kategoriler[i]]["winsorize_suicides/100k_rate"],
                               df1[df1["category"]==kategoriler[j]]["winsorize_suicides/100k_rate"])
        category_1 = kategoriler[i]
        category_2 = kategoriler[j]
        statistics = ttest[0]
        p_value = ttest[1]
        karsilastirma = karsilastirma.append({"category_1": category_1, "category_2": category_2,
                                              "statistics": statistics, "p_value": p_value}, ignore_index=True)
        
display(karsilastirma)   

There is only significance between the high and very high.


### 4.2- Do suicide numbers differ significantly between age groups?

We will create a new dataset for on study on it.

In [None]:
df1 = pd.DataFrame(df.groupby("age").sum()["winsorize_suicides_no"]).reset_index()
df1.groupby("age").sum()["winsorize_suicides_no"]
df1

In [None]:
orderlist= ["5-14 years", "15-24 years", "25-34 years", "35-54 years", "55-74 years", "75+ years"]

In [None]:
plt.figure(figsize=(12,5))
sns.barplot(df["age"], df["winsorize_suicides_no"], order=orderlist)
plt.title("Suicide Numbers by Age Groups", fontdict=title_font)
plt.xlabel("Age Groups", fontdict=label_font)
plt.ylabel("Suicide Rates",  fontdict=label_font)
plt.xticks(color= "black")
plt.yticks(color= "black")
plt.show()  

#### Test for Age Groups

In [None]:
yaslar = df1["age"].unique()
karsilastirma = pd.DataFrame(columns= ["group_1", "group_2", "statistics", "p_value"])
pd.options.display.float_format= "{:.6f}".format
for i in range(0,len(yaslar)):
    for j in range(i+1,len(yaslar)):
        ttest = stats.ttest_ind(df[df["age"]==yaslar[i]]["winsorize_suicides_no"],
                                df[df["age"]==yaslar[j]]["winsorize_suicides_no"])
        group_1 = yaslar[i]
        group_2 = yaslar[j]
        statistics = ttest[0]
        p_value = ttest[1]
        karsilastirma = karsilastirma.append({"group_1": group_1, "group_2": group_2,
                                              "statistics": statistics, "p_value": p_value}, ignore_index=True)
display(karsilastirma)

### 4.3- What are the distributions of age group by gender?


In [None]:
df1= df.groupby(["sex", "age"]).mean()["winsorize_suicides_no"].reset_index()


plt.figure(figsize=(20,5))
sns.barplot(data=df1, x= df1["age"], y=df1["winsorize_suicides_no"], hue=df1["sex"], order=orderlist)
plt.title("Suicide Rates By Gender", fontdict=title_font)
plt.xlabel("Age Groups", fontdict=label_font)
plt.ylabel("Suicide Rates", fontdict=label_font)

plt.show()  


As we see the rate of men is higher than that of women for all age groups.


### 4.4- What are the distribution of suicide rates by years and gender

In [None]:
df1 = pd.DataFrame()
df1["year"]= df["year"].astype("object")
df1["sex"]= df["sex"]
df1["winsorize_suicides/100k_rate"]= df["winsorize_suicides/100k_rate"]
df1.head()

In [None]:
plt.figure(figsize=(18,7))
sns.barplot(data=df1, x="year", y="winsorize_suicides/100k_rate", hue="sex")
plt.title("Suicide Rates by Years", fontdict=title_font)
plt.xlabel("Year", fontdict=label_font)
plt.ylabel("Suicide Rates", fontdict=label_font)
plt.xticks(color= "black")
plt.yticks(color= "black")
plt.show() 

Ttest for year

In [None]:
yıllar = df["year"].unique()
karsilastirma = pd.DataFrame(columns= ["grup_1", "grup_2", "istatistik", "p_degeri"])
pd.options.display.float_format= "{:.6f}".format
for i in range(0,len(yıllar)):
    for j in range(i+1,len(yıllar)):
        ttest= stats.ttest_ind(df[df["year"]==yıllar[i]]["winsorize_suicides/100k_rate"],
                               df[df["year"]==yıllar[j]]["winsorize_suicides/100k_rate"])
        grup_1= yıllar[i]
        grup_2= yıllar[j]
        istatistik= ttest[0]
        p_degeri= ttest[1]
        karsilastirma= karsilastirma.append({"grup_1": grup_1, "grup_2": grup_2,
                                            "istatistik": istatistik, "p_degeri": p_degeri}, ignore_index=True)
display(karsilastirma[karsilastirma["p_degeri"]<0.005])

### 4.5- Are there any relation between HDI(Human Development Index) and suicide rates?

In [None]:
df.head()

All countries have no HDI data. That's why we will only study with countries that have HDI data.

In [None]:
df.dropna()
df1 = pd.DataFrame(df.groupby("country").mean()["HDI_for_year"])
df1["winsorize_suicides/100k_rate"]= df.groupby("country").mean()["winsorize_suicides/100k_rate"]
df1 = df1.dropna()
df1.head()

In [None]:
plt.figure(figsize=(12,6))
plt.scatter(df1["HDI_for_year"], df1["winsorize_suicides/100k_rate"])
plt.title("HDI And Suicide Rates", fontdict=title_font)
plt.xlabel("HDI", fontdict=label_font)
plt.ylabel("Suicide Rates", fontdict=label_font)
plt.show()  

There is no appears meaningful relationship. We will calculate the correlation between HDI and suicide rates by using the "corr()" method. 

In [None]:
df1.corr() 

In [None]:
korelasyon_matrisi_df1= df1.corr()
plt.figure(figsize=(14,6))
sns.heatmap(korelasyon_matrisi_df1, square=True, annot=True, linewidth=.5, vmin=0, vmax=1, cmap="Greens")
plt.title("Suicide Rate And HDI Correlation Matrix", fontdict=title_font)
plt.show()

Again, the relationship does not seem meaningful.

### 4.6- Study of sucide rates by specific countries.

In [None]:
df = df.dropna()
df2 = df[(df["country"]=="Brazil") | (df["country"]=="Mexico") |(df["country"]=="Turkey")] #Developing Countries
df2_ = pd.DataFrame(df2.groupby([df2["country"], df2["year"]]).mean()["winsorize_suicides/100k_rate"]).reset_index()
df3 = df[(df["country"]=="Argentina") | (df["country"]=="Chile") | (df["country"]=="Ecuador")] #South America
df3_ = pd.DataFrame(df3.groupby([df3["country"], df3["year"]]).mean()["winsorize_suicides/100k_rate"]).reset_index()
df4 = df[(df["country"]=="Belgium") | (df["country"]=="France") | (df["country"]=="Germany")] #Strong Economic Europe
df4_ = pd.DataFrame(df4.groupby([df4["country"], df4["year"]]).mean()["winsorize_suicides/100k_rate"]).reset_index()
df5 = df[(df["country"]=="Norway") | (df["country"]=="Finland") | (df["country"]=="Denmark")] #North Europe
df5_ = pd.DataFrame(df5.groupby([df5["country"], df5["year"]]).mean()["winsorize_suicides/100k_rate"]).reset_index()
df6 = df[(df["country"]=="Serbia") | (df["country"]=="Ukraine") |(df["country"]=="Bulgaria")] #East Europe
df6_ = pd.DataFrame(df6.groupby([df6["country"], df6["year"]]).mean()["winsorize_suicides/100k_rate"]).reset_index()
df7 = df[(df["country"]=="United States") | (df["country"]=="United Kingdom") |(df["country"]=="Japan")] #Countries with advanced technology
df7_ = pd.DataFrame(df7.groupby([df7["country"], df7["year"]]).mean()["winsorize_suicides/100k_rate"]).reset_index()

plt.figure(figsize=(20,6))
fig = px.line(df2_, x="year", y="winsorize_suicides/100k_rate", color="country")
fig.show()

plt.figure(figsize=(20,6))
fig= px.line(df3_, x="year", y="winsorize_suicides/100k_rate", color="country")
fig.show()

plt.figure(figsize=(20,6))
fig= px.line(df4_, x="year", y="winsorize_suicides/100k_rate", color="country")
fig.show()

plt.figure(figsize=(20,6))
fig= px.line(df5_, x="year", y="winsorize_suicides/100k_rate", color="country")
fig.show()

plt.figure(figsize=(20,6))
fig= px.line(df6_, x="year", y="winsorize_suicides/100k_rate", color="country")
fig.show()

plt.figure(figsize=(20,6))
fig= px.line(df7_, x="year", y="winsorize_suicides/100k_rate", color="country")
fig.show()




### 5- Feature Engineering  

#### 5.1- We can use Jarque-Bera and Normal Tests to see if the columns follow the normal distribution. 

In [None]:
columns_name= ["suicides_no", "population", "suicides/100k_rate",
               "gdp_per_capita", "gdp_for_year"]
for name in columns_name:
    plt.figure(figsize=(15,6))
    plt.subplot(2,2,1)
    plt.hist(df[name])
    plt.title(name, fontdict=title_font)
        
    plt.subplot(2,2,2)
    plt.hist(np.log(df[name]+1))
    plt.title(name+ " (log transformation)", fontdict=title_font)
    plt.show()
    

The distribution of our features looks like this. Let's use Jarque-Bera and Normal Tests for if check the columns follow the normal distribution.

In [None]:
columns_name = ["suicides_no", "suicides/100k_rate", "population", "gdp_per_capita", "gdp_for_year"]
pd.options.display.float_format = "{:.5f}".format

distribution_test = pd.DataFrame(columns= ["attribute", "jarque_bera_stats", "jarque_bera_p_value",
                                           "normaltest_stats", "normaltest_p_value"])
for names in columns_name:
    jb_stats= jarque_bera(np.log(df[names]+1))
    normal_stats= normaltest(np.log(df[names]+1))
    distribution_test= distribution_test.append({"attribute": names, 
                                                 "jarque_bera_stats": jb_stats[0], "jarque_bera_p_value": jb_stats[1],
                                                  "normaltest_stats": normal_stats[0], "normaltest_p_value": normal_stats[1]},
                                                ignore_index=True)
display(distribution_test)



If the p-value is 0.00, this shows that variables have not normal distribution.

####  5.2- Normalization And Standardization

We will use normalization and standardization because some machine learning models assume that all properties have the same range of values.

Normalization: Rescaling a variable to the interval [0,1]


Standardization: A variable is rescaled to a mean of 0 and its standard deviation of 1.



In [None]:
#normalization:
df["norm_winsorize_suicides_no"]= normalize(np.array(df["winsorize_suicides_no"]).reshape(-1,1)).reshape(-1,1)
df["norm_winsorize_suicides/100k_rate"]= normalize(np.array(df["winsorize_suicides/100k_rate"]).reshape(-1,1)).reshape(-1,1)
df["norm_winsorize_gdp_per_capita"]= normalize(np.array(df["winsorize_gdp_per_capita"]).reshape(-1,1)).reshape(-1,1)
df["norm_winsorize_gdp_for_year"]= normalize(np.array(df["winsorize_gdp_for_year"]).reshape(-1,1)).reshape(-1,1)
df["norm_winsorize_population"]= normalize(np.array(df["winsorize_population"]).reshape(-1,1)).reshape(-1,1)

norm_features= ["winsorize_suicides_no", "norm_winsorize_suicides_no",
                "winsorize_suicides/100k_rate", "norm_winsorize_suicides/100k_rate",
                "winsorize_gdp_per_capita", "norm_winsorize_gdp_per_capita",
                "winsorize_gdp_for_year", "norm_winsorize_gdp_for_year",
                "winsorize_population", "norm_winsorize_population"]
print("---Minimum Values---\n ")
print(df[norm_features].min())
print("---Maximum Values---\n ")
print(df[norm_features].max())

As we see, minimum value of normalized values is 0.00 and maximum value of normalized values is 1.00.

In [None]:
# standardization
df["scale_suicides_no"]= scale(df["winsorize_suicides_no"])
df["scale_suicides/100k_rate"]= scale(df["winsorize_suicides/100k_rate"])
df["scale_gdp_per_capita"]= scale(df["winsorize_gdp_per_capita"])
df["scale_gdp_for_year"]= scale(df["winsorize_gdp_for_year"])
df["scale_population"]= scale(df["winsorize_population"])

scale_features= ["winsorize_suicides_no", "scale_suicides_no", 
                 "winsorize_suicides/100k_rate", "scale_suicides/100k_rate", 
                 "winsorize_gdp_per_capita", "scale_gdp_per_capita", 
                 "winsorize_gdp_for_year", "scale_gdp_for_year", 
                 "winsorize_population", "scale_population"]
print("---Standard Deviation--- \n")
print(df[scale_features].std())
print("---Means--- \n")
print(df[scale_features].mean())


As we see, standard deviation of standardized values is 1.00 and mean of standardized values is 0.00.

### 6- Results

This study has examined what might be the possible features that affect suicide rates.
- I compared the suicide rates with the economic conditions of the country. With GDP per capita and GDP for year data we we identified the economic situation. According to the results of the analysis we didn't see a significant relationship between suicide rates and economic situation.
-  I have studied the distribution of suicide rates according to age groups.  Suicide rates are highest where the age range is 35-54 age range.
- I checked the distribution of suicide rates according to gender. Suicide rates were significantly higher in men than in women. The differences between the male female rates does not show a big change compared to years.
- I examined the Human Development Index with suicide rates. HDI relationship with suicide rates not that we have seen.
- Specifically, I compared suicide rates of some countries. I grouped by Developing Countries, South America, Europan Countries With Strong Economy, North Europe, East Europe, Countries With Advanced Technology. These comparisons were shown on the graphs. 

Many features have an impact on suicide rates of countries. For example geographical features, people's religious beliefs, prevalence rates of substance use, sexual and physical abuse rates are effect the suicide rates. That's why  we didn't see the impact on suicide rates of the economic situation and HDI.