# Suicide Rates Overview 1985 to 2016

In this Exploratory Data Analysis project, I will visualize and analyze the data on suicide rates from 1985 to 2016. I will try to understand the relationship each of the variables may have with the target variable, i.e., suicide rates.

**Research Question:** Which factors lead to higher suicide rates?

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

plt.style.use('fivethirtyeight')
sns.set_style('white')

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv('/kaggle/input/suicide-rates-overview-1985-to-2016/master.csv')

df.head()

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.describe(include='O')

In [None]:
df.isna().sum()

In [None]:
df.nunique()

We have data on **101 countries**, spanning over **32 years** between _1985_ and _2016_. For each country, there is information about **6 age groups**, **6 generations**, and **2 genders**.

In [None]:
for index in df.nunique().index:
    if df.nunique()[index] < 10:
        print("Unique values in {} column:".format(index))
        print(df[index].unique(), "\n")

## Data Cleaning:

- '**gdp_for_year (\$)**' column has whitespaces in its column name.
- '**gdp_for_year (\$)**' column has the datatype object, since a comma (',') is used to separate the numbers. It can be converted into integer datatype after the commas are removed.
- '**HDI for year**' column has too many NULL values, instead, I will use the HDI data from UNDP website, which needs some transformation to be merged with the suicide data: http://hdr.undp.org/en/indicators/137506
- '**country-year**' column looks unnecessary, I can recreate it if I need it again.
- '**age**' and '**generation**' columns are categorical, they both refer to 6 different categories. They can be converted into discrete variables to see the correlation between other columns (they will have high correlation between themsevles, but we can ingore it).

In [None]:
df.rename({' gdp_for_year ($) ': 'gdp_for_year ($)'}, axis=1, inplace=True)

In [None]:
df['gdp_for_year ($)'] = df['gdp_for_year ($)'].apply(lambda x: ''.join(x.split(','))).astype('int64')

### HDI Column

In [None]:
hdi = pd.read_csv('/kaggle/input/human-development-index-hdi/Human Development Index (HDI).csv', skiprows=5, engine='python')

In [None]:
#for i in range(1990, 2020):
#    print("'" + str(i) + "', ", end='')

In [None]:
hdi = hdi[['HDI Rank', 'Country', '1990', '1991','1992', '1993', '1994', '1995',
           '1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003', '2004',
           '2005', '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
           '2014', '2015', '2016', '2017', '2018', '2019']]

In [None]:
hdi_t = hdi[['1990', '1991','1992', '1993', '1994', '1995', '1996', '1997',
     '1998', '1999', '2000', '2001', '2002', '2003', '2004', '2005',
     '2006', '2007', '2008', '2009', '2010', '2011', '2012', '2013',
     '2014', '2015', '2016', '2017', '2018', '2019']].transpose()

In [None]:
hdi_t.columns = hdi['Country']

In [None]:
hdi_new = hdi_t.T.reset_index()

In [None]:
hdi_new.replace('..', np.NaN, inplace=True)

In [None]:
hdi_new['Country'] = hdi_new['Country'].apply(lambda x: x.strip())

In [None]:
hdi_final = pd.melt(hdi_new, id_vars=['Country']).rename({'Country':'country', 'variable': 'year', 'value': 'HDI'}, axis=1)

In [None]:
hdi_final['year'] = hdi_final['year'].astype(np.int64)
hdi_final['HDI'] = hdi_final['HDI'].astype(float)

In [None]:
df = pd.merge(df, hdi_final, how='left', on=['country', 'year'])

In [None]:
df.drop('HDI for year', axis=1, inplace=True)
df.drop('country-year', axis=1, inplace=True)

#### Imputation for HDI column

If you look at the HDI data below closely, you will observe that for each country, HDI values have an upwards, **linear** trend. Therefore, I think it may be reasonable to fill the null values by using **interpolation** method.

In [None]:
hdi.iloc[:60, :20]

In [None]:
df['HDI'].isna().sum()

In [None]:
df['HDI'] = df['HDI'].interpolate(method = 'linear', limit_direction='both')

In [None]:
df['HDI'].isna().sum()

### Age group & Generation group

I will create numerical **age_group** and **generation_group** columns from categorical **age** and **generation** columns.

In [None]:
list(df.groupby(['generation', 'age'])['generation'].count().index)

In [None]:
df['age'].unique()

In [None]:
df['generation'].unique()

In [None]:
def age_group(age):
    if age == '5-14 years':
        return 0
    elif age == '15-24 years':
        return 1
    elif age == '25-34 years':
        return 2
    elif age == '35-54 years':
        return 3
    elif age == '55-74 years':
        return 4
    else:
        return 5
    
def generation_group(generation):
    if generation == 'Generation Z':
        return 0
    elif generation == 'Millenials':
        return 1
    elif generation == 'Generation X':
        return 2
    elif generation == 'Boomers':
        return 3
    elif generation == 'Silent':
        return 4
    else:
        return 5

In [None]:
df['age_group'] = df['age'].apply(age_group)
df['generation_group'] = df['generation'].apply(generation_group)

## Feature Engineering:

### Suicide, Population & Suicide Rates

I will create three new columns:
- 'suicides_no_total',
- 'population_total', and
- 'suicide_rates_total'

These columns are not affected by grouping by age group, gender, generation, etc.

In rest of this notebook, '**suicide counts**' refers to the total number of suicides whereas '**suicide rates**' refer to the total suicides divided by population.

In [None]:
df = df.merge(df.groupby(['country', 'year'])['suicides_no'].sum(), on=['country', 'year'], suffixes=('', '_total'))
df = df.merge(df.groupby(['country', 'year'])['population'].sum(), on=['country', 'year'], suffixes=('', '_total'))
df['suicide_rates_total'] = df['suicides_no_total']*100 / df['population_total']

### Final version of the data

In [None]:
df.head()

In [None]:
df.columns

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.describe(include='O')

In [None]:
df.isna().sum()

# Univariate Analysis

## Distributions of Quantitative Variables

I will first check the distributions of quantitative variables as they are in the raw data. However, since most of the values are separated according to _year_, _age groups_, _generation_ and _gender_, I will then check the distributions again with appropriate groupbys. I think the latter will provide a better insight.

In [None]:
sns.set_style('white')
plt.style.use('bmh')
plt.figure(figsize=(24, 12))

plt.subplot(2, 4, 1)
plt.hist(df['suicides_no'])
plt.title('suicidies_no')

plt.subplot(2, 4, 2)
plt.hist(df['suicides/100k pop'])
plt.title('sucidies/100k pop')

plt.subplot(2, 4, 3)
plt.hist(df['population'])
plt.title('population')

plt.subplot(2, 4, 4)
plt.hist(df['gdp_per_capita ($)'])
plt.title('gdp_per_capita ($)')

plt.subplot(2, 4, 5)
plt.boxplot(df['suicides_no'])
plt.title('suicidies_no')

plt.subplot(2, 4, 6)
plt.boxplot(df['suicides/100k pop'])
plt.title('sucidies/100k pop')

plt.subplot(2, 4, 7)
plt.boxplot(df['population'])
plt.title('population')

plt.subplot(2, 4, 8)
plt.boxplot(df['gdp_per_capita ($)'])
plt.title('gdp_per_capita ($)')

plt.show()

In [None]:
sns.set_style('white')
plt.style.use('bmh')
plt.figure(figsize=(24, 12))

plt.subplot(2, 4, 1)
plt.hist(df.groupby(['country', 'year'])['suicide_rates_total'].max(), bins=30)
plt.title('Distribution of Suicide Rates')

plt.subplot(2, 4, 2)
plt.hist(df.groupby(['country', 'year'])['population_total'].max(), bins=30)
plt.title('Distribution of Population')

plt.subplot(2, 4, 3)
plt.hist(df.groupby(['country', 'year'])['gdp_for_year ($)'].max(), bins=30)
plt.title('Distribution of GDP for Year')

plt.subplot(2, 4, 4)
plt.hist(df.groupby(['country', 'year'])['gdp_per_capita ($)'].max(), bins=30)
plt.title('Distribution of GDP per Capita')

plt.subplot(2, 4, 5)
plt.boxplot(df.groupby(['country', 'year'])['suicide_rates_total'].max())
plt.title('Distribution of Suicide Rates')

plt.subplot(2, 4, 6)
plt.boxplot(df.groupby(['country', 'year'])['population_total'].max())
plt.title('Distribution of Population')

plt.subplot(2, 4, 7)
plt.boxplot(df.groupby(['country', 'year'])['gdp_for_year ($)'].max())
plt.title('Distribution of GDP for Year')

plt.subplot(2, 4, 8)
plt.boxplot(df.groupby(['country', 'year'])['gdp_per_capita ($)'].max())
plt.title('Distribution of GDP per Capita')

plt.show()

As you can see from the graphs, most of these data are **_right-skewed_**. Also, **Suicide Rates** do not have much outliers, as opposed to **Population** and **GDP per Capita** data.

## Countries

In [None]:
countries = df['country'].unique()

print("Countries in the dataset are:")
print(*countries, sep=', ', end='.')

In [None]:
print('Countries sorted by suicide counts:')
display(pd.DataFrame(df.groupby(['country'])['suicides_no'].sum().sort_values(ascending=False)).reset_index().head(25))

In [None]:
plt.figure(figsize=(14, 6))

plt.bar(df.groupby('country')['suicides_no'].sum().sort_values(ascending=False).index[:50],
        df.groupby('country')['suicides_no'].sum().sort_values(ascending=False).values[:50])
plt.title('Suicide Counts by Country (Top 50)', fontsize=20)
plt.xlabel('Countries', fontsize=15)
plt.ylabel('Sum of Suicides', fontsize=15)
plt.xticks(rotation=90, fontsize=15)
plt.yticks(fontsize=10)

plt.show()

In [None]:
print('Top 25 Countries according to Suicide Rates (suicide_no/population):')
display(pd.DataFrame(df.groupby('country')['suicide_rates_total'].max().sort_values(ascending=False).head(25)).reset_index())

In [None]:
plt.figure(figsize=(14, 6))

plt.bar(df.groupby('country')['suicide_rates_total'].max().sort_values(ascending=False).index[:50],
        df.groupby('country')['suicides_no'].max().sort_values(ascending=False).values[:50])
plt.title('Suicide Rates by Country (Top 50)', fontsize=20)
plt.xlabel('Countries', fontsize=15)
plt.ylabel('Rates of Suicides', fontsize=15)
plt.xticks(rotation=90, fontsize=15)
plt.yticks(fontsize=10)

plt.show()

In [None]:
plt.figure(figsize=(14, 6))

plt.bar(df.groupby('country')['suicides/100k pop'].sum().sort_values(ascending=False).index[:50],
        df.groupby('country')['suicides/100k pop'].sum().sort_values(ascending=False).values[:50])
plt.title('Suicide Rates by Country (Top 50)', fontsize=20)
plt.xlabel('Countries', fontsize=15)
plt.ylabel('Rates of Suicides', fontsize=15)
plt.xticks(rotation=90, fontsize=15)
plt.yticks(fontsize=10)

plt.show()

In [None]:
plt.figure(figsize=(50, 150))

for i in range(len(df['country'].unique())-1):
    plt.subplot(26, 4, i+1)
    plt.plot(df.groupby(['country', 'year'])['suicide_rates_total'].max()[df['country'].unique()[i]],
             lw=6, color='r', marker='o', ms=10, mfc='k')
    plt.title(str(df['country'].unique()[i]), fontsize=50)
    plt.xlim(1985, 2016)
    plt.ylim(0, 0.051020)
    
plt.show()

### Year

In [None]:
plt.figure(figsize=(14, 8))

plt.plot(df.groupby('year')['suicides_no'].sum()[:-1])
plt.title('Suicide Counts by Year (Worldwide)', fontsize=25)
plt.xlabel('Years', fontsize=20)
plt.ylabel('Suicide Counts', fontsize=20)
plt.xticks(range(1985, 2016), rotation=45)

plt.show()

- From **1988** to **1990**, suicide rates increase from 120.000s to nearly 200.000s.
- From **1990** to **1996**, the increases continues until 250.000s.

### Gender

In [None]:
plt.figure(figsize=(20, 10))

plt.subplot(1, 2, 1)
plt.bar(df.groupby('sex')['suicides_no'].sum().index,
        df.groupby('sex')['suicides_no'].sum().values, color=['m', 'c'])
plt.title('Suicide Counts by Gender (Worldwide)', fontsize=25)
plt.xlabel('Sex', fontsize=20)
plt.ylabel('Suicide Counts', fontsize=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)

plt.subplot(1, 2, 2)
plt.pie(df.groupby('sex')['suicides_no'].sum(), labels=['female', 'male'], explode=(0.03, 0.03),
        autopct='%1.1f%%', shadow=True, colors=['m', 'c'], textprops={'fontsize':25}, startangle=180)
plt.title('Suicide Counts by Gender (Worldwide)', fontsize=25)

plt.show()

### Gender by Countries

In [None]:
top25_countries = df.groupby(['country'])['suicides_no'].sum().sort_values(ascending=False)[:25].index

plt.figure(figsize=(12, 10))

sns.barplot(x='country', y='suicides_no', hue='sex',
            data=df[df['country'].isin(top25_countries)],
            ci=None, order=top25_countries)
plt.title('Suicide Counts by Gender for 25 Countries with the Most Suicide Counts', fontsize=23)
plt.xlabel('Countries', fontsize=20)
plt.ylabel('Suicide Counts', fontsize=20)
plt.xticks(rotation=90)

plt.show()

In [None]:
top25_countries_r = df.groupby('country')['suicide_rates_total'].max().sort_values(ascending=False)[:25].index

plt.figure(figsize=(12, 8))

sns.barplot(x='country', y='suicides/100k pop', hue='sex',
            data=df[df['country'].isin(top25_countries_r)],
            ci=None, order=top25_countries_r)
plt.title('Suicide Rates by Gender for 25 Countries with the Most Suicide Rates', fontsize=23)
plt.xlabel('Countries', fontsize=20)
plt.ylabel('Suicide Rates', fontsize=20)
plt.xticks(rotation=90)

plt.show()

### Generation & Age Columns

In [None]:
list(df.groupby(['generation', 'age'])['generation'].count().index)

This shows that some age groups are included in more than one generation. For instance, '25-34 years' age group can be from 'Boomer' generation, 'Generation X', or 'Millenials'. Therefore, I will do separate analyses on these columns.

In [None]:
plt.figure(figsize=(20, 10))

plt.subplot(1, 2, 1)
plt.bar(df.groupby('generation')['suicides_no'].sum().sort_values(ascending=False).index,
        df.groupby('generation')['suicides_no'].sum().sort_values(ascending=False).values,
        color=['blue', 'red', 'purple', 'green', 'magenta', 'yellow'])
plt.title('Suicide Counts by Generation (Worldwide)', fontsize=25)
plt.xlabel('Generations', fontsize=20)
plt.ylabel('Total Count of Suicides', fontsize=20)
plt.xticks(rotation=45, fontsize=15)
plt.yticks(fontsize=15)

plt.subplot(1, 2, 2)
plt.pie(df.groupby('generation')['suicides_no'].sum().sort_values(ascending=False),
        labels=df.groupby('generation')['suicides_no'].sum().sort_values(ascending=False).index,
        explode=[0.01]*6, autopct='%1.1f%%', shadow=True, textprops={'fontsize':20}, startangle=180,
        colors=['blue', 'red', 'purple', 'green', 'magenta', 'yellow'])
plt.title('Suicide Counts by Gender (Worldwide)', fontsize=25)

plt.show()

In [None]:
plt.figure(figsize=(20, 10))

plt.subplot(1, 2, 1)
plt.bar(df.groupby('age')['suicides_no'].sum().sort_values(ascending=False).index,
        df.groupby('age')['suicides_no'].sum().sort_values(ascending=False).values,
        color=['blue', 'red', 'purple', 'green', 'magenta', 'yellow'])
plt.title('Suicide Counts by Age Group (Worldwide)', fontsize=25)
plt.xlabel('Age Groups', fontsize=20)
plt.ylabel('Total Count of Suicides', fontsize=20)
plt.xticks(rotation=45, fontsize=15)
plt.yticks(fontsize=15)

plt.subplot(1, 2, 2)
plt.pie(df.groupby('age')['suicides_no'].sum().sort_values(ascending=False),
        labels=df.groupby('age')['suicides_no'].sum().sort_values(ascending=False).index,
        explode=[0.01]*6, autopct='%1.1f%%', shadow=True, textprops={'fontsize':20, 'color':'k'}, startangle=180,
        colors=['blue', 'red', 'purple', 'green', 'magenta', 'yellow'])
plt.title('Suicide Counts by Gender (Worldwide)', fontsize=25)

plt.show()

Now, I will create a lineplot of each age group across different years. When I first did this, I saw that each line was going strictly down at the year 2016, which means that the values on 2016 was considerably smaller than the earlier years (probably because they were not yet complete). So, I have filtered out the year 2016 from this plot.

In [None]:
year_age_df = pd.DataFrame(df.groupby(['year', 'age'])['suicides_no'].sum()).reset_index()
year_age_df = year_age_df[year_age_df['year'] != 2016]

In [None]:
plt.figure(figsize=(12,8))

sns.lineplot(x='year', y='suicides_no', hue='age', data=year_age_df)
plt.title('Suicide Counts by Age Groups over Years', fontsize=25)
plt.xlabel('Years', fontsize=20)
plt.ylabel('Suicide Counts', fontsize=20)
plt.xticks(range(1985, 2016), rotation=45)

plt.show()

In [None]:
year_generation_df = pd.DataFrame(df.groupby(['year', 'generation'])['suicides_no'].sum()).reset_index()
year_generation_df = year_generation_df[year_generation_df['year'] != 2016]

In [None]:
plt.figure(figsize=(12,8))

sns.lineplot(x='year', y='suicides_no', hue='generation', data=year_generation_df)
plt.title('Suicide Counts by Generation over Years', fontsize=25)
plt.xlabel('Years', fontsize=20)
plt.ylabel('Suicide Counts', fontsize=20)
plt.xticks(range(1985, 2016), rotation=45)

plt.show()

Looking at this data in terms of generation also allows us to see the years that each generation has lived. For instance, you will notice that data on **G.I. Generation** stops around the year _2000_, while the data of **Generation Z** starts around year _2007_.

### Suicide Counts and Rates of Countries by Year

In [None]:
country_year_group = pd.DataFrame(df.groupby(['country', 'year'])['suicides_no'].sum()).reset_index()
country_year_group = pd.concat([country_year_group, pd.Series(df.groupby(['country', 'year'])['suicide_rates_total'].max().values, name='suicide_rates_total')], axis=1)

sns.set_style('whitegrid')

plt.figure(figsize=(24, 16), dpi=300)

sns.lineplot(x='year', y='suicides_no', hue='country', linewidth=3,
             data=country_year_group[country_year_group['country'].isin(top25_countries[:10])])

plt.title('Suicide Counts by Year (Top 10 Countries)', fontsize=30)
plt.xlabel('Years', fontsize=25)
plt.ylabel('Sum of Suicides', fontsize=25)
plt.xticks(range(1985, 2016), fontsize=20, rotation=45)
plt.yticks(fontsize=15)

plt.show()

- For **Russian Federation**, we see that from **1991 to 1994**, there is a _significant increase_ in the suicide numbers. These years correspond to the aftermath of the **Collapse of Soviet Union** which were politically unstable years in the Russia.

- For **Japan**, there is an _increase_ in suicide numbers from **1997 to 1998**. In 1997, there was an **Asian Financial Crisis** which affected Japan severely.

- For the **United States**, we see a _steady increase_ from 2000 onwards.

In [None]:
sns.set_style('whitegrid')

plt.figure(figsize=(24, 16))

sns.lineplot(x='year', y='suicide_rates_total', hue='country', linewidth = 3,
             data=country_year_group[country_year_group['country'].isin(top25_countries_r[:10])])

plt.title('Suicide Rates by Year (Top 10 Countries)', fontsize=30)
plt.xlabel('Years', fontsize=25)
plt.ylabel('Rates of Suicides', fontsize=25)
plt.xticks(range(1985, 2016), fontsize=20, rotation=45)
plt.yticks(fontsize=15)

plt.show()

In terms of the suicide rates (i.e., total suicide/total population) is examined, we observe:

- A _dramatic increase_ in suicide rates in **Republic of Korea** after **1991**.
- For the **rest of the countries**, on the other hand, there is a _steady decrease_ in the suicide rates especially after **2000s**.

### HDI (Human Development Index)

In [None]:
print("Top 25 countries according to their HDI (Human Development Index):")
display(pd.DataFrame(df.groupby('country')['HDI'].mean().sort_values(ascending=False)[:25]))

In [None]:
sns.set_style('white')
plt.style.use('bmh')

plt.figure(figsize=(6, 10))

plt.barh(y=df.groupby('country')['HDI'].mean().sort_values(ascending=False).index[:25],
         width=df.groupby('country')['HDI'].mean().sort_values(ascending=False)[:25])
plt.title('Average HDI by Country (Top 25)', fontsize=20)
plt.xlabel('Countries', fontsize=15)
plt.ylabel('HDI', fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=10)

plt.show()

## Multivariate Analysis

### HDI & Suicide

In [None]:
hdi_suicide = pd.DataFrame(df.groupby(['country'])[['HDI', 'suicide_rates_total']].mean().reset_index())

In [None]:
hdi_suicide.corr()

In [None]:
sns.set_style('white')
plt.figure(figsize=(10, 6))

sns.scatterplot(x='HDI', y='suicide_rates_total', data=hdi_suicide)
plt.title('HDI vs Suicide Rates', fontsize=25)
plt.xlabel('HDI', fontsize=20)
plt.ylabel('Suicide Rates', fontsize=20)

plt.show()

There doesn't seem to be a corelation between Human Development Index (HDI) and Suicide Rates.

### GDP & Suicide

In [None]:
country_year_gdp = pd.DataFrame(df.groupby(['country', 'year', 'gdp_for_year ($)'])['suicides_no'].sum()).reset_index()
country_year_gdp = pd.concat([country_year_gdp, pd.Series(df.groupby(['country', 'year'])['suicide_rates_total'].max().values, name='suicide_rates_total')], axis=1)

country_year_gdp.head()

In [None]:
top25_gdp = country_year_gdp.groupby('country')['gdp_for_year ($)'].max().sort_values(ascending=False).index[:25]

Notice that for the following two scatter plots, I am limiting x and y-axes with .xlim() and .ylim(), since this way, our plot does not get affected by the outliers in Russian Federation and United States.

In [None]:
plt.figure(figsize=(12, 8))

for country in top25_gdp:
    plt.scatter(country_year_gdp[country_year_gdp['country'] == country]['gdp_for_year ($)'],
                country_year_gdp[country_year_gdp['country'] == country]['suicides_no'],
                label=country)

plt.title('GDP & Suicide Counts (Top 25 GDP)', fontsize=25)
plt.xlabel('GDP per Year ($)', fontsize=15)
plt.ylabel('Suicide Counts', fontsize=15)
plt.xlim(0, 5_000_000_000_000)
plt.ylim(0, 30000)

plt.legend(fontsize=12)
plt.show()

In [None]:
sns.set_style('white')
plt.figure(figsize=(12, 8))

for country in top25_gdp:
    plt.scatter(country_year_gdp[country_year_gdp['country'] == country]['gdp_for_year ($)'],
                country_year_gdp[country_year_gdp['country'] == country]['suicide_rates_total'],
                label=country)

plt.title('GDP & Suicide Rates (Top 25 GDP)', fontsize=25)
plt.xlabel('GDP per Year ($)', fontsize=15)
plt.ylabel('Suicide Rates', fontsize=15)
plt.xlim(0, 5_000_000_000_000)

plt.legend(fontsize=12)
plt.show()

## Correlation Analysis

In [None]:
df.columns

In [None]:
plt.figure(figsize=(10, 8))

sns.heatmap(df.drop(['year'], axis=1).corr(), annot=True, fmt='.3f')
plt.show()

Unfortunately, the relationship between **suicides_no_total** (or **suicide_rates_total**) and **gdp_for_year(\\$)** is not as strong as I was expecting it to be.

**age_group** and **generation_group** columns I have created did not result in a meaningful correlation neither.

However, there is a **_strong negative correlation_** between '**gpd_per_capita (\\$)**' and '**suicides/100k pop**' (suicide rates for each year, age_group, gender, group) columns, and a positive correlation between '**gdp_for_year (\\$)**' and '**population**' columns. Since our target variable is the number/rate of suicides, we are not particularly interested in the relationship between GDP and Population. There is also a strong correlation between **suicides_no_total** and **population_total**, but it is already expected.

Let's see the relationship between '**gpd_per_capita (\\$)**' and '**suicides/100k pop**' columns then!

In [None]:
plt.figure(figsize=(8, 6))

plt.scatter(df['gdp_per_capita ($)'], df['suicides/100k pop'])
plt.title('GDP per Capita vs Suicide Rates')
plt.xlabel('GDP per Capita')
plt.ylabel('Suicide Rates')

plt.show()

## Conclusion

Overall, we can conclude that:

- In terms of **gender**, **_men_** commit suicide much more than **_women_**, both worldwide and for each country.
- The **age group** with most suicide counts is **_35-54 years_** age group, while the **generation** with most suicide counts is **_Boomers_**.
- Worldwide, there is a sharp increase in suicide counts between the **years** **_1988_** and **_2000_**.
- **Human Development Index (HDI)** does not seem to have a high correlation between **suicide rates**.
- **Gross Domestic Product (GDP)** has a **_strong correlation_** with **suicide rates**, especially when the suicide rates is divided into year, gender and age groups (i.e., 'suicide/100k pop' column).