## Objective of the Kernel :
To analyse the suicide rate w.r.t. various variables present in the data set.
1.     - Which countries have highest suicide numbers of all.
2.     - Countries which have an increasing suicide rate trend.
3.     - Affect of population and GDP on the suicide rates.
4.     - At which age people are more susceptiple to commit suicides.
5.     - Who amongst the men and women commit more suicides.

### Please upvote/comment if you like the Kernel and comment if any feedback.

In [None]:
#Filetring warnings
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

### Data Understanding and cleaning

In [None]:
suicide_df = pd.read_csv('../input/suicide-rates-overview-1985-to-2016/master.csv')
suicide_df.head()

In [None]:
suicide_df.shape

In [None]:
suicide_df.describe(percentiles=[0.1,0.25,0.4,0.5,0.6,0.75,0.9,0.99,1])

In [None]:
count1 = len(suicide_df.loc[suicide_df['suicides_no'].between(np.percentile(suicide_df.suicides_no,0),
                                                     np.percentile(suicide_df.suicides_no,99))])
count2 = len(suicide_df.loc[suicide_df['suicides_no'].between(np.percentile(suicide_df.suicides_no,99),
                                                     np.percentile(suicide_df.suicides_no,100))])
print("Suicide Numbers between 0 and 99 percentile :",count1)
print("Suicide Numbers between 99 and 100 percentile :",count2)

#### Note :
We can see from the describe function as well as above calculated lengths that there is a significant difference between 99th and 100th percentile value(From 3993 to 22338).

There are 279 rows associated with that 1 percentile.

We will further analyse it while doing univariate analysis.

In [None]:
suicide_df.info()

In [None]:
#Categorical variables
suicide_df.select_dtypes(include=[object]).head()

In [None]:
#Numerical variables
suicide_df.select_dtypes(exclude=[object]).head()

In [None]:
suicide_df.isnull().sum()

In [None]:
#Deleting 'HDI for year' column as most of the values are NaN
#Dropping column country-year as it is redundant
suicide_df.drop(['HDI for year','country-year'], axis=1, inplace=True)

In [None]:
#Checking columns with only one value throughout all the rows
suicide_df.loc[:,suicide_df.nunique()==1].columns

In [None]:
suicide_df.columns

In [None]:
#renaming columns for better readability and usability
suicide_df.rename(columns={'suicides/100k pop':'Suicides100kPop', ' gdp_for_year ($) ':'GDPForYear',
                          'gdp_per_capita ($)':'GDPPerCapita'}, inplace=True)

In [None]:
suicide_df.head()

In [None]:
suicide_df.GDPForYear = suicide_df.GDPForYear.apply(lambda x : x.replace(",", ""))
suicide_df.GDPForYear = suicide_df.GDPForYear.astype('int64')

In [None]:
#Changing GDPForYear into million $
suicide_df.GDPForYear = ((suicide_df.GDPForYear) / (1000000))
suicide_df.head()

## Visual Analysis

### Target Variable : 

#### 1. SUICIDE NUMBER and SUICIDE NUMBER PER 100K PEOPLE

In [None]:
plt.figure(figsize=(14,4))

plt.subplot(121)
plt.title('Suicide Number')
sns.distplot(suicide_df['suicides_no'], hist=False)

plt.subplot(122)
plt.title('Suicide Number per 100k population')
sns.distplot(suicide_df['Suicides100kPop'], hist=False)
plt.tight_layout()

#### Inference :

1. The plot is right skewed, with median(25) much less than the mean(243), a significant difference.
2. 99% values seem to be below 5000. The variance is very high in the suicide numbers, let us explore why is that a case.

#### Feature Variables

#### 1. COUNTRY

In [None]:
lat_long = pd.read_csv('../input/country-geo/country_data.csv')
lat_long.rename(columns={'country':'countrycode','name':'country'},inplace=True)
lat_long.head()

In [None]:
#Checking if all the values in one Dataframe is present in other or not
temp1 = pd.DataFrame(suicide_df.country.unique())
temp2 = pd.DataFrame(lat_long.country.unique())
temp2.equals(temp1)

In [None]:
#Checking the values which are not present in lat_long
df = suicide_df.copy()
df = df.merge(lat_long, how = 'left', on = 'country')
df.loc[df.countrycode.isnull()].country.unique()

In [None]:
#Correcting the country names in our data set, then merging the dataset with lat_long
suicide_df.loc[df['country']=='Cabo Verde', 'country'] = 'Cape Verde'
suicide_df.loc[df['country']=='Republic of Korea', 'country'] = 'South Korea'
suicide_df.loc[df['country']=='Russian Federation', 'country'] = 'Russia'
suicide_df.loc[df['country']=='Saint Vincent and Grenadines', 'country'] = 'Saint Vincent and the Grenadines'
suicide_df = suicide_df.merge(lat_long[['latitude','longitude','country']], how = 'left', on = 'country')

In [None]:
#Adding the column suicide_country with total suicides in a country value
temp = suicide_df.copy()
table = temp.groupby(['country'])['suicides_no'].sum()
temp = temp.merge(table.reset_index(), how='left',on='country')
suicide_df['suicide_country'] = temp['suicides_no_y']
suicide_df.head()

In [None]:
#As we had seen earlier, the last percentile had a significant difference. 
#Looking at the countries included with the last percentile of suicides_no value.
suicide_df['country'].loc[suicide_df['suicides_no'].between(np.percentile(suicide_df.suicides_no,99),
                                                     np.percentile(suicide_df.suicides_no,100))].unique()

In [None]:
#Visualizing the top 10 countries with highest total suicide numbers
df = suicide_df.groupby(['country'])['suicides_no'].sum().sort_values(ascending=False).head(10)
df.plot.bar(figsize=(15,8))

#### Inference :

1. We can see that the top three countries with highest suicide count are - `Russia`, `United States` and `Japan`.

In [None]:
#Visualizing the Top 5 countries with total suicides between 1985 to 2016, gender-wise against the total suicide number
plt.figure(figsize=(8,6))
df = suicide_df.loc[((suicide_df.country=='Russia') | (suicide_df.country=='United States')
                     | (suicide_df.country=='Japan') | (suicide_df.country=='France')
                    | (suicide_df.country=='Ukraine'))].groupby(['country','sex'])['suicides_no'].sum().unstack(fill_value=0).head(10)
df.plot.bar(figsize=(15,8))


#### Inference :

1. We can see that the top five countries with highest total suicides, the number of `male` suicides are consistently higher than female ones.

In [None]:
#Visualizing the countries with total suicide counts on a map
from mpl_toolkits.basemap import Basemap

lat_min = min(suicide_df['latitude'])
lat_max = max(suicide_df['latitude'])
lon_min = min(suicide_df['longitude'])
lon_max = max(suicide_df['longitude'])

m = Basemap(
    projection='merc', 
    llcrnrlat=lat_min, 
    urcrnrlat=lat_max, 
    llcrnrlon=lon_min, 
    urcrnrlon=lon_max,
    resolution='l'
)
# Draw the components of the map

longitudes = suicide_df['longitude'].tolist()
latitudes = suicide_df['latitude'].tolist()
suicide_count = suicide_df['suicide_country'].values
fig = plt.figure(figsize=(30,30))
ax = fig.add_subplot(1,1,1)
ax = m.drawcountries()
ax = m.drawcoastlines(linewidth=0.1, color="white")
ax = m.fillcontinents(color='grey', alpha=0.6, lake_color='grey')
ax = m.drawmapboundary(fill_color='#A6CAE0', linewidth=0)
ax = m.scatter(longitudes, latitudes, c=suicide_count,s=500, zorder = 1,linewidth=1,latlon=True, edgecolors='yellow',cmap='YlOrRd'
               ,alpha=1)
plt.title('Suicide Number - Countrywise', fontsize=30)

#### Inference :

Looking at the map, we can see that the data present is for only some of the countries, so we can not derive any global inference from our analysis, nor can we generalize the data by aggregating the countries into continents.

Most of the countries we have data of belong to European and American continent. 

In [None]:
plt.figure(figsize=(8,6))
df = suicide_df.loc[((suicide_df.country=='Russia') | (suicide_df.country=='United States')
                     | (suicide_df.country=='Japan') | (suicide_df.country=='France')
                    | (suicide_df.country=='Ukraine'))].groupby(['country','age'])['suicides_no'].sum().unstack(fill_value=0).head(10)
df.plot.bar(figsize=(15,8))

#### 2. YEAR

In [None]:
#Year against the total suicides that year, avg GDP and average total population of that year.
plt.figure(figsize=(10, 6))


df_time = suicide_df.groupby(["year"]).suicides_no.sum()
sns.lineplot(data = df_time)
plt.xlabel("Year")
plt.ylabel("Total Suicide Count")
plt.show()

#Year against suicide rate of the year Bar plot
df = suicide_df.groupby(['year'])['suicides_no'].sum()
df.plot(kind='bar',legend=True,figsize=(8,6),colormap='Pastel2')

In [None]:
print("Percent rows of 2015 :",round((len(suicide_df.loc[suicide_df.year==2015])/len(suicide_df.index))*100,2),"%")
print("Countries recorded in 2015 : ", len(suicide_df['country'].loc[suicide_df.year==2015].unique()))
print("Percent rows of 2016 :",round((len(suicide_df.loc[suicide_df.year==2016])/len(suicide_df.index))*100,2),"%")
print("Countries recorded in 2015 : ",len(suicide_df['country'].loc[suicide_df.year==2016].unique()))

#### Inference :

1. Year - Suicide count :

    - The suicide rate seems to be increasing between 1985-2000 and then slight decrease by 2015.
    - The significant difference between 2015-2016 might be because of less records in 2016 rather than any significant change in the suicide rate.


2. Year - Avg GDP :

    - Avg GDP per year seem to have an increasing trend from 1985 - 2015.


In [None]:
# Seeing the total number of suicides in a country in a particular year
temp = suicide_df.copy()
table = temp.groupby(['country','year'])['suicides_no'].sum()
temp = temp.merge(table.reset_index(), how='left',on=['country','year'])
temp = temp.sort_values(by='suicides_no_y',ascending = False)
temp[['country','year','suicides_no_y']].drop_duplicates(keep='last').head(50)

#### Inference :

Clearly Russia, United States and Japan are topping the list 

In [None]:
#Finding out the countries with increasing suicide rate by year trend
def trend(countries,df):
    trend_up = pd.DataFrame()
    lst = []
    num = []
    for i in countries:
        cnt = 0
        rows = df.loc[df['country']==i]
        years = rows['year'].sort_values(ascending=False).unique()
        for j in years[:15]:
            suicide_year = rows['suicides_no_y'].loc[rows['year']==j].unique()
            suicide_year_prev = rows['suicides_no_y'].loc[rows['year']==(j-1)].unique()
            if(suicide_year > suicide_year_prev):
                cnt+=1
        if(cnt>=11):
            lst.append(i)
            num.append(cnt)
    trend_up['Count'] = num
    trend_up['Country'] = lst
    return trend_up.sort_values(by='Count',ascending=False)
                    
countries = temp['country'].unique()
df = temp[['country','year','suicides_no_y']]
lst = trend(countries,df)
lst

In [None]:
#Visualising the top five countries with an increasing suicide rate trend for past 15 years.
plt.figure(figsize=(8,6))
df = suicide_df.loc[((suicide_df.country=='United States') | (suicide_df.country=='Brazil')
                     | (suicide_df.country=='South Korea') | (suicide_df.country=='Mexico')
                    | (suicide_df.country=='Netherlands'))].groupby(['country','year'])['suicides_no'].sum().unstack(fill_value=0).head(10)
df.plot.bar(figsize=(15,8),legend=False,colormap='Accent')


#### Inference :

While `Brazil`, `Mexico` and `United States` clearly show the increasing trend, `South Korea` has had it's peak in 1990's but it is an increasing trend clearly. For `Netherlands`, it mostly seems constant, with insignificant increasing number of suicide.

In [None]:
#Seeing the countries with maximum number of suicides in 2015 
suicide_df[(suicide_df.year==2015)].groupby(['year','country'])['suicides_no'].sum().sort_values(ascending = False).head(10)

#### 3. Age

In [None]:
# A simple view of total number of suicides per age category in all the years from 1985 to 2016
df = suicide_df.groupby(['age'])['suicides_no'].sum().sort_values(ascending=False)
df.plot(kind='bar',legend=True,figsize=(8,6),colormap='Pastel1')

#### Note :

Clearly, age category `35-54` and `55-74` are more prone to suicides than other age groups. 

(On a side note, mid life crisis does seem to be between 45 years to 64 years approximately. Could this be a reason?)

In [None]:
# Seeing the sex wise categorization of age in suicide numbers
df = suicide_df.groupby(['age','sex'])['suicides_no'].sum().unstack(fill_value=0)
df.plot(kind='bar',legend=True,figsize=(8,6),colormap='Pastel2')

#### NOTE :

In every age category the `Male` seems to have more suicide numbers than the female. Let's look at the population to reconfirm that this is indicative feature rather than a reflection of population disproportion.

#### 4. Population

#### NOTE :

We have already seen that the average population per year has increased with time. Let us see the population in respect to other parameters.

In [None]:
sns.set(style="whitegrid")
ax = sns.violinplot(x=suicide_df["population"])

Population seems to be dense around the median.

In [None]:
#In 2015 - most populous countries in the dataset
suicide_df[(suicide_df.year==2015)].groupby(['country'])['population'].sum().sort_values(ascending = False).head(10)

In [None]:
plt.figure(figsize=(15,5))
ax = sns.violinplot(x="age", y="population", data=suicide_df)

#### NOTE :
As we can see here, the population for people who have age 35-54 is more than any other category. This might be one of a reason for the suicide rate being more for this category

In [None]:
plt.figure(figsize=(15,5))
ax = sns.barplot(x="age", y="population", hue="sex", data=suicide_df, palette="muted")

#### Note :

The sex ratio in every age category seems to be almost same. Hence the inference that the number of male suicides in each age category is far higher than female ones.


In [None]:
plt.figure(figsize=(15,5))
ax = sns.barplot(y="generation", x="population",data=suicide_df)


#### Note :

`Boomers` and `Generation X` seem to be more in suicidal count than other generations. 

#### 5. Suicides100kPop	

Since total suicide numbers are skewed by population of a country, suicides per hundred thousand people seems to be a better measure in deciding which country has more suicide rate per thousand people.

In [None]:
suicide_df[(suicide_df.year==2015)].groupby(['year','country'])['Suicides100kPop'].sum().sort_values(ascending = False).head(20)

#### NOTE :

`South Korea` seems to have maximum number of suicides w.r.t. its population in 2015. And as we have seen in 'country' analysis as well, South Korea seems to have an increasing trend of suicides.


In [None]:
suicide_df.groupby(['year','country'])['Suicides100kPop'].sum().sort_values(ascending = False).head(20)

#### 6. GDP :


Exploring the GDP of the countries which have hight suicidal rates, as well as an increasing trend.

In [None]:
plt.figure(figsize=(15,6))

plt.subplot(121)
df_time = suicide_df.groupby(["year"]).GDPForYear.mean()
sns.lineplot(data = df_time)
plt.xlabel("Year")
plt.ylabel("Average GDP For Year")

plt.subplot(122)
df_time = suicide_df.groupby(["year"]).GDPPerCapita.mean()
sns.lineplot(data = df_time)
plt.xlabel("Year")
plt.ylabel("Average GDPPerCapita")
plt.tight_layout()
plt.show()


In [None]:
#Visualising the top five countries with an GDPPerCapita trend.
df = suicide_df.loc[((suicide_df.country=='United States') | (suicide_df.country=='Brazil')
                     | (suicide_df.country=='South Korea') | (suicide_df.country=='Mexico')
                    | (suicide_df.country=='Netherlands'))].groupby(['country','year'])['GDPPerCapita'].sum().unstack(fill_value=0).head(10)
df.plot.bar(figsize=(15,8),legend=False,colormap='Accent')


In [None]:
suicide_df.drop(['latitude', 'longitude', 'suicide_country'], axis=1,inplace=True)

In [None]:
sns.pairplot(suicide_df, hue="age")
plt.show()