## WHO Suicide Statistics

#### The aim of this dataset is to analyze the data in order to determine the age group,country where maximum suicides take place and the transition in the suicide rates over the years

#### The evaluation will take place in 3 stages
#### > Data Cleaning
#### > Data Visualization
#### > Apply Machine Learning for prediction

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
% matplotlib inline

In [None]:
df=pd.read_csv('../input/who_suicide_statistics.csv')
df.head()

In [None]:
new_df=df.copy()
new_df.head()

In [None]:
df.info()

In [None]:
df.fillna(0,inplace=True)
df.head()

In [None]:
df.info()

### Data Cleaning

In [None]:
df.country.unique()

In [None]:
df.year.unique()

In [None]:
df.age.unique()

In [None]:
df.suicides_no.unique()

In [None]:
df=df.replace(['male','female'],[0,1])
df=df.replace(['5-14 years','15-24 years','25-34 years','35-54 years','55-74 years','75+ years'],[0,1,2,3,4,5])
df.head()

In [None]:
df['population']=df['population'].astype(int)
df.head()

In [None]:
df['suicides_no']=df['suicides_no'].astype(int)
df.head()

In [None]:
df.suicides_no.unique()

### Data Visualization

In [None]:
pd.scatter_matrix(df,color='red',figsize=(10,10),diagonal='hist')

#### The plot determines the feature importance and effect of features on other features 

In [None]:
sns.barplot(x='age',y='suicides_no',data=df,ci=0)
plt.title(" total suicides in age group ")

#(['5-14 years','15-24 years','25-34 years','35-54 years','55-74 years','75+ years'],[0,1,2,3,4,5])

#### Most of the suicides are comitted in age group of 35-54 years

In [None]:
sns.barplot(x='sex',y='suicides_no',data=df,ci=0)
plt.title("total suicides committed by male and female")


#(['male','female'],[0,1])

#### It can be observed that male commits more suicide as compared to female 

In [None]:
sns.barplot(x='age',y='suicides_no',data=df,hue='sex',palette='spring',ci=0)
plt.title(" total suicides in age group classified by gender")

In [None]:
sns.barplot(x='year',y='suicides_no',data=df,ci=0)
plt.title(" total suicides committed over the years ")

<a href="https://imgur.com/cs2OeWx"><img src="https://i.imgur.com/cs2OeWx.png" title="source: imgur.com" /></a>

#### Russia and United States are the country with most number of suicides

<a href="https://imgur.com/uF3gxgn"><img src="https://i.imgur.com/uF3gxgn.png" title="source: imgur.com" /></a>

#### Countries with most number of suicides in decreasing order

<a href="https://imgur.com/frD0dZZ"><img src="https://i.imgur.com/frD0dZZ.png" title="source: imgur.com" /></a>

#### It can be seen that for age group between 5-14 years suicide is almost nill, for age group between 15-24 years number of suicides is almost constant, for age group between 25-34 there is almost constant number of suicides with little bit increase or decrease in numbers over the years, for age group between 35-54 years number of suicides increased till 2000 and then there is decrease in number of suicides, for age group between 55-74 years till 1994 there is increase in number of suicides which decreases after 1994 and for age group 75+ over the years till 2015 suicide rate is constant which decreases in 2016

### K Means Clusttering

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn import preprocessing
from scipy.spatial.distance import cdist

In [None]:
df.head()

In [None]:
cluster=df[['year','sex','age','population']]

In [None]:
cluster_s=cluster.copy()

In [None]:
cluster_s['population']=preprocessing.scale(cluster_s['population'].astype('float64'))
cluster_s['year']=preprocessing.scale(cluster_s['year'].astype('float64'))
cluster_s['sex']=preprocessing.scale(cluster_s['sex'].astype('float64'))
cluster_s['age']=preprocessing.scale(cluster_s['age'].astype('float64'))

In [None]:
cluster_train,cluster_test=train_test_split(cluster_s,test_size=0.3,random_state=7)

In [None]:
cluster_train.shape,cluster_test.shape

In [None]:
cluster=range(1,11)

In [None]:
mean_dist=[]

In [None]:
for k in cluster:
    model=KMeans(n_clusters=k)
    model.fit(cluster_train)
    mean_dist.append(sum(np.min(cdist(cluster_train,model.cluster_centers_,'euclidean'),axis=1))/cluster_train.shape[0])

In [None]:
plt.figure(figsize=(10,8))
plt.plot(cluster,mean_dist)
plt.xlabel("Number of clusters")
plt.ylabel("Average distance")
plt.title("Elbow Curve")
plt.xticks(range(1,11))

#### For ideal case as the number of clusters increases average distance should reduce which is happening for the above case and so the model predicted perfectly 

### Checking validity at bent in elbow curve

In [None]:
from sklearn.decomposition import PCA

In [None]:
model1=KMeans(n_clusters=2)

In [None]:
model1.fit(cluster_train)

In [None]:
pca_2=PCA(2)

In [None]:
plt.figure(figsize=(10,6))
plot_columns=pca_2.fit_transform(cluster_train)
plt.scatter(x=plot_columns[:,0],y=plot_columns[:,1],c=model1.labels_,)
plt.xlabel("Canonical Varialbe 1")
plt.ylabel("Canonical variable 2")
plt.title("2 canonical variable for 2 cluster")
plt.show()

#### After evaluating the bent at cluster number 2 and 3 in elbow curve it can be seen that as the cluster is densely packed they are highly correlated with one another and within each cluster variance is less 