**The following is the package list, please check whether you have installed all the package below:   
- Matplotlib, Pandas, Numpy, Seaborn 
- pycountry, geopandas
- wordcloud, plotly, cufflinks 
- scikit learn,maths 
- gap_statistic, gapstat_rs  

#  EDA and clustering analysis of Suicides Ratio


Introduction:   
The content is divided into two parts, the first part is the EDA analysis of the suicides rate all around the    world, which will represent the data from different points of view, in the second part we will apply the K-means algorithm to realize the cluster of the samples and at the same time furnish the visualization of the clustering result. 

<span class="mark">-------------------------------------------------------------------------------------------</span>

In this project, we will analyze the reason of difference between suicides ratio in 
different regions and years, the following is the sourcecode

In [None]:
#the following are the libaries we called in this project
import pandas as pd                      #the pandas library is used for the constrcution of dataframe 
import numpy as np                       #the numpy libarary is used for some vector calculation
import matplotlib.pyplot as plt          #the matplotlib library is used for basic chart drawing
import seaborn as sns                    #the seaborn library is an extension of matplotlib, aiming for advanced chart
import pycountry                         #the pycountry library is used for the mark of different regions in the world
import geopandas as gpd                  #the geopandas library is used for the drawing of map chart 
import plotly
import plotly.graph_objects as go
import plotly.express as px              #the ploty function enable us to realize some interacting function in the visualization 
from plotly.offline import iplot,init_notebook_mode
import cufflinks as cf                   #the visualizzation tool 
from scipy import stats                  #the package can be used to apply some basic regression method into application
import sklearn
import warnings                          
warnings.filterwarnings('ignore')        #this command line is used for ignoring the warnings 
%matplotlib inline                         

from plotly.offline import iplot,init_notebook_mode   # You can go offline on demand by using this command 
cf.go_offline()                                        # To connect java script to your notebook
init_notebook_mode(connected=True)

The following is the information of the data we choose

In [None]:
data=pd.read_csv('/kaggle/input/suicide-rates-overview-1985-to-2016/master.csv')   
#dataset read operation, the read_csv function was used to input the file into the dataframe data    
print(data.info())            
#The info function shows the data types and numerical values of the features in our data set.

As we can see from the information, there are 12 columns in the dataset, since the names of the columns are 
hard to distinguish, which may cause problems in the future analysis, so we need to change the names 

In [None]:
data=data.rename(columns={'country':'Country','year':'Year','sex':'Sex','age':'Age',
                          'suicides_no':'Suicidesno','population':'Population','suicides/100k pop':'SuicidesPer',
                          'country-year':'CountryYear','HDI for year':'HDIForYear',
                          ' gdp_for_year ($) ':'GdpYear','gdp_per_capita ($)':'GdpCapita',
                          'generation':'Generation'})
# using the ranme function to change the name of the columns 

In [None]:
print(data.columns)    #check whether the names of the columns have been successfully changed

In [None]:
#since there may exist missing value in the dataset
data.isnull().any()    
#the isnull function together with the any function can show the columns where there are missing values

According to the information of the dataset, there are only 8364/27820 non-null values in the column of 
"HDIForYear". Since there are too many null values, we choose to drop the whole columns 

In [None]:
data=data.drop(['HDIForYear'],axis=1)       #the drop fruncion can drop a whole row or column 
print(data.columns)                         #check the columns after the function

In [None]:
data.head(8)        #have a brief view of the first 5 rows in the dataset 

Have a brief view of the distribution of the variable

In [None]:
Copydata = data[['Suicidesno','Population','SuicidesPer','GdpCapita']]   # choose a few columns to see the distribution
Copydata.iplot(kind='hist',              # choose the kind of histogram     
           subplots=True,               # plot a few subplots 
           horizontal_spacing=.1,       # set the horizontal space 
           fill=True,
           subplot_titles=True,        # set the subtitle 
           title='Data Distribution')
# using the Cufflinks package to plot a dynamic chart to visualize the distribution of some variables

In [None]:
#create an outtlier-detection function 
def outliers_detection(df, columns):
    outliers_indices = []                    # create a list to store the outlier 
    for column in columns:
        Q1 = np.percentile(df[column],25)    # calculate the 25 percentile and 75 percentile 
        Q3 = np.percentile(df[column],75)
        IQR = Q3-Q1                          # calculate the interval 
        threshold = IQR * 1.5                # set the threshold of 1.5 times of IQR
        lower, upper = Q1 - threshold, Q3 + threshold                  
        outliers = df[(df[column]<lower)|(df[column]>upper)].index
        outliers_indices.extend(outliers)
    outlier_indices=Counter(outliers_indices)                                        
    multiple_outliers=list(i for i,v in outlier_indices.items() if v>1)
    data = df[~df.index.isin(multiple_outliers)][columns[0]]
    return data

Having done the basic processing of the dataset, we could try to visualize the data above

## The Correlation between different variables

In [None]:
#since we want to have a brief view of the correlation between different variables
#here we choose to use heat-map to visualize the correlation
plt.figure(figsize=(8,5),dpi=100)
Heatmap = sns.heatmap(data.corr(), annot = True)

As is seen from the heatmap,the correlation between the factors except population with GDP for year is low

## Part1: The visualization of the global trend

### Aggregating the Year with the columns of suicidesno and population

In [None]:
GlobalTrend=data.groupby("Year")["Suicidesno","Population"].sum()  
# this command line is to groupy the column['year'] and calculate the sum of suidesno and population

GlobalPercent=GlobalTrend["Suicidesno"]/GlobalTrend["Population"]
# this command line is to create a new Series which represents the global percentage of the suicides number 

plt.figure(figsize=(12,3),dpi=100)
# this command line is to create a canvas which length:width is 12:3 and dpi is 100

sns.set()                       #transfer the plotting mode to seaborn
plt.title("The global trend of suicides ratio")   # add a title to the plot
plt.xlabel("Year")            #add an abscissa label to the plot
plt.ylabel("Overall ration")  #add an ordinate label to the plot

sns.lineplot(x=GlobalPercent.index,y=GlobalPercent.values,marker='o',alpha=0.5)
#plot the lineplot using the seaborn library


In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=list(GlobalPercent.index), y=list(GlobalPercent.values)))

# Set title
fig.update_layout(
    title_text="Globale trend over time"
)

# Add range slider
fig.update_layout(
    xaxis=dict(
        rangeselector=dict(
            buttons=list([
                dict(count=1,
                     label="1m",
                     step="month",
                     stepmode="backward"),
                dict(count=6,
                     label="6m",
                     step="month",
                     stepmode="backward"),
                dict(count=1,
                     label="1y",
                     step="year",
                     stepmode="backward"),
                dict(step="all")
            ])
        ),
        rangeslider=dict(
            visible=True
        ),
        type="date"
    )
)

fig.show()

The above figure could offer us a few icons for us to choose an time interval so that we can have an more intuitive feeling


### Using the country code package to add continent label to the dataset

In [None]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
# load the world geometry dataset 
cities = gpd.read_file(gpd.datasets.get_path('naturalearth_cities'))
# load the dataset of cities in the world

world.head()       # have a brief view of the firt 5 rows of the dataset 

In [None]:
CountryData=data[["Country","Year","Suicidesno","Population","SuicidesPer"]]
LabelingData=world[["continent","name","iso_a3"]]
# Create a new dataframe contain some of the columns 

LabelingData.columns=["continent","Country","iso_a3"] 
#change the names of the columns 
LabelingData.head()

In [None]:
MergeData = pd.merge(CountryData,LabelingData,on='Country')
# merging the countrydata and the labelingdata together 
MergeData.head()

Now we have successfully labeled the country with the continent and geometry code

### Plotting with the continent data

In [None]:
ContinentData=MergeData.groupby(["continent","Year"])["Suicidesno","Population"].sum()
# Aggregating the continent and year data 

ContinentPercent=ContinentData["Suicidesno"]/ContinentData["Population"]
#create a new Series to store the percentage data

ContinentPercent=ContinentPercent.to_frame()
#Transfer the Sereis strcuture to dataframe
ContinentPercent.columns=["Percentage"]
#Change the name of the column
ContinentPercent.head()

#### Plotting the line chart

In [None]:
plt.figure(figsize=(10,3),dpi=200)             # create the canvas
sns.set()

plt.title("Continents' Suicides Ratio over the year")     # set the title of the chart
plt.xlabel("Year")                       
plt.ylabel("Suicides Ratio")

sns.lineplot(x="Year",y="Percentage",hue="continent",data=ContinentPercent)   #plot the chart with data in different continents

#### Plotting the pie chart

In [None]:
Comparison=ContinentPercent.groupby("continent")["Percentage"].mean()
# create a comparison Series to compare the suicides ratio in different continents
Per=Comparison.values/Comparison.values.sum()
#calculate the percentage of the 

In [None]:
plt.figure(figsize=(3,3),dpi=100)    # create the canvas

plt.title("Comaprison between different continents")
explode = [0,0.1,0,0.1,0,0.1]  # create the highlighting label
labels=["Africa","Asia","Europe","North America","Oceania","South America"]  #create the label

plt.pie(x=Per,labels=labels,explode=explode)  #plot the pie chart 
plt.show()

#### Plotting the Lollipop chart

In [None]:
fig, ax = plt.subplots(figsize=(8,4), dpi= 80)  
# set the figure size of the canvas and set the resolution


ax.set_title('Suicides Rate in different continents', fontdict={'size':15})\
# set the content of the title and the size of the title 
ax.set_ylabel('Suicides Rate')
# set the label of y-axis


ax.vlines(x=Comparison.index, ymin=0, ymax=Comparison.values,  alpha=0.7, linewidth=2)

ax.scatter(x=Comparison.index, y=Comparison, s=75,  alpha=0.7)


plt.show()

## The visualization of the country data analysis

### Integrate the country data

In [None]:
CountryData=MergeData.groupby("Country")["SuicidesPer"].mean()   
# aggregate the data according to the country, calculating the mean of suicidespercentaga
CountryData=CountryData.to_frame()  # Transfer the Series object to dataframe

LowSuicideCountry=CountryData.sort_values("SuicidesPer").index            #get the lowest suicides rate country list
HighSuicideCountry=CountryData.sort_values("SuicidesPer",ascending = False).index     #get the highest suicides rate country rate

print(LowSuicideCountry[0:10])   #have a view of the top 10 country list
print(HighSuicideCountry[0:10])



#### Word-cloud visualization

Since we want to have a more intuitive view of the country, we will use the word-cloud to represent the countries

In [None]:
#pip install wordcloud   
# this command is to install a package which is used for the creation of the WORD_CLOUD

In [None]:
import wordcloud                             
plt.figure(figsize=(10,5),dpi=100)                #create a canvas which can accommodate two word-cloud

plt.subplot(1,2,1)  # create a 1*2 subplots and choose the first subplot
plt.title("Coutries with lowest suicides ratio")

wordcloud = wordcloud.WordCloud(background_color='white',width=512,height=384).generate(" ".join(LowSuicideCountry[0:20]))
#create a wordcloud using the country list with the lowest suicides rate

plt.imshow(wordcloud)
plt.axis('off')     #hide the x-y axis 

import wordcloud
plt.subplot(1,2,2)  #choose the scend subplot
plt.title("Coutries with highest suicides ratio")

wordcloud = wordcloud.WordCloud(background_color='white',width=512,height=384).generate(" ".join(HighSuicideCountry[0:20]))
#create a wordcloud using the country list with the highest suicides rate

plt.imshow(wordcloud)
plt.axis('off')     #hide the x-y axis 

plt.show()   #represent the outcome 

#### Barchart visualization

In [None]:
LowSuicideRate=CountryData.sort_values("SuicidesPer").values.ravel()
HighSuicideRate=CountryData.sort_values("SuicidesPer",ascending=False).values.ravel()
# since the data we tale from the countrydata is a vector array, we need the ravel function to make the 
# high-dimensional vector array into one dimension 

plt.figure(figsize=(20,6))   # create a canvas to accommodate the two subplots 

plt.subplot(1,2,1)
sns.barplot(x=LowSuicideRate[0:10],y=LowSuicideCountry[0:10])
#take the top 10 lowest suicidesrate countries to make the barchart
plt.xlabel("ratio of suicide")
plt.ylabel("country")
plt.title("Top 10 lowest suicides rate countries")  #add the title 

plt.subplot(1,2,2)
sns.barplot(x=HighSuicideRate[0:10],y=HighSuicideCountry[0:10])
#take the top 10 highest suicidesrate countries to make the barchart
plt.xlabel("ratio of suicide")
plt.ylabel("country")
plt.title("Top 10 highest suicides rate countries")  #add the title 

plt.show()

### Geometry visualization

In order to have a more intuitive view of the Geographical Location of those countries,we can visualize the 
suicide rate data on the world map with the assistance of the Geopandas and pycountry package, the following 
is the code 



In [None]:
CountryData.index.name="name"   
#change the name of the index of country data in order to merge the suicides rate and the geometric information
GeoData = pd.merge(world, CountryData,on = "name")
# using the merge function to aggregate the data

GeoData.head()  #have a look at the result

In [None]:
Mapplot=GeoData.plot(figsize=(25,8),                  # set the ratio of the length and the width to 25:8
                     column="SuicidesPer",            # set the color according to the suicides rate 
                     legend=True,                     # show the legend 
                     legend_kwds={'label': "Suicides Rate by country"})   # set the label of the legend 

# create a geopandas form of world map,
Mapplot.set_title("World map of countries with different suicide rate")
#set the title of the world map

### Plot a dynamic world map using the Plotly package

With the help of the Plotly package,we could realize some amazing functions and some basic interacting feeling.

In [None]:
data_map = data.groupby(by=['Country']).agg({"Suicidesno": ['sum']})
# create a new dataframe by aggregating the country 
data_map.columns = ['WorldWide_Suicides']  # change the name of the columns 
data_map=data_map.reset_index()  # reset the index of the dataframe 

fig = px.choropleth(data_map, locations="Country", locationmode='country names',
                    color='WorldWide_Suicides', hover_name="Country", color_continuous_scale='sunset')
# create a interacting figure 

fig.update_layout(
    title="The Suicides Rate in the World with Interacting Functions",   
    font=dict(size=15,color="RebeccaPurple")
)  # set the form of the title 

fig.show()   #show the chart

### Plot an animation dynamic chart by the change of time

In [None]:
#Suicides Number Over the time
fig = px.choropleth(data, locations='Country',
                   locationmode='country names',color=np.log(data["SuicidesPer"]),  
                    # represent the change of color using the log function in numpy
                   animation_frame=data['Year'],
                   color_continuous_scale='matter') #px.colors.sequential.matter

fig.update_layout(
    title="The change of suicides ratio per 100k over the time ",   
    font=dict(size=15,color="RebeccaPurple")
)  # set the form of the title 

fig.show()
#fig.write_html("map.html")

With the interacting chart above, we could directly get the information of suicides number in different countries with just a click 
on the map, also you can box or lasso select a region by using the tool on the top.

### The trend of some typical countries

Here we will take the top 10 countries with highest and lowest suicides rate to see the trend of the ratio

In [None]:
CountryTrend=data.groupby(["Country","Year"])["SuicidesPer"].mean()
#Aggregate the data according to the "country" label and "year" label
plt.figure(figsize=(10,8),dpi=100)  # create the canvas 

plt.subplot(2,1,1)
for country in LowSuicideCountry[0:10]:   # read in the country in the lowest suicide rate countries
    plt.plot(CountryTrend[country].index,CountryTrend[country].values,label=country,marker='o')
    #as the countrytrend is a Series data structure, we use the index as X axis and values as Y axis, and set 'o' as the marker

plt.title("Top 10 countris suicides rate trend")
plt.legend()                       #set the x label title fdfds
plt.ylabel("suicides ratio/100k")  #set the y label title 

plt.subplot(2,1,2)
for country in HighSuicideCountry[0:10]:   # read in the country in the highest suicide rate countries
    plt.plot(CountryTrend[country].index,CountryTrend[country].values,label=country,marker='o')
    #as the countrytrend is a Series data structure, we use the index as X axis and values as Y axis, and set 'o' as the marker

plt.legend()
plt.xlabel("Year")                 #set the x label title 
plt.ylabel("suicides ratio/100k")  #set the y label title 

plt.show()

## The visualization relative to gender and age

In the following part we will analyze the data according to the effect of sex and age

### Integrate the data by gender

In [None]:
male_population = data.loc[data.loc[:, 'Sex']=='male',:]
female_population = data.loc[data.loc[:, 'Sex']=='female',:]
# create two new dataframes to store the data of male and female 

MalePer=male_population.groupby("Year")["SuicidesPer"].mean()
FemalePer=female_population.groupby("Year")["SuicidesPer"].mean()
#create two new Series to store the data of suicides rate according to sex

### Plot the overall trend of data according to gender

In [None]:
plt.figure(figsize=(6,3),dpi=100)  #create a new canvas

plt.plot(MalePer.index,MalePer.values,label="Male")
plt.plot(FemalePer.index,FemalePer.values,label="Female")
# using the Year as the x-axis and the suicides rate as the y-axis, gender as the hue

plt.legend()           #show the legend 
plt.title("Suicides Rate of different gender")    #set the title 
plt.xlabel("Year")     #set the xlabel
plt.ylabel("Suicides Rate/100k")   #set the ylabel

plt.show()  #show the chart

### Using the dynamic chart to better show the ratio of different genders' suicides rate

In [None]:
temp=pd.merge(MalePer,FemalePer,on="Year")  # merge the data of two genders
temp=temp.reset_index()   #reset the index of the columns 
temp=temp.set_index("Year")   #set the Year as the index of the dataframe
temp.columns=["Male","Female"]


layout = cf.tools.getLayout(title='The ratio of Gender Suicides over Years')

(temp[['Male','Female']]).iplot(kind='ratio',colors=['green','red'],layout=layout)
#plot the dynamic chart to show the ratio of suicides rate between genders more explicitly

### Integrate the data by age

In [None]:
Agedata=data.groupby(["Age","Year"])["SuicidesPer"].mean()
AgeDis=data.groupby("Age")["SuicidesPer"].mean()
#create a new dataframe by age and year

agelist=AgeDis.index  #take the categories of different ages as a list data structure

print(agelist)  
AgeDis.head()  

#### Plot the bar chart 

In [None]:
plt.figure(figsize=(7,3),dpi=100)  #create the canvas

sns.barplot(x=AgeDis.sort_values().index,y=AgeDis.sort_values().values)   
# plot a bar chart according to the suicides rate (from low to high)

plt.title("Suicides rate among people with different ages")   #set the title of the chart 
plt.xlabel("Year")       #set the xlabel title 
plt.ylabel("Suicides Rate/100k")       #set the ylabel title 

plt.show()         #present the chart 

#### Plot the line chart 

In [None]:
plt.figure(figsize=(15,4),dpi=100)

#The following code 
plt.subplot(2,3,1)
plt.xlabel(" ")       # hide the x-label title 
plt.plot(Agedata[agelist[0]].index,Agedata[agelist[0]].values,label=agelist[0],marker='o',color='r')
#use the index of agedata as x-axis and value of agedata as y-axis, set the color and marker 
plt.legend()

    
plt.subplot(2,3,2)
plt.xlabel(" ")       # hide the x-label title 
plt.plot(Agedata[agelist[1]].index,Agedata[agelist[1]].values,label=agelist[1],marker='o',color='skyblue')
#use the index of agedata as x-axis and value of agedata as y-axis, set the color and marker 
plt.legend()


plt.subplot(2,3,3)
plt.xlabel(" ")       # hide the x-label title 
plt.plot(Agedata[agelist[2]].index,Agedata[agelist[2]].values,label=agelist[2],marker='o',color='y')
#use the index of agedata as x-axis and value of agedata as y-axis, set the color and marker 
plt.legend()


plt.subplot(2,3,4)
plt.xlabel(" ")       # hide the x-label title 
plt.plot(Agedata[agelist[3]].index,Agedata[agelist[3]].values,label=agelist[3],marker='o',color='yellowgreen')
#use the index of agedata as x-axis and value of agedata as y-axis, set the color and marker 
plt.legend()


plt.subplot(2,3,5)
plt.xlabel(" ")       # hide the x-label title 
plt.plot(Agedata[agelist[4]].index,Agedata[agelist[4]].values,label=agelist[4],marker='o',color='coral')
#use the index of agedata as x-axis and value of agedata as y-axis, set the color and marker 
plt.legend()


plt.subplot(2,3,6)
plt.xlabel(" ")       # hide the x-label title 
plt.plot(Agedata[agelist[5]].index,Agedata[agelist[5]].values,label=agelist[5],marker='o',color='royalblue')
#use the index of agedata as x-axis and value of agedata as y-axis, set the color and marker 
plt.legend()


plt.show()

### Integrate the data of gender and age for further analysis 

For further analysis, we need to combine the data of sex and age to get a further view.

In [None]:
AgeGender=data.groupby(["Age","Sex"])["SuicidesPer"].mean()
# Aggregate the data by virtue of the Age and Sex columns
AgeGender=AgeGender.to_frame()
# Transfer the Series object to dataframe object
AgeGender=AgeGender.reset_index()
# Reset the index of the dataframe 
AgeGender=AgeGender.sort_values("SuicidesPer")
# Sort the data according to the suicides percentage of 100k

AgeGender  # Have a brief view of the dataframe

#### Plot a box plot to visualize the data

In [None]:
plt.figure(figsize=(8,6),dpi=100)   # create a canvas with size of 8*6 and resolution of 100dpi

sns.barplot(x="Age",y="SuicidesPer",hue="Sex",data=AgeGender,palette="Set1")
# Using the seaborn package to plot the barplot 

plt.title("Composition of suicides")  #set the title 
plt.xlabel("Age")    # set the title of x-axis
plt.ylabel("Suicides Per 100K")   # set the title of y-axis

plt.show()   # show the picture

### Take a few countries as examples to analyze the possible reasons

In [None]:
plt.figure(figsize=(10,10),dpi=80)         # create a canvas 

plt.subplot(2,1,1)       # choose the first part of the canvas 
plt.title("United States")      # set the title of the first chart 
sns.boxplot(x='Age', y='Suicidesno', hue='Sex',data=data[data["Country"]=="United States"],palette='Set2')
# using the seaborn package to plot the box plot 

plt.subplot(2,1,2)       # choose the second part of the canvas
plt.title("Brazil")                                          
sns.boxplot(x='Age', y='Suicidesno', hue='Sex',data=data[data["Country"]=="Brazil"],palette='Set3')

plt.show()    # show the subplot chart 

## The correlation between GDP and suicides rate

The next step we wanna to analyze whether there are any relationship between GDP and Suicides Rate

In [None]:
# Firstly, we want to have a brief view of the overall trend of the GDP trend 
plt.figure(figsize=(10,10),dpi=100)  # create a canvas 

sns.jointplot("Year", "GdpCapita", data=data, kind="reg")
# using the seaborn package to plot a joint chart and at the same time plot the regression line

plt.xlabel("Year")   # set the title of x-axis
plt.ylabel("GDP per capita")   # set the title of y-axis
plt.show()  # show the joint chart 

### Plot the trend of countries' GDP

Visualize the trend of top 10 countries with highest and lowest suicides rate

In [None]:
plt.figure(figsize = (20,6),dpi=200)
GDPdata = data.groupby(["Country","Year"])['GdpCapita'].mean()
# Create a new dataframe to store the data of country with GDP

plt.subplot(1,2,1)                    

for country in HighSuicideCountry[0:10]:
    plt.plot(GDPdata[country].index,GDPdata[country].values, label=country, marker='o')
# plot the gdp trend of top 10 highest suicide rate country

plt.xlabel("year")   #set the title of x-axis
plt.ylabel("GDP per Capita")   # set the title of y-axis
plt.legend()    # show the label of different countries


plt.subplot(1,2,2)

for country in LowSuicideCountry[0:10]:
    plt.plot(GDPdata[country].index,GDPdata[country].values, label=country, marker='o')
# plot the gdp trend of top 10 lowest suicide rate country

plt.xlabel("year")   # set the titel of x-axis
plt.ylabel("GDP per Capita")   #set the title of y-axis
plt.legend()    # show the label of different countries

plt.show()

In [None]:
# create two new dataframe to store the data of gdp and suicides rate respectively
df_gdp=data.groupby(["Country","Year"])["GdpCapita"].mean()
df_total=data.groupby(["Country","Year"])["SuicidesPer"].mean()

### Do the regression of top 10 countries

In [None]:
plt.figure(figsize = (9,6))

for country in HighSuicideCountry[:10]:
    sns.regplot(x=df_gdp[country].values, y=df_total[country].values, label = country)
# plot the regression chart 

plt.xlabel("GDP per capita")
plt.ylabel("Suicides Rate")
plt.title("Regression of GDP and Suicides Rate")
plt.legend()
plt.show()

corr_eff = {}   # create a dictionary to store the coefficients of linear regression
for country in HighSuicideCountry[:10]:
    slope, intercept, r_value, p_value, std_err = stats.linregress(df_gdp[country].values,df_total[country].values)
    corr_eff[country] = float(r_value)   # transfer the scientific form of numeric attributes into float 
    
sns.barplot(x=list(corr_eff.keys()), y=list(corr_eff.values()), palette = "YlOrRd")
# plot a barplot by virtue of the seaborn library

plt.xticks(rotation = 90)    # rotate the x-axis label
plt.xlabel("Country")        
plt.ylabel("Correlation Coefficient")
plt.title("GDP vs suicides")

plt.show()

In [None]:
plt.figure(figsize = (9,6))

for country in LowSuicideCountry[:10]:
    sns.regplot(x=df_gdp[country].values, y=df_total[country].values, label = country)
# plot the regression chart 

plt.xlabel("GDP per capita")
plt.ylabel("Suicides Rate")
plt.title("Regression of GDP and Suicides Rate")
plt.legend()
plt.show()

corr_eff = {}   # create a dictionary to store the coefficients of linear regression
for country in LowSuicideCountry[:10]:
    slope, intercept, r_value, p_value, std_err = stats.linregress(df_gdp[country].values,df_total[country].values)
    corr_eff[country] = float(r_value)   # transfer the scientific form of numeric attributes into float 
    
sns.barplot(x=list(corr_eff.keys()), y=list(corr_eff.values()), palette = "YlOrRd")
# plot a barplot by virtue of the seaborn library

plt.xticks(rotation = 90)    # rotate the x-axis label
plt.xlabel("Country")        
plt.ylabel("Correlation Coefficient")
plt.title("GDP vs suicides")

plt.show()

As we can see from the above two charts, in the high suicides rate region, GDP plays an indispensable role, while the opposite 
does not show the same situation.

## Analyze the influence of Generation

In [None]:
GenerationData=data.groupby("Generation")["SuicidesPer"].mean()
# Create a new datafram to store the data of 

In [None]:
GenerationData=GenerationData.to_frame()     # trandform the data into dataframe 

In [None]:
GenerationData=GenerationData.reset_index()  # reset the index of the generation data
GenerationData

In [None]:
layout = cf.tools.getLayout(title='Pie Plot for Sucides Per 100k by Generation')

GenerationData.iplot(kind='pie',labels="Generation" ,textinfo='label+percent',world_readable=True,hole=.4,
           values='SuicidesPer',
           layout=layout)
# Plot an interacting pie chart to show the percentage of suicides rate among different generations 
#fig.write_html("file.html")

<div class="mark">
----------------------------------------------------------------------------------------</div><i class="fa fa-lightbulb-o "></i>

# Using machine learning to realize the clustering of data

By doing the clustering of samples,we could seek the inner structure of the dataset.

## Data preprocessing part 

In [None]:
df=data.copy()  # create a copy of the dataset 
df.sample(5)    # randomly choose five samples to have a brief view

In [None]:
df.fillna(df.mean(), inplace=True)               # use the mean to fill the NAN value
df.drop("CountryYear", axis=1, inplace=True)     # since the CountryYear is only a combination of country and year, so we could drop it
df.head()                                  

In [None]:
(df.dtypes=="object").index[df.dtypes=="object"]              # check the object index

In [None]:
## Turning object types into category and integer types
df[["Country","Age","Sex","Generation"]] = df[["Country","Age","Sex","Generation"]].astype("category")
## Converting number strings with commas into float
df['GdpYear'] = df['GdpYear'].str.replace(",", "").astype("float")
#the above command line first drop the comma and convert the new string into a float object 

columns = df.select_dtypes(['category']).columns              # select the column with category 
df[columns] = df[columns].apply(lambda fx: fx.cat.codes)      # using the cat code to recode the catogory 
df.dtypes                                                      

The above code could convert the object to the int or float object.

## Calculate the value of K with the best performance 

We will utilize the SSE method to evaluate the performance of K-means clustering, SSE requires us to calculate the average distance between the points and the clustering center, we will plot the trend of the SSE according to the number of clustering centers and find out the K value with the lowest decreasing amplitude.

Visualize the trend according to the K value.

In [None]:
from sklearn.cluster import KMeans
inertia  = []                                               # create a list to store the inertia value(SSE) 
for i in range(1, 11):                                     
    kmeans = KMeans(n_clusters = i)                               
    kmeans.fit(df)                                  
    inertia.append(kmeans.inertia_)                         # store the SSE cost with different K 


plt.figure(figsize=(6,4))        # create a canvas
plt.plot(range(1, 11), inertia)      # plot the line chart 
plt.title('SSE evaluation of K value', fontsize = 12)    # set the title and the size of the title 
plt.xlabel('The value of K')                             # set the label of the x-axis
plt.ylabel('Inertia value')                              # set the label of the y-axis
plt.grid(True)
plt.show()         #represent the figure 

The following code will calculate the Gap value with the assistance of gap_statistic and gapstat_rs

In [None]:
!pip install git+git://github.com/milesgranger/gap_statistic.git
!pip install gapstat_rs
from gap_statistic import OptimalK
import gapstat_rs

optimalK = OptimalK(parallel_backend='rust')
optimalK                                     # check the form of optimalK 

In [None]:
X=df.values                                  # transform the data into array 
n_clusters = optimalK(X, cluster_array=np.arange(1, 5))   # calculate the optimalK from 1 to 5 clusters 

In [None]:
plt.figure(figsize=(6,4))                    # create the canvus 
plt.plot(optimalK.gap_df.n_clusters, optimalK.gap_df.gap_value, linewidth=3)   
# plot the line chart of the Gap value 
plt.scatter(optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].n_clusters,
            optimalK.gap_df[optimalK.gap_df.n_clusters == n_clusters].gap_value, s=250, c='r')
# mark the point with the best performance 
plt.grid(True)
plt.xlabel('Cluster Count') 
plt.ylabel('Gap Value')
plt.title('Gap Values by Cluster Count')
plt.show()


Based on the chart above, we found that K with the value of 3 could output a better performance. So in the following step we will choose 3 as the value of K

## Applying K-means Clustering model to the dataset 

In [None]:
from sklearn.cluster import MiniBatchKMeans        
#The sklearn package could help us do the unsupervised learning and offer us convenient api 

kmeans = MiniBatchKMeans(n_clusters=3,                      # this code means we will cluster the code into three groups
                          random_state=0,                   # set the random value seed
                          batch_size=10)                    # set the size of a batch                 

y_pred = kmeans.fit_predict(df[['Sex','Generation','Age','Population','SuicidesPer','GdpCapita']])

In [None]:
df['k_means_clusters'] = pd.Series(y_pred)                  # add the predicted label to a column
df

## The visuailization of the cluster 

Let's do the visualization part.

In [None]:
px.scatter(data_frame=df,x='GdpCapita',y='Suicidesno',color='k_means_clusters')
# using the plotly packgae to plot an interactive scatter chart with the above two variables 

In [None]:
px.scatter(data_frame=df,x='Population',y='Suicidesno',color='k_means_clusters')
# using the plotly packgae to plot an interactive scatter chart with the above two variables 

## Visualize the clusters in 3-D dimensions

In [None]:
px.scatter_3d(data_frame=df,x='GdpYear',y='Generation',z='Suicidesno',color='k_means_clusters')
# using the plotly packgae to plot an 3-D chart 


In [None]:
px.scatter_3d(data_frame=df,x='GdpCapita',y='GdpYear',z='Population',color='k_means_clusters')

# using the plotly packgae to plot an 3-D chart 

----END

<div class="mark">
------------------------------------------------------------------------------------------------------------</div><i class="fa fa-lightbulb-o "></i>