# A look at demographics, suicidal causes and means in India

In [3]:
%pylab inline
import pandas as pd

In [16]:
import seaborn as sns
#sns.set(style="white", palette="muted")
sns.set_style("darkgrid")

current_palette = sns.color_palette()
sns.set(font_scale=1.5) 

In [73]:
df = pd.read_csv('../input/Suicides in India 2001-2012.csv') 
# Let's look at the data...
df.head(2)

## Which ___age-groups___ are more likely to commit suicide?   

In [44]:

new_df = df[df['Total']>0]
age_cnt = dict(new_df['Age_group'].value_counts())
tot = sum(list(age_cnt.values()))
# Percentage of suicides per age group
age_perc = {k:float(v)/tot for k,v in age_cnt.items()}

key_set = ['0-14', '15-29', '30-44', '45-59', '60+', '0-100+'] 
age_grps = [age_perc[k] for k in key_set]
X = arange(len(age_grps))
fig = figure(figsize=(6, 4))

bar(X, age_grps, align='center', width=0.5, color='green', alpha=0.6)
xticks(X, key_set)

xlabel('Age groups')
ylabel('Percent')
show()

__Adolescents and adults__ the three middle groups, are more likely to commit suicide. 

### Who is more susceptible: Males or Female?

In [8]:
grpby_gender = {k:list(v) for k,v in df.groupby('Gender')['Total']}
gender_cnt = {k:sum(v) for k,v in grpby_gender.items()}
gender_df = pd.DataFrame(list(gender_cnt.items()), columns=['Gender', 'Counts'])
# Percentage of suicides committed by each gender
gender_df['Percent'] = gender_df['Counts'].apply(lambda l: float(l)/sum(gender_df['Counts'])*100)

### Plotting the histogram

In [15]:

color_code = ["#e74c3c", "#3498db"]
fig = figure(figsize=(6, 4))
ax = fig.gca()
sns.barplot(x='Gender',y='Percent',data=gender_df, palette=sns.color_palette(color_code))
ax.set(ylabel="Percent");


Let's look at different categories, under  __Type_code__

In [25]:
print('Different type codes:', df['Type_code'].unique())

In [None]:
type_dict = {k:list(v) for k,v in df.groupby('Type_code')['Type']} 

As a example let's look at different types included in __Education Level__

In [28]:
type_dict = {k:list(v) for k,v in df.groupby('Type_code')['Type']} 
educ_uniq = set(type_dict['Education_Status'])
print('Categories in Education_Status', educ_uniq)
educ_level_df = df[df['Type'].isin(educ_uniq)]

In [29]:
soc_uniq = set(type_dict['Social_Status'])
soc_stat_df = df[df['Type'].isin(soc_uniq)]

prof_uniq = set(type_dict['Professional_Profile'])
prof_stat_df = df[df['Type'].isin(prof_uniq)]

means_uniq = set(type_dict['Means_adopted'])
means_df = df[df['Type'].isin(means_uniq)]
print('Categories in Means_adopted:', means_uniq)

__By Other means__ and __By Other means (please specify)__ are basically same categories, therefore can be merged. Let's write a function to replace these similar categories.

In [30]:
def replace_similar_types(col, similar_pairs):
    '''
    Function gets a list of simiar_pair and replaces the 2nd one with the first
    '''
    for sim in list(similar_pairs):
        if col == sim[1]:
            col = sim[0]
    return col

In [31]:
# list of tuples that need to be replaced
sim_means = [('By Other means', 'By Other means (please specify)')]

In [32]:
means_df.is_copy = False
means_df['Type'] = means_df['Type'].apply(lambda l: replace_similar_types(l,sim_means))

In [35]:
causes_uniq = set(type_dict['Causes'])
print('Categories in Causes:',causes_uniq)
causes_df = df[df['Type'].isin(causes_uniq)]

It seems that: __Bankruptcy or Sudden change in Economic__ and __Bankruptcy or Sudden change in Economic Status__ are the same.  Also, __Other Causes (Please Specity)__ and __Causes Not known__ are both unknown, we can merge these two as well.

In [38]:
sim_causes = [('Bankruptcy or Sudden change in Economic' , 'Bankruptcy or Sudden change in Economic Status'), \
             ('Causes Not known', 'Other Causes (Please Specity)'), \
             ('Not having Children (Barrenness/Impotency', 'Not having Children(Barrenness/Impotency')]

In [39]:
causes_df.is_copy = False
causes_df['Type'] = causes_df['Type'].apply(lambda l: replace_similar_types(l, sim_causes))

Distribution of suicides by  __Educational level__:

In [65]:
plot = sns.factorplot(x='Type', y='Total', kind='bar', data=educ_level_df, estimator=sum, size=4, aspect=2)
plot.set_xticklabels(rotation=45, horizontalalignment='right');

# plot color coded by __Gender__
plot = sns.factorplot(x='Type', y='Total', hue='Gender', kind='bar',\
                      data=educ_level_df, palette=sns.color_palette(color_code),\
                      estimator=sum, size=4, aspect=2)
plot.set_xticklabels(rotation=45, horizontalalignment='right');

Suicide is mostly committed by individuals in the lower educational bracket in the population and the distribution is consistent for both genders.

Next look at the distribution of __Social status__ in the data...

In [67]:
plot = sns.factorplot(x='Type', y='Total', kind='bar', data=soc_stat_df, estimator=sum, size=4, aspect=2)
plot.set_xticklabels(rotation=45, horizontalalignment='right');
# color by gender
plot = sns.factorplot(x='Type', y='Total', hue ='Gender', kind='bar',\
                      data=soc_stat_df, palette=sns.color_palette(color_code),\
                      estimator=sum, size=4, aspect=2)
plot.set_xticklabels(rotation=45, horizontalalignment='right');

Married individuals are more likely to commit suicide. It would be nicer to see the ratio of marital statuses in the population. 

Next we look at the distribution of number of suicides by the __professional level__:

In [69]:
plot = sns.factorplot(x='Type', y='Total', kind='bar', data=prof_stat_df, estimator=sum, size=4, aspect=2)
plot.set_xticklabels(rotation=45, horizontalalignment='right');

plot = sns.factorplot(x='Type', y='Total', hue ='Gender', kind='bar',\
                      data=prof_stat_df, palette=sns.color_palette(color_code),\
                      estimator=sum, size=4, aspect=2)
plot.set_xticklabels(rotation=45, horizontalalignment='right');

The highest reported category is __house wife__  following by the __Farming/Agri__
Like before the distribution is the same for both genders except not surprisingly the house wife, which are Female only. 

In [70]:
plot = sns.factorplot(x='Type', y='Total', kind='bar', data=causes_df, estimator=sum ,size=6, aspect=2)
plot.set_xticklabels(rotation=45, horizontalalignment='right');
plot = sns.factorplot(x='Type', y='Total', hue='Gender', kind='bar',\
                      data=causes_df, palette=sns.color_palette(color_code),\
                      estimator=sum ,size=6, aspect=2)
plot.set_xticklabels(rotation=45, horizontalalignment='right');

The leading cause of suicide is the Family problem, which is in line with the observation of that married groups are more likely to commit suicide, followed by __Prolonged illness__ and __Insanity/Mental Ilness__.

The only difference between males and females on in the order of types of suicide causes is on "Dowry Dispute", __Suspected/Illicit Relation__ and __Cancellation/None-Settlement of Marriage__ where female suicide has a higher rate.

Finally the distribution of the __Means_adopted__:

In [72]:
plot = sns.factorplot(x='Type', y='Total', kind='bar', data=means_df, estimator=sum, size=4, aspect=2)
plot.set_xticklabels(rotation=45, horizontalalignment='right');
plot = sns.factorplot(x='Type', y='Total',  hue="Gender", kind='bar',\
                      data=means_df, estimator=sum, palette=sns.color_palette(color_code), size=4, aspect=2)
plot.set_xticklabels(rotation=45, horizontalalignment='right');

In all different means of committing suicide men have a higher fraction, except for __By Fire/Self Immolation__.

### Future work ... 
Data has a lot to offer, in the next step I would like to run some state level analysis.