# Data visualisation on the cleaned OSMI Mental Health dataset.

### All warnings are supressed to produce more concise output.

In [57]:
import pandas as pd  # data processing, CSV file I/O
survey = pd.read_csv("OSMIcleaned.csv")


import plotly.express as px 
import cufflinks as cf
import plotly.graph_objects as go
from plotly.subplots import make_subplots
cf.go_offline()
%matplotlib inline
from matplotlib.axes._axes import _log as matplotlib_axes_logger
matplotlib_axes_logger.setLevel('ERROR')


Prior versions of plotly contained functionality for creating figures in both “online” and “offline” modes. In “online” mode, figures were uploaded to an instance of Plotly’s Chart Studio service and then displayed, whereas in “offline” mode figures were rendered locally. This duality has been a common source of confusion for several years, and so in version 4 we are making some important changes to help clear this up.

Starting with this version, the only supported mode of operation in the plotly package is “offline” mode, which requires no internet connection, no account, no authentication tokens, and no payment of any kind. Support for “online” mode has been moved into a separately-installed package called chart-studio.

### Let's know our data

In [8]:
survey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4218 entries, 0 to 4217
Data columns (total 20 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   Family History of Mental Illness  4218 non-null   object 
 1   Company Size                      3714 non-null   object 
 2   year                              4218 non-null   int64  
 3   Age                               4218 non-null   int64  
 4   Age-Group                         4218 non-null   object 
 5   Gender                            4218 non-null   object 
 6   Sought Treatment                  4218 non-null   int64  
 7   Describe Past Experience          945 non-null    object 
 8   Prefer Anonymity                  1523 non-null   float64
 9   Rate Reaction to Problems         1523 non-null   object 
 10  Negative Consequences             4216 non-null   object 
 11  Location                          4216 non-null   object 
 12  Access

Lets observe how data can be easily cleaned through visualization

In [20]:
survey["Age"].iplot()
survey["Age"].describe()

count    4218.000000
mean       33.765055
std         8.326657
min         0.000000
25%        28.000000
50%        33.000000
75%        38.000000
max        99.000000
Name: Age, dtype: float64

In [21]:
df= pd.read_csv("OSMIcleaned.csv")
for x in df.index:
    if df.loc[x,"Age"]>75 or df.loc[x,"Age"]<20:
        df.loc[x,"Age"]=df["Age"].mean()
df["Age"].iplot()

### The distribution of data with respect to age and gender throughout the years can be seen below. 

In [22]:
df=survey.groupby(["year","Age-Group","Gender"]).size().to_frame(name="Count").reset_index()
fig = px.bar(df, x="Age-Group", y="Count", color="Gender", barmode="group",facet_col="year",title="General Statistics on Survey Data Participants")
fig.show()


Conclusions:

1.Over the years, lesser people have participated in the survey.


2.Men are more likely to participate in the survey.


3.It can be seen that most respondents involved in the survey are between the ages 20-40.

In [23]:
fig = go.Figure()
fig.add_trace(go.Violin(x=survey['year'][ survey['Gender'] == 'Male' ],
                        y=survey['Age'][ survey['Gender'] == 'Male' ],
                        legendgroup='Yes', scalegroup='Yes', name='Male',
                        side='negative',
                        line_color='blue'))
fig.add_trace(go.Violin(x=survey['year'][ survey['Gender'] == 'Female' ],
                        y=survey['Age'][ survey['Gender'] == 'Female' ],
                        legendgroup='Yes', scalegroup='Yes', name='Female',
                        side='positive',
                        line_color='red'))
fig.update_layout(title="Age Distribution of Participants")

In [24]:
df1=survey.groupby(["year","Gender","Family History of Mental Illness","Discuss Mental Health Problems"]).size().to_frame(name="Count").reset_index()
g1=df1[df1['Family History of Mental Illness']=='Yes']
fig = px.line(g1, x='year', y='Count', color='Discuss Mental Health Problems', symbol="Gender",title="With Family History of Mental Illness ")
fig.show()
g2=df1[df1['Family History of Mental Illness']=='No']
fig = px.line(g2, x='year', y='Count', color='Discuss Mental Health Problems', symbol="Gender",title="Without Family History of Mental Illness ")
fig.show()

Conclusions:

1.People with family history of mental illness are more willing to reach out for help.

2.Employees without a history of illness show more reservations in opening up about it.

3.Men were found to be more willing to discuss their mental health

4.Over the years, the tendency to open up about discussing mental illness has decreased.

In [208]:
df1=survey.groupby(["year","Age-Group","Family History of Mental Illness","Sought Treatment"]).size().to_frame(name="Count").reset_index()
g1=df1[df1['Family History of Mental Illness']=='Yes']
g2=df1[df1['Family History of Mental Illness']=='No']

In [210]:
fig = px.histogram(g1, x="Age-Group", y="Count", color="Sought Treatment",barmode='group',facet_col="year",title="With Family History of Mental Illness")
fig.show()
fig = px.histogram(g2, x="Age-Group", y="Count", color="Sought Treatment",barmode='group',facet_col="year",title="Without Family History of Mental Illness")
fig.show()

Conclusions:

1.People with family history of mental illness are more willing to seek treatment.

2.Over the years, lesser no. of people are opting for treatment.

3.People without family history of mental illness are less willing to seek treatment.

4.Middle-aged people are more likely to seek treatment.

In [211]:
df1=survey.groupby(["year","Age-Group","Prefer Anonymity","Negative Consequences"]).size().to_frame(name="Count").reset_index()
g1=df1[df1['Prefer Anonymity']==1]
g2=df1[df1['Prefer Anonymity']==0]

In [212]:
fig = px.histogram(g1, x="Prefer Anonymity", y="Count", color="Negative Consequences",barmode='group',facet_col="year",title="Prefers anonymity")
fig.show()
fig = px.histogram(g2, x="Prefer Anonymity", y="Count", color="Negative Consequences",barmode='group',facet_col="year",title="Does not prefer anonymity")
fig.show()

Conclusion:

1.People who think that anonymity is not necessary believe there might not be negative consequences.

2.Over the years, people who prefers anonymity have started to believe that there might be negative consequences.

In [93]:
df=survey.groupby(["year","Location"]).size().to_frame(name="Count").reset_index()
print(df)

     year                Location  Count
0    2014               Australia     22
1    2014                 Austria      3
2    2014            Bahamas, The      1
3    2014                 Belgium      6
4    2014  Bosnia and Herzegovina      1
..    ...                     ...    ...
213  2019                   Spain      3
214  2019             Switzerland      4
215  2019                  Turkey      3
216  2019                     USA    205
217  2019          United Kingdom     32

[218 rows x 3 columns]


In [109]:
fig = px.bar(df, x="Location", y="Count",
                     hover_name="Location",
                     animation_frame="year")

fig["layout"].pop("updatemenus")
# optional, drop animation buttons
fig['layout']['sliders'][0]['pad']=dict(l=10,r= 10, t= 150)
fig.show()

Conclusion : Most of the respondents are from USA, there is little to no info on other parts of the world.

In [215]:
df=survey.groupby(["Location","Gender"]).size().to_frame(name="Count").reset_index()
m=df[df["Gender"]=='Male']
m=m[m["Count"]>20]
m.sort_values(by='Count',inplace=True)
f=df[df["Gender"]=='Female']
f=f[f["Count"]>20]
f.sort_values(by='Count',inplace=True)
for x in m.index:
    m.loc[x,"Count"]=-m.loc[x,"Count"]

print(m)
print(f)


           Location Gender  Count
78      New Zealand   Male    -21
111     Switzerland   Male    -22
108          Sweden   Male    -24
15           Brazil   Male    -33
57          Ireland   Male    -39
52            India   Male    -42
37           France   Male    -47
5         Australia   Male    -58
75      Netherlands   Male    -83
41          Germany   Male   -118
20           Canada   Male   -139
122  United Kingdom   Male   -393
118             USA   Male  -1777
           Location  Gender  Count
19           Canada  Female     55
121  United Kingdom  Female     80
117             USA  Female    782


In [216]:
fig = make_subplots(rows=1, cols=2,
                    shared_yaxes=True,
                    horizontal_spacing=0)

fig.append_trace(go.Bar(
                 y = m['Location'],
                 x = m['Count'],
                 text = m['Count'],
                 textposition = 'outside',
                 name = 'Male responses',
                 orientation = 'h'),
                 row=1, col=1
                 )

fig.append_trace(go.Bar(
                 y = f['Location'],
                 x = f['Count'],
                 text = f['Count'],
                 textposition = 'outside',
                 name = 'Female responses',
                 orientation = 'h'),
                 row=1, col=2)

fig.update_layout(
                  font_family   = 'monospace',
                  title         = dict(text = 'Gender of the survey respondents across Countries', x = 0.525),
                  margin        = dict(t=80, b=0, l=70, r=40),
                  hovermode     = "y unified",
                  font          = dict(color='black'),
                  legend        = dict(orientation="h",
                                       yanchor="bottom", y=1,
                                       xanchor="center", x=0.5),
                  hoverlabel    = dict(font_size=13, 
                                      font_family="Monospace"))


fig.show()

Conclusion : Females from only highly developed countries have contributed in the survey, with major contribution from only three locations. 

In [128]:
df = survey.iloc[:,[0,1,6,8,10,12,14,15,16,18]]
print(df)
for x in df.index:
    if df.iloc[x,2]==1:
        df.iloc[x,2]='Yes'
    elif df.iloc[x,2]==0:
        df.iloc[x,2]='No'
    if df.iloc[x,3]==1:
        df.iloc[x,3]='Yes'
    elif df.iloc[x,3]==0:
        df.iloc[x,3]='No'
    if df.iloc[x,5]==1:
        df.iloc[x,5]='Yes'
    elif df.iloc[x,5]==0:
        df.iloc[x,5]='No'
    if df.iloc[x,9]==1:
        df.iloc[x,9]='Yes'
    elif df.iloc[x,9]==0:
        df.iloc[x,9]='No'
print(df)
    

     Family History of Mental Illness    Company Size  Sought Treatment  \
0                                  No          Jun-25                 1   
1                                  No  More than 1000                 0   
2                                  No          Jun-25                 0   
3                                 Yes          26-100                 1   
4                                  No         100-500                 0   
...                               ...             ...               ...   
4213                              Yes          Jun-25                 1   
4214                              Yes         100-500                 1   
4215                     I don't know         100-500                 1   
4216                     I don't know          Jun-25                 0   
4217                     I don't know  More than 1000                 1   

      Prefer Anonymity Negative Consequences  Access to information Diagnosis  \
0                 

In [129]:
buttons = []
i = 0
vis = [False] * 10

for col in df.columns:
    vis[i] = True
    buttons.append({'label' : col,
             'method' : 'update',
             'args'   : [{'visible' : vis},
             {'title'  : col}] })
    i+=1
    vis = [False] * 10

fig = go.Figure()

for col in df.columns:
    fig.add_trace(go.Pie(
             values = df[col].value_counts(),
             labels = df[col].value_counts().index,
             title = dict(text = 'Distribution of {}'.format(col),
                          font = dict(size=18, family = 'monospace'),
                          ),
             hole = 0.6,
             hoverinfo='label+percent',))

fig.update_traces(hoverinfo='label+percent',
                  textinfo='label+percent',
                  textfont_size=12,
                  opacity = 0.8,
                  showlegend = False,
                  )
              

fig.update_layout(margin=dict(t=0, b=0, l=0, r=0.5),
                  updatemenus = [dict(
                        type = 'dropdown',
                        x = 1.15,
                        y = 0.85,
                        showactive = True,
                        active = 0,
                        buttons = buttons)],
                 annotations=[
                             dict(text = "<b>Choose Column<b> : ",
                             showarrow=False,
                             x = 1.06, y = 0.92, yref = "paper", align = "left")])

for i in range(1,8):
    fig.data[i].visible = False

fig.show()

Conclusion:

  1.A large number of respondents report to having some mental health disorder or any other health conditions.

3.Respondents are still not really sure if bringing up a mental health issue would be a good idea.

4.A large number of respondents have sought help for mental health issues, but many are not yet open to discussing about it or are not comfortable doing so.

In [137]:
df = survey.iloc[:,[0,1,5,6,8,10,12,14,15,16,18]]
for x in df.index:
    if df.iloc[x,3]==1:
        df.iloc[x,3]='Yes'
    elif df.iloc[x,3]==0:
        df.iloc[x,3]='No'
    if df.iloc[x,4]==1:
        df.iloc[x,4]='Yes'
    elif df.iloc[x,4]==0:
        df.iloc[x,4]='No'
    if df.iloc[x,6]==1:
        df.iloc[x,6]='Yes'
    elif df.iloc[x,6]==0:
        df.iloc[x,6]='No'
    if df.iloc[x,10]==1:
        df.iloc[x,10]='Yes'
    elif df.iloc[x,10]==0:
        df.iloc[x,10]='No'
m=df[df["Gender"]=='Male']
f=df[df["Gender"]=='Female']    
m.drop(["Gender"],axis=1,inplace=True)
f.drop(["Gender"],axis=1,inplace=True)


In [143]:
buttons = []
i = 0
vis = [False] * 10

for col in m.columns:
    vis[i] = True
    buttons.append({'label' : col,
             'method' : 'update',
             'args'   : [{'visible' : vis},
             {'title'  : col}] })
    i+=1
    vis = [False] * 10

fig = make_subplots(rows=1, cols=2,
                    specs=[[{'type':'domain'}, {'type':'domain'}]])     #domain: Subplot type for traces that are individually positioned. pie, parcoords, parcats, etc.

for col in m.columns:
    fig.add_trace(go.Pie(
             values = m[col].value_counts(),
             labels = m[col].value_counts().index,
             title = dict(text = 'Male distribution<br>of {}'.format(col),
                          font = dict(size=18, family = 'monospace'),
                          ),
             hole = 0.5,
             hoverinfo='label+percent',),1,1)


for col in f.columns:
    fig.add_trace(go.Pie(
             values = f[col].value_counts(),
             labels = f[col].value_counts().index,
             title = dict(text = 'Female distribution<br>of {}'.format(col),
                          font = dict(size=18, family = 'monospace')
                          ),
             hole = 0.5,
             hoverinfo='label+percent',),1,2)

fig.update_traces(hoverinfo='label+percent',
                  textinfo='label+percent',
                  textfont_size=12,
                  opacity = 0.8,
                  showlegend = False)

fig.update_traces(row=1, col=2, hoverinfo='label+percent',
                  textinfo='label+percent',
                  textfont_size=12,
                  opacity = 0.8,
                  showlegend = True)
              

fig.update_layout(margin=dict(t=0, b=0, l=0, r=0),
                  font_family   = 'monospace',
                  updatemenus = [dict(
                        type = 'dropdown',
                        x = 0.62,
                        y = 0.91,
                        showactive = True,
                        active = 0,
                        buttons = buttons)],
                 annotations=[
                             dict(text = "<b>Choose Column<b> : ",
                                  font = dict(size = 14),
                             showarrow=False,
                             x = 0.5, y = 1, yref = "paper", align = "left")])
for i in range(1,8):
    fig.data[i].visible = False

fig.show()

Conclusions:

1.More females report to having some history of mental health illness in the family.

2.Past diagnosis or presence of disorders highten the probability that a respondent may have sought help for mental health issues.

In [181]:
df = survey.iloc[:,[6,8,12,18]]
co=df.corr()
print(co)

                       Sought Treatment  Prefer Anonymity  \
Sought Treatment               1.000000          0.259566   
Prefer Anonymity               0.259566          1.000000   
Access to information          0.129782         -0.002078   
Disorder                       0.506193          0.230537   

                       Access to information  Disorder  
Sought Treatment                    0.129782  0.506193  
Prefer Anonymity                   -0.002078  0.230537  
Access to information               1.000000  0.111771  
Disorder                            0.111771  1.000000  


In [179]:
fig = px.imshow(co, text_auto=True)
fig.show()

Conclusions:

1.People with less knowledge about mental health are more likely to be anonymous about their mental health.

2.There is a 50% correlation that a person seeking treatment will be diagonised with a disorder.