In [1]:
import numpy as np
import pandas as pd
import category_encoders as ce
from sklearn.preprocessing import OrdinalEncoder

In [2]:
data = pd.read_csv("mental-heath-in-tech-2016_20161114.csv")

In [3]:
data.head()

Unnamed: 0,Are you self-employed?,How many employees does your company or organization have?,Is your employer primarily a tech company/organization?,Is your primary role within your company related to tech/IT?,Does your employer provide mental health benefits as part of healthcare coverage?,Do you know the options for mental health care available under your employer-provided coverage?,"Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?",Does your employer offer resources to learn more about mental health concerns and options for seeking help?,Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?,"If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:",...,"If you have a mental health issue, do you feel that it interferes with your work when being treated effectively?","If you have a mental health issue, do you feel that it interferes with your work when NOT being treated effectively?",What is your age?,What is your gender?,What country do you live in?,What US state or territory do you live in?,What country do you work in?,What US state or territory do you work in?,Which of the following best describes your work position?,Do you work remotely?
0,0,26-100,1.0,,Not eligible for coverage / N/A,,No,No,I don't know,Very easy,...,Not applicable to me,Not applicable to me,39,Male,United Kingdom,,United Kingdom,,Back-end Developer,Sometimes
1,0,6-25,1.0,,No,Yes,Yes,Yes,Yes,Somewhat easy,...,Rarely,Sometimes,29,male,United States of America,Illinois,United States of America,Illinois,Back-end Developer|Front-end Developer,Never
2,0,6-25,1.0,,No,,No,No,I don't know,Neither easy nor difficult,...,Not applicable to me,Not applicable to me,38,Male,United Kingdom,,United Kingdom,,Back-end Developer,Always
3,1,,,,,,,,,,...,Sometimes,Sometimes,43,male,United Kingdom,,United Kingdom,,Supervisor/Team Lead,Sometimes
4,0,6-25,0.0,1.0,Yes,Yes,No,No,No,Neither easy nor difficult,...,Sometimes,Sometimes,43,Female,United States of America,Illinois,United States of America,Illinois,Executive Leadership|Supervisor/Team Lead|Dev ...,Sometimes


Drop rows where the person is self employed since we are primarily concerned with people in the tech workforce

In [4]:
data['Are you self-employed?'].value_counts()

0    1146
1     287
Name: Are you self-employed?, dtype: int64

In [5]:
data = data[data["Are you self-employed?"]==0]

Looking for columns that were not answered by a majority of the respondants.

In [6]:
drop_cols = []
for col in data.columns:
    if (sum(pd.isnull(data[col]))>len(data) / 2):
        drop_cols.append(col)
drop_cols

['Is your primary role within your company related to tech/IT?',
 'Do you have medical coverage (private insurance or state-provided) which includes treatment of \xa0mental health issues?',
 'Do you know local or online resources to seek help for a mental health disorder?',
 'If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to clients or business contacts?',
 'If you have revealed a mental health issue to a client or business contact, do you believe this has impacted you negatively?',
 'If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to coworkers or employees?',
 'If you have revealed a mental health issue to a coworker or employee, do you believe this has impacted you negatively?',
 'Do you believe your productivity is ever affected by a mental health issue?',
 'If yes, what percentage of your work time (time performing primary or secondary job functions) is affected by a mental health issue?',
 'Ha

In [7]:
data[drop_cols[0]].value_counts().plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x118691550>

Some questions that aren't related to mental illness

In [8]:
useless_col = ["Are you self-employed?",
            "What US state or territory do you work in?",
           "What US state or territory do you live in?",
           "What country do you live in?",
           "Why or why not?",
           "Why or why not?.1"]
drop_cols.extend(useless_col)

In [9]:
drop_cols

['Is your primary role within your company related to tech/IT?',
 'Do you have medical coverage (private insurance or state-provided) which includes treatment of \xa0mental health issues?',
 'Do you know local or online resources to seek help for a mental health disorder?',
 'If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to clients or business contacts?',
 'If you have revealed a mental health issue to a client or business contact, do you believe this has impacted you negatively?',
 'If you have been diagnosed or treated for a mental health disorder, do you ever reveal this to coworkers or employees?',
 'If you have revealed a mental health issue to a coworker or employee, do you believe this has impacted you negatively?',
 'Do you believe your productivity is ever affected by a mental health issue?',
 'If yes, what percentage of your work time (time performing primary or secondary job functions) is affected by a mental health issue?',
 'Ha

Drop useless columns

In [10]:
data2 = data.copy()
df3 = data2.drop(drop_cols, axis=1)

In [11]:
data2.head()

Unnamed: 0,Are you self-employed?,How many employees does your company or organization have?,Is your employer primarily a tech company/organization?,Is your primary role within your company related to tech/IT?,Does your employer provide mental health benefits as part of healthcare coverage?,Do you know the options for mental health care available under your employer-provided coverage?,"Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?",Does your employer offer resources to learn more about mental health concerns and options for seeking help?,Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?,"If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:",...,"If you have a mental health issue, do you feel that it interferes with your work when being treated effectively?","If you have a mental health issue, do you feel that it interferes with your work when NOT being treated effectively?",What is your age?,What is your gender?,What country do you live in?,What US state or territory do you live in?,What country do you work in?,What US state or territory do you work in?,Which of the following best describes your work position?,Do you work remotely?
0,0,26-100,1.0,,Not eligible for coverage / N/A,,No,No,I don't know,Very easy,...,Not applicable to me,Not applicable to me,39,Male,United Kingdom,,United Kingdom,,Back-end Developer,Sometimes
1,0,6-25,1.0,,No,Yes,Yes,Yes,Yes,Somewhat easy,...,Rarely,Sometimes,29,male,United States of America,Illinois,United States of America,Illinois,Back-end Developer|Front-end Developer,Never
2,0,6-25,1.0,,No,,No,No,I don't know,Neither easy nor difficult,...,Not applicable to me,Not applicable to me,38,Male,United Kingdom,,United Kingdom,,Back-end Developer,Always
4,0,6-25,0.0,1.0,Yes,Yes,No,No,No,Neither easy nor difficult,...,Sometimes,Sometimes,43,Female,United States of America,Illinois,United States of America,Illinois,Executive Leadership|Supervisor/Team Lead|Dev ...,Sometimes
5,0,More than 1000,1.0,,Yes,I am not sure,No,Yes,Yes,Somewhat easy,...,Not applicable to me,Often,42,Male,United Kingdom,,United Kingdom,,DevOps/SysAdmin|Support|Back-end Developer|Fro...,Sometimes


In [12]:
data2.columns

Index(['Are you self-employed?',
       'How many employees does your company or organization have?',
       'Is your employer primarily a tech company/organization?',
       'Is your primary role within your company related to tech/IT?',
       'Does your employer provide mental health benefits as part of healthcare coverage?',
       'Do you know the options for mental health care available under your employer-provided coverage?',
       'Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?',
       'Does your employer offer resources to learn more about mental health concerns and options for seeking help?',
       'Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?',
       'If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:',
       'Do you think that dis

In [13]:
df4 = df3.copy()

###
rp_col = "How many employees does your company or organization have?"
# fill na
# df4[cdf4[rp_col]] = df4[cdf4[rp_col]].fillna(-1)
# replace labels with
rp_dt = {'1-5':1,
        '6-25':6,
        '26-100':26,
        '100-500':101,
        '500-1000':501,
        'More than 1000':1001}

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
# rp_col = "Is your primary role within your company related to tech/IT?"
# df4[rp_col] = df4[rp_col].fillna(-1) #for NA

rp_col = "Does your employer provide mental health benefits as part of healthcare coverage?"
rp_dt = {'Yes':1, # positive/yes response to qn will be 1
        "I don't know":2, # responses in increasing negativity will be 2 onwards
        'No':3,
        'Not eligible for coverage / N/A':-1
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Do you know the options for mental health care available under your employer-provided coverage?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes':1, # positive/yes response to qn will be 1
        'I am not sure':2, # responses in increasing negativity will be 2 onwards
        'No':3,
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes':1, # positive/yes response to qn will be 1
        "I don't know":2, # responses in increasing negativity will be 2 onwards
        'No':3,
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Does your employer offer resources to learn more about mental health concerns and options for seeking help?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes':1, # positive/yes response to qn will be 1
        "I don't know":2, # responses in increasing negativity will be 2 onwards
        'No':3,
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes':1, # positive/yes response to qn will be 1
        "I don't know":2, # responses in increasing negativity will be 2 onwards
        'No':3,
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {"Very easy":1, # positive/yes response to qn will be 1
        "Somewhat easy":2, # responses in increasing negativity will be 2 onwards
        "Neither easy nor difficult":3,
         "I don't know":3,
         "Somewhat difficult":4,
         "Very difficult":5
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Do you think that discussing a mental health disorder with your employer would have negative consequences?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes':1, # positive/yes response to qn will be 1
        'Maybe':2, # responses in increasing negativity will be 2 onwards
         'No':3,
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Do you think that discussing a physical health issue with your employer would have negative consequences?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes':1, # positive/yes response to qn will be 1
        'Maybe':2, # responses in increasing negativity will be 2 onwards,
         'No':3,
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Would you feel comfortable discussing a mental health disorder with your coworkers?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes':1, # positive/yes response to qn will be 1
        'Maybe':2, # responses in increasing negativity will be 2 onwards,
         'No':3,
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Would you feel comfortable discussing a mental health disorder with your direct supervisor(s)?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes':1, # positive/yes response to qn will be 1
        'Maybe':2, # responses in increasing negativity will be 2 onwards,
         'No':3,
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Do you feel that your employer takes mental health as seriously as physical health?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes':1, # positive/yes response to qn will be 1
        "I don't know":2, # responses in increasing negativity will be 2 onwards,
         'No':3,
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Have you heard of or observed negative consequences for co-workers who have been open about mental health issues in your workplace?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes':1, # positive/yes response to qn will be 1
        'No':2, # responses in increasing negativity will be 2 onwards,
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Do you have previous employers?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {0:2 # replace 0 (no) with 2 for consistency
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Have your previous employers provided mental health benefits?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes, they all did':1, # positive/yes response to qn will be 1
        'Some did':2, # responses in increasing negativity will be 2 onwards,
        "I don't know":3,
         'No, none did':4
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Were you aware of the options for mental health care provided by your previous employers?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes, I was aware of all of them':1, # positive/yes response to qn will be 1
        'I was aware of some':2, # responses in increasing negativity will be 2 onwards,
        'No, I only became aware later':3,
         'N/A (not currently aware)':4
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Did your previous employers ever formally discuss mental health (as part of a wellness campaign or other official communication)?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes, they all did':1, # positive/yes response to qn will be 1
        'Some did':2, # responses in increasing negativity will be 2 onwards,
        "I don't know":3,
         'None did':4
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Did your previous employers provide resources to learn more about mental health issues and how to seek help?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes, they all did':1, # positive/yes response to qn will be 1
        'Some did':2, # responses in increasing negativity will be 2 onwards,
        "I don't know":3,
         'None did':4
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Was your anonymity protected if you chose to take advantage of mental health or substance abuse treatment resources with previous employers?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes, always':1, # positive/yes response to qn will be 1
        'Sometimes':2, # responses in increasing negativity will be 2 onwards,
        "I don't know":3,
         'No':4
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Do you think that discussing a mental health disorder with previous employers would have negative consequences?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes, all of them':1, # positive/yes response to qn will be 1
        'Some of them':2, # responses in increasing negativity will be 2 onwards,
        "I don't know":3,
         'None of them':4
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Do you think that discussing a physical health issue with previous employers would have negative consequences?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes, all of them':1, # positive/yes response to qn will be 1
        'Some of them':2, # responses in increasing negativity will be 2 onwards,
         'None of them':3
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Would you have been willing to discuss a mental health issue with your previous co-workers?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes, at all of my previous employers':1, # positive/yes response to qn will be 1
        'Some of my previous employers':2, # responses in increasing negativity will be 2 onwards,
         'No, at none of my previous employers':3
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Would you have been willing to discuss a mental health issue with your direct supervisor(s)?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes, at all of my previous employers':1, # positive/yes response to qn will be 1
        'Some of my previous employers':2, # responses in increasing negativity will be 2 onwards,
         "I don't know":3,
         'No, at none of my previous employers':4
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Did you feel that your previous employers took mental health as seriously as physical health?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes, they all did':1, # positive/yes response to qn will be 1
        'Some did':2, # responses in increasing negativity will be 2 onwards,
        "I don't know":3,
         'None did':4
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Did you hear of or observe negative consequences for co-workers with mental health issues in your previous workplaces?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes, all of them':1, # positive/yes response to qn will be 1
        'Some of them':2, # responses in increasing negativity will be 2 onwards,
         'None of them':3
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Would you be willing to bring up a physical health issue with a potential employer in an interview?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes':1, # positive/yes response to qn will be 1
        'Maybe':2, # responses in increasing negativity will be 2 onwards,
         'No':3,
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Would you bring up a mental health issue with a potential employer in an interview?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes':1, # positive/yes response to qn will be 1
        'Maybe':2, # responses in increasing negativity will be 2 onwards,
         'No':3,
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Do you feel that being identified as a person with a mental health issue would hurt your career?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes, it has':1, # positive/yes response to qn will be 1
        'Yes, I think it would':2, # responses in increasing negativity will be 2 onwards,
        'Maybe':3,
         "No, I don't think it would":4,
         'No, it has not':5
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Do you think that team members/co-workers would view you more negatively if they knew you suffered from a mental health issue?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes, they do':1, # positive/yes response to qn will be 1
         'Yes, I think they would':2, # responses in increasing negativity will be 2 onwards,
        'Maybe':3,
         "No, I don't think they would":4,
         'No, they do not':5
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "How willing would you be to share with friends and family that you have a mental illness?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Very open':1, # positive/yes response to qn will be 1
         'Somewhat open':2, # responses in increasing negativity will be 2 onwards,
        'Neutral':3,
         'Somewhat not open':4,
         'Not open at all':5,
         'Not applicable to me (I do not have a mental illness)':-1
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes, I experienced':1, # positive/yes response to qn will be 1
         'Yes, I observed':2, # responses in increasing negativity will be 2 onwards,
        'Maybe/Not sure':3,
         'No':4,
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
# rp_col = "Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace?"
# # nan values is 55.41%; unsure what is the cause of nan values
# # drop column
# df4 = df4.drop([rp_col],axis=1)

###
rp_col = "Do you have a family history of mental illness?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes':1, # positive/yes response to qn will be 1
        "I don't know":2, # responses in increasing negativity will be 2 onwards
        'No':3,
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Have you had a mental health disorder in the past?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes':1, # positive/yes response to qn will be 1
        'Maybe':2, # responses in increasing negativity will be 2 onwards,
         'No':3,
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Do you currently have a mental health disorder?"
# potential target column or key X column
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes':1, # positive/yes response to qn will be 1
        'Maybe':2, # responses in increasing negativity will be 2 onwards,
         'No':3,
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Have you been diagnosed with a mental health condition by a medical professional?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Yes':1, # positive/yes response to qn will be 1
        'No':2, # responses in increasing negativity will be 2 onwards,
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Have you ever sought treatment for a mental health issue from a mental health professional?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {1:1, # positive/yes response to qn will be 1
        0:2, # responses in increasing negativity will be 2 onwards,
        }

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "If you have a mental health issue, do you feel that it interferes with your work when being treated effectively?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Often':1, # positive/yes response to qn will be 1
        'Sometimes':2, # responses in increasing negativity will be 2 onwards,
        'Rarely':3,
        'Never':4,
        'Not applicable to me':-1}

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "If you have a mental health issue, do you feel that it interferes with your work when NOT being treated effectively?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Often':1, # positive/yes response to qn will be 1
        'Sometimes':2, # responses in increasing negativity will be 2 onwards,
        'Rarely':3,
        'Never':4,
        'Not applicable to me':-1}

df4[rp_col] = df4[rp_col].replace(rp_dt)

###
rp_col = "Do you work remotely?"
df4[rp_col] = df4[rp_col].fillna(-1) #for NA
rp_dt = {'Always':1, # positive/yes response to qn will be 1
        'Sometimes':2, # responses in increasing negativity will be 2 onwards,
        'Never':3,
       }

df4[rp_col] = df4[rp_col].replace(rp_dt)

#####

# prepare replacement lists
male_ls = ['Male','male', 'Male ', 'M', 'm', 'man', 'Cis male',
           'Male.', 'Male (cis)', 'Man', 'Sex is male',
           'cis male', 'Malr', 'Dude', "I'm a man why didn't you make this a drop down question. You should of asked sex? And I would of answered yes please. Seriously how much text can this take? ",
           'mail', 'M|', 'male ', 'Cis Male', 'Male (trans, FtM)',
           'cisdude', 'cis man', 'MALE']
# FYI: cisgender: describes a person who identifies as the same gender assigned at birth
female_ls = ['Female','female', 'I identify as female.', 'female ',
             'Female assigned at birth ', 'F', 'Woman', 'fm', 'f',
             'Cis female', 'Transitioned, M2F', 'Female or Multi-Gender Femme',
             'Female ', 'woman', 'female/woman', 'Cisgender Female', 
             'mtf', 'fem', 'Female (props for making this a freeform field, though)',
             ' Female', 'Cis-woman', 'AFAB', 'Transgender woman',
             'Cis female ']
# FYI: AFAB: assigned female at birth
other_ls = ['Bigender', 'non-binary,', 'Genderfluid (born female)',
            'Other/Transfeminine', 'Androgynous', 'male 9:1 female, roughly',
            'nb masculine', 'genderqueer', 'Human', 'Genderfluid',
            'Enby', 'genderqueer woman', 'Queer', 'Agender', 'Fluid',
            'Genderflux demi-girl', 'female-bodied; no feelings about gender',
            'non-binary', 'Male/genderqueer', 'Nonbinary', 'Other', 'none of your business',
            'Unicorn', 'human', 'Genderqueer']

# replace gender values with numberic labels
df4["What is your gender?"] = df4["What is your gender?"].replace(male_ls,1)
df4["What is your gender?"] = df4["What is your gender?"].replace(female_ls,2)
df4["What is your gender?"] = df4["What is your gender?"].replace(other_ls,3)
df4["What is your gender?"] = df4["What is your gender?"].fillna(3)
df4["What is your gender?"].unique()

df4.describe(include='all')

Unnamed: 0,How many employees does your company or organization have?,Is your employer primarily a tech company/organization?,Does your employer provide mental health benefits as part of healthcare coverage?,Do you know the options for mental health care available under your employer-provided coverage?,"Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?",Does your employer offer resources to learn more about mental health concerns and options for seeking help?,Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?,"If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:",Do you think that discussing a mental health disorder with your employer would have negative consequences?,Do you think that discussing a physical health issue with your employer would have negative consequences?,...,Do you currently have a mental health disorder?,Have you been diagnosed with a mental health condition by a medical professional?,Have you ever sought treatment for a mental health issue from a mental health professional?,"If you have a mental health issue, do you feel that it interferes with your work when being treated effectively?","If you have a mental health issue, do you feel that it interferes with your work when NOT being treated effectively?",What is your age?,What is your gender?,What country do you work in?,Which of the following best describes your work position?,Do you work remotely?
count,1146.0,1146.0,1146.0,1146.0,1146.0,1146.0,1146.0,1146.0,1146.0,1146.0,...,1146.0,1146.0,1146.0,1146.0,1146.0,1146.0,1146.0,1146,1146,1146.0
unique,,,,,,,,,,,...,,,,,,,,44,180,
top,,,,,,,,,,,...,,,,,,,,United States of America,Back-end Developer,
freq,,,,,,,,,,,...,,,,,,,,716,238,
mean,288.216405,0.770506,1.505236,1.692845,2.508726,2.205934,1.794066,2.750436,2.189354,2.69459,...,1.991274,1.505236,1.426702,1.177138,0.681501,33.655323,1.28534,,,2.088133
std,401.014511,0.420691,1.028192,1.236236,0.807396,0.823985,0.557129,1.240026,0.734615,0.532856,...,0.882589,0.500191,0.494814,1.878176,1.311271,11.703277,0.502999,,,0.677846
min,1.0,0.0,-1.0,-1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,-1.0,-1.0,3.0,1.0,,,1.0
25%,26.0,1.0,1.0,1.0,2.0,1.0,1.0,2.0,2.0,2.0,...,1.0,1.0,1.0,-1.0,-1.0,28.0,1.0,,,2.0
50%,101.0,1.0,1.0,2.0,3.0,2.0,2.0,3.0,2.0,3.0,...,2.0,2.0,1.0,2.0,1.0,32.0,1.0,,,2.0
75%,501.0,1.0,2.0,3.0,3.0,3.0,2.0,4.0,3.0,3.0,...,3.0,2.0,2.0,3.0,2.0,38.0,2.0,,,3.0


In [14]:
df4['Does your employer provide mental health benefits as part of healthcare coverage?'].value_counts()

 1    531
 2    319
 3    213
-1     83
Name: Does your employer provide mental health benefits as part of healthcare coverage?, dtype: int64

In [179]:
df4['Which of the following best describes your work position?'].unique()
positions = ['Front-end Developer', 'Back-end Developer', 'DevOps/SysAdmin', 'Dev Evangelist/Advocate', 'One-person shop', 'Executive Leadership', 'Supervisor/Team Lead', 'Other', 'HR', 'Sales', 'Support']
for k in range(len(positions)):
    df4[positions[k]] = 0

for i, r in df4.iterrows():
    for k in range(len(positions)):
        if positions[k] in r['Which of the following best describes your work position?']:
            df4.loc[i, positions[k]] = 1
df4

Unnamed: 0,How many employees does your company or organization have?,Is your employer primarily a tech company/organization?,Does your employer provide mental health benefits as part of healthcare coverage?,Do you know the options for mental health care available under your employer-provided coverage?,"Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?",Does your employer offer resources to learn more about mental health concerns and options for seeking help?,Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?,"If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:",Do you think that discussing a mental health disorder with your employer would have negative consequences?,Do you think that discussing a physical health issue with your employer would have negative consequences?,...,Back-end Developer,DevOps/SysAdmin,Dev Evangelist/Advocate,One-person shop,Executive Leadership,Supervisor/Team Lead,Other,HR,Sales,Support
0,26,1.0,-1,-1,3,3,2,1,3,3,...,1,0,0,0,0,0,0,0,0,0
1,6,1.0,3,1,1,1,1,2,3,3,...,1,0,0,0,0,0,0,0,0,0
2,6,1.0,3,-1,3,3,2,3,2,3,...,1,0,0,0,0,0,0,0,0,0
4,6,0.0,1,1,3,3,3,3,1,2,...,1,1,1,0,1,1,0,0,0,1
5,1001,1.0,1,2,3,1,1,2,1,1,...,1,1,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1426,101,1.0,2,2,3,2,2,2,2,3,...,0,0,0,0,0,0,1,0,0,0
1427,501,1.0,1,3,3,3,1,2,3,3,...,0,0,0,0,0,0,0,0,0,1
1430,101,1.0,1,1,1,1,2,4,2,2,...,1,0,0,0,0,0,0,0,0,0
1431,101,0.0,2,2,3,1,2,4,2,3,...,0,1,0,0,0,0,0,0,0,0


In [322]:
df5 = df4.copy()
country_rp_dt = {}
for idx, name in enumerate(df5['What country do you work in?'].unique()):
    country_rp_dt[name] = idx
df5['What country do you work in?'] = df5['What country do you work in?'].replace(country_rp_dt)
df5 = df5.drop(columns='Which of the following best describes your work position?')


In [323]:
X = df5.loc[:, df5.columns != 'Do you currently have a mental health disorder?']
y = df5['Do you currently have a mental health disorder?']

In [324]:
from sklearn.model_selection import KFold
from sklearn import preprocessing

normalized_X = pd.DataFrame(preprocessing.normalize(X))
normalized_X.columns = X.columns

In [325]:
# With features selected with PCA
PCA_X = df5.iloc[:, [15, 18, 26, 19, 23, 20, 22, 16, 25, 17, 24, 21, 43, 12, 40]]

In [326]:
PCA_X.columns

Index(['Have your previous employers provided mental health benefits?',
       'Did your previous employers provide resources to learn more about mental health issues and how to seek help?',
       'Would you be willing to bring up a physical health issue with a potential employer in an interview?',
       'Was your anonymity protected if you chose to take advantage of mental health or substance abuse treatment resources with previous employers?',
       'Would you have been willing to discuss a mental health issue with your direct supervisor(s)?',
       'Do you think that discussing a mental health disorder with previous employers would have negative consequences?',
       'Would you have been willing to discuss a mental health issue with your previous co-workers?',
       'Were you aware of the options for mental health care provided by your previous employers?',
       'Did you hear of or observe negative consequences for co-workers with mental health issues in your previous workpl

In [327]:
normalized_PCA_X = pd.DataFrame(preprocessing.normalize(PCA_X))
normalized_PCA_X.columns = PCA_X.columns

In [328]:
kfold = KFold(n_splits=4, random_state = 42)
kfold.get_n_splits(normalized_X)

4

In [329]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from keras.models import Sequential
from keras.layers import Dense
from sklearn.preprocessing import LabelEncoder
from keras.utils import np_utils

rfc = RandomForestClassifier(n_estimators=100, max_depth = 10, random_state=42)
svc = SVC(kernel = 'linear', C = 1)
knn = KNeighborsClassifier(n_neighbors=7)
keras = Sequential()
keras.add(Dense(53, input_dim=53, activation='relu'))
keras.add(Dense(20, activation='relu'))
keras.add(Dense(3, activation='softmax'))
keras.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [330]:
# Without dropping any columns
for train_index, test_index in kfold.split(normalized_X):
    X_train, X_test = normalized_X.iloc[train_index, :], normalized_X.iloc[test_index, :]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    rfc.fit(X_train, y_train)
    svc.fit(X_train, y_train)
    knn.fit(X_train, y_train)

    predict_rfc = rfc.predict(X_test)
    predict_svc = svc.predict(X_test)
    predict_knn = knn.predict(X_test)
    
    print("rfc:", accuracy_score(predict_rfc, y_test))
    print("svc:", accuracy_score(predict_svc, y_test))
    print("knn:", accuracy_score(predict_knn, y_test))
    print()

rfc: 0.759581881533101
svc: 0.5574912891986062
knn: 0.6550522648083623

rfc: 0.7177700348432056
svc: 0.5505226480836237
knn: 0.6202090592334495

rfc: 0.6433566433566433
svc: 0.4755244755244755
knn: 0.5244755244755245

rfc: 0.7237762237762237
svc: 0.4755244755244755
knn: 0.6398601398601399



In [358]:
normalized_X.iloc[:, [36, 37,34, 33, 30, 35, 32]].columns  # 1, 2, 3, 

Index(['If you have a mental health issue, do you feel that it interferes with your work when being treated effectively?',
       'If you have a mental health issue, do you feel that it interferes with your work when NOT being treated effectively?',
       'Have you been diagnosed with a mental health condition by a medical professional?',
       'Have you had a mental health disorder in the past?',
       'How willing would you be to share with friends and family that you have a mental illness?',
       'Have you ever sought treatment for a mental health issue from a mental health professional?',
       'Do you have a family history of mental illness?'],
      dtype='object')

In [362]:
normalized_X.iloc[:, [36, 37,34, 33, 30, 35, 32]].columns  # 1, 2, 3, 
newnew_X = normalized_X.drop(columns=['If you have a mental health issue, do you feel that it interferes with your work when being treated effectively?',
       'If you have a mental health issue, do you feel that it interferes with your work when NOT being treated effectively?',
       'Have you been diagnosed with a mental health condition by a medical professional?'])

In [363]:
from sklearn.feature_selection import SelectFromModel
model = SelectFromModel(rfc, prefit=True)
X_new = model.transform(newnew_X)

ValueError: X has a different shape than during fitting.

In [355]:
from sklearn.model_selection import GridSearchCV
n_estimators = [200, 300]
max_depth = [5, 8,]
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 5, 10] 

hyperF = dict(n_estimators = n_estimators, max_depth = max_depth,  
              min_samples_split = min_samples_split, 
             min_samples_leaf = min_samples_leaf)

gridF = GridSearchCV(rfc, hyperF, cv = 3, verbose = 1, n_jobs = -1)
bestF = gridF.fit(X_new, y)

Fitting 3 folds for each of 48 candidates, totalling 144 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
exception calling callback for <Future at 0x1394fb610 state=finished raised TerminatedWorkerError>
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/joblib/externals/loky/_base.py", line 625, in _invoke_callbacks
    callback(self)
  File "/usr/local/lib/python3.7/site-packages/joblib/parallel.py", line 340, in __call__
    self.parallel.dispatch_next()
  File "/usr/local/lib/python3.7/site-packages/joblib/parallel.py", line 768, in dispatch_next
    if not self.dispatch_one_batch(self._original_iterator):
  File "/usr/local/lib/python3.7/site-packages/joblib/parallel.py", line 834, in dispatch_one_batch
    self._dispatch(tasks)
  File "/usr/local/lib/python3.7/site-packages/joblib/parallel.py", line 753, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/usr/local/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 543, in apply_async
    

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGABRT(-6), SIGABRT(-6), SIGABRT(-6), SIGABRT(-6)}

In [336]:
bestF.best_params_
rfc_best = RandomForestClassifier(n_estimators=200, max_depth = 5, min_samples_leaf=2, min_samples_split=2, random_state=42)

In [364]:
for train_index, test_index in kfold.split(X_new):
    X_train, X_test = X_new[train_index], X_new[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    rfc.fit(X_train, y_train)
    svc.fit(X_train, y_train)
    knn.fit(X_train, y_train)
    rfc_best.fit(X_train, y_train)

    predict_rfc = rfc.predict(X_test)
    predict_svc = svc.predict(X_test)
    predict_knn = knn.predict(X_test)
    predict_rfc_best = rfc_best.predict(X_test)
    
    print("rfc:", accuracy_score(predict_rfc, y_test))
    print("svc:", accuracy_score(predict_svc, y_test))
    print("knn:", accuracy_score(predict_knn, y_test))
    print("rfc_best:", accuracy_score(predict_rfc_best, y_test))
    print()

rfc: 0.686411149825784
svc: 0.4250871080139373
knn: 0.6167247386759582
rfc_best: 0.7003484320557491

rfc: 0.710801393728223
svc: 0.44947735191637633
knn: 0.6097560975609756
rfc_best: 0.6759581881533101

rfc: 0.6538461538461539
svc: 0.34965034965034963
knn: 0.5804195804195804
rfc_best: 0.6468531468531469

rfc: 0.6748251748251748
svc: 0.2867132867132867
knn: 0.6188811188811189
rfc_best: 0.6538461538461539



In [357]:
rfc.feature_importances_

array([0.05005887, 0.05636559, 0.05473147, 0.05205788, 0.05360729,
       0.05456182, 0.04756871, 0.05769249, 0.07867339, 0.05580263,
       0.06679116, 0.19921321, 0.10717573, 0.06569975])

In [365]:
for train_index, test_index in kfold.split(newnew_X):
    X_train, X_test = newnew_X.iloc[train_index, :], newnew_X.iloc[test_index, :]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    rfc.fit(X_train, y_train)
    svc.fit(X_train, y_train)
    knn.fit(X_train, y_train)
    rfc_best.fit(X_train, y_train)

    predict_rfc = rfc.predict(X_test)
    predict_svc = svc.predict(X_test)
    predict_knn = knn.predict(X_test)
    predict_rfc_best = rfc_best.predict(X_test)
    
    print("rfc:", accuracy_score(predict_rfc, y_test))
    print("svc:", accuracy_score(predict_svc, y_test))
    print("knn:", accuracy_score(predict_knn, y_test))
    print("rfc_best:", accuracy_score(predict_rfc_best, y_test))
    print()

rfc: 0.7038327526132404
svc: 0.4425087108013937
knn: 0.5400696864111498
rfc_best: 0.6933797909407665

rfc: 0.6829268292682927
svc: 0.4738675958188153
knn: 0.5331010452961672
rfc_best: 0.6445993031358885

rfc: 0.6258741258741258
svc: 0.40559440559440557
knn: 0.46153846153846156
rfc_best: 0.6398601398601399

rfc: 0.6678321678321678
svc: 0.32517482517482516
knn: 0.534965034965035
rfc_best: 0.6538461538461539



In [368]:
(newnew_X.iloc[:, [12,20,27,30,31,32,33,34, 36]]).columns

Index(['Do you feel that your employer takes mental health as seriously as physical health?',
       'Do you think that discussing a mental health disorder with previous employers would have negative consequences?',
       'Would you bring up a mental health issue with a potential employer in an interview?',
       'How willing would you be to share with friends and family that you have a mental illness?',
       'Have you observed or experienced an unsupportive or badly handled response to a mental health issue in your current or previous workplace?',
       'Do you have a family history of mental illness?',
       'Have you had a mental health disorder in the past?',
       'Have you ever sought treatment for a mental health issue from a mental health professional?',
       'What is your gender?'],
      dtype='object')

In [338]:
for train_index, test_index in kfold.split(normalized_X):
    X_train, X_test = normalized_X.iloc[train_index, :], normalized_X.iloc[test_index, :]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    keras.fit(X_train, y_train)
    predict = keras.predict(X_test)

Epoch 1/1


InvalidArgumentError:  Received a label value of 3 which is outside the valid range of [0, 3).  Label values: 3 3 1 1 1 3 3 1 3 1 2 1 1 1 3 1 1 1 1 3 2 3 1 3 1 1 1 1 2 2 1 2
	 [[node loss_22/dense_78_loss/sparse_categorical_crossentropy/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits (defined at /usr/local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_keras_scratch_graph_15639]

Function call stack:
keras_scratch_graph
