<a href="https://colab.research.google.com/github/shwets1503/Mental-Health-in-Tech-Industry/blob/master/Notebooks/Data_Exploration_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Mental Health in Tech Industry**

---



## **1. Business Understanding**


In this project, I am measuring the attitude of employers towards mental health in the tech workplaces, and examine the frequency of mental health disorders among tech workers. The dataset I am using is from [2014 Mental Health in Tech](https://osmihelp.org/research) Survey.
 
The following project aims to answer the following questions:<br>
1. In 2014, did employers in the US tech industry recognize the importance of mental health?<br>
2. How often does mental health affect the work of the employees?<br>
3. What are certain attitudes towards mental health in the tech industry?<br>
4. How do one's family history and age affect mental health?<br>
5. What does the gender of an individual tell about the mental health condition?<br>

Also, I will be building five models that predict whether the employee will seek medical treatment or not for mental health issues and measure their performances.

In [0]:
# Importing library for data processing
import pandas as pd



---



## **2. Data Understanding**

In [0]:
# Reading the dataset
initial_data = pd.read_csv('/survey.csv')

In [5]:
# Look at the data
initial_data.head()

Unnamed: 0,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
0,2014-08-27 11:29:31,37,Female,United States,IL,,No,Yes,Often,6-25,No,Yes,Yes,Not sure,No,Yes,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,
1,2014-08-27 11:29:37,44,M,United States,IN,,No,No,Rarely,More than 1000,No,No,Don't know,No,Don't know,Don't know,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No,
2,2014-08-27 11:29:44,32,Male,Canada,,,No,No,Rarely,6-25,No,Yes,No,No,No,No,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,
3,2014-08-27 11:29:46,31,Male,United Kingdom,,,Yes,Yes,Often,26-100,No,Yes,No,Yes,No,No,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,
4,2014-08-27 11:30:22,31,Male,United States,TX,,No,No,Never,100-500,Yes,Yes,Yes,No,Don't know,Don't know,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,


**This dataset contains the following columns:**

* *Timestamp*

* *Age*

- *Gender*

- *Country*

- *state*: If you live in the United States, which state or territory do you live in?

- *self_employed*: Are you self-employed?

- *family_history*: Do you have a family history of mental illness?

- *treatment*: Have you sought treatment for a mental health condition?

- *work_interfere*: If you have a mental health condition, do you feel that it interferes with your work?

- *no_employees*: How many employees does your company or organization have?

- *remote_work*: Do you work remotely (outside of an office) at least 50% of the time?

- *tech_company*: Is your employer primarily a tech company/organization?

- *benefits*: Does your employer provide mental health benefits?

- *care_options*: Do you know the options for mental health care your employer provides?

- *wellness_program*: Has your employer ever discussed mental health as part of an employee wellness program?

- *seek_help*: Does your employer provide resources to learn more about mental health issues and how to seek help?

- *anonymity*: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources?

- *leave*: How easy is it for you to take medical leave for a mental health condition?

- *mentalhealthconsequence*: Do you think that discussing a mental health issue with your employer would have negative consequences?

- *physhealthconsequence*: Do you think that discussing a physical health issue with your employer would have negative consequences?

- *coworkers*: Would you be willing to discuss a mental health issue with your coworkers?

- *supervisor*: Would you be willing to discuss a mental health issue with your direct supervisor(s)?

- *mentalhealthinterview*: Would you bring up a mental health issue with a potential employer in an interview?

- *physhealthinterview*: Would you bring up a physical health issue with a potential employer in an interview?

- *mentalvsphysical*: Do you feel that your employer takes mental health as seriously as physical health?

- *obs_consequence*: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace?

- *comments*: Any additional notes or comments

In [6]:
# Shape of the data
initial_data.shape

(1259, 27)

In [7]:
initial_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1259 entries, 0 to 1258
Data columns (total 27 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Timestamp                  1259 non-null   object
 1   Age                        1259 non-null   int64 
 2   Gender                     1259 non-null   object
 3   Country                    1259 non-null   object
 4   state                      744 non-null    object
 5   self_employed              1241 non-null   object
 6   family_history             1259 non-null   object
 7   treatment                  1259 non-null   object
 8   work_interfere             995 non-null    object
 9   no_employees               1259 non-null   object
 10  remote_work                1259 non-null   object
 11  tech_company               1259 non-null   object
 12  benefits                   1259 non-null   object
 13  care_options               1259 non-null   object
 14  wellness

In [8]:
initial_data.describe()

Unnamed: 0,Age
count,1259.0
mean,79428150.0
std,2818299000.0
min,-1726.0
25%,27.0
50%,31.0
75%,36.0
max,100000000000.0




---


## **3. Data Preparation**

## 3.1 Cleaning the Dataset

### 3.1.1 Updating Age Column Values

In [0]:
# There are values in this column that doesn't make sense
initial_data['Age'].unique()

array([         37,          44,          32,          31,          33,
                35,          39,          42,          23,          29,
                36,          27,          46,          41,          34,
                30,          40,          38,          50,          24,
                18,          28,          26,          22,          19,
                25,          45,          21,         -29,          43,
                56,          60,          54,         329,          55,
       99999999999,          48,          20,          57,          58,
                47,          62,          51,          65,          49,
             -1726,           5,          53,          61,           8,
                11,          -1,          72])

In [0]:
# Removing values which can't be changed
values = [-1726, 329, 99999999999, -1, -29]
for val in values:
  initial_data = initial_data[initial_data.Age != val]

In [0]:
# Shape after updating
initial_data.shape

(1254, 27)

In [10]:
initial_data.Country.count()

1259

In [0]:
# Dropping non-essential columns: Country column has 751 values from United States alone so if we keep it in the dataset, it will create bias
initial_data = initial_data.drop(["comments","Timestamp", 
                                  "Country", "state"], axis = 1)

### 3.1.2 Dealing with Null values

In [0]:
# Checking for missing values
null_values = initial_data.isnull().sum()
null_values[null_values > 0]

self_employed      18
work_interfere    263
dtype: int64

In [0]:
# Assigning default values for columns with missing values
defaultString = 'NaN' # Since both consists of string values 

# Creating list of the columns to replace null values
stringFeatures = ['self_employed', 'work_interfere']
                 
# Gettng consistent NaN's
for feature in initial_data:
    if feature in stringFeatures:
        initial_data[feature] = initial_data[feature].fillna(defaultString)  

In [0]:
# Most of the values in self_employed column are 'No' so will change NaN values to 'No'
initial_data['self_employed'][initial_data['self_employed']=='No'].count()

1092

In [0]:
# Removing NaN values from self_employed column and replacing
initial_data['self_employed'] = initial_data['self_employed'].replace([defaultString], 'No')

# Unique values in self_employed column after replacement
initial_data['self_employed'].unique()

array(['No', 'Yes'], dtype=object)

In [0]:
# Removing NaN values from work_interfere column and replacing with most common value
initial_data['work_interfere'] = initial_data['work_interfere'].replace([defaultString], 'Don\'t know' )
print(initial_data['work_interfere'].unique())

['Often' 'Rarely' 'Never' 'Sometimes' "Don't know"]


### 3.1.3 Making Gender Column Consistent

In [12]:
# Gender column consist of lots of inconsistent values 
print(initial_data["Gender"].unique())

['Female' 'M' 'Male' 'male' 'female' 'm' 'Male-ish' 'maile' 'Trans-female'
 'Cis Female' 'F' 'something kinda male?' 'Cis Male' 'Woman' 'f' 'Mal'
 'Male (CIS)' 'queer/she/they' 'non-binary' 'Femake' 'woman' 'Make' 'Nah'
 'All' 'Enby' 'fluid' 'Genderqueer' 'Female ' 'Androgyne' 'Agender'
 'cis-female/femme' 'Guy (-ish) ^_^' 'male leaning androgynous' 'Male '
 'Man' 'Trans woman' 'msle' 'Neuter' 'Female (trans)' 'queer'
 'Female (cis)' 'Mail' 'cis male' 'A little about you' 'Malr' 'p' 'femail'
 'Cis Man' 'ostensibly male, unsure what that really means']


In [0]:
# Changing all values to lowercase
initial_data["Gender"] = initial_data["Gender"].str.lower()

In [0]:
# Making gender-groups to classify all values
male_str = ["male", "m", "male-ish", "maile", "mal", "male (cis)", "make", "male ", "man","msle", "mail", "malr","cis man", "Cis Male", "cis male"]
trans_str = ["trans-female", "something kinda male?", "queer/she/they", "non-binary","nah", "all", "enby", "fluid", "genderqueer", "androgyne", "agender", "male leaning androgynous", "guy (-ish) ^_^", "trans woman", "neuter", "female (trans)", "queer", "ostensibly male, unsure what that really means"]           
female_str = ["cis female", "f", "female", "woman",  "femake", "female ","cis-female/femme", "female (cis)", "femail"]

In [0]:
# Changing groups into three categories
for (row, col) in initial_data.iterrows():

    if (col.Gender) in male_str:
        initial_data['Gender'].replace(to_replace=col.Gender, value='male', inplace=True)

    if (col.Gender) in female_str:
        initial_data['Gender'].replace(to_replace=col.Gender, value='female', inplace=True)

    if (col.Gender) in trans_str:
        initial_data['Gender'].replace(to_replace=col.Gender, value='trans', inplace=True)

# Getting rid of unnnecessary values
stk_list = ['a little about you', 'p']
initial_data = initial_data[~initial_data['Gender'].isin(stk_list)]

In [0]:
# Values after cleaning
initial_data["Gender"].value_counts()

male      988
female    247
trans      18
Name: Gender, dtype: int64

In [0]:
# Data after cleaning
initial_data.head()

Unnamed: 0,Age,Gender,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence
0,37,female,No,No,Yes,Often,6-25,No,Yes,Yes,Not sure,No,Yes,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No
1,44,male,No,No,No,Rarely,More than 1000,No,No,Don't know,No,Don't know,Don't know,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No
2,32,male,No,No,No,Rarely,6-25,No,Yes,No,No,No,No,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No
3,31,male,No,Yes,Yes,Often,26-100,No,Yes,No,Yes,No,No,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes
4,31,male,No,No,No,Never,100-500,Yes,Yes,Yes,No,Don't know,Don't know,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No


### 3.1.4 Saving Clean Data

In [0]:
# Making copy of cleaned data to use for visualizations in next notebook
data_cleaned = initial_data.copy()

# Saving to csv file
data_cleaned.to_csv('clean_data.csv', index=False)

In the next notebook I will visualize the data



---

