# Eveline Srinivasan - 10801751

## Research question/interests

### Is there a significant age and gender disparity in the mental health of tech employees?

Many news articles and other forms of media, highlight gender disparities in various metrics such as income, position, representation etc. By using the dataset we seek to determine if the impact of such disparities is reflected in the mental health of women and other under-represented groups within the tech industry.



In [25]:
#Importing
import pandas as pd
import numpy as np

## Milestone 2
---
Importing data from file:

In [26]:
# Importing Data 
rawData = pd.read_csv('../data/raw/dataRaw.csv')

## Milestone 3
---
### Task 1: Conduct Exploratory Data Analysis (EDA) on your dataset.

Columns available in the data set along which analysis can be performed:

In [27]:
#Printing columns of Data set.
print(rawData.columns)
len(rawData.columns)


Index(['Timestamp', 'Age', 'Gender', 'Country', 'state', 'self_employed',
       'family_history', 'treatment', 'work_interfere', 'no_employees',
       'remote_work', 'tech_company', 'benefits', 'care_options',
       'wellness_program', 'seek_help', 'anonymity', 'leave',
       'mental_health_consequence', 'phys_health_consequence', 'coworkers',
       'supervisor', 'mental_health_interview', 'phys_health_interview',
       'mental_vs_physical', 'obs_consequence', 'comments'],
      dtype='object')


27

These 27 columns can be interpreted by comparing it with the questionnaire provided with the [data source](https://www.kaggle.com/datasets/osmi/mental-health-in-tech-survey):

1. Timestamp of the individual's survey
2. Age of the individual
3. Gender of the individual
4. Country of origin of the individual
5. US state of origin of the individual if applicable
6. Whether or not the individual is self-employed
7. Whether the individual has a history of mental illness
8. Whether or not the individual has sought treatment for mental health
9. Whether the individual believes their mental health interferes with their work
10. The number of employees at the individual's company
11. Whether the individual works remotely (outside of an office) at least 50% of the time
12. Whether the individual's employer is a primarily tech company
13. Whether the individual's employer provides mental health benefits
14. Whether the individual is aware of their employer's mental health care options
15. Whether the individual's employer has ever discussed a mental health wellness program
16. Whether the individual's employer provides resources on how to seek help for mental health
17. Whether the individual is able to use company resources for mental health anonymously
18. The difficulty of taking mental health leaves at the individual's company
19. Whether the individual thinks discussing mental health with their employer will have negative connotations
20. Whether the individual thinks discussing physical health with their employer will have negative connotations
21. Whether the individual would be willing to discuss mental health with their coworkers
22. Whether the individual would be willing to discuss mental health with their direct supervisor
23. Whether the individual would bring up mental health issues during an interview with a possible employer
24. Whether the individual would bring up physical health issues during an interview with a possible employer
25. Whether the individual believes their employer takes mental health as seriously as physical health
26. Whether the individual has heard of or observed any negative consequences with mental health conditions in their workplace
27.  Any additional comments

### From here the columns of primary importance are:

1. Age of the individual
1. Gender of the Individual
1. Family History of Mental Illness

### Systemic factors of interest: 

#### Support Systems available:

1. Whether the individual's employer provides mental health benefits
1. Whether the individual's employer has ever discussed a mental health wellness program
1. Whether the individual's employer provides resources on how to seek help for mental health

#### Workplace Culture Factors:

1. Whether the individual has heard of or observed any negative consequences with mental health conditions in their workplace
1. Whether the individual thinks discussing mental health with their employer will have negative connotations
1. Whether the individual thinks discussing physical health with their employer will have negative connotations

### Outcomes of Interest:

1. Whether or not the individual has sought treatment for mental health
1. Whether the individual believes their mental health interferes with their work
1. Whether the individual would be willing to discuss mental health with their coworkers
1. Whether the individual would be willing to discuss mental health with their direct supervisor
1. Whether the individual has heard of or observed any negative consequences with mental health conditions in their workplace

Particularly in relation to the question of gender. We want determine if there are disparties between various gender groups.

Since we are only conducting an exploratory analysis we will ignore systemic factors and see if we can identify patterns of interest within our data. So first we trim out data set to only the columns of interest:

In [28]:
skimmedData = rawData[['Age','Gender','family_history','treatment','work_interfere','coworkers','supervisor','obs_consequence']]
skimmedData.head()

Unnamed: 0,Age,Gender,family_history,treatment,work_interfere,coworkers,supervisor,obs_consequence
0,37,Female,No,Yes,Often,Some of them,Yes,No
1,44,M,No,No,Rarely,No,No,No
2,32,Male,No,No,Rarely,Yes,Yes,No
3,31,Male,Yes,Yes,Often,Some of them,No,Yes
4,31,Male,No,No,Never,Some of them,Yes,No


Now that we have only the relevant columns, and notice that gender is write-in field. We can investigate further.

In [29]:
skimmedData.Gender.unique()

array(['Female', 'M', 'Male', 'male', 'female', 'm', 'Male-ish', 'maile',
       'Trans-female', 'Cis Female', 'F', 'something kinda male?',
       'Cis Male', 'Woman', 'f', 'Mal', 'Male (CIS)', 'queer/she/they',
       'non-binary', 'Femake', 'woman', 'Make', 'Nah', 'All', 'Enby',
       'fluid', 'Genderqueer', 'Female ', 'Androgyne', 'Agender',
       'cis-female/femme', 'Guy (-ish) ^_^', 'male leaning androgynous',
       'Male ', 'Man', 'Trans woman', 'msle', 'Neuter', 'Female (trans)',
       'queer', 'Female (cis)', 'Mail', 'cis male', 'A little about you',
       'Malr', 'p', 'femail', 'Cis Man',
       'ostensibly male, unsure what that really means'], dtype=object)

There are a lot of different ways people have written their genders, including misspellings that needs to be addressed. We will address those during the data cleaning phase.
For now let us investigate the relation between the columns, excluding gender.

Let us investigate the general pattern of family history and our factors of outcome in the dataset.

In [39]:
print(skimmedData.family_history.value_counts())
print(skimmedData.treatment.value_counts())
print(skimmedData.work_interfere.value_counts())
print(skimmedData.coworkers.value_counts())
print(skimmedData.supervisor.value_counts())



No     767
Yes    492
Name: family_history, dtype: int64
Yes    637
No     622
Name: treatment, dtype: int64
Sometimes    465
Never        213
Rarely       173
Often        144
Name: work_interfere, dtype: int64
Some of them    774
No              260
Yes             225
Name: coworkers, dtype: int64
Yes             516
No              393
Some of them    350
Name: supervisor, dtype: int64
