# Exploration & Preprocessing: Mental Health in Tech 2016

## 1. Data Exploration

In [1]:
import pandas as pd

# Load data
df = pd.read_csv('../data/mental-heath-in-tech-2016_20161114.csv')

# Basic information
print(f"Participants (rows): {df.shape[0]}")
print(f"Variables (columns): {df.shape[1]}")

Participants (rows): 1433
Variables (columns): 63


## 2. Data Type Analysis

In [16]:
# Display data types of all columns as DataFrame
dtype_df = pd.DataFrame({
    'Data Type': df.dtypes.values,
    'Column Name': df.columns
})

display(dtype_df)

Unnamed: 0,Data Type,Column Name
0,int64,Are you self-employed?
1,object,How many employees does your company or organization have?
2,float64,Is your employer primarily a tech company/organization?
3,float64,Is your primary role within your company related to tech/IT?
4,object,Does your employer provide mental health benefits as part of healthcare coverage?
5,object,Do you know the options for mental health care available under your employer-provided coverage?
6,object,"Has your employer ever formally discussed mental health (for example, as part of a wellness campaign or other official communication)?"
7,object,Does your employer offer resources to learn more about mental health concerns and options for seeking help?
8,object,Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources provided by your employer?
9,object,"If a mental health issue prompted you to request a medical leave from work, asking for that leave would be:"


## 3. Outlier Analysis

In [26]:
# Age column analysis
age_col = 'What is your age?'
ages = df[age_col]

# Find suspicious values
print("\nSuspicious values (age < 18 or age > 70):")
suspicious = df[(ages < 18) | (ages > 70)][age_col]
print(suspicious.values)


Suspicious values (age < 18 or age > 70):
[17 15 74]


In [27]:
# Remove unrealistic age values
df = df[~df['What is your age?'].isin([3, 99, 323])]

## 4. Data Consistency

In [30]:
# Gender: Show all unique values
gender_col = 'What is your gender?'
print(f"Gender - {df[gender_col].nunique()} unique values:\n")
print(df[gender_col].value_counts())

Gender - 3 unique values:

What is your gender?
Male      1055
Female     338
Other       34
Name: count, dtype: int64


In [29]:
import numpy as np

# Normalize gender: 70 variants -> 3 categories (Male / Female / Other)
male_variants = [
    'male', 'm', 'man', 'male ', 'male.', 'malr', 'mail', 'm|', 'dude', 
    'cis male', 'cis man', 'cisdude', 'male (cis)', 'sex is male', 
    'i\'m a man why didn\'t you make this a drop down question. you should of asked sex? and i would of answered yes please. seriously how much text can this take?'
]

female_variants = [
    'female', 'f', 'woman', 'female ', ' female', 'fem', 'fm',
    'female/woman', 'cis female ', 'cis female', 'cisgender female', 
    'cis-woman', 'female assigned at birth ', 'i identify as female.',
    'female (props for making this a freeform field, though)', 'afab'
]

def normalize_gender(val):
    if pd.isna(val):
        return np.nan
    val_lower = str(val).lower().strip()
    if val_lower in male_variants:
        return 'Male'
    elif val_lower in female_variants:
        return 'Female'
    else:
        return 'Other'

df[gender_col] = df[gender_col].apply(normalize_gender)

print("Gender normalized: 70 -> 3 categories\n")
print(df[gender_col].value_counts())

Gender normalized: 70 -> 3 categories

What is your gender?
Male      1055
Female     338
Other       34
Name: count, dtype: int64


## 5. Missing Values

In [40]:
# Missing values per column
missing = df.isnull().sum()
missing = missing[missing > 0].sort_values(ascending=False)
missing_pct = (missing / len(df) * 100).round(1)

print(f"Columns with missing values: {len(missing)}\n")
for col in missing.index:
    print(f"{missing[col]:4} {missing_pct[col]:5.1f}% - {col}")

Columns with missing values: 34

 863  60.3% - If yes, what condition(s) have you been diagnosed with?
 775  54.2% - Have your observations of how another individual who discussed a mental health disorder made you less likely to reveal a mental health issue yourself in your current workplace?
 721  50.4% - If so, what condition(s) were you diagnosed with?
 592  41.4% - What US state or territory do you live in?
 581  40.6% - What US state or territory do you work in?
 420  29.4% - Do you know the options for mental health care available under your employer-provided coverage?
 338  23.6% - Why or why not?
 307  21.5% - Why or why not?.1
 287  20.1% - Is your employer primarily a tech company/organization?
 287  20.1% - How many employees does your company or organization have?
 287  20.1% - Would you feel comfortable discussing a mental health disorder with your coworkers?
 287  20.1% - Does your employer provide mental health benefits as part of healthcare coverage?
 287  20.1% - Has y

In [41]:
# Remove columns with >70% missing values
threshold = 0.70
missing_pct = df.isnull().sum() / len(df)
cols_to_drop = missing_pct[missing_pct > threshold].index.tolist()

df = df.drop(columns=cols_to_drop)

## 6. Free Text Handling

## 7. Multi-Value Columns

## 8. Save Data