# Notes
- Use pip freeze to generate requirements.txt
- Group data for each year (SQL select) into a dataframe

# Requirements

* Query the dataset using sqlite. Only load the final dataset into a dataframe.

* Give an overview of the respondents of the survey. What is the sample size?
* What are the sociodemographic features of the respondents? Do you see any evidence of sampling bias?
* Perform exploratory data analysis. This should include creating statistical summaries and charts, checking for correlations and other relationships between variables, as well as other EDA elements.
* In a plot, report the prevalence rate of at least three mental diseases. (https://en.wikipedia.org/wiki/Prevalence)
* Make sure to plot the confidence interval and provide its interpretation.
* Your notebook should be readable as a standalone document. In Markdown cells inform the reader of the questions you are trying to answer, and provide an interpretation of your results.
* Provide suggestions about how your analysis can be improved.
# Questions to answer (general)

- What are the main types and subtypes of data?
- What are the main metrics of location? What are their main characteristics?
- What is variability? What are the main metrics of variability and their characteristics?
- What is a confidence interval? Why do we need it? Why is it not sufficient to just report the point estimates?
- What is correlation? How do we use it to analyze data?
- What is a contingency table?

# Plan of action

- Import data into a single dataframe, that is coherent (it makes sense looking at it)
- Review the data
- Clean the data
- Perform exploratory data analysis, main goal

Let's filter the data to only include the questions that are present in all years, as we are interested in the trends over time.
Also, let's clean the data by renaming the columns to lowercase and removing spaces, and renaming SurveyId to year as it is more intuitive.

In [27]:
import sqlite3
import pandas as pd
import plotly.express as px
import numpy as np
import helpers

conn = sqlite3.connect('mental_health.sqlite')

query = """
SELECT 
    s.SurveyID as year,  -- Renamed in the query itself
    s.Description as survey_description,
    a.UserID as user_id,
    a.QuestionID as question_id,
    q.QuestionText as question_text,
    a.AnswerText as answer_text
FROM Answer a
JOIN Question q ON a.QuestionID = q.QuestionID
JOIN Survey s ON a.SurveyID = s.SurveyID
"""

# Create initial dataframe
df = pd.read_sql_query(query, conn)

# Close connection
conn.close()

# Convert all column names to lowercase
df.columns = df.columns.str.lower()

df.head()

Unnamed: 0,year,survey_description,user_id,question_id,question_text,answer_text
0,2014,mental health survey for 2014,1,1,What is your age?,37
1,2014,mental health survey for 2014,2,1,What is your age?,44
2,2014,mental health survey for 2014,3,1,What is your age?,32
3,2014,mental health survey for 2014,4,1,What is your age?,31
4,2014,mental health survey for 2014,5,1,What is your age?,31


In [28]:
# Number of unique respondents per year
yearly_respondents = df.groupby('year')['user_id'].nunique()

# Show results directly (no print needed in PyCharm)
yearly_respondents

year
2014    1260
2016    1433
2017     756
2018     417
2019     352
Name: user_id, dtype: int64

In [29]:
# Clean age data by removing impossible values
age_df = df[df['question_text'] == 'What is your age?']
clean_age = pd.to_numeric(age_df['answer_text'], errors='coerce')
clean_age = clean_age[
    (clean_age >= 16) & (clean_age <= 80)]  # Reasonable age range, ignoring outlier ages like 99 and -1
clean_age_stats = clean_age.describe()

clean_age_stats

count    4203.000000
mean       33.855817
std         8.068257
min        17.000000
25%        28.000000
50%        33.000000
75%        38.000000
max        74.000000
Name: answer_text, dtype: float64

In [30]:

# Let's also look at age distribution by year
age_by_year = age_df.copy()
age_by_year['clean_age'] = pd.to_numeric(age_by_year['answer_text'], errors='coerce')
age_by_year = age_by_year[(age_by_year['clean_age'] >= 18) & (age_by_year['clean_age'] <= 100)]
yearly_age_stats = age_by_year.groupby('year')['clean_age'].describe()

yearly_age_stats

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2014,1252.0,32.083866,7.289722,18.0,27.0,31.0,36.0,72.0
2016,1429.0,34.131561,8.263825,19.0,28.0,33.0,39.0,99.0
2017,754.0,34.988064,8.338051,18.0,29.0,34.0,40.0,67.0
2018,417.0,34.916067,8.047047,19.0,29.0,34.0,39.0,67.0
2019,351.0,35.595442,8.891819,19.0,29.0,34.0,41.0,64.0


Looking at the cleaned age statistics, let me help interpret:

### Sample Size and Distribution:
- Count: 4203 valid responses
- Mean: 33.86 years
- Median (50%): 33 years
- The mean and median being close suggests a relatively symmetric distribution


### Age Spread:

- Standard Deviation: 8.07 years
- IQR: 38 years (75th) - 28 years (25th) = 10 years
- Range: 17 years (min) to 74 years (max)

### Evidence of Sampling Bias:

- Age concentration: 50% of respondents are between 28-38 years
- Underrepresentation of:
    - Senior tech workers (40+ years)
    - Early career professionals (< 25 years)
    - The narrow standard deviation (8.07 years) suggests limited age diversity

Let's analyze other demographics to get a fuller picture.

In [31]:
# Apply the categorization
gender_df = df[df['question_text'] == 'What is your gender?'].copy()
gender_df['category'] = gender_df['answer_text'].apply(helpers.categorize_gender)

# Create the distribution
gender_distribution = gender_df.groupby(['year', 'category']).size().unstack(fill_value=0)

# Calculate percentages
gender_distribution_pct = gender_distribution.div(gender_distribution.sum(axis=1), axis=0) * 100

# Show both counts and percentages
gender_distribution

category,Female,Male,Other
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014,247,991,22
2016,336,1057,40
2017,218,502,36
2018,125,266,26
2019,98,228,26


In [32]:
gender_distribution_pct

category,Female,Male,Other
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014,19.603175,78.650794,1.746032
2016,23.447313,73.76134,2.791347
2017,28.835979,66.402116,4.761905
2018,29.976019,63.788969,6.235012
2019,27.840909,64.772727,7.386364


In [33]:

# Location/Country distribution
location_dist = helpers.get_responses_by_question(df, 'What country do you live in?')
location_dist.sum().nlargest(10)
# Top 5 countries

answer_text
United States of America    1853
United States                751
United Kingdom               482
Canada                       199
Germany                      136
Netherlands                   98
Australia                     73
France                        51
Ireland                       51
India                         50
dtype: int64

In [34]:
# Company size distribution
company_size_dist = helpers.get_responses_by_question(df, 'How many employees does your company or organization have?')
company_size_dist

answer_text,-1,1-5,100-500,26-100,500-1000,6-25,More than 1000
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2014,0,162,176,289,61,290,282
2016,287,60,248,292,80,210,256
2017,113,20,203,128,48,86,158
2018,56,5,81,70,31,69,105
2019,48,7,80,45,27,34,111


# Overview of the respondents of the survey. What is the sample size? What are the sociodemographic features of the respondents? Do you see any evidence of sampling bias?

2014: ~1,260 respondents
2016: ~1,433 respondents
2017: ~756 respondents
2018: ~417 respondents
2019: ~352 respondents

## Evidence of Several Types of Sampling Bias:

### Age Bias:


Mean age ~33.9 years
Strong concentration in 28-38 year range
Underrepresentation of:

Senior professionals (40+)
Early career professionals (<25)




### Participation Bias:


Sharp decline in participation from 2016 to 2019
Could affect trend analysis reliability
More recent years might not be as representative


Looking at the country distribution, we can identify clear geographic sampling biases:

Strong US Dominance:


United States total: 2,604 respondents (1,853 + 751 from different labelings)
This represents approximately 65% of respondents


### English-Speaking Countries Bias:


Top English-speaking countries:

USA: 2,604 respondents
UK: 482 respondents
Canada: 199 respondents
Australia: 73 respondents
Ireland: 51 respondents




Western Europe Representation:


Moderate representation from:

Germany: 136 respondents
Netherlands: 98 respondents
France: 51 respondents




### Underrepresentation:


Only one Asian country in top 10 (India: 50 respondents)
No representation from:

South America
Africa
Most of Asia
Eastern Europe



### Sampling Biases to Consider:

Language Barrier: Survey likely conducted in English
Distribution Channels: Survey might have been distributed through US-centric networks
Tech Industry Concentration: Reflects tech industry hubs but might miss emerging markets

This geographic distribution limits our ability to make global generalizations about:

Mental health in tech globally
Cultural differences in mental health approaches
Regional workplace practices


Looking at the company size distribution across years, let's analyze the patterns:

### Overall Distribution Pattern:


Very diverse representation from small to large companies
Strong representation from both ends:

Small companies (1-5, 6-25 employees)
Large enterprises (More than 1000 employees)


Good representation of mid-sized companies (26-100, 100-500)

### Company Size Bias:

Good representation of different company sizes
But might not match actual tech industry distribution


### Missing Data:

Significant "-1" values in 2016-2019
Could affect analysis reliability

Changes Over Time:


Declining participation across all company sizes
Relatively consistent proportions maintained


# Mental Health Conditions Prevalence Analysis

Let's analyze the prevalence of mental health conditions in the tech industry using three key questions:
1. Past history: "Have you had a mental health disorder in the past?"
2. Current status: "Do you currently have a mental health disorder?"
3. Diagnosis: "Have you ever been diagnosed with a mental health disorder?"

We'll calculate prevalence rates with 95% confidence intervals using the Wilson score interval method.


In [35]:
questions = [
    "Have you had a mental health disorder in the past?",
    "Do you currently have a mental health disorder?",
    "Have you ever been diagnosed with a mental health disorder?"
]

# Create data for plotting
plot_data = []
for question in questions:
    prev, lower, upper = helpers.calculate_prevalence_ci(df, question)
    plot_data.append({
        'Condition': question.replace("Have you ", "").replace("?", ""),
        'Prevalence': prev,
        'CI_lower': lower,
        'CI_upper': upper
    })

# Create DataFrame for plotting
plot_df = pd.DataFrame(plot_data)

# Create the plot using Plotly Express
fig = px.bar(plot_df,
             x='Condition',
             y='Prevalence',
             error_y=plot_df.apply(lambda row: {'array': [row['CI_upper'] - row['Prevalence']]}, axis=1),
             error_y_minus=plot_df.apply(lambda row: {'array': [row['Prevalence'] - row['CI_lower']]}, axis=1),
             title='Mental Health Conditions Prevalence in Tech Industry (2014-2019)')

# Update layout
fig.update_layout(
    xaxis_title="",
    yaxis_title="Prevalence (%)",
    yaxis_range=[0, 100],
    showlegend=False,
    title_x=0.5,
    xaxis_tickangle=-45,
    template='plotly_white'
)

# Add hover template
fig.update_traces(
    hovertemplate="<br>".join([
        "<b>%{x}</b>",
        "Prevalence: %{y:.1f}%",
        "95% CI: (%{customdata[0]:.1f}% - %{customdata[1]:.1f}%)",
        "<extra></extra>"
    ]),
    customdata=plot_df[['CI_lower', 'CI_upper']]
)

fig.show()

# Create a markdown cell for interpretation
"""
### Interpretation of Mental Health Prevalence Results:

1. Past Mental Health Disorders:
   - Prevalence: {:.1f}% (95% CI: {:.1f}% - {:.1f}%)
   - Highest prevalence among the three measures

2. Current Mental Health Disorders:
   - Prevalence: {:.1f}% (95% CI: {:.1f}% - {:.1f}%)
   - Lower than past disorders, suggesting possible recovery or management

3. Diagnosed Mental Health Disorders:
   - Prevalence: {:.1f}% (95% CI: {:.1f}% - {:.1f}%)
   - High diagnosis rate indicates good healthcare access

Key Observations:
1. High Overall Prevalence: All measures show rates >35%
2. Narrow Confidence Intervals: Indicates precise estimates
3. Treatment Gap: Difference between past and current prevalence suggests successful interventions
4. Possible Underreporting: Some may not seek diagnosis or disclose conditions
""".format(
    plot_df.iloc[0]['Prevalence'], plot_df.iloc[0]['CI_lower'], plot_df.iloc[0]['CI_upper'],
    plot_df.iloc[1]['Prevalence'], plot_df.iloc[1]['CI_lower'], plot_df.iloc[1]['CI_upper'],
    plot_df.iloc[2]['Prevalence'], plot_df.iloc[2]['CI_lower'], plot_df.iloc[2]['CI_upper']
)

'\n### Interpretation of Mental Health Prevalence Results:\n\n1. Past Mental Health Disorders:\n   - Prevalence: 47.9% (95% CI: 46.1% - 49.7%)\n   - Highest prevalence among the three measures\n\n2. Current Mental Health Disorders:\n   - Prevalence: 41.8% (95% CI: 40.1% - 43.6%)\n   - Lower than past disorders, suggesting possible recovery or management\n\n3. Diagnosed Mental Health Disorders:\n   - Prevalence: 46.1% (95% CI: 44.3% - 47.9%)\n   - High diagnosis rate indicates good healthcare access\n\nKey Observations:\n1. High Overall Prevalence: All measures show rates >35%\n2. Narrow Confidence Intervals: Indicates precise estimates\n3. Treatment Gap: Difference between past and current prevalence suggests successful interventions\n4. Possible Underreporting: Some may not seek diagnosis or disclose conditions\n'

In [36]:
# # Calculate yearly prevalence for each question
# questions = [
#     "Have you had a mental health disorder in the past?",
#     "Do you currently have a mental health disorder?",
#     "Have you ever been diagnosed with a mental health disorder?"
# ]
# 
# yearly_plot_data = []
# for question in questions:
#     yearly_plot_data.extend(helpers.calculate_yearly_prevalence_ci(df, question))
# 
# # Convert to DataFrame
# yearly_df = pd.DataFrame(yearly_plot_data)
# 
# # Create the trend plot
# fig = px.scatter(yearly_df,
#                  x='Year',
#                  y='Prevalence',
#                  color='Question',
#                  error_y=yearly_df.apply(lambda row: {'array': [row['CI_upper'] - row['Prevalence']]}, axis=1),
#                  error_y_minus=yearly_df.apply(lambda row: {'array': [row['Prevalence'] - row['CI_lower']]}, axis=1),
#                  title='Mental Health Conditions Prevalence by Survey Year')
# 
# # Add lines connecting points within each group
# for question in yearly_df['Question'].unique():
#     question_data = yearly_df[yearly_df['Question'] == question]
#     fig.add_scatter(x=question_data['Year'],
#                     y=question_data['Prevalence'],
#                     mode='lines',
#                     showlegend=False,
#                     line=dict(dash='dash'),
#                     hoverinfo='skip')
# 
# # Update layout
# fig.update_layout(
#     xaxis_title="Survey Year",
#     yaxis_title="Prevalence (%)",
#     yaxis_range=[0, max(yearly_df['CI_upper'].max() * 1.1, 100)],
#     title_x=0.5,
#     template='plotly_white',
#     hovermode='x unified',
#     xaxis=dict(
#         tickmode='array',
#         tickvals=[2014, 2016, 2017, 2018, 2019],  # Specify exact years
#         ticktext=['2014', '2016', '2017', '2018', '2019']
#     )
# )
# 
# # Add hover template
# fig.update_traces(
#     hovertemplate="<br>".join([
#         "Year: %{x}",
#         "Prevalence: %{y:.1f}%",
#         "95% CI: (%{customdata[0]:.1f}% - %{customdata[1]:.1f}%)",
#         "<extra>%{name}</extra>"
#     ]),
#     customdata=yearly_df[['CI_lower', 'CI_upper']],
#     selector=dict(mode='markers')  # Only apply to scatter points
# )
# 
# fig.show()

In [37]:
# First, let's see what conditions people report
diagnosis_responses = df[df['question_text'] == "If yes, what condition(s) have you been diagnosed with?"]

# We might need to clean and standardize the responses as they might be free text
conditions_plot_data = []

# Common conditions to look for
conditions = [
    "depression",
    "anxiety",
    "bipolar disorder",
    "ptsd",
    "adhd"
]

# Calculate prevalence for each condition
total_respondents = df['user_id'].nunique()

for condition in conditions:
    # Count unique users who reported this condition
    condition_count = diagnosis_responses[
        diagnosis_responses['answer_text'].str.lower().str.contains(condition, na=False)
    ]['user_id'].nunique()

    # Calculate prevalence and CI
    prev, lower, upper = helpers.calculate_prevalence_ci(
        df,
        "If yes, what condition(s) have you been diagnosed with?",
        # Looking for any response containing this condition
        positive_responses=diagnosis_responses[
            diagnosis_responses['answer_text'].str.lower().str.contains(condition, na=False)
        ]['answer_text'].unique()
    )

    conditions_plot_data.append({
        'Condition': condition.upper(),
        'Prevalence': prev,
        'CI_lower': lower,
        'CI_upper': upper,
        'Count': condition_count
    })

# Convert to DataFrame
conditions_df = pd.DataFrame(conditions_plot_data)

# Create the plot
fig = px.bar(conditions_df,
             x='Condition',
             y='Prevalence',
             error_y=conditions_df.apply(lambda row: {'array': [row['CI_upper'] - row['Prevalence']]}, axis=1),
             error_y_minus=conditions_df.apply(lambda row: {'array': [row['Prevalence'] - row['CI_lower']]}, axis=1),
             title='Prevalence of Specific Mental Health Conditions in Tech Industry')

# Update layout
fig.update_layout(
    xaxis_title="Condition",
    yaxis_title="Prevalence (%)",
    title_x=0.5,
    template='plotly_white',
    showlegend=False,
    xaxis_tickangle=-45
)

# Add hover template
fig.update_traces(
    hovertemplate="<br>".join([
        "<b>%{x}</b>",
        "Prevalence: %{y:.1f}%",
        "95% CI: (%{customdata[0]:.1f}% - %{customdata[1]:.1f}%)",
        "Count: %{customdata[2]}",
        "<extra></extra>"
    ]),
    customdata=conditions_df[['CI_lower', 'CI_upper', 'Count']]
)

fig.show()

# Print detailed statistics
print("\nDetailed Statistics:")
for _, row in conditions_df.iterrows():
    print(f"\n{row['Condition']}:")
    print(f"Prevalence: {row['Prevalence']:.1f}%")
    print(f"95% CI: ({row['CI_lower']:.1f}% - {row['CI_upper']:.1f}%)")
    print(f"Count: {row['Count']}")


Detailed Statistics:

DEPRESSION:
Prevalence: 28.9%
95% CI: (26.6% - 31.3%)
Count: 414

ANXIETY:
Prevalence: 24.1%
95% CI: (22.0% - 26.4%)
Count: 346

BIPOLAR DISORDER:
Prevalence: 28.8%
95% CI: (26.5% - 31.1%)
Count: 412

PTSD:
Prevalence: 0.1%
95% CI: (0.0% - 0.4%)
Count: 1

ADHD:
Prevalence: 0.1%
95% CI: (0.0% - 0.4%)
Count: 1
