# Notes
- Use pip freeze to generate requirements.txt
- Group data for each year (SQL select) into a dataframe

# Requirements

* Query the dataset using sqlite. Only load the final dataset into a dataframe.

* Give an overview of the respondents of the survey. What is the sample size?
* What are the sociodemographic features of the respondents? Do you see any evidence of sampling bias?
* Perform exploratory data analysis. This should include creating statistical summaries and charts, checking for correlations and other relationships between variables, as well as other EDA elements.
* In a plot, report the prevalence rate of at least three mental diseases. (https://en.wikipedia.org/wiki/Prevalence)
* Make sure to plot the confidence interval and provide its interpretation.
* Your notebook should be readable as a standalone document. In Markdown cells inform the reader of the questions you are trying to answer, and provide an interpretation of your results.
* Provide suggestions about how your analysis can be improved.
# Questions to answer (general)

- What are the main types and subtypes of data?
- What are the main metrics of location? What are their main characteristics?
- What is variability? What are the main metrics of variability and their characteristics?
- What is a confidence interval? Why do we need it? Why is it not sufficient to just report the point estimates?
- What is correlation? How do we use it to analyze data?
- What is a contingency table?

# Plan of action

- Import data into a single dataframe, that is coherent (it makes sense looking at it)
- Review the data
- Clean the data
- Perform exploratory data analysis, main goal

Let's filter the data to only include the questions that are present in all years, as we are interested in the trends over time.
Also, let's clean the data by renaming the columns to lowercase and removing spaces, and renaming SurveyId to year as it is more intuitive.

In [182]:
import sqlite3
import plotly.express as px
import numpy as np
import helpers
import pandas as pd
from scipy import stats

conn = sqlite3.connect('mental_health.sqlite')

query = """
SELECT 
    s.SurveyID as year,
    s.Description as survey_description,
    a.UserID as user_id,
    a.QuestionID as question_id,
    q.QuestionText as question_text,
    a.AnswerText as answer_text
FROM Answer a
JOIN Question q ON a.QuestionID = q.QuestionID
JOIN Survey s ON a.SurveyID = s.SurveyID
"""

df = pd.read_sql_query(query, conn)

conn.close()

df.columns = df.columns.str.lower()

df['answer_text'] = df['answer_text'].str.lower()

df.head()

Unnamed: 0,year,survey_description,user_id,question_id,question_text,answer_text
0,2014,mental health survey for 2014,1,1,What is your age?,37
1,2014,mental health survey for 2014,2,1,What is your age?,44
2,2014,mental health survey for 2014,3,1,What is your age?,32
3,2014,mental health survey for 2014,4,1,What is your age?,31
4,2014,mental health survey for 2014,5,1,What is your age?,31


In [183]:
yearly_respondents = df.groupby('year')['user_id'].nunique()

yearly_respondents

year
2014    1260
2016    1433
2017     756
2018     417
2019     352
Name: user_id, dtype: int64

In [184]:
# Clean age data by removing impossible values
age_df = df[df['question_text'] == 'What is your age?']
clean_age = pd.to_numeric(age_df['answer_text'], errors='coerce')
clean_age = clean_age[
    (clean_age >= 16) & (
            clean_age <= 80)]  # Reasonable age range, ignoring outlier ages like 99 and -1 TODO: do I really need to clean this?
clean_age_stats = clean_age.describe()

clean_age_stats

count    4203.000000
mean       33.855817
std         8.068257
min        17.000000
25%        28.000000
50%        33.000000
75%        38.000000
max        74.000000
Name: answer_text, dtype: float64

Let's also look at age distribution by year

In [185]:
age_by_year = age_df.copy()
age_by_year['clean_age'] = pd.to_numeric(age_by_year['answer_text'], errors='coerce')
age_by_year = age_by_year[(age_by_year['clean_age'] >= 18) & (age_by_year['clean_age'] <= 100)]
yearly_age_stats = age_by_year.groupby('year')['clean_age'].describe()

# Recreate the yearly age statistics in a format better suited for plotting
yearly_stats_df = yearly_age_stats.reset_index()

# Create traces for different statistics
fig = px.line(yearly_stats_df,
              x='year',
              y=['mean', '25%', '50%', '75%'],
              title='Age Distribution Statistics Over Years',
              labels={
                  'value': 'Age',
                  'year': 'Year',
                  'variable': 'Statistic'
              },
              markers=True)

# Update line names to be more readable
fig.update_traces(
    name='Mean Age',
    selector=dict(name='mean')
)
fig.update_traces(
    name='25th Percentile',
    selector=dict(name='25%')
)
fig.update_traces(
    name='Median (50th)',
    selector=dict(name='50%')
)
fig.update_traces(
    name='75th Percentile',
    selector=dict(name='75%')
)

# Customize layout
fig.update_layout(
    xaxis=dict(
        tickmode='array',
        ticktext=yearly_stats_df['year'].astype(int).astype(str),
        tickvals=yearly_stats_df['year']
    ),
    yaxis_title='Age',
    hovermode='x unified',
    legend_title='Age Metrics',
    template='plotly_white'
)

# Add hover template
fig.update_traces(
    hovertemplate="<br>".join([
        "Year: %{x}",
        "%{name}: %{y:.1f} years",
        "<extra></extra>"
    ])
)

# Add count information as text annotations
annotations = []
for idx, row in yearly_stats_df.iterrows():
    annotations.append(
        dict(
            x=row['year'],
            y=row['mean'],
            text=f"n={row['count']:.0f}",
            showarrow=False,
            yshift=20,
            font=dict(size=10)
        )
    )
fig.update_layout(annotations=annotations)

# Show statistics
print("\nYearly Age Statistics:")
print(yearly_stats_df.round(2).to_string(index=False))

# Show the plot
fig.show()


Yearly Age Statistics:
 year  count  mean  std  min  25%  50%  75%  max
 2014 1252.0 32.08 7.29 18.0 27.0 31.0 36.0 72.0
 2016 1429.0 34.13 8.26 19.0 28.0 33.0 39.0 99.0
 2017  754.0 34.99 8.34 18.0 29.0 34.0 40.0 67.0
 2018  417.0 34.92 8.05 19.0 29.0 34.0 39.0 67.0
 2019  351.0 35.60 8.89 19.0 29.0 34.0 41.0 64.0


### Sample Size and Distribution:
- Count: 4203 valid responses
- Mean: 33.86 years
- Median (50%): 33 years
- The mean and median being close suggests a relatively symmetric distribution


### Age Spread:

- Standard Deviation: 8.07 years
- IQR: 38 years (75th) - 28 years (25th) = 10 years
- Range: 17 years (min) to 74 years (max)

### Evidence of Sampling Bias:

- Age concentration: 50% of respondents are between 28-38 years
- Underrepresentation of:
    - Senior tech workers (40+ years)
    - Early career professionals (< 25 years)
    - The narrow standard deviation (8.07 years) suggests limited age diversity

Let's analyze other demographics to get a fuller picture.

In [186]:
# Apply the categorization
gender_df = df[df['question_text'] == 'What is your gender?'].copy()
gender_df['category'] = gender_df['answer_text'].apply(helpers.categorize_gender)

# Create the distribution
gender_distribution = gender_df.groupby(['year', 'category']).size().unstack(fill_value=0)

# Calculate percentages
gender_distribution_pct = gender_distribution.div(gender_distribution.sum(axis=1), axis=0) * 100

# Show both counts and percentages
gender_distribution

category,Female,Male,Other
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014,247,991,22
2016,336,1057,40
2017,218,502,36
2018,125,266,26
2019,98,228,26


In [187]:
gender_distribution_pct

category,Female,Male,Other
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014,19.603175,78.650794,1.746032
2016,23.447313,73.76134,2.791347
2017,28.835979,66.402116,4.761905
2018,29.976019,63.788969,6.235012
2019,27.840909,64.772727,7.386364


In [188]:

# Location/Country distribution
location_dist = helpers.get_responses_by_question(df, 'What country do you live in?')
location_dist.sum().nlargest(10)
# Top 5 countries

answer_text
united states of america    1853
united states                751
united kingdom               482
canada                       199
germany                      136
netherlands                   98
australia                     73
france                        51
ireland                       51
india                         50
dtype: int64

In [189]:
# Company size distribution
company_size_dist = helpers.get_responses_by_question(df, 'How many employees does your company or organization have?')
company_size_dist

answer_text,-1,1-5,100-500,26-100,500-1000,6-25,more than 1000
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2014,0,162,176,289,61,290,282
2016,287,60,248,292,80,210,256
2017,113,20,203,128,48,86,158
2018,56,5,81,70,31,69,105
2019,48,7,80,45,27,34,111


# Overview of the respondents of the survey. What is the sample size? What are the sociodemographic features of the respondents? Do we see any evidence of sampling bias?

2014: ~1,260 respondents
2016: ~1,433 respondents
2017: ~756 respondents
2018: ~417 respondents
2019: ~352 respondents

## Evidence of Several Types of Sampling Bias:

### Age Bias:


Mean age ~33.9 years
Strong concentration in 28-38 year range
Underrepresentation of:

Senior professionals (40+)
Early career professionals (<25)




### Participation Bias:


Sharp decline in participation from 2016 to 2019
Could affect trend analysis reliability
More recent years might not be as representative


Looking at the country distribution, we can identify clear geographic sampling biases:

Strong US Dominance:


United States total: 2,604 respondents (1,853 + 751 from different labelings)
This represents approximately 65% of respondents


### English-Speaking Countries Bias:


Top English-speaking countries:

USA: 2,604 respondents
UK: 482 respondents
Canada: 199 respondents
Australia: 73 respondents
Ireland: 51 respondents




Western Europe Representation:


Moderate representation from:

Germany: 136 respondents
Netherlands: 98 respondents
France: 51 respondents




### Underrepresentation:


Only one Asian country in top 10 (India: 50 respondents)
No representation from:

South America
Africa
Most of Asia
Eastern Europe



### Sampling Biases to Consider:

Language Barrier: Survey likely conducted in English
Distribution Channels: Survey might have been distributed through US-centric networks
Tech Industry Concentration: Reflects tech industry hubs but might miss emerging markets

This geographic distribution limits our ability to make global generalizations about:

Mental health in tech globally
Cultural differences in mental health approaches
Regional workplace practices


Looking at the company size distribution across years, let's analyze the patterns:

### Overall Distribution Pattern:


Very diverse representation from small to large companies
Strong representation from both ends:

Small companies (1-5, 6-25 employees)
Large enterprises (More than 1000 employees)


Good representation of mid-sized companies (26-100, 100-500)

### Company Size Bias:

Good representation of different company sizes
But might not match actual tech industry distribution


### Missing Data:

Significant "-1" values in 2016-2019
Could affect analysis reliability

Changes Over Time:


Declining participation across all company sizes
Relatively consistent proportions maintained


# Mental Health Conditions Prevalence Analysis

Let's analyze the prevalence of mental health conditions in the tech industry using three key questions:
1. Past history: "Have you had a mental health disorder in the past?"
2. Current status: "Do you currently have a mental health disorder?"
3. Diagnosis: "Have you ever been diagnosed with a mental health disorder?"

We'll calculate prevalence rates with 95% confidence intervals using the Wilson score interval method.


In [190]:
questions = [
    "Have you had a mental health disorder in the past?",
    "Do you currently have a mental health disorder?",
    "Have you ever been diagnosed with a mental health disorder?"
]

# Create data for plotting
plot_data = []
for question in questions:
    prev, lower, upper = helpers.calculate_prevalence_ci(df, question)
    plot_data.append({
        'Condition': question.replace("Have you ", "").replace("?", ""),
        'Prevalence': prev,
        'CI_lower': lower,
        'CI_upper': upper
    })

# Create DataFrame for plotting
plot_df = pd.DataFrame(plot_data)

# Create the plot using Plotly Express
fig = px.bar(plot_df,
             x='Condition',
             y='Prevalence',
             error_y=plot_df.apply(lambda row: {'array': [row['CI_upper'] - row['Prevalence']]}, axis=1),
             error_y_minus=plot_df.apply(lambda row: {'array': [row['Prevalence'] - row['CI_lower']]}, axis=1),
             title='Mental Health Conditions Prevalence in Tech Industry (2014-2019)')

# Update layout
fig.update_layout(
    xaxis_title="",
    yaxis_title="Prevalence (%)",
    yaxis_range=[0, 100],
    showlegend=False,
    title_x=0.5,
    xaxis_tickangle=-45,
    template='plotly_white'
)

# Add hover template
fig.update_traces(
    hovertemplate="<br>".join([
        "<b>%{x}</b>",
        "Prevalence: %{y:.1f}%",
        "95% CI: (%{customdata[0]:.1f}% - %{customdata[1]:.1f}%)",
        "<extra></extra>"
    ]),
    customdata=plot_df[['CI_lower', 'CI_upper']]
)

fig.show()

# Create a markdown cell for interpretation
"""
### Interpretation of Mental Health Prevalence Results:

1. Past Mental Health Disorders:
   - Prevalence: {:.1f}% (95% CI: {:.1f}% - {:.1f}%)
   - Highest prevalence among the three measures

2. Current Mental Health Disorders:
   - Prevalence: {:.1f}% (95% CI: {:.1f}% - {:.1f}%)
   - Lower than past disorders, suggesting possible recovery or management

3. Diagnosed Mental Health Disorders:
   - Prevalence: {:.1f}% (95% CI: {:.1f}% - {:.1f}%)
   - High diagnosis rate indicates good healthcare access

Key Observations:
1. High Overall Prevalence: All measures show rates >35%
2. Narrow Confidence Intervals: Indicates precise estimates
3. Treatment Gap: Difference between past and current prevalence suggests successful interventions
4. Possible Underreporting: Some may not seek diagnosis or disclose conditions
""".format(
    plot_df.iloc[0]['Prevalence'], plot_df.iloc[0]['CI_lower'], plot_df.iloc[0]['CI_upper'],
    plot_df.iloc[1]['Prevalence'], plot_df.iloc[1]['CI_lower'], plot_df.iloc[1]['CI_upper'],
    plot_df.iloc[2]['Prevalence'], plot_df.iloc[2]['CI_lower'], plot_df.iloc[2]['CI_upper']
)

'\n### Interpretation of Mental Health Prevalence Results:\n\n1. Past Mental Health Disorders:\n   - Prevalence: 0.0% (95% CI: 0.0% - 0.1%)\n   - Highest prevalence among the three measures\n\n2. Current Mental Health Disorders:\n   - Prevalence: 0.0% (95% CI: 0.0% - 0.1%)\n   - Lower than past disorders, suggesting possible recovery or management\n\n3. Diagnosed Mental Health Disorders:\n   - Prevalence: 0.0% (95% CI: 0.0% - 0.1%)\n   - High diagnosis rate indicates good healthcare access\n\nKey Observations:\n1. High Overall Prevalence: All measures show rates >35%\n2. Narrow Confidence Intervals: Indicates precise estimates\n3. Treatment Gap: Difference between past and current prevalence suggests successful interventions\n4. Possible Underreporting: Some may not seek diagnosis or disclose conditions\n'

In [194]:
import plotly.express as px
import pandas as pd
import numpy as np

# Focus on the mental health condition responses
diagnosis_responses = df[df['question_text'] == "If yes, what condition(s) have you been diagnosed with?"]

# Common conditions to look for (keeping just three major ones for clarity)
conditions = [
    "Anxiety Disorder",
    "Mood Disorder",
    "Attention Deficit Hyperactivity Disorder"
]

conditions_plot_data = []
total_respondents = df['user_id'].nunique()

for condition in conditions:
    # Count unique users who reported this condition
    condition_count = diagnosis_responses[
        diagnosis_responses['answer_text'].str.contains(condition, na=False)
    ]['user_id'].nunique()

    # Calculate prevalence
    prevalence = (condition_count / total_respondents) * 100

    # Calculate confidence intervals
    z = 1.96  # 95% confidence level
    n = total_respondents
    p = prevalence / 100

    ci_lower = ((p + z * z / (2 * n) - z * np.sqrt((p * (1 - p) + z * z / (4 * n)) / n)) / (1 + z * z / n)) * 100
    ci_upper = ((p + z * z / (2 * n) + z * np.sqrt((p * (1 - p) + z * z / (4 * n)) / n)) / (1 + z * z / n)) * 100

    conditions_plot_data.append({
        'Condition': condition,
        'Prevalence': prevalence,
        'CI_lower': ci_lower,
        'CI_upper': ci_upper,
        'Count': condition_count
    })

# Convert to DataFrame
conditions_df = pd.DataFrame(conditions_plot_data)

# Create the plot
fig = px.bar(conditions_df,
             x='Condition',
             y='Prevalence',
             error_y=conditions_df.apply(lambda row: {'array': [row['CI_upper'] - row['Prevalence']]}, axis=1),
             error_y_minus=conditions_df.apply(lambda row: {'array': [row['Prevalence'] - row['CI_lower']]}, axis=1),
             title='Prevalence of Mental Health Conditions in Tech Industry (2014)')

# Update layout
fig.update_layout(
    xaxis_title="Condition",
    yaxis_title="Prevalence (%)",
    title_x=0.5,
    template='plotly_white',
    showlegend=False,
    xaxis_tickangle=-45
)

# Add hover template
fig.update_traces(
    hovertemplate="<br>".join([
        "<b>%{x}</b>",
        "Prevalence: %{y:.1f}%",
        "95% CI: (%{customdata[0]:.1f}% - %{customdata[1]:.1f}%)",
        "Count: %{customdata[2]}",
        "<extra></extra>"
    ]),
    customdata=conditions_df[['CI_lower', 'CI_upper', 'Count']]
)

# Show plot
fig.show()

# Print detailed statistics
print("\nDetailed Statistics:")
for _, row in conditions_df.iterrows():
    print(f"\n{row['Condition']}:")
    print(f"Prevalence: {row['Prevalence']:.1f}%")
    print(f"95% CI: ({row['CI_lower']:.1f}% - {row['CI_upper']:.1f}%)")
    print(f"Count: {row['Count']}")


Detailed Statistics:

Anxiety Disorder:
Count: 0
Prevalence: 0.0%
95% CI: (0.0% - 0.3%)

Mood Disorder:
Count: 0
Prevalence: 0.0%
95% CI: (0.0% - 0.3%)

Attention Deficit Hyperactivity Disorder:
Count: 0
Prevalence: 0.0%
95% CI: (0.0% - 0.3%)


In [192]:
import plotly.express as px
import pandas as pd
import numpy as np

# Focus on the mental health condition responses
diagnosis_responses = df[df['question_text'] == "If yes, what condition(s) have you been diagnosed with?"]

# Common conditions to look for (keeping just three major ones for clarity)
conditions = [
    "Anxiety Disorder",
    "Mood Disorder",
    "Attention Deficit Hyperactivity Disorder"
]

conditions_plot_data = []
total_respondents = df['user_id'].nunique()

for condition in conditions:
    # Count unique users who reported this condition
    condition_count = diagnosis_responses[
        diagnosis_responses['answer_text'].str.contains(condition, na=False, case=False)
    ]['user_id'].nunique()

    # Calculate prevalence
    prevalence = (condition_count / total_respondents) * 100

    # Calculate confidence intervals
    z = 1.96  # 95% confidence level
    n = total_respondents
    p = prevalence / 100

    ci_lower = ((p + z * z / (2 * n) - z * np.sqrt((p * (1 - p) + z * z / (4 * n)) / n)) / (1 + z * z / n)) * 100
    ci_upper = ((p + z * z / (2 * n) + z * np.sqrt((p * (1 - p) + z * z / (4 * n)) / n)) / (1 + z * z / n)) * 100

    conditions_plot_data.append({
        'Condition': condition,
        'Prevalence': prevalence,
        'CI_lower': ci_lower,
        'CI_upper': ci_upper,
        'Count': condition_count
    })

# Convert to DataFrame
conditions_df = pd.DataFrame(conditions_plot_data)

# Create the plot
fig = px.bar(conditions_df,
             x='Condition',
             y='Prevalence',
             error_y=conditions_df.apply(lambda row: {'array': [row['CI_upper'] - row['Prevalence']]}, axis=1),
             error_y_minus=conditions_df.apply(lambda row: {'array': [row['Prevalence'] - row['CI_lower']]}, axis=1),
             title='Prevalence of Mental Health Conditions in Tech Industry (2014)')

# Update layout
fig.update_layout(
    xaxis_title="Condition",
    yaxis_title="Prevalence (%)",
    title_x=0.5,
    template='plotly_white',
    showlegend=False,
    xaxis_tickangle=-45
)

# Add hover template
fig.update_traces(
    hovertemplate="<br>".join([
        "<b>%{x}</b>",
        "Prevalence: %{y:.1f}%",
        "95% CI: (%{customdata[0]:.1f}% - %{customdata[1]:.1f}%)",
        "Count: %{customdata[2]}",
        "<extra></extra>"
    ]),
    customdata=conditions_df[['CI_lower', 'CI_upper', 'Count']]
)

# Show plot
fig.show()

# Print detailed statistics
print("\nDetailed Statistics:")
for _, row in conditions_df.iterrows():
    print(f"\n{row['Condition']}:")
    print(f"Prevalence: {row['Prevalence']:.1f}%")
    print(f"95% CI: ({row['CI_lower']:.1f}% - {row['CI_upper']:.1f}%)")
    print(f"Count: {row['Count']}")


Detailed Statistics:

Anxiety Disorder:
Prevalence: 8.2%
95% CI: (7.4% - 9.0%)
Count: 345

Mood Disorder:
Prevalence: 9.8%
95% CI: (8.9% - 10.7%)
Count: 412

Attention Deficit Hyperactivity Disorder:
Prevalence: 2.9%
95% CI: (2.4% - 3.4%)
Count: 121


In [None]:
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
import numpy as np
from scipy import stats


def create_age_distribution_plot(df):
    """Create an age distribution plot with box plot and violin plot"""
    age_df = df[df['question_text'] == 'What is your age?'].copy()
    age_df['clean_age'] = pd.to_numeric(age_df['answer_text'], errors='coerce')
    age_df = age_df[age_df['clean_age'].between(16, 80)]  # Filter reasonable ages

    # Create figure with secondary y-axis
    fig = px.histogram(
        age_df, x='clean_age',
        title='Age Distribution in Tech Industry',
        labels={'clean_age': 'Age', 'count': 'Frequency'},
        marginal='violin',  # Add violin plot on the margin
        nbins=30,
        color_discrete_sequence=['#636EFA']
    )

    # Update layout
    fig.update_layout(
        title_x=0.5,
        bargap=0.1,
        showlegend=False,
        template='plotly_white'
    )

    return fig


def create_time_series_analysis(df):
    """Create time series analysis of age statistics by year"""
    # Prepare yearly age statistics
    age_by_year = df[df['question_text'] == 'What is your age?'].copy()
    age_by_year['clean_age'] = pd.to_numeric(age_by_year['answer_text'], errors='coerce')
    yearly_stats = age_by_year.groupby('year')['clean_age'].agg(['mean', 'std']).reset_index()

    # Calculate confidence intervals
    yearly_stats['ci_upper'] = yearly_stats['mean'] + 1.96 * yearly_stats['std'] / np.sqrt(len(df))
    yearly_stats['ci_lower'] = yearly_stats['mean'] - 1.96 * yearly_stats['std'] / np.sqrt(len(df))

    # Create the plot
    fig = px.line(
        yearly_stats, x='year', y='mean',
        title='Average Age Trend Over Years',
        labels={'mean': 'Average Age', 'year': 'Year'},
        error_y=yearly_stats['ci_upper'] - yearly_stats['mean'],
        error_y_minus=yearly_stats['mean'] - yearly_stats['ci_lower']
    )

    # Update layout
    fig.update_layout(
        title_x=0.5,
        template='plotly_white',
        showlegend=False
    )

    return fig


def create_gender_distribution_plot(df):
    """Create gender distribution plot over years"""
    gender_df = df[df['question_text'] == 'What is your gender?'].copy()
    gender_counts = gender_df.groupby(['year', 'answer_text']).size().reset_index(name='count')

    fig = px.bar(
        gender_counts,
        x='year',
        y='count',
        color='answer_text',
        title='Gender Distribution Over Years',
        labels={'count': 'Number of Respondents', 'answer_text': 'Gender'},
        barmode='group'
    )

    # Update layout
    fig.update_layout(
        title_x=0.5,
        template='plotly_white',
        xaxis_tickmode='linear'
    )

    return fig


def create_mental_health_correlation_heatmap(df):
    """Create correlation heatmap for mental health questions"""
    # List of mental health related questions
    mh_questions = [
        "Have you had a mental health disorder in the past?",
        "Do you currently have a mental health disorder?",
        "Have you ever been diagnosed with a mental health disorder?"
    ]

    # Create pivot table
    pivot_df = df[df['question_text'].isin(mh_questions)].pivot_table(
        index='user_id',
        columns='question_text',
        values='answer_text',
        aggfunc='first'
    )

    # Calculate correlations
    corr_matrix = pivot_df.apply(lambda x: pd.factorize(x)[0]).corr()

    # Create heatmap
    fig = px.imshow(
        corr_matrix,
        title='Correlation Between Mental Health Questions',
        labels=dict(color="Correlation Coefficient"),
        color_continuous_scale='RdBu',
        aspect='auto'
    )

    # Update layout
    fig.update_layout(
        title_x=0.5,
        template='plotly_white'
    )

    return fig


def create_company_size_trend(df):
    """Create company size distribution trend"""
    company_df = df[df['question_text'] == 'How many employees does your company or organization have?'].copy()
    size_counts = company_df.groupby(['year', 'answer_text']).size().reset_index(name='count')

    fig = px.bar(
        size_counts,
        x='year',
        y='count',
        color='answer_text',
        title='Company Size Distribution Over Years',
        labels={'count': 'Number of Respondents', 'answer_text': 'Company Size'},
        barmode='stack'
    )

    # Update layout
    fig.update_layout(
        title_x=0.5,
        template='plotly_white',
        xaxis_tickmode='linear'
    )

    return fig


def create_mental_health_prevalence_by_company_size(df):
    """Create mental health prevalence by company size visualization"""
    # Get mental health responses
    mh_df = df[df['question_text'] == 'Do you currently have a mental health disorder?'].copy()

    # Get company sizes for the same users
    company_sizes = df[df['question_text'] == 'How many employees does your company or organization have?']
    company_sizes = company_sizes[['user_id', 'answer_text']].rename(columns={'answer_text': 'company_size'})

    # Merge the data
    merged_df = mh_df.merge(company_sizes, on='user_id')

    # Calculate prevalence by company size
    prevalence = merged_df.groupby('company_size')['answer_text'].value_counts(normalize=True).unstack()
    prevalence = prevalence.reset_index()

    # Create the visualization
    fig = px.bar(
        prevalence,
        x='company_size',
        y=['Yes', 'No'],
        title='Mental Health Prevalence by Company Size',
        labels={'value': 'Proportion', 'company_size': 'Company Size'},
        barmode='group'
    )

    # Update layout
    fig.update_layout(
        title_x=0.5,
        template='plotly_white',
        xaxis_tickangle=-45
    )

    return fig


# Example usage:
age_dist_plot = create_age_distribution_plot(df)
age_dist_plot.show()
# 
time_series_plot = create_time_series_analysis(df)
time_series_plot.show()
# 
gender_plot = create_gender_distribution_plot(df)
gender_plot.show()
# 
correlation_plot = create_mental_health_correlation_heatmap(df)
correlation_plot.show()
# 
company_size_plot = create_company_size_trend(df)
company_size_plot.show()
# 
mental_health_by_size_plot = create_mental_health_prevalence_by_company_size(df)
mental_health_by_size_plot.show()

In [196]:
import plotly.express as px
import pandas as pd
import numpy as np

# Filter for diagnosis responses
diagnosis_responses = df[df['question_text'] == "If yes, what condition(s) have you been diagnosed with?"]

# Common conditions to look for
conditions = [
    "Anxiety Disorder",
    "Mood Disorder",
    "Attention Deficit Hyperactivity Disorder"
]

# Calculate total number of respondents who answered the diagnosis question
total_respondents = len(diagnosis_responses[diagnosis_responses['answer_text'] != '-1'])

conditions_plot_data = []

for condition in conditions:
    # Count users with each condition (excluding -1 responses)
    condition_count = diagnosis_responses[
        (diagnosis_responses['answer_text'].str.contains(condition, na=False, case=False)) &
        (diagnosis_responses['answer_text'] != '-1')
        ]['user_id'].nunique()

    # Calculate prevalence
    prevalence = (condition_count / total_respondents) * 100

    # Calculate Wilson score intervals
    z = 1.96  # 95% confidence level
    n = total_respondents
    p = prevalence / 100

    ci_lower = ((p + z * z / (2 * n) - z * np.sqrt((p * (1 - p) + z * z / (4 * n)) / n)) / (1 + z * z / n)) * 100
    ci_upper = ((p + z * z / (2 * n) + z * np.sqrt((p * (1 - p) + z * z / (4 * n)) / n)) / (1 + z * z / n)) * 100

    conditions_plot_data.append({
        'Condition': condition,
        'Prevalence': prevalence,
        'CI_lower': ci_lower,
        'CI_upper': ci_upper,
        'Count': condition_count
    })

# Convert to DataFrame
conditions_df = pd.DataFrame(conditions_plot_data)

# Create the plot
fig = px.bar(conditions_df,
             x='Condition',
             y='Prevalence',
             error_y=conditions_df.apply(lambda row: {'array': [row['CI_upper'] - row['Prevalence']]}, axis=1),
             error_y_minus=conditions_df.apply(lambda row: {'array': [row['Prevalence'] - row['CI_lower']]}, axis=1),
             title='Prevalence of Mental Health Conditions with 95% Confidence Intervals')

# Update layout
fig.update_layout(
    xaxis_title="Condition",
    yaxis_title="Prevalence (%)",
    title_x=0.5,
    template='plotly_white',
    showlegend=False,
    xaxis_tickangle=-45
)

# Add hover template
fig.update_traces(
    hovertemplate="<br>".join([
        "<b>%{x}</b>",
        "Prevalence: %{y:.1f}%",
        "95% CI: (%{customdata[0]:.1f}% - %{customdata[1]:.1f}%)",
        "Count: %{customdata[2]}",
        "<extra></extra>"
    ]),
    customdata=conditions_df[['CI_lower', 'CI_upper', 'Count']]
)

# Show plot
fig.show()

# Print detailed statistics
print("\nDetailed Statistics:")
for _, row in conditions_df.iterrows():
    print(f"\n{row['Condition']}:")
    print(f"Count: {row['Count']}")
    print(f"Prevalence: {row['Prevalence']:.1f}%")
    print(f"95% CI: ({row['CI_lower']:.1f}% - {row['CI_upper']:.1f}%)")


Detailed Statistics:

Anxiety Disorder:
Count: 345
Prevalence: 28.6%
95% CI: (26.1% - 31.2%)

Mood Disorder:
Count: 412
Prevalence: 34.1%
95% CI: (31.5% - 36.9%)

Attention Deficit Hyperactivity Disorder:
Count: 121
Prevalence: 10.0%
95% CI: (8.5% - 11.8%)


In [45]:
def calculate_population_estimates(df):
    """
    Calculate population parameter estimates with confidence intervals
    """
    # 1. Age Statistics with Confidence Intervals
    age_df = df[df['question_text'] == 'What is your age?'].copy()
    age_data = pd.to_numeric(age_df['answer_text'], errors='coerce')
    age_data = age_data[(age_data >= 16) & (age_data <= 80)]  # Filter reasonable ages

    # Calculate mean with 95% CI
    age_mean = np.mean(age_data)
    age_std = np.std(age_data, ddof=1)  # ddof=1 for sample standard deviation
    age_n = len(age_data)

    # Fixed t-interval calculation
    confidence = 0.95
    degrees_of_freedom = age_n - 1
    t_value = stats.t.ppf((1 + confidence) / 2, degrees_of_freedom)
    margin_of_error = t_value * (age_std / np.sqrt(age_n))
    age_ci = (age_mean - margin_of_error, age_mean + margin_of_error)

    # 2. Gender Proportion Estimates
    gender_df = df[df['question_text'] == 'What is your gender?'].copy()
    total_responses = len(gender_df)

    gender_props = {}
    for gender in gender_df['answer_text'].unique():
        count = len(gender_df[gender_df['answer_text'] == gender])
        prop = count / total_responses
        # Wilson score interval for proportions
        z = stats.norm.ppf(0.975)  # 95% confidence
        denominator = 1 + z ** 2 / total_responses
        center = (prop + z ** 2 / (2 * total_responses)) / denominator
        margin = z * np.sqrt((prop * (1 - prop) + z ** 2 / (4 * total_responses)) / total_responses) / denominator

        gender_props[gender] = {
            'proportion': prop,
            'ci_lower': max(0, center - margin),
            'ci_upper': min(1, center + margin)
        }

    # 3. Mental Health Prevalence Estimates
    mh_questions = [
        "Have you had a mental health disorder in the past?",
        "Do you currently have a mental health disorder?",
        "Have you ever been diagnosed with a mental health disorder?"
    ]

    mh_estimates = {}
    for question in mh_questions:
        responses = df[df['question_text'] == question]['answer_text']
        positive_responses = responses[responses.str.lower() == 'yes']
        positive_count = len(positive_responses)
        total = len(responses)
        proportion = positive_count / total if total > 0 else 0

        # Wilson score interval
        if total > 0:
            z = stats.norm.ppf(0.975)
            denominator = 1 + z ** 2 / total
            center = (proportion + z ** 2 / (2 * total)) / denominator
            margin = z * np.sqrt((proportion * (1 - proportion) + z ** 2 / (4 * total)) / total) / denominator
            ci_lower = max(0, center - margin)
            ci_upper = min(1, center + margin)
        else:
            ci_lower = 0
            ci_upper = 0

        mh_estimates[question] = {
            'prevalence': proportion,
            'ci_lower': ci_lower,
            'ci_upper': ci_upper,
            'sample_size': total
        }

    return {
        'age_estimates': {
            'mean': age_mean,
            'ci_lower': age_ci[0],
            'ci_upper': age_ci[1],
            'std': age_std,
            'sample_size': age_n
        },
        'gender_estimates': gender_props,
        'mental_health_estimates': mh_estimates
    }


# Create a formatted markdown report
def create_statistical_report(estimates):
    report = """
    # Statistical Inference Report
    
    ## Age Distribution in Tech Industry
    - **Population Mean Age Estimate:** {:.1f} years
    - **95% Confidence Interval:** ({:.1f} - {:.1f}) years
    - **Sample Size:** {:,} respondents
    
    *Interpretation:* We can be 95% confident that the true population mean age of tech workers falls between {:.1f} and {:.1f} years.
    
    ## Gender Distribution
    """.format(
        estimates['age_estimates']['mean'],
        estimates['age_estimates']['ci_lower'],
        estimates['age_estimates']['ci_upper'],
        estimates['age_estimates']['sample_size'],
        estimates['age_estimates']['ci_lower'],
        estimates['age_estimates']['ci_upper']
    )

    # Add gender proportions
    for gender, stats in estimates['gender_estimates'].items():
        report += f"""
    **{gender}:**
    - Estimated Proportion: {stats['proportion']:.1%}
    - 95% CI: ({stats['ci_lower']:.1%} - {stats['ci_upper']:.1%})
        """

    # Add mental health estimates
    report += "\n## Mental Health Prevalence Estimates\n"
    for question, stats in estimates['mental_health_estimates'].items():
        report += f"""
    **{question}**
    - Prevalence: {stats['prevalence']:.1%}
    - 95% CI: ({stats['ci_lower']:.1%} - {stats['ci_upper']:.1%})
    - Sample Size: {stats['sample_size']:,}
        """

    return report


estimates = calculate_population_estimates(df)
report = create_statistical_report(estimates)

# -----------
# Create dataframes for each section
age_df = pd.DataFrame({
    'Metric': ['Mean Age', 'CI Lower', 'CI Upper', 'Sample Size'],
    'Value': [33.9, 33.6, 34.1, 4203]
})

# For gender distribution, let's focus on categories with >1% representation
gender_df = pd.DataFrame({
    'Gender': ['Male', 'Female', 'male', 'female', '-1'],
    'Proportion': [67.1, 21.7, 5.0, 2.6, 0.6],
    'CI_Lower': [65.7, 20.5, 4.4, 2.2, 0.4],
    'CI_Upper': [68.5, 22.9, 5.7, 3.1, 0.8]
})

mental_health_df = pd.DataFrame({
    'Condition': [
        'Past mental health disorder',
        'Current mental health disorder',
        'Diagnosed mental health disorder'
    ],
    'Prevalence': [47.9, 41.8, 46.1],
    'CI_Lower': [46.1, 40.1, 44.3],
    'CI_Upper': [49.7, 43.6, 47.9],
    'Sample_Size': [2958, 2958, 2958]
})

# Example usage:
age_df

Unnamed: 0,Metric,Value
0,Mean Age,33.9
1,CI Lower,33.6
2,CI Upper,34.1
3,Sample Size,4203.0


In [43]:
gender_df

Unnamed: 0,year,survey_description,user_id,question_id,question_text,answer_text,category
1260,2014,mental health survey for 2014,1,2,What is your gender?,female,Female
1261,2014,mental health survey for 2014,2,2,What is your gender?,male,Male
1262,2014,mental health survey for 2014,3,2,What is your gender?,male,Male
1263,2014,mental health survey for 2014,4,2,What is your gender?,male,Male
1264,2014,mental health survey for 2014,5,2,What is your gender?,male,Male
...,...,...,...,...,...,...,...
204288,2019,mental health survey for 2019,4214,2,What is your gender?,male,Male
204289,2019,mental health survey for 2019,4215,2,What is your gender?,male,Male
204290,2019,mental health survey for 2019,4216,2,What is your gender?,male,Male
204291,2019,mental health survey for 2019,4217,2,What is your gender?,female,Female


In [44]:
mental_health_df

Unnamed: 0,Condition,Prevalence,CI_Lower,CI_Upper,Sample_Size
0,Past mental health disorder,47.9,46.1,49.7,2958
1,Current mental health disorder,41.8,40.1,43.6,2958
2,Diagnosed mental health disorder,46.1,44.3,47.9,2958
