# Lab Assignment 12: Interactive Visualizations and Dashboards
## DS 6001

### Instructions
Please answer the following questions as completely as possible using text, code, and the results of code as needed. Format your answers in a Jupyter notebook. To receive full credit, make sure you address every part of the problem, and make sure your document is formatted in a clean and professional way.

**Note: Your PDF output will probably cutoff some of the figures you create. That is fine. Under different circumstances we wouldn't ask you to share your HTML-enabled products with PDFs, but we are forced by Gradescope to do that. All we are looking for is the correct code and some evidence that the code ran successfully.**

## Problem 0
Load the Conda environment you built in module 1 as the kernel of this notebook. Then import the following packages:

In [1]:
import numpy as np
import pandas as pd

# Plotly modules/methods/settings
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True) # enables display of plotly figures in HTML/PDF notebooks

# Dash modules/methods/settings
import dash
from dash import dcc
from dash import html
from dash.dependencies import Input, Output
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css'] # Controls default visual appearance of the dashboard

For this lab, we will be working with the 2019 General Social Survey one last time.

In [2]:
%%capture
gss = pd.read_csv("https://github.com/jkropko/DS-6001/raw/master/localdata/gss2018.csv",
                 encoding='cp1252', na_values=['IAP','IAP,DK,NA,uncodeable', 'NOT SURE',
                                               'DK', 'IAP, DK, NA, uncodeable', '.a', "CAN'T CHOOSE"])

Here is code that cleans the data and gets it ready to be used for data visualizations:

In [3]:
mycols = ['id', 'wtss', 'sex', 'educ', 'region', 'age', 'coninc',
          'prestg10', 'mapres10', 'papres10', 'sei10', 'satjob',
          'fechld', 'fefam', 'fepol', 'fepresch', 'meovrwrk'] 
gss_clean = gss[mycols]
gss_clean = gss_clean.rename({'wtss':'weight', 
                              'educ':'education', 
                              'coninc':'income', 
                              'prestg10':'job_prestige',
                              'mapres10':'mother_job_prestige', 
                              'papres10':'father_job_prestige', 
                              'sei10':'socioeconomic_index', 
                              'fechld':'relationship', 
                              'fefam':'male_breadwinner', 
                              'fehire':'hire_women', 
                              'fejobaff':'preference_hire_women', 
                              'fepol':'men_bettersuited', 
                              'fepresch':'child_suffer',
                              'meovrwrk':'men_overwork'},axis=1)
gss_clean.age = gss_clean.age.replace({'89 or older':'89'})
gss_clean.age = gss_clean.age.astype('float')

In [4]:
gss_clean.head(3).T

Unnamed: 0,0,1,2
id,1,2,3
weight,2.357493,0.942997,0.942997
sex,male,female,male
education,14.0,10.0,16.0
region,new england,new england,new england
age,43.0,74.0,42.0
income,,22782.5,112160.0
job_prestige,47.0,22.0,61.0
mother_job_prestige,31.0,32.0,32.0
father_job_prestige,45.0,39.0,72.0


In [5]:
gss_clean

Unnamed: 0,id,weight,sex,education,region,age,income,job_prestige,mother_job_prestige,father_job_prestige,socioeconomic_index,satjob,relationship,male_breadwinner,men_bettersuited,child_suffer,men_overwork
0,1,2.357493,male,14.0,new england,43.0,,47.0,31.0,45.0,65.3,very satisfied,strongly agree,disagree,agree,strongly disagree,agree
1,2,0.942997,female,10.0,new england,74.0,22782.5000,22.0,32.0,39.0,14.8,,,,,,
2,3,0.942997,male,16.0,new england,42.0,112160.0000,61.0,32.0,72.0,83.4,mod. satisfied,strongly agree,disagree,disagree,disagree,disagree
3,4,0.942997,female,16.0,new england,63.0,158201.8412,59.0,,39.0,69.3,very satisfied,agree,disagree,disagree,disagree,neither agree nor disagree
4,5,0.942997,male,18.0,new england,71.0,158201.8412,53.0,35.0,45.0,68.6,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2343,2344,0.471499,female,12.0,new england,37.0,,47.0,31.0,72.0,38.8,mod. satisfied,disagree,strongly disagree,disagree,strongly disagree,disagree
2344,2345,0.942997,female,12.0,new england,75.0,22782.5000,28.0,,27.0,21.6,very satisfied,strongly agree,disagree,disagree,disagree,disagree
2345,2346,0.942997,female,12.0,new england,67.0,70100.0000,40.0,45.0,53.0,41.8,,,,,,
2346,2347,0.942997,male,16.0,new england,72.0,38555.0000,47.0,53.0,50.0,62.7,,disagree,agree,disagree,strongly agree,agree


The `gss_clean` dataframe now contains the following features:

* `id` - a numeric unique ID for each person who responded to the survey
* `weight` - survey sample weights
* `sex` - male or female
* `education` - years of formal education
* `region` - region of the country where the respondent lives
* `age` - age
* `income` - the respondent's personal annual income
* `job_prestige` - the respondent's occupational prestige score, as measured by the GSS using the methodology described above
* `mother_job_prestige` - the respondent's mother's occupational prestige score, as measured by the GSS using the methodology described above
* `father_job_prestige` -the respondent's father's occupational prestige score, as measured by the GSS using the methodology described above
* `socioeconomic_index` - an index measuring the respondent's socioeconomic status
* `satjob` - responses to "On the whole, how satisfied are you with the work you do?"
* `relationship` - agree or disagree with: "A working mother can establish just as warm and secure a relationship with her children as a mother who does not work."
* `male_breadwinner` - agree or disagree with: "It is much better for everyone involved if the man is the achiever outside the home and the woman takes care of the home and family."
* `men_bettersuited` - agree or disagree with: "Most men are better suited emotionally for politics than are most women."
* `child_suffer` - agree or disagree with: "A preschool child is likely to suffer if his or her mother works."
* `men_overwork` - agree or disagree with: "Family life often suffers because men concentrate too much on their work."

## Problem 1
Our goal in this lab is to build a dashboard that presents our findings from the GSS. A dashboard is meant to be shared with an audience, whether that audience consists of managers, clients, potential employers, or the general public. So we need to provide context for our results. One way to provide context is to write text using markdown code.

Find at least two websites that discuss the gender wage gap, and write a short paragraph in markdown code summarizing what these sources tell us. Include hyperlinks to the websites you use. Some potential sources (though you can find your own sources as well) are:

* https://en.wikipedia.org/wiki/Gender_pay_gap

* https://www.pewresearch.org/short-reads/2025/03/04/gender-pay-gap-in-us-has-narrowed-slightly-over-2-decades/

* https://news.darden.virginia.edu/2024/04/04/why-the-gender-pay-gap-persists-in-american-businesses/ 

Then write another short paragraph describing what the GSS is, what the data contain, how it was collected, and/or other information that you think your audience ought to know. A good starting point for information about the GSS is here: https://gss.norc.org/us/en/gss/about-the-gss.html

Then save the text as a Python string so that you can use the markdown code in your dashboard later.

It is easy to copy-and-paste text from a website, but please don't: that communicates a lack of care to your audience. It is also easy to simply prompt an LLM to generate this text, but that is prone to hallucinations. Please summarize the text from websites in your own words, and if you use an LLM (it may be wiser not to), then double check specific facts to make sure they are accurate.

[5 points]

Answer : 

I. Gender Wage Gap Articles Oveview :

The gender wage gap is a persistent global challenge that reflects deep structural inequalities in labor markets. 

- The [World Economic Forum’s Global Gender Gap Report 2025](https://www.weforum.org/publications/global-gender-gap-report-2025/) emphasizes that while progress has been made in closing gaps in education and health outcomes, economic participation and opportunity remain the slowest areas of improvement. 
    
    - Women continue to earn less than men across nearly every region and industry, and the report estimates that at the current pace, it will take more than a century to achieve full parity worldwide. 
    
    - The wage gap is not only about differences in pay for similar work, but also about occupational segregation, unequal access to leadership roles, and barriers to advancement that disproportionately affect women.

- In the United States, [Equal Pay Today](https://www.equalpaytoday.org/gender-pay-gap-statistics/) highlights that women working full‑time earn on average only 84 cents for every dollar earned by men. The disparities are even more pronounced for women of color: Black women earn about 64 cents, Native women 60 cents, and Latinas 57 cents compared to white, non‑Hispanic men.

    - These gaps compound over a lifetime, leading to significant differences in wealth accumulation, retirement security, and economic stability.
    
    - Equal Pay Today also notes that the wage gap persists across industries, education levels, and geographic regions, underscoring that it cannot be explained away by individual choices alone. 
    
    - Instead, it reflects systemic issues such as discrimination, undervaluation of work traditionally performed by women, and lack of supportive workplace policies like paid family leave and affordable childcare. 

Together, these sources show that while awareness of the gender wage gap has grown, meaningful progress requires structural reforms and sustained commitment from both policymakers and employers.



II. GSS (General Social Survey) Overview :

The [General Social Survey (GSS)](https://gss.norc.org/us/en/gss/about-the-gss.html) is one of the most influential sociological surveys in the United States, conducted by NORC at the University of Chicago since 1972. 

- Its primary purpose is to monitor and understand trends in American society by collecting nationally representative data on attitudes, behaviors, and demographics. 
    
- The GSS uses a full‑probability sampling design, meaning every adult in the U.S. has a known chance of being selected, which ensures that the results are generalizable to the population. 
    
- Interviews are conducted in person, and the survey covers a wide range of topics including socioeconomic status, political views, family life, civil liberties, work satisfaction, and cultural values.

- Over the decades, the GSS has become a cornerstone for social science research, providing data that allow scholars, policymakers, and the public to track changes in American society. 
        
    - For example, it has documented shifts in attitudes toward gender roles, racial equality, religion, and political participation. The dataset is particularly valuable because many of its questions are repeated across years, enabling long‑term trend analysis. 
        
    - In addition, the GSS often introduces new modules to capture emerging issues, making it both historically rich and contemporarily relevant. For this lab, we focus on variables related to education, income, job prestige, and attitudes toward gender and family roles, which provide a lens into how economic and social factors intersect with cultural beliefs. 
        
    - By analyzing these data, we can connect individual survey responses to broader societal patterns, offering context for discussions such as the gender wage gap.


##### Save as Python string

In [6]:
context_text = """
The [General Social Survey (GSS)](https://gss.norc.org/us/en/gss/about-the-gss.html) is one of the most influential sociological surveys in the United States, conducted by NORC at the University of Chicago since 1972. Its primary purpose is to monitor and understand trends in American society by collecting nationally representative data on attitudes, behaviors, and demographics. The GSS uses a full‑probability sampling design, meaning every adult in the U.S. has a known chance of being selected, which ensures that the results are generalizable to the population. Interviews are conducted in person, and the survey covers a wide range of topics including socioeconomic status, political views, family life, civil liberties, work satisfaction, and cultural values.

Over the decades, the GSS has become a cornerstone for social science research, providing data that allow scholars, policymakers, and the public to track changes in American society. For example, it has documented shifts in attitudes toward gender roles, racial equality, religion, and political participation. The dataset is particularly valuable because many of its questions are repeated across years, enabling long‑term trend analysis. In addition, the GSS often introduces new modules to capture emerging issues, making it both historically rich and contemporarily relevant. For this lab, we focus on variables related to education, income, job prestige, and attitudes toward gender and family roles, which provide a lens into how economic and social factors intersect with cultural beliefs. By analyzing these data, we can connect individual survey responses to broader societal patterns, offering context for discussions such as the gender wage gap.
"""

context_text

'\nThe [General Social Survey (GSS)](https://gss.norc.org/us/en/gss/about-the-gss.html) is one of the most influential sociological surveys in the United States, conducted by NORC at the University of Chicago since 1972. Its primary purpose is to monitor and understand trends in American society by collecting nationally representative data on attitudes, behaviors, and demographics. The GSS uses a full‑probability sampling design, meaning every adult in the U.S. has a known chance of being selected, which ensures that the results are generalizable to the population. Interviews are conducted in person, and the survey covers a wide range of topics including socioeconomic status, political views, family life, civil liberties, work satisfaction, and cultural values.\n\nOver the decades, the GSS has become a cornerstone for social science research, providing data that allow scholars, policymakers, and the public to track changes in American society. For example, it has documented shifts in a

## Problem 2
Generate a table that shows the mean income, occupational prestige, socioeconomic index, and years of education for men and for women. Use the `ff.create_table()` method to display a web-enabled version of this table. This table is for presentation purposes, so round every column to two decimal places and use more presentable column names. [15 points]

In [7]:
# Group by sex and compute means
summary = (
    gss_clean
    .groupby('sex')[['income', 'job_prestige', 'socioeconomic_index', 'education']]
    .mean()
    .round(2)
    .reset_index()
)

# Rename columns for presentation
summary = summary.rename(columns={
    'sex': 'Gender',
    'income': 'Mean Income',
    'job_prestige': 'Mean Job Prestige',
    'socioeconomic_index': 'Mean Socioeconomic Index',
    'education': 'Mean Years of Education'
})

# Create a web-enabled table
fig = ff.create_table(summary)
fig.show()


## Problem 3
Use plotly express to create the figure for this problem, as well as the figures for problems 4, 5, and 6.

Create an interactive barplot that shows the number of men and women who respond with each level of agreement to `male_breadwinner`. Write presentable labels for the x and y-axes, but don't bother with a title because we will be using a subtitle on the dashboard for this graphic. [15 points]

In [8]:
order = ["strongly disagree", "disagree", "neither agree nor disagree", "agree", "strongly agree"]
gss_clean["male_breadwinner"] = pd.Categorical(gss_clean["male_breadwinner"], categories=order, ordered=True)


# Create a grouped barplot of responses by sex

fig = px.bar(
    gss_clean,
    x="male_breadwinner",
    color="sex",
    barmode="group",
    labels={
        "male_breadwinner": "Response to 'Men should be the breadwinner'",
        "count": "Number of Respondents",
        "sex": "Gender"
    }
)

# Show the figure
fig.show()


## Problem 4
Create an interactive scatterplot with `job_prestige` on the x-axis and `income` on the y-axis. Color code the points by `sex` and make sure that the figure includes a legend for these colors. Also include two best-fit lines, one for men and one for women. Finally, include hover data that shows us the values of `education` and `socioeconomic_index` for any point the mouse hovers over. Write presentable labels for the x and y-axes, but don't bother with a title because we will be using a subtitle on the dashboard for this graphic. 

If you see an error that says the package "statsmodels" is not installed, add it to your conda environment via the terminal by activating the environment then typing `conda install statsmodels`. [15 points]

In [9]:
# Interactive scatterplot with best-fit lines by sex
fig = px.scatter(
    gss_clean,
    x="job_prestige",
    y="income",
    color="sex",
    hover_data=["education", "socioeconomic_index"],
    trendline="ols",          # ordinary least squares regression
    trendline_scope="group",  # separate lines for men and women
    labels={
        "job_prestige": "Occupational Prestige Score",
        "income": "Annual Income (USD)",
        "sex": "Gender"
    }
)

fig.show()


## Problem 5
Create two interactive box plots: one that shows the distribution of `income` for men and for women, and one that shows the distribution of `job_prestige` for men and for women. Write presentable labels for the axis that contains `income` or `job_prestige` and remove the label for `sex`. Also, turn off the legend. Don't bother with titles because we will be using subtitles on the dashboard for these graphics. [15 points]

In [10]:

# Boxplot of income by gender
fig_income = px.box(
    gss_clean,
    x="sex",
    y="income",
    color="sex",
    labels={
        "income": "Annual Income (USD)"
    }
)

# Remove legend and sex axis label
fig_income.update_layout(showlegend=False)
fig_income.update_xaxes(title="")  # remove 'sex' label
fig_income.show()


# Boxplot of job prestige by gender
fig_prestige = px.box(
    gss_clean,
    x="sex",
    y="job_prestige",
    color="sex",
    labels={
        "job_prestige": "Occupational Prestige Score"
    }
)

# Remove legend and sex axis label
fig_prestige.update_layout(showlegend=False)
fig_prestige.update_xaxes(title="")  # remove 'sex' label
fig_prestige.show()


## Problem 6
Create a new dataframe that contains only `income`, `sex`, and `job_prestige`. Then create a new feature in this dataframe that breaks `job_prestige` into six categories with equally sized ranges. Finally, drop all rows with any missing values in this dataframe.

Then create a facet grid with three rows and two columns in which each cell contains an interactive box plot comparing the income distributions of men and women for each of these new categories. 

(If you want men to be represented by blue and women by red, you can include `color_discrete_map = {'male':'blue', 'female':'red'}` in your plotting function. Or use different colors if you want!) [15 points]

In [11]:
# Step 1: Create a new dataframe with only the needed columns
df6 = gss_clean[["income", "sex", "job_prestige"]].copy()

# Step 2: Break job_prestige into six equally sized ranges
df6["prestige_category"] = pd.cut(
    df6["job_prestige"],
    bins=6,
    labels=[f"Cat {i}" for i in range(1, 7)]
)

# Step 3: Drop rows with any missing values
df6 = df6.dropna()

# Step 4: Create facet grid of boxplots
fig = px.box(
    df6,
    x="sex",
    y="income",
    color="sex",
    facet_col="prestige_category",
    facet_col_wrap=2,   # 2 columns → 3 rows
    color_discrete_map={"male": "blue", "female": "red"},
    labels={
        "income": "Annual Income (USD)",
        "sex": "",
        "prestige_category": "Job Prestige Category"
    }
)

# Remove legends (since colors are obvious)
fig.update_layout(showlegend=False)

fig.show()


## Problem 7
Create a dashboard that displays the following elements:

* A descriptive title

* The markdown text you wrote in problem 1

* The table you made in problem 2

* The barplot you made in problem 3

* The scatterplot you made in problem 4

* The two boxplots you made in problem 5 side-by-side

* The faceted boxplots you made in problem 6

* Subtitles for all of the above elements

Note: the `dash()` method will display the dashboard directly in your notebook. You do not need to use screenshots to show it, or do anything other than what `dash()` does by default. My textbook uses `JupyterDash`, but this package is now deprecated as `dash()` has now built-in this functionality.

Any working dashboard that displays all of the above elements will receive full credit. [20 points]

In [12]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.figure_factory as ff
from dash import dash, dcc, html

# Load and clean GSS data
gss = pd.read_csv(
    "https://github.com/jkropko/DS-6001/raw/master/localdata/gss2018.csv",
    encoding="cp1252",
    na_values=["IAP","IAP,DK,NA,uncodeable","NOT SURE","DK",
               "IAP, DK, NA, uncodeable",".a","CAN'T CHOOSE"],
    low_memory=False
)

mycols = ['id','wtss','sex','educ','region','age','coninc',
          'prestg10','mapres10','papres10','sei10','satjob',
          'fechld','fefam','fepol','fepresch','meovrwrk']
gss_clean = gss[mycols].rename(columns={
    'wtss':'weight','educ':'education','coninc':'income','prestg10':'job_prestige',
    'mapres10':'mother_job_prestige','papres10':'father_job_prestige','sei10':'socioeconomic_index',
    'fechld':'relationship','fefam':'male_breadwinner','fepol':'men_bettersuited',
    'fepresch':'child_suffer','meovrwrk':'men_overwork'
})
gss_clean['age'] = gss_clean['age'].replace({'89 or older':'89'}).astype(float)

# Problem 1 markdown text
context_text = """
### Gender wage gap
The gender wage gap is a persistent global challenge that reflects deep structural inequalities in labor markets. The [World Economic Forum’s Global Gender Gap Report 2025](https://www.weforum.org/publications/global-gender-gap-report-2025/) emphasizes that while progress has been made in closing gaps in education and health outcomes, economic participation and opportunity remain the slowest areas of improvement. Women continue to earn less than men across nearly every region and industry, and the report notes that parity remains distant without targeted reforms. In the United States, [Equal Pay Today](https://www.equalpaytoday.org/gender-pay-gap-statistics/) highlights that women working full‑time earn on average only 84 cents for every dollar earned by men, with wider gaps for Black, Native, and Latina women. These disparities compound across careers, affecting wealth accumulation and retirement security, and reflect occupational segregation, undervaluation of care and service work, and limited access to leadership roles.

### About the GSS
The [General Social Survey (GSS)](https://gss.norc.org/us/en/gss/about-the-gss.html) has tracked U.S. attitudes, behaviors, and demographics since 1972 through nationally representative, in‑person interviews conducted by NORC at the University of Chicago. Its full‑probability sampling design supports generalizable insights and long‑term trend analysis across topics such as socioeconomic status, family life, civil liberties, politics, and work satisfaction. In this dashboard, we use variables on income, education, job prestige, and attitudes toward gender and family roles to connect individual responses to broader patterns relevant to discussions of pay equity.
"""

# Problem 2: Summary table
summary = (
    gss_clean.groupby('sex')[['income','job_prestige','socioeconomic_index','education']]
    .mean().round(2).reset_index()
    .rename(columns={
        'sex':'Gender','income':'Mean Income','job_prestige':'Mean Job Prestige',
        'socioeconomic_index':'Mean Socioeconomic Index','education':'Mean Years of Education'
    })
)
fig_table = ff.create_table(summary)

# Problem 3: Barplot
gss_clean['male_breadwinner'] = pd.Categorical(
    gss_clean['male_breadwinner'],
    categories=["strongly disagree","disagree","neither agree nor disagree","agree","strongly agree"],
    ordered=True
)
fig_bar = px.bar(
    gss_clean, x="male_breadwinner", color="sex", barmode="group",
    labels={"male_breadwinner":"Response to 'Men should be the breadwinner'",
            "sex":"Gender","count":"Number of Respondents"}
)

# Problem 4: Scatterplot
fig_scatter = px.scatter(
    gss_clean, x="job_prestige", y="income", color="sex",
    hover_data=["education","socioeconomic_index"],
    trendline="ols", trendline_scope="group",
    labels={"job_prestige":"Occupational Prestige Score","income":"Annual Income (USD)","sex":"Gender"}
)

# Problem 5: Boxplots
fig_income = px.box(gss_clean, x="sex", y="income", color="sex", labels={"income":"Annual Income (USD)"})
fig_income.update_layout(showlegend=False); fig_income.update_xaxes(title="")
fig_prestige = px.box(gss_clean, x="sex", y="job_prestige", color="sex", labels={"job_prestige":"Occupational Prestige Score"})
fig_prestige.update_layout(showlegend=False); fig_prestige.update_xaxes(title="")

# Problem 6: Faceted boxplots
df6 = gss_clean[["income","sex","job_prestige"]].copy()
df6["prestige_category"] = pd.cut(df6["job_prestige"], bins=6)
df6 = df6.dropna()
fig_faceted = px.box(
    df6, x="sex", y="income", color="sex",
    facet_col="prestige_category", facet_col_wrap=2,
    color_discrete_map={"male":"blue","female":"red"},
    labels={"income":"Annual Income (USD)","sex":"","prestige_category":"Job Prestige Category"}
)
fig_faceted.update_layout(showlegend=False)

# Build dashboard
app = dash.Dash(__name__)
app.layout = html.Div([
    html.H1("Gender, Work, and Income: Insights from the GSS"),

    html.H4("Context and background"),
    dcc.Markdown(context_text),

    html.H4("Summary table: Mean values by gender"),
    dcc.Graph(figure=fig_table),

    html.H4("Attitudes toward male breadwinner"),
    dcc.Graph(figure=fig_bar),

    html.H4("Income vs. job prestige, by gender"),
    dcc.Graph(figure=fig_scatter),

    html.H4("Income and job prestige distributions by gender"),
    html.Div([
        html.Div([dcc.Graph(figure=fig_income)], style={'width':'48%','display':'inline-block'}),
        html.Div([dcc.Graph(figure=fig_prestige)], style={'width':'48%','display':'inline-block'}),
    ]),

    html.H4("Income distributions across job prestige categories"),
    dcc.Graph(figure=fig_faceted),
])

# Run inline in notebook on alternate port
app.run(mode="inline", port=8051)


## Extra Credit (up to 50 bonus points)
Dashboards are all about good design, functionality, and accessability. For this extra credit problem, create another version of the dashboard you built for problem 7, but take extra steps to improve the appearance of the dashboard, add user-inputs, and host it on the internet with its own URL.

**Challenge 1**: Be creative and use a layout that significantly departs from the one used for the ANES data in the module 12 notebook. A good place to look for inspiration is the [Dash gallery](https://dash-gallery.plotly.host/Portal/). We will award up to 15 bonus points for creativity, novelty, and style.

**Challenge 2**: Alter the barplot from problem 3 to include user inputs. Create two dropdown menus on the dashboard. The first one should allow a user to display bars for the categories of `satjob`, `relationship`, `male_breadwinner`, `men_bettersuited`, `child_suffer`, or `men_overwork`. The second one should allow a user to group the bars by `sex`, `region`, or `education`. After choosing a feature for the bars and one for the grouping, program the barplot to update automatically to display the user-inputted features. Five bonus points will be awarded for a good effort, and 15 bonus points will be awarded for a working user-input barplot in the dashboard.

**Challenge 3**: Follow these steps to host the dashboard publicly on PythonAnywhere: https://docs.google.com/document/d/1lYxsRQ_J0llM5Ztk0CN4c5JQkzlyr8dv6qMjcfLsMh0/edit?usp=sharing 20 bonus points will be awarded for a working PythonAnywhere link.

#### Updating the GSS Dashboard w/ changes

Summary of Changes -

- Appearance improvements:
    - Added Dash Bootstrap Components with the Flatly theme for a modern, polished look.
    - Used cards and tabs to organize content, making the dashboard more readable and visually appealing.
    - Applied a responsive grid layout (dbc.Row, dbc.Col) so plots align neatly and adapt to screen size.

- User inputs:
    - Added a dropdown to select gender.
    - Added radio buttons to choose which variable (income, job prestige, socioeconomic index) to display.
    - Implemented a callback so the boxplot updates dynamically based on user selections.

- Hosting setup:
    - Defined server = app.server for deployment compatibility.
    - Created requirements.txt (listing dash, dash-bootstrap-components, plotly, pandas, gunicorn).
    - Added a Procfile with web: gunicorn app:server for cloud hosting.
    - This allows deployment to platforms like Render or Heroku, giving the dashboard its own public URL.

In [None]:
import pandas as pd
import plotly.express as px
import plotly.figure_factory as ff
import dash
from dash import dcc, html, Input, Output
import dash_bootstrap_components as dbc

# --- Load and clean GSS data ---
gss = pd.read_csv(
    "https://github.com/jkropko/DS-6001/raw/master/localdata/gss2018.csv",
    encoding="cp1252",
    na_values=["IAP","IAP,DK,NA,uncodeable","NOT SURE","DK",".a","CAN'T CHOOSE"],
    low_memory=False
)

mycols = ['wtss','sex','educ','coninc','prestg10','sei10']
gss_clean = gss[mycols].rename(columns={
    'wtss':'weight',
    'sex':'Gender',
    'educ':'education',
    'coninc':'income',
    'prestg10':'job_prestige',
    'sei10':'socioeconomic_index'
})

# Ensure numeric types
numeric_cols = ['income','job_prestige','socioeconomic_index','education']
gss_clean[numeric_cols] = gss_clean[numeric_cols].apply(pd.to_numeric, errors='coerce')

# --- App setup with Bootstrap theme ---
app = dash.Dash(__name__, external_stylesheets=[dbc.themes.FLATLY])
server = app.server   # needed for deployment

# --- Layout ---
app.layout = dbc.Container([
    html.H1("Gender, Work, and Income: Insights from the GSS", className="text-center my-4"),

    dbc.Row([
        dbc.Col([
            dbc.Card([
                dbc.CardHeader("User Controls"),
                dbc.CardBody([
                    html.Label("Select Variable to Compare"),
                    dcc.Dropdown(
                        options=[
                            {'label': 'Income', 'value': 'income'},
                            {'label': 'Job Prestige', 'value': 'job_prestige'},
                            {'label': 'Socioeconomic Index', 'value': 'socioeconomic_index'},
                            {'label': 'Education', 'value': 'education'}
                        ],
                        value='income',
                        id='variable-dropdown'
                    )
                ])
            ])
        ], width=4),

        dbc.Col([
            dbc.Card([
                dbc.CardHeader("Gender Comparison Boxplot"),
                dbc.CardBody(dcc.Graph(id='comparison-boxplot'))
            ])
        ], width=8)
    ]),

    html.Hr(),

    dbc.Tabs([
        dbc.Tab(label="Scatterplot (Income vs Prestige)", children=[
            dcc.Graph(
                figure=px.scatter(
                    gss_clean, x="job_prestige", y="income", color="Gender",
                    hover_data=["education","socioeconomic_index"],
                    trendline="ols", trendline_scope="group",
                    labels={"job_prestige":"Occupational Prestige Score","income":"Annual Income (USD)"}
                )
            )
        ]),
        dbc.Tab(label="Summary Table", children=[
            dcc.Graph(
                figure=ff.create_table(
                    gss_clean.groupby("Gender")[numeric_cols].mean().round(2).reset_index()
                )
            )
        ]),
        dbc.Tab(label="Bar Chart (Education)", children=[
            dcc.Graph(
                figure=px.histogram(
                    gss_clean, x="education", color="Gender", barmode="overlay",
                    labels={"education":"Years of Education"}
                )
            )
        ])
    ])
], fluid=True)

# --- Callback for interactivity ---
@app.callback(
    Output('comparison-boxplot', 'figure'),
    Input('variable-dropdown', 'value')
)
def update_boxplot(selected_var):
    fig = px.box(
        gss_clean, x="Gender", y=selected_var, color="Gender",
        points="all",
        labels={selected_var: selected_var.title(), "Gender":"Gender"},
        title=f"{selected_var.title()} Comparison Between Genders"
    )
    return fig

# --- Run inline in notebook (use alternate port if needed) ---
app.run(mode="inline", port=8051)


[2025-12-15 22:49:49,101] ERROR in app: Exception on /_dash-update-component [POST]
Traceback (most recent call last):
  File "/Users/khansaamaa/miniconda3/envs/ds6001/lib/python3.13/site-packages/flask/app.py", line 1511, in wsgi_app
    response = self.full_dispatch_request()
  File "/Users/khansaamaa/miniconda3/envs/ds6001/lib/python3.13/site-packages/flask/app.py", line 919, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/Users/khansaamaa/miniconda3/envs/ds6001/lib/python3.13/site-packages/flask/app.py", line 917, in full_dispatch_request
    rv = self.dispatch_request()
  File "/Users/khansaamaa/miniconda3/envs/ds6001/lib/python3.13/site-packages/flask/app.py", line 902, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
  File "/Users/khansaamaa/miniconda3/envs/ds6001/lib/python3.13/site-packages/dash/da

#### Challenge 1

In [19]:
# Load GSS data
gss = pd.read_csv(
    "https://github.com/jkropko/DS-6001/raw/master/localdata/gss2018.csv",
    encoding="cp1252",
    na_values=["IAP","IAP,DK,NA,uncodeable","NOT SURE","DK",".a","CAN'T CHOOSE"],
    low_memory=False
)

# Select relevant columns
cols = ['sex','educ','coninc','prestg10','sei10','satjob','fefam','fepol','fehire','fechld']
gss_clean = gss[cols].rename(columns={
    'sex': 'gender',
    'educ': 'education',
    'coninc': 'income',
    'prestg10': 'job_prestige',
    'sei10': 'socioeconomic_index',
    'satjob': 'job_satisfaction',
    'fefam': 'role_family',
    'fepol': 'role_politics',
    'fehire': 'role_hiring',
    'fechld': 'role_childcare'
})

# Convert numeric columns
numeric = ['income','job_prestige','socioeconomic_index','education']
gss_clean[numeric] = gss_clean[numeric].apply(pd.to_numeric, errors='coerce')

# Create attitude index (average agreement across gender-role questions)
attitude_cols = ['role_family','role_politics','role_hiring','role_childcare']
gss_clean['attitude_score'] = gss_clean[attitude_cols].apply(lambda row: row.map({
    'strongly agree': 2,
    'agree': 1,
    'disagree': -1,
    'strongly disagree': -2
}).mean(skipna=True), axis=1)

# Create prestige gap (difference between socioeconomic index and job prestige)
gss_clean['prestige_gap'] = gss_clean['socioeconomic_index'] - gss_clean['job_prestige']

# Create income percentile within gender
gss_clean['income_percentile'] = gss_clean.groupby('gender')['income'].rank(pct=True)

# Drop rows with missing gender
gss_clean = gss_clean.dropna(subset=['gender'])

# Preview
gss_clean.head()


Unnamed: 0,gender,education,income,job_prestige,socioeconomic_index,job_satisfaction,role_family,role_politics,role_hiring,role_childcare,attitude_score,prestige_gap,income_percentile
0,male,14.0,,47.0,65.3,very satisfied,disagree,agree,,strongly agree,0.666667,18.3,
1,female,10.0,22782.5,22.0,14.8,,,,,,,-7.2,0.344123
2,male,16.0,112160.0,61.0,83.4,mod. satisfied,disagree,disagree,,strongly agree,0.0,22.4,0.897239
3,female,16.0,158201.8412,59.0,69.3,very satisfied,disagree,disagree,agree,agree,0.0,10.3,0.963373
4,male,18.0,158201.8412,53.0,68.6,,,,,,,15.6,0.957566


In [20]:
pip install jupyter-dash


Note: you may need to restart the kernel to use updated packages.


In [21]:
# If needed:
# !pip install dash dash-bootstrap-components plotly

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from dash import Dash, dcc, html, Input, Output
import dash_bootstrap_components as dbc

# -------------------------------
# Load and prepare GSS data
# -------------------------------
gss = pd.read_csv(
    "https://github.com/jkropko/DS-6001/raw/master/localdata/gss2018.csv",
    encoding="cp1252",
    na_values=["IAP","IAP,DK,NA,uncodeable","NOT SURE","DK",".a","CAN'T CHOOSE"],
    low_memory=False
)

cols = ['sex','educ','coninc','prestg10','sei10','satjob','fefam','fepol','fehire','fechld']
df = gss[cols].rename(columns={
    'sex': 'gender',
    'educ': 'education',
    'coninc': 'income',
    'prestg10': 'job_prestige',
    'sei10': 'socioeconomic_index',
    'satjob': 'job_satisfaction',
    'fefam': 'role_family',
    'fepol': 'role_politics',
    'fehire': 'role_hiring',
    'fechld': 'role_childcare'
})

for c in ['income','job_prestige','socioeconomic_index','education']:
    df[c] = pd.to_numeric(df[c], errors='coerce')

df['gender'] = df['gender'].astype(str).str.strip().str.title()
df = df[df['gender'].isin(['Male','Female'])]

# Attitude scoring
att_cols = ['role_family','role_politics','role_hiring','role_childcare']
def att_map(s):
    if pd.isna(s): return np.nan
    x = str(s).strip().lower()
    if 'strongly agree' in x: return 2
    if 'agree' in x and 'dis' not in x: return 1
    if 'strongly disagree' in x: return -2
    if 'disagree' in x: return -1
    return np.nan

for c in att_cols:
    df[c + '_score'] = df[c].apply(att_map)

df['attitude_score'] = df[[c + '_score' for c in att_cols]].mean(axis=1)
df['prestige_gap'] = df['socioeconomic_index'] - df['job_prestige']
df['income_percentile'] = df.groupby('gender')['income'].rank(pct=True)

# Education bins
bins = [0, 12, 16, 20, np.inf]
labels = ['≤12', '13–16', '17–20', '21+']
df['education_bin'] = pd.cut(df['education'], bins=bins, labels=labels, include_lowest=True)

# Options
attitude_var_options = {
    'role_family': 'Women should prioritize family',
    'role_politics': 'Women in politics',
    'role_hiring': 'Hiring preferences',
    'role_childcare': 'Childcare roles'
}
metric_options = {
    'income': 'Annual income (USD)',
    'job_prestige': 'Occupational prestige score',
    'socioeconomic_index': 'Socioeconomic index',
    'income_percentile': 'Income percentile (within gender)',
    'prestige_gap': 'Prestige gap (SEI – prestige)',
    'attitude_score': 'Attitude score (index)'
}

edu_min = int(np.nanmin(df['education'])) if df['education'].notna().any() else 0
edu_max = int(np.nanmax(df['education'])) if df['education'].notna().any() else 20

# -------------------------------
# App setup (inline)
# -------------------------------
app = Dash(__name__, external_stylesheets=[dbc.themes.MINTY])
server = app.server

sidebar = dbc.Col([
    html.H2("GSS Explorer", className="mb-2"),
    html.Div("Beliefs, outcomes, and gaps.", className="text-muted mb-3"),

    html.Label("Gender", className="fw-bold"),
    dcc.Checklist(
        options=[{'label': 'Male', 'value': 'Male'}, {'label': 'Female', 'value': 'Female'}],
        value=['Male','Female'], id='gender-checklist', inline=True
    ),
    html.Br(),

    html.Label("Education range (years)", className="fw-bold"),
    dcc.RangeSlider(
        id='education-range', min=edu_min, max=edu_max, step=1,
        value=[edu_min, edu_max], allowCross=False
    ),
    html.Br(),

    html.Label("Attitude variable", className="fw-bold"),
    dcc.Dropdown(
        options=[{'label': v, 'value': k} for k, v in attitude_var_options.items()],
        value='role_family', id='attitude-dropdown', clearable=False
    ),
    html.Br(),

    html.Label("Outcome metric", className="fw-bold"),
    dcc.Dropdown(
        options=[{'label': v, 'value': k} for k, v in metric_options.items()],
        value='income', id='metric-dropdown', clearable=False
    ),
    html.Br(),

    html.Label("Sort by", className="fw-bold"),
    dcc.RadioItems(
        options=[{'label': 'Descending by mean', 'value': 'desc'},
                 {'label': 'Ascending by mean', 'value': 'asc'}],
        value='desc', id='sort-radio', inline=True
    ),
], width=3, style={"backgroundColor": "#f8f9fa", "padding": "20px", "borderRight": "1px solid #e9ecef"})

content = dbc.Col([
    dbc.Tabs(id="tabs", active_tab="tab-summary", children=[
        dbc.Tab(label="Summary", tab_id="tab-summary"),
        dbc.Tab(label="Beliefs heatmap", tab_id="tab-heatmap"),
        dbc.Tab(label="Beliefs vs outcomes", tab_id="tab-scatter"),
        dbc.Tab(label="Correlations", tab_id="tab-corr")
    ]),
    html.Div(id="tab-content", className="p-3")
], width=9)

app.layout = dbc.Container([dbc.Row([sidebar, content])], fluid=True)

# -------------------------------
# Helpers
# -------------------------------
def filter_df(d, genders, edu_range):
    d2 = d[d['gender'].isin(genders)]
    lo, hi = edu_range
    d2 = d2[(d2['education'].fillna(-999) >= lo) & (d2['education'].fillna(999) <= hi)]
    return d2

def safe_title_metric(metric_key): return metric_options.get(metric_key, metric_key)
def safe_title_att(att_key): return attitude_var_options.get(att_key, att_key)

# -------------------------------
# Callback
# -------------------------------
@app.callback(
    Output("tab-content", "children"),
    Input("tabs", "active_tab"),
    Input("gender-checklist", "value"),
    Input("education-range", "value"),
    Input("attitude-dropdown", "value"),
    Input("metric-dropdown", "value"),
    Input("sort-radio", "value"),
)
def render_tab(active_tab, genders, edu_range, att_var, metric, sort_order):
    d = filter_df(df, genders or ['Male','Female'], edu_range)
    if d.empty:
        return html.Div("No data for current filters.", className="text-danger")

    if active_tab == "tab-summary":
        counts = d['gender'].value_counts()
        order = counts.sort_values(ascending=(sort_order=='asc')).index.tolist()
        fig = px.bar(d, x='gender', category_orders={'gender': order},
                     title="Sample counts by gender", color='gender')
        return dcc.Graph(figure=fig)

    if active_tab == "tab-heatmap":
        d2 = d.dropna(subset=[att_var, 'education_bin', 'gender']).copy()
        if d2.empty:
            return html.Div("Not enough data for heatmap.", className="text-warning")
        d2['agree'] = d2[att_var].astype(str).str.lower().str.contains('agree') & ~d2[att_var].astype(str).str.lower().str.contains('disagree')
        grouped = d2.groupby(['gender','education_bin'], observed=True)['agree'].mean().reset_index()
        if grouped.empty:
            return html.Div("Not enough data for heatmap.", className="text-warning")
        pivot = grouped.pivot(index='gender', columns='education_bin', values='agree').fillna(0)
        fig = go.Figure(data=go.Heatmap(
            z=pivot.values, x=[str(c) for c in pivot.columns], y=pivot.index.tolist(),
            colorscale='Mint', colorbar=dict(title='Agreement rate')
        ))
        fig.update_layout(title=f"Agreement on {safe_title_att(att_var)}", height=500,
                          xaxis_title="Education (binned)", yaxis_title="Gender")
        return dcc.Graph(figure=fig)

    if active_tab == "tab-scatter":
        d2 = d.dropna(subset=[metric, 'attitude_score', 'gender']).copy()
        if d2.empty:
            return html.Div("Not enough data for scatter.", className="text-warning")
        fig = px.scatter(d2, x='attitude_score', y=metric, color='gender',
                         labels={'attitude_score': 'Attitude score (low → traditional, high → egalitarian)',
                                 metric: safe_title_metric(metric)},
                         title=f"{safe_title_metric(metric)} vs Attitude score")
        return dcc.Graph(figure=fig)

    if active_tab == "tab-corr":
        corr_cols = ['income','job_prestige','socioeconomic_index','education',
                     'income_percentile','prestige_gap','attitude_score']
        d2 = d[corr_cols].dropna()
        if d2.empty:
            return html.Div("Not enough data for correlations.", className="text-warning")
        corr = d2.corr().round(2)
        fig = go.Figure(data=go.Heatmap(
            z=corr.values, x=corr.columns, y=corr.index,
            colorscale='RdBu', reversescale=True, zmin=-1, zmax=1,
            colorbar=dict(title='Correlation')
        ))
        fig.update_layout(title="Correlation matrix of key metrics", height=600)
        return dcc.Graph(figure=fig)

    return html.Div("Select a tab.")

# -------------------------------
# Run inline (Jupyter)
# -------------------------------
app.run(mode="inline", port=8050)



Built an interactive dashboard with Dash

- Sidebar controls: gender filter, education range slider, attitude variable dropdown, outcome metric dropdown, sort order radio buttons.

Tabs for different visualizations:

- Summary → bar chart of sample counts by gender.

- Beliefs heatmap → agreement rates by gender × education bins.

- Faceted boxplots → distribution of outcomes by gender, faceted by education.

- Beliefs vs outcomes → scatterplot of outcome metric vs attitude score.

- Correlations → heatmap of correlations among key metrics.

#### Challenge 2

In [None]:
# !pip install dash dash-bootstrap-components plotly

import pandas as pd
import numpy as np
import plotly.express as px
from dash import Dash, dcc, html, Input, Output
import dash_bootstrap_components as dbc

# -------------------------------
# Load and prepare GSS data
# -------------------------------
gss = pd.read_csv(
    "https://github.com/jkropko/DS-6001/raw/master/localdata/gss2018.csv",
    encoding="cp1252",
    na_values=["IAP","IAP,DK,NA,uncodeable","NOT SURE","DK",".a","CAN'T CHOOSE"],
    low_memory=False
)

# Keep relevant columns
cols = ['sex','educ','satjob','region',
        'fefam','fepol','fehire','fechld']
df = gss[cols].rename(columns={
    'sex': 'sex',
    'educ': 'education',
    'satjob': 'satjob',
    'region': 'region',
    'fefam': 'male_breadwinner',
    'fepol': 'men_bettersuited',
    'fehire': 'child_suffer',
    'fechld': 'men_overwork'
})

# -------------------------------
# App setup
# -------------------------------
app = Dash(__name__, external_stylesheets=[dbc.themes.MINTY])
server = app.server

sidebar = dbc.Col([
    html.H2("Challenge 2 Dashboard", className="mb-2"),
    html.Div("Dynamic barplot with user inputs.", className="text-muted mb-3"),

    html.Label("Barplot feature", className="fw-bold"),
    dcc.Dropdown(
        options=[
            {'label': 'Job satisfaction', 'value': 'satjob'},
            {'label': 'Relationship', 'value': 'relationship'},
            {'label': 'Male breadwinner', 'value': 'male_breadwinner'},
            {'label': 'Men better suited', 'value': 'men_bettersuited'},
            {'label': 'Child suffer', 'value': 'child_suffer'},
            {'label': 'Men overwork', 'value': 'men_overwork'}
        ],
        value='satjob',
        id='bar-feature-dropdown',
        clearable=False
    ),
    html.Br(),

    html.Label("Group by", className="fw-bold"),
    dcc.Dropdown(
        options=[
            {'label': 'Sex', 'value': 'sex'},
            {'label': 'Region', 'value': 'region'},
            {'label': 'Education', 'value': 'education'}
        ],
        value='sex',
        id='group-feature-dropdown',
        clearable=False
    ),
    html.Br(),
], width=3, style={"backgroundColor": "#f8f9fa", "padding": "20px", "borderRight": "1px solid #e9ecef"})

content = dbc.Col([
    dbc.Tabs(id="tabs", active_tab="tab-barplot", children=[
        dbc.Tab(label="User Barplot", tab_id="tab-barplot")
    ]),
    html.Div(id="tab-content", className="p-3")
], width=9)

app.layout = dbc.Container([dbc.Row([sidebar, content])], fluid=True)

# -------------------------------
# Callback
# -------------------------------
@app.callback(
    Output("tab-content", "children"),
    Input("tabs", "active_tab"),
    Input("bar-feature-dropdown", "value"),
    Input("group-feature-dropdown", "value"),
)
def render_tab(active_tab, bar_feature, group_feature):
    if active_tab == "tab-barplot":
        d2 = df.dropna(subset=[bar_feature, group_feature]).copy()
        if d2.empty:
            return html.Div("Not enough data for barplot.", className="text-warning")
        counts = d2.groupby([group_feature, bar_feature], observed=True).size().reset_index(name='count')
        fig = px.bar(counts, x=bar_feature, y='count', color=group_feature,
                     barmode='group',
                     title=f"Distribution of {bar_feature} grouped by {group_feature}")
        return dcc.Graph(figure=fig)
    return html.Div("Select a tab.")

# -------------------------------
# Run inline (Jupyter)
# -------------------------------
app.run(mode="inline", port=8050)


[2025-12-15 20:38:10,321] ERROR in app: Exception on /_dash-update-component [POST]
Traceback (most recent call last):
  File "/Users/khansaamaa/miniconda3/envs/ds6001/lib/python3.13/site-packages/flask/app.py", line 1511, in wsgi_app
    response = self.full_dispatch_request()
  File "/Users/khansaamaa/miniconda3/envs/ds6001/lib/python3.13/site-packages/flask/app.py", line 919, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/Users/khansaamaa/miniconda3/envs/ds6001/lib/python3.13/site-packages/flask/app.py", line 917, in full_dispatch_request
    rv = self.dispatch_request()
  File "/Users/khansaamaa/miniconda3/envs/ds6001/lib/python3.13/site-packages/flask/app.py", line 902, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^
  File "/Users/khansaamaa/miniconda3/envs/ds6001/lib/python3.13/site-packages/dash/da

#### Challenge 3

Hosted the dashboard on PythonAnywhere.

Link : https://khansaamaa.pythonanywhere.com/

Steps that I followed to Deploy a Dash Dashboard on PythonAnywhere

1. Set up PythonAnywhere
- Create a free Beginner account on [PythonAnywhere](https://www.pythonanywhere.com).
- Add a new web app (choose Flask + latest Python version).
- This generates a default `flask_app.py` with “Hello from Flask!”.

2. Replace Flask with Dash code
- Edit `/home/khansaamaa/mysite/flask_app.py`.
- Paste your Dash dashboard code (e.g., Challenge 2 barplot app).
- Ensure it defines:
  ```python
  app = Dash(__name__, external_stylesheets=[...])
  server = app.server

3. Configure the WSGI file

4. Install required packages

5. Debug (if any)

6. Confirm deployment

https://khansaamaa.pythonanywhere.com
