# GENDER AND STOCK MARKET PARTICIPATION
_Karl Stavem_

## Table of Contents
1.  [Motivation and Problem Statement](#motivation)
1.  [Research Questions and Hypotheses](#rq)
1.  [Background and Related Work](#brr)
1.  [Data](#dd)
1.  [A Note On Terminology](#terminology)
1.  [Methodology](#methodology)
1.  [Data Preprocessing](#dpp)
1.  [Findings](#findings)
1.  [Discussion](#discussion)
1.  [Conclusion](#conclusion)

<span id="motivation"/>
    
### Motivation and Problem Statement
This study looks at the stock market and the ways in which participation varies by gender.  Because there is a significant and well-established [gender pay gap](https://www.pewresearch.org/fact-tank/2019/03/22/gender-pay-gap-facts/) it is vital to understand the ways this may lead to a exponential disparity in accumulated wealth and retirement savings.

<span id="rq"/>

### Research Questions and Hypotheses

There are two specific questions that I will address in this analysis:

- _Q 1:  Do women participate in the stock market at a lower rate than men?_
- _Q 2.1:  What are they key factors that affect stock market participation?_  
- _Q 2.2:  Do these key factors disproportionally affect women?_

My initial hypothesis is that women participate in the stock market at a lower rate than men.   I expect that this drop in participation is driven less by differences in attitudes or perceptions and more.


<span id="brr" />

### Background and Related Work
Many financial advisory firms have conducted independent research into these issues in the past.   For example Northwestern Mutual has routinely [published findings](https://news.northwesternmutual.com/planning-and-progress-2019) outlining general attitudes and bevahiors around money and investing.   Addtionally, FINRA publishes results from the National Financial Capability Study on [their website](https://www.usfinancialcapability.org/results.php?region=US) every three years.  However, most of these studies do not specifically focus on gender.

<span id="dd" />

### Data


#### Descrption
The primary dataset used to address these questions is the _2018 National Financial Capability Study (NFCS)_, funded by the FINRA Investor Education Foundation.  The NFCS is a longitudinal survey conducted across the United States every three years.   The goal of the survey is to benchmark key indicators of financial capability in U.S. households and evaluate how these indictors vary by regions, attitudes, and demographics.  While this is a longitudinal survey that has been conducted every three years since 2009, my analysis focuses exclusively on the 2018 dataset.  While the FINRA Investor Education Foundation publishes its own findings on this data, it has made the datasets, questionaires, and documents available to outside researchers.

#### Dimensions
The 2018 NFCS contains responses from roughly 27,000 households across all 50 states.  Each respondent provided answers to 127 different questions.   All responses have been numericall encoded into a single .csv file.


#### Access

The 2018 NFCS dataset is freely available online and is subject to FINRA's [terms of use](https://www.usfinancialcapability.org/terms.php).  Full details and descrptions can be found on the on the [Data and Downloads](https://www.usfinancialcapability.org/downloads.php) page of the US Financial Capability website.  The primrary dataset used in this exploration is titled, _2018 State-by-State Survey — Respondent-Level Data, Comma delimited Excel file (.csv)_ and it may be downloaded in a .zip file directly from FINRA's website using [this link](https://www.usfinancialcapability.org/downloads/NFCS_2018_State_by_State_Data_Excel.zip).  This project already contains the full 2018 NFCS dataset, which can be viewed directly here: [\"NFCS 2018 State Data 190603.csv\"](raw_data/NFCS%202018%20State%20Data%20190603.csv). 

<span id="terminology" />

### A Note on Terminology

It is important to acknowledge the usage of specific terms like _gender_ and _sex_ in this analysis.  Regrettably, the NFCS dataset uses terms like _gender_ and _sex_ interchangebly as well as terms like _woman_ and _female_.   Additionally, all survey respondents were required to self-identify strictly as either _male_ or _female_ with no alternative for non-binary designations.   As a result, the findings presented in this study make the same adjustments.   Unfortunately, all subsequent analysis will use these words interchangebly and all findings are based on the assumption that respondents fall neatly into one of two distinct gender groups, either male or female.

<span id="methodology" />

### Methodology

blah blah blah blah



<span id="dpp"/>

### Data Preprocessing

Each respondent in this survey was asked 127 individual questions; however, not all of the questions posed are useful to this analysis.   The following section will download the raw data from the website and create a refined dataframe to work with.

First, import all necessary libraries for processing and analysing the data.

In [2]:
# import libraries for data aquisition and processing
import pandas as pd
import numpy as np
import requests, zipfile, io

# import libraries for visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# import libraries for statistical tests
from statsmodels.stats.proportion import proportions_ztest

Since the data on the FINRA website is all in zipped form, I have created a method to extract the raw data into the _raw_data_ directory.  Additionally, since there are multiple datasets available on the website, this will make future analysis easier.

In [50]:
def get_data(zip_file_url):
    """
    Input: Website to extract zip file.
    Output:  Extracted data in '/raw_data' folder
    """
    import requests, zipfile, io
    r = requests.get(zip_file_url)
    if r.ok:
        print('Request Succesful.')
    else:
        print('Error submitting request.')
        
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall('./raw_data')

Now we can call the method to save the files in the appropriate place.

In [51]:
#%%capture

# download state by state data, store this in a data frame
BASE_URL = 'https://www.usfinancialcapability.org/downloads/'

# populate the list of data sets to download from the site
file_list = ['NFCS_2018_State_by_State_Data_Excel.zip']

# call the function for each file in the list
for filename in file_list:
    get_data(zip_file_url = BASE_URL + filename)

Request Succesful.


We will store this data as a pandas dataframe.

In [52]:
# read the newly aquired data into a dataframe
raw_df = pd.read_csv('raw_data/NFCS 2018 State Data 190603.csv')

# Check the import results.
display(raw_df.head())

Unnamed: 0,NFCSID,STATEQ,CENSUSDIV,CENSUSREG,A3,A3Ar_w,A3B,A4A_new_w,A5_2015,A6,...,M42,M6,M7,M8,M31,M9,M10,wgt_n2,wgt_d2,wgt_s3
0,2018010001,48,9,4,2,5,11,1,5,4,...,,1,3,98,98,98,1,0.683683,0.519642,1.095189
1,2018010002,10,5,3,2,2,8,1,6,1,...,,1,3,98,3,1,98,0.808358,2.516841,0.922693
2,2018010003,44,7,3,2,2,8,1,6,1,...,,1,1,98,98,1,98,1.021551,1.896192,0.671093
3,2018010004,10,5,3,2,1,7,1,6,2,...,7.0,98,98,4,4,2,98,0.808358,2.516841,0.922693
4,2018010005,13,8,4,1,2,2,1,6,1,...,,1,3,98,2,1,98,0.448075,0.614733,1.232221


Now we can verify the size of the file.   Since there are about 27,000 respondents answering 127 questions each, we should see a dataset of roughly this size.

In [53]:
raw_df.shape

(27091, 128)

Since there are 127 questions in the survey and none of them have meaningful names, the function below will assign more informative headers to key columns in the dataset.  Since not all 127 questions are relevant to this research, the extranneous columns can be dropped.

In [54]:
def set_column_names(c):
    switcher = {
        'NFCSID':'id', 
        'A3':'gender', 
        'A5_2015':'education', 
        'A8':'income', 
        'A9':'work_status',
        'A14':'investment_knowledge',
        'J1':'financial_satisfaction_score',
        'J2':'risk_tolerance',
        'J3':'income_to_debt',
        'J4':'difficult_to_pay_bills',
        'J8':'know_retirement_amount',
        'J20':'confidence_paying_unexpected_need',
        'J32':'current_credit_record',
        'J33_1':'worry_about_retirement',
        'J33_40':'finances_makes_me_anxious',
        'J33_41':'finances_makes_me_stressed',
        'B2':'savings_account',
        'B40':'comfort_asking_questions',
        'C1_2012':'employer_provided_retirement_account',
        'C4_2012':'self_retirement_account',
        'B14':'other_investment_accounts',
        'F2_3':'minimum_credit_card_payement_only',
        'G23':'too_much_debt',
        'M1_1':'good_at_daily_financial_matters',
        'M1_2':'good_at_math',
        'M4':'financial_knowledge',
        'M40':'taken_financial_education',
        'M41':'how_many_hours_of_financial_education'
    }
    return switcher.get(c, "column_to_drop")

# create a new copy of the dataframe to manipulate
df = raw_df

# call the function above to rename columns
df.rename(columns=lambda x: set_column_names(x), inplace=True)

Now we are free to drop the remaining columns that will not be used in this analysis.

In [55]:
df.drop(['column_to_drop'], axis=1, inplace=True)

There are several blank entries in the dataframe.   It is easier to work with NaN, so we will edit those cells.

In [57]:
df = df.replace(r'^\s*$', np.nan, regex=True)

Since all the data is numerically encoded, we will convert all columns into integer data types.   Since 0 is not a valid answer for any of the survey questions, we will fill in all blank and missing data with a 0.

In [65]:
# convert missing data to zeros
df = df.fillna(0)

# coerce all columns in the df to integer
cols =['id', 'gender', 'education', 'income', 'work_status','investment_knowledge',
      'financial_satisfaction_score','risk_tolerance','income_to_debt','difficult_to_pay_bills',
      'know_retirement_amount','confidence_paying_unexpected_need','current_credit_record','worry_about_retirement',
      'finances_makes_me_anxious','finances_makes_me_stressed','savings_account','comfort_asking_questions',
      'employer_provided_retirement_account','self_retirement_account','other_investment_accounts','minimum_credit_card_payement_only',
      'too_much_debt','good_at_daily_financial_matters','good_at_math','financial_knowledge','taken_financial_education',
      'how_many_hours_of_financial_education']

df[col] = df[cols].apply(lambda x :pd.to_numeric(x,errors='coerce').astype(int),axis=0)

# check to ensure all columns are integers now
df.dtypes

id                                       int64
gender                                   int64
education                                int64
income                                   int64
work_status                              int64
investment_knowledge                     int64
financial_satisfaction_score             int64
risk_tolerance                           int64
income_to_debt                           int64
difficult_to_pay_bills                   int64
know_retirement_amount                   int64
confidence_paying_unexpected_need        int64
current_credit_record                    int64
worry_about_retirement                   int64
finances_makes_me_anxious                int64
finances_makes_me_stressed               int64
savings_account                          int64
comfort_asking_questions                 int64
employer_provided_retirement_account     int64
self_retirement_account                  int64
other_investment_accounts                int64
minimum_credi

Preview the results before moving forward.

In [61]:
df.head()

Unnamed: 0,id,gender,education,income,work_status,investment_knowledge,financial_satisfaction_score,risk_tolerance,income_to_debt,difficult_to_pay_bills,...,employer_provided_retirement_account,self_retirement_account,other_investment_accounts,minimum_credit_card_payement_only,too_much_debt,good_at_daily_financial_matters,good_at_math,financial_knowledge,taken_financial_education,how_many_hours_of_financial_education
0,2018010001,2,5,1,6,0,1,1,3,1,...,1,2,0,0,7,1,6,2,2,0
1,2018010002,2,6,5,2,1,4,3,3,2,...,1,1,1,2,1,7,6,5,2,0
2,2018010003,2,6,4,2,3,1,1,2,1,...,1,2,2,1,7,1,1,2,2,0
3,2018010004,2,6,5,2,3,8,10,2,3,...,98,98,2,98,98,6,6,98,2,3
4,2018010005,1,6,3,2,1,6,5,3,3,...,1,2,2,2,4,7,4,4,2,0


<span id="findings" />
          
          
### Findings

If we look at an initial breakdown of the data, we see slightly more female respondents than male respondents.   About 56% of respondents are women vs 44% men.

In [None]:
# create a clean categorical column for gender
df['cat_gender'] = df['gender'].apply(lambda x: 'Male' if x == 1 else 'Female')

# plot the totals in a quick pie chart
g1 = df['cat_gender'].value_counts().plot.pie(autopct="%.1f%%")
fig = g1.get_figure()
fig.savefig('./figures/g1.png')

# print total counts for each group
print(df['cat_gender'].value_counts())

#### R1:   Do Women Participate in the Stock Market at a Lower Rate Than Men?
This analysis begins by addressing the question of whether women \participate in the stock market at a lower rate than men.  Since all of these responses are numerically encoded, it may be useful to create cleaner categorical columns with more informative entries.  The following code performs this task.
          

In [None]:
# who has access to an investment account

# create a clean column with labels
def set_investment_accounts(score):
    if score in (1, '1'):
        return 'Yes'
    elif score in (2, '2'):
        return 'No'
    else:
        return "Unknown"

# who has employer-provided account, other retirement, or other investments        
df['cat_employer_provided_retirement_account'] = df.apply(lambda x: set_investment_accounts(x['employer_provided_retirement_account']),axis=1)
df['cat_self_retirement_account'] = df.apply(lambda x: set_investment_accounts(x['self_retirement_account']),axis=1)
df['cat_other_investment_accounts'] = df.apply(lambda x: set_investment_accounts(x['other_investment_accounts']),axis=1)

In order to be considered a participant in the stock market, a respondent must meet _at least one_ of the following criteria:  
- Hold an employer-provided retirement account.
- Hold a retirement account not provided by an employer.
- Hold any investments in stocks, bonds, mutual funds, or other securities outside of a retirement account.

We can create a categorical column that tracks this information.

In [None]:
# create a clean column with labels
def set_participant(account1, account2, account3):
    if 'Yes' in {account1, account2, account3}:
        return 'Yes'
    else:
        return "No"
        
df['cat_market_participant'] = df.apply(lambda x: set_participant(x['cat_employer_provided_retirement_account'], x['cat_self_retirement_account'], x['cat_other_investment_accounts']),axis=1)


We can now graph this data and examine the relationship.

In [None]:
# plot the gender distribution
g2 = sns.catplot(x="cat_market_participant", 
            kind="count",
            hue='cat_gender',
            data=df,
            order=["Yes", "No"],
            palette=("Blues"))

# set title
g2.fig.suptitle('Do you participate in the market in some way?') 

# save output
g2.savefig('./figures/g2.png')

There are a higher number of female respondents that participate in the market.   However, there are more females in our respondent pool, so these results must be normalized against the total population in each group.

In [None]:
print(df[(df['cat_gender'] =='Female')][['cat_gender','cat_market_participant']].value_counts(normalize=True))
print(df[(df['cat_gender'] =='Male')][['cat_gender','cat_market_participant']].value_counts(normalize=True))

We can see that 72% of male respondents participate in the market in some way, while only 62% of female respondents particpate.  To examine whether or not this is statistically significant, we will run a simple two-sample Z-test for a proportion.  In this case our null hypothesis is that the proportion should be equal for the two groups.  Our alternative hypothesis is that the proportion of market participation is less for women than it is for men.

In [None]:
# set significance level
alpha = 0.025


# compare market participation in both populations
female_participants, female_respondents = (9501, 15135)
male_participants, male_respondents = (8678, 11956)

successes = np.array([female_participants, male_participants])
samples = np.array([female_respondents, male_respondents])

# run our test
stat, p_val = proportions_ztest(count=successes, nobs=samples,  alternative='smaller')

# check results
print('z_stat: %0.3f, p_value: %0.3f' % (stat, p_value))
if p_val < alpha:
    print ("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")


It is clear from the results that this is a signifcant difference between the two proportions.   Based on this sample, it appears women participate in the market at a lower rate then men.

#### R2: 1.   What are some of the factors that blah blah blah>

dsafkj;asdfkjdsaf

In [None]:
for name in df.columns:
    print(name)

In [None]:
df.describe

In [64]:
df[df.columns[1:]].corr()['gender'][:].sort_values()

income                                  -0.149901
education                               -0.098227
good_at_math                            -0.032530
minimum_credit_card_payement_only       -0.025315
financial_satisfaction_score            -0.019560
good_at_daily_financial_matters         -0.017195
risk_tolerance                          -0.004872
comfort_asking_questions                -0.003697
difficult_to_pay_bills                   0.007248
taken_financial_education                0.007811
how_many_hours_of_financial_education    0.008242
financial_knowledge                      0.014366
finances_makes_me_anxious                0.014756
savings_account                          0.017264
know_retirement_amount                   0.018075
too_much_debt                            0.018204
finances_makes_me_stressed               0.018814
current_credit_record                    0.021596
employer_provided_retirement_account     0.022847
worry_about_retirement                   0.030526


In [None]:
corrMatrix = df.corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()

#### R2: 2. Do these factors disproportionally affect women?


dsafkj;asdfkjdsaf

<span id="discussion" />

### Discussion

<span id="conclusion" />

### Conclusion
blah blah blah

<span id="questions"/>

