# DATA 512 FINAL PROJECT

A successful report will take the form of a well-written, well-executed research study document (plus supplementary resources, see the full requirements below) contained in a folder within your Github repository—the same folder that holds your final project plan (Assignment 5).

Your previous deliverables for the final project proposal and plan (assignments 4 and 5) are part of this report: you are expected to build your report by adding more text (and, obviously, code!) to your existing project plan Jupyter Notebook.

Structure of the report
The report should be structured well, with headings and even sub-headings. Most reports should have a structure similar to this one

Introduction
Background or Related Work
Research questions or hypotheses
Methodology
Findings
Discussion (including Limitations and Implications)
Conclusion
References

## Introduction / Motivation

start with some sort of “introduction” or “motivation” section that describes what the study is about, why it is important/interesting, and note any existing research in this area and/or other related work, like articles by journalists.

Contain a description of their dataset and the license or terms of use (unless these are described in the README)

## Background or Related Work

## Research questions or hypotheses

Contain explicit research questions or hypotheses

## Methodology

Describe the methods they used, and explain why these methods are good/appropriate methods to use


## Data Exploration

In [1]:
# change the current working directory - use relative references later
import os
os.chdir('/home/jovyan/data-512/data-512-final/data-512-final')

Since multiple files are going to be downloaded from the internet. We'll define a method to retreive the data.

In [2]:
def get_data(zip_file_url):
    """
    Input: Website to extract zip file.
    Output:  Extracted data in '/raw_data' folder
    """
    import requests, zipfile, io
    r = requests.get(zip_file_url)
    if r.ok:
        print('Request Succesful.')
    else:
        print('Error submitting request.')
        
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall('./raw_data')

Call the method multiple times to save the files in the appropriate place.

In [4]:
%%capture
import pandas as pd
import numpy as np

# download state by state data, store this in a data frame
BASE_URL = 'https://www.usfinancialcapability.org/downloads/'

file_list = ['NFCS_2018_State_by_State_Data_Excel.zip', 'NFCS_2018_Inv_Data_Excel.zip']

for filename in file_list:
    get_data(zip_file_url = BASE_URL + filename)
        
# read the state by state data into a dataframe
df_sbs = pd.read_csv('raw_data/NFCS 2018 State Data 190603.csv')

# read the investor data into a dataframe
df_inv = pd.read_csv('raw_data/NFCS 2018 Investor Data 191107.csv')

Request Succesful.
Request Succesful.


Now we can check our import results.

In [5]:
# Check the import results.
display(df_sbs.head())
display(df_inv.head())

Unnamed: 0,NFCSID,STATEQ,CENSUSDIV,CENSUSREG,A3,A3Ar_w,A3B,A4A_new_w,A5_2015,A6,...,M42,M6,M7,M8,M31,M9,M10,wgt_n2,wgt_d2,wgt_s3
0,2018010001,48,9,4,2,5,11,1,5,4,...,,1,3,98,98,98,1,0.683683,0.519642,1.095189
1,2018010002,10,5,3,2,2,8,1,6,1,...,,1,3,98,3,1,98,0.808358,2.516841,0.922693
2,2018010003,44,7,3,2,2,8,1,6,1,...,,1,1,98,98,1,98,1.021551,1.896192,0.671093
3,2018010004,10,5,3,2,1,7,1,6,2,...,7.0,98,98,4,4,2,98,0.808358,2.516841,0.922693
4,2018010005,13,8,4,1,2,2,1,6,1,...,,1,3,98,2,1,98,0.448075,0.614733,1.232221


Unnamed: 0,NFCSID,A1,A2,A3,B2_1,B2_2,B2_3,B2_4,B2_5,B2_7,...,G12,G13,H2,H3,WGT1,S_Gender,S_Age,S_Ethnicity,S_Education,S_Income
0,2018010042,2,1,1,98,98,1,1,98,2,...,3,4,1,2,0.910655,2,3,1,1,2
1,2018010047,1,1,1,1,1,1,1,1,2,...,2,2,1,1,1.566608,1,1,2,2,2
2,2018010050,2,1,1,1,1,2,2,2,2,...,98,98,1,2,0.609443,2,3,1,2,2
3,2018010051,1,1,1,2,2,1,2,1,2,...,1,4,1,2,0.609443,1,3,1,2,2
4,2018010053,1,1,1,1,1,1,1,2,1,...,3,4,1,1,0.609443,1,3,1,2,3


In [None]:
import pandas as pd
import numpy as np
import os

# import libraries for visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# magic code for viewing plots using jupyter notebooks:
%matplotlib inline

---
# DEMOGRAPHICS
A higher number of respondents identified as female.

In [None]:
# create a nicer column for gender
df_sbs['Gender'] = df_sbs['A3'].apply(lambda x: 'Male' if x == 1 else 'Female')

# plot the gender distribution
sns.catplot(x="Gender", 
            kind="count", 
            data=df_sbs,
            palette=("Blues"))

---
# PARTICIPATION
Initially it appears that quite a bit of the female respondents hold an investment account

In [None]:
# who has access to an investment account

# create a clean column with labels
def investment_accounts(score):
    if score == 1:
        return 'Yes'
    elif score == 2:
        return 'No'
    else:
        return "Don't know"
        
df_sbs['Hold_Investment_Account'] = df_sbs.apply(lambda x: investment_accounts(x['C1_2012']),axis=1)

# plot the gender distribution
g3 = sns.catplot(x="Hold_Investment_Account", 
            kind="count",
            hue='Gender',
            data=df_sbs,
            order=["Yes", "No", "Don't know"],
            palette=("Blues"))
g3.fig.suptitle('Do you hold an investment account?') # can also get the figure from plt.gcf()

As a percentage of respondents about 0.535117 female respondents hold an account.   And about 0.612412 of males hold an account.

It's difficult to see who's account these actually belong to.   There is a lot of unknown data.   Not clear the gender of the partner, lots of respondents with unknown or non responses to this question.

In [None]:
# create a clean column with labels
def who_owns_account(gender, owner):
    if owner in ('3') or (gender == 'Female' and owner in ('1')):
        return 'Female Respondent Owns Account'
    elif  gender == 'Female' and owner in ('2'):
        return "Female's Partner Owns Account"
    elif  gender == 'Male' and owner in ('2'):
        return 'Unknown'
    elif  gender == 'Male' and owner in ('1'):
        return 'Male Owns Account'
    elif  owner in ('98','99', ' '):
        return 'Unknown'
        
df_sbs['Account_Owner'] = df_sbs.apply(lambda x: who_owns_account(x['Gender'], x['C2_2012']),axis=1)


g3 = sns.catplot(y="Account_Owner", 
            kind="count",
            data=df_sbs.where(df_sbs['C1_2012'] == 1),
            order=["Female Respondent Owns Account", 'Male Owns Account', "Female's Partner Owns Account", "Unknown"],
            palette=("Blues"))
g3.fig.suptitle('Who owns the account?')

In [None]:
# create a clean column with labels
def who_owns_account(gender, owner):
    if owner in ('3') or (gender == 'Female' and owner in ('1')):
        return 'Female Respondent Owns Account'
    elif  gender == 'Female' and owner in ('2'):
        return "Female's Partner Owns Account"
    elif  gender == 'Male' and owner in ('2'):
        return 'Unknown'
    elif  gender == 'Male' and owner in ('1'):
        return 'Male Owns Account'
    elif  owner in ('98','99', ' '):
        return 'Unknown'
        
df_sbs['Account_Owner'] = df_sbs.apply(lambda x: who_owns_account(x['Gender'], x['C2_2012']),axis=1)


g3 = sns.catplot(y="Account_Owner", 
            kind="count",
            data=df_sbs.where(df_sbs['C1_2012'] == 1),
            order=["Female Respondent Owns Account", 'Male Owns Account', "Female's Partner Owns Account", "Unknown"],
            palette=("Blues"))
g3.fig.suptitle('Who owns the account?')

In [None]:
df_sbs[(df_sbs['Gender'] =='Female')][['Gender','Account_Owner']].value_counts(normalize=True)

---
# ATTITUDES

In [None]:

# create a clean column with labels
def financial_stress(score):
    if score <= 3:
        return 'Disagree'
    elif score == 4:
        return 'Neutral'
    elif score <= 7:
        return "Agree"
    else:
        return "Don't know."
        
df_sbs['Stressed'] = df_sbs.apply(lambda x: financial_stress(x['J33_41']),axis=1)

# plot the gender distribution
g2 = sns.catplot(x="Stressed", 
            kind="count",
            hue='Gender',
            data=df_sbs,
            palette=("Blues"))
g2.fig.suptitle('Discussing My Finances Makes Me Feel Stressed') # can also get the figure from plt.gcf()

41% of males report feeling stressed about discussing finances while 49% of females discuss feeling stressed.

## Findings

## Discussion (including Limitations and Implications)

Summarize the findings, and what they mean (implications for business, design, further research, public policy, etc). An explicit “Discussion” “Conclusion” and “Limitations” section are great! But there are other ways to organize this information too. It just needs to be there in some form.

## Conclusion

## References