 # <font color=black>2015-2017 National Survey of Family Growth Male Questionnaire: Report 2 Analysis</font> #

### <font color=blue>Sophia Atik, Applied Data Science, Spring 2019</font> ###

The dataset I will be using is from the 2015 to 2017 National Survey of Family Growth from the CDC, specifically their Male Questionnaire data. [__[1](https://www.cdc.gov/nchs/data/nsfg/NSFG_2015_2017_UserGuide_MainText.pdf)__ __[2](https://www.cdc.gov/nchs/data/nsfg/NSFG_2015-2017_MaleCAPIlite_forPUF.pdf)__ __[3](ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NSFG/)__] The data contains demographic information about the men participating, along with when and where they learned about birth control, if they got a vasectomy and where they had the procedure, how many sexual partners do they have, what is their relationship with said partners, what form of birth control do they use, what form of birth control do their partners use, etc. This dataset also contains information about the family and children living in these men’s lives.  [__[1](https://www.cdc.gov/nchs/data/nsfg/NSFG_2015_2017_UserGuide_MainText.pdf)__ __[2](https://www.cdc.gov/nchs/data/nsfg/NSFG_2015-2017_MaleCAPIlite_forPUF.pdf)__ __[3](ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Datasets/NSFG/)__] However, I am most interested in the use of birth control by men with their partners. 

Men only have two options when it comes to birth control: condoms and vasectomies. I am looking at this dataset to get a better understanding of what men in relationships chose to use as a form of birth control. What is great about this survey is that it also asks about the birth control that their partner is using, suggesting that men might just rely on their female partners for contraceptive methods.[__[2](https://www.cdc.gov/nchs/data/nsfg/NSFG_2015-2017_MaleCAPIlite_forPUF.pdf)__] I am also hoping this dataset will inform me on the situations when men are most likely to get a vasectomy (ie. what is the family dynamic at that time, what is the relationship status, how many children does he have, etc.). 

In order to best complete my analysis, I will be using the answers from the 2015 to 2017 National Survey of Family Growth Male Questionnaire that best paint the best picture of where this man is in life and how him and his partner made decisions on what contraceptive method(s) to use. 


#### <font color=blue>Importing My Dataset</font> ####

I had many problem with loading my dataset. As you can see below, technically it did load! As you can see in the displayed table, the spacing of the numbers seem extremely odd. I double checked them with the questionnaire response dicitonary provided, and the recorded numbers in the table do not line up. Therefore, I knew something was wrong. 

In [2]:
import pandas as pd    
import numpy as np
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt

with open('2015_2017_MaleData.dat','r') as f:
    next(f) # skip first row
    df1 = pd.DataFrame(l.rstrip().split() for l in f)

df1.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,269,270,271,272,273,274,275,276,277,278
0,70626518141818181112,66,50,1,13512015,12,5,18155151,11,3,...,,,,,,,,,,
1,70629123532323235,1,36,50,1,18512011,12,122015,19155111,32,...,,,,,,,,,,
2,70631517531717175,1,46,50,1,115,11,16155111,43,5,...,,,,,,,,,,
3,70636537533737375,1,26,50,1,1412,12121996122003,1721151213141,5,135,...,,,,,,,,,,
4,70640549524949495,1,41,11115,12111985,12,113255111,31,3,5,...,,,,,,,,,,


I kept trying alternative ways of importing the data so that it would produce an accurate and interpretable table. The NSGF also provided SAS and STATA files. However it seems as though they are only for setup and I could not get them to be uploaded via pandas and display a table.

I also went to a different computer lab across campus to use the STATA program that was downloaded on the school's computers. My hope was that I could get the files to be read in that program and then export the file as a .csv to be read here. However, there was no luck on that front either.

In [13]:

df = pd.read_sas('2015_2017_MaleSetup.sas')
df.head()

ValueError: unable to infer format of SAS file

In [18]:
df = pd.read_stata('2015_2017_MaleData.dat')

ValueError: Version of given Stata file is not 104, 105, 108, 111 (Stata 7SE), 113 (Stata 8/9), 114 (Stata 10/11), 115 (Stata 12), 117 (Stata 13), or 118 (Stata 14)

It seemed like the data might need to be weighted according to the provided resources at  __[the NSFG website](https://www.cdc.gov/nchs/nsfg/nsfg_2015_2017_puf.htm)__. I was able to load this data. However, I do not know how to interpret it and therefore can not determine what my next step would be.

In [26]:
with open('2013_2017_2011_2017_Malewgt.dat','r') as f:
    next(f) # skip first row
    df2 = pd.DataFrame(l.rstrip().split() for l in f)

df2.head()

Unnamed: 0,0,1,2
0,50003,1573.80827829814,
1,50006,9889.59443179315,
2,50007,2409.20267649558,
3,50009,16521.6660364234,
4,50010,1304.93333475325,


### <font color=blue>Creating A Dictionary</font> ###

In order to properly decode the data, I will use a dictionary to clarify the value definitions. The NSFG provided a dictionary file that I am attempting to import as a .txt file.

In [28]:
import pandas as pd

df3 = pd.read_csv('Dictionary.txt', header = None)

ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 2


(Below is me messing with various codes to try to create a dictionary.)

In [29]:
    df.to_dict('series')
{'col1': row1    1
         row2    2
Name: col1, dtype: int64,
'col2': row1    0.50
        row2    0.75
Name: col2, dtype: float64}

NameError: name 'Dictionary' is not defined

In [11]:
dic_custom = {"key": 0, "another": 1}

In [3]:
type(dic_custom)

dict

In [4]:
dic_custom1 = {
    "0": "age",
    "1": "gender",
}

In [5]:
type(dic_custom1)

dict

### <font color=blue>Analysis Of The Dataset</font> ###

Below are codes and desciptions of the types of analysis methods I would run on my dataset. The goal is to try to understand why and when men elect to get vasectomies. Further anaylsis will be completed once I am able to load the dataset in an understandable way.

In [None]:
# Basic statistics about my dataset.
pd.DataFrame(2015_2017_MaleDat.vasectomy.describe())

In [None]:
# A boxplot showing the distribution by age of men who get a vasectomy.
f, ax = plt.subplots(figsize=(6.5, 6.5))
sns.boxplot(x="vesectomy", y="age", data= 2015_2017_MaleData, fliersize=0.5, linewidth=0.75, ax=ax)

In [None]:
# A histogram showing how many men rely on the different types of birthcontrol.
diamonds['birthcontrol'].hist(bins=np.arange(0,20000,2500))
plt.xlabel('Types of Birth Control')
plt.ylabel('Number of Men')

# Merging Datasets #

Because I could not properly load my main dataset, I can not actually merge my dataset. However, I found data that I would merge it with. I would use __[data](https://www.kaggle.com/census/census-bureau-usa )__ from the US census to see if the men who elect to get vasectomies happen to be from a similar location. Therefore, my "foreign key" would be age (15-49 year old males). I would than plot the men of that age range who have had a vasectomy on spatial visualiztion map to analyze any trends.


In [None]:
import bq_helper
from bq_helper import BigQueryHelper
# https://www.kaggle.com/sohier/introduction-to-the-bq-helper-package
census_data = bq_helper.BigQueryHelper(active_project="bigquery-public-data",
                                   dataset_name="census_bureau_usa")

### <font color=blue>Deploying A Right Join Merge Of The Two Datasets</font> ###

In [None]:
df1.merge(df2, left_on='2015_2017_MaleData', right_on='census_bureau_usa')

(I would than look for code examples to figure out the exact syntax I would need for my project to depict a map of the USA with specific areas highlighted where high number of vasectomy procedures are proformed.)

In [None]:
import geopandas as gpd
import pandas as pd
import pickle
import matplotlib.pyplot as plt

# Overall Conclusions #

(Once I am able to import my data set, I will have a more robust conclusion to describe here.)

This information could be very valuable to someone who needs to know men’s attitudes toward birth control. This information would be important for someone marketing a new form of birth control to men and knowing which demographic and psychographic group of men to target. With the current contraceptive market being dominated by contraceptive methods for women, companies developing a contraceptive for men will need as much insight as possible to obtain significant market share. 

*My appologies for not having more to submit. I spent a huge amount of time trying to upload my data with no luck. Although I have limited experience in coding, I have enjoyed and learned a lot from my experience so far (even thought I have encounter lots of trouble). Please let me know what you recommend to be my next steps.*