# Introduction to Jupyter Notebook

This is a markdown cell. It lets me wrap text around my code and output.

In [17]:
# This is a code cell. Since this is a python notebook, I can write python code here, hit the play button above,
# and it will run that code. If my python code produces output, it will appear below the cell.
x = 2
y = 3
z = x + y
print(z)

5


## Jupyter for Cielo/HINTS tool

The CIELO tool will serve two functions:
1. A searchable repository for the work the HSR team has done to replicate papers using the HINTS dataset. 
2. An execution environment where researchers can run HSR team replications using their own HINTS related data. 

Jupyter Notebook is a possible platform for the second function. 

### Using STATA in a Jupyter Notebook
A user created package called *ipystata* allows the user to write and execute STATA code in a python notebook.

In [5]:
# import python STATA package
import ipystata
from ipystata.config import config_stata 
# the package is making calls to a locally installed version of STATA to run stata code. the config_stata function 
# tells the notebook where to find STATA.
config_stata('/Applications/Stata/StataMP.app/Contents/MacOS/StataMP')

IPyStata is loaded in batch mode.


With ipystata installed, I can now open a STATA session from the notebook using the call *%%stata*. Any code below this call in a code cell will be run in the STATA session. The STATA logfile and graphic output will appear as output in the notebook. 

In [23]:
%%stata -gr
set obs 20
gen x = runiform()
gen y = runiform()
mean x y
scatter x y


number of observations (_N) was 0, now 20

Mean estimation                   Number of obs   =         20

--------------------------------------------------------------
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
           x |   .4687793   .0651381      .3324437    .6051148
           y |   .5744606   .0701596      .4276149    .7213062
--------------------------------------------------------------


## Limitations
### Limitations of ipystata
- STATA can only be run in a Jupyter Notebook with a user created package. No support is available for this pakcage.
- To run STATA, the notebook is making a call to a locally installed instance of STATA. We want users to be able to use STATA online.
- Every time the Notebook sees %%stata, a new STATA session is opened. Any STATA process run in the previous code cell will be terminated. As a consequence, all code for a STATA session must be written in a single cell. This severely Jupyter Notebooks features.

### Limitations of Jupyter Notebooks
- Hosting Jupyter Notebooks to be interactive online is possible, but the feasibility of doing so for this project is unknown.
- can't run part of a code cell.

In [32]:
%%stata 
local two  2
display `red'

global two 2
display $red

%%stata
display `red'


red not found
r(111);

end of do-file
r(111);


In [28]:
%%stata 
mean x


no variables defined
r(111);

end of do-file
r(111);


In [29]:
%%stata
display $red





## Replication analysis

In [26]:
%%stata
cd "/Users/sethscarborough/Desktop/ HINTS/"
use fullhintsdestring, replace
rename *, lower
//log using STATATemp/12,replace     // ACTIVATE LOG

****************************************************************************
*** Phase 1: STANDARDIZE ALL THE VARIABLES
*** The paper may use variables that don't exactly align to the way the data
*** was collected. The first step is to ensure that variables are presented
*** in the same manner as they were in the original paper
****************************************************************************
*** These lines take existing variables (e.g. case genderc) to create new
***Variables (in this case sex).
***
***V0: Example - Sex
***    recode gendern (1=1 "Male") (2=2 "Female") (-9/0 = .), generate(sex)
***    replace sex = genderc if genderc != . // when gender is fixed, this line goes away
***    label variable sex "Sex"
***    tab sex datayr
***
*** RECODE simultaneously recodes and creates a new variable, keeping the original variable intact. 
*** Note the recode has (-9/0 = .) which means from "-9 to 0, mark those as missing"
*** 
*** REPLACE is used, when necessary, when the identical variabels are called different things.
*** This is a workaround while we wait to repair the dataset and will be deleted eventually/
***
*** LABEL is used to relabel the variable - otherwise it will show as RECODE of varname
*** 
*** TAB is used as afinal tabulate is used to make sure that there are no other variable definitions out of range
****************************************************************************
***
*** BASED ON Blake, Ottenbacher, Finney Rutten, Grady, Kobrin, Jacobson, and Hesse (2015)
*** Paper Number: 12
*** Using HINTS4c3
*** 

***V1: Sex
recode genderc (1=1 "Male") (2=2 "Female") (-9/0 = .), generate(sex)
label variable sex "Sex"
tab sex datayr

***V2: Age Replication
recode selfage (18/34=1 "18-34") (35/49=2 "35-49") (50/65=3 "50-64") (66/10000 = 4 ">65") (-9/0 = 0), gen (age3)
tab age3 datayr
	***V2.1: Age Adjusted
	recode age (18/34=1 "18-34") (35/49=2 "35-49") (50/65=3 "50-64") (66/10000 = 4 ">65") (-9/0 = .), gen (age2)
	label variable age2 "Age"
	tab age2 datayr
	
***V3: Education
recode educa (1=1 "Less than High School") (2=2 "12 years or completed High School") (3=3 "Technical, vocational, or some college") (4=4 "College Graduate or postgraduate") (nonmissing=.), generate(educ)
label variable educ "Education"
tab educ datayr

***V4: Race (Race_cat2 is not in HINTS 2 and 3, but other race variables should be able to be used. For this article, race_cat2 is definitely the variable used)
generate racecattmp = hd09race1
replace racecattmp=99 if hd09race1!=. & hd09race2!=.
recode racecattmp (11=1 "White") (12=2 "Black") (13/15=3 "Other") (99=4 "Multiple races") (-9/0=.), gen(racecat)
recode race_cat2 (11=1 "White") (12=2 "Black") (14 31/54=3 "Other") (16=4 "Multiple races") (-9/0=.), gen(race_cat3)
replace racecat = race_cat3 if race_cat3!=.
tab racecat datayr

***V5:Hispanic (nothisp is not in HINTS 2 and 3, but other race variables should be able to be used. For this article, nothisp is definitely the variable used)
recode nothisp (-9=.) (1=1 "Not Hispanic") (2=2 "Hispanic"), gen (hisp)
label variable hisp "Hispanic ethnicity"
recode hispanic (1=2) (2=1) (-9/0 9=.)
replace hisp = hispanic if hispanic!=.
tab hisp datayr

***V6: Children in Household
recode childreninhh (-9/-1 98=.) (0=0 "0") (1/30=1 ">=1"), gen (children)
label variable children "Children under 18 in household"
tab children datayr

***V7: Income
recode incomeranges (-9/-1=.) (1/3=1 "0-$19,999") (4/5=2 "$20,000-$49,999") (6=3 "$50,000-$74,999") (7/9=4 ">=$75,000"), gen (income2)
label variable income2 "Income"
tab income2

***V8: Rurality
recode ruc2003 (4/9 = 1 "Rural") (1/3=2 "Urban"), gen(metro)
label variable metro "Metropolitan area"
tab metro datayr

***V9: Health Insuance
gen Healthinstmp=healthinsurance
replace Healthinstmp=. if (datayr==2011|datayr==2013)&healthinsurance==2&(hccoverage_insurance!=2|hccoverage_private!=2|hccoverage_medicare!=2 ///
|hccoverage_medicaid!=2|hccoverage_tricare!=2|hccoverage_va!=2|hccoverage_ihs!=2|hccoverage_other!=2)
recode Healthinstmp (1=1 "yes")(2=2 "no")(-9/0 9=.), gen (Healthins)
drop *tmp
tab Healthins datayr

***V10: useinternet
recode useinternet (-9/0 9=.) (1=2 "Yes") (2=1 "No"), gen(intuse)
label variable intuse "Internet use"
tab intuse datayr

***V11: heardhpv
recode heardhpv (-9/0=.) (9=.) (1=2 "Yes") (2=1 "No"), gen(heard)
label variable heard "Heard of HPV"
tab heard datayr
recode heard (1 = 0) (2 = 1),gen(heard01)

***V12: heardhpvvaccine
recode heardhpvvaccine2 (-9/0=.) (9=.) (1=2 "Yes") (2=1 "No"), gen(heardvac)
label variable heardvac "Heard of HPV vaccine"
tab heardvac datayr
recode heardvac (1 = 0) (2 = 1),gen(heardvac01)

***V13: hpvcausecancer_cervical
recode hpvcausecancer_cervical (-9/0=.) (1=2 "Yes") (2=1 "No") (3=3 "Not sure"), gen(hpvcervical)
label variable hpvcervical "HPV causes cancer"
tab hpvcervical datayr
recode hpvcervical (3=.), gen(hpvcervical2)
recode hpvcervical (1 3= 0) (2 = 1),gen(hpvcervical01)
*recode hpvcervical2 (1= 0) (2= 1),gen(hpvcervical02)

***V14: hpvstd
recode hpvstd (-9/0=.) (1=2 "Yes") (2=1 "No") (3=3 "Not sure"), gen(std)
label variable std "HPV is STD"
tab std datayr
recode std (3=.), gen(std2)
recode std (1 3= 0) (2 = 1),gen(std01)
*recode std2 (1= 0) (2 = 1),gen(std02)

***V15: hpvgoaway
replace hpvgoaway = br70hpvgoawaymail if br70hpvgoawaymail!=.
recode hpvgoaway (-9/0=.) (1=2 "Yes") (2=1 "No") (9 3=3 "Not sure"), gen(goaway)
label variable goaway "HPV will go away on own"
tab goaway datayr
recode goaway (3=.), gen(goaway2)
recode goaway (1 3= 0) (2 = 1),gen(goaway01)
*recode goaway2 (1= 0) (2 = 1),gen(goaway02)
****************************************************************************
*** Phase 2: SURVEY WEIGHTING DIAGNOSTIC (HINTS 2007 transition) 
*** NOTE: THIS CAN ONLY BE RUN IF THE VARIABLE WAS COLLECTED IN 2007
****************************************************************************
*** In 2002 and 2005, data was collected using telephone surveys
*** In 2007, data was collected using both telephone and mail surveys
*** From 2010 to present, data is only collected using mail surveys
*** 
*** To merge across the 2007 divide, we must determine if individuals respond to
***	the mail survey the same way they answered in questions on the phone?
*** 
*** This is done in 3 steps according to Moser, et. al 2013
***
*** PHASE 1, Step 1: Examine whether there are differences between phone and mail samples for HINTS 3
***

gen evalwt = mwgt0 if sampflag == 1 

forvalues i = 1/50 {
	gen evalwgt`i' = mwgt`i' if sampflag==1
}

forvalues i = 51/100 {
	local j=`i' - 50
	gen evalwgt`i' = mwgt0 if sampflag==1
}


replace evalwt = rwgt0 if sampflag == 2

forvalues i = 1/50 {
	replace evalwgt`i' = rwgt0 if sampflag==2
}

forvalues i = 51/100 {
	local j=`i' - 50
	replace evalwgt`i' = rwgt`j' if sampflag==2
}

svyset _n [pweight=evalwt], jkrweight(evalwgt*,multiplier(0.98)) vce(jackknife) mse
***   _n:  _n to indicate that individuals (instead of clusters) were randomly sampled
***   pweight = denote the inverse of the probability that the observation is included (use final)

/*svy: tab sex sampflag, stubw(30)
svy: tab age2 sampflag, stubw(30)
svy: tab educ sampflag, stubw(30)
svy: tab racecat sampflag, stubw(30)
svy: tab hisp sampflag, stubw(30)
svy: tab children sampflag, stubw(30)
svy: tab income2 sampflag, stubw(30)
svy: tab metro sampflag, stubw(30)
svy: tab healthins sampflag, stubw(30)
svy: tab intuse sampflag, stubw(30)
svy: tab heard sampflag, stubw(30)
svy: tab heardvac sampflag, stubw(30)
svy: tab hpvcervical sampflag, stubw(30)
svy: tab std sampflag, stubw(30)
svy: tab goaway sampflag, stubw(30)
* DEACTIVATE LOG AND CREATE PDF*/

********************************************************************************
*** PHASE 1, Step 2: Evaluate the tables above to determine if there was a difference
*** The GENERAL RULE is: If no differences at 0.01, then set HINTS 3 flag = 1
*** otherwise set HINTS 3 flag = 2
*** Bring tables to Dr. Huerta to evaluate
********************************************************************************
*** 1 means that a composite is ok
*** 2 means that you should use the phone data only
*** 3 means that you should use the mail data only
gen HINTS3_flag = 3 if datayr == 2007
********************************************************************************
*** PHASE 3: PUT ALL THE CORRECT WEIGHTS INTO A SINGLE VARIABLE SET
***	2003 and 2005 use fwgt and fwgt*
***
*** 2007 needs to replace fwgt & fwgt* with the appropriate sampling strategy
*** based on the analysis above (HINTS3_flag)
***
*** 2010 and on use person_finwt*
*** 
*** This protocol standardizes on nfinalwt and nfwgt*
***
*** Step 1: 2003 & 2005 recode. No need to 
***

gen nfinalwt = fwgt // 2003 and 2005 recode, blank in other years
 
forvalues i = 1/50 {
	generate nfwgt`i' = fwgt`i'
}
tab datayr, summarize(nfinalwt)
***
*** Step 2: 2007 recode.
***

replace nfinalwt = cwgt0 if HINTS3_flag == 1 // 2007 recode - merged mail and phone
replace nfinalwt = rwgt0 if HINTS3_flag == 2 // 2007 recode - phone only
replace nfinalwt = mwgt0 if HINTS3_flag == 3 // 2007 recode - mail only
tab datayr, summarize(nfinalwt)

forvalues i = 1/50 {
	replace nfwgt`i' = cwgt`i' if HINTS3_flag == 1
	replace nfwgt`i' = rwgt`i' if HINTS3_flag == 2
	replace nfwgt`i' = mwgt`i' if HINTS3_flag == 3
}

***
*** Step 3: 2010 and beyond - recode
***

replace nfinalwt = person_finwt0 if person_finwt0 != . // 2010 and forward recode

forvalues i = 1/50 {
	replace nfwgt`i' = person_finwt`i' if person_finwt0 != .
}

tab datayr, summarize(nfinalwt)
********************************************************************************
**  DROP DATAYRS THAT ARE OUT OF SCOPE - they are dropped if they DON'T have an
**  asterisk - ANNOTATE WHY here THIS IS GENERALLY BASED ON PHASE 1
** 
** The var "seekcancerinfo" is only avialable in year 2003, 2012 and 2014
**
** Most variables are only in HINTS 2,3 and 4c3
**
**
********************************************************************************
***
*** The data year is dropped if there is NO ASTERISK
***
********************************************************************************

drop if datayr == 2003 //heardvac heard children goaway std
drop if datayr == 2005 //heardvac
*drop if datayr == 2007
drop if datayr == 2011 //heardvac heard goaway std
drop if datayr == 2012 //heardvac heard goaway std
*drop if datayr == 2013
*drop if datayr == 2014 //goaway
drop if sampflag!=1 & datayr==2007
********************************************************************************
*** PHASE 4: CREATE A COMPOSITE WEIGHTING STRUCTURE ACROSS MULTIPLE YEARS
*** Based on Moser et al 2013, Chapter 2, table 2-1
***
*** Sets create a staggered weigthing set with 50 replicates per year
*** e.g., when you use three years, you get 150 replicates
*** All observations use their replicate weights when analyzing their year
*** In all other years, they use their final replicate weight
*** 

by datayr, sort: gen nvals = _n == 1
gen nvals2 = sum(nvals)
sum nvals
return list
local k = r(sum)
display `k'


forvalues j = 1/`k' {
	forvalues i = 1/50 {
		local l = (50 * `j') - 50 + `i' 
		gen recfwgt`l' = nfinalwt
	}
}

forvalues j = 1/`k' {
	forvalues i = 1/50 {
		local l = (50 * `j') - 50 + `i' 
	replace recfwgt`l' = nfwgt`i' if nvals2 == `j'
	}
}




********************************************************************************
svyset _n [pweight=nfinalwt], jkrweight(recfwgt*,multiplier(0.98)) vce(jackknife) mse
********************************************************************************
***  PHASE 5: CONDUCT THE ANALYSIS ACROSS YEARS
***  Table 1 - Replicate
********************************************************************************
/*tab sex if datayr==2013
tab age3 if datayr==2013
tab educ if datayr==2013
tab racecat if datayr==2013
tab hisp if datayr==2013
tab children if datayr==2013
tab income2 if datayr==2013
tab metro if datayr==2013
tab healthins if datayr==2013
tab intuse if datayr==2013
svy: tab sex if datayr==2013, col
svy: tab age3 if datayr==2013, col
svy: tab educ if datayr==2013, col
svy: tab racecat if datayr==2013, col
svy: tab hisp if datayr==2013, col
svy: tab children if datayr==2013, col
svy: tab income2 if datayr==2013, col
svy: tab metro if datayr==2013, col
svy: tab healthins if datayr==2013, col
svy: tab intuse if datayr==2013, col*/
********************************************************************************
***  Table 1 - Adjust/Extend
********************************************************************************
tab sex datayr
tab age2 datayr
tab educ datayr
tab racecat datayr
tab hisp datayr
tab children datayr
tab income2 datayr
tab metro datayr
tab Healthins datayr
tab intuse datayr
svy: tab sex datayr, col
svy: tab age2 datayr, col
svy: tab educ datayr, col
svy: tab racecat datayr, col
svy: tab hisp datayr, col
svy: tab children datayr, col
svy: tab income2 datayr, col
svy: tab metro datayr, col
svy: tab Healthins datayr, col
svy: tab intuse datayr, col

log close
*translate STATATemp/12.smcl STATAResults/12.pdf, replace




/Users/sethscarborough/Desktop/ HINTS
. ****************************************************************************. *** The paper may use variables that don't exactly align to the way the data. *** in the same manner as they were in the original paper. *** These lines take existing variables (e.g. case genderc) to create new. ***. ***    recode gendern (1=1 "Male") (2=2 "Female") (-9/0 = .), generate(sex). ***    label variable sex "Sex". ***. *** Note the recode has (-9/0 = .) which means from "-9 to 0, mark those as missing". *** REPLACE is used, when necessary, when the identical variabels are called different things.. ***. *** . ****************************************************************************. *** BASED ON Blake, Ottenbacher, Finney Rutten, Grady, Kobrin, Jacobson, and Hesse (2015). *** Using HINTS4c3. ***V1: Sex(322 differences between genderc and sex)

           |                                    datayr
       Sex |      2003       2005       2007       2011    