**Program**: 1a_NSCH_family_data<br>
**Class**: Fall 2025, Machine Learning, Project<br>
**Member**: Vanessa Thorsten<br>
**Description**: This program reads in the NSCH database which includes<br>
individual family-level data. Analysis variables are created.<br>
**Outputs**: NSCH_fam.csv file of the family-level data
  
**Program History/Modifications**:<br>
09/04/2025    Initial Version

In [1]:
import pandas as pd
#import pyreadstat
import numpy as np

## NSCH Data
Public access to NSCH data is available after signing a data use agreement.
The instructions indicate that every collaborator should download their
own copy of the data. As such the data have not been loaded on the GitHub site.  

In [3]:
#The SAS file had different variables from the documentation. 
#Use the CSV file instead
NSCH_fam_all = pd.read_csv("NSCH_2023e_Topical_CAHMI_DRC.csv")
NSCH_fam_all.head()

Unnamed: 0,HEIGHT,FIPSST,STRATUM,HHID,FORMTYPE,TOTKIDS_R,TENURE,HHLANGUAGE,SC_AGE_YEARS,SC_SEX,...,nomAnxietyDep12to17_23,nomSystCareCSHCN_23,nomFlrish6mto5_23,nomFlrish6to17_23,nomFlrish6to17CSHCN_23,nomACE2more_23,smSmoking_23,smPhysAct12to17_23,smAdeqIns_23,smForgoneHC_23
0,9999.0,6,1,23043707,2,1,3,1,11,1,...,95,2,90,2,2,1,1,95,1,2
1,149.86,6,1,23120547,3,2,1,1,14,2,...,2,2,90,1,1,2,2,2,1,2
2,157.47,6,1,23197456,3,1,1,2,17,2,...,2,95,90,1,95,2,2,2,1,2
3,157.47,6,1,23197458,3,2,1,1,12,2,...,2,95,90,1,95,2,2,2,1,2
4,149.86,6,1,23235909,2,2,3,1,10,2,...,95,2,90,1,1,2,2,95,2,2


In [4]:
NSCH_fam_all.head()

Unnamed: 0,HEIGHT,FIPSST,STRATUM,HHID,FORMTYPE,TOTKIDS_R,TENURE,HHLANGUAGE,SC_AGE_YEARS,SC_SEX,...,nomAnxietyDep12to17_23,nomSystCareCSHCN_23,nomFlrish6mto5_23,nomFlrish6to17_23,nomFlrish6to17CSHCN_23,nomACE2more_23,smSmoking_23,smPhysAct12to17_23,smAdeqIns_23,smForgoneHC_23
0,9999.0,6,1,23043707,2,1,3,1,11,1,...,95,2,90,2,2,1,1,95,1,2
1,149.86,6,1,23120547,3,2,1,1,14,2,...,2,2,90,1,1,2,2,2,1,2
2,157.47,6,1,23197456,3,1,1,2,17,2,...,2,95,90,1,95,2,2,2,1,2
3,157.47,6,1,23197458,3,2,1,1,12,2,...,2,95,90,1,95,2,2,2,1,2
4,149.86,6,1,23235909,2,2,3,1,10,2,...,95,2,90,1,1,2,2,95,2,2


In [5]:
# Print all column names. 
#Inconsistent case status with documentation is 
#making keeping the variables difficult

#for col in NSCH_fam_all.columns:
#    print(col)

### Identifiers
FIPSST	State FIPS Code
STRATUM	Sampling Stratum
HHID	Unique Household ID
FORMTYPE	Form Type
TOTKIDS_R	Number of Children in Househld

FWC	Selected Child Weight

### Features of interest

**Demographics and Family Characteristics**<br>
SC_AGE_YEARS	Age of Selected Child - In Years<br>
SC_SEX	Sex of Selected Child<br>
age3_23	Age in 3 groups<br>
age5_23	U.S. children in 5 age groups<br>
sex_23	Sex of child<br>
hispanic_23	Hispanic origin of the child<br>
race4_23	Race and ethnicity distribution of the child population<br>
raceASIA_23	Race and ethnicity distribution of the child population<br>
race7_23	Race and ethnicity distribution of children in the US population<br>
PlacesLived_23	Number of places child has lived in the past 12 months<br>
HousingInstab_23	Indicator 6.29: Children who experienced housing instability in the past year<br>
FoodCash_23	Indicator 6.27: Someone in the family received food or cash assistance at any time during the past 12 months<br>
WIC_23	Someone in the family received benefits from the WIC Program at any time during the past 12 months<br>
povlev4_23	Income level based on family poverty level status, imputed<br>
povSCHIP_23	Income level based on family poverty level status (SCHIP), imputed<br>
AdultEduc_23	Highest level of education of any adult in the household<br>
BornUSA_23	Children who were born in the United States<br>
FamCount_23	Number of family members in the child's household<br>

**Child Characteristics**<br>
K2Q01	General Health<br>
ChHlthSt_23	Indicator 1.1: Children's overall health statu<br>s
nomChHlthSt_23	NOM: Percent of children, ages 0 through 17, in excellent or very good healt<br>h<br>
MedCare_23	Indicator 4.1: Children who received any type of medical care during the past 12 months<br>
PrevMed_23	Indicator 4.1a: Children who had one or more preventive medical care visits during past 12 months<br>
PrevDent_23	Indicator 4.2a: Children who had one or more preventive dental care visits during the past 12 months, age 1-17 years<br>
MedDentCare_23	Indicator 4.3: Children received both preventive medical and dental care visits in past 12 months<br>
K4Q01	Place Usually Goes Sick<br>
GOWHENSICK	Place Usually Goes Sick - Where<br>
UsualSck_23	Indicator 4.12b: Medical Home Component: Usual source for sick care<br>
smAdeqIns_23	Standardized Measure: Percent of children, ages 0 through 17, who are continuously and adequately insured<br>
smForgoneHC_23	Standardized Measure: Percent of children, ages 0 through 17, who were unable to obtain needed health care in the past year<br>
<br>
th
nomAnxietyDep12to17_23	NOM: Percent of adolescents, ages 12 through 17, who have depression or anxi<br>ety
nomFlrish6mto5_23	NOM: Percent of children, ages 6 months through 5 years, who are flourish<br>ing
nomFlrish6to17_23	NOM: Percent of children, ages 6 through 17, who are flourish<br>i<br>ea
ScreenTime_23	Indicator 6.10: Time spent in front of a TV, computer, cellphone or other electronic device watching programs, playing games, accessing the internet or using social media, not including schoolwork on most weekda<br>ysHrsSleep_23	Indicator 6.25: Child slept recommended age-appropriate hours during an average day/on most weeknights, age 4 months - 1

r

In [6]:
#keep state, weights, features of interst
NSCH_fam_0 = pd.DataFrame(NSCH_fam_all[['FIPSST', 'STRATUM',
'HHID','FORMTYPE', 'TOTKIDS_R', 'FWC', 'SC_AGE_YEARS', 'SC_SEX', 
'age3_23', 'age5_23', 'sex_23','hispanic_23', 'race4_23', 'raceASIA_23', 'race7_23',
'PlacesLived_23', 'HousingInstab_23', 'FoodCash_23',
'WIC_23', 'povlev4_23', 'povSCHIP_23', 'AdultEduc_23',
'BornUSA_23', 'FamCount_23', 'K2Q01', 'ChHlthSt_23',
'nomChHlthSt_23', 'MedCare_23', 'PrevMed_23', 'PrevDent_23',
'MedDentCare_23', 'K4Q01', 'GOWHENSICK', 'UsualSck_23',
'smAdeqIns_23', 'smForgoneHC_23', 'nomAnxietyDep12to17_23',
'nomFlrish6mto5_23', 'nomFlrish6to17_23', 'ScreenTime_23','HrsSleep_23']])

In [7]:
NSCH_fam_0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55162 entries, 0 to 55161
Data columns (total 41 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   FIPSST                  55162 non-null  int64  
 1   STRATUM                 55162 non-null  int64  
 2   HHID                    55162 non-null  int64  
 3   FORMTYPE                55162 non-null  int64  
 4   TOTKIDS_R               55162 non-null  int64  
 5   FWC                     55162 non-null  float64
 6   SC_AGE_YEARS            55162 non-null  int64  
 7   SC_SEX                  55162 non-null  int64  
 8   age3_23                 55162 non-null  int64  
 9   age5_23                 55162 non-null  int64  
 10  sex_23                  55162 non-null  int64  
 11  hispanic_23             55162 non-null  int64  
 12  race4_23                55162 non-null  int64  
 13  raceASIA_23             55162 non-null  int64  
 14  race7_23                55162 non-null

In [8]:
#The NSCH identifies states by their FIPS code
#Read in state names and codes
State_Codes = pd.read_csv("State_FIPS.csv")
State_Codes.head()

Unnamed: 0,FIPSST,STATE,STATE_NAME
0,1,AL,Alabama
1,2,AK,Alaska
2,4,AZ,Arizona
3,5,AR,Arkansas
4,6,CA,California


In [9]:
State_Codes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   FIPSST      51 non-null     int64 
 1   STATE       51 non-null     object
 2   STATE_NAME  51 non-null     object
dtypes: int64(1), object(2)
memory usage: 1.3+ KB


## Merge Data Files

In [10]:
# Merge files with country name in commong including Math, Reading, Science, Math & Sex, SES, GDP, %GDP for ED
NSCH_fam_0 = pd.merge(NSCH_fam_0, State_Codes, how = 'left',  on = 'FIPSST')

# Move 'FIPSST', 'STATE', and 'STATE_NAME' to the first columns
cols = ['FIPSST', 'STATE', 'STATE_NAME'] + [col for col in NSCH_fam_0.columns if col not in ['FIPSST', 'STATE', 'STATE_NAME']]
NSCH_fam_0 = NSCH_fam_0[cols]

NSCH_fam_0.head()

Unnamed: 0,FIPSST,STATE,STATE_NAME,STRATUM,HHID,FORMTYPE,TOTKIDS_R,FWC,SC_AGE_YEARS,SC_SEX,...,K4Q01,GOWHENSICK,UsualSck_23,smAdeqIns_23,smForgoneHC_23,nomAnxietyDep12to17_23,nomFlrish6mto5_23,nomFlrish6to17_23,ScreenTime_23,HrsSleep_23
0,6,CA,California,1,23043707,2,1,1318.47684,11,1,...,2,95,2,1,2,95,90,2,4,2
1,6,CA,California,1,23120547,3,2,978.499881,14,2,...,1,1,1,1,2,2,90,1,5,1
2,6,CA,California,1,23197456,3,1,904.191765,17,2,...,1,99,99,1,2,2,90,1,4,1
3,6,CA,California,1,23197458,3,2,1092.097256,12,2,...,1,1,1,1,2,2,90,1,4,1
4,6,CA,California,1,23235909,2,2,586.38787,10,2,...,2,95,2,2,2,95,90,1,5,1


In [11]:
NSCH_fam_0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55162 entries, 0 to 55161
Data columns (total 43 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   FIPSST                  55162 non-null  int64  
 1   STATE                   55162 non-null  object 
 2   STATE_NAME              55162 non-null  object 
 3   STRATUM                 55162 non-null  int64  
 4   HHID                    55162 non-null  int64  
 5   FORMTYPE                55162 non-null  int64  
 6   TOTKIDS_R               55162 non-null  int64  
 7   FWC                     55162 non-null  float64
 8   SC_AGE_YEARS            55162 non-null  int64  
 9   SC_SEX                  55162 non-null  int64  
 10  age3_23                 55162 non-null  int64  
 11  age5_23                 55162 non-null  int64  
 12  sex_23                  55162 non-null  int64  
 13  hispanic_23             55162 non-null  int64  
 14  race4_23                55162 non-null

## Missing Data

In [12]:
#Replace missing codes with NAN
NSCH_fam_1 = NSCH_fam_0.replace([95, 99], np.nan)

In [13]:
#List variables with missing values
missing_only = NSCH_fam_1.isnull().sum()
missing_only = missing_only[missing_only > 0]
print(missing_only)

PlacesLived_23             1637
HousingInstab_23           1204
FoodCash_23                1231
WIC_23                     1504
BornUSA_23                  398
FamCount_23                1918
K2Q01                       131
ChHlthSt_23                 131
nomChHlthSt_23              131
MedCare_23                   86
PrevMed_23                  457
PrevDent_23                2175
MedDentCare_23              622
K4Q01                       219
GOWHENSICK                 9777
UsualSck_23                 580
smAdeqIns_23                348
smForgoneHC_23              265
nomAnxietyDep12to17_23    36819
nomFlrish6mto5_23          1170
nomFlrish6to17_23           296
ScreenTime_23               966
HrsSleep_23                1263
dtype: int64


In [14]:
#Drop all records missing the variables of this 
#analysis, as the number missing relative to the
#total is small.
#- overall health of the child (missing 131),
#- preventative care (missing 457), 
#- place where take child when sick (missing 219)
#- insurance (missing 348)
#- poverty level not missing, as they used imputation

NSCH_fam_1 = NSCH_fam_1.dropna(subset=['K2Q01', 'PrevMed_23', 'K4Q01', 'smAdeqIns_23'])

#Other variables that are missing more often are WIC_23, 
#ScreenTime_23, and PlacesLived_23. For these analyses
#limit the analysis to non-missing responses, as missing
#values area across all states and appear to be missing
#at random.

In [15]:
#List variables with missing values
missing_only = NSCH_fam_1.isnull().sum()
missing_only = missing_only[missing_only > 0]
print(missing_only)

PlacesLived_23             1420
HousingInstab_23           1019
FoodCash_23                1022
WIC_23                     1272
BornUSA_23                  240
FamCount_23                1784
MedCare_23                    7
PrevDent_23                2035
MedDentCare_23              114
GOWHENSICK                 9289
UsualSck_23                 331
smForgoneHC_23              138
nomAnxietyDep12to17_23    36326
nomFlrish6mto5_23          1085
nomFlrish6to17_23           166
ScreenTime_23               749
HrsSleep_23                1056
dtype: int64


In [16]:
NSCH_fam_1.info()

#54159/55162 records
#1,003 removed due to missing a key variable.

<class 'pandas.core.frame.DataFrame'>
Index: 54159 entries, 0 to 55161
Data columns (total 43 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   FIPSST                  54159 non-null  int64  
 1   STATE                   54159 non-null  object 
 2   STATE_NAME              54159 non-null  object 
 3   STRATUM                 54159 non-null  int64  
 4   HHID                    54159 non-null  int64  
 5   FORMTYPE                54159 non-null  int64  
 6   TOTKIDS_R               54159 non-null  int64  
 7   FWC                     54159 non-null  float64
 8   SC_AGE_YEARS            54159 non-null  int64  
 9   SC_SEX                  54159 non-null  int64  
 10  age3_23                 54159 non-null  int64  
 11  age5_23                 54159 non-null  int64  
 12  sex_23                  54159 non-null  int64  
 13  hispanic_23             54159 non-null  int64  
 14  race4_23                54159 non-null  int

In [17]:
#Check state missing WIC
missing_WIC = NSCH_fam_1[NSCH_fam_1['WIC_23'].isnull()]
missing_WIC.value_counts('STATE')

STATE
CA    130
LA     67
KS     53
OH     42
MN     40
CO     37
GA     34
PA     34
IL     31
NM     30
AR     30
TN     29
HI     28
NY     27
NE     27
FL     25
DE     25
MI     25
VA     24
NC     24
AL     24
RI     22
IN     22
MS     22
OK     20
SC     19
WV     19
WA     19
KY     19
WI     18
TX     18
OR     17
NV     17
WY     17
MA     17
CT     17
ME     17
ND     16
AZ     16
VT     16
IA     15
NJ     15
MO     15
NH     14
DC     14
AK     12
UT     11
SD     11
MD     11
ID     10
MT     10
Name: count, dtype: int64

In [18]:
#Check state missing ScreenTime_23, and PlacesLived_23
missing_screen = NSCH_fam_1[NSCH_fam_1['ScreenTime_23'].isnull()]
missing_screen.value_counts('STATE')

STATE
CA    65
LA    36
KS    30
IL    24
PA    22
MN    20
OH    20
GA    18
AR    18
NY    18
KY    18
NE    17
CO    17
RI    17
IN    17
WI    16
TN    16
VA    16
ME    15
WY    15
NV    15
FL    14
NM    14
HI    13
MI    13
AL    13
DE    12
OR    12
NC    12
OK    12
WV    12
AZ    12
NJ    12
TX    11
AK    11
CT    11
ND    10
MO    10
SC     9
MA     9
UT     9
VT     9
WA     9
MS     9
NH     8
ID     8
MD     7
MT     6
DC     6
SD     3
IA     3
Name: count, dtype: int64

In [19]:
#Check state missing PlacesLived_23
missing_places = NSCH_fam_1[NSCH_fam_1['PlacesLived_23'].isnull()]
missing_places.value_counts('STATE')

STATE
CA    147
LA     73
KS     61
MN     47
OH     45
PA     39
AR     39
NM     34
IL     34
TN     34
GA     33
CO     32
MI     31
NJ     31
NE     29
DE     28
VA     27
RI     26
TX     26
WI     25
MS     25
WY     25
FL     25
KY     25
AZ     25
CT     24
AL     24
NY     24
NV     24
OK     23
NC     22
HI     22
IN     22
WV     21
ND     19
NH     18
WA     18
SC     17
MA     16
UT     16
ID     15
VT     15
MO     15
ME     14
SD     14
OR     14
DC     13
MD     13
MT     11
IA     10
AK     10
Name: count, dtype: int64

## Check unweighted counts for the variables

In [20]:
#Child Health
NSCH_fam_1['K2Q01'].value_counts()

K2Q01
1.0    35822
2.0    13879
3.0     3796
4.0      596
5.0       66
Name: count, dtype: int64

In [21]:
#Child Health excellent or very good
NSCH_fam_1['nomChHlthSt_23'].value_counts()

nomChHlthSt_23
1.0    49701
2.0     4458
Name: count, dtype: int64

In [22]:
#Preventative health visit
NSCH_fam_1['PrevMed_23'].value_counts()

PrevMed_23
1.0    45287
2.0     8872
Name: count, dtype: int64

In [23]:
#Place to take child when sick
NSCH_fam_1['K4Q01'].value_counts()

K4Q01
1.0    45201
2.0     8958
Name: count, dtype: int64

In [24]:
#Adequate health insurance
NSCH_fam_1['smAdeqIns_23'].value_counts()

smAdeqIns_23
1.0    35649
2.0    18510
Name: count, dtype: int64

In [25]:
#WIC Participation
NSCH_fam_1['WIC_23'].value_counts()

WIC_23
2.0    49114
1.0     3773
Name: count, dtype: int64

In [26]:
#Screen time
NSCH_fam_1['ScreenTime_23'].value_counts()

ScreenTime_23
3.0    15309
5.0    10010
2.0     9747
4.0     9494
1.0     8850
Name: count, dtype: int64

In [27]:
#Places lived
NSCH_fam_1['PlacesLived_23'].value_counts()

PlacesLived_23
1.0    51632
2.0     1107
Name: count, dtype: int64

In [28]:
#Poverty levels
NSCH_fam_1['povlev4_23'].value_counts()

povlev4_23
4    23004
3    16119
2     8464
1     6572
Name: count, dtype: int64

In [29]:
NSCH_fam_1['povSCHIP_23'].value_counts()

povSCHIP_23
4    23004
1    15036
2     8826
3     7293
Name: count, dtype: int64

### Output CSV file

In [30]:
NSCH_fam_1.to_csv("NSCH_fam.csv")

In [31]:
NSCH_fam_1.value_counts('STATE')

STATE
CA    4552
KS    2751
MN    2363
LA    2048
OH    1707
PA    1506
IL    1466
WI    1430
CO    1428
WY    1291
GA    1233
NE    1135
NM    1115
TN    1008
MD     874
NY     863
AL     840
AR     838
MI     832
UT     826
IA     821
CT     819
NH     814
OR     813
VA     812
SD     808
OK     805
IN     805
HI     803
AK     798
ME     798
VT     796
TX     792
MT     792
MO     788
AZ     788
ID     787
WV     778
NC     771
NJ     759
DC     759
SC     755
WA     747
ND     744
FL     740
RI     740
MA     740
KY     730
DE     728
MS     727
NV     696
Name: count, dtype: int64