# Exploring the Youth Risk Behavior Surveillance System Datasets

The [Youth Risk Behavior Surveillance System (YRBSS)](https://chronicdata.cdc.gov/Youth-Risk-Behaviors/DASH-Youth-Risk-Behavior-Surveillance-System-YRBSS/svam-8dhg) is a project implemented by Center for Disease Control and Prevention (CDC) which monitors six categories of health-related behaviors that contribute to the leading causes of death and disability among youth and adults, including—

- Behaviors that contribute to unintentional injuries and violence
- Sexual behaviors related to unintended pregnancy and sexually transmitted diseases, including HIV infection
- Alcohol and other drug use
- Tobacco use
- Unhealthy dietary behaviors
- Inadequate physical activity

YRBSS includes a national school-based survey conducted by CDC and state, territorial, tribal, and local surveys conducted by state, territorial, and local education and health agencies and tribal governments.

Some of the survey data go back to as early as 1997. The survey includes demographic and geographic variables such as race and ethicity and location of the data. 

In [1]:
import numpy as np 
import pandas as pd 
import plotly
from plotly import graph_objs as go
import plotly_express as px

In [4]:
alcohol = pd.read_csv('data/Alcohol and Other Drug Use.csv')
#dietary = pd.read_csv('data/Dietary Behaviors.csv')
#obsity = pd.read_csv('data/Obesity Overweight and Weight Control.csv')
#physical = pd.read_csv('data/Physical Activity.csv')
#sexual = pd.read_csv('data/Sexual Behaviors.csv')
#tabacco = pd.read_csv('data/Tobacco Use.csv')

In [5]:
import jupyterthemes

## Start with one of the datasets - alcohol use
The YRBSS database has 6 datasets, each of which focues on a different topic, while all of them have similar structures. Start with one of them to examine what they look like and what we can do with them. 

In [6]:
# Ensure that pandas will show all the columns
pd.set_option('display.max_columns', 50)

print(alcohol.shape)
alcohol.head()

(1176120, 35)


Unnamed: 0,YEAR,LocationAbbr,LocationDesc,DataSource,Topic,Subtopic,ShortQuestionText,Greater_Risk_Question,Description,Data_Value_Symbol,Data_Value_Type,Greater_Risk_Data_Value,Greater_Risk_Data_Value_Footnote_Symbol,Greater_Risk_Data_Value_Footnote,Greater_Risk_Low_Confidence_Limit,Greater_Risk_High_Confidence_Limit,Lesser_Risk_Question,Lesser_Risk_Data_Value,Lesser_Risk_Data_Value_Footnote_Symbol,Lesser_Risk_Data_Value_Footnote,Lesser_Risk_Low_Confidence_Limit,Lesser_Risk_High_Confidence_Limit,Sample_Size,Sex,Race,Grade,GeoLocation,TopicId,SubTopicID,QuestionCode,LocationId,StratID1,StratID2,StratID3,StratificationType
0,2005,MM,"Miami-Dade County, FL",YRBSS,Alcohol and Other Drug Use,Other Drug Use,Illegal drugs at school,"Were offered, sold, or given an illegal drug o...",during the 12 months before the survey,%,Percentage,,,,,,"Were not offered, sold, or given an illegal dr...",,,,,,3,Total,Multiple Race,9th,"(25.551603, -80.632692)",C03,C15,H58,108.0,S1,R16,G2,Local
1,2017,SA,"San Diego, CA",YRBSS,Alcohol and Other Drug Use,Alcohol Use,Current binge drinking,Reported current binge drinking,four or more drinks of alcohol in a row (if th...,%,Percentage,,,,,,Did not report current binge drink,,,,,,12,Total,American Indian or Alaska Native,Total,"(32.715738, -117.161084)",C03,C14,H44,103.0,S1,R10,G1,Local
2,1995,HO,"Houston, TX",YRBSS,Alcohol and Other Drug Use,Other Drug Use,Ever cocaine use,Ever used cocaine,"any form of cocaine, such as powder, crack, or...",%,Percentage,4.9938,,,3.3047,7.4793,Never used cocaine,95.0062,,,92.5207,96.6953,335,Female,Hispanic or Latino,Total,"(29.760427, -95.369803)",C03,C15,H49,128.0,S7,R13,G1,Local
3,2017,CK,Cherokee Nation,YRBSS,Alcohol and Other Drug Use,Other Drug Use,Ever cocaine use,Ever used cocaine,"any form of cocaine, such as powder, crack, or...",%,Percentage,,,,,,Never used cocaine,,,,,,11,Female,Multiple Race,12th,,C03,C15,H49,,S7,R16,G5,Other
4,2013,WY,Wyoming,YRBSS,Alcohol and Other Drug Use,Alcohol Use,Current alcohol use,Currently drank alcohol,"at least one drink of alcohol, on at least 1 d...",%,Percentage,,,,,,Did not currently drink alcohol,,,,,,13,Female,Black or African American,Total,"(43.23554134300048, -108.10983035299967)",C03,C14,H42,56.0,S7,R12,G1,State


In [7]:
alcohol = alcohol.sort_values(by=['YEAR', 'StratificationType', 'LocationAbbr', 'ShortQuestionText', 'Grade'])
alcohol = alcohol.reset_index(drop=True)
alcohol.head()

Unnamed: 0,YEAR,LocationAbbr,LocationDesc,DataSource,Topic,Subtopic,ShortQuestionText,Greater_Risk_Question,Description,Data_Value_Symbol,Data_Value_Type,Greater_Risk_Data_Value,Greater_Risk_Data_Value_Footnote_Symbol,Greater_Risk_Data_Value_Footnote,Greater_Risk_Low_Confidence_Limit,Greater_Risk_High_Confidence_Limit,Lesser_Risk_Question,Lesser_Risk_Data_Value,Lesser_Risk_Data_Value_Footnote_Symbol,Lesser_Risk_Data_Value_Footnote,Lesser_Risk_Low_Confidence_Limit,Lesser_Risk_High_Confidence_Limit,Sample_Size,Sex,Race,Grade,GeoLocation,TopicId,SubTopicID,QuestionCode,LocationId,StratID1,StratID2,StratID3,StratificationType
0,1991,CH,"Chicago, IL",YRBSS,Alcohol and Other Drug Use,Alcohol Use,Current alcohol use,Currently drank alcohol,"at least one drink of alcohol, on at least 1 d...",%,Percentage,,,,,,Did not currently drink alcohol,,,,,,18,Total,White,10th,"(41.878114, -87.629798)",C03,C14,H42,112.0,S1,R15,G3,Local
1,1991,CH,"Chicago, IL",YRBSS,Alcohol and Other Drug Use,Alcohol Use,Current alcohol use,Currently drank alcohol,"at least one drink of alcohol, on at least 1 d...",%,Percentage,35.7308,,,27.0135,45.5073,Did not currently drink alcohol,64.2692,,,54.4927,72.9865,148,Female,Total,10th,"(41.878114, -87.629798)",C03,C14,H42,112.0,S7,R1,G3,Local
2,1991,CH,"Chicago, IL",YRBSS,Alcohol and Other Drug Use,Alcohol Use,Current alcohol use,Currently drank alcohol,"at least one drink of alcohol, on at least 1 d...",%,Percentage,,,,,,Did not currently drink alcohol,,,,,,13,Female,Asian,10th,"(41.878114, -87.629798)",C03,C14,H42,112.0,S7,R11,G3,Local
3,1991,CH,"Chicago, IL",YRBSS,Alcohol and Other Drug Use,Alcohol Use,Current alcohol use,Currently drank alcohol,"at least one drink of alcohol, on at least 1 d...",%,Percentage,,,,,,Did not currently drink alcohol,,,,,,29,Female,Hispanic or Latino,10th,"(41.878114, -87.629798)",C03,C14,H42,112.0,S7,R13,G3,Local
4,1991,CH,"Chicago, IL",YRBSS,Alcohol and Other Drug Use,Alcohol Use,Current alcohol use,Currently drank alcohol,"at least one drink of alcohol, on at least 1 d...",%,Percentage,,,,,,Did not currently drink alcohol,,,,,,3,Female,White,10th,"(41.878114, -87.629798)",C03,C14,H42,112.0,S7,R15,G3,Local


In [8]:
# delete records with sample sizes less than 30
try:
    alcohol['Sample_Size'] = alcohol['Sample_Size'].str.replace(',', '').astype(int)
except AttributeError:
    pass

alcohol[alcohol['Sample_Size']<30].shape

(634579, 35)

In [9]:
# how many unique values does each column have (mainly for the categorical columns)
for column in alcohol.columns:
    print(column)
    print(len(alcohol[column].unique()))
    print(alcohol[column].unique())
    print()

YEAR
14
[1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015 2017]

LocationAbbr
91
['CH' 'DA' 'FT' 'MM' 'PH' 'SA' 'XX' 'AL' 'GA' 'ID' 'NE' 'NM' 'SC' 'SD'
 'UT' 'PR' 'BO' 'SE' 'HI' 'IL' 'MA' 'MS' 'MT' 'NC' 'NH' 'NV' 'OH' 'TN'
 'VT' 'WI' 'WV' 'AS' 'DN' 'HO' 'NO' 'AK' 'AR' 'ME' 'MO' 'ND' 'WY' 'GU'
 'DT' 'LO' 'NYC' 'SF' nan 'CT' 'IA' 'KY' 'LA' 'MI' 'NY' 'RI' 'PB' 'DE'
 'PW' 'OL' 'SB' 'FL' 'NJ' 'TX' 'MEM' 'ML' 'NYG' 'NYH' 'NYI' 'NYJ' 'NYK'
 'AZB' 'IN' 'OK' 'MH' 'MP' 'BA' 'CM' 'CO' 'KS' 'MD' 'DCB' 'DU' 'PA' 'NZ'
 'VA' 'CE' 'DKC' 'FW' 'OA' 'CA' 'ST' 'CK']

LocationDesc
91
['Chicago, IL' 'Dallas, TX' 'Broward County, FL' 'Miami-Dade County, FL'
 'Philadelphia, PA' 'San Diego, CA' 'United States' 'Alabama' 'Georgia'
 'Idaho' 'Nebraska' 'New Mexico' 'South Carolina' 'South Dakota' 'Utah'
 'Puerto Rico' 'Boston, MA' 'Seattle, WA' 'Hawaii' 'Illinois'
 'Massachusetts' 'Mississippi' 'Montana' 'North Carolina' 'New Hampshire'
 'Nevada' 'Ohio' 'Tennessee' 'Vermont' 'Wisconsin' 'West Virg

20
['H42' 'H48' 'H40' 'H49' 'H46' 'H55' 'H41' 'H47' 'H58' 'H50' 'H57' 'H51'
 'H52' 'H53' 'QNHALLUCDRUG' 'H43' 'H45' 'H54' 'H44' 'H56']

LocationId
89
[112. 126. 106. 108. 124. 103.  59.   1.  13.  16.  31.  35.  45.  46.
  49.  72. 114. 129.  15.  17.  25.  28.  30.  37.  33.  32.  39.  47.
  50.  55.  54.  60. 105. 128.  nan   2.   5.  23.  29.  38.  56.  66.
 115. 100. 121. 104. 201.   9.  19.  21.  22.  26.  36.  44. 110.  10.
 204. 109. 102.  12.  34.  48. 125. 130. 116. 117. 118. 119. 120.   4.
  18.  40. 200. 203. 113. 122.   8.  20.  24.  11. 107.  42. 202.  51.
 123. 111. 127. 101.   6.]

StratID1
3
['S1' 'S7' 'S8']

StratID2
8
['R15' 'R1' 'R11' 'R13' 'R12' 'R10' 'R14' 'R16']

StratID3
5
['G3' 'G4' 'G5' 'G2' 'G1']

StratificationType
5
['Local' 'National' 'State' 'Territory' 'Other']



In [10]:
alcohol['value'] = alcohol['Greater_Risk_Data_Value'] 
alcohol_sub = alcohol[['YEAR','LocationAbbr','LocationDesc', 'ShortQuestionText', 'value',
             'Sample_Size', 'Race', 'Sex', 'Grade', 'StratificationType']]

### National numbers and trends

In [32]:
# start with national-level numbers to see the overall stats and trends
alcohol_nat = alcohol_sub[alcohol_sub['StratificationType']=='National']

47


In [34]:
# how many rows have non-null values
print(alcohol_nat.value.shape)
print(alcohol_nat.value[alcohol_nat.value.notnull()].shape)

(24000,)
(14453,)


In [35]:
alcohol_nat.head()

Unnamed: 0,YEAR,LocationAbbr,LocationDesc,ShortQuestionText,value,Sample_Size,Race,Sex,Grade,StratificationType
5760,1991,XX,United States,Current alcohol use,48.7006,341,Hispanic or Latino,Male,10th,National
5761,1991,XX,United States,Current alcohol use,47.7908,2970,Total,Total,10th,National
5762,1991,XX,United States,Current alcohol use,,22,American Indian or Alaska Native,Total,10th,National
5763,1991,XX,United States,Current alcohol use,,0,Native Hawaiian or Other Pacific Islander,Female,10th,National
5764,1991,XX,United States,Current alcohol use,,64,Asian,Female,10th,National


In [15]:
#means = sub.groupby(['YEAR','ShortQuestionText']).mean().value
#means

In [36]:
pd.pivot_table(alcohol_nat, values = 'value', index = 'ShortQuestionText', columns = 'YEAR')

YEAR,1991,1993,1995,1997,1999,2001,2003,2005,2007,2009,2011,2013,2015,2017
ShortQuestionText,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Current alcohol use,48.272835,45.338728,49.376132,45.806648,47.639736,44.735399,43.37647,41.097906,42.762447,38.947018,37.495126,34.258943,31.376258,27.929478
Current binge drinking,,,,,,,,,,,,,,11.849543
Current marijuana use,13.797586,17.107999,25.804305,25.131977,26.196428,24.354324,22.497611,20.149984,20.035806,20.20759,24.304609,25.159975,22.006251,20.249546
Ever alcohol use,80.222114,78.937932,78.986392,76.102275,80.021381,77.202886,74.843451,74.27349,74.436541,70.269501,70.434763,66.190121,61.781597,58.774269
Ever cocaine use,5.626271,5.272438,7.824703,7.941023,9.194349,9.458049,8.284093,7.497116,6.939616,6.04827,6.778555,5.680292,5.117083,4.513083
Ever ecstasy use,,,,,,10.485265,10.76964,6.58932,5.676429,6.633539,8.566608,6.862593,5.091031,3.916817
Ever hallucinogenic drug use,,,,,,11.459846,9.440884,7.622448,7.021697,6.979835,8.226324,6.646395,6.044342,5.797269
Ever heroin use,,,,,2.257449,3.178323,3.362931,2.220167,2.388924,2.594609,3.006753,2.332038,2.189501,1.918578
Ever inhalant use,,,18.609003,13.892768,12.853209,13.042931,11.255869,11.458399,12.446866,11.39146,11.572799,8.822635,7.008413,6.001342
Ever marijuana use,30.038825,31.589109,43.767749,45.73619,47.286521,43.964569,40.511539,39.157533,39.012579,36.31897,41.931222,42.518849,39.404568,36.350057


In [37]:
pd.pivot_table(alcohol_nat, values = 'value', index = ['ShortQuestionText', 'Sex'], columns = 'YEAR')

Unnamed: 0_level_0,YEAR,1991,1993,1995,1997,1999,2001,2003,2005,2007,2009,2011,2013,2015,2017
ShortQuestionText,Sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Current alcohol use,Female,46.721695,43.538548,46.88421,44.130909,46.211482,42.991309,44.52355,40.88765,42.749817,39.962439,37.333942,35.195032,32.390332,31.427133
Current alcohol use,Male,51.177371,49.647067,51.8556,49.655748,50.751532,46.563986,43.015282,41.797891,43.29457,39.265759,38.452009,33.697791,30.651541,25.576843
Current alcohol use,Total,47.037126,43.312908,49.388586,44.115692,46.444965,44.731144,42.684884,40.666944,42.302896,37.99805,36.909968,33.98399,31.164063,26.967119
Current binge drinking,Female,,,,,,,,,,,,,,13.198079
Current binge drinking,Male,,,,,,,,,,,,,,10.897663
Current binge drinking,Total,,,,,,,,,,,,,,11.55205
Current marijuana use,Female,11.676948,13.987905,21.306619,20.488423,21.990373,19.994213,18.974782,17.817105,17.375296,17.168987,21.100067,23.590904,20.647335,21.216717
Current marijuana use,Male,16.446586,21.274633,30.390276,30.731186,31.492218,28.845574,26.615141,22.729273,22.950861,23.851468,27.583852,26.535259,24.529678,20.447637
Current marijuana use,Total,13.31517,16.262715,25.716019,24.5387,25.423068,24.246604,21.994404,19.933144,19.810631,19.886419,24.352558,25.354387,21.142258,19.3756
Ever alcohol use,Female,79.807457,79.352543,77.296457,76.281086,80.82145,76.803527,76.420127,74.792464,75.240652,71.277362,71.123429,68.302513,64.276926,62.538433


In [39]:
#plot national numbers by year
fig = go.Figure()
alcohol_nat_total = alcohol_nat[(alcohol_nat['Race']=='Total') & (alcohol_nat['Sex']=='Total') & 
                                   (alcohol_nat['Grade']=='Total')]
for question in alcohol_nat['ShortQuestionText'].unique():
    fig.add_trace(go.Scatter(
        x=alcohol_nat_total[alcohol_nat_total['ShortQuestionText']==question]['YEAR'],
        y=alcohol_nat_total[alcohol_nat_total['ShortQuestionText']==question]['value'],
        name=question))
fig.show()

Nationwide, alcohol use among youth has declined significantly, from more than half reporting they were consuming alcohol and 80% reporting they had cconsumed alcohol in the early 90s, to less than 30% reporting currently using and 60% having ever used. 

### Georgraphic distributions and trends

In [40]:
# how many states are in the data
print(len(alcohol_sub[alcohol_sub['StratificationType']=='State']['LocationAbbr'].unique()))

47
