In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('dark_background')

# Exploratory Intro to NEETs

In this chapter, we'll take an exploratory look into NEETs, using the CPS monthly samples in 2024. To give this study the proper treatment, we'll need the following variables (all available variables can be [found here](https://cps.ipums.org/cps-action/variables/group)):

- Demographic variables: `AGE`, `SEX`, `RACE`, `HISPAN`, `NCHILD`, `DIFFANY`

- Geographic variables: `STATEFIP`

- Socioeconomic variables: `EMPSTAT`, `LABFORCE`, `SCHLCOLL`

These variables will be enough for our current purposes.

## Load in the data

Using `ipumspy` and a function (`get_CPS`) to simplify the process, we can get the data fairly easily (though it may take some time). Remember to assign your IPUMS API Key to an environmental variable labelled 'IPUMS_API_KEY'. In bash, for example...

    export IPUMS_API_KEY=2477c3178c3178247


In [3]:
import sys
import os
from ipumspy import readers

sys.path.append('..')

from scripts.clean_ipums import get_CPS

In [3]:
my_vars = ['AGE', 'SEX', 'RACE', 'HISPAN', 'NCHILD', 
           'DIFFANY', 'EMPSTAT', 'LABFORCE', 'SCHLCOLL', 'STATEFIP'] 

get_CPS(years=2024, 
        vars=my_vars, 
        filename='exploratory_NEETs_2024', 
        filepath='../datasets' 
        ) # extracting a fwf & xml, named '../datasets/exploratory_NEETs_2024.[dat.gz/xml]'

Let's load our dataset in now.

In [4]:
%%capture

ddi_cps24 = readers.read_ipums_ddi('../datasets/exploratory_NEETs_2024.xml')
cps24 = readers.read_microdata(ddi=ddi_cps24, filename='../datasets/exploratory_NEETs_2024.dat.gz')

We'll be pooling monthly estimates to get a larger sample. For now, we won't worry about repeated survey respondents or how to deal with our weights when collecting multiple months, but it's something to remember.

Now, we can take a brief look at the data, but there's one more pesky thing. When we extracted, we also extracted the Annual Social and Economic Supplement [(ASEC)](https://www.census.gov/data/datasets/time-series/demo/cps/cps-asec.html) data, which we don't need. Let's remove that.

In [5]:
cps24 = cps24[(cps24['ASECFLAG'] == 2)  | (cps24['ASECFLAG'].isnull())] # either == 2 (March basic) or NA (all other months)
cps24['MONTH'].value_counts().sort_index()

MONTH
1      99135
2      99250
3      95311
4      99245
5      99781
6      98697
7      99138
8     100288
9      99421
10     99045
11     99063
12     98982
Name: count, dtype: Int64

We can now look at some basic characteristics of our data, and clean it up a bit.

In [6]:
cps24.shape

(1187356, 22)

In [7]:
cps24.columns

Index(['YEAR', 'SERIAL', 'MONTH', 'HWTFINL', 'CPSID', 'ASECFLAG', 'ASECWTH',
       'STATEFIP', 'PERNUM', 'WTFINL', 'CPSIDP', 'CPSIDV', 'ASECWT', 'AGE',
       'SEX', 'RACE', 'NCHILD', 'HISPAN', 'EMPSTAT', 'LABFORCE', 'SCHLCOLL',
       'DIFFANY'],
      dtype='object')

Let's look at the gender split and average age in our data.

In [8]:
print('Share Male:' , np.mean(cps24['SEX'] == 1))

Share Male: 0.4885038691007583


In [27]:
print('Median Age:', np.median(cps24['AGE']))

Median Age: 42.0


## Overall

Our key variables here are `EMPSTAT` and `SCHLCOLL`, as those will let us define whether an individual is employed or in school. The codes for each are as follows:

`EMPSTAT`:
- `0 == NIU`
- `1 == Armed Forces`
- `10,12 == Employed`
- `20-36 == Unemployed or Not In Labor Force`

`SCHLCOLL`:
- `0 == NIU`
- `1,2,3,4 == in  High School or College`
- `5 == Not in High School or College`

(note that `SCHLCOLL` is only applicable for individuals aged 16-24. Technically, ASEC has it available for 16-54, but let's limit to 16-24)

Let's get the NEET rate (weighted individuals not in school or employed / weighted individuals):
- by gender
- by age ranges


In [6]:
cps24['SCHLCOLL'].value_counts()

cps24 = cps24.query('EMPSTAT > 0 and SCHLCOLL > 0') # remove NIU observations
print('num_obs:', cps24.shape[0])

num_obs: 560541


In [7]:
gender_codes = {1 : 'men', 2 : 'women'} # map gender codes
cps24['sex'] = cps24['SEX'].map(gender_codes)

cps24['NEET'] = 'not_neet'
cps24.loc[(cps24['EMPSTAT'].isin(range(20,37))) & (cps24['SCHLCOLL'] == 5), 'NEET'] = 'neet' # NEET conditions

age_ranges = [range(16,25), range(16,21), range(20,25)] # three age ranges to test

for i in age_ranges:
    filtered_df = cps24.query('AGE in @i').copy()
    agg_wt = filtered_df.groupby(['sex', 'NEET'])['WTFINL'].sum().unstack() # get weighted sum by gender and NEET status
    agg_wt= agg_wt.eval('neet_rate = neet / (neet + not_neet) * 100')
    print('age', str(i), sep='_')
    print('Men: {0}%'.format(agg_wt['neet_rate'].loc['men'].round(2)))
    print('Women: {0}%\n'.format(agg_wt['neet_rate'].loc['women'].round(2)))

age_range(16, 25)
Men: 13.92%
Women: 14.5%

age_range(16, 21)
Men: 13.13%
Women: 12.3%

age_range(20, 25)
Men: 15.15%
Women: 16.85%



This is about in line with my expectations, and similar to what [CEPR](https://cepr.net/publications/are-young-men-falling-behind-young-women-the-neet-rate-helps-shed-light-on-the-matter/) found: men are less likely to be NEETs, and NEET rates for both genders increase slightly by age. 13.9% of men aged 16-24 were not employed or in school in 2024, slightly less than the 14.5% of women.

## By race

Next, let's look at NEET rates by race and gender.

In [8]:
race_codes = [
    ((cps24['RACE'] == 100) & (cps24['HISPAN'] == 0)),
    ((cps24['RACE'] == 200) & (cps24['HISPAN'] == 0)),
    ((cps24['RACE'].isin(range(650, 653))) & (cps24['HISPAN'] == 0)),
    ((cps24['HISPAN'] > 0) & (cps24['HISPAN'] < 902))
]

race_choices = ['white', 'black', 'asian', 'hispanic']

cps24['race_cat'] = np.select(race_codes, race_choices, default = 'other')

filtered_df = cps24.query("AGE >= 16 and AGE <= 24")
agg_wt = filtered_df.groupby(['race_cat', 'sex', 'NEET'])['WTFINL'].sum().unstack() # get weighted sum by gender and NEET status
agg_wt= agg_wt.eval('neet_rate = neet / (neet + not_neet) * 100')
print('NEET rates by race and gender for 16-24 year-olds')
print(agg_wt['neet_rate'].round(2).unstack())

NEET rates by race and gender for 16-24 year-olds
sex         men  women
race_cat              
asian     11.59  11.78
black     19.90  17.55
hispanic  14.93  17.03
other     16.66  16.02
white     11.94  12.58


Intra-racial NEET rates are higher for White and Hispanic women, but they're lower for Black and Asian women. By race and gender, Black men have the highest NEET rates, with about one-in-five Black men aged 16-24 not being in education or employed; White and Asian men have the lowest NEET rates at about 12%.

## Geography

To get NEET rates by state, we'll be using `STATEFIP`, which has STATE codes for each individual in our set. 

`STATEFIP in range(1,57)` includes the fifty states and Washington D.C. (the codes occasionally skip a number). To get the corresponding names, we'll use the `us` and library.

In [9]:
import us 

state_names = us.states.mapping('fips', 'name') # dictionary with STATEFIP as key
state_names['11'] = 'District of Columbia' # DC isn't included, so manually add it
cps24['state'] = cps24['STATEFIP'].astype(str).str.zfill(2) 
cps24['state'] = cps24['state'].map(state_names)

filtered_df = cps24.query('AGE >= 16 and AGE <= 24')
agg_wt = filtered_df.groupby(['state', 'sex', 'NEET'])['WTFINL'].sum().unstack() # get weighted sum by gender and NEET status
agg_wt= agg_wt.eval('neet_rate = neet / (neet + not_neet) * 100')
agg_wt2 = agg_wt['neet_rate'].round(2).unstack()
print('NEET rates by state and gender for 16-24 year-olds')
print(agg_wt2)


NEET rates by state and gender for 16-24 year-olds
sex                     men  women
state                             
Alabama               13.04  17.46
Alaska                18.75  17.82
Arizona               15.25  13.71
Arkansas              15.86  18.90
California            14.05  14.85
Colorado              11.70  15.10
Connecticut           11.62   8.79
Delaware              12.17  13.91
District of Columbia  16.41  15.16
Florida               13.29  15.06
Georgia               15.42  14.45
Hawaii                11.21  13.32
Idaho                 14.93  12.78
Illinois              14.61  12.89
Indiana               13.35  15.01
Iowa                  11.69  10.40
Kansas                12.11  10.71
Kentucky              13.99  19.22
Louisiana             14.68  18.18
Maine                  9.04  11.02
Maryland              12.42  16.03
Massachusetts         12.25  10.13
Michigan              17.04  16.96
Minnesota              8.12   8.33
Mississippi           18.72  18.70
Miss

Let's look at the states with the maximum/minimum male/female NEET rates, and states where the NEET rate is higher/lower for men/women.

In [13]:
max_st_male = agg_wt2['men'].sort_values(ascending=False)
max_st_female = agg_wt2['women'].sort_values(ascending=False)

print('State with highest male NEET rate: {0} -- {1}%'.format(max_st_male.index[0], max_st_male.iloc[0]))
print('State with highest female NEET rate: {0} -- {1}%'.format(max_st_female.index[0], max_st_female.iloc[0]))

State with highest male NEET rate: New Mexico -- 20.45%
State with highest female NEET rate: Missouri -- 20.55%


In [14]:
min_st_male = agg_wt2['men'].sort_values(ascending=True)
min_st_female = agg_wt2['women'].sort_values(ascending=True)

print('State with lowest male NEET rate: {0} -- {1}%'.format(min_st_male.index[0], min_st_male.iloc[0]))
print('State with lowest male NEET rate: {0} -- {1}%'.format(min_st_female.index[0], min_st_female.iloc[0]))

State with lowest male NEET rate: Minnesota -- 8.12%
State with lowest male NEET rate: Wisconsin -- 7.76%


In [15]:
higher_with_men = agg_wt2[agg_wt2['men'] > agg_wt2['women']].sort_values(by='men', ascending=False)
print('States with higher male NEET rates (including {0} states) :\n {1}'.format(higher_with_men.shape[0] , higher_with_men))

States with higher male NEET rates (including 25 states) :
 sex                     men  women
state                             
New Mexico            20.45  14.33
Nevada                20.15  12.87
Alaska                18.75  17.82
Mississippi           18.72  18.70
West Virginia         17.61  16.70
Michigan              17.04  16.96
District of Columbia  16.41  15.16
Rhode Island          15.61  10.42
New York              15.47  14.20
Georgia               15.42  14.45
Arizona               15.25  13.71
Pennsylvania          15.19  11.99
Idaho                 14.93  12.78
Illinois              14.61  12.89
Ohio                  14.31  13.67
Vermont               13.32   9.29
Utah                  12.34  11.73
Massachusetts         12.25  10.13
Kansas                12.11  10.71
Oregon                11.92  10.36
Iowa                  11.69  10.40
Connecticut           11.62   8.79
New Hampshire         10.98   9.17
Wisconsin             10.85   7.76
Nebraska               9.95   

In [16]:
higher_with_women = agg_wt2[agg_wt2['men'] < agg_wt2['women']].sort_values(by='women', ascending=False)
print('States with lower male NEET rates (including {0} states) :\n {1}'.format(higher_with_women.shape[0] , higher_with_women))

States with lower male NEET rates (including 26 states) :
 sex               men  women
state                       
Missouri        10.86  20.55
Tennessee       16.45  20.36
Wyoming         13.68  19.33
Kentucky        13.99  19.22
Arkansas        15.86  18.90
Oklahoma        14.45  18.47
Louisiana       14.68  18.18
Alabama         13.04  17.46
Washington      12.96  16.14
Maryland        12.42  16.03
Texas           14.29  15.91
North Carolina  14.27  15.83
Colorado        11.70  15.10
Florida         13.29  15.06
Indiana         13.35  15.01
Virginia        13.86  14.86
California      14.05  14.85
South Carolina  14.66  14.78
North Dakota    12.00  14.53
Delaware        12.17  13.91
Hawaii          11.21  13.32
Montana         12.97  13.00
New Jersey      10.06  11.63
South Dakota     9.72  11.31
Maine            9.04  11.02
Minnesota        8.12   8.33


## Other variables (disability and caretaking)

Okay, we now have a general sense of NEET rates in the US. We know that (at least based on 2024):

- **around 14% of young (16-24 y.o.) men and women are not employed or in education.**
- **men are slightly less likely to be NEETs, but this depends on age.**
- **among groups by race and gender, White and Asian men are the least likely to be NEETs, and Black men are the most likely.**
- **most states have higher female NEET rates, and NEET rates vary considerably by state, for both men and women.**

Now we might be curious about *why* some young man/woman is a NEET and another isn't. We can't entirely solve that here obviously, but we can look at some accompanying characteristics. 

Two interesting correlations, based on other research are:

- having a disability
- raising a child

where someone having one of these characteristics would be more likely to be a NEET. We can look at these relationships using variables in our dataset:
- `DIFFANY` : This variable determines if a surveyed individual has any [physical or cognitive disability](https://cps.ipums.org/cps-action/variables/DIFFANY#description_section), and is based responding 'Yes' to any of six other specific disability variables. Research finds that CPS disability prevalence estimates are lower than other, similar data sources, but the directionality should be the same.
- `NCHILD` : This variable indicates the number of the surveyed individual's own children present in their household, including step-children and adopted children. 

While `DIFFANY` should apply equally no matter how we define our age range to calculate the NEET rate, `NCHILD` won't (e.g., a twenty-nine year-old should be more likely to have a kid than a twenty-two year-old). Given our dataset, we'll stick with 16-24 while having this in the back of our heads. 

We'll try out a few different queries, so let's make a basic function that lets us calculate NEET rates repeatedly, based on certain restrictions.

In [10]:
def get_NEET(df = cps24, query_str = '', group_by = []):
    """
    calculates the NEET rate based on query and grouping options 

    **df**
    - dataframe, defaults to cps24

    **query_str**
    - query string to be placed in pandas query method
    - ex. query_str = 'AGE == 25 and SEX == 1'

    **group_by**
    - grouping options to get NEET rates
    - ex. group_by = ['SEX', 'RACE']
    """
    filtered_df = df.query(query_str)
    agg_wt = filtered_df.groupby(group_by)['WTFINL'].sum().unstack() # get weighted sum by gender and NEET status
    agg_wt= agg_wt.eval('neet_rate = neet / (neet + not_neet) * 100')
    
    if len(group_by) > 2:
        agg_wt2 = agg_wt['neet_rate'].round(2).unstack()
        return agg_wt2
    else:
        return agg_wt['neet_rate']

Let's briefly look at `DIFFANY` and `NCHILD` before getting our rates.

In [18]:
cps24.loc[:, ['DIFFANY', 'NCHILD']].describe()

Unnamed: 0,DIFFANY,NCHILD
count,560541.0,560541.0
mean,1.067157,0.846129
std,0.250293,1.194972
min,1.0,0.0
25%,1.0,0.0
50%,1.0,0.0
75%,1.0,2.0
max,2.0,9.0


`DIFFANY` equaling 1 means 'no difficulty', with 2 meaning 'has difficulty.' `NCHILD` is an integer variable, with 0 meaning no kids, and 9 meaning 9 or more kids. Let's start getting our calculations. Let's map binary variables for these two variables.

In [11]:
child_dict = {i : 'has_child' for i in range(10) if i > 0}
child_dict[0] = 'no_child'

disability_dict = {1 : 'no_dis', 2 : 'has_dis'}

cps24['kids'] = cps24['NCHILD'].map(child_dict)
cps24['dis'] = cps24['DIFFANY'].map(disability_dict)

First, the difference in NEET rates between people with a disabilty versus those without, by gender.

In [20]:
by_disability = get_NEET(cps24, 'AGE >= 16 and AGE <= 24', ['sex', 'dis', 'NEET'])
print(by_disability)

dis    has_dis  no_dis
sex                   
men      37.92   12.44
women    33.00   13.49


Wow, disabled men are three times as likely to be NEETs than men without a disability, almost four-in-ten. The gap is about as big for women. What's the sample size for disabled men and women in our data?

In [21]:
dis_counts = cps24['dis'].sort_values().value_counts()
print(dis_counts)
male_dis_counts = cps24.query("sex == 'men'")['dis'].sort_values().value_counts()
female_dis_counts = cps24.query("sex == 'women'")['dis'].sort_values().value_counts()
print(male_dis_counts)

dis
no_dis     522897
has_dis     37644
Name: count, dtype: int64
dis
no_dis     258481
has_dis     18609
Name: count, dtype: int64


Okay, let's look at kids now. How do the NEET rates of those with kids differ from those without kids?

In [22]:
by_kids = get_NEET(cps24, 'AGE >= 16 and AGE <= 24', ['sex', 'kids', 'NEET'])
print(by_kids)

kids   has_child  no_child
sex                       
men        14.97     13.90
women      40.89     12.74


There's a small gap between men with kids and men without, but a much larger gap for women. 41% of women aged 16-24 years-old with kids are NEETs, compared to 13% of women without any kids. 

Well, how about the intersections? What is the NEET rate for men and women who are not disabled and don't have children?

In [23]:
no_kids_no_diff = get_NEET(cps24, 
                           "AGE >= 16 and AGE <= 24 and kids == 'no_child' and dis == 'no_dis' ", 
                           ['sex', 'NEET'])
print(no_kids_no_diff)

sex
men      12.405762
women    11.694938
Name: neet_rate, dtype: float64


Another finding: while the overall NEET rate (for 16-24 year-olds) is higher for women than men, when you look at the NEET rate for those with no kids, it's slightly higher for men; again, very very close, but slightly higher.

## Up next

We've made some progress on this question. But we've only looked at 2024. The natural next question is:

**'How have NEET rates changed over time in the US? and how has this change (if at all) varied by various demographics?'**

In [12]:
# getting our results
overall = get_NEET(cps24, 'AGE >= 16 and AGE <= 24', ['sex', 'NEET'])
byrace = get_NEET(cps24, 'AGE >= 16 and AGE <= 24', ['race_cat', 'sex', 'NEET'])
bystate = get_NEET(cps24, 'AGE >= 16 and AGE <= 24', ['state', 'sex', 'NEET'])
bydis = get_NEET(cps24, 'AGE >= 16 and AGE <= 24', ['dis', 'sex', 'NEET'])
bykids = get_NEET(cps24, 'AGE >= 16 and AGE <= 24', ['kids', 'sex', 'NEET'])
nodisnokids = get_NEET(cps24, "AGE >= 16 and AGE <= 24 and kids == 'no_child' and dis == 'no_dis'", ['sex', 'NEET'])

dfs = [overall, byrace, bystate, bydis, bykids, nodisnokids]
names = ['overall', 'race', 'state', 'dis', 'kids', 'nodisnokids']
for i in range(6):
    fpath = '../datasets/results/' + 'by_' + names[i] + '_NEET.csv'
    dfs[i].to_csv(fpath)