In [7]:
import requests
import json
import pandas as pd
import altair as alt

# Final Project Exploratory Visualization: Bureau of Labor Statistics Data
**Steven Hewitt - UC Berkeley MIDS W209 - 06/18/2022**

The primary purpose for selecting the BLS dataset is to explore inflation as measured by the CPI-U. This index can be used to show how the price of goods has changed over time. There are many different individual products and services that factor into the CPI-U calculation based on a weights that the BLS changes over time. My goal for this exploratory visualization is to focus on some of the market sectors that contribute at least 2% to the CPI-U (based on 2021 weights), and then highlight the differences in the rate of price changes by sector. I will focus on the time period from 2000 to 2021.

In [2]:
# Import list of targeted CPI-U cateogires.
targets = pd.read_excel("Targets.xlsx", header=0)
targets

Unnamed: 0,Category,CPI-U Weight,Series ID
0,Food,13.37,CUSR0000SAF1
1,Shelter,32.946,CUSR0000SAH1
2,Apparel,2.458,CUSR0000SAA
3,Medical care,8.487,CUSR0000SAM
4,Education,2.677,CUSR0000SAE1
5,New and used motor vehicles,9.218,CUSR0000SETA
6,Energy,7.348,CUSR0000SA0E
7,Recreation,5.108,CUSR0000SAR
8,Communication,3.728,CUSR0000SAE2


These target sectors impact all areas of daily life. Food, shelter, and apparel are the basic means of survival. Medical care and eductaion are key services that promote class mobility. Motor vehicles and energy allow people the geographic mobility that gives them flexibility in where they live, work, and shop. Recreation and communication are quality-of-life concerns chosen for their relatively high CPI weights.

The following cells show how to get data from the API, prepare it for future graphing use, and save a backup of the data.

In [3]:
def API_call(series_ids, start_year, end_year):
    '''
    Calls the BLS API to return data. Returns a DataFrame with the combined results.
    '''
    
    # Build message to send to API.
    headers = {'Content-type': 'application/json'}
    data = json.dumps({"seriesid": series_ids,"startyear":start_year, "endyear":end_year})
    p = requests.post('https://api.bls.gov/publicAPI/v2/timeseries/data/', data=data, headers=headers)

    # Parse results.
    j = json.loads(p.text)
    dfs = []
    for x in range(0,len(j["Results"]['series'])):
        t_df = pd.DataFrame(j["Results"]['series'][x]['data'])
        t_df['series'] = j["Results"]['series'][x]['seriesID']
        dfs.append(t_df)
    df = pd.concat(dfs)
    
    return df

In [20]:
# Get a list of series IDs.
series = [s for s in targets['Series ID']]

# The API limits requests to spans of 10 years, so I will need to send multiple requests.
year_ranges = [(2000,2009),(2010,2019),(2020,2021)]

# Send API requests and combine into a single DataFrame
df = pd.concat([API_call(series, year_ranges[x][0], year_ranges[x][1]) for x in range(0,3)])

In [80]:
# Merge series names into results from API pull.
series_names = targets.set_index('Series ID').to_dict()['Category']
df['Category'] = df['series'].map(series_names)

# Convert month and year to a datetime column.
df['date'] = pd.to_datetime(df.year.astype(str) + '/' + df.period.str[1:] + '/01')

In [5]:
# Make sure values are stored as numbers and not as strings.
df.value = df.value.astype(float)

# Save DataFrame as pickle just in case.
df.to_pickle("backup_data.pkl")

# Load DataFrame from pickle.
#df = pd.read_pickle("backup_data.pkl")

# Check to see that format looks correct.
df.head()

Unnamed: 0,year,period,periodName,value,footnotes,series,Category,date
0,2009,M12,December,217.904,[{}],CUSR0000SAF1,Food,2009-12-01
1,2009,M11,November,217.581,[{}],CUSR0000SAF1,Food,2009-11-01
2,2009,M10,October,217.452,[{}],CUSR0000SAF1,Food,2009-10-01
3,2009,M09,September,217.258,[{}],CUSR0000SAF1,Food,2009-09-01
4,2009,M08,August,217.376,[{}],CUSR0000SAF1,Food,2009-08-01


## Hypotheses to Explore:
##### 1 - Shelter prices have increased more rapidly than any of my other target categories since 2010.
##### 2 - Energy prices fluctuate more than my other target categories and will therefore have more periods of negative inflation than all of the other categories combined.
##### 3 - Medical care inflation will be more rapid from 2019-2021 than at any other two-year period from 2000-2021, possibly due to the COVID-19 pandemic.

<HR>

Following exploration, I want to highlight my process with energy prices first before moving on to the other hypotheses.

<HR>
    
### Hypothesis: Energy prices fluctuate more than my other target categories and will therefore have more periods of negative inflation than all of the other categories combined.

In [8]:
# Shortening a category name based on other questions worked.
df['Category'].replace({'New and used motor vehicles':'Motor vehicles'}, inplace=True)

# First chart - raw data by category.
alt.Chart(df[['date','value','Category']],).mark_line().encode(
    x = 'date',
    y = 'value',
    color = 'Category'
)

**What's informative about this view:** This first chart definitively shows the fluctuations in energy prices, supporting my hypothesis. 

**What could be improved about this view:** To better align this chart with the task at hand, I want to change the units. Month-over-month or year-over-year percentage changes might better illustrate the point.

In [9]:
# There are multiple ways to get this data in order. These are a few different techniques.

# Build new dataframe showing month-over-month changes in a new column.
df2 = df.set_index('date').sort_values('date')
df2['pct_change'] = df2.groupby('Category').value.pct_change()

# Build new dataframe showing month-over-month changes after reindexing by data and grouping by category.
df_m = df.set_index('date').groupby('Category').resample('M').mean().pct_change()

# Build new dataframe showing year-over-year changes after reindexing by data and grouping by category.
df_y = df.set_index('date').groupby('Category').resample('Y').mean().pct_change()

# Second chart - Use circle marks to show the monthly changes.
alt.Chart(df2,title='Monthly Price Movements by Category from 2000 to 2021').mark_circle().encode(
    x = alt.X('pct_change', title='Percentage Change', axis=alt.Axis(format='%')),
    y = alt.Y('Category'),
    color = 'Category'
)

**What's informative about this view:** This graph shows the relatively extreme fluctuations for Energy prices

**What could be improved about this view:** There are too many points crowded in the middle to make sense of that region of the chart. How does this look with hollow circles?

In [10]:
# Third chart - Use hollow circle marks to show the monthly changes.
alt.Chart(df2,title='Monthly Price Movements by Category from 2000 to 2021').mark_point().encode(
    x = alt.X('pct_change', title='Percentage Change', axis=alt.Axis(format='%')),
    y = alt.Y('Category'),
    color = 'Category'
)

**What's informative about this view:** Similar to the last view. No great improvements.

**What could be improved about this view:** Still difficult to parse datapoints in the middle. How about looking at the yearly changes?

In [12]:
# Fourth chart - Use hollow marks to show the yearly changes.
alt.Chart(pd.DataFrame(df_y).reset_index(),title='Yearly Price Movements by Category from 2000 to 2021').mark_point().encode(
    x = alt.X('value', title='Percentage Change', axis=alt.Axis(format='%')),
    y = alt.Y('Category'),
    color = 'Category'
)

**What's informative about this view:** There is greater separation in the middle, allowing the user to better parse the crowded data in there. 

**What could be improved about this view:** This chart still doesn't answer our question. Let's try a histogram instead.

In [13]:
# Fifth chart - Historgram
alt.Chart(pd.DataFrame(df_y).reset_index(),title='Yearly Price Movements by Category from 2000 to 2021').mark_bar().encode(
    x = alt.X('value', title='Percentage Change', bin=alt.Bin(extent=[-100, 100], step=100)),
    y = alt.Y('count()'),
    color = 'Category'
)

**What's informative about this view:** Now we can explicitly see the negative and positive changes by category without having to parse slopes or points.

**What could be improved about this view:** There are a few problems to tackle here. The labels on the X axis should show that we're binning positive vs. negative changes, and we also want to unstack the categories. We want to focus on the categories as entities rather than one big heap.

In [15]:
# Sixth chart - Historgram of positive or negative changes
alt.Chart(pd.DataFrame(df_y).reset_index().dropna(),title='Yearly Price Movements by Category from 2000 to 2021').mark_bar().encode(
    x = alt.X('Category', title=None),
    y = alt.Y('count()', title='Count'),
    color = 'Category',
    column = alt.Column('value', bin=alt.Bin(extent=[-100, 100], step=100), title=None, header=alt.Header(labels=False))
)

**What's informative about this view:** I can see that my hypothesis is wrong for year-by-year groupings. The categories are now properly separated and able to be discerned. This presentation much more closely aligns with answering our question than anything before.

**What could be improved about this view:** I'm having trouble labeling both sides of this column chart with "Negative" and "Positive". In the next step I'm going to try another method to slice the data while also looking at the monthly changes. I'll look for changes that are 1% or more in either direction, as these would be more significant. Currently even a 0.00001% change is being counted.

In [16]:
df3 = pd.DataFrame(df_m).reset_index().dropna()
df3['1_pct_grow'] = df3['value'] >= 0.01
df3['1_pct_drop'] = df3['value'] <= -0.01
df3['change'] = 'Negligible'
df3.loc[df3['1_pct_grow'] == True, 'change'] = 'Price Increase'
df3.loc[df3['1_pct_drop'] == True, 'change'] = 'Price Decrease'
df3[df3['change'] == 'Decreased']
df3.loc[df3['Category'] != 'Energy', 'Category'] = 'Non-Energy'

# Seventh chart - combining the other categories and limiting to changes of +/- 1%.
alt.Chart(df3[df3['change'] != 'Negligible'],title='Monthly Price Changes of 1% or Greater 2000-2021').mark_bar().encode(
    x = alt.X('Category'),
    y = alt.Y('count()', title='Count'),
    color = 'Category',
    column = alt.Column('change', title=None)
).configure_axis(
    titleFontSize = 16,
    labelFontSize = 12
).configure_title(
    fontSize=20,
    anchor='middle'
).configure_legend(
    titleFontSize=16,
    labelFontSize=12,
).properties(height=300, width=200)

**What's informative about this view:** By throwing out changes less than 1% either way, the data aligns more closely with my initial hypothesis, but is this a faithful representation of the truth? The chart is clean and easy to read, and directly answers the main question.

**What could be improved about this view:** I'm also not liking the way those "Price Decrease" and "Price Increase" subheaders look because I can't figure out how to resize them; if I can't resize them I shouldn't resize the other text to be so large. Let's see how the same chart looks like grouped by year, and with some small format changes.

In [17]:
df3 = pd.DataFrame(df_y).reset_index().dropna()
df3['1_pct_grow'] = df3['value'] >= 0.01
df3['1_pct_drop'] = df3['value'] <= -0.01
df3['change'] = 'Negligible'
df3.loc[df3['1_pct_grow'] == True, 'change'] = 'Price Increase'
df3.loc[df3['1_pct_drop'] == True, 'change'] = 'Price Decrease'
df3[df3['change'] == 'Decreased']
df3.loc[df3['Category'] != 'Energy', 'Category'] = 'Non-Energy'

# Eighth chart - 1% annual changes by category.
alt.Chart(df3[df3['change'] != 'Negligible'],title='Yearly Price Changes of 1% or Greater 2000-2021').mark_bar().encode(
    x = alt.X('Category', title=None),
    y = alt.Y('count()', title='Count'),
    color = 'Category',
    column = alt.Column('change', title=None)
).configure_title(
    anchor='middle'
).properties(height=300, width=200)

**What's informative about this view:** Like the previous chart, this one is clear and directly answers the question, but now on a yearly change basis.

**What could be improved about this view:** By making this yearly, most years end up counting as 1% +/- for every category. Lumping the other 8 categories into "Non-Energy" here makes Energy look insignificant. Basically, almost everything seems to go up or down by 1% per year whereas very few things were going up and down by 1% per month. For my final chart I want to go back to the monthly version, but make sure to note how I've framed the data to support my hypothesis.

In [20]:
df3 = pd.DataFrame(df_m).reset_index().dropna()
df3['1_pct_grow'] = df3['value'] >= 0.01
df3['1_pct_drop'] = df3['value'] <= -0.01
df3['change'] = 'Negligible'
df3.loc[df3['1_pct_grow'] == True, 'change'] = 'Price Increase'
df3.loc[df3['1_pct_drop'] == True, 'change'] = 'Price Decrease'
df3[df3['change'] == 'Decreased']
df3.loc[df3['Category'] != 'Energy', 'Category'] = 'Non-Energy'

# Ninth chart - 1% monthly changes by category.
alt.Chart(df3[df3['change'] != 'Negligible'],title='Monthly Price Changes of 1% or Greater 2000-2021 for select CPI-U Categories').mark_bar().encode(
    x = alt.X('Category', title=None),
    y = alt.Y('count()', title='Count'),
    color = 'Category',
    column = alt.Column('change', title=None)
).configure_title(
    anchor='middle'
).properties(height=300, width=200)

**What's informative about this view:** I made the title more explicit to what I'm showing to try and reduce confusion.

**What could be improved about this view:** Lost in the noise here is what exactly makes up the 'Non-Energy' block. I need to find a way to break it out for this chart to be meaningful to an audience that isn't familair with all of the steps above.

In [37]:
df3 = pd.DataFrame(df_m).reset_index().dropna()
df3['1_pct_grow'] = df3['value'] >= 0.01
df3['1_pct_drop'] = df3['value'] <= -0.01
df3['change'] = 'Negligible'
df3.loc[df3['1_pct_grow'] == True, 'change'] = 'Price Increase'
df3.loc[df3['1_pct_drop'] == True, 'change'] = 'Price Decrease'
df3[df3['change'] == 'Decreased']
df3['Category2'] = df3['Category']
df3.loc[df3['Category2'] != 'Energy', 'Category2'] = 'Non-Energy'

# Tenth and final chart - 1% monthly changes by category, with the categories providing coloring and hover tooltip.
alt.Chart(df3[df3['change'] != 'Negligible'],title='Monthly Price Changes of 1% or Greater 2000-2021 for select CPI-U Categories').mark_bar().encode(
    x = alt.X('Category2', title=None),
    y = alt.Y('count()', title='Count'),
    color = 'Category',
    column = alt.Column('change', title=None),
    tooltip=[alt.Tooltip('Category', title="Category"), alt.Tooltip('count()', title="Count")]
).configure_title(
    anchor='middle'
).properties(height=300, width=200)

**What's informative about this view:** All of the categories in Non-Energy have been broken out and stacked to show how they contribute to the total. I added a hover tooltip to let users get counts and category labels for each slice of the stack. Now if effectively illustrates that energy experiences way more monthly increases or decreases of at least 1% than apparel, communication, education, food, medical care, motor vehicles, recreation, and shelter combined. Not only does it answer our hypothesis about price decreases, but we find that it holds true for increases when sliced this way.

**What could be improved about this view:** I like this chart, but there are definitely other refinements that could be made if desired. To further underline my hypothesis I could add more categories to the non-energy stack. I could also focus on only the price decrease side to stick to my hypothesis.


### Conclusion: The hypothesis is partially true. 

Energy prices do fluctuate more than other cateogires, but there is an ambiguity to measuring "more periods of negative inflation". The data can be presented to support this statement, but there are also ways to display the data that don't support the statement. Ultimately, the hypothesis was not defined rigorously enough to be fully answered with the data available.

A more specific hypothesis like "Energy prices fluctuate more than my other target categories, and will therefore have more months with negative inflation of 1% or greater than all of the other categories combined." would have been fully supported by the data, as evidenced in the chart above.

<HR>
    
### Next Hypothesis: Shelter prices have increased more rapidly than any of my other target categories since 2010.

In [38]:
# Normalize values to % change the specific category from start of date window.
baseline_df = df[['date','value','Category']][df['date'] == '1/1/2010']
baseline_dict = baseline_df.set_index('Category').to_dict()['value']
df['baseline'] = df['Category'].map(baseline_dict) 
df['change'] = df['value']/df['baseline'] - 1

# First real chart - show normalized data.
alt.Chart(df[['date','change','Category']][df['date'] >= '1/1/2010'], title='Price Change since 2010 by Category').mark_line().encode(
    x = alt.X('date', title = 'Year'),
    y = alt.Y('change', title='Price change since 2010', axis=alt.Axis(format='%')),
    color = 'Category'
)

Even in this first viz, we can clearly see that the hypothesis is wrong. Shelter did not increase as much as eduction or medical care. Still, I pressed on refinining the chart into something that better fit the hypothesis, ending up with the version below:

In [39]:
# Final chart - With improved labels and resized.
alt.Chart(df[['date','change','Category']][df['date'] >= '12/1/2021'],title='Price Change since 2010 by Category').mark_bar().encode(
    x = alt.X('Category', sort='y'),
    y = alt.Y('change', title='Price change since 2010', axis=alt.Axis(format='%')),
    color = 'Category'
).configure_axis(
    titleFontSize = 16,
    labelFontSize = 12
).configure_title(
    fontSize=20
).configure_legend(
    titleFontSize=16,
    labelFontSize=12,
).properties(height=300, width=400)

### Conclusion: The hypothesis is incorrect.

Education and medical care experienced greater increases in prices from 2010-2021 than Shelter. These three categories saw relatively similar increases in price. Shelter did increase in price more than apparel, communication, energy, food, motor vehicles, and recreation.

<HR>

### Next Hypothesis: Medical care inflation was more rapid from 2019-2021 than at any other two-year period from 2000-2021, possibly due to the COVID-19 pandemic.

In [40]:
# We only need to look at the medical care category.
df = df[df['Category'] == 'Medical care']

# Build new dataframe showing year-over-year changes after reindexing by data and grouping by category.
df_y = df.set_index('date').resample('Y').mean().pct_change()
df_y.reset_index(inplace=True)
df_y['year'] = pd.DatetimeIndex(df_y['date']).year

# First chart - bar chart of yearly inflation rates.
alt.Chart(pd.DataFrame(df_y).dropna(),title='Yearly Price Changes for Medical Care from 2000 to 2021').mark_bar().encode(
    x = alt.X('year:O', title=None, sort='y'),
    y = alt.Y('value', title='Change', axis=alt.Axis(format='%'))
)

Yet another incorrect hypothesis. With 2021 showing the lowest inflation rate, and 2020 being somewhere in the middle, there is no way the hypothesis is true. 2001 and 2002 had the highest two inflation rates, so 2000-2002 seems to be the period with the most inflation. Still, the first graph does not tell this story explicitly, so there is work to be done. Here is the final chart I settled on:

In [45]:
#Going to use months to help space out our index, and add hover labels.
df_m = df.set_index('date').sort_values('date')
df_m['pct_change'] = df_m.value.pct_change(periods=24)
df_m.reset_index(inplace=True)
df_m['year'] = pd.DatetimeIndex(df_m['date']).year
df_m['label'] = df_m['periodName'].astype(str) + ' ' + (df_m['year'] -2).astype(str) + ' to ' + df_m['year'].astype(str)
df_m['month'] = pd.DatetimeIndex(df_m['date']).month
df_m['spacing'] = df_m['year'] + (df_m['month'] / 12)
df_m['mouseover'] = df_m['label'] + ': ' + (df_m['pct_change'] * 100).round(2).astype(str) + '%'

# Sixth chart - scatter of 24-month rolling changes with red highlight.
alt.Chart(pd.DataFrame(df_m).dropna(),title='Rolling 24-Month Price Changes for Medical Care since 2002 with 2021 in Red').mark_point().encode(
    y = alt.Y('pct_change', title='Change', axis=alt.Axis(format='%')),
    x = alt.X('spacing', title='Year', axis = alt.Axis(format='0'), scale=alt.Scale(domain=[2002, 2022])),
    tooltip=('mouseover'),
    color=alt.condition(
        alt.datum.year == 2021,
        alt.value('red'),
        alt.value('lightgrey')
    )
).interactive(
).properties(height=300, width=400)

### Conclusion: The hypothesis is false.

Medical expense inflation was surprisingly low during 2019-2021, despite the ongoing Covid-19 pandemic. Paradoxically, this was one of the lowest periods of medical expense inflation since 2000. Medical Care frequently topped 8% inflation for most of the first decade of the millenium. Later years saw annualized inflation hovering in the range of ~3% to ~8%.