In [5]:
import pandas as pd
import altair as alt
df = pd.read_pickle('backup_data.pkl')

### Hypothesis: Shelter prices have increased more rapidly than any of my other target categories since 2010.

In [6]:
# Preview chart - raw data by category.
alt.Chart(df[['date','value','Category']],).mark_line().encode(
    x = 'date',
    y = 'value',
    color = 'Category'
)

We don't need to show this whole date range to answer our hypothesis. Let's narrow it down.

In [7]:
# Second chart graph - cut to target date range.
alt.Chart(df[['date','value','Category']][df['date'] >= '1/1/2010']).mark_line().encode(
    x = 'date',
    y = 'value',
    color = 'Category'
)

Another problem is that these values aren't all in the same units. If we look at percentage change over the target time period then it will be a fair comparison. Now is also a good time to add some better titles/labels.

In [10]:
# Normalize values to % change the specific category from start of date window.
baseline_df = df[['date','value','Category']][df['date'] == '1/1/2010']
baseline_dict = baseline_df.set_index('Category').to_dict()['value']
df['baseline'] = df['Category'].map(baseline_dict) 
df['change'] = df['value']/df['baseline'] - 1

# First real chart - show normalized data.
alt.Chart(df[['date','change','Category']][df['date'] >= '1/1/2010'], title='Price Change since 2010 by Category').mark_line().encode(
    x = alt.X('date', title = 'Year'),
    y = alt.Y('change', title='Price change since 2010', axis=alt.Axis(format='%')),
    color = 'Category'
)

We can clearly see that our hypothesis is wrong. Shelter did not increase as much as eduction or medical care. This chart looks a little busy. Is there another way to present this data?

In [11]:
# Second chart - let's see how the normalized data looks as stacked area plot.
alt.Chart(df[['date','change','Category']][df['date'] >= '1/1/2010'],title='Price Change since 2010 by Category').mark_area().encode(
    x = alt.X('date'),
    y = alt.Y('change', stack='center', axis=None),
    color = 'Category'
)

This chart looks cool, but it is less effective at answering our hypothesis. The encoding makes it difficult to judge the relative value of each category. Maybe making it interactive and allowing users to zoom in on specific areas will help?

In [12]:
# Third chart - let's see how the normalized data looks as stacked area plot.
alt.Chart(df[['date','change','Category']][df['date'] >= '1/1/2010'],title='Price Change since 2010 by Category').mark_area().encode(
    x = alt.X('date'),
    y = alt.Y('change', stack='center', axis=None),
    color = 'Category'
).interactive()

Adding zoom makes it easier to compare adjacent series, but as we zoom we miss out on the big picture. The most important thing to visualize is the total change during the time period, so a bar chart seems like a cleaner and better tool for that task.

In [13]:
# If all that matters is the final value, let's graph as a bar chart to make it easier to read.
# Fourth chart - show normalized data as a bar chart.
alt.Chart(df[['date','change','Category']][df['date'] >= '12/1/2021'],title='Price Change since 2010 by Category').mark_bar().encode(
    x = alt.X('Category'),
    y = alt.Y('change', title='Price Change', axis=alt.Axis(format='%')),
    color = 'Category'
)

One last tweek to better fit our task is to sort the bars by the change since 2010.

In [14]:
# If all that matters is the final value, let's graph as a bar chart to make it easier to read.
# Fifth chart - show normalized data as a bar chart.
alt.Chart(df[['date','change','Category']][df['date'] >= '12/1/2021'],title='Price Change since 2010 by Category').mark_bar().encode(
    x = alt.X('Category', sort='y'),
    y = alt.Y('change', title='Price change since 2010', axis=alt.Axis(format='%')),
    color = 'Category'
)

"New and used motor vehicles" is a bit long for a label. It looks weird. Let's change it to "Motor vehicles".

In [15]:
df['Category'].replace({'New and used motor vehicles':'Motor vehicles'}, inplace=True)
# Final chart - With improved labels and resized.
alt.Chart(df[['date','change','Category']][df['date'] >= '12/1/2021'],title='Price Change since 2010 by Category').mark_bar().encode(
    x = alt.X('Category', sort='y'),
    y = alt.Y('change', title='Price change since 2010', axis=alt.Axis(format='%')),
    color = 'Category'
).configure_axis(
    titleFontSize = 16,
    labelFontSize = 12
).configure_title(
    fontSize=20
).configure_legend(
    titleFontSize=16,
    labelFontSize=12,
).properties(height=300, width=400)

### Conclusion: The hypothesis is incorrect.

Education and medical care experienced greater increases in prices from 2010-2021 than Shelter. These three categories saw relatively similar increases in price. Shelter did increase in price more than apparel, communication, energy, food, motor vehicles, and recreation.