### Baby Boomers Through Time
#### Demographics Report     

**Name: Vaishnavi Sathiyamoorthy**      

**UT EID: vs25229**  

**Date: 10/08/2023**

#### Homework 6

##### Creating Interactive Charts to Visualize Population Shifts over Time with Altair

Baby boomers (often shortened to boomers) are the demographic cohort following the Silent Generation and preceding Generation X. The generation is generally defined as people born from 1946 to 1964, during the post–World War II baby boom. The term is also used outside the United States but the dates, the demographic context and the cultural identifiers may vary. The baby boom has been described variously as a "shockwave"and as "the pig in the python." Baby boomers are often parents of late Gen Xers and Millennials. [from wikipedia](https://en.wikipedia.org/wiki/Baby_boomers).

Let us explore this "shockwave" by examining the US Census data available via the vega datasets package.  We'll start by doing some data engineering to add a column in our population data to denote generational membership, then we will juxtapose the sex distribution of the population using a brush and linking technique we studied in the lab.  Finally we will add a slider to animate the transition through time.

In [29]:
# Import the necessary libraries and data
# altair was updated to 5.2 version
import altair as alt
import pandas as pd

df_pop = pd.read_json('population.json')

In [30]:
df_pop.head()

Unnamed: 0,year,age,sex,people
0,1850,0,1,1483789
1,1850,0,2,1450376
2,1850,5,1,1411067
3,1850,5,2,1359668
4,1850,10,1,1260099


In [31]:
df_pop.shape

(570, 4)

In [32]:
df_pop.describe()

Unnamed: 0,year,age,sex,people
count,570.0,570.0,570.0,570.0
mean,1927.333333,45.0,1.5,3428937.0
std,46.726717,27.410182,0.500439,3101098.0
min,1850.0,0.0,1.0,5259.0
25%,1880.0,20.0,1.0,674284.0
50%,1930.0,45.0,1.5,2548015.0
75%,1970.0,70.0,2.0,5466410.0
max,2000.0,90.0,2.0,11635650.0


In [33]:
df_pop.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 570 entries, 0 to 569
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   year    570 non-null    int64
 1   age     570 non-null    int64
 2   sex     570 non-null    int64
 3   people  570 non-null    int64
dtypes: int64(4)
memory usage: 17.9 KB


##### Q1 - Add in the "Boomer" label

As we can see from inspecting the dataframe, our data only gives us information for:
  - year
  - age
  - sex
  - people
  
But, we want to be able to highlight just the people born between 1946 - 1964 as a separate group.  To accomplish this, we want to create a new categorical attribute - `Generation`

Using pandas data manipulation techniques, add a new column to `df_pop` named `Generation` that either has the value `Baby Boomer` or `Other`.

In [34]:
def generation(year, age):
  if (year - age) >= 1946 and (year - age) <= 1964:
    return "Baby Boomer"
  return "Other"
df_pop['Generation'] = df_pop.apply(lambda row: generation(row['year'], row['age']), axis=1)

df_pop.head()

Unnamed: 0,year,age,sex,people,Generation
0,1850,0,1,1483789,Other
1,1850,0,2,1450376,Other
2,1850,5,1,1411067,Other
3,1850,5,2,1359668,Other
4,1850,10,1,1260099,Other


##### Q2 - Change the encoding for `sex`

As in our lab in class, the sex is "Male" is encoded as the number `1` and the sex for Female is encoded as `2`.  Modify the dataframe  `df_pop` to replace the encoding with the string so when we create our plots this will automatically have the legend come out correctly (note, you can map numbers to labels in Altair as well).  

In [35]:
def encode(x):
  if str(x) == '1':
    return "Male"
  return "Female"
df_pop['sex'] = df_pop['sex'].apply(encode)

In [36]:
df_pop.head()

Unnamed: 0,year,age,sex,people,Generation
0,1850,0,Male,1483789,Other
1,1850,0,Female,1450376,Other
2,1850,5,Male,1411067,Other
3,1850,5,Female,1359668,Other
4,1850,10,Male,1260099,Other


##### Q3 Juxtapose Bar Charts Horizontally

Create a bar chart of the population distribution in the year 1960, and horizontally juxtapose the bar chart of the population distribution for the year 1990. Plot the total number of people (ignoring the `sex` attribute).

Note - You can slice the data to a given year before you pass it to Altair using pandas. (you can also do this in Altair with filters).

Encode membership of the Baby Boomer generation with color using "#7D3C98" (purple) for the boomers "#F4D03F" (gold) for other.

Fix the y axis so it is equal in both plots.

In [37]:
df = df_pop[(df_pop['year'] == 1960) | (df_pop['year'] == 1990)]
alt.Chart(df).mark_bar().encode(
    x=alt.X('age:N', title='Age'),
    y=alt.Y('people:Q', title='People'),
    color=alt.Color('Generation:N', scale=alt.Scale(range=["#7D3C98", "#F4D03F"])),
    column='year'
)

##### Q5 - Show the Population Change Over Time with a Slider

Now, we have a snapshot of 2 different years next to each other, but what about creating a crude animation by controlling the the year displayed with a slider?

Create a slider using [this example](https://altair-viz.github.io/gallery/us_population_over_time.html) to help guide you.  Our plot will look similar, except we have not split our bar chart up by `sex` yet. Name the slider 'Select Year:' (this in controlled in `binding_range`, and not in the `selection_single` parameters).  

Encode membership of the Baby Boomer generation with color using "#7D3C98" (purple) for the boomers "#F4D03F" (gold) for other.

Start the slider at 1900.

In [64]:
select_year = alt.selection_point(
    name="Year",
    fields=["year"],
    bind=alt.binding_range(min=1900, max=2000, step=10, name="Year"),
    value={"year": 1900})

alt.Chart(df_pop).mark_bar().encode(
    alt.X("age:N").title('').axis(labels=True, ticks=True),
    alt.Y("people:Q").title("People"),
    alt.Color("Generation:N").scale(domain=("Baby Boomer", "Other"), range=["#7D3C98", "#F4D03F"])).properties(title = 'Population by Generation and Age').add_params(select_year).transform_filter(
        select_year).configure_facet(spacing=8)

##### Q6 - Linking

Let us take a closer look at just the year 2000 data, and find what the distribution of sex is for each individual age grouping.  Plot the distribution of ages as a bar chart for just the year 2000, and link a histogram that will plot the distribution of sex for the current selection.  It should default to no age group selected.  The histogram for the sex distribution should appear below the year 2000 data (vertically concatenated). When a bar on the top chart is selected, indicate its selection by turning the other bars light gray.  The histogram of the sex distribution below it should be a horizontal bar chart.

Encode membership of the Baby Boomer generation with color using "#7D3C98" (purple) for the boomers "#F4D03F" (gold) for other.

In [39]:
selector = alt.selection_single(fields=['age'], empty=True)
main = alt.Chart(df_pop[df_pop['year'] == 2000]).properties(
    width=250,
    height=250
).add_selection(selector).mark_bar().encode(
    x=alt.X('age:N', title='Age'),
    y=alt.Y('sum(people):Q', title='People'),
    color=alt.condition(
        selector,
        alt.Color("Generation:N",
                  scale=alt.Scale(domain=["Baby Boomer", "Other"], range=["#7D3C98", "#F4D03F"])),
        alt.value('lightgray')
    )
)
sub = alt.Chart(df_pop[df_pop['year'] == 2000]).transform_filter(
    selector
).transform_aggregate(
    TotalPeople='sum(people)',
    groupby=['sex']
).mark_bar().encode(
    alt.X('sex:N', title='Sex'),
    alt.Y('TotalPeople:Q', title='Total People'),
    color=alt.Color('sex:N', scale=alt.Scale(domain=['Male', 'Female'], range=['#0000FF', '#FFC0CB']), title='Sex')
).properties(
    title='Sex Distribution in 2000'
)
main | sub



##### Q7 - Combine Q5 and Q6 to One Chart

In question 6, we linked the distribution of sex to the age selection for just the year 2000.  Let us visualize all the data by incorporating the year selection slider from question 5 so that you can select which year of data you are viewing. Retain the ability to just select one age group for the sex distribution, and default to no age group selected.

Add a tooltip so you can see exactly how many people are in the age range for the top "Distribution of Ages for the Selected Year" histogram.

Encode membership of the Baby Boomer generation with color using "#7D3C98" (purple) for the boomers "#F4D03F" (gold) for other.

In [52]:
select_age = alt.selection_single(fields=['age'], empty=True)

main = alt.Chart(df_pop).mark_bar().encode(
    alt.X('age:O', title='Age Group'),
    alt.Y('sum(people):Q', title='Total People', axis=alt.Axis(title='Total People', grid=True)),
    alt.Color("Generation:N",
        scale=alt.Scale(domain=["Baby Boomer", "Other"], range=["#7D3C98", "#F4D03F"]),
        title='Generation'
    ),
    opacity=alt.condition(select_age, alt.value(1), alt.value(0.3))
).properties(
    title='Age Distribution in Selected Year'
).add_selection(select_age)

sub = alt.Chart(df_pop).transform_filter(
    select_age
).transform_aggregate(
    TotalPeople='sum(people)',
    groupby=['sex']
).mark_bar().encode(
    alt.X('sex:N', title='Sex'),
    alt.Y('TotalPeople:Q', title='Total People', axis=alt.Axis(title='Total People', grid=True)),
    alt.Color('sex:N', scale=alt.Scale(domain=['Male', 'Female'], range=['#0000FF', '#FFC0CB']), title='Sex')
).properties(
    title='Sex Distribution in Selected Year'
)

select_year = alt.selection_single(fields=['year'],
                                bind=alt.binding_range(min=1900, max=2000, step=10,
                                                       name='Year'), value={"year": 1900})

(main | sub).add_selection(select_year).transform_filter(select_year)