# Smoking in US

The following visualizations are designed to explore patterns of cigarette and vape use, along with demographic and behavioral factors. These charts are based on survey data collected over several years, focusing on the usage of tobacco products, variations by age, gender, and race, and socioeconomic factors. Each chart is crafted with care to provide insights into different aspects of tobacco use.

In [None]:
import pandas as pd
import altair as alt
from pathlib import Path

In [None]:
df = pd.read_csv("../data/smoking.csv")
df2021= pd.read_csv("../data/smoking_20-21.csv")

In [None]:
# Recode data
df['IRPINC3'] = df['IRPINC3'].replace({1: '1: Less than $10,000',2: '2: $10,000 - $19,999', 3: '3: $20,000 - $29,999',
                                      4: '4: $30,000 - $39,999', 5: '5: $40,000 - $49,999', 6: '6: $50,000 - $74,999',7: '7: $75,000 or more' })
df['IRSEX'] = df['IRSEX'].replace({1: 'Male', 2:'Female'})
df['NEWRACE2'] = df['NEWRACE2'].replace({1: 'White', 2: 'Black/African American', 3:'Native American Alaska Native', 
                                        4: 'Native Hawaiian/Other Pacific Islanders', 5: 'Asian', 6:'More than One Race', 7:'Hispanic'})
df['CIGEVER'] = df['CIGEVER'].replace({1: 'Yes', 2:'No'})
df['CATAG3'] = df['CATAG3'].replace({1: '12-17 Years Old', 2: '18-25 Years Old', 3: '26-34 Years Old', 4: '35-49 Years Old',
                                    5: '50 or Older'})
y_labels = {
        1: 'Less than one cigarette per day',
        2: '1 cigarette per day',
        3: '2 to 5 cigarettes per day',
        4: '6 to 15 cigarettes per day',
        5: '16 to 25 cigarettes per day',
        6: '26 to 35 cigarettes per day',
        7: 'More than 35 cigarettes per day'
    }
df['cig30av_label'] = df['CIG30AV'].map(y_labels)

In [None]:
# Get different data frame to work on visualization
smoke = df.loc[df['CIGEVER'] == 'Yes']
smoke = smoke.loc[smoke['CIGTRY'] <= 900]
smoke_2021 = smoke.loc[smoke['year'] == 2021]
current_smoker = smoke.loc[smoke['CIGREC'] == 1]

In [None]:
df2021_youth = df2021.loc[df2021['CATAG3'] == 1]
df2021_youth = df2021_youth[['CIGEVER', 'VAPANYEVR', 'year']].loc[df2021_youth['VAPANYEVR'] < 80]
df2021_youth['CIGEVER'] = df2021_youth['CIGEVER'].replace({2:0})
df2021_youth['VAPANYEVR'] = df2021_youth['VAPANYEVR'].replace({2:0})

def prepare_pie_data(df, var):
    # Group by year and the variable (CIGEVER or VAPANYEVR)
    pie_data = df.groupby(['year', var]).size().reset_index(name='count')
    pie_data['Usage'] = pie_data[var].map({1: 'Ever Used', 0: 'Never Used'})
    pie_data['Product'] = var
    return pie_data

cig_data = prepare_pie_data(df2021_youth, 'CIGEVER')
vap_data = prepare_pie_data(df2021_youth, 'VAPANYEVR')

# Combine both datasets
pie_data = pd.concat([cig_data, vap_data])

In [None]:
alt.data_transformers.disable_max_rows()

def year_everuse(df):
    stacked_chart = alt.Chart(df).mark_bar().encode(
        x=alt.X('year:N', title='Year', axis=alt.Axis(labelAngle=0)),
        y=alt.Y('count():Q', stack='normalize', title='Proportion'),
        color=alt.Color('CIGEVER:N').scale(scheme="set2"),
        order=alt.Order('CIGEVER:N', sort='descending')
    ).properties(
        title='Ever Smoked Cigarette by Year',
        width=400  # Adjust the width here
    )
    return stacked_chart

In [None]:
year_everuse(df)

This stacked bar chart visualizes the proportion of people who have ever smoked cigarettes by year. Over the years, the proportion of people who have ever smoked shows a slight downward trend, indicating that fewer people are initiating smoking, which could be due to effective public health campaigns or shifts in societal attitudes toward smoking.

In [None]:
def year_firstage(df):
    line_first = alt.Chart(df).mark_line(point=True).encode(
        alt.Y("mean(CIGTRY)"),
        alt.X("year:N", axis=alt.Axis(labelAngle=0)),
        alt.Color("IRSEX:N", scale=alt.Scale(
            domain=['Female', 'Male'],  # Ensure correct mapping to gender
            range=['#8624f5', '#1fc3aa']))
    ).properties(
        title='Average Age When First Smoked Cigarette by Year',
        width=600  # Adjust the width here
    )
    return line_first

In [None]:
year_firstage(smoke)

This line chart shows the average age when individuals first smoked a cigarette (CIGTRY) by gender (IRSEX) over the years 2015 to 2021. The average age of first cigarette use remains relatively stable across genders and years, hovering around 16 years old. Both males and females show similar trends, indicating minimal gender difference in the age of smoking initiation.

In [None]:
def vap_cig(pie_data):
    pie_chart = alt.Chart(pie_data).mark_arc().encode(
        theta=alt.Theta('count:Q', title='Proportion'),
        color=alt.Color('Usage:N', title='Ever Used').scale(scheme="tealblues")       
    ).facet(
        row=alt.Row('Product:N', title=None),
        column=alt.Column('year:N', title='Year'))
    return pie_chart

In [None]:
vap_cig(pie_data)

This pie chart visualizes the proportion of 12-17 year-old individuals who have ever used versus never used cigarettes (CIGEVER) and vape products (VAPANYEVR) for the years 2020 and 2021. The proportion of "Ever Used" for both cigarettes and vape products increases slightly in 2021 compared to 2020. This could reflect increasing trends in vape usage among younger populations.

In [None]:
def race_smoke(df):
    total_by_race = df.groupby('NEWRACE2').size().reset_index(name='Total')
    
    # Calculate number of current smokers (CIGREC == 1) in each race
    smokers_by_race = df[df['CIGREC'] == 1].groupby('NEWRACE2').size().reset_index(name='Smokers')
    race_smoker_data = pd.merge(total_by_race, smokers_by_race, on='NEWRACE2', how='left')
    race_smoker_data['Smoker_Percentage'] = (race_smoker_data['Smokers'] / race_smoker_data['Total']) * 100

    current = race_smoker_data[['NEWRACE2', 'Total', 'Smokers', 'Smoker_Percentage']]
    
    color_scale = alt.Scale(
        domain=['White', 'Black/African American', 'Asian', 'Hispanic', 
                'More than One Race', 'Native American Alaska Native', 'Native Hawaiian/Other Pacific Islanders'],
        range=['#93a1a1', '#4b4b4b', '#d0a585', '#a58fa5', '#f7e09c', '#c6dbef', '#a6a475']  # Adjusted colors
    )    
    chart = alt.Chart(current).mark_bar().encode(
        x=alt.X('NEWRACE2:N', title='Race'),
        y=alt.Y('Smoker_Percentage:Q', title='Percentage of Current Smokers'),
        color=alt.Color('NEWRACE2:N', title='Race', scale=color_scale)
    ).properties(
        title='Percentage of Current Smokers by Race'
    )
    return chart

In [None]:
race_smoke(df)

This bar chart displays the percentage of current smokers among various racial groups. Native American/Alaska Native individuals have the highest percentage of current smokers, while Asian populations show the lowest smoking rates. The data reveals racial disparities in smoking behavior, potentially influenced by socioeconomic and cultural factors.

In [None]:
def violin_frequency(df, var):
    violin = alt.Chart(df).transform_density(
    var,
    as_=[var, 'density'],
    groupby=['CATAG3']
).mark_area(orient='horizontal').encode(
    alt.X('density:Q')
        .stack('center')
        .impute(None)
        .title(None)
        .axis(labels=False, values=[0], grid=False, ticks=True),
    alt.Y(var),
    alt.Color('CATAG3:N').scale(scheme="tealblues"),
    alt.Column('CATAG3:N')
        .spacing(0)
        .header(titleOrient='bottom', labelOrient='bottom', labelPadding=0)
).configure_view(
    stroke=None
)
    return violin

In [None]:
violin_frequency(smoke_2021.loc[smoke_2021['CIG30USE']<90], 'CIG30USE')

This violin plot demonstrates the distribution of cigarette use across different age groups. The 26-34 years old group shows the broadest distribution, with higher frequencies of both light and heavy smoking. The 18-25 group has a more narrow and lighter use pattern, while the 50+ group shows more sporadic but heavier smoking behaviors.

In [None]:
def stacked_cigarette_use(df):
    # Create the chart
    label_mapping = {
        1: '1: Less than one cigarette per day',
        2: '2: 1 cigarette per day',
        3: '3: 2 to 5 cigarettes per day',
        4: '4: 6 to 15 cigarettes per day',
        5: '5: 16 to 25 cigarettes per day',
        6: '6: 26 to 35 cigarettes per day',
        7: '7: More than 35 cigarettes per day'
    }
    df['CIG30AV_label'] = df['CIG30AV'].map(label_mapping)
    stacked_bar = alt.Chart(df).mark_bar().encode(
        x=alt.X('CATAG3:N', title='Age Group', axis=alt.Axis(labelAngle=0)),
        y=alt.Y('count()', stack='normalize', title='Proportion of Smokers'),
        color=alt.Color('CIG30AV_label:N', title='Average Cigarettes Smoked Per Day',
                        scale=alt.Scale(scheme='tealblues'))
    ).properties(
        title='Cigarette Use Distribution by Age Group',
        width=600,
        height=400
    )

    return stacked_bar

In [None]:
stacked_cigarette_use(smoke_2021.loc[smoke_2021['CIG30AV']<90])

This stacked bar chart shows the proportion of smokers across different age groups, categorized by the average number of cigarettes smoked per day. Younger age groups tend to smoke fewer cigarettes per day, while middle-aged individuals (35-49 years old) have a higher proportion of heavy smokers. The trend reveals a clear gradation where older age groups are more likely to smoke larger quantities of cigarettes daily.

In [None]:
def tobacco_use(df):
    # Calculate percentage of ever use for each product by year
    aggregated_data = df.groupby('year').agg(
        cigever=('CIGEVER', lambda x: (x == 'Yes').mean() * 100),  # Percentage for CIGEVER
        smklssevr=('SMKLSSEVR', lambda x: (x == 1).mean() * 100),  # Percentage for SMKLSSEVR
        cigarevr=('CIGAREVR', lambda x: (x == 1).mean() * 100),  # Percentage for CIGAREVR
        pipever=('PIPEVER', lambda x: (x == 1).mean() * 100)  # Percentage for PIPEVER
    ).reset_index()

    # Melt the data into long format for plotting
    melted_data = aggregated_data.melt(id_vars='year', 
                                       value_vars=['cigever', 'smklssevr', 'cigarevr', 'pipever'],
                                       var_name='Product', value_name='Percentage')
    # Create the line plot
    line_chart = alt.Chart(melted_data).mark_line(point=True).encode(
        x=alt.X('year:O', title='Year', axis=alt.Axis(labelAngle=0)),  # X-axis: Year
        y=alt.Y('Percentage:Q', title='Percentage of Ever Use (%)'),  # Y-axis: Percentage of ever use
        color=alt.Color('Product:N', title='Product').scale(scheme="set2")
    ).properties(
        title='Percentage of Ever Use of Cigarettes, Smokeless Tobacco, Cigars, and Pipes by Year',
        width=700,
        height=400
    )

    return line_chart


In [None]:
tobacco_use(df)

This line chart tracks the percentage of ever use of various tobacco products (CIGEVER, CIGAREVR, SMKISSEVR, PIPEVER) from 2015 to 2021. Cigarette use (CIGEVER) is consistently the highest, though it shows a steady decline. Use of cigars and smokeless tobacco remains stable with a slight decrease, while the use of pipes is relatively rare and shows minimal change.

In [None]:
def income_cigarette(df):
    df['CIG30USE_grouped'] = pd.cut(smoke_2021['CIG30USE'], bins=[0, 5, 10, 20, 30],
                                    labels=['1: 1-5 days', '2: 6-10 days', '3: 11-20 days', '4: 21-30 days'],
                                    right=True)
    heatmap = alt.Chart(df).mark_rect().encode(
        x=alt.X('IRPINC3:N', title='Income Category', axis=alt.Axis(labelAngle=0)),
        y=alt.Y('CIG30USE_grouped:N', title='Cigarette Use (Grouped Days)'),
        color=alt.Color('count():Q', title='Number of People').scale(scheme="tealblues"),
    ).properties(
        title='Heatmap of Income vs Cigarette Use (Days)',
        width=800
    )


    return heatmap

In [None]:
income_cigarette(smoke_2021.loc[smoke_2021['CIG30USE']<90])

This heatmap visualizes the relationship between income categories and cigarette use (grouped by number of days). Darker shades represent a higher number of smokers in each income group. Individuals in lower income categories show higher frequencies of cigarette use. Conversely, as income increases, cigarette use tends to decrease, with a noticeable drop-off in higher income brackets.