<a href="https://colab.research.google.com/github/qjpbpios/CPE-031-Visualization-and-Data-Analysis/blob/main/PIOS_and_REYES%2C_JAZMIN_Hands_On_Activity_14___Telling_the_Truth_with_Data_Visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Hands-On Activity 14 | Telling the Truth with Data Visualization**





---



Name : Pios, Joshua Paul B. | Reyes, Jazmin C. <br>
Course Code and Title : CPE 031 | Visualizations and Data Analysis <br>
Date Submitted : November 13, 2025<br>
Instructor : Engr. Maria Rizette Sayo


---



**1. Objectives:**

This activity aims to demonstrate studentsâ€™ ability to visualize data truthfully and ethically. Students will identify missing or biased data, correct misleading visualizations, and apply techniques to ensure integrity in data presentation.

**2. Intended Learning Outcomes (ILOs):**

By the end of this activity, students should be able to:

1. Analyze datasets to detect missing values, errors, and biases.

2. Evaluate the accuracy and fairness of different data visualization designs.

3. Create ethical and truthful charts by correcting deceptive visualizations.

**3. Discussions:**

Telling the truth with data visualization means ensuring that every visual accurately represents the data and context without distortion.
Misleading charts can manipulate interpretation through poor scaling, selective data, or biased representation.

Missing Data and Data Errors:
Missing values or outliers can lead to incorrect conclusions if ignored. Visualizations should either indicate missing data or use methods like interpolation or removal.

Biased Data:
Data can be biased through selection bias (only certain data is collected) or survivor bias (excluding failures or dropouts). Identifying these biases prevents misleading visuals.

Adjusting for Inflation:
When comparing values over time (e.g., prices, income), data should be adjusted for inflation to reflect real value changes.

Deceptive Design:
Visualization design choices such as truncated axes, dual-axis charts, or selective time frames can distort perception. Ethical visualization maintains consistent scales and transparency.

**4. Procedures:**

Step 1: Import Libraries

In [None]:
!pip install pandas plotly numpy
import pandas as pd
import numpy as np
import plotly.express as px



Step 2: Create a Sample Dataset

This dataset simulates product prices, sales, and inflation across years.

In [None]:
# Sample data
years = np.arange(2015, 2025)
data = {
    "Year": years,
    "Sales": [120, 130, 150, 170, 200, np.nan, 240, 260, 290, 320],
    "Price": [50, 52, 55, 57, 60, 63, 65, 70, 75, 78],
    "InflationRate": [1.02, 1.03, 1.01, 1.05, 1.04, 1.03, 1.02, 1.03, 1.02, 1.02]
}

df = pd.DataFrame(data)
df.head()

Unnamed: 0,Year,Sales,Price,InflationRate
0,2015,120.0,50,1.02
1,2016,130.0,52,1.03
2,2017,150.0,55,1.01
3,2018,170.0,57,1.05
4,2019,200.0,60,1.04


Step 3: Identify Missing Data and Errors

In [None]:
# Check missing and invalid data
print("Missing Data per Column:")
print(df.isna().sum())

# Fill or interpolate missing sales values
df["Sales"] = df["Sales"].interpolate()
df

Missing Data per Column:
Year             0
Sales            1
Price            0
InflationRate    0
dtype: int64


Unnamed: 0,Year,Sales,Price,InflationRate
0,2015,120.0,50,1.02
1,2016,130.0,52,1.03
2,2017,150.0,55,1.01
3,2018,170.0,57,1.05
4,2019,200.0,60,1.04
5,2020,220.0,63,1.03
6,2021,240.0,65,1.02
7,2022,260.0,70,1.03
8,2023,290.0,75,1.02
9,2024,320.0,78,1.02


Step 4: Adjust for Inflation

In [None]:
# Adjust sales for inflation
df["Adjusted_Sales"] = df["Sales"] / df["InflationRate"].cumprod()
fig = px.line(df, x="Year", y=["Sales", "Adjusted_Sales"],
              title="Sales Over Time (Adjusted for Inflation)",
              labels={"value": "Sales", "variable": "Metric"})
fig.show()

Step 5: Demonstrate Deceptive Design

Bad Example (Truncated Axis):

In [None]:
bad_chart = px.bar(df, x="Year", y="Sales", title="Deceptive Chart (Truncated Axis)")
bad_chart.update_yaxes(range=[150, 350])  # starts too high
bad_chart.show()

Good Example (Honest Axis):

In [None]:
good_chart = px.bar(df, x="Year", y="Sales", title="Truthful Chart (Proper Scale)")
good_chart.update_yaxes(range=[0, 350])
good_chart.show()

**Task 1:** Handling Missing and Erroneous Data

Identify missing or inconsistent data points in your own dataset (or this one).

Apply at least one correction method (interpolation, imputation, or exclusion).

Visualize the corrected dataset.

In [None]:

# Sample dataset with missing Sales data
years = np.arange(2015, 2025)
data = {
    "Year": years,
    "Sales": [120, 130, 150, 170, 200, np.nan, 240, 260, 290, 320],  # Missing value in 2020
    "Price": [50, 52, 55, 57, 60, 63, 65, 70, 75, 78],
    "InflationRate": [1.02, 1.03, 1.01, 1.05, 1.04, 1.03, 1.02, 1.03, 1.02, 1.02]
}

df = pd.DataFrame(data)

# Display the first few rows and check for missing values
print("Original Dataset:")
print(df.head())

print("\nMissing Data per Column:")
print(df.isna().sum())

# Handling missing data: Interpolate missing 'Sales'
df["Sales"] = df["Sales"].interpolate()

print("\nDataset after Interpolation:")
print(df)

# Adjust sales for inflation
df["Adjusted_Sales"] = df["Sales"] / df["InflationRate"].cumprod()

# Visualize Sales over time (line chart)
fig = px.line(df, x="Year", y=["Sales", "Adjusted_Sales"],
              title="Sales Over Time (Adjusted for Inflation)",
              labels={"value": "Sales", "variable": "Metric"})
fig.show()

# Deceptive chart example (truncated y-axis)
bad_chart = px.bar(df, x="Year", y="Sales", title="Deceptive Chart (Truncated Axis)")
bad_chart.update_yaxes(range=[150, 350])  # misleading start
bad_chart.show()

# Truthful chart example (full y-axis)
good_chart = px.bar(df, x="Year", y="Sales", title="Truthful Chart (Proper Scale)")
good_chart.update_yaxes(range=[0, 350])
good_chart.show()


Original Dataset:
   Year  Sales  Price  InflationRate
0  2015  120.0     50           1.02
1  2016  130.0     52           1.03
2  2017  150.0     55           1.01
3  2018  170.0     57           1.05
4  2019  200.0     60           1.04

Missing Data per Column:
Year             0
Sales            1
Price            0
InflationRate    0
dtype: int64

Dataset after Interpolation:
   Year  Sales  Price  InflationRate
0  2015  120.0     50           1.02
1  2016  130.0     52           1.03
2  2017  150.0     55           1.01
3  2018  170.0     57           1.05
4  2019  200.0     60           1.04
5  2020  220.0     63           1.03
6  2021  240.0     65           1.02
7  2022  260.0     70           1.03
8  2023  290.0     75           1.02
9  2024  320.0     78           1.02


**Task 2:** Detecting and Correcting Bias

Create or simulate a biased dataset (e.g., only showing top-performing products or regions).

1. Visualize the biased data.

2. Then, include the full dataset and create a truthful comparison chart.

3. Briefly explain how bias affected interpretation.

In [None]:
years = np.arange(2015, 2025)
np.random.seed(42)

# Simulate sales for multiple regions
data = {
    "Year": np.tile(years, 4),  # 4 regions
    "Region": ["North"]*10 + ["South"]*10 + ["East"]*10 + ["West"]*10,
    "Sales": np.random.randint(50, 300, size=40)  # Random sales
}

df_full = pd.DataFrame(data)

print("Full Dataset:")
print(df_full.head(12))

# Step 2: Create biased dataset
# Only show top-performing sales (Sales > 200)
df_biased = df_full[df_full["Sales"] > 200]

print("\nBiased Dataset:")
print(df_biased.head(12))

# Step 3: Visualize biased data
fig_biased = px.bar(df_biased, x="Year", y="Sales", color="Region",
                    title="Biased Data: Only Top-Performing Sales")
fig_biased.show()

# Step 4: Visualize full dataset
fig_full = px.bar(df_full, x="Year", y="Sales", color="Region",
                  title="Truthful Data: All Sales Included")
fig_full.show()

Full Dataset:
    Year Region  Sales
0   2015  North    152
1   2016  North    229
2   2017  North    142
3   2018  North     64
4   2019  North    156
5   2020  North    121
6   2021  North    238
7   2022  North     70
8   2023  North    152
9   2024  North    171
10  2015  South    260
11  2016  South    264

Biased Dataset:
    Year Region  Sales
1   2016  North    229
6   2021  North    238
10  2015  South    260
11  2016  South    264
13  2018  South    252
18  2023  South    201
24  2019   East    285
25  2020   East    207
28  2023   East    241
29  2024   East    237
31  2016   West    210
32  2017   West    253


By only showing top-performing sales, the chart gives the impression that all regions and years are doing extremely well. In reality, the full dataset shows the actual range of sales, including lower-performing years and regions. This comparison highlights how selective data presentation can mislead interpretation.

**Task 3:** Deceptive vs. Truthful Visualization

Create one misleading chart using axis manipulation or selective data range.

Create a corrected version that shows the same data honestly.

Explain the difference in interpretation between the two visuals.

In [None]:
years = np.arange(2015, 2025)
data = {
    "Year": years,
    "Sales": [120, 130, 150, 170, 200, 220, 240, 260, 290, 320]
}

df = pd.DataFrame(data)
print("Dataset:")
print(df)

# Step 2: Create deceptive chart
# Misleading: y-axis starts at 150 (truncated) to exaggerate growth
deceptive_chart = px.bar(df, x="Year", y="Sales", title="Misleading Chart (Truncated Y-axis)")
deceptive_chart.update_yaxes(range=[150, 350])  # Starts too high
deceptive_chart.show()

# Step 3: Create truthful chart
# Honest: y-axis starts at 0 to reflect true growth
truthful_chart = px.bar(df, x="Year", y="Sales", title="Truthful Chart (Proper Y-axis)")
truthful_chart.update_yaxes(range=[0, 350])
truthful_chart.show()

Dataset:
   Year  Sales
0  2015    120
1  2016    130
2  2017    150
3  2018    170
4  2019    200
5  2020    220
6  2021    240
7  2022    260
8  2023    290
9  2024    320


The deceptive chart exaggerates the increase in sales by truncating the y-axis,
making the growth appear dramatic. The truthful chart starts the y-axis at zero,
showing the actual scale of sales growth. Misleading visuals can distort perception,
while truthful charts provide an accurate understanding of the data.



---


**5. Supplementary Activity:**

Visual Truth Challenge

Create a small project where you visualize a real-world dataset (e.g., population, income, environmental data).

1. Detect and correct at least two forms of distortion (missing data, bias, or misleading scaling).

2. Annotate your charts with titles and labels explaining your corrections.

3. Reflect on how ethical visualization improves trust and understanding.

In [None]:

# Step 1: Create dataset
years = np.arange(2015, 2025)
regions = ["North", "South", "East", "West"]
np.random.seed(42)

# Simulated population data with some missing values and random bias
data = []
for region in regions:
    for year in years:
        pop = np.random.randint(50000, 150000)
        # Introduce missing data for South in 2018
        if region == "South" and year == 2018:
            pop = np.nan
        # Introduce biased high values for North in 2022
        if region == "North" and year == 2022:
            pop += 50000
        data.append([region, year, pop])

df = pd.DataFrame(data, columns=["Region", "Year", "Population"])

print("Original Dataset (with distortions):")
print(df.head(12))

# Step 2: Detect and correct distortions
# 2a. Handle missing data: interpolate population per region
df["Population"] = df.groupby("Region")["Population"].transform(lambda x: x.interpolate())

# 2b. Correct bias: cap North region values in 2022
df.loc[(df["Region"] == "North") & (df["Year"] == 2022), "Population"] = 120000

print("\nCorrected Dataset:")
print(df.head(12))

# Step 3: Visualize corrected data
fig = px.line(df, x="Year", y="Population", color="Region",
              title="Population Over Time (Corrected for Missing Data and Bias)",
              labels={"Population": "Population Count", "Year": "Year"},
              markers=True)

# Annotate corrections
fig.add_annotation(x=2018, y=df.loc[(df['Region'] == 'South') & (df['Year'] == 2018), 'Population'].values[0],
                   text="Interpolated missing South data",
                   showarrow=True, arrowhead=2, yshift=10)
fig.add_annotation(x=2022, y=120000, text="Capped North population to remove bias",
                   showarrow=True, arrowhead=2, yshift=10)

fig.show()


Original Dataset (with distortions):
   Region  Year  Population
0   North  2015     65795.0
1   North  2016     50860.0
2   North  2017    126820.0
3   North  2018    104886.0
4   North  2019     56265.0
5   North  2020    132386.0
6   North  2021     87194.0
7   North  2022    187498.0
8   North  2023     94131.0
9   North  2024    110263.0
10  South  2015     66023.0
11  South  2016     91090.0

Corrected Dataset:
   Region  Year  Population
0   North  2015     65795.0
1   North  2016     50860.0
2   North  2017    126820.0
3   North  2018    104886.0
4   North  2019     56265.0
5   North  2020    132386.0
6   North  2021     87194.0
7   North  2022    120000.0
8   North  2023     94131.0
9   North  2024    110263.0
10  South  2015     66023.0
11  South  2016     91090.0


Correcting missing data and removing exaggerated biases ensures that charts accurately
represent reality. Adding clear labels and annotations builds trust and understanding
for the audience. Ethical visualizations prevent misleading conclusions and support
data-driven decisions based on truth rather than distortion.

**6. Conclusion/Learnings/Analysis:**

This activity showed us how important it is to visualize data honestly and ethically. We practiced finding missing or biased data and learned how to correct misleading charts. Techniques like using unbiased samples, proper axis scales, and data interpolation helped us present information accurately. We also learned to add clear explanations on our charts to show what changes we made. This activity taught us that careful and honest data visualization builds trust, avoids confusion, and helps people make better and more reliable decisions.