# Data Visualization & Tidying Lab

This notebook is split into **two parts**:

1. **Core skills tutorial** – short walkthroughs that demonstrate standard Python data-visualization techniques with `matplotlib`, `seaborn`, and `pandas`.  
2. **Applied challenges** – five messy, simulated data sets accompanied by stakeholder-style questions that someone might ask you to answer. Your task is to tidy each data set and write a brief data story for your audience with visuals.



## Part 1 – Core Visualization Skills 

### 1. Line, scatter, bar – the classics

In [None]:
import seaborn as sns, matplotlib.pyplot as plt, pandas as pd

# Load example
fmri = sns.load_dataset('fmri')

# LINE PLOT — average signal over time for each event type
plt.figure(figsize=(7,4))
sns.lineplot(data=fmri, x='timepoint', y='signal', hue='event') 
plt.title('Line plot: fMRI signal over time')
plt.show()



In [None]:
# SCATTER PLOT — flipper vs body mass
penguins = sns.load_dataset('penguins').dropna()
sns.scatterplot(data=penguins, x='flipper_length_mm', y='body_mass_g', hue='species')
plt.title('Scatter plot: flipper length vs body mass')
plt.show()



In [None]:
# BAR PLOT — mean total bill by day
tips = sns.load_dataset('tips')
sns.barplot(data=tips, x='day', y='total_bill', errorbar='sd')
plt.title('Bar plot: mean total bill by day')
plt.show()

These three basic plotting examples cover **quantitative over time**, **relationship between two numeric variables**, and **comparisons across categories**. Remember to always label axes and provide context in titles or captions.

### 2. Distributions – histograms, KDEs, box/violin

Use **histograms/KDEs** for a single distribution and **box/violin** plots for comparing distributions across groups.

In [None]:
# Histogram + KDE overlay for 'total_bill'
plt.figure(figsize=(6,4))
sns.histplot(tips['total_bill'], kde=True, bins=20)
plt.title('Histogram + KDE: total bill')
plt.show()


#### What is a KDE and why would you ever want one? 

A KDE (kernel density estimate) is a way to visualize the distribution of the data.

In [None]:
# Box & violin plots side-by-side
fig, ax = plt.subplots(1,2, figsize=(10,4))
sns.boxplot(data=tips, x='day', y='tip', ax=ax[0])
ax[0].set_title('Boxplot: tip by day')
sns.violinplot(data=tips, x='day', y='tip', ax=ax[1])
ax[1].set_title('Violin plot: tip by day')
plt.tight_layout()
plt.show()

#### Why would you want to use a boxplot over a violin plot and vice versa?

A violin plot shows the distribution of the data, which may be helpful for seeing patterns in the data. A box plot is more helpful for seeing the presence of outliers (shown as dots beyond the "whiskers").

### 3. Multi-dimensional encodings – color, size & facets

In [None]:
# Bubble plot: GDP vs life expectancy, bubble size = population
gap_url = 'https://raw.githubusercontent.com/plotly/datasets/master/gapminderDataFiveYear.csv'
gap = pd.read_csv(gap_url)
year2007 = gap[gap.year == 2007]

plt.figure(figsize=(8,5))
sns.scatterplot(data=year2007, x='gdpPercap', y='lifeExp',
                size='pop', hue='continent', sizes=(40,400), alpha=0.7, legend=False)
plt.xscale('log')
plt.title('Bubble plot: GDP vs Life Expectancy (2007)')
plt.show()

# Facet grid
g = sns.relplot(data=gap, x='gdpPercap', y='lifeExp',
                col='continent', col_wrap=3,
                kind='scatter', height=3, alpha=0.6, facet_kws={'sharex':False})
g.set(xscale='log')

color, point size, and faceting let you incorporate **additional variables** without clutter.

### 4. Time series & multiple lines

In [None]:
# Flights example
flights = sns.load_dataset('flights')
pivot = flights.pivot(index='month', columns='year', values='passengers')

pivot.plot(figsize=(10,5))
plt.title('Monthly air passengers (1949-1960)')
plt.ylabel('Passengers')
plt.legend(loc='upper left', ncol=4, title='Year')
plt.show()

Pivoting long-format data wide can make multi-line time-series plots straightforward.

### 5. Customising aesthetics

In [None]:
# Global Seaborn style
sns.set_theme(style='whitegrid', context='notebook', font_scale=1.1)

plt.figure(figsize=(6,4))
sns.barplot(data=tips, x='day', y='total_bill', hue='sex', palette='Set2')
plt.title('Total bill by day & biological sex')
plt.xlabel('Day of week')
plt.ylabel('Total bill ($)')
sns.despine()
plt.show()

Small touches (despine, style, context, custom palettes) go a long way to professional-looking figures.

## Part 2 – Applied Challenges

Below are five *realistic* messy data sets.  
For **each**:

1. **Run** the *Generate the data* cell to create a DataFrame `df`.  
2. Inspect & **tidy** it into a clean, analysis-ready form (remember *Tidy ≡ one variable per column, one observation per row*).  
3. **Answer the questions** in a concise written report (use the provided headings).  
4. Support your conclusions with **at least two visualizations** (feel free to create more).

### Report template (copy for each dataset)
- **Context** – restate the stakeholder’s objective in 1-2 sentences.  
- **Tidying steps** – bullet list of wrangling operations applied.  
- **Findings** – describe what the visuals show.  
- **Recommendations** – actionable insights for the stakeholder.


### Challenge 1: Global Gadget Co. sales data (messy wide)

*Stakeholder*: **VP of Sales**  
> “We need to understand how each product sold across regions over the year and spot any patterns.”

**Key questions**
1. Which region and month generated the highest revenue for *Gizmo*?
2. Do *Widget* and *Doohickey* follow similar patterns over the year?

### Challenge 2: IoT greenhouse sensor logs

*Stakeholder*: **Facility engineer**  
> “Our sensors embed temperature and humidity in one field. I suspect humidity spikes at night – can you confirm?”

**Key questions**
1. At what hours does humidity exceed 60 % most frequently?
2. Is there any correlation between temperature and humidity?

### Challenge 3: Developer tools preference survey

*Stakeholder*: **Product manager**  
> “We surveyed devs about their favorite tools. Can you tell if age group influences tool choice and satisfaction?”

**Key questions**
1. Which tools are most popular in the 18-24 vs 45+ brackets?
2. Does reported satisfaction differ by primary tool?

### Challenge 4: Blood pressure drug trial

*Stakeholder*: **Principal Investigator**  
> “We ran a cross-over trial with three conditions. Summarise efficacy in reducing BP and highlight best performer.”

**Key questions**
1. What is the average BP reduction (post-minus-pre) for each drug?
2. Is there any individual variability?

### Challenge 5: Social media campaign analytics

*Stakeholder*: **Marketing lead**  
> “Our views are recorded as strings like ‘1.2k’. Clean this up and evaluate platform performance.”

**Key questions**
1. Which platform achieved the highest median daily *views* and *like rate*?
2. Are weekends different from weekdays?

### Challenge 1: Global Gadget Co. sales data (messy wide)

*Stakeholder*: **VP of Sales**  
> “We need to understand how each product sold across regions over the year and spot any seasonality.”

**Key questions**
1. Which region and month generated the highest revenue for *Gizmo*?
2. Do *Widget* and *Doohickey* follow similar seasonal patterns?

In [None]:
# --- Generate the messy data (RUN THIS) ---
import pandas as pd, numpy as np, seaborn as sns, matplotlib.pyplot as plt

def _simulate():

    np.random.seed(0)
    months = list(range(1,13))
    regions = ['North', 'South', 'East', 'West']
    data = {}
    for r in regions:
        for m in months:
            col = f"{r[:2]}_{m}"
            data[col] = np.random.poisson(lam=2000 + 100*m + 400*regions.index(r), size=3)
    df = pd.DataFrame(data)
    df['Product'] = ['Gizmo', 'Widget', 'Doohickey']
    df = df.sample(frac=1, axis=1).reset_index(drop=True)
    return df


df = _simulate()
print("Shape:", df.shape)
df.head()

#### Your analysis below

In [None]:
df_tidy = df.melt(id_vars='Product', var_name='Region_Month', value_name='Sales')
df_tidy['Region'] = df_tidy['Region_Month'].str.extract(r"([A-Z][a-z])_\d+")
df_tidy['Month'] = df_tidy['Region_Month'].str.extract(r"[A-Z][a-z]_(\d+)").astype(int)
df_tidy = df_tidy.drop('Region_Month', axis=1)
df_tidy.head()

In [None]:
plt.figure(figsize=(10, 6))
sns.lineplot(data=df_tidy[df_tidy['Product'] == 'Gizmo'], x='Month', y='Sales', hue='Region', marker='o')
plt.title('Gizmo sales per month, separated by region')
plt.xticks(range(1, 13))
plt.tight_layout()
plt.show()

In [None]:
df_widget_doohickey = df_tidy[df_tidy['Product'].isin(['Widget', 'Doohickey'])].copy()
sns.lineplot(data=df_widget_doohickey.groupby(['Product', 'Month'], as_index=False)['Sales'].sum(),
             x='Month', y='Sales', hue='Product', marker='o')
plt.title('Sales of Widget and Doohickey by month')
plt.xticks(range(1, 13))
plt.tight_layout()
plt.show()

#### Report

**Context**

The VP of Sales wants a clear understanding of how each product sold across regions and months to identify seasonal trends and to assess whether different products follow similar sales patterns.

**Tidying Steps**

1. Apply `pandas.melt()` to reshape the DataFrame from wide to long format.
2. Extract `Region` and `Month` information using regular expressions.
3. Convert `Month` data to integers for correct plotting.


**Findings**

1. The highest Gizmo sales occurred in the West region, which had higher sales than all other regions in every month. Monthly sales consistently increased in all regions from January to December.
2. Doohickey and Widget sales share extremely similar monthly sales patterns.

**Recommendations**

Increase prices for all products in later months to take advantage of higher demand.


### Challenge 2: IoT greenhouse sensor logs

*Stakeholder*: **Facility engineer**  
> “Our sensors embed temperature and humidity in one field. I suspect humidity spikes at night – can you confirm?”

**Key questions**
1. At what hours does humidity exceed 60 % most frequently?
2. Is there any correlation between temperature and humidity?

In [None]:
# --- Generate the messy data (RUN THIS) ---
import pandas as pd, numpy as np, seaborn as sns, matplotlib.pyplot as plt

def _simulate():
    
    np.random.seed(0)

    times = pd.date_range('2025-01-01', periods=48, freq='H')
    sensors = [f"S{i}" for i in range(1,6)]
    rows = []
    for t in times:
        row = {'timestamp': t}
        for s in sensors:
            temp = np.random.normal(20,3)
            hum = np.random.uniform(30,70)
            row[s] = f"{temp:.1f}|{hum:.0f}"
        rows.append(row)
    return pd.DataFrame(rows)


df = _simulate()
print("Shape:", df.shape)
df.head()

#### Your analysis below

In [None]:
df_tidy = df.melt(id_vars='timestamp', var_name='sensor', value_name='data')
df_tidy['temp_celsius'] = df_tidy['data'].str.split('|').str[0].astype(float)
df_tidy['humidity'] = df_tidy['data'].str.split('|').str[1].astype(int)
df_tidy = df_tidy.drop('data', axis=1)
df_tidy['hour'] = pd.to_datetime(df_tidy['timestamp']).dt.hour
print(df_tidy)

In [None]:
sns.histplot(data=df_tidy[df_tidy['humidity'] >= 60], x='hour', bins=range(0, 24))
plt.title('Number of sensors measuring >60% humidity vs hour of day')
plt.xticks(range(0, 24, 2))
plt.show()

In [None]:
corr = df_tidy['temp_celsius'].corr(df_tidy['humidity'])
print(corr)

#### Report

**Context**

The facility manager wants to determine the humidity patterns in the greenhouses over time, as well as determine whether temperature has an influence on humidity.

**Tidying Steps**

1. Apply `DataFrame.melt()` to reshape the DataFrame from wide to long format.
2. Extract `temp_celsius` and `humidity` information using `Series.str.split()`.
3. Add `hour` column to indicate hours past midnight.


**Findings**

1. There appears to be a small spike in humidity in the morning and a large spike at night.
2. Temperature has very little correlation with humidity.

**Recommendations**

Recalibrate your sensors.

### Challenge 3: Developer tools preference survey

*Stakeholder*: **Product manager**  
> “We surveyed devs about their favorite tools. Can you tell if age group influences tool choice and satisfaction?”

**Key questions**
1. Which tools are most popular in the 18-24 vs 45+ brackets?
2. Does reported satisfaction differ by primary tool?

In [None]:
# --- Generate the messy data (RUN THIS) ---
import pandas as pd, numpy as np, seaborn as sns, matplotlib.pyplot as plt

def _simulate():
    
    np.random.seed(0)

    n = 200
    choices = ['Python', 'R', 'MATLAB', 'JavaScript']
    data = {
        'respondent_id': range(1,n+1),
        'age_group': np.random.choice(['18-24','25-34','35-44','45+'], n),
        'tools_used': [', '.join(np.random.choice(choices, size=np.random.randint(1,4), replace=False)) for _ in range(n)],
        'satisfaction_1-5': np.random.randint(1,6, n)
    }
    return pd.DataFrame(data)


df = _simulate()
print("Shape:", df.shape)
df.head()

#### Your analysis below

In [None]:
df_tidy = df.copy()
df_tidy['tools_used'] = df_tidy['tools_used'].str.split(', ')
df_tidy = df_tidy.explode('tools_used').reset_index(drop=True)
df_tidy

In [None]:
sns.countplot(data=df_tidy, x='tools_used', hue='age_group')
plt.title('Tool usage by age group')
plt.tight_layout()
plt.show()


In [None]:
avg_satisfaction = df_tidy.groupby('tools_used')['satisfaction_1-5'].mean().reset_index()
sns.barplot(data=df_tidy, x='tools_used', y='satisfaction_1-5', estimator='mean', errorbar=None)
plt.title('Average satisfaction by tool')
plt.tight_layout()
plt.show()

#### Report

**Context**

The product manager wants to learn whether tool usage varies by age group, as well as if average satisfaction varies by primary tool use.

**Tidying Steps**

1. Convert the `tools_used` column into a list, then explode the column.

**Findings**

1. The 18-24 age group is least likely to use R and most likely to use Python. The 45+ age group is slightly more likely to use R, but the usage is roughly even.
2. Average satisfaction by tool does not differ significantly.

**Recommendations**

Survey more people and ask about other languages as well.

### Challenge 4: Blood pressure drug trial

*Stakeholder*: **Principal Investigator**  
> “We ran a cross-over trial with three conditions. Summarise efficacy in reducing BP and highlight best performer.”

**Key questions**
1. What is the average BP reduction (post-minus-pre) for each drug?
2. Is there significant individual variability?

In [None]:
# --- Generate the messy data (RUN THIS) ---
import pandas as pd, numpy as np, seaborn as sns, matplotlib.pyplot as plt

def _simulate():
    np.random.seed(100)

    subjects = [f"Subj_{i:03d}" for i in range(1,51)]
    conditions = ['placebo','drugA','drugB']
    df = pd.DataFrame({'subject': subjects})
    for c in conditions:
        df[f"bp_pre_{c}"] = np.random.normal(120,10, len(subjects))
        df[f"bp_post_{c}"] = df[f"bp_pre_{c}"] - np.random.normal(5,2, len(subjects)) + (0 if c=='placebo' else  -10 + 5*conditions.index(c))
    return df


df = _simulate()
print("Shape:", df.shape)
df.head()

#### Your analysis below

In [None]:
df_tidy = df.melt(id_vars='subject', var_name='bp_when_type', value_name='blood_pressure')
df_tidy['when'] = df_tidy['bp_when_type'].str.split('_').str[1]
df_tidy['type'] = df_tidy['bp_when_type'].str.split('_').str[2]
df_tidy = df_tidy.drop('bp_when_type', axis=1)
df_tidy

In [None]:
df_wide = df_tidy.pivot_table(index=['subject', 'type'], columns='when', values='blood_pressure').reset_index()
df_wide['reduction']= df_wide['post'] - df_wide['pre']
avg_reduction = df_wide.groupby('type')['reduction'].mean()
print(avg_reduction)
sns.violinplot(data=df_wide, x='type', y='reduction')
plt.title('Blood pressure reduction by drug type')
plt.show()

#### Report

**Context**

The PI wants to determine the average blood pressure reduction for each drug and determine the variability of the outcomes.

**Tidying Steps**

1. Apply `DataFrame.melt()` to reshape the DataFrame from wide to long format.
2. Extract `when` (pre- or post-) and `type` (placebo, drug A, or drug B) information using `Series.str.split()`.

**Findings**

Drug A appears to be effective at reducing blood pressure, while drug B seems to make a minimal difference (very similar to the placebo). There is definitely variability in the effect of the drug, but the difference between the 75th quantile and the 25th quantile is roughly 2.5.

**Recommendations**

Devote more time investigating the effects of drug A. Drug B is likely ineffective.

### Challenge 5: Social media campaign analytics

*Stakeholder*: **Marketing lead**  
> “Our views are recorded as strings like ‘1.2k’. Clean this up and evaluate platform performance.”

**Key questions**
1. Which platform achieved the highest median daily *views* and *like rate*?
2. Are weekends different from weekdays?

In [None]:
# --- Generate the messy data (RUN THIS) ---
import pandas as pd, numpy as np, seaborn as sns, matplotlib.pyplot as plt

def _simulate():
    
    np.random.seed(0)

    dates = pd.date_range('2024-07-01', '2024-12-31', freq='D')
    platforms = ['TikTok','Instagram','YouTube']
    rows = []
    for d in dates:
        for p in platforms:
            views = np.random.randint(1000, 100000)
            likes = int(views * np.random.uniform(0.05, 0.2))
            rows.append({'date': d, 'platform': p, 'views': f"{views/1000:.1f}k", 'likes': likes if np.random.rand()>0.05 else np.nan})
    df = pd.DataFrame(rows)
    dup = df.sample(200)
    df = pd.concat([df, dup], ignore_index=True).reset_index(drop=True)
    return df


df = _simulate()
print("Shape:", df.shape)
df

#### Your analysis below

In [None]:
df_tidy = df.copy()
df_tidy['views'] = df_tidy['views'].str.extract(r"(\d+.\d+)k").astype(float)
df_tidy['views'] = df_tidy['views'].apply(lambda x : 1000 * x)
df_tidy = df_tidy.dropna()
df_tidy

In [None]:
df_tidy['like_rate'] = df_tidy['likes'] / df_tidy['views']

print(df_tidy.groupby('platform')[['like_rate', 'views']].median())

sns.boxplot(data=df_tidy, x='platform', y='views')
plt.title('Daily Views per Platform')
plt.ylabel('Views')
plt.xlabel('Platform')
plt.show()

sns.boxplot(data=df_tidy, x='platform', y='like_rate')
plt.title('Daily Like Rate per Platform')
plt.ylabel('Like Rate')
plt.xlabel('Platform')
plt.show()


In [None]:
sns.boxplot(data=df_tidy, x='platform', y='views')
plt.show()
sns.boxplot(data=df_tidy, x='platform', y='like_rate')
plt.show()

In [None]:
df_tidy['date'] = pd.to_datetime(df_tidy['date'])
df_tidy['dayofweek'] = df_tidy['date'].dt.dayofweek
df_tidy['is_weekend'] = df_tidy['dayofweek'] >= 5

sns.boxplot(data=df_tidy, x='is_weekend', y='views')
plt.title('Views: Weekend vs Weekday')
plt.xticks([0, 1], ['Weekday', 'Weekend'])
plt.xlabel('')
plt.ylabel('Views')
plt.show()

sns.boxplot(data=df_tidy, x='is_weekend', y='like_rate')
plt.title('Like Rate: Weekend vs Weekday')
plt.xticks([0, 1], ['Weekday', 'Weekend'])
plt.xlabel('')
plt.ylabel('Like Rate')
plt.show()


#### Report

**Context**

The marketing lead is evaluating the effectiveness of content on various social media platforms.

**Tidying Steps**

1. Apply `Series.str.extract()` to extract the numerical part of the view count.
2. Use `Series.apply()` to multiply the view count by one thousand.
3. Use `DataFrame.dropna()` to remove rows with missing data.

**Findings**

Youtube receives slightly more views and a higher like rate than TikTok, which has a higher median view count and like rate than Instagram. However, all three are roughly the same---there is no significant difference.

**Recommendations**

Continue to post media on all three social media platforms.