In [4]:
from IPython.display import Markdown, display

display(Markdown(r"""
# HAVK Data Visualization Sprint
**Date:** Monday, November 3, 2025
**Time:** 5:30 PM – 9:00 PM
**Location:** Woodward 335, UNC Charlotte
**Hosted by:** HAVK

---

## Purpose of This Notebook
This notebook is provided **for reference and documentation only**.
It contains setup instructions, library notes, and code templates designed to help participants replicate or expand upon their sprint work after the event.
During the competition, all analysis, coding, and interpretation must be performed independently within the official work period.

The goal is to support post-event learning, reproducibility, and portfolio development—not to serve as a shortcut or solution file.

---

## Overview
The HAVK Data Visualization Sprint is a focused, research-oriented challenge centered on **AI, Cybersecurity, Data Science, and Engineering** applications of data analysis.
Participants explore newly released datasets, extract meaningful insights, and communicate their results clearly and logically.

This event emphasizes practical reasoning and real analytical workflows used in modern research, security, and technical environments.

---

## Schedule

| Time | Activity |
|:--:|:--|
| **5:15 PM** | **Dataset Publication.** Official datasets are made public. Participants may review dataset summaries but may not begin work until 5:30 PM. |
| **5:30 – 8:00 PM** | **Work Period.** Teams analyze, visualize, and interpret assigned datasets using their chosen tools. HAVK mentors and officers will be available for technical questions. |
| **8:00 – 9:00 PM** | **Presentations and Judging.** Teams present findings to the judging panel. Winners are announced immediately following the final presentation. |

---

## Dataset Assignment
- Datasets will be **released at 5:15 PM** and assigned at **5:30 PM** when all teams are seated.
- Each team receives a **team number** at check-in. Selection order is determined by a random draw.
- Once selected, a dataset is locked for the duration of the sprint.
- Datasets are chosen to represent real-world contexts in **AI research, cybersecurity monitoring, and scientific or engineering analytics**.
- Accessing or analyzing datasets before 5:15 PM is not permitted.

---

## Team Structure
- Teams may consist of **one or two members**.
- Collaboration between separate teams is not permitted once assignments are made.
- Mentors may provide clarification and troubleshooting help but will not perform analysis or generate visualizations.

---

## Technical Environment and Libraries
You may use **any language, framework, or visualization tool** that supports your analysis.
This includes but is not limited to:

- **Python:** `pandas`, `numpy`, `matplotlib`, `seaborn`, `plotly`, `scikit-learn`
- **R:** `tidyverse`, `ggplot2`, `dplyr`, `shiny`
- **JavaScript:** `D3.js`, `Chart.js`, `TensorFlow.js`
- **C++ / Java:** for algorithmic or systems-based approaches
- **Tableau, Power BI, Excel:** for data visualization and dashboard creation

Participants are encouraged to experiment with techniques relevant to **machine learning, anomaly detection, cybersecurity telemetry, and statistical pattern analysis**, as long as they can explain their process and reasoning.

This reference notebook includes notes on commonly used libraries for participants who are newer to coding or visualization.

---

## Use of Generative AI and External Resources
**Generative AI tools** such as ChatGPT, Gemini, Copilot, or Claude may be used under the following guidelines:

**Allowed:**
- Syntax help, debugging suggestions, or library documentation
- Formatting assistance for code or visualization setup
- Writing markdown explanations or summaries after you have completed your analysis

**Not Allowed:**
- Generating or performing full analyses of the provided dataset
- Submitting AI-generated interpretations as original work
- Uploading or exposing any competition dataset to third-party AI platforms that retain data

All visualizations, insights, and conclusions must be your own.
External datasets may not be merged or cross-referenced without mentor approval.

---

## Objective
Your task is to analyze your dataset and extract meaningful, technically sound insights.
You are expected to:
1. Identify a relevant question, pattern, or anomaly.
2. Use data-driven reasoning to support your findings.
3. Communicate results clearly through concise visuals and logical explanation.

Projects should reflect real-world problem-solving in areas such as **AI model behavior, cybersecurity threat detection, or scientific data exploration**.

---

## Presentations and Judging
Presentations will take place between **8:00 PM and 9:00 PM**.
Each team will have approximately **3 minutes** to present their findings in any preferred format:
- Live walkthrough of your notebook
- Short slide deck or dashboard demonstration
- Direct explanation of plots and analysis

Judging criteria:

| Criterion | Description |
|:--|:--|
| **Insight** | Depth and originality of findings. Does the work reveal something meaningful about system behavior, trends, or risk? |
| **Clarity** | Logical flow and quality of explanation. Can your reasoning be followed by a technical audience? |
| **Visualization** | Readability, structure, and effectiveness of plots or figures in supporting your conclusion. |

Awards will be presented for:
- **Best Overall Project**
- **Best Visualization**
- **Most Insightful Finding**

---

## Expectations
- Bring a reliable laptop and any necessary cables or tools.
- Save work regularly and maintain version control when possible.
- Label all figures, variables, and outputs for readability.
- Maintain professionalism, teamwork, and academic integrity.
- This event prioritizes **critical thinking, accuracy, and communication** over volume of code.

---

**Once your dataset has been assigned, proceed to Section 1 – Setup below.**
This notebook exists to guide your workflow, clarify library use, and serve as a record of your analytical process for future reference.
"""))




# HAVK Data Visualization Sprint  
**Date:** Monday, November 3, 2025  
**Time:** 5:30 PM – 9:00 PM  
**Location:** Woodward 335, UNC Charlotte  
**Hosted by:** HAVK  

---

## Purpose of This Notebook
This notebook is provided **for reference and documentation only**.  
It contains setup instructions, library notes, and code templates designed to help participants replicate or expand upon their sprint work after the event.  
During the competition, all analysis, coding, and interpretation must be performed independently within the official work period.

The goal is to support post-event learning, reproducibility, and portfolio development—not to serve as a shortcut or solution file.

---

## Overview
The HAVK Data Visualization Sprint is a focused, research-oriented challenge centered on **AI, Cybersecurity, Data Science, and Engineering** applications of data analysis.  
Participants explore newly released datasets, extract meaningful insights, and communicate their results clearly and logically.

This event emphasizes practical reasoning and real analytical workflows used in modern research, security, and technical environments.

---

## Schedule

| Time | Activity |
|:--:|:--|
| **5:15 PM** | **Dataset Publication.** Official datasets are made public. Participants may review dataset summaries but may not begin work until 5:30 PM. |
| **5:30 – 8:00 PM** | **Work Period.** Teams analyze, visualize, and interpret assigned datasets using their chosen tools. HAVK mentors and officers will be available for technical questions. |
| **8:00 – 9:00 PM** | **Presentations and Judging.** Teams present findings to the judging panel. Winners are announced immediately following the final presentation. |

---

## Dataset Assignment
- Datasets will be **released at 5:15 PM** and assigned at **5:30 PM** when all teams are seated.  
- Each team receives a **team number** at check-in. Selection order is determined by a random draw.  
- Once selected, a dataset is locked for the duration of the sprint.  
- Datasets are chosen to represent real-world contexts in **AI research, cybersecurity monitoring, and scientific or engineering analytics**.  
- Accessing or analyzing datasets before 5:15 PM is not permitted.  

---

## Team Structure
- Teams may consist of **one or two members**.  
- Collaboration between separate teams is not permitted once assignments are made.  
- Mentors may provide clarification and troubleshooting help but will not perform analysis or generate visualizations.

---

## Technical Environment and Libraries
You may use **any language, framework, or visualization tool** that supports your analysis.  
This includes but is not limited to:

- **Python:** `pandas`, `numpy`, `matplotlib`, `seaborn`, `plotly`, `scikit-learn`  
- **R:** `tidyverse`, `ggplot2`, `dplyr`, `shiny`  
- **JavaScript:** `D3.js`, `Chart.js`, `TensorFlow.js`  
- **C++ / Java:** for algorithmic or systems-based approaches  
- **Tableau, Power BI, Excel:** for data visualization and dashboard creation  

Participants are encouraged to experiment with techniques relevant to **machine learning, anomaly detection, cybersecurity telemetry, and statistical pattern analysis**, as long as they can explain their process and reasoning.

This reference notebook includes notes on commonly used libraries for participants who are newer to coding or visualization.

---

## Use of Generative AI and External Resources
**Generative AI tools** such as ChatGPT, Gemini, Copilot, or Claude may be used under the following guidelines:

**Allowed:**  
- Syntax help, debugging suggestions, or library documentation  
- Formatting assistance for code or visualization setup  
- Writing markdown explanations or summaries after you have completed your analysis  

**Not Allowed:**  
- Generating or performing full analyses of the provided dataset  
- Submitting AI-generated interpretations as original work  
- Uploading or exposing any competition dataset to third-party AI platforms that retain data  

All visualizations, insights, and conclusions must be your own.  
External datasets may not be merged or cross-referenced without mentor approval.

---

## Objective
Your task is to analyze your dataset and extract meaningful, technically sound insights.  
You are expected to:
1. Identify a relevant question, pattern, or anomaly.  
2. Use data-driven reasoning to support your findings.  
3. Communicate results clearly through concise visuals and logical explanation.

Projects should reflect real-world problem-solving in areas such as **AI model behavior, cybersecurity threat detection, or scientific data exploration**.

---

## Presentations and Judging
Presentations will take place between **8:00 PM and 9:00 PM**.  
Each team will have approximately **3 minutes** to present their findings in any preferred format:
- Live walkthrough of your notebook  
- Short slide deck or dashboard demonstration  
- Direct explanation of plots and analysis  

Judging criteria:

| Criterion | Description |
|:--|:--|
| **Insight** | Depth and originality of findings. Does the work reveal something meaningful about system behavior, trends, or risk? |
| **Clarity** | Logical flow and quality of explanation. Can your reasoning be followed by a technical audience? |
| **Visualization** | Readability, structure, and effectiveness of plots or figures in supporting your conclusion. |

Awards will be presented for:
- **Best Overall Project**  
- **Best Visualization**  
- **Most Insightful Finding**

---

## Expectations
- Bring a reliable laptop and any necessary cables or tools.  
- Save work regularly and maintain version control when possible.  
- Label all figures, variables, and outputs for readability.  
- Maintain professionalism, teamwork, and academic integrity.  
- This event prioritizes **critical thinking, accuracy, and communication** over volume of code.

---

**Once your dataset has been assigned, proceed to Section 1 – Setup below.**  
This notebook exists to guide your workflow, clarify library use, and serve as a record of your analytical process for future reference.


In [6]:
from IPython.display import Markdown, display

display(Markdown(r"""
# Section 1 — How to Think Like an Analyst Here
This section is a guide. You do **not** have to follow it exactly.
If you already know what you're doing, skip ahead.

The goal of this sprint is not "make a pretty plot."
The goal is "find something real and explain why it matters."

Below is the standard workflow people use in AI, cybersecurity, data science, and engineering.

---

## Step 0. Understand the dataset
Before you write any code, answer this in plain English:
- What does one row represent?
  Example: one app listing, one employee record, one purchase event, one security alert, one lift attempt.
- What does each column approximately mean?
  (timestamp, rating, department, severity, price, etc.)

If you can't answer what a single row means, stop here and figure that out.
If you can't say what the columns are measuring, you cannot defend any conclusion later.

---

## Step 1. Load the data
Most teams will be handed a CSV file.

A CSV is just a table saved as plain text:
- Comma-separated values.
- Each line is one row.
- First line usually has column names.

In Python, people usually load CSV into a pandas DataFrame (a table-like object you can filter, group, summarize).

In Excel / Tableau / Power BI, you'd import the CSV as a sheet or data source.

More details on loading a CSV are in a later cell.

---

## Step 2. Inspect it
Do not jump straight to graphs.
You have to see what you're working with.

Questions to ask:
- How many rows are there?
- What are the column names?
- Which columns are numbers? Which are categories? Which are timestamps?
- Are there missing values?

If you're using pandas, people commonly call:
- `df.head()` → first 5 rows
- `df.info()` → column names and data types
- `df.describe()` → summary stats

If you're not using Python, you still do the same thinking:
scroll the first ~20 rows and write down what's normal and what's weird.

---

## Step 3. Pick ONE question
You are not trying to analyze everything at once.
Pick a single question that matters and stick to it.

Good questions:
- Which category or group is actually dominating?
- Did something spike (traffic, attrition, cost, severity) at a certain time?
- Are two variables clearly related (for example: price vs rating, weight vs lift total)?
- Is there a meaningful difference between groups (for example: department A vs department B)?

Tie it to reality:
- AI / model behavior
- security / anomaly / incident
- performance / reliability
- business / behavior / usage

If you can't state your question in one clean sentence, you are not ready to plot.

---

## Step 4. Build evidence
Now you create visual proof.

This is where plotting comes in. Typical useful plots:
- Bar chart comparing counts across categories
- Line plot over time
- Scatter plot showing the relationship between two numeric variables
- Heatmap of correlations

Each of these is shown in a later cell as a template.

Important:
- Your plot is not decoration.
- Your plot is supposed to prove or disprove your question.

---

## Step 5. Write the finding in normal language
Bad: "Here is a bar chart."
Good: "Apps in the 'Tools' category have way more installs than most other categories, but ratings are lower, suggesting adoption without satisfaction."

You must be able to say what happened, why it matters, and how you know.

---

## Step 6. Present it
When you present, judges want:
1. What data you got
2. What question you asked
3. The one plot that answers it
4. Why that answer matters in a real setting (AI, cyber, operations, product, etc.)

This is what they're scoring.

You are not graded on having the fanciest model.
You are graded on whether you can think like someone who is trusted with data.
"""))




# Section 1 — How to Think Like an Analyst Here
This section is a guide. You do **not** have to follow it exactly.  
If you already know what you're doing, skip ahead.

The goal of this sprint is not "make a pretty plot."  
The goal is "find something real and explain why it matters."

Below is the standard workflow people use in AI, cybersecurity, data science, and engineering.

---

## Step 0. Understand the dataset
Before you write any code, answer this in plain English:
- What does one row represent?  
  Example: one app listing, one employee record, one purchase event, one security alert, one lift attempt.
- What does each column approximately mean?
  (timestamp, rating, department, severity, price, etc.)

If you can't answer what a single row means, stop here and figure that out.  
If you can't say what the columns are measuring, you cannot defend any conclusion later.

---

## Step 1. Load the data
Most teams will be handed a CSV file.

A CSV is just a table saved as plain text:
- Comma-separated values.
- Each line is one row.
- First line usually has column names.

In Python, people usually load CSV into a pandas DataFrame (a table-like object you can filter, group, summarize).

In Excel / Tableau / Power BI, you'd import the CSV as a sheet or data source.

More details on loading a CSV are in a later cell.

---

## Step 2. Inspect it
Do not jump straight to graphs.
You have to see what you're working with.

Questions to ask:
- How many rows are there?
- What are the column names?
- Which columns are numbers? Which are categories? Which are timestamps?
- Are there missing values?

If you're using pandas, people commonly call:
- `df.head()` → first 5 rows
- `df.info()` → column names and data types
- `df.describe()` → summary stats

If you're not using Python, you still do the same thinking:
scroll the first ~20 rows and write down what's normal and what's weird.

---

## Step 3. Pick ONE question
You are not trying to analyze everything at once.
Pick a single question that matters and stick to it.

Good questions:
- Which category or group is actually dominating?
- Did something spike (traffic, attrition, cost, severity) at a certain time?
- Are two variables clearly related (for example: price vs rating, weight vs lift total)?
- Is there a meaningful difference between groups (for example: department A vs department B)?

Tie it to reality:
- AI / model behavior
- security / anomaly / incident
- performance / reliability
- business / behavior / usage

If you can't state your question in one clean sentence, you are not ready to plot.

---

## Step 4. Build evidence
Now you create visual proof.

This is where plotting comes in. Typical useful plots:
- Bar chart comparing counts across categories
- Line plot over time
- Scatter plot showing the relationship between two numeric variables
- Heatmap of correlations

Each of these is shown in a later cell as a template.

Important:
- Your plot is not decoration.
- Your plot is supposed to prove or disprove your question.

---

## Step 5. Write the finding in normal language
Bad: "Here is a bar chart."
Good: "Apps in the 'Tools' category have way more installs than most other categories, but ratings are lower, suggesting adoption without satisfaction."

You must be able to say what happened, why it matters, and how you know.

---

## Step 6. Present it
When you present, judges want:
1. What data you got
2. What question you asked
3. The one plot that answers it
4. Why that answer matters in a real setting (AI, cyber, operations, product, etc.)

This is what they're scoring.

You are not graded on having the fanciest model.
You are graded on whether you can think like someone who is trusted with data.


In [7]:
from IPython.display import Markdown, display

display(Markdown(r"""
# Section 2 — Team Setup and Plan
Fill this out for yourselves. This becomes the spine of your final 3-minute presentation.
You are allowed to change answers throughout the sprint. This is a working area.

## 2.1 Team Information
- Team number: `______`
- Team members: `______`
- Dataset description (high-level, not confidential name):
  `__________________________________________`

What does one row represent in your dataset?
Example answers:
- "One row = one app in the Android store."
- "One row = one employee and their HR profile."
- "One row = one purchase event from a shopping weekend."
- "One row = one powerlifting meet attempt with lifter bodyweight and lift total."
- "One row = car sales numbers for a specific model in a specific region and year."

Write your version here:
`______________________________________________________________`

If you cannot answer that, solve that now before doing anything else.

---

## 2.2 Where are you working?
Which environment are you using to analyze this data?

Choose / edit:
- Google Colab with Python
- Jupyter / VS Code locally
- R / RStudio
- Excel / Power BI / Tableau
- Custom script / other

Write it here:
`______________________________________________________________`

This matters because you will be asked to walk us through your process.

---

## 2.3 Core question you plan to answer
You get one main question. Not five. One.

Examples:
- "Which category of app actually dominates installs, and does that match rating quality?"
- "Which department is losing people the fastest, and what pattern shows up with attrition?"
- "What factor most strongly predicts higher lift totals at meets?"
- "How did sales change over time for a specific model, and when did it jump or crash?"

Your question:
`______________________________________________________________`

Why this matters in a real AI / cyber / engineering / ops context:
`______________________________________________________________`

You will literally say this sentence to judges.

---

## 2.4 What columns (fields) matter for that question?
Pick 2–5 columns from the dataset that are relevant. For each, explain what it actually means in plain English.

Example:
- `Installs` – approximate popularity / reach of the app
- `Rating` – user satisfaction score (1–5)
- `Category` – app type (Tools, Game, Finance, etc.)
- `Price` – whether the app is free or paid

Your columns:
1. `_________` means `________________________________________`
2. `_________` means `________________________________________`
3. `_________` means `________________________________________`

If you cannot explain your columns, you cannot interpret your result.

---

## 2.5 Hypothesis (this can be wrong)
Before you analyze, take a position. Call your shot.

Examples:
- "Paid apps have higher ratings than free apps, but way fewer installs."
- "Attrition is concentrated in high-travel roles."
- "Sales dipped in 2020, recovered in 2021."
- "Lifters with lower bodyweight but high total are clustered in specific federations."

Your hypothesis:
`______________________________________________________________`

This is useful because:
- If you're right, you found something.
- If you're wrong, you found something else (the opposite).
Either way, you have a story.

---

## 2.6 Planned visual
Name one plot you think will answer your question.

Examples:
- bar chart comparing counts
- line chart over time
- scatter plot with color group
- correlation heatmap

Your planned visual and what you expect to prove with it:
`______________________________________________________________`

This is likely the figure you will show the judges.
"""))



# Section 2 — Team Setup and Plan
Fill this out for yourselves. This becomes the spine of your final 3-minute presentation.
You are allowed to change answers throughout the sprint. This is a working area.

## 2.1 Team Information
- Team number: `______`
- Team members: `______`
- Dataset description (high-level, not confidential name):  
  `__________________________________________`

What does one row represent in your dataset?  
Example answers:
- "One row = one app in the Android store."
- "One row = one employee and their HR profile."
- "One row = one purchase event from a shopping weekend."
- "One row = one powerlifting meet attempt with lifter bodyweight and lift total."
- "One row = car sales numbers for a specific model in a specific region and year."

Write your version here:  
`______________________________________________________________`

If you cannot answer that, solve that now before doing anything else.

---

## 2.2 Where are you working?
Which environment are you using to analyze this data?

Choose / edit:
- Google Colab with Python
- Jupyter / VS Code locally
- R / RStudio
- Excel / Power BI / Tableau
- Custom script / other

Write it here:  
`______________________________________________________________`

This matters because you will be asked to walk us through your process.

---

## 2.3 Core question you plan to answer
You get one main question. Not five. One.

Examples:
- "Which category of app actually dominates installs, and does that match rating quality?"
- "Which department is losing people the fastest, and what pattern shows up with attrition?"
- "What factor most strongly predicts higher lift totals at meets?"
- "How did sales change over time for a specific model, and when did it jump or crash?"

Your question:
`______________________________________________________________`

Why this matters in a real AI / cyber / engineering / ops context:
`______________________________________________________________`

You will literally say this sentence to judges.

---

## 2.4 What columns (fields) matter for that question?
Pick 2–5 columns from the dataset that are relevant. For each, explain what it actually means in plain English.

Example:
- `Installs` – approximate popularity / reach of the app
- `Rating` – user satisfaction score (1–5)
- `Category` – app type (Tools, Game, Finance, etc.)
- `Price` – whether the app is free or paid

Your columns:
1. `_________` means `________________________________________`
2. `_________` means `________________________________________`
3. `_________` means `________________________________________`

If you cannot explain your columns, you cannot interpret your result.

---

## 2.5 Hypothesis (this can be wrong)
Before you analyze, take a position. Call your shot.

Examples:
- "Paid apps have higher ratings than free apps, but way fewer installs."
- "Attrition is concentrated in high-travel roles."
- "Sales dipped in 2020, recovered in 2021."
- "Lifters with lower bodyweight but high total are clustered in specific federations."

Your hypothesis:
`______________________________________________________________`

This is useful because:
- If you're right, you found something.
- If you're wrong, you found something else (the opposite).
Either way, you have a story.

---

## 2.6 Planned visual
Name one plot you think will answer your question.

Examples:
- bar chart comparing counts
- line chart over time
- scatter plot with color group
- correlation heatmap

Your planned visual and what you expect to prove with it:
`______________________________________________________________`

This is likely the figure you will show the judges.


In [9]:
# Section 3 — Loading Your CSV (Reference Only)

# Most teams in this sprint will receive a `.csv` file.

# A `.csv` file:
# - Is just plain text
# - Each row is one record
# - Columns are separated by commas
# - The first row usually contains column names

# Below are common ways to load a CSV depending on your tool.
# These are examples, not solutions — copy/paste them later when you’re ready.

## Option A: Python (Google Colab or Jupyter)

# **Step 1 – Upload your file into the runtime.**
from google.colab import files
uploaded = files.upload()  # you'll be prompted to choose your CSV file

# After you upload, Colab will show you the filename (for example: `BlackFriday.csv` or `spotify.csv`).
# Now you can load the data into a pandas DataFrame.
import pandas as pd

# Replace "your_file.csv" with the name that appeared after uploading.
df = pd.read_csv("your_file.csv")

# Check the first few rows to confirm it's loaded correctly.
df.head()

# At this point, `df` is your working table (DataFrame).

# - Each **column** represents one variable or feature.
# - Each **row** represents one record.

# If you see errors loading the CSV:
# - Check the filename carefully (case-sensitive).
# - Ensure your CSV isn’t zipped (`.zip`) or Excel (`.xlsx`) format.
# - Try reuploading.





FileNotFoundError: [Errno 2] No such file or directory: 'your_file.csv'

In [None]:
## Option B: Python with a Direct Path

# If you’ve already got the CSV in your environment, skip the upload step.
import pandas as pd

# If your file is already in your Colab environment or local folder:
df = pd.read_csv("/content/your_file.csv")  # or "./your_file.csv"
df.head()



In [None]:
## Option C: Excel / Power BI / Tableau

# 1. Open your tool.
# 2. Select **Import Data → From Text/CSV**.
# 3. Choose the file you were given.
# 4. Verify that columns are typed correctly:
   # - Numbers → numeric
   # - Dates → datetime
   # - Text → string


In [None]:
## Option D: R / RStudio, I do not reccomend this if you are a begginer

# ```r
# data <- read.csv("your_file.csv", header = TRUE, stringsAsFactors = FALSE)
# head(data)


In [None]:
### After loading, stop and confirm:
# - Do you see the columns you expected?
# - Does each row look like what you think it represents?
# - Are there blank columns, repeated headers, or broken rows?

# If this step is wrong, every step after this will be wrong.

In [None]:
# Section 4 — Inspecting and Cleaning Your Data

# Before visualizing, always **inspect and clean** your dataset.
# Think of this as your data “pre-flight checklist” — making sure everything looks right before analysis.

# Ask yourself:
# - What does each column represent?
# - Are there missing values?
# - Are any values duplicated or formatted incorrectly?
# - What kind of data types (numeric, categorical, date/time) am I working with?

# You don’t have to answer all of these now — but you should know **how** to find out.


In [None]:
# Look at the first few rows of your dataset
df.head()

# Hint: Try df.tail() to view the bottom rows.


In [None]:
# Now, check the overall structure.
# This gives you a summary of columns, data types, and how many non-null entries there are.

# Think:
# - Are any columns mostly empty?
# - Do the column names make sense for what’s inside them?


In [None]:
# Check columns, data types, and null counts
df.info()

# Bonus: Try df.shape to see how many rows and columns your dataset has.


In [None]:
# Missing values and duplicates are common in real-world data.
# Finding them early helps prevent misleading charts or broken analyses.


In [None]:
# Count missing (NaN) values per column
df.isnull().sum()

# Optional challenge:
# Can you write code to show only columns with missing values?


In [None]:
# See how many duplicate rows exist
df.duplicated().sum()

# Optional:
# Try displaying a few duplicate rows if they exist.


In [None]:
# Different columns require different treatments.
# Numerical columns can be averaged or plotted directly.
# Categorical columns might need grouping or counting.

# Before visualizing, identify **which are which**.


In [None]:
# At this point, you should have a sense of:
# - What’s inside your data
# - What might be broken
# - Which columns are worth visualizing

# If something looks wrong (e.g., prices listed as text, missing entries, or nonsense values),
# try fixing just one issue and re-running your summary.

# You’ll learn more by **experimenting** than following a script.


In [None]:
# Section 5 — Exploring Patterns and Building Visuals (Optional / Next Steps)

# If you’ve cleaned your data and confirmed it looks correct — great.
# Now it’s time to **explore, experiment, and visualize.**

# You can use *any* language, library, or method you like — Python, R, Excel, Power BI, etc.
# The following examples are purely optional and meant to help you **think** about what to try.


## Step 1 — Ask Questions Before Plotting

Good data visualization starts with a question.  
Ask yourself:

- What story does this dataset tell?  
- Are there trends, peaks, or correlations worth highlighting?  
- Which variables interact in interesting ways?  
- What could this mean for AI, cybersecurity, biology, or engineering?

Write down 1–2 questions before you start plotting.  
For example:
- “Do users with more app reviews tend to have higher ratings?”  
- “Are certain categories of Netflix content watched longer than others?”  
- “Does crime frequency vary by time of day in Charlotte?”


## Step 2 — Choose Visualization Tools

You can use any tools. Here are some common Python options:

- **matplotlib** – basic, fast plotting  
- **seaborn** – quick, beautiful statistical visuals  
- **plotly** – interactive, dynamic plots  
- **pandas** – built-in plotting from dataframes  

If you’re using another language or platform, that’s completely fine.  
These examples are just for reference.


In [None]:
# Import visualization libraries (if you need them)
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Example setup — you can comment out any you don't use
plt.style.use('seaborn-v0_8-darkgrid')


## Step 3 — Explore Relationships Between Variables

Start simple:
- Compare two numeric columns (e.g., price vs rating, downloads vs reviews)
- Look at averages by category
- Find outliers or unusual spikes


In [None]:
# Example only — modify with your own column names
sns.scatterplot(data=df, x="ColumnA", y="ColumnB")

# Try swapping ColumnA and ColumnB to see different perspectives


## Step 4 — Summarize and Group

You can also group data to look for patterns:
- Mean sales per year
- Average rating per category
- Total users by country


In [None]:
# Example grouping
df.groupby("CategoryColumn")["NumericColumn"].mean().sort_values(ascending=False).head(10)


## Step 5 — Think Beyond Visualization

The best teams will interpret their visuals like researchers, not just show charts.  
For inspiration:

- **AI / Data Science:** Could you predict one variable using another?  
- **Cybersecurity:** Can you detect unusual spikes or anomalies?  
- **Biology:** Are there environmental or behavioral patterns emerging?  
- **Engineering:** Which metrics show system performance or reliability?

You don’t have to model — but showing **why** your pattern matters is powerful.


## Step 6 — Presenting Your Findings

When presenting to judges:
1. State your dataset and your main question.  
2. Show 1–2 clean visuals that answer it.  
3. Explain what the patterns might mean.  
4. If you used AI, anomaly detection, clustering, or anything advanced — describe it clearly.

You can present however you like:
- Colab notebook  
- Slides  
- Short verbal explanation  
- Simple dashboard  


## Key Reminder

This notebook is for reference — not a checklist.  
You’re free to go in any direction that fits your dataset and story.

Be creative. Think critically.  
Visuals are just tools — the real insight comes from how *you interpret* them.


# Section 6 — Presentation & Judging Tips

You’ve reached the final stage of the sprint.  
Now it’s time to **present your findings** — not just your charts, but your reasoning and insight.  

This section will help you prepare for the 8–9 PM judging session.


## Step 1 — What to Present

Each team will have a short presentation window to explain their analysis.

You can present however you like:
- Directly in your notebook (Colab or Jupyter)
- Google Slides
- Short verbal walkthrough
- Or even a single dashboard or chart view

The key is **clarity and storytelling** — not code length.

When presenting:
1. **State your dataset** (what it represents)
2. **Share your main question or goal**
3. **Show 1–2 visuals** that directly support your point
4. **Explain what your data reveals**
5. End with **a short interpretation** — why does it matter?


### Example Presentation Flow

1. “We analyzed the Spotify Tracks dataset.”
2. “Our question was: Do older songs have lower popularity scores?”
3. “We plotted average popularity by release year.”
4. “We found a steady drop from 2015 onward, likely due to streaming algorithm bias.”
5. “This could relate to how recommendation systems amplify newer content.”

That’s it — simple, direct, and meaningful.


## Step 2 — Judging Criteria

Projects will be judged by HAVK officers/ future officers based on:

- **Insight:** Did you find something meaningful or surprising?  
- **Clarity:** Is your visualization readable and your story easy to follow?  
- **Creativity:** Did you explore your data in an original way?  
- **Technical skill:** Did you use your tools effectively (Python, AI, etc.)?  
- **Communication:** Did you clearly explain your reasoning?

Each category carries roughly equal weight — technical complexity alone does not guarantee a win.


## Step 3 — Common Mistakes to Avoid

- Too many visuals with no clear takeaway  
- Spending all your time cleaning data without exploring it  
- Showing models or charts you can’t explain  
- Ignoring outliers or odd patterns instead of discussing them  
- Focusing only on aesthetics instead of interpretation

Remember: **Judges care more about your insight than your syntax.**


## Step 4 — Advanced or Thematic Tie-ins

If you have time, connect your findings to your field:

- **AI / Data Science:** Could your dataset train a model or inform one?  
- **Cybersecurity:** Are there trends that could represent risk or anomaly detection?  
- **Biology / Health:** Are there patterns in data that mirror biological behavior?  
- **Engineering:** Could this help optimize or predict performance?

These connections show real interdisciplinary thinking — exactly what top researchers and recruiters look for.


## Step 5 — Closing Thoughts

Whether you used AI, statistics, or just a few charts, the purpose of this sprint is **insight** — not perfection.  

Show us how you think.  
Show us what your data revealed.  
Show us that you can connect numbers to meaning.

Good luck, and thank you for participating in the HAVK Data Visualization Sprint.
