# Data Science 101 — Getting Started with Jupyter & Pandas
**Date:** 2025-09-03

Welcome! This notebook introduces:
- How Jupyter notebooks work (cells, markdown, code, running).
- Essential Python + pandas for data analysis.
- Analyzing a few **simple datasets** (grades, sales, weather).
- Plotting with matplotlib.
- A short **mini‑project** at the end.

> Keep the CSV files (``students_grades.csv``, ``mini_sales.csv``, ``tiny_weather.csv``) in the **same folder** as this notebook for easy loading.


## 1) How to use this notebook
- A notebook is made of **cells**. There are two main types:
  - **Markdown cells** (like this one) for text and instructions.
  - **Code cells** (Python code you can run).
- To **run a cell**: Click inside it and press **Shift+Enter** (or use the ▶️ button in the toolbar).
- You can **edit** any cell. If something breaks, just re‑run the earlier cells.
- Tip: Use **Undo** (Cmd/Ctrl+Z) if you change something by accident.


## 2) Quick Python refresh

In [None]:
# Variables
message = "Hello, Data Science!"
year = 2025
print(message, year)

# Lists and dicts
fruits = ["apple", "banana", "cherry"]
facts = {"pi": 3.1416, "e": 2.718}
print(fruits[1], facts["pi"])

In [None]:
# Loops & conditions
nums = [3, 7, 2, 9, 4]
total = 0
for n in nums:
    if n % 2 == 1:
        total += n
total

## 3) Meet pandas (DataFrames)
**pandas** is the most common Python library for working with tables (spreadsheets/CSV).
- We'll also use **matplotlib** for charts.
- Run the setup cell below (it installs if needed).

In [None]:
# Setup: import pandas and matplotlib (installs if missing)
try:
    import pandas as pd
    import matplotlib.pyplot as plt
except Exception as e:
    !pip -q install pandas matplotlib
    import pandas as pd
    import matplotlib.pyplot as plt

pd.__version__

## 4) Dataset #1 — Student grades
File: `students_grades.csv`

**Goal:** Explore averages, distributions, and simple relationships (e.g., absences vs scores).


In [None]:
import pandas as pd
grades = pd.read_csv("students_grades.csv")
grades.head()

In [None]:
# Basic info & summary stats
grades.info()
grades.describe(numeric_only=True)

In [None]:
# Simple computed columns
grades["avg_score"] = grades[["math","science","english"]].mean(axis=1)
grades[["name","avg_score","absences"]].head()

In [None]:
# Plot: avg_score distribution (histogram)
import matplotlib.pyplot as plt
grades["avg_score"].plot(kind="hist", bins=5, title="Average Score Distribution")
plt.xlabel("avg_score")
plt.show()

In [None]:
# Relationship: absences vs average score (scatter)
grades.plot(kind="scatter", x="absences", y="avg_score", title="Absences vs Avg Score")
plt.show()

**Think about it**
- Do more absences correlate with lower scores?
- What other features might influence performance?


## 5) Dataset #2 — Mini sales
File: `mini_sales.csv`

**Goal:** Compute revenue, summarize by region and product, and visualize totals.


In [None]:
sales = pd.read_csv("mini_sales.csv", parse_dates=["date"])
sales["revenue"] = sales["units"] * sales["unit_price"]
sales.head()

In [None]:
# Total revenue by product
by_product = sales.groupby("product")["revenue"].sum().reset_index().sort_values("revenue", ascending=False)
by_product

In [None]:
# Plot: revenue by product (bar)
by_product.plot(kind="bar", x="product", y="revenue", title="Revenue by Product")
plt.ylabel("Revenue")
plt.show()

In [None]:
# Revenue by region and product (pivot table)
pivot = sales.pivot_table(index="region", columns="product", values="revenue", aggfunc="sum")
pivot

In [None]:
# Time series: revenue per day (line)
by_day = sales.groupby("date")["revenue"].sum().reset_index()
by_day.plot(kind="line", x="date", y="revenue", title="Revenue by Day")
plt.ylabel("Revenue")
plt.show()

## 6) Dataset #3 — Tiny weather
File: `tiny_weather.csv`

**Goal:** Work with dates and simple aggregations, then visualize temperature and precipitation.


In [None]:
weather = pd.read_csv("tiny_weather.csv", parse_dates=["date"])
weather.head()

In [None]:
# Summary stats
weather.describe(numeric_only=True)

In [None]:
# Average temp and total precip for the week
avg_temp = weather["temp_c"].mean()
total_precip = weather["precip_mm"].sum()
avg_temp, total_precip

In [None]:
# Plot: temperature over time (line)
weather.plot(kind="line", x="date", y="temp_c", title="Temperature Over Time")
plt.ylabel("°C")
plt.show()

In [None]:
# Plot: precipitation by day (bar)
weather.plot(kind="bar", x="date", y="precip_mm", title="Daily Precipitation (mm)")
plt.ylabel("mm")
plt.show()

## 7) Data cleaning quick hits
- Handling missing values
- Converting data types
- Renaming columns


In [None]:
# Example: introduce missing value then fill it
grades_with_na = grades.copy()
grades_with_na.loc[0, "math"] = None
grades_with_na["math_filled"] = grades_with_na["math"].fillna(grades_with_na["math"].median())
grades_with_na.head()

In [None]:
# Renaming columns and type conversion
sales2 = sales.rename(columns={"unit_price":"unit_price_usd"}).copy()
sales2["units"] = sales2["units"].astype("int64")
sales2.dtypes

## 8) Mini‑project (choose one)
**A. Grades challenge**  
- Create a column `passed_math` (True/False) for score ≥ 70.  
- Compute pass rates for each subject.  
- Plot pass rate per subject.

**B. Sales challenge**  
- Which region generated the highest revenue?  
- Which product sells best in each region?  
- Plot revenue by region.

**C. Weather challenge**  
- Which day was warmest/coldest?  
- Plot a combined chart that shows temperature and precipitation.  
- What patterns do you notice?


In [None]:
# 👉 Your work here (duplicate this cell as needed)
pass

## 9) Data ethics & reproducibility (quick checklist)
- **Responsible:** Use datasets appropriately; respect privacy and licenses.
- **Equitable:** Look for bias (e.g., class imbalance); consider subgroup performance.
- **Traceable:** Record versions, parameters, seeds, and the steps you took.
- **Reliable:** Validate results; consider failure modes and edge cases.
- **Governable:** Make it easy to review, fix, or roll back changes.

**Pro tip:** Add a short markdown cell at the end of your analysis explaining decisions and next steps.


## 10) Wrap‑up
You’ve learned the basics of Jupyter, pandas, and matplotlib, and you explored three tiny datasets.  
Next steps:
- Try a larger dataset (e.g., Kaggle).  
- Learn scikit‑learn for modeling (classification/regression).  
- Practice telling a story with your charts and summaries.
