# Lesson 7: Introduction to pandas üêº

**Welcome back!** Today you'll learn to:
- Use **pandas** ‚Äî the most popular Python library for working with data
- Create and explore **DataFrames** (think: spreadsheets in Python)
- Plot charts directly from your data with one line of code
- Load a **real CSV file** from the internet
- Select columns and filter rows to answer questions about data

---

## üåç What is Data Science?

Data science is about **finding answers and patterns in data**. It's used everywhere:
- üéÆ Game companies analysing player behaviour to balance difficulty
- üì± Social media apps deciding what to show in your feed
- üè• Hospitals predicting which patients need urgent care
- ‚öΩ Football clubs scouting players using match statistics

**pandas** is the tool that almost every data scientist uses daily. By the end of today, you'll be using it too!

---

## ü§î What is a DataFrame?

A **DataFrame** is like a spreadsheet inside Python. It has rows and columns, and keeps everything neatly organised.

![DataFrame Diagram](https://pandas.pydata.org/docs/_images/01_table_dataframe.svg)

Each **column** has a name (like "Name", "Age", "City") and each **row** represents one record. pandas automatically adds **row numbers** (0, 1, 2...) on the left ‚Äî these are called the **index**.

In previous lessons, we stored data in separate lists ‚Äî one for names, one for hours, one for ranks. A DataFrame keeps all the information about each person **together in one row**, so nothing can get out of sync.

---

## Part 1: DataFrames ‚Äî Creating Your First Table üìä

### Importing pandas

Every time we use pandas, we start with this import line:

```python
import pandas as pd
```

We write `as pd` so we can type `pd` instead of `pandas` every time ‚Äî it's just a shortcut that everyone uses.

Run the cell below to import pandas:

### Creating a DataFrame from a Dictionary

Remember dictionaries from Lesson 6? We can create a DataFrame from a dictionary where:
- Each **key** becomes a **column name**
- Each **value** is a **list** of data for that column

```python
import pandas as pd

data = {
    "student": ["Ali", "Fatima", "Omar", "Zara", "Hassan", "Maryam"],
    "game": ["Minecraft", "Valorant", "FIFA 24", "Fortnite", "GTA V", "Minecraft"],
    "hours_per_week": [14, 22, 8, 18, 12, 6],
    "rank": ["Gold", "Diamond", "Silver", "Platinum", "Gold", "Bronze"],
    "platform": ["PC", "PC", "PlayStation", "Xbox", "PC", "Switch"]
}

df = pd.DataFrame(data)
df
```

üéâ **Look at that!** A beautiful, organised table ‚Äî no more juggling separate lists.

---

## Exploring Your DataFrame üîç

pandas gives us handy tools to quickly understand our data. Let's try them out!

### `.head()` and `.tail()` ‚Äî Peek at Your Data

```python
df.head(3)
```

```python
df.tail(2)
```

### `.shape` ‚Äî How Big is the Data?

```python
df.shape
```

### `.info()` ‚Äî What's Inside?

```python
df.info()
```

### `.describe()` ‚Äî Instant Statistics! ‚ú®

This is the **wow moment**. Remember in Lesson 4, you wrote code to calculate the average, find the maximum, and find the minimum? Watch this:

```python
df.describe()
```

‚òùÔ∏è **One line of code** just replaced about 15 lines you would have written manually!

Compare this to what you did in Lesson 4:

```python
# The manual way (Lesson 4)
average = sum(hours_per_week) / len(hours_per_week)
max_hours = hours_per_week[0]
for i in range(len(students)):
    if hours_per_week[i] > max_hours:
        max_hours = hours_per_week[i]
# ... and so on for min, count, etc.
```

pandas does all of this with just `df.describe()` üöÄ

---

## Plotting from a DataFrame üìà

In Lesson 4, you created bar charts using `plt.bar()`. pandas makes this even easier ‚Äî you can plot directly from the DataFrame!

```python
import matplotlib.pyplot as plt

df.plot(x="student", y="hours_per_week", kind="bar", legend=False)
plt.title("Weekly Gaming Hours by Student")
plt.ylabel("Hours per Week")
plt.show()
```

Compare this to the Lesson 4 approach:

```python
# Lesson 4 way
plt.bar(students, hours_per_week)
plt.title("Weekly Gaming Hours by Student")
plt.ylabel("Hours per Week")
plt.show()
```

Both work! pandas plotting sits **on top of** matplotlib ‚Äî they work together, not against each other.

### Horizontal Bar Chart

```python
df.plot(x="student", y="hours_per_week", kind="barh", legend=False, color="steelblue")
plt.title("Weekly Gaming Hours by Student")
plt.xlabel("Hours per Week")
plt.show()
```

---

## Part 2: Real Data ‚Äî Loading a CSV File üåê

So far we built our DataFrame by hand. But real data scientists load data from **files**.

A **CSV file** (Comma-Separated Values) is one of the most common formats for data. It's just a text file where each line is a row and values are separated by commas.

### Loading CSV from GitHub

Our course has a larger gaming survey with **25 students** stored as a CSV file on GitHub. Let's load it with **one line of code**:

```python
url = "https://raw.githubusercontent.com/simunaqv2/python-jupyter-course/main/lessons/lesson07/gaming_survey.csv"
survey = pd.read_csv(url)
survey.shape
```

Let's see the first few rows:

```python
survey.head()
```

üéâ **25 students, 9 columns ‚Äî loaded in one line!** 

Notice this has more columns than our hand-built DataFrame: `genre`, `years_playing`, and `would_recommend` are new.

Let's explore it:

```python
survey.describe()
```

```python
survey.columns.tolist()
```

---

## Selecting Columns üìã

Often you don't need all the columns ‚Äî just the ones relevant to your question.

### Selecting a Single Column

Use **square brackets** with the column name. This gives you a **Series** ‚Äî a single column pulled out of the table:

![Series Diagram](https://pandas.pydata.org/docs/_images/01_table_series.svg)

```python
survey["game"]
```

### Selecting Multiple Columns

Use **double square brackets** with a list of column names:

![Selecting Columns](https://pandas.pydata.org/docs/_images/03_subset_columns.svg)

```python
survey[["student", "game", "hours_per_week"]]
```

‚òùÔ∏è Notice the double brackets `[[ ]]` ‚Äî the outer brackets select from the DataFrame, and the inner brackets create a list of column names.

### Quick Tip: Useful Column Methods

Count how many times each game appears:

```python
survey["game"].value_counts()
```

Get the average hours per week:

```python
survey["hours_per_week"].mean()
```

---

## Filtering Rows üîé

Filtering lets you find rows that match a condition ‚Äî like "show me only students who play more than 15 hours per week".

![Filtering Rows](https://pandas.pydata.org/docs/_images/03_subset_rows.svg)

### Filtering by Number

```python
heavy_gamers = survey[survey["hours_per_week"] > 15]
heavy_gamers
```

**How does this work?**

1. `survey["hours_per_week"] > 15` creates a list of `True`/`False` for each row
2. `survey[...]` keeps only the rows where the value is `True`

Let's see step by step:

```python
survey["hours_per_week"] > 15
```

### Filtering by Text

```python
pc_gamers = survey[survey["platform"] == "PC"]
pc_gamers
```

### Combining Two Conditions

Use `&` for AND and `|` for OR. **Important:** wrap each condition in brackets!

You can also combine row filtering with column selection to get a specific subset of your data:

![Selecting Rows and Columns](https://pandas.pydata.org/docs/_images/03_subset_columns_rows.svg)

```python
hardcore_pc = survey[(survey["platform"] == "PC") & (survey["hours_per_week"] > 15)]
hardcore_pc
```

---

## üèÜ Mini Project: Answer Questions with pandas

Using the `survey` DataFrame (the 25-student CSV), answer these three questions using **pandas only ‚Äî no loops allowed!**

**Question 1:** How many students play Shooter games?

**Question 2:** What is the average `hours_per_week` for students who would recommend their game?

**Question 3:** Which platform has the most students? (Hint: use `.value_counts()`)

In [None]:
# Question 1: How many students play Shooter games?


In [None]:
# Question 2: Average hours for students who would recommend their game?


In [None]:
# Question 3: Which platform has the most students?


<details>
<summary>üí° Click for solutions</summary>

**Question 1:**
```python
shooters = survey[survey["genre"] == "Shooter"]
len(shooters)
```

**Question 2:**
```python
recommenders = survey[survey["would_recommend"] == "Yes"]
recommenders["hours_per_week"].mean()
```

**Question 3:**
```python
survey["platform"].value_counts()
```
</details>

---

## ‚ö†Ô∏è Common Mistakes

| Mistake | Example | Fix |
|---------|---------|-----|
| Forgetting `pd.` | `DataFrame(data)` | `pd.DataFrame(data)` |
| Single brackets for multiple columns | `df["col1", "col2"]` | `df[["col1", "col2"]]` |
| Forgetting brackets in combined filters | `df[cond1 & cond2]` | `df[(cond1) & (cond2)]` |
| Using `=` instead of `==` for filtering | `df[df["game"] = "FIFA"]` | `df[df["game"] == "FIFA"]` |

---

## üìö Key Vocabulary

| Term | Definition |
|------|------------|
| **pandas** | A Python library for working with data in tables |
| **DataFrame** | A 2D table of data with labelled rows and columns |
| **Series** | A single column from a DataFrame |
| **CSV** | Comma-Separated Values ‚Äî a common file format for data |
| **Filter** | Selecting only rows that match a condition |
| **`.describe()`** | A method that gives summary statistics for numeric columns |

---

## üìù Homework

Complete the following tasks using the `survey` DataFrame loaded from the CSV.

**Remember:** Make sure you've run the cell that loads the CSV first!

---

### Task 1: Basic Data Questions ‚≠ê

**Goal:** Answer these questions using pandas methods (no loops!).

**Your program should find:**
1. The average hours per week across all 25 students
2. The most popular platform (the one with the most students)
3. How many students said "Yes" to `would_recommend`

**Helpful code snippets:**

Getting the mean of a column:
```python
survey["column_name"].mean()
```

Counting values in a column:
```python
survey["column_name"].value_counts()
```

Filtering and counting rows:
```python
filtered = survey[survey["column"] == "some_value"]
len(filtered)
```

In [None]:
# HOMEWORK Task 1: Basic Data Questions
# Load the data (run this if you haven't already)
import pandas as pd
url = "https://raw.githubusercontent.com/simunaqv2/python-jupyter-course/main/lessons/lesson07/gaming_survey.csv"
survey = pd.read_csv(url)

# Question 1: Average hours per week


# Question 2: Most popular platform


# Question 3: How many students would recommend their game?


<details>
<summary>üí° Click for solution</summary>

```python
import pandas as pd
url = "https://raw.githubusercontent.com/simunaqv2/python-jupyter-course/main/lessons/lesson07/gaming_survey.csv"
survey = pd.read_csv(url)

# Question 1
survey["hours_per_week"].mean()

# Question 2
survey["platform"].value_counts()

# Question 3
recommenders = survey[survey["would_recommend"] == "Yes"]
len(recommenders)
```
</details>

---

### Task 2: Genre Analysis ‚≠ê‚≠ê

**Goal:** Filter the dataset by genre and create a bar chart for each genre showing hours per student.

**Your program should:**
1. Find all unique genres in the dataset
2. Filter the data to show only **Shooter** players
3. Create a bar chart showing hours per week for Shooter players only
4. Repeat for one other genre of your choice

**Helpful code snippets:**

Getting unique values:
```python
survey["genre"].unique()
```

Filtering and plotting:
```python
filtered = survey[survey["genre"] == "SomeGenre"]
filtered.plot(x="student", y="hours_per_week", kind="bar", legend=False)
plt.title("Your Title Here")
plt.ylabel("Hours per Week")
plt.show()
```

In [None]:
# HOMEWORK Task 2: Genre Analysis
import matplotlib.pyplot as plt

# Step 1: Find all unique genres


# Step 2: Filter for Shooter players


# Step 3: Create a bar chart for Shooter players


# Step 4: Pick another genre and create its bar chart


<details>
<summary>üí° Click for solution</summary>

```python
import matplotlib.pyplot as plt

# Step 1: Find all unique genres
survey["genre"].unique()

# Step 2: Filter for Shooter players
shooters = survey[survey["genre"] == "Shooter"]

# Step 3: Bar chart for Shooters
shooters.plot(x="student", y="hours_per_week", kind="bar", legend=False, color="crimson")
plt.title("Shooter Players ‚Äî Hours per Week")
plt.ylabel("Hours per Week")
plt.show()

# Step 4: Another genre
sims = survey[survey["genre"] == "Simulation"]
sims.plot(x="student", y="hours_per_week", kind="bar", legend=False, color="teal")
plt.title("Simulation Players ‚Äî Hours per Week")
plt.ylabel("Hours per Week")
plt.show()
```
</details>

---

### üåü Stretch Challenge: Leaderboard ‚≠ê‚≠ê‚≠ê

**Goal:** Use `df.sort_values()` to build a ranked leaderboard of the top 10 students by hours played, displayed as a horizontal bar chart.

**Hints:**
- `df.sort_values("column_name", ascending=False)` sorts from highest to lowest
- `.head(10)` gets the top 10 rows
- You can chain them: `df.sort_values(...).head(10)`

**Target output:** A horizontal bar chart showing the top 10 gamers ranked by hours per week.

```python
# Example of sort_values
sorted_df = survey.sort_values("hours_per_week", ascending=False)
sorted_df.head()
```

In [None]:
# STRETCH CHALLENGE: Gaming Leaderboard
import matplotlib.pyplot as plt

# Step 1: Sort by hours_per_week (highest first) and get top 10


# Step 2: Create a horizontal bar chart of the top 10


<details>
<summary>üí° Click for solution</summary>

```python
import matplotlib.pyplot as plt

# Step 1: Sort and get top 10
top_10 = survey.sort_values("hours_per_week", ascending=False).head(10)

# Step 2: Horizontal bar chart (reversed so #1 is at the top)
top_10_reversed = top_10.sort_values("hours_per_week", ascending=True)
top_10_reversed.plot(x="student", y="hours_per_week", kind="barh", legend=False, color="steelblue")
plt.title("üèÜ Top 10 Gamers by Hours per Week")
plt.xlabel("Hours per Week")
plt.show()
```
</details>

---

## üéØ What's Next?

Now you can load, explore, and filter real data with pandas. Next lesson, you'll learn to **create new columns** from calculations, **group data** by categories, and build more complex analyses ‚Äî the point where pandas starts feeling genuinely powerful! üìäüöÄ