# Lesson 2: Understanding Your Data

Understanding your data is like plotting key positions on a map for a journey. How do we do it? Through statistical quantities like **mean**, **median**, and **mode**, which tell us more about our data. Today, we will learn about these quantities and how to calculate them in **pandas**.

---

## Basic Statistical Quantities

In data analysis, **mean**, **median**, and **mode** help us understand the **central tendency** of our data:  
- **Mean**: The average.  
- **Median**: The middle value when data is sorted.  
- **Mode**: The most frequent value.  

**Standard deviation** and **variance** show how data varies and the difference between each quantity and the mean. Additionally, we use:  
- **Minimum (min)** and **maximum (max)**: For data spread.  
- **Quantiles**: Divide data into equal intervals to understand its distribution.

**Quantiles** include quartiles, which split data into **four equal-sized groups**. For example, the **first quartile** is the value greater than 25% of the data.

---

## Calculation of Statistical Quantities in Pandas

Knowing our destinations, let's see how to reach them using **pandas**! You can compute these quantities for a `DataFrame` or `Series` object using the following methods:  
- `mean = data.mean()`
- `median = data.median()`
- `mode = data.mode()`
- `standard deviation = data.std()`
- `variance = data.var()`
- `min = data.min()`
- `max = data.max()`
- `quantile = data.quantile(q)` (where `q` is the quantile, e.g., `0.25`, `0.5`, etc.)

### Example

```python
import pandas as pd

# DataFrame creation
data = pd.DataFrame({
    'friends': ['Alice', 'Bob', 'Charlie', 'Dave', 'Eve'],
    'scores': [93, 89, 82, 88, 94],
    'age': [20, 21, 20, 19, 21]
})

# Print statistics
print('Mean:', data['scores'].mean())               # 89.2
print('Median:', data['scores'].median())           # 89.0
print('Mode:', data['scores'].mode()[0])            # 82
print('Standard Deviation:', data['scores'].std())  # 4.764451
print('Variance:', data['scores'].var())            # 22.7
print('Min:', data['scores'].min())                 # 82
print('Max:', data['scores'].max())                 # 94
print('25% Quantile:', data['scores'].quantile(0.25))  # 88.0
```

---

## DataFrame `describe()` Function

pandas provides the `describe()` function, which computes these statistical quantities for each DataFrame column. Here's an example:

```python
# Describe function usage
print(data.describe())
```

Output:

```
          scores       age
count   5.000000   5.00000
mean   89.200000  20.20000
std     4.764452   0.83666
min    82.000000  19.00000
25%    88.000000  20.00000
50%    89.000000  20.00000
75%    93.000000  21.00000
max    94.000000  21.00000
```

---

## Lesson Summary and Practice

Today, you learned about:  
- **Statistical quantities**: Mean, median, mode, standard deviation, variance, min, max, and quantiles.  
- How to calculate these using **pandas**.  
- Using `describe()` to compute these for real-world data.

### Next Steps

Learning is solidified through practice! Explore your data and try out these methods on your datasets. Happy exploring! 🎉


## Student Math Scores Statistical Summary

Imagine you are a school administrator seeking a quick overview of students' mathematical abilities. The provided code offers a statistical summary of their math scores, including measures like the mean, median, and range. What does this data reveal about the group's performance? Click Run to uncover insights concerning the students' math scores!

import pandas as pd

student_data = pd.DataFrame({
    'Student': ['Anna', 'Ben', 'Cara', 'Don', 'Ella'],
    'Math_Scores': [78, 82, 89, 94, 91],
    'Age': [15, 16, 15, 14, 16]
})

print(student_data['Math_Scores'].describe())

Running the provided code will give you a statistical summary of the students' math scores using the `describe()` function in pandas. Here's an explanation of the insights you can expect to uncover from the output:

### Output Breakdown:
The `describe()` function generates the following statistical measures for the `Math_Scores` column:

1. **Count**: The total number of students.  
   - Insight: Verifies the number of math scores recorded.

2. **Mean**: The average score.  
   - Insight: Gives a sense of the overall performance level of the group.

3. **Standard Deviation (std)**: The spread or variability of scores.  
   - Insight: Indicates how consistent the scores are—lower values suggest students perform similarly.

4. **Min**: The lowest score.  
   - Insight: Identifies the weakest performance.

5. **25% (1st Quartile)**: The score below which 25% of the students fall.  
   - Insight: Provides the boundary of the lower quartile group.

6. **50% (Median)**: The middle score.  
   - Insight: Shows the central performance when scores are ordered.

7. **75% (3rd Quartile)**: The score below which 75% of the students fall.  
   - Insight: Marks the boundary of the higher quartile group.

8. **Max**: The highest score.  
   - Insight: Highlights the top-performing student.

---

### Example Insights:
Suppose the output is:

```
count      5.000000
mean      86.800000
std        6.830080
min       78.000000
25%       82.000000
50%       89.000000
75%       91.000000
max       94.000000
```

Key takeaways:
- The **mean score** of 86.8 shows the group's overall performance is above average.  
- The **range** (max - min) is \(94 - 78 = 16\), indicating some variability in performance.  
- A **standard deviation** of ~6.83 suggests moderate variation in scores.  
- The **median** (89) indicates that half of the students scored at least 89, reflecting strong performance.  
- The **quartiles** (82 and 91) provide additional context about the distribution, showing that most scores are within a relatively tight range.

---

### Conclusion:
This data reveals that the students as a group are performing well in math, with scores fairly close to each other. It highlights both the top performers and areas where additional support might benefit lower-scoring students.

## Exploring the Median of Student Math Scores

Stellar Navigator, it's time to shift our focus! You've seen the max() function at work. Can you use your newfound knowledge to find the middle ground? That's right — change the code to display the median math score. This time, look for the median() function.

import pandas as pd

# DataFrame representing students' performance
performance_data = pd.DataFrame({
    'student': ['Ana', 'Ben', 'Cathy', 'Dan', 'Eli'],
    'math_scores': [85, 76, 95, 63, 88]
})

# Print the max math score among the students
print('Highest math score:', performance_data['math_scores'].max())

To find the **median math score**, replace the `max()` function with the `median()` function. Here's the updated code:

```python
import pandas as pd

# DataFrame representing students' performance
performance_data = pd.DataFrame({
    'student': ['Ana', 'Ben', 'Cathy', 'Dan', 'Eli'],
    'math_scores': [85, 76, 95, 63, 88]
})

# Print the median math score among the students
print('Median math score:', performance_data['math_scores'].median())
```

---

### Explanation:
- The `median()` function calculates the middle value of the sorted data.
- For this dataset (`math_scores` = [85, 76, 95, 63, 88]), sorting gives [63, 76, 85, 88, 95]. The **median** is the third value, **85**.

When you run the updated code, the output will be:
```
Median math score: 85
``` 

This reveals the **middle ground** of the students' performance in math!

## Calculating Statistics in a DataFrame

You're doing stellar work! Now, compute the crucial statistics for your data. Complete the code to calculate and report the average and middle values of math scores in the DataFrame.

import pandas as pd

# Student Performance and Grades DataFrame
grades = pd.DataFrame({
    'student': ['Anna', 'Ben', 'Chloe', 'Dylan'],
    'math_scores': [88, 92, 75, 85]
})

# TODO: Calculate and display the mean math score
print('Mean Math Score:', ___)

# TODO: Calculate and display the median math score
print('Median Math Score:', ___)

Here's the completed code to calculate and report the **mean** and **median** of the math scores:

```python
import pandas as pd

# Student Performance and Grades DataFrame
grades = pd.DataFrame({
    'student': ['Anna', 'Ben', 'Chloe', 'Dylan'],
    'math_scores': [88, 92, 75, 85]
})

# Calculate and display the mean math score
print('Mean Math Score:', grades['math_scores'].mean())

# Calculate and display the median math score
print('Median Math Score:', grades['math_scores'].median())
```

---

### Explanation:
- **`mean()`**: Calculates the average of all math scores:  
  \[
  \text{Mean} = \frac{88 + 92 + 75 + 85}{4} = 85
  \]

- **`median()`**: Finds the middle value when the data is sorted.  
  Sorted `math_scores` = [75, 85, 88, 92]. The median is the average of the two middle values:
  \[
  \text{Median} = \frac{85 + 88}{2} = 86.5
  \]

### Expected Output:
```
Mean Math Score: 85.0
Median Math Score: 86.5
``` 

This code provides key insights into the students' overall performance and central tendency in math! 🎉

## Creating a DataFrame and Summarizing Statistics in Pandas

You've explored student grades and their statistical insights, Stellar Navigator! Now let's encapsulate your knowledge. Create a DataFrame with students' names and their math scores. Then, using pandas, generate a summary of statistics for the math scores. Good luck!

import pandas as pd

# TODO: Create a DataFrame with two columns: 'student' and 'math_score', 
# fill it with five students' names and their corresponding math scores.

# TODO: Generate and print a summary of statistics for the math scores.


Here's how you can create a DataFrame with students' names and math scores, then generate a statistical summary for the math scores:

```python
import pandas as pd

# Create a DataFrame with students' names and math scores
student_data = pd.DataFrame({
    'student': ['Alice', 'Bob', 'Cara', 'Dave', 'Eve'],
    'math_score': [90, 85, 78, 92, 88]
})

# Generate and print a summary of statistics for the math scores
print(student_data['math_score'].describe())
```

---

### Explanation:
1. **DataFrame Creation**:
   - The DataFrame contains two columns: `student` (names of students) and `math_score` (their math grades).
   - Five rows are added for sample data.

2. **Summary of Statistics**:
   - The `describe()` function calculates:
     - **Count**: Number of entries.
     - **Mean**: Average of the scores.
     - **Standard deviation (std)**: Spread of the data.
     - **Min/Max**: Smallest and largest values.
     - **25%, 50%, 75% Quartiles**: Key percentiles to understand distribution.

---

### Expected Output:
If the provided data is used, the output will be:
```
count     5.000000
mean     86.600000
std       5.315073
min      78.000000
25%      85.000000
50%      88.000000
75%      90.000000
max      92.000000
```

This gives a clear statistical overview of the students' performance in math! 🎓