# Exploring Descriptive Statistics

## **Introduction**
Descriptive statistics provide essential tools to summarize and understand data. They include measures of central tendency, variability, and distribution shape. In this notebook, we'll explore these concepts using the "Students Performance in Exams" dataset.

### **Key Concepts**:
- **Mean**: The average of a set of numbers, calculated as:

  $$
  \text{Mean} = \frac{\sum_{i=1}^n x_i}{n}
  $$
  where $(x_i)$ are the data points and $(n)$ is the number of data points.

- **Median**: The middle value when the numbers are sorted. If $(n)$ is odd, it is the middle value; if $(n)$ is even, it is the average of the two middle values.

- **Mode**: The most frequently occurring value in a dataset.

- **Variance**: The average squared deviation from the mean, measuring spread:
  $$
  \text{Variance} = \frac{\sum_{i=1}^n (x_i - \mu)^2}{n}
  $$
  where $(\mu)$ is the mean.

- **Standard Deviation**: The square root of variance, representing spread in the same units as the data:
  $$
  \text{Standard Deviation} = \sqrt{\text{Variance}}
  $$

- **Percentiles and Quartiles**: Values dividing the data into 100 or 4 equal parts, respectively. For example, the 25th percentile (Q1) represents the value below which 25% of data lies.

- **Skewness**: A measure of data asymmetry:
  $$
  \text{Skewness} = \frac{\sum_{i=1}^n (x_i - \mu)^3}{n \cdot \sigma^3}
  $$
  where $(\sigma)$ is the standard deviation.

- **Kurtosis**: A measure of whether data tails are heavy or light compared to a normal distribution:
  $$
  \text{Kurtosis} = \frac{\sum_{i=1}^n (x_i - \mu)^4}{n \cdot \sigma^4} - 3
  $$


## **Dataset**
We'll use the **"Students Performance in Exams"** dataset available on Kaggle.

**Dataset Link**: [https://www.kaggle.com/spscientist/students-performance-in-exams](https://www.kaggle.com/spscientist/students-performance-in-exams)

**Columns of Interest**:
- **`math score`**: Numerical, representing the score obtained by students in mathematics.
- **`reading score`**: Numerical, representing the score obtained by students in reading.
- **`writing score`**: Numerical, representing the score obtained by students in writing.
- **`gender`**: Categorical, representing the gender of the student (e.g., male, female).
- **`parental level of education`**: Categorical, indicating the highest level of education achieved by the student’s parents.


## **Loading the Dataset**

In [1]:
import pandas as pd
import numpy as np
from scipy.stats import skew, kurtosis
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset

In [10]:
data_url = "./datasets/StudentsPerformance.csv"
df = pd.read_csv(data_url)

In [11]:
# Display the first few rows of the dataset
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


## **Exercises**

### **Exercise 1: Data Overview**
**Question**: 
What are the data types of the columns? How many unique values are present in categorical columns?


### **Exercise 2: Mean, Median, Mode**
**Scenario**: 
A teacher wants to understand how her class performed in the math exam.

**Question**: 
- What are the mean, median, and mode of the `math score`?


### **Exercise 3: Variance and Standard Deviation**
**Scenario**: 
The school principal wants to know how consistent students are in their reading scores.

**Question**: 
- Calculate the variance and standard deviation for `reading score`.


### **Exercise 4: Percentiles and Quartiles**
**Scenario**: 
The school counselor wants to identify the top 25% of students based on their writing scores.

**Questions**: 
- What are the 25th, 50th (median), and 75th percentiles for `writing score`?
- How would you identify students in the top 25%?


### **Exercise 5: Skewness and Kurtosis**
**Scenario**: 
The head of the education board wants to understand the shape of the distribution of students' total scores.

**Questions**: 
- Calculate the skewness and kurtosis of `total score`.
- Is the distribution symmetric, skewed, or heavy/light-tailed?


### **Exercise 6: Visualizations**
**Question**: 
Create visualizations to explore the distribution of `math score`, `reading score`, and `writing score`. Which distribution appears the most skewed?


### **Exercise 7: Gender Analysis**
**Scenario**: 
The school wants to explore if gender impacts performance.

**Questions**: 
- What is the average `math score`, `reading score`, and `writing score` for each gender?
- Which gender performs better on average in each subject?


### **Exercise 8: Parental Education Analysis**
**Scenario**: 
The school wants to understand the influence of parental education on student performance.

**Questions**: 
- What is the average score in each subject for different levels of parental education?
- Which education level correlates with the highest average scores?


### **Exercise 9: Outlier Detection**
**Question**: 
Are there any outliers in the `math score`? Use a boxplot to visualize and calculate the Interquartile Range (IQR).


### **Exercise 10: Correlation Analysis**
**Question**: 
Is there a correlation between `math score`, `reading score`, and `writing score`? Visualize using a heatmap and calculate correlation coefficients.


## Bonus: Try getting the overall descriptive stats using one single function. 

## **Conclusion**
Descriptive statistics provide valuable insights into datasets by summarizing key properties. By practicing these exercises, you should now be more familiar with fundamental statistical techniques and their applications to real-world data.
