### **Descriptive Statistics**

Before making predictions or finding patterns, we need to understand what the data is telling us. Descriptive statistics is the first step in data analysis. It helps us summarize, explore, and describe the main features of a dataset in a clear way.

In [101]:
"""
Execute this cell before continue
""" 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import skew

In [None]:
"""
Measuring Central Tendency: Mean
""" 

grades = np.array([9, 7, 8, 9, 9, 9, 10, 10, 9, 10])
print("The mean of the grades is", grades.mean())

grades_with_outliers = np.array([9, 7, 8, 9, 9, 9, 10, 10, 9, 100])

# TODO: Calculate mean of grades_with_outliers
# TODO: Explain how outliers affect the mean

In [None]:
"""
Measuring Central Tendency: Median
""" 

grades = np.array([9, 7, 8, 9, 9, 9, 10, 10, 9, 10])
print("The median of the grades is", np.median(grades))

grades_with_outliers = np.array([9, 7, 8, 9, 9, 9, 10, 10, 9, 100])

# TODO: Calculate median of grades_with_outliers
# TODO: Explain how outliers affect the median

In [None]:
"""
Measuring Central Tendency: Mode
"""

grades = np.array([2, 10, 10, 10, 10, 10, 10, 10, 10, 10])
print(stats.mode(grades)[0])

# TODO: What happens when the outliers are removed?

In [None]:
"""
Measuring Dispersion: Range
"""

short_range_grades = np.array([9, 10, 9, 9, 9, 9, 10, 10, 9, 10])
print("The mean of the short range grades is", short_range_grades.mean())
print("The range of the short range grades is", (np.max(short_range_grades) - np.min(short_range_grades)))

wide_range_grades = np.array([2, 7, 8, 9, 9, 9, 10, 10, 9, 100])

# TODO: Calculate mean of wide_range_grades
# TODO: Calculate range of wide_range_grades

In [None]:
"""
Measuring Dispersion: Interquartile Range
"""

grades = np.array([9, 8, 8, 9, 10, 9, 9, 10, 10, 12, 13, 6])
q3 = np.percentile(grades, 75)
q1 = np.percentile(grades, 25)

iqr = q3 - q1
print(iqr)

# Outlier thresholds
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

# Detect outliers
outliers = grades[(grades < lower_bound) | (grades > upper_bound)]
print(outliers)

# Create a box plot
plt.boxplot(grades)

# Add title and labels
plt.title('Box Plot of Grades')
plt.ylabel('Grades')

# Show plot
plt.show()


# TODO: Identify the outliers
# TODO: Remove the outliers and re-execute the cell

In [None]:
"""
Measuring Dispersion: Variance
"""

grades = np.array([9, 7, 8, 9, 9, 9, 10, 10, 9, 10])
print(grades.mean())
print(np.var(grades))

grades_with_outliers = np.array([9, 7, 8, 9, 9, 9, 10, 10, 9, 100])

# TODO: Calculate mean of grades_with_outliers
# TODO: Calculate variance of grades_with_outliers

In [None]:
"""
Measuring Dispersion: Standard Deviation
"""

grades = np.array([9, 7, 8, 9, 9, 9, 10, 10, 9, 10])
print(np.var(grades))
print(np.std(grades))

In [None]:
"""
Measuring Asymmetric
"""
grades = np.array([1, 2, 5, 2, 3, 3, 4, 4, 5, 20])

# Skewness value
print(skew(grades))

# Histogram to show distribution
plt.hist(grades, bins=10, edgecolor='black')
plt.title("Distribution of Grades")
plt.xlabel("Grade")
plt.ylabel("Frequency")
plt.show()

# TODO: Change the outliers until skewness in range [-0.5, 0.5]

In [None]:
"""
Data Relationship: Covariance & Corelation
"""

sample_data = {
    'name': ['John', 'Alia', 'Ananya', 'Steve', 'Ben'],
    'age': [22, 24, 26, 28, 30],
    'maturity_score': [60, 70, 80, 85, 90],
    'memorization_score': [90, 85, 75, 65, 60],
    'hobbies_score': [70, 40, 30, 60, 70]
}

df = pd.DataFrame(sample_data)
 
# Calculate covariance
df.cov(numeric_only=True)
# TODO: Covariance doesn't explain how the variables are related properly use df.corr(numeric_only=True)

# TODO: Visualize the data relationship
# # --- Scatter Plots ---
# plt.figure(figsize=(12, 4))

# plt.subplot(1, 3, 1)
# sns.scatterplot(x='age', y='maturity_score', data=df)
# plt.title('Age vs Maturity')

# plt.subplot(1, 3, 2)
# sns.scatterplot(x='age', y='memorization_score', data=df)
# plt.title('Age vs Memorization')

# plt.subplot(1, 3, 3)
# sns.scatterplot(x='age', y='hobbies_score', data=df)
# plt.title('Age vs Hobbies')

# plt.tight_layout()
# plt.show()

# # --- Correlation Heatmap ---
# plt.figure(figsize=(6, 4))
# corr = df.drop(columns=['name']).corr()
# sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, fmt=".2f", linewidths=0.5, vmin=-1, vmax=1)
# plt.title("Correlation Matrix")
# plt.show()

In [None]:
"""
Class Activity: Understanding the data
"""
# TODO: Convert the data to DataFrame
# TODO: Perform descriptive statistics to understand the data
# TODO: Calculate and visualize like our previous example
# TODO: Take a conclusion and present

data = {
    'name': ['John', 'Alia', 'Ananya', 'Steve', 'Ben'],
    'age': [22, 24, 26, 28, 30],
    'math_score': [85, 92, 75, 88, 82],
    'english_score': [90, 85, 70, 75, 78],
    'science_score': [75, 80, 80, 70, 85],
    'communication_score': [88, 80, 85, 90, 85]
}

### **Reflection**
Take a moment to reflect on what we've learned so far. What insights have you gained? Write your thoughts in your own words.

(answer here)

### **Exploration**
You've already gained a fundamental understanding of the mathematical concepts behind data analytics. Next, we’ll learn about data wrangling — the crucial first step before we begin analyzing data.
- https://pandas.pydata.org/docs/user_guide/merging.html
- https://www.kdnuggets.com/10-essential-data-cleaning-techniques-explained-in-12-minutes