# Pandas Data Summary

## What is Data Summary?

Data Summary is a statistical analysis tool that provides a concise overview of a dataset. It includes measures of central tendency (mean, median, mode) and variability (range, variance, standard deviation). This information helps users understand the distribution of data, identify patterns, and make informed decisions. 

Data Summary is commonly used in various fields, including business, finance, healthcare, and social sciences, to analyze and interpret data from surveys, experiments, and other sources.

- **Mean** : The average value of a set of numbers. It is calculated by adding up all the numbers and then dividing by the number of values in the set.

- **Median** : The middle value of a set of numbers when they are arranged in order. If there is an even number of values, the median is the average of the two middle values.

- **Mode** : The value that appears most frequently in a set of numbers.

- **Standard Deviation** : 
    - A measure of how spread out the values in a set of numbers are. It is calculated by finding the average distance between each value and the mean. 

    - A low standard deviation means that the values are close together, while a high standard deviation means that they are far apart.
- **Variance** : The average of the squared differences from the Mean. It is a measure of how spread out the values in a set of numbers are. It is calculated by finding the average of the squared differences between each value and the mean.

## Properties of Data Summary

- Simplicity
- Insightful
- Comparative
- Informative

## Creating Data Summaries with Pandas

In [1]:
# Step 1 : Importing Pandas
import pandas as pd

In [3]:
# Step 2 : Loading Data
data = {
    'Name' : ['Aarav','Isha','Rahul','Sneha','Ananya'],
    'Math' : [85,92,78,88,95],
    'Science': [89,94,82,91,87],
    'English' : [78,85,88,90,92]
}
students_df = pd.DataFrame(data)
students_df

Unnamed: 0,Name,Math,Science,English
0,Aarav,85,89,78
1,Isha,92,94,85
2,Rahul,78,82,88
3,Sneha,88,91,90
4,Ananya,95,87,92


In [5]:
# Step 3 : Descriptive Statistics
summary = students_df.describe()
summary

Unnamed: 0,Math,Science,English
count,5.0,5.0,5.0
mean,87.6,88.6,86.6
std,6.580274,4.505552,5.458938
min,78.0,82.0,78.0
25%,85.0,87.0,85.0
50%,88.0,89.0,88.0
75%,92.0,91.0,90.0
max,95.0,94.0,92.0


In [6]:
# Step 4 : Custom Aggregations
custom_summary = students_df.agg({
    'Math' : ['mean','median'],
    'Science' : ['mean','median'],
    'English' : ['mean','median']
})
custom_summary

Unnamed: 0,Math,Science,English
mean,87.6,88.6,86.6
median,88.0,89.0,88.0


In [11]:
# Step 5 : Grouped Summaries
students_df['Performance'] = ['Good','Excellent','Average','Good','Excellent']
grouped_summary = students_df.groupby('Performance').mean(numeric_only=True)
grouped_summary

Unnamed: 0_level_0,Math,Science,English
Performance,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Average,78.0,82.0,88.0
Excellent,93.5,90.5,88.5
Good,86.5,90.0,84.0


## Attributes and Methods of Pandas for Data Summary

- mean()
- median()
- mode()
- std()
- var()
- sum()
- min() and max()
- count() 

## Example : Analyzing Exam Scores

In [12]:
import pandas as pd

In [14]:
data = {
    'Name': ['Aarav', 'Isha', 'Rahul', 'Sneha', 'Ananya'],
    'Math': [85, 92, 78, 88, 95],
    'Science': [89, 94, 82, 91, 87],
    'English': [78, 85, 88, 90, 92]
}
students_df = pd.DataFrame(data)
students_df

Unnamed: 0,Name,Math,Science,English
0,Aarav,85,89,78
1,Isha,92,94,85
2,Rahul,78,82,88
3,Sneha,88,91,90
4,Ananya,95,87,92


In [15]:
# Descriptive statistics
summary = students_df.describe()
print("Descriptive Statistics: ",summary)

Descriptive Statistics:              Math    Science    English
count   5.000000   5.000000   5.000000
mean   87.600000  88.600000  86.600000
std     6.580274   4.505552   5.458938
min    78.000000  82.000000  78.000000
25%    85.000000  87.000000  85.000000
50%    88.000000  89.000000  88.000000
75%    92.000000  91.000000  90.000000
max    95.000000  94.000000  92.000000


In [16]:
# Custom aggregations
custom_summary = students_df.agg({
    'Math': ['mean', 'median'],
    'Science': ['mean', 'median'],
    'English': ['mean', 'median']
})
print("Custom Aggregations: ", custom_summary)

Custom Aggregations:          Math  Science  English
mean    87.6     88.6     86.6
median  88.0     89.0     88.0


In [17]:
# Grouped Summary
students_df['Performance'] = ['Good', 'Excellent', 'Average', 'Good', 'Excellent']
grouped_summary = students_df.groupby('Performance').mean(numeric_only=True)
print("Grouped Summary by Performance : ", grouped_summary)


Grouped Summary by Performance :               Math  Science  English
Performance                        
Average      78.0     82.0     88.0
Excellent    93.5     90.5     88.5
Good         86.5     90.0     84.0


## Applications of Data Summary

- Data Preprocessing
- Feature Engineering
- Model Evaluation
- Decision Making