Mean, Median, and Mode

These are three important measures of central tendency in statistics, used to describe the "typical" or "average" value in a dataset.

1. Mean

Definition: The sum of all values in a dataset divided by the number of values.

Calculation:

Sum of all values / Number of values
Example: For the dataset {2, 5, 8, 11, 15}, the mean is (2 + 5 + 8 + 11 + 15) / 5 = 41 / 5 = 8.2

2. Median

Definition: The middle value when the dataset is arranged in order (ascending or descending).

Calculation:

If the number of values is odd: The middle value is the median.
If the number of values is even: The median is the average of the two middle values.   
Example:

For the dataset {2, 5, 8, 11, 15}, the median is 8.
For the dataset {2, 5, 8, 11, 15, 18}, the median is (8 + 11) / 2 = 9.5
3. Mode

Definition: The value that appears most frequently in the dataset.

Calculation:

Identify the value(s) that occur more often than others.
Example:

For the dataset {2, 5, 5, 8, 11, 15}, the mode is 5.
A dataset can have multiple modes (bimodal, trimodal, etc.) or no mode at all if all values occur with the same frequency.

In [1]:
import pandas as pd

# Sample DataFrame
data = {'Department': ['HR', 'IT', 'Sales', 'Marketing', 'HR', 'IT', 'Sales', 'Marketing', 'HR'],
        'Salary': [50000, 60000, 70000, 80000, 55000, 65000, 75000, 85000, 52000],
        'Age': [30, 35, 40, 45, 28, 32, 38, 42, 31]}
df = pd.DataFrame(data)
print(df)

# Group by Department and apply multiple aggregations
grouped_df = df.groupby('Department').agg({'Salary': ['mean', 'max', 'min'], 
                                          'Age': ['mean', 'count']})

grouped_df2=df.groupby(['Department','Age']).agg({'Salary':['mean'],'Age':['mean']})

print('✔🧮',grouped_df2)


print(grouped_df)

  Department  Salary  Age
0         HR   50000   30
1         IT   60000   35
2      Sales   70000   40
3  Marketing   80000   45
4         HR   55000   28
5         IT   65000   32
6      Sales   75000   38
7  Marketing   85000   42
8         HR   52000   31
✔🧮                  Salary   Age
                   mean  mean
Department Age               
HR         28   55000.0  28.0
           30   50000.0  30.0
           31   52000.0  31.0
IT         32   65000.0  32.0
           35   60000.0  35.0
Marketing  42   85000.0  42.0
           45   80000.0  45.0
Sales      38   75000.0  38.0
           40   70000.0  40.0
                  Salary                      Age      
                    mean    max    min       mean count
Department                                             
HR          52333.333333  55000  50000  29.666667     3
IT          62500.000000  65000  60000  33.500000     2
Marketing   82500.000000  85000  80000  43.500000     2
Sales       72500.000000  75000  70000  3

# Group by multiple columns

In [5]:
import pandas as pd

# Sample DataFrame
data = {'Region': ['North', 'South', 'North', 'South', 'West', 'West', 'North','North'],
        'City': ['New York', 'Miami', 'Chicago', 'Houston', 'Los Angeles', 'San Francisco', 'Boston','Boston'],
        'Sales': [1000, 500, 800, 600, 1200, 900, 700,700]}
df = pd.DataFrame(data)

# Group by 'Region' and 'City' and calculate the sum of sales
grouped_df = df.groupby(['Region', 'City'])['Sales'].sum()
grouped_df2=df.groupby(['Region', 'City']).Sales.agg(['sum'])
grouped_df3=df.groupby(['Region', 'City']).agg( SumOfSales=('Sales','sum'))
print('✔🧮',grouped_df3)
print('✔🚀',grouped_df2)
print(grouped_df)

✔🧮                       SumOfSales
Region City                     
North  Boston               1400
       Chicago               800
       New York             1000
South  Houston               600
       Miami                 500
West   Los Angeles          1200
       San Francisco         900
✔🚀                        sum
Region City               
North  Boston         1400
       Chicago         800
       New York       1000
South  Houston         600
       Miami           500
West   Los Angeles    1200
       San Francisco   900
Region  City         
North   Boston           1400
        Chicago           800
        New York         1000
South   Houston           600
        Miami             500
West    Los Angeles      1200
        San Francisco     900
Name: Sales, dtype: int64


In [4]:
import pandas as pd

# Sample DataFrame
data = {'AgeGroup': ['18-24', '18-24', '25-34', '25-34', '35-44', '35-44', '18-24'],
        'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male'],
        'AnnualIncome': [30000, 25000, 40000, 35000, 50000, 45000, 28000],
        'CreditScore': [700, 680, 750, 720, 800, 780, 690]}
df = pd.DataFrame(data)

# Group by 'AgeGroup' and 'Gender', and calculate multiple aggregations
grouped_df = df.groupby(['AgeGroup', 'Gender']).agg(
    AvgIncome=('AnnualIncome', 'mean'),
    AvgCreditScore=('CreditScore', 'mean')
)

print(grouped_df)

                 AvgIncome  AvgCreditScore
AgeGroup Gender                           
18-24    Female    25000.0           680.0
         Male      29000.0           695.0
25-34    Female    35000.0           720.0
         Male      40000.0           750.0
35-44    Female    45000.0           780.0
         Male      50000.0           800.0


In [5]:
import pandas as pd

# Sample DataFrame
data = {
    'Department': ['HR', 'IT', 'Sales', 'Marketing', 'HR', 'IT', 'Sales', 'Marketing', 'HR'],
    'Salary': [50000, 60000, 70000, 80000, 55000, 65000, 75000, 85000, 52000],
    'Age': [30, 35, 40, 45, 28, 32, 38, 42, 31],
    'Region': ['North', 'South', 'North', 'South', 'West', 'West', 'North', 'South', 'North'], 
    'City': ['New York', 'Miami', 'Chicago', 'Houston', 'Los Angeles', 'San Francisco', 'Boston', 'Dallas', 'New York'],
    'AgeGroup': ['25-34', '35-44', '35-44', '45-54', '25-34', '35-44', '35-44', '45-54', '25-34'],
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female', 'Male'],
    'AnnualIncome': [50000, 60000, 70000, 80000, 55000, 65000, 75000, 85000, 52000],
    'CreditScore': [700, 750, 800, 780, 720, 760, 820, 790, 710],
    'Sales': [1000, 800, 1200, 900, 1500, 1100, 1300, 1000, 900] 
}

df = pd.DataFrame(data)

# Syntax 1: Grouping by 'AgeGroup' and 'Gender' with multiple aggregations
grouped_df1 = df.groupby(['AgeGroup', 'Gender']).agg(
    AvgIncome=('AnnualIncome', 'mean'),
    AvgCreditScore=('CreditScore', 'mean')
)

# Syntax 2: Grouping by 'Department' with multiple aggregations on different columns
grouped_df2 = df.groupby('Department').agg({'Salary': ['mean', 'max', 'min'], 
                                          'Age': ['mean', 'count']})

# Syntax 3: Grouping by 'Region' and 'City' with a single aggregation
grouped_df3 = df.groupby(['Region', 'City'])['Sales'].sum() 

print("Syntax 1:\n", grouped_df1)
print("\nSyntax 2:\n", grouped_df2)
print("\nSyntax 3:\n", grouped_df3)

Syntax 1:
                     AvgIncome  AvgCreditScore
AgeGroup Gender                              
25-34    Male    52333.333333           710.0
35-44    Female  62500.000000           755.0
         Male    72500.000000           810.0
45-54    Female  82500.000000           785.0

Syntax 2:
                   Salary                      Age      
                    mean    max    min       mean count
Department                                             
HR          52333.333333  55000  50000  29.666667     3
IT          62500.000000  65000  60000  33.500000     2
Marketing   82500.000000  85000  80000  43.500000     2
Sales       72500.000000  75000  70000  39.000000     2

Syntax 3:
 Region  City         
North   Boston           1300
        Chicago          1200
        New York         1900
South   Dallas           1000
        Houston           900
        Miami             800
West    Los Angeles      1500
        San Francisco    1100
Name: Sales, dtype: int64
