###  Pandas - Part 2 :- Handling Missing Values, Aggregations & Sorting in Pandas


##  Handling Missing Values

### **Concept & Why Required**

Missing values (NaN/NA) distort analysis and machine learning models. They can:
- Skew statistical calculations (mean, median)
- Cause errors during model training
- Lead to incorrect conclusions




### **Methods to Handle Missing Values**
- **Imputation**: Fill missing values with mean/median/mode
- **Deletion**: Remove rows/columns with missing values (use cautiously)

### **When to Use:**
| Method      | When to Use                                                                 |
|-------------|-----------------------------------------------------------------------------|
| **Mean**    | Numerical data, normal distribution                                         |
| **Median**  | Numerical data, skewed distribution                                         |
| **Mode**    | Categorical data (e.g., 'Weather' in our dataset)                           |
| **Delete**  | <5% missing data & no meaningful pattern (MCAR - Missing Completely At Random) |



### **Example 1: Loading Data & Finding Missing Values**

In [1]:

import pandas as pd

# Load dataset (replace with your path)
df = pd.read_csv('data_clean.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Ozone,Solar.R,Wind,Temp C,Month,Day,Year,Temp,Weather
0,1,41.0,190.0,7.4,67,5,1,2010,67,S
1,2,36.0,118.0,8.0,72,5,2,2010,72,C
2,3,12.0,149.0,12.6,74,5,3,2010,74,PS
3,4,18.0,313.0,11.5,62,5,4,2010,62,S
4,5,,,14.3,56,5,5,2010,56,S


In [5]:
df.isnull().sum()

Unnamed: 0     0
Ozone         38
Solar.R        7
Wind           0
Temp C         0
Month          0
Day            0
Year           0
Temp           0
Weather        3
dtype: int64

In [3]:
print("Missing Values:\n", df.isnull().sum())

Missing Values:
 Unnamed: 0     0
Ozone         38
Solar.R        7
Wind           0
Temp C         0
Month          0
Day            0
Year           0
Temp           0
Weather        3
dtype: int64


### **Example 2: Imputation with Mean/Median/Mode**

In [10]:
z=df['Ozone'].median() # before imputation
print(z)

30.5


In [12]:
# Fill 'Ozone' with median (skewed data)

ozone_median = df['Ozone'].median()
print(ozone_median)


30.5


In [13]:
df['Ozone'].fillna(ozone_median, inplace=True)

In [14]:
df['Ozone'] # after imputation

0      41.0
1      36.0
2      12.0
3      18.0
4      30.5
       ... 
153    41.0
154    30.0
155    30.5
156    14.0
157    18.0
Name: Ozone, Length: 158, dtype: float64

In [15]:
df.head()

Unnamed: 0.1,Unnamed: 0,Ozone,Solar.R,Wind,Temp C,Month,Day,Year,Temp,Weather
0,1,41.0,190.0,7.4,67,5,1,2010,67,S
1,2,36.0,118.0,8.0,72,5,2,2010,72,C
2,3,12.0,149.0,12.6,74,5,3,2010,74,PS
3,4,18.0,313.0,11.5,62,5,4,2010,62,S
4,5,30.5,,14.3,56,5,5,2010,56,S


In [21]:
df['Solar.R'].isnull().sum() # before imputation

0

In [17]:
# Fill 'Solar.R' with mean (normal distribution)

solar_mean = df['Solar.R'].mean()
print(solar_mean)


185.40397350993376


In [18]:
df['Solar.R'].fillna(solar_mean, inplace=True)

In [19]:
df['Solar.R'] # after imputation

0      190.000000
1      118.000000
2      149.000000
3      313.000000
4      185.403974
          ...    
153    190.000000
154    193.000000
155    145.000000
156    191.000000
157    131.000000
Name: Solar.R, Length: 158, dtype: float64

In [22]:
df['Weather'].isna().sum()  #before imputation

3

In [17]:
# Fill 'Weather' with mode (categorical)
weather_mode = df['Weather'].mode()[0]
weather_mode

'S'

In [18]:
df['Weather'].fillna(weather_mode, inplace=True)

In [20]:
df['Weather'].isna().sum()  #after imputation

0

In [28]:
print("\nAfter Imputation:\n", df.head())


After Imputation:
    Unnamed: 0  Ozone     Solar.R  Wind Temp C Month  Day  Year  Temp Weather
0           1   41.0  190.000000   7.4     67     5    1  2010    67       S
1           2   36.0  118.000000   8.0     72     5    2  2010    72       C
2           3   12.0  149.000000  12.6     74     5    3  2010    74      PS
3           4   18.0  313.000000  11.5     62     5    4  2010    62       S
4           5   30.5  185.403974  14.3     56     5    5  2010    56       S


### Quiz time

### **Quiz 1**
### Q: Fill missing 'Temp' values with median and show first 5 rows


In [22]:

# A:
temp_median = df['Temp'].median()
df['Temp'].fillna(temp_median, inplace=True)
print(df[['Temp']].head())



   Temp
0    67
1    72
2    74
3    62
4    56


### 2. Aggregations (groupby, mean)


### **Concept & Definition**
- **GroupBy**: Split data into groups based on criteria
- **Aggregation**: Compute statistics (mean, sum) for each group



### **Example: Group by 'Weather' and Find Mean Temperature**

In [31]:
weather_group = df.groupby('Weather')

print(weather_group['Temp'].mean())


Weather
C     77.734694
PS    76.872340
S     78.067797
Name: Temp, dtype: float64


### Q: Group by 'Month' and find max 'Wind' speed


In [25]:
month_group = df.groupby('Month')
print(month_group['Wind'].max())

Month
5      20.1
6      20.7
7      14.9
8      15.5
9      16.6
May    12.0
Name: Wind, dtype: float64


### Quiz time

### Read a file salaries.csv and display top 6 records

In [28]:
df_sal = pd.read_csv("Salaries.csv")# reading csv file
df_sal.head(6)

Unnamed: 0,rank,discipline,phd,service,sex,salary
0,Prof,B,56,49,Male,186960
1,Prof,A,12,6,Male,93000
2,Prof,A,23,20,Male,110515
3,Prof,A,40,31,Male,131205
4,Prof,B,20,18,Male,104800
5,Prof,A,20,20,Male,122400


### Display sum of all salary as per rank (AssocProf,AsstProf,Prof)

In [29]:
obj=df_sal.groupby('rank')

In [30]:
obj.sum()

Unnamed: 0_level_0,phd,service,salary
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AssocProf,196,147,1193221
AsstProf,96,42,1545893
Prof,1245,985,5686741


In [None]:
### own can write following way as well.

In [31]:
df_sal.groupby('rank').sum()


Unnamed: 0_level_0,phd,service,salary
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AssocProf,196,147,1193221
AsstProf,96,42,1545893
Prof,1245,985,5686741


In [None]:
#Quiz

In [None]:
### write a group by to display all maximum value based on sex ( Gender )

In [32]:

df_sal.groupby("sex").max()


Unnamed: 0_level_0,rank,discipline,phd,service,salary
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Female,Prof,B,39,36,161101
Male,Prof,B,56,51,186960


### 3. Sorting with Pandas

### **Concept & Definition**
- `sort_values()`: Order data by column values
- Key parameters: `by` (column), `ascending` (True/False)



### **Example: Sort by Temperature (Descending)**

In [34]:
sorted_df = df.sort_values(by='Temp', ascending=False)
sorted_df

Unnamed: 0.1,Unnamed: 0,Ozone,Solar.R,Wind,Temp C,Month,Day,Year,Temp,Weather
119,120,76.0,203.000000,9.7,97,8,28,2010,97,S
121,122,84.0,237.000000,6.3,96,8,30,2010,96,S
120,121,118.0,225.000000,2.3,94,8,29,2010,94,S
122,123,85.0,188.000000,6.3,94,8,31,2010,94,C
41,42,30.5,259.000000,10.9,93,6,11,2010,93,C
...,...,...,...,...,...,...,...,...,...,...
14,15,18.0,65.000000,13.2,58,5,15,2010,58,C
24,25,30.5,66.000000,16.6,57,5,25,2010,57,PS
26,27,30.5,185.403974,8.0,57,5,27,2010,57,PS
17,18,6.0,78.000000,18.4,57,5,18,2010,57,C


In [35]:
print(sorted_df[['Temp', 'Month']].head())

     Temp Month
119    97     8
121    96     8
120    94     8
122    94     8
41     93     6


### Quiz 

In [None]:
# Q: Sort by 'Solar.R' in ascending order and show first 3 rows

In [36]:
sorted_solar = df.sort_values(by='Solar.R')
print(sorted_solar[['Solar.R']].head(3))


    Solar.R
81      7.0
20      8.0
27     13.0


## **Summary Cheat Sheet**
| Task                | Code Example                               |
|---------------------|--------------------------------------------|
| Find missing values | `df.isnull().sum()`                        |
| Fill with mean      | `df['col'].fillna(df['col'].mean())`       |
| Group by & mean     | `df.groupby('col')['target'].mean()`       |
| Sort values         | `df.sort_values(by='col', ascending=False)`|

