#DATAFRAME PART - 3 (LAST PART) -> Missing Values And Example

We've seen a preview of how Pandas handles missing values using the None Type and the NumPy and NaN keywords. Missing values are pretty common in data cleaning activities, and missing values can be there for any number of reasons, and I just want to touch on a few of those here. For instance, if you're running a survey and a respondent didn't answer a question, the missing value is actually an omission. This kind of missing data is called missing at random if there are other variables that might be used to predict the variable which is missing. In my work when I deliver surveys I often find that missing data say interest in being involved in a follow-up study, often has some correlation with other data like gender or ethnicity. If there's no relationship to other variables, then we call this data missing completely at random. So these are just two examples of missing data and there's many more. For instance, data might be missing because it wasn't collected. Either because the process responsible for collecting the data such as the researcher, or because it wouldn't make sense if it were to be collected. This last example is extremely common when you start joining DataFrames together from multiple sources such as joining a list of people at a university with a list of offices in the university. Students don't generally have offices but they're still people at the university. So let's take a look at some ways of handling missing data in Pandas. So let's import Pandas as pd.

In [1]:
import pandas as pd
df = pd.read_csv('students_records_updated.csv')
df

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


Unnamed: 0,Name,Roll No,Assignment 1 Submitted,Sessional 1 Marks,Assignment 2 Submitted,Sessional 2 Marks,Final Exam Marks
0,John Doe,101,Yes,12,,10,55
1,Jane Smith,102,Yes,14,Yes,13,60
2,Michael Johnson,103,,10,,8,52
3,Emily Davis,104,,9,Yes,12,58
4,Chris Lee,105,Yes,11,,9,48
5,Olivia Brown,106,,8,Yes,7,50
6,Sophia Wilson,107,Yes,15,Yes,13,65
7,James Miller,108,,9,,8,55
8,Isabella Anderson,109,Yes,13,Yes,12,62
9,Daniel Thomas,110,,10,,10,53


In [2]:
#Let's use is_null to create a boolean mask of dataframe. It will check it on each and every part of data.
is_null_mask = df.isnull()
is_null_mask

Unnamed: 0,Name,Roll No,Assignment 1 Submitted,Sessional 1 Marks,Assignment 2 Submitted,Sessional 2 Marks,Final Exam Marks
0,False,False,False,False,True,False,False
1,False,False,False,False,False,False,False
2,False,False,True,False,True,False,False
3,False,False,True,False,False,False,False
4,False,False,False,False,True,False,False
5,False,False,True,False,False,False,False
6,False,False,False,False,False,False,False
7,False,False,True,False,True,False,False
8,False,False,False,False,False,False,False
9,False,False,True,False,True,False,False


In [4]:
#Now we can use dropna to drop each and every row which has a missing value.
#We will see students which have submitted all of assignments.
df.dropna()

Unnamed: 0,Name,Roll No,Assignment 1 Submitted,Sessional 1 Marks,Assignment 2 Submitted,Sessional 2 Marks,Final Exam Marks
1,Jane Smith,102,Yes,14,Yes,13,60
6,Sophia Wilson,107,Yes,15,Yes,13,65
8,Isabella Anderson,109,Yes,13,Yes,12,62


We can set all the missing values to a single value called a scalar if you want to make changes in original dataframe you can use inplace and set it to True

In [3]:
df.fillna("Not Given",inplace=True)
df

Unnamed: 0,Name,Roll No,Assignment 1 Submitted,Sessional 1 Marks,Assignment 2 Submitted,Sessional 2 Marks,Final Exam Marks
0,John Doe,101,Yes,12,Not Given,10,55
1,Jane Smith,102,Yes,14,Yes,13,60
2,Michael Johnson,103,Not Given,10,Not Given,8,52
3,Emily Davis,104,Not Given,9,Yes,12,58
4,Chris Lee,105,Yes,11,Not Given,9,48
5,Olivia Brown,106,Not Given,8,Yes,7,50
6,Sophia Wilson,107,Yes,15,Yes,13,65
7,James Miller,108,Not Given,9,Not Given,8,55
8,Isabella Anderson,109,Yes,13,Yes,12,62
9,Daniel Thomas,110,Not Given,10,Not Given,10,53


In [4]:
#Set Roll No as the index and since roll no are already sorted we don't need to use sort_index.
cols=df.columns
cols

Index(['Name', 'Roll No', 'Assignment 1 Submitted', 'Sessional 1 Marks',
       'Assignment 2 Submitted', 'Sessional 2 Marks', 'Final Exam Marks'],
      dtype='object')

In [6]:
#Set Roll No as the index


In [5]:
df=df.set_index('Roll No')
df

Unnamed: 0_level_0,Name,Assignment 1 Submitted,Sessional 1 Marks,Assignment 2 Submitted,Sessional 2 Marks,Final Exam Marks
Roll No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
101,John Doe,Yes,12,Not Given,10,55
102,Jane Smith,Yes,14,Yes,13,60
103,Michael Johnson,Not Given,10,Not Given,8,52
104,Emily Davis,Not Given,9,Yes,12,58
105,Chris Lee,Yes,11,Not Given,9,48
106,Olivia Brown,Not Given,8,Yes,7,50
107,Sophia Wilson,Yes,15,Yes,13,65
108,James Miller,Not Given,9,Not Given,8,55
109,Isabella Anderson,Yes,13,Yes,12,62
110,Daniel Thomas,Not Given,10,Not Given,10,53


In [12]:
#Let's sort acc to sessional marks.
df.columns


Index(['Name', 'Assignment 1 Submitted', 'Assignment 2 Submitted',
       'Sessional 2 Marks', 'Final Exam Marks'],
      dtype='object')

In [6]:
df.reset_index()

Unnamed: 0,Roll No,Name,Assignment 1 Submitted,Sessional 1 Marks,Assignment 2 Submitted,Sessional 2 Marks,Final Exam Marks
0,101,John Doe,Yes,12,Not Given,10,55
1,102,Jane Smith,Yes,14,Yes,13,60
2,103,Michael Johnson,Not Given,10,Not Given,8,52
3,104,Emily Davis,Not Given,9,Yes,12,58
4,105,Chris Lee,Yes,11,Not Given,9,48
5,106,Olivia Brown,Not Given,8,Yes,7,50
6,107,Sophia Wilson,Yes,15,Yes,13,65
7,108,James Miller,Not Given,9,Not Given,8,55
8,109,Isabella Anderson,Yes,13,Yes,12,62
9,110,Daniel Thomas,Not Given,10,Not Given,10,53


You can see Sessional 1 Marks are often same for two students so we can set names and roll no as indexes using multi level indexing.

In [7]:
df=df.reset_index()
df

Unnamed: 0,Roll No,Name,Assignment 1 Submitted,Sessional 1 Marks,Assignment 2 Submitted,Sessional 2 Marks,Final Exam Marks
0,101,John Doe,Yes,12,Not Given,10,55
1,102,Jane Smith,Yes,14,Yes,13,60
2,103,Michael Johnson,Not Given,10,Not Given,8,52
3,104,Emily Davis,Not Given,9,Yes,12,58
4,105,Chris Lee,Yes,11,Not Given,9,48
5,106,Olivia Brown,Not Given,8,Yes,7,50
6,107,Sophia Wilson,Yes,15,Yes,13,65
7,108,James Miller,Not Given,9,Not Given,8,55
8,109,Isabella Anderson,Yes,13,Yes,12,62
9,110,Daniel Thomas,Not Given,10,Not Given,10,53


In [8]:
#So now we can set Roll No and name as index
df=df.set_index(['Roll No','Name'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Assignment 1 Submitted,Sessional 1 Marks,Assignment 2 Submitted,Sessional 2 Marks,Final Exam Marks
Roll No,Name,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
101,John Doe,Yes,12,Not Given,10,55
102,Jane Smith,Yes,14,Yes,13,60
103,Michael Johnson,Not Given,10,Not Given,8,52
104,Emily Davis,Not Given,9,Yes,12,58
105,Chris Lee,Yes,11,Not Given,9,48
106,Olivia Brown,Not Given,8,Yes,7,50
107,Sophia Wilson,Yes,15,Yes,13,65
108,James Miller,Not Given,9,Not Given,8,55
109,Isabella Anderson,Yes,13,Yes,12,62
110,Daniel Thomas,Not Given,10,Not Given,10,53


In [10]:
#Now we will learn how to replace data in a dataframe
grades = pd.DataFrame({'Shaurya Mittal':['A','B+','A','B'],
                       'Sachin Chauhan':['B','A','B','A'],
                       'Madhur Kaushik':['B+','B','B','A'],
                      })
grades

Unnamed: 0,Shaurya Mittal,Sachin Chauhan,Madhur Kaushik
0,A,B,B+
1,B+,A,B
2,A,B,B
3,B,A,A


In [11]:
#Let's Replace All B+ with A-
grades.replace('B+','A-')


Unnamed: 0,Shaurya Mittal,Sachin Chauhan,Madhur Kaushik
0,A,B,A-
1,A-,A,B
2,A,B,B
3,B,A,A


In [12]:
grades

Unnamed: 0,Shaurya Mittal,Sachin Chauhan,Madhur Kaushik
0,A,B,B+
1,B+,A,B
2,A,B,B
3,B,A,A


In [14]:
#Replace two values
grades.replace(['B+','B'],['A-','B+'])

Unnamed: 0,Shaurya Mittal,Sachin Chauhan,Madhur Kaushik
0,A,B+,A-
1,A-,A,B+
2,A,B+,B+
3,B+,A,A


#END OF DATAFRAME PART - 3