<a href="https://colab.research.google.com/github/urness/CS167Fall2025/blob/main/Day05_Missing_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CS167: Day05 (part 2)
##Missing Data

#### CS167: Machine Learning, Fall 2025


## Before we get started, let's load in our datasets:
Make sure you change the path to match your Google Drive.
- Also, go ahead and download the `vehicles.csv` file from Blackboard and put it in your Google Drive.

In [None]:
import pandas as pd

# The first step is to mount your Google Drive to your Colab account.
#You will be asked to authorize Colab to access your Google Drive. Follow the steps they lead you.

from google.colab import drive
drive.mount('/content/drive')

In [None]:
#import the data:
#make sure the path on the line below corresponds to the path where you put your dataset.

iris_df = pd.read_csv('/content/drive/MyDrive/CS167/datasets/irisData.csv')

titanic_df = pd.read_csv('/content/drive/MyDrive/CS167/datasets/titanic.csv')

# Missing Data:
Most datasets you will work with will not be in perfect shape--you'll need to "clean" the data before you can run any machine learning algorithms on it.

Missing data is a pretty common thing--so much so that there's a special value for missing data: `NaN`, or not a number.

The steps of cleaning data normally include:
1. Identifying which columns have missing data
2. Determining how much data is missing in each column
3. Deciding what to do with the missing data: drop it, fill it, let it be

Notice, in the `deck` column, there are 3 instances of `NaN` we can see...

But what about the other 800 or so rows? Do we have to go through and find them manually? Gross.

In [None]:
titanic_df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## Step 1: Identify Missing Data

In order to ID missing data, we will use a combination of three pandas functions:
- `isna()`, `notna()`, and `any()`

## Using `isna()` and `notna()` to find missing data:
- `isna()` will return a boolean series where it is True if the element is `NaN'.
- `notna()` will return a bollean seires where it is True if the element is __not__ `NaN`.


Let's call `isna()` on the first 5 row of Titanic, and see what we get as an output:

In [None]:
titanic_df.loc[:4].isna()
#look at the 'deck' column...

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False


calling `any()` on the result of a `isna()`

In [None]:
titanic_df.isna().any()

Unnamed: 0,0
survived,False
pclass,False
sex,False
age,True
sibsp,False
parch,False
fare,False
embarked,True
class,False
who,False


## Step 2: How much data is missing?
To decide how to handle our missing data, it's important to know how much missing data each column has.
- If the missing data is a small proportion of the data, we choose to drop those rows completely from the dataset.
- However, if most of the rows are missing data for a specific column, maybe it's a sign that we don't need to use that column.

There are multiple ways of doing this, but one of the quickest/easiest is using `value_counts()`



In [None]:
#how many missing values are on the deck column?!?
titanic_df.deck.value_counts(dropna=False)
#688 missing values

Unnamed: 0_level_0,count
deck,Unnamed: 1_level_1
,688
C,59
B,47
D,33
E,32
A,15
F,13
G,4


In [None]:
titanic_df.age.value_counts(dropna=False)
#177 missing values

Unnamed: 0_level_0,count
age,Unnamed: 1_level_1
,177
24.00,30
22.00,27
18.00,26
28.00,25
...,...
24.50,1
0.67,1
0.42,1
34.50,1


In [None]:
titanic_df.embarked.value_counts(dropna=False)
#2 missing values

Unnamed: 0_level_0,count
embarked,Unnamed: 1_level_1
S,644
C,168
Q,77
,2


In [None]:
titanic_df.embark_town.value_counts(dropna=False)
#2 missing values

Unnamed: 0_level_0,count
embark_town,Unnamed: 1_level_1
Southampton,644
Cherbourg,168
Queenstown,77
,2


So, here are our results:

| **Column**    | **Num Rows Missing** |
|:---------------|----------------------|
| `deck`         | 688                  |
| `age`          | 177                  |
| `embarked`    | 2                    |
| `embark_town` | 2                    |

Now with this new information, it's up to us to decide what to do with these missing values

## Step 3: Decide how to handle missing data

There are 3 main options here:
- drop the missing data from the dataset (either col or row)
- fill the missing data with a suitable replacement
- let it be and cross our fingers

### Option 1: Drop it using `dropna()`

If there isn't much missing data, and/or you have a very large dataset, dropping the row that includes the missing data is a viable option.

In [None]:
print("before: ", titanic_df.shape)
titanic_df.dropna()
print("after: " , titanic_df.shape)

before:  (891, 15)
after:  (891, 15)


**huh... that's weird.** We know that there's missing data, why didn't the shape change?

Pandas is trying to protect you, and rather than dropping the rows "in place", it is returning a dataframe with the rows dropped--as written, we're just not saving it's return. There are two ways to fix this:
- save what `dropna()` is returning in a variable (see below)
- add the parameter `inplace=True` to the function call, and it will drop the rows in the original dataset (be careful with this one)

In [None]:
print("before: ", titanic_df.shape)
no_missing_data = titanic_df.dropna()
#titanic.dropna(inplace=True)
print("after: " , no_missing_data.shape)

before:  (891, 15)
after:  (182, 15)


`embarked` and `embark_town` don't have many rows missing... let's use `dropna()` to drop them in place:
- the parameter `subset` allows us to provide a list of columns that we want any missing data to be dropped from.

In [None]:
print("before: ", titanic_df.shape)
titanic_df.dropna(inplace=True, subset=['embarked', 'embark_town'])
print("after: ", titanic_df.shape)

before:  (891, 15)
after:  (889, 15)


### Option 2:  Fill it using `fillna()`

If dropping all of the data will make your dataset too sparse, consider filling the missing values with something else.

What do you think we should use to fill in the missing data in the `age` column?
- we probably don't want to throw off our statistics...

The `fillna()` functiona llows `NaN` values to be filled with a given value like so:

In [None]:
## calculate the average age, and any missing age gets filled with this value
print("before: ", titanic_df['age'].isna().any())
age_mean = titanic_df['age'].mean()
titanic_df.fillna({"age":age_mean}, inplace=True) ## this is new!
print("after: ", titanic_df['age'].isna().any())
titanic_df.head(7)

before:  True
after:  False


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,29.699118,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True


## Option #3: Let it be ❄️

What's so bad about missing data? Why do we care if some data is missing?

What happens if we try to do math with `NaN`? Try it out for yourself:

In [None]:
import numpy as np
a = np.nan

In [None]:
#try out some addition/subtraction

In [None]:
#try out some multiplication/division

In [None]:
#what about taking something to the power? (**)

In [None]:
# what happens if you take the average of this list of numbers?
my_series = pd.Series([2,2,3,np.nan,3])
my_series.mean()
