# Assignment 30: Pandas Data Cleaning #

### Goals for this Assignment ###

By the time you have completed this assignment, you should be able to:

- Use the `isna` method to determine if a given datum is missing
- Use the `dropna` method to remove rows and/or columns containing missing data
- Use the `drop_duplicates` method to remove possibly partially-duplicated rows
- Use the `replace` method to replace values with other values

## Step 1: Use the `isna` Method to Find Missing Data ##

### Background: Data Cleaning ###

For analysis purposes, a data set should ideally be complete.
However, raw data has a tendency to be imperfect, especially if data entry needs to be performed at all manually.
For example:

- Individual cells might not contain a value.  Exactly why could be for a number of reasons: temporary disconnection from a sensor, a failing data storage, someone just plain forgot to include it, etc.
- Whole rows or columns may be missing values, in the same manner as the prior bullet.  Perhaps the data is incomplete, and some experiment was started but never finished.  Perhaps there were some placeholders that were intended to be filled-in later, but we never got around to it, or something happened where they couldn't be filled-in.
- Conversely, data may be duplicated in a data set.  For example, if the data was represented originally as a spreadsheet, someone may have been moving data around and accidentally did two pastes after a single copy.  Especially on large data sets, if there is any sort of manual transcription required, it's easy to accidentally redo some work; this is especially true considering that redoing work just means lost time, whereas forgetting to do something means lost data.
- There may be contradictory values in the data set which are not internally consistent.  For example, consider a data set containing customer information, including both the city and country the customer resides in.  If the listed city was New York City, but the listed country was Portugal, this would not be internally consistent.
- Building on the prior point, there could be dramatic outliers, to the point where they seem more likely to be errors than actual data points.  For example, if a data set records details of how employees commute to the office, it would be more than a little odd if someone says they commute by foot, live 7 miles away, and it takes them 30 minutes to get to work.  This isn't impossible, but the pacing rivals Olympic-level athletes.  It's far more likely that the employee is overreporting distance, underreporting time, or indicated the wrong means of transportation.
- There is a strange pattern in the data, where individual data points make sense, but as a collection they don't.  While this could indicate a meaningful pattern, it might also be due to what ultimately amounts to an error.  For example, in many locations it would not be strange to occasionally see a wind speed sensor read 0 mph.  However, a long string of 0 mph readings more likely indicates that the sensor got stuck, especially if you only see 0 mph readings after a certain point.  Faulty sensors can lead to all sorts of strange, bogus readings, and detecting failure may not be straightforward.

A data set that is free of these sorts of problems is said to be _clean_.
Similarly, the process of transforming a data set to be rid of these problems is referred to as _data cleanup_.
In practice, [a significant amount of time is spent on data cleanup](https://blog.ldodds.com/2020/01/31/do-data-scientists-spend-80-of-their-time-cleaning-data-turns-out-no/), though the amount of time varies wildly depending on the particular data set (that source, which cites many other sources, ranges from 15% to 90% of the time on data cleanup).
A big chunk of this time is spent in understanding what the data is showing, identifying issues, and devising strategies and solutions for fixing these issues.

Fixing problems can be as simple as just throwing out entire rows, or perhaps even entire columns, when a problem is found involving a given row or column.
However, this could lead to an unacceptable amount of data being discarded.
Furthermore, if there is reason to believe that data which would be discarded is somehow different from other data beyond just the issue, this would lead to a systematic bias getting thrown into the analysis.
For example, for the situation where the one employee had an Olympic-level running pace, a potential solution is to drop all rows where employees reported they commute by foot.
However, this not only discards data, but it systematically discards only employees who commute by foot.
For example, if we wanted to determine the average employee distance to the office, discarding everyone who commutes by foot would very likely increase the average distance to the office, perhaps significantly so if many people commute by foot.
All this to day: data cleaning is not always straightforward, and we need to be cognizant of the analyses we plan to perform.

> From my own experiences, my [MS thesis](https://repository.rit.edu/theses/4078/) dealt with a primarily manually-entered data set.
> This data set had originally been represented as an Excel spreadsheet spanning 10 columns and 4,348 rows, and had been edited over the course of several years by a multitude of people.
> Missing cells were common, as were internal inconsistencies, along with partial or complete duplications of rows.
> I worked with this data set for multiple years and spent approximately one year cleaning it, though in my case I got a bit distracted and built general-purpose tools for data cleanup instead of focusing solely on the task at hand.
> Looking back on it now, Pandas would have dramatically simplified this, and would have likely made the programming side of the problem so easy that I couldn't justify building my own tooling.
> This effort started in 2010, and Pandas supposedly had been open source for two years at that point, but I didn't know about it and no one I knew was talking about it.
> Python similarly was no where near as popular as it is now, and I even knew people who actively discouraged learning it because they thought it was a fad.

### Background: Missing Data in Pandas ###

In Pandas, missing datums can be represented in different ways depending on the type of the column.
[`NaN` (Not a Number)](https://en.wikipedia.org/wiki/NaN) is used for a number of representations, including floating-point values.
To see this in action, consider the next cell, which was taken from step 3 of assignment 28.
In this case, an outer join of `hardware_inventory` and `hardware_purchases` was made on the `"product"` column.
However, the product `"jigsaw"` in `hardware_inventory` had no corresponding entries in `hardware_purchases`, and similarly the product `"table saw"` in `hardware_purchases` had no corresponding entries in `hardware_inventory`.
This lead to `NaN` being used in some entries in the table..

In [1]:
import pandas as pd

hardware_inventory = pd.DataFrame({"product": ["hammer", "wrench", "screws (20)", "jigsaw"],
                                   "price": [20, 30, 2.5, 40],
                                   "quantity": [19, 15, 150, 12],
                                   "location": ["tools", "tools", "hardware", "tools"]})
hardware_purchases = pd.DataFrame({"customer_name": ["alice", "alice", "bob", "joe", "carl"],
                                   "product": ["hammer", "wrench", "screws (20)", "hammer", "table saw"],
                                   "when": ["1/2/2025", "1/2/2025", "3/4/2025", "5/6/2025", "5/12/2015"]})

combined_hardware = hardware_inventory.merge(hardware_purchases, on="product", how="outer")
print(combined_hardware)

       product  price  quantity  location customer_name       when
0       hammer   20.0      19.0     tools         alice   1/2/2025
1       hammer   20.0      19.0     tools           joe   5/6/2025
2       jigsaw   40.0      12.0     tools           NaN        NaN
3  screws (20)    2.5     150.0  hardware           bob   3/4/2025
4    table saw    NaN       NaN       NaN          carl  5/12/2015
5       wrench   30.0      15.0     tools         alice   1/2/2025


In Pandas, the preferred way to check if some data is missing is with the `isna` method on `Series` objects.
`isna` yields a new `Series` object of the same length, containing a Boolean value for the entries which were missing.
Importantly, `isna` understands all the different representations of missing data for all types in Pandas, not just `NaN`; this means we can apply `isna` to any `Series` object no matter what `dtype` is, and `isna` will work consistently
We can see this in the cell below.

In [2]:
print(combined_hardware["price"].isna())

0    False
1    False
2    False
3    False
4     True
5    False
Name: price, dtype: bool


As shown above, row `4` contains `True`, which follows from the fact that row `4` of the `"price"` column contained a `Nan` at this position.
In contrast, all other positions had actual values, and therefore `isna` returned `False` for those positions.

As another example, let's look at the `"customer_name"` column:

In [3]:
print(combined_hardware["customer_name"].isna())

0    False
1    False
2     True
3    False
4    False
5    False
Name: customer_name, dtype: bool


Here, row `2` has a `True` which follows from the fact that `"customer_name"` has `NaN` at this position.

### Background (and Warning): `NaN` is Weird ###

You may (rightfully!) think that you could do what `isna` is doing above yourself, at least specifically for checking for `NaN`.
For example, it seems like you should be able to use the vector operation `==` with `NaN` to gather the same information (i.e., `some_series == NaN`).
Here comes the first weird caveat: to do this comparison, you need an instance of `NaN`, and that isn't as straightforward as you might think.
One way to do this is by using `float` to parse the _string_ `"NaN"`, as with:

In [4]:
print(float("NaN")) # prints nan, a floating-point value

nan


Another way (and the way we will use here, moving forward) is to use `nan`'s definition from NumPy, like so:

In [5]:
import numpy as np

print(np.nan)

nan


Now that we have a `NaN` instance, we can look for it as desired.
The next cell effectively tries to do `combined_hardware["price"].isna()`, but without using `isna`:

In [6]:
print(combined_hardware["price"] == np.nan)

0    False
1    False
2    False
3    False
4    False
5    False
Name: price, dtype: bool


Look specifically at row `4`: it's now `False`, not `True`, _even though_ this row contains `NaN`!
This weird behavior follows from how equality over `NaN` works, as shown in the next cell:

In [7]:
print(np.nan == np.nan) # prints False
print(np.nan != np.nan) # prints True

False
True


That is, counter to almost everything imaginable, `NaN` is **not** equal to itself.
This is not a Pandas, NumPy, or Python bug, but rather a consequence of the [floating point standard itself](https://en.wikipedia.org/wiki/IEEE_754).
That is, the standard itself says that `NaN` is not equal to itself, and anything that follows the standard must follow this same behavior; it'd be a bug to say that `NaN` is equal to itself.

Using NumPy, the correct way to check for `NaN` is the `isnan` method, like so:

In [8]:
print(np.isnan(np.nan)) # prints True
print(np.isnan(3.14)) # prints False

True
False


If you are specifically checking for missing data, it is **strongly** recommended to use the `isna` method as opposed to trying to check for `NaN` yourself.
Not only will `isna` work correctly for `NaN` without extra work, it will also work for other data types which use representations other than `NaN` to represent missing data.

> General warning: if you ever need to check if two floating point numbers are equal, you should be cognizant of this behavior with `NaN`.
> Something as innocent-looking as `a == b` won't do what you expect if `a` and `b` are both `NaN`.
> This also applies if you have a data structure that contains a floating-point value, and you want to compare that data structure itself to another data structure containing floating-point values.
> (I've lost multiple hours of my life to this issue before.)

### Try this Yourself ###

The next cell defines a `Series` object bound to variable `has_missing` which contains some missing values (with `np.nan`).
Use `isna` along with `print` to print out which elements are missing (or not).

In [28]:
has_missing = pd.Series([3.14, 1.44, 2.9, np.nan, 9.0, np.nan, 4.5],
                        index=["a", "b", "c", "d", "e", "f", "g"])

# Call and print the result of isna.
print(has_missing.isna())

#This should print the
# following:
# a    False
# b    False
# c    False
# d     True
# e    False
# f     True
# g    False
# dtype: bool

a    False
b    False
c    False
d     True
e    False
f     True
g    False
dtype: bool


## Step 2: Use `dropna` to Remove Rows and/or Columns with Missing Data ##

### Background: Logical Not, And, and Or over `Series` Objects ###

We've previously seen Python's `not`, `and`, and `or`, which compute logical NOT, AND, and OR, respectively.
A quick recap of these operations is in the cell below:

In [10]:
print(not True) # prints False
print(not False) # prints True

print()
print(True and True) # prints True
print(True and False) # prints False
print(False and True) # prints False
print(False and False) # prints False

print()
print(True or True) # prints True
print(True or False) # prints True
print(False or True) # prints True
print(False or False) # prints False

False
True

True
False
False
False

True
True
True
False


When it comes to arithmetic operations, both Pandas and Python use the same operators.
That is, `+` means addition, `-` means subtraction, and so on.
However, this is **not** true for logical operators.
For example, let's say we want to compute logical not over a `Series` of Boolean values.
Let's see what happens if we try this with `not`:

In [11]:
booleans = pd.Series([True, False, False, True],
                     index=["a", "b", "c", "d"])
print(not booleans)

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

This throws a `ValueError` exception, saying that the "true value of a `Series` is ambiguous."
To explain this error,`not`, along with `and` and `or`, are hardwired into Python itself, and **always** refer to the operations which work strictly over Booleans.
Python therefore is attempting to convert `booleans` itself into a single Boolean value, but `Series` objects don't inherently work this way.
There _are_ methods on `Series` objects which can give back a Boolean, including `any()` and `all()`, along with the others shown in the error message.
In this case, Python (and Pandas) thinks you want a single Boolean value, not a `Series` object holding Boolean values.

> The specific feature that Python has which allows operations like `+` to work over `Series` objects is [_operator overloading_](https://en.wikipedia.org/wiki/Operator_overloading).
> This is fairly easy to do in Python by defining methods on a custom class with specific names, detailed [in the official Python documentation](https://docs.python.org/3/reference/datamodel.html#special-method-names).
> Exactly how to do this is beyond our scope.

To apply these logical operations to a `Series` object as a vector operation, we need to use the following:

- `~`: logical NOT
- `&`: logical AND
- `|`: logical OR

Examples are shown in the next cell.

In [12]:
booleans1 = pd.Series([True, False, False, True],
                      index=["a", "b", "c", "d"])
booleans2 = pd.Series([True, False, True, False],
                      index=["a", "b", "c", "d"])
print("~booleans1")
print(~booleans1)

print()
print("booleans1 & booleans2")
print(booleans1 & booleans2)

print()
print("booleans1 | booleans2")
print(booleans1 | booleans2)

~booleans1
a    False
b     True
c     True
d    False
dtype: bool

booleans1 & booleans2
a     True
b    False
c    False
d    False
dtype: bool

booleans1 | booleans2
a     True
b    False
c     True
d     True
dtype: bool


### Background: `dropna` ###

Using `isna` with masking, you can remove missing data.
However, because masking takes rows which are `True`, and `isna` gives `True` for missing values, you need to logically negate the result from `isna`.
An example of this is shown in the next cell, using the `combined_hardware` data from before.

In [13]:
# copied for convenience
hardware_inventory = pd.DataFrame({"product": ["hammer", "wrench", "screws (20)", "jigsaw"],
                                   "price": [20, 30, 2.5, 40],
                                   "quantity": [19, 15, 150, 12],
                                   "location": ["tools", "tools", "hardware", "tools"]})
hardware_purchases = pd.DataFrame({"customer_name": ["alice", "alice", "bob", "joe", "carl"],
                                   "product": ["hammer", "wrench", "screws (20)", "hammer", "table saw"],
                                   "when": ["1/2/2025", "1/2/2025", "3/4/2025", "5/6/2025", "5/12/2015"]})

combined_hardware = hardware_inventory.merge(hardware_purchases, on="product", how="outer")
print(combined_hardware[~combined_hardware["price"].isna()])

       product  price  quantity  location customer_name      when
0       hammer   20.0      19.0     tools         alice  1/2/2025
1       hammer   20.0      19.0     tools           joe  5/6/2025
2       jigsaw   40.0      12.0     tools           NaN       NaN
3  screws (20)    2.5     150.0  hardware           bob  3/4/2025
5       wrench   30.0      15.0     tools         alice  1/2/2025


As shown, this ended up removing all rows where the `"price"` column was missing.
However, row `2` is still missing data, as row `2` nonetheless was not missing `"price"`.

We could address this issue by adjusting the mask, as with:

In [14]:
print(combined_hardware[~(combined_hardware["price"].isna() | combined_hardware["customer_name"].isna())])

       product  price  quantity  location customer_name      when
0       hammer   20.0      19.0     tools         alice  1/2/2025
1       hammer   20.0      19.0     tools           joe  5/6/2025
3  screws (20)    2.5     150.0  hardware           bob  3/4/2025
5       wrench   30.0      15.0     tools         alice  1/2/2025


However, in general, we'd have to do this sort of adjustment over every column of the `DataFrame`.

It turns out that this is such a common operation that Pandas offers a much simpler solution, just for this problem: `DataFrame`'s `dropna` method.
`dropna` will remove rows and/or columns containing missing values, like so:

In [15]:
print(combined_hardware.dropna())

       product  price  quantity  location customer_name      when
0       hammer   20.0      19.0     tools         alice  1/2/2025
1       hammer   20.0      19.0     tools           joe  5/6/2025
3  screws (20)    2.5     150.0  hardware           bob  3/4/2025
5       wrench   30.0      15.0     tools         alice  1/2/2025


As shown in the prior cell, any rows which contained missing values have been stripped away in the resulting `DataFrame`.
This effectively did the same thing as what we had with the mask, but it is much shorter.

You can also strip away any columns containing missing values by passing `axis="columns"` as a keyword argument to `dropna`.
An example of this with `combined_hardware` is shown in the next cell:

In [16]:
print(combined_hardware) # refresher of what the original data was
print()
print(combined_hardware.dropna(axis="columns"))

       product  price  quantity  location customer_name       when
0       hammer   20.0      19.0     tools         alice   1/2/2025
1       hammer   20.0      19.0     tools           joe   5/6/2025
2       jigsaw   40.0      12.0     tools           NaN        NaN
3  screws (20)    2.5     150.0  hardware           bob   3/4/2025
4    table saw    NaN       NaN       NaN          carl  5/12/2015
5       wrench   30.0      15.0     tools         alice   1/2/2025

       product
0       hammer
1       hammer
2       jigsaw
3  screws (20)
4    table saw
5       wrench


There are some additional parameters that `dropna` can take to customize its behavior, though they are beyond our scope.
You may wish to consult the [official Pandas documentation for `dropna`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) in order to learn more.

### Try this Yourself ###

The next cell defines a `DataFrame` and binds it to variable `some_missing`.
Using `dropna`, initialize variables `without_missing_rows` and `without_missing_columns` to hold `DataFrame` objects based on `some_missing`.

In [30]:
some_missing = pd.DataFrame({"foos": [3.3, 2.1, np.nan, 7.0, 7.1],
                             "bars": [2.2, np.nan, 7.6, 7.7, 2.1],
                             "blahs": [1.1, 2.2, 3.3, 4.4, 5.5]})

# Write your code here defining without_missing_rows.
# This should be a DataFrame where any rows containing missing
# information are stripped away.

without_missing_rows = some_missing.dropna()

print(without_missing_rows)
# the above statement should print:
#    foos  bars  blahs
# 0   3.3   2.2    1.1
# 3   7.0   7.7    4.4
# 4   7.1   2.1    5.5

# Write your code here defining without_missing_columns.
# This should be a DataFrame where any columns containing missing
# information are stripped away.
without_missing_columns = some_missing.dropna(axis=1)
print()
print(without_missing_columns)
# the above statement should print:
#    blahs
# 0    1.1
# 1    2.2
# 2    3.3
# 3    4.4
# 4    5.5

   foos  bars  blahs
0   3.3   2.2    1.1
3   7.0   7.7    4.4
4   7.1   2.1    5.5

   blahs
0    1.1
1    2.2
2    3.3
3    4.4
4    5.5


## Step 3: Use `drop_duplicates` to Remove Possibly Partially-Duplicated Rows ##

### Background: (Partial) Duplications in Data ###

Sometimes a data set contains duplicate rows, or partially-duplicated rows.
For example, consider this variant of the hardware inventory data we have previously seen in the next cell:

In [31]:
dups_hardware_inventory = pd.DataFrame({"product": ["hammer", "wrench", "screws (20)", "jigsaw", "hammer", "jigsaw"],
                                        "price": [20, 30, 2.5, 40, 20, 35],
                                        "quantity": [19, 15, 150, 12, 19, 14],
                                        "location": ["tools", "tools", "hardware", "tools", "tools", "tools"]})
print(dups_hardware_inventory)

       product  price  quantity  location
0       hammer   20.0        19     tools
1       wrench   30.0        15     tools
2  screws (20)    2.5       150  hardware
3       jigsaw   40.0        12     tools
4       hammer   20.0        19     tools
5       jigsaw   35.0        14     tools


As shown, rows `0` and `4`, other than their numeric index, share completely identical information; these two rows are total duplicates of each other.
Rows `3` and `5` have some data in common (namely `"product"` and `"location"`), though the remaining values are not shared (`"price"` and `"quantity"`); in this sense, rows `3` and `5` are partially duplicated.

For some data sets, partial duplicates or even total duplicates may be expected, and not considered to be inherently indicative of a problem.
For example, within this hardware inventory information, we'd expect that multiple rows would share a value in the `"location"` column.
Of anything, if `"location"` was fully unique across rows, that in and of itself would likely indicate a problem, because it would mean that at most one item is stocked per location.
As another example, a car dealership may track the year, make, and model of each car on the lot, where each row in the data set represents one vehicle.
If the only things tracked are year, make, and model, it would not be out of the realm of possibility to see completely duplicate rows, representing the fact that two or more cars of the same year, make, and model are on the lot.

For other data sets, duplicates, even partial duplicates, may indicate a problem.
For example, a data set tracking wind speed over time may include a timestamp and a particular reading at that time.
If we have only a single wind speed sensor, we would likely not expect to see multiple entries for wind speed at the exact same time, regardless of the values in the rest of the columns.
This all illustrates the importance of understanding the data set before you analyze it: you need to make sure it's clean first, and if not, perform corrective actions of some sort.

> From my own experiences with my MS thesis, duplications were some of the most challenging issues to fix.
> Total duplicates happened more often than you might think, and tended to be clustered together.
> What likely had happened was that someone years prior had started to manually rearrange the data and copy/pasted things around, and accidentally pasted the same data multiple times.
> Certain partial duplications were the worst, especially when all entries were internally consistent; these occasionally required consultation with domain experts to resolve.

### Background: `DataFrame`'s `drop_duplicates` Method ###

If you have come to the conclusion that it's appropriate to strip out rows containing duplicate data, Pandas `DataFrame`s offer a method just for this purpose: `drop_duplicates()`.
The simplest form takes no parameters, and will return a new `DataFrame` without any completely duplicate rows.
This is shown in the next cell with the `dups_hardware_inventory` `DataFrame`:

In [19]:
print(dups_hardware_inventory) # reminder of what this data is
print()
print(dups_hardware_inventory.drop_duplicates())

       product  price  quantity  location
0       hammer   20.0        19     tools
1       wrench   30.0        15     tools
2  screws (20)    2.5       150  hardware
3       jigsaw   40.0        12     tools
4       hammer   20.0        19     tools
5       jigsaw   35.0        14     tools

       product  price  quantity  location
0       hammer   20.0        19     tools
1       wrench   30.0        15     tools
2  screws (20)    2.5       150  hardware
3       jigsaw   40.0        12     tools
5       jigsaw   35.0        14     tools


Looking at the output of the prior cell, the resulting `DataFrame` from `drop_duplicates` lacks a second row where `"hammer"` is the product, reflecting the fact that the fully duplicated row has been removed.
It's also possible to specify which specific `"hammer"`-containing row should be preserved; the [official Pandas documentation for `drop_duplicates`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html) has more information.

Looking again at the output of the prior cell, the rows containing `"jigsaw"` are both still present in the output, reflecting the fact that the data is not fully duplicated acorss the rows.
Specificaly, both have a different `"price"` and `"quantity"` listed.
The `drop_duplicates` method can still handle this sort of situation, with an additional keyword parameter: `subset`.
The `subset` parameter can take a list of column names, and only those columns will be used to determine if two rows are identical.
For example, if we want two rows to be considered identical if their `"product"` and `"location"` columns are identical, we can do the following in the next cell:

In [20]:
print(dups_hardware_inventory) # reminder of what this data is
print()
print(dups_hardware_inventory.drop_duplicates(subset=["product", "location"]))

       product  price  quantity  location
0       hammer   20.0        19     tools
1       wrench   30.0        15     tools
2  screws (20)    2.5       150  hardware
3       jigsaw   40.0        12     tools
4       hammer   20.0        19     tools
5       jigsaw   35.0        14     tools

       product  price  quantity  location
0       hammer   20.0        19     tools
1       wrench   30.0        15     tools
2  screws (20)    2.5       150  hardware
3       jigsaw   40.0        12     tools


### Try this Yourself ###

The next cell defines a `DataFrame` and binds it to variable `dups_data`.
This is intended to represent daily average temperatures in Fahrenheit, recorded from a somewhat faulty weather station somewhere on the planet.
Using `drop_duplicates`:

1. Initialize the `without_dup_rows` variable to be `dups_data` without the fully duplicate rows.
2. Initialize the `without_dup_when` variable to be `dups_data` without duplicated `"when"` information.

In [33]:
dups_data = pd.DataFrame({"when": ["1/1/2015", "1/2/2015", "1/2/2015", "1/3/2015", "1/4/2015", "1/5/2015", "1/5/2015"],
                          "temp": [25, 23, 23, 30, 40, 21, 15]})
print(dups_data)

print()
# Define your code here to initialize without_dup_rows.
without_dup_rows = dups_data.drop_duplicates()
print(without_dup_rows)
# The above statement should print:
#        when  temp
# 0  1/1/2015    25
# 1  1/2/2015    23
# 3  1/3/2015    30
# 4  1/4/2015    40
# 5  1/5/2015    21
# 6  1/5/2015    15

print()
# Define your code here to initialize without_dup_when.
without_dup_when = dups_data.drop_duplicates(subset=["when"])

print(without_dup_when)
# The above statement should print:
#        when  temp
# 0  1/1/2015    25
# 1  1/2/2015    23
# 3  1/3/2015    30
# 4  1/4/2015    40
# 5  1/5/2015    21

       when  temp
0  1/1/2015    25
1  1/2/2015    23
2  1/2/2015    23
3  1/3/2015    30
4  1/4/2015    40
5  1/5/2015    21
6  1/5/2015    15

       when  temp
0  1/1/2015    25
1  1/2/2015    23
3  1/3/2015    30
4  1/4/2015    40
5  1/5/2015    21
6  1/5/2015    15

       when  temp
0  1/1/2015    25
1  1/2/2015    23
3  1/3/2015    30
4  1/4/2015    40
5  1/5/2015    21


## Step 4: Use `DataFrame`'s `replace` Method to Replace Fixed Values with Other Fixed Values ##

### Background: `replace` Method ###

Consider a modified version of the `dups_data` `DataFrame` from the prior step, shown in the next cell:

In [22]:
weather = pd.DataFrame({"when": ["1/1/2015", "1/2/2015", "1/3/2015", "1/4/2015", "1/5/2015", "1/6/2015", "1/7/2015", "1/8/2015", "1/9/2015"],
                        "temp": [25, 23, 23, 30, 40, 9999, 15, 18, 9998]})
print(weather)

       when  temp
0  1/1/2015    25
1  1/2/2015    23
2  1/3/2015    23
3  1/4/2015    30
4  1/5/2015    40
5  1/6/2015  9999
6  1/7/2015    15
7  1/8/2015    18
8  1/9/2015  9998


As shown, at rows `5` and `8`, we have recorded temperatures of `9999` and `9998`, respectively.
Given that this is supposed to be data from a terrestial weather station, these numbers are impossibly high and must be incorrect.
Morever, these numbers don't appear to be random, given how close they are to 1,000, an even power of 10.
This suggests that these numbers were very intentionally put here, and were intended to stand out.
While we don't know exactly what these numbers mean without further context (perhaps specific error codes?), we can be sure that these are **not** temperatures.
With this in mind, `"temp"` for these rows is effectively a missing value.
However, unlike with `NaN`, these are nonetheless valid numbers, and Pandas has no understanding of what any of this data means.
For example, Pandas will still happily compute the mean of `"temp"`, if you ask for it:

In [23]:
print(weather["temp"].mean()) # prints 2241.222222222222

2241.222222222222


...though the result is pretty useless, because we can confidently say that the average temperature over the week was **not** in excess of 2,000 degrees Fahrenheit.

This is another form of invalid data slipping in, though in this case it's more problematic than just missing data: it's missing data that, at a quick glance, looks like real data.
This is something else we need to clean.
Towards that end, `Pandas` `DataFrame`s offer a `replace` method, which takes a list of values to replace, along with a corresponding list of values to replace them with.
For example, let's say we want to replace both `9998` and `9999` with `NaN`, and truly make them missing values.
This is shown in the following cell:

In [24]:
print(weather.replace([9998, 9999], [np.nan, np.nan]))

       when  temp
0  1/1/2015  25.0
1  1/2/2015  23.0
2  1/3/2015  23.0
3  1/4/2015  30.0
4  1/5/2015  40.0
5  1/6/2015   NaN
6  1/7/2015  15.0
7  1/8/2015  18.0
8  1/9/2015   NaN


From here, we can use `dropna` as before to get rid of these rows entirely, as with:

In [25]:
print(weather.replace([9998, 9999], [np.nan, np.nan]).dropna())

       when  temp
0  1/1/2015  25.0
1  1/2/2015  23.0
2  1/3/2015  23.0
3  1/4/2015  30.0
4  1/5/2015  40.0
6  1/7/2015  15.0
7  1/8/2015  18.0


Alternatively, perhaps we want to replace `9999` with `30`, and `9998` with `25`.
This would look as follows:

In [26]:
print(weather.replace([9998, 9999], [25, 30]))

       when  temp
0  1/1/2015    25
1  1/2/2015    23
2  1/3/2015    23
3  1/4/2015    30
4  1/5/2015    40
5  1/6/2015    30
6  1/7/2015    15
7  1/8/2015    18
8  1/9/2015    25


We are only scratching the surface of the possibilities with `replace`; you may wish to look at the [offical Pandas documentation for `replace`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.replace.html) for more information.

### Try this Yourself ###

In the next cell, a variant of the `some_missing` `DataFrame` is defined, and bound to the variable `new_some_missing`.
In this case, instead of using `np.nan` for missing data, the values `400` and `401` were used.
Using `replace`, create a new `DataFrame` where all instances of `400` are replaced with `3.5`, and all instances of `401` are replaced with `4.1`.
Bind your resulting `DataFrame` to the `replaced` variable.

In [34]:
new_some_missing = pd.DataFrame({"foos": [3.3, 2.1, 400, 7.0, 7.1],
                                 "bars": [2.2, 401, 7.6, 7.7, 2.1],
                                 "blahs": [1.1, 2.2, 3.3, 4.4, 400]})
print(new_some_missing)
print()

# Define your code here to define and initialize variable replaced

replaced = new_some_missing.replace(to_replace={400: 3.5, 401: 4.1})

print(replaced)
# The above statement should print:
#    foos  bars  blahs
# 0   3.3   2.2    1.1
# 1   2.1   4.1    2.2
# 2   3.5   7.6    3.3
# 3   7.0   7.7    4.4
# 4   7.1   2.1    3.5

    foos   bars  blahs
0    3.3    2.2    1.1
1    2.1  401.0    2.2
2  400.0    7.6    3.3
3    7.0    7.7    4.4
4    7.1    2.1  400.0

   foos  bars  blahs
0   3.3   2.2    1.1
1   2.1   4.1    2.2
2   3.5   7.6    3.3
3   7.0   7.7    4.4
4   7.1   2.1    3.5


## Step 5: Submit via Canvas ##

Be sure to **save your work**, then log into [Canvas](https://canvas.csun.edu/).  Go to the COMP 502 course, and click "Assignments" on the left pane.  From there, click "Assignment 30".  From there, you can upload your `30_pandas_data_cleaning.ipynb` file.

You can turn in the assignment multiple times, but only the last version you submitted will be graded.