In [2]:
## PREPROCESSING
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston

# makes the dataframes easier to view. Just formatting.
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

<!-- TITLE -->
# 1.0 Read in the CSV Data

<!-- INTRODUCTION -->
## Working with incorrect data and missing values

While the Boston dataset we loaded from the sklearn datasets is great for practicing using pandas, it's not likely you'll find data that's perfectly clean in the real world. In this lesson we'll review some methods and techniques for working with incomplete or missing data.

<!-- TASK 1.1 -->
## Read in the new csv file

In the example below we will read in the same dataset with some manufactured issues in it to learn how to clean a dataset. First we will read in the csv using the pandas `.read_csv()` module. When reading in a file from your local computer you have to specify the file path that you're pulling the data from. If your information is located in the downloads folder in your home directory the filepath would look something like this `'./my_interesting_file.csv'`, where the . represents the current directory. 

In [None]:
## STARTER CODE for 1.1
# Read in the csv file to a variable named boston_dirty_df

# Check the columns attribute and printing that to output to see which
# columns have changed

In [3]:
## SOLUTION CODE for 1.1
# Read in the csv file to a variable named boston_dirty_df
# If the dataframe is reading in with a new column Unnamed:0, set
# index_col parameter to 0
boston_dirty_df = pd.read_csv('./boston_dirty.csv', index_col=0)

# Check the columns attribute and printing that to output to see which
# columns have changed
boston_dirty_df.columns

Index([u'CRIM', u'ZN', u'INDUS', u'CHAS', u'NOX', u'RM', u' ge', u'DIS',
       u'RAD!', u'TAdfX', u'PTRATIO', u'B', u'LSTATa'],
      dtype='object')

<!-- HINT 1.1 -->
Remember to use `pd.read_csv()` and then pass in the filename.

<!-- TASK 1.2 -->
## Clean up the column names

Once we've read the file in you'll notice that there are some incorrect column names. Since we know what the column names are supposed to be from previous examples we can fix that easily. There are two ways we'll show you how to correct this. One way is to pass a list of values and set that equal to the dataframe's columns attribute e.g. `boston_df.columns = ['CRIM', 'ZN', 'RM', ...]`. Another way is to selectively rename the columns by passing a dictionary through to the columns parameter of the rename function in pandas e.g `boston_df.replace(columns={'old_name': 'new_name'})`. Let's give it a try.

In [None]:
## STARTER CODE for 1.2
# Check the columns attribute and printing that to output to see which
# columns have changed


# Use one or both of the methods above to rename the columns appropriately.
# Correct column names for reference [u'CRIM', u'ZN', u'INDUS', u'CHAS', u'NOX',
# u'RM', u'AGE', u'DIS', u'RAD', u'TAX', u'PTRATIO', u'B', u'LSTAT']'''

In [11]:
## SOLUTION CODE for 1.2
# Use one or both of the methods above to rename the columns appropriately.
# Correct column names for reference [u'CRIM', u'ZN', u'INDUS', u'CHAS', u'NOX',
# u'RM', u'AGE', u'DIS', u'RAD', u'TAX', u'PTRATIO', u'B', u'LSTAT']'''
boston_dirty_df.columns = [u'CRIM', u'ZN', u'INDUS', u'CHAS', u'NOX', u'RM', u'AGE', u'DIS',
                           u'RAD', u'TAX', u'PTRATIO', u'B', u'LSTAT']
# OR
boston_dirty_df.rename(
    columns={u' ge': u'AGE', u'RAD!': u'RAD', u'TAdfX': u'TAX', u'LSTATa': u'LSTAT'})

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33
5,0.02985,0.0,2.18,0.0,0.458,6.430,58.7,6.0622,3,222,18.7,394.12,5.21
6,0.08829,12.5,7.87,0.0,0.524,6.012,,5.5605,5,311,15.2,395.60,
7,0.14455,12.5,7.87,0.0,0.524,6.172,96.1,5.9505,5,311,15.2,396.90,
8,0.21124,12.5,7.87,0.0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,
9,0.17004,12.5,7.87,0.0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,


<!-- HINT 1.2 -->
Remember to pass a dictionary of old and new name pairs into the `columns` argument if using the `.rename()` function and a list if passing directly to the `columns` argument.

<!-- TITLE -->
# Imputing missing values

<!-- INTRODUCTION -->
## Imputing missing values

In this task we'll look at one of the columns that contains missing values and will work to replace those values with the mean value from the rest of the column. When working with a dataframe the `.isnull()` method is very useful for seeing where there are `NaN` (not a number) or missing values. Calling `.isnull()` on the dataframe itself will yield a dataframe with boolean values indicating whether the cell is null or not. We can then call the `.sum()` function to count the number of null values at each column.

<!-- TASK 2.1 -->
In this instance we'll take the mean value of the entire column and use it to replace the `NaN` values. We can also compute the minimum and maximum values in a column using `.max()` or `.min()`, and replace `NaN`s with these. Choosing which approach to take requires careful consideration, and the right choice depends on the dataset and problem to be solved. Later, we'll go through other strategies to deal with missing data.

In [None]:
## STARTER CODE for 2.1
# call .isnull() and .sum() to get the total number of null values per column

In [12]:
## SOLUTION CODE for 2.1
# call .isnull() and .sum() to get the total number of null values per column
boston_dirty_df.isnull().sum()

CRIM         2
ZN           0
INDUS        1
CHAS         2
NOX          2
RM           1
AGE          2
DIS          0
RAD          0
TAX          0
PTRATIO      7
B            0
LSTAT      395
dtype: int64

<!-- HINT 2.1 -->
Remember to call `.isnull()` on the DataFrame first followed by `.sum()` all in one line.

<!-- TASK for 2.2 -->
To impute the missing values, call `.fillna()`, and pass in the mean of the column's values. Then we'll assign this value to the original column.

In [318]:
## STARTER CODE for 2.2
# step 1: Use .mean() to find the mean value of the 'CRIM' column
# step 2: The output of this value should be passed through to the .fillna() method
# step 3: Assign the the output of both of those calculations to the 'CRIM' column
# or use inplace=True to write directly to the dataframe

# call the same .isnull() and .sum() combination again to see if the operation
# was successful

In [14]:
## SOLUTION CODE for 2.2
# step 1: Use .mean() to find the mean value of the 'CRIM' column
# step 2: The output of this value should be passed through to the .fillna() method
# step 3: Assign the the output of both of those calculations to the 'CRIM' column
boston_dirty_df['CRIM'] = boston_dirty_df['CRIM'].fillna(
    boston_dirty_df['CRIM'].mean())

# OR
boston_dirty_df['CRIM'].fillna(boston_dirty_df['CRIM'].mean(), inplace=True)

# call the same .isnull() and .sum() combination again to see if the operation
# was successful
boston_dirty_df.isnull().sum()

CRIM         0
ZN           0
INDUS        1
CHAS         2
NOX          2
RM           1
AGE          2
DIS          0
RAD          0
TAX          0
PTRATIO      7
B            0
LSTAT      395
dtype: int64

<!-- HINT 2.2 -->
The suggested format for filling in values to the 'CRIM' column: `boston_df.column_name = boston_df.fillna(boston_df.column_name.mean())`.

<!-- TASK 2.3 -->
You may have noticed in the last task that the LSTAT column has 395 missing values. In some instances you may have such little data that the column is unusable for modeling or analysis. It certainly wouldn't be wise to try and calculate the mean value and fill the missing 395 values with only 100 or so actual data points. In the following exercise we'll delete the LSTAT row.

For illustrative purposes, we'll delete the row in place so you learn how, but it usually makes sense to create a new copy of the dataframe, separate from the original, and slice out the bad column there.

One thing to note is that you should only run the cell to delete the column once. If you've already deleted the column then when you run the cell it will try and delete a column that no longer exists on the dataframe and will throw an exception.

<!-- TASK 3.1 -->


In [15]:
## STARTER CODE for 2.3
# del df['col_name'] where df and col_name are the DataFrame and column names you're trying to delete
# reoutput the dataframe so you can ensure it worked

In [1]:
## SOLUTION CODE for 2.3
del boston_dirty_df['LSTAT']

boston_dirty_df

You've already deleted the column or the column doesn't exist!


NameError: name 'boston_dirty_df' is not defined

<!-- HINT for 2.3 -->
Remember one method to access a specific column in a dataframe is df['col_name']

<!-- TASK 3.4 -->
## Dropping NaN values

As we saw previously, one approach to dealing with `NaN` values in a column is to replace them with the mean. Another approach is simply to drop the values altogether. Again, choosing the right technique requires some thought, and you shouldn't rush to simply drop data or impute the mean value without considering the implications. It can be important to know when to choose each. To demonstrate, we'll use the `.dropna()` function to drop all the remaining `NaN` values from the dataframe.

In [322]:
## STARTER CODE for 3.4
# use .dropna() and set the parameters axis=0 and inplace=True
# check to see if there are any null values left by using .isnull() and .sum()

In [22]:
## SOLUTION CODE for 3.4
boston_dirty_df.isnull().sum()
boston_dirty_df.dropna(inplace=True)
boston_dirty_df.isnull().sum()

CRIM       0
ZN         0
INDUS      1
CHAS       2
NOX        2
RM         1
AGE        2
DIS        0
RAD        0
TAX        0
PTRATIO    7
B          0
dtype: int64

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
dtype: int64

<!-- HINT for 3.4 -->


<!-- TITLE -->
## Converting Data Types

<!-- INTRODUCTION -->
Now that we've removed all the `NaN` values, we need to check the data types for each column  with the `.dtypes` attribute to see if they're what we expect. Notice that most of the columns are of type `float64` or `int64`, which is convenient because linear regression requires numeric data. However, you'll notice there are a few columns that have `object` data types, such as 'CRIM'. This indicates that the column contains multiple incompatible datatypes that could not be combined when reading from the .csv file, so the column gets assigned the most general type, `object`.
Since we know the 'CRIM' column should be of type `float`, we can convert it using `.astype()`.

Note that by default, `.astype()` creates a copy of the column it operates on. So far we've been overwriting data in place in `boston_dirty_df`, so we'll override this default with the 'copy' argument. In general, you would should create copies of the original data so that if you accidentally overwrote something incorrectly, you wouldn't have to re-import from .csv. However, for simplicity, we'll continue overwriting in the original dataframe.

<!-- TASK 3.1 -->
Use `.astype()` to convert the 'CRIM' column to `float`. Note that `.astype()` creates a copy of the column it operates on, so you must replace the original column with the newly created copy. To accomplish this either set `copy = False` or set the column name equal to the new copy.

In [18]:
## STARTER CODE for 3.1
boston_dirty_df.dtypes

CRIM       float64
ZN         float64
INDUS      float64
CHAS       float64
NOX         object
RM         float64
AGE        float64
DIS        float64
RAD          int64
TAX         object
PTRATIO     object
B          float64
dtype: object

In [21]:
## SOLUTION CODE for 3.1
boston_dirty_df.dtypes

boston_dirty_df.CRIM = boston_dirty_df.CRIM.astype(float)
# or
boston_dirty_df.CRIM.astype(float, copy=False) # BE VERY CAREFUL WITH THIS APPROACH

CRIM       float64
ZN         float64
INDUS      float64
CHAS       float64
NOX         object
RM         float64
AGE        float64
DIS        float64
RAD          int64
TAX         object
PTRATIO     object
B          float64
dtype: object

0       0.00632
1       0.02731
2       0.02729
3       0.03237
4       0.06905
5       0.02985
6       0.08829
7       0.14455
8       0.21124
9       0.17004
10      0.22489
11      0.11747
12      0.09378
13      0.62976
14      0.63796
15      0.62739
16      1.05393
17      0.78420
18      0.80271
19      0.72580
20      1.25179
21      0.85204
22      1.23247
23      0.98843
24      0.75026
25      0.84054
26      0.67191
27      0.95577
28      0.77299
29      1.00245
         ...   
476     4.87141
477    15.02340
478    10.23300
479    14.33370
480     5.82401
481     5.70818
482     5.73116
483     2.81838
484     2.37857
485     3.67367
486     5.69175
487     4.83567
488     0.15086
489     0.18337
490     0.20746
491     0.10574
492     0.11132
493     0.17331
494     0.27957
495     0.17899
496     0.28960
497     0.26838
498     0.23912
499     0.17783
500     0.22438
501     0.06263
502     0.04527
503     0.06076
504     0.10959
505     0.04741
Name: CRIM, dtype: float

<!-- HINT for 3.1 -->

<!-- TASK 3.2 -->
## Using regex to find alphabetic characters in columns

The `.astype()` function is very useful for converting between compatible datatypes, but it has limitations. What happens when you try to convert the string 'a' or 'b' to a `float` or `int`? Is this even possible? If you try  `.astype('float')` on a column that contains strings, the function will raise an exception. Specifically if you tried this on the 'NOX' column you would get the following error: `ValueError: could not convert string to float: a`. One way to get around this is to replace the value of those cells with the mean value of the rest of the column. Another would be to just drop the rows with the wrongly formatted data. We'll explore both approaches.

In the below example we'll be using Regular Expressions or regex to find the alphabetic characters. There are a ton of other interesting and powerful things you can do with regex but we'll start basic. The code below searches the 'NOX' column for any alphabetic characters from A-Z for uppercase and a-z for lowercase. It returns boolean array the same length as the 'NOX' column where 'True' means there was an a letter (A-Z or a-z) in that cell and 'False' means there wasn't. When we pass this array in to the original dataframe, we're filtering, or masking, the dataframe to only show the values in a `True` cell. The end result is a dataframe that shows which rows had letters in the 'NOX' column. 

In [23]:
## STARTER CODE for 3.2

# run once to see output

# search the NOX column for alphabetic characters
letter_mask = boston_dirty_df.NOX.str.contains(r'[A-Za-z]', regex=True)

# pass the boolean array into the original dataframe to return a dataframe
# where the values are True
boston_dirty_df[letter_mask]

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B
18,0.80271,0.0,8.14,0.0,b,5.456,36.6,3.7965,4,307,21.0,288.99
25,0.84054,0.0,8.14,0.0,am,5.599,85.7,4.4546,4,307,21.0,303.42
76,0.10153,0.0,12.83,0.0,a,6.279,74.5,4.0522,5,398,18.7,373.66
79,0.08387,0.0,12.83,0.0,a,5.874,36.6,4.5026,5,398,18.7,396.06
85,0.05735,0.0,4.49,0.0,a,6.63,56.1,4.4377,3,a,18.5,392.3
88,0.0566,0.0,3.41,0.0,a,7.007,86.3,3.4217,2,270,17.8,396.9


In [None]:
## SOLUTION CODE for 3.5

# run once to see output

# search the NOX column for alphabetic characters
letter_mask = boston_dirty_df.NOX.str.contains(r'[A-Za-z]', regex=True)

# pass the boolean array into the original dataframe to return a dataframe
# where the values are True
boston_dirty_df[letter_mask]

<!-- HINT for 3.2 -->
Just run the cell.

<!-- TASK for 3.3 -->
## Create an new list of the index values for our filtered dataframe

Here we will use the `.index` attribute to pull out the index values from our newly created dataframe containing the alphabetic characters. We'll be doing this so that we can reference those values later on and replace the alphabetic characters with the mean value of the column.

In [None]:
## STARTER CODE for 3.3

# search the NOX column for alphabetic characters
letter_mask = boston_dirty_df.NOX.str.contains(r'[A-Za-z]', regex=True)

# pass the boolean array into the original dataframe to return a dataframe
# where the values are True
boston_dirty_df[letter_mask]

# use the index attribute and create a new list with list() of the index values
# save this to a new variable named index_list for later on in the code.


In [25]:
## SOLUTION CODE for 3.3

# search the NOX column for alphabetic characters
letter_mask = boston_dirty_df.NOX.str.contains(r'[A-Za-z]', regex=True)

# pass the boolean array into the original dataframe to return a dataframe
# where the values are True
boston_dirty_df[letter_mask]

# use the index attribute and create a new list with list() of the index values
# save this to a new variable i.e. index_list for later on in the code.
index_list = list(boston_dirty_df[letter_mask].index)
index_list

[18, 25, 76, 79, 85, 88]

<!-- TASK for 3.4 -->
## Use the ~ character to return all the False values

In the task below use the `~` before the letter_mask variable to return the opposite of the original dataframe we requested. In this case if we pass `~letter_mask` in as a boolean filter to our dataframe it will return all the `False` values or all the values that contain no alphabetic characters. Then specify that we want to only return the 'NOX' column by passing 'NOX' in between two brackets at the end.

In [None]:
## STARTER CODE for 3.4

In [None]:
## SOLUTION CODE for 3.4
boston_dirty_df[~letter_mask]['NOX']

<!-- HINT for 3.4 -->

<!-- TASK for 3.5 -->
## Convert the remaining values to float and find the mean

Now that we have a way to filter by values that should all be numeric, we can convert them to `float` and calculate the mean. In the below code call the `.astype()` method to convert the values to `float` and then call the `.mean()` method to take the mean of those values

In [None]:
## STARTER CODE for 3.5

# convert NOX column to a float type and calculate the mean value.
# save this mean value to a new variable nox_mean
nox_mean = boston_dirty_df[~letter_mask]['NOX']

In [28]:
## SOLUTION CODE for 3.5

# convert this new dataframe to a float type and calculate the mean value.
# save this mean value to a new variable, e.g. nox_mean
nox_mean = boston_dirty_df[~letter_mask]['NOX'].astype(float).mean()
nox_mean

0.5546348453608252

<!-- TASK for 3.6 -->
## Use the index list we created to replace the strings with the mean value from the column

Now that we have the index locations of the strings we want to replace and the mean value for the 'NOX' column, we can replace the values we are looking for. We can now write a simple function that iterates over the index list(`index_list`) that we created earlier and replace each value at that location (`.loc[]`) with the mean value we calculated above e.g. `for i in index_list: example_df.loc[i, 'col_name'] = mean_value`.

In [32]:
## STARTER CODE for 3.6

# check the values at the index locations to ensure we've replaced them correctly
boston_dirty_df.loc[index_list, 'NOX']

18    0.554635
25    0.554635
76    0.554635
79    0.554635
85    0.554635
88    0.554635
Name: NOX, dtype: object

In [33]:
## SOLUTION CODE for 3.6

for i in index_list:
    boston_dirty_df.loc[i, 'NOX'] = nox_mean
    
# check the values at the index locations to ensure we've replaced them correctly
boston_dirty_df.loc[index_list, 'NOX']

18    0.554635
25    0.554635
76    0.554635
79    0.554635
85    0.554635
88    0.554635
Name: NOX, dtype: object

<!-- HINT for 3.6 -->
Remember the format of a for loop and use the appropriate names: `for i in index_list: example_df.loc[i, 'col_name'] = mean_value`

<!-- TASK for 3.7 -->
## Convert the entire column to a float

Now that we've replaced the string characters with values the `.astype()` method can actually convert, we can go back to our original goal of turning the entire column to floats. Use the `.astype()` module to convert the 'NOX' column all to floats now. Make sure to assign the new copy to the original dataframe or to specify `copy = False`.

In [344]:
## STARTER CODE for 3.7

# once it's run check the dtypes to ensure the NOX column is now all floats
boston_dirty_df.dtypes

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B
18,0.80271,0.0,8.14,0.0,b,5.456,36.6,3.7965,4,307,21.0,288.99
25,0.84054,0.0,8.14,0.0,am,5.599,85.7,4.4546,4,307,21.0,303.42
76,0.10153,0.0,12.83,0.0,a,6.279,74.5,4.0522,5,398,18.7,373.66
79,0.08387,0.0,12.83,0.0,a,5.874,36.6,4.5026,5,398,18.7,396.06
85,0.05735,0.0,4.49,0.0,a,6.63,56.1,4.4377,3,a,18.5,392.3
88,0.0566,0.0,3.41,0.0,a,7.007,86.3,3.4217,2,270,17.8,396.9


In [35]:
## SOLUTION CODE for 3.7
boston_dirty_df['NOX'] = boston_dirty_df.NOX.astype(float)

# once it's run check the dtypes to ensure the NOX column is now all floats
boston_dirty_df.dtypes

CRIM       float64
ZN         float64
INDUS      float64
CHAS       float64
NOX        float64
RM         float64
AGE        float64
DIS        float64
RAD          int64
TAX         object
PTRATIO     object
B          float64
dtype: object

<!-- TITLE -->
## Using replace to remove values

<!-- INTRODUCTION -->
The above example is a lot of work for such little payoff. If we're working with large datasets it can sometimes be much easier and much more valuable to just drop the values that can't be converted. Next we'll work through replacing the alphabetic values with the `.replace()` function.

<!-- TASK 4.1 -->
Somewhere in the 'TAX' column there's a string 'a' that's preventing us from just using `.astype()` to convert the entire column to a float. Use the `.replace()` function to replace 'a' with `np.nan` which is just a NaN value from the numpy library. Also set the `inplace` parameter to True. The structure of `.replace()` is `example_df.replace('value_to_replace', 'value_to_substitute_in', inplace=True)`.

In [None]:
## STARTER CODE

# run to see if you've created new null values
boston_dirty_df.isnull().sum()

In [44]:
## SOLUTION CODE
boston_dirty_df.replace('a', np.nan, inplace=True)

# run to see if you've created new null values
boston_dirty_df.isnull().sum()

<!-- HINT for 4.1 -->
Use `np.nan` as the value to substitute in.

<!-- TASK 4.2 -->
# Use regex to replace values

The code we previously ran is great if we know what the values are we need to replace. In most instances we won't know what value we need to replace or there will be so many that doing this on a case by case basis is not feasible. In those cases it's probably easiest to use some regex to filter by alphabetic characters. In this example substitute the 'a' string with a regex for all alphabetic characters and set the regex parameter in the replace function equal to True. You should see that we've created a few more NaN values in the TAX and PTRATIO columns.

In [None]:
## STARTER CODE

# run to see if you've created new null values
boston_dirty_df.isnull().sum()

In [46]:
## SOLUTION CODE
boston_dirty_df.replace(r'[A-Za-z]', np.nan, regex=True, inplace=True)

# run to see if you've created new null values
boston_dirty_df.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        1
PTRATIO    3
B          0
dtype: int64

<!-- HINT for 4.1 -->
The regex for alphabetic characters looks something like this `r'[A-Za-z]'` also don't forget to set `regex=True`.

<!-- TASK 4.3 -->
# Drop the NaN values

As we've done before, now we're going to drop the NaN values we've just created. Use the `.dropna()` function on the entire dataframe to drop all the remaining NaN values. Set `inplace=True` as a parameter to `.dropna()` to keep working with the same dataframe.

In [None]:
## STARTER CODE

# run to see if you've removed all NaN values
boston_dirty_df.isnull().sum()

In [47]:
## SOLUTION CODE
boston_dirty_df.dropna(inplace=True)

# run to see if you've removed all NaN values
boston_dirty_df.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
dtype: int64

<!-- HINT for 4.2 -->
The regex for alphabetic characters looks something like this `r'[A-Za-z]'`. Also don't forget to set `regex=True`.

<!-- CONCLUSION -->
Congratulations, you've worked through a few very common problems when dealing with datasets!

Other Helpful Data Cleaning Functions Not Covered Above
* pd.notnull() - Opposite of s.isnull() 
* df.dropna(axis=1) - Drop all columns that contain null values
* s.replace([1,3],['one','three']) - Replace all 1 with 'one' and 3 with 'three' 
* df.rename(columns=lambda x: x + 1) - mass renaming of columns 
* df.set_index('column_one') - change the index 
* df.rename(index=lambda x: x + 1) - mass renaming of index