# Hands-On Assignment 1

In this assignment, you will pre-process and analyze synthetic data (i.e., data produced by an algorithm rather than collected from the real world).

The objective of this assignment is for you to learn about:
 - Data manipulation (selecting, adding, and removing rows and columns).
 - Data exploration (understanding the structure and contents of a dataset).
 - Data selection (filtering rows and columns).
 - Feature engineering (pre-processing data for use with machine learning).
 - Basic data visualization (e.g., plotting data to explore functional relationships between variables).
 - Working with mathematical equations and turning them into code.

Throughout this course, we will be using the [Pandas library](https://pandas.pydata.org/) to manipulate data.
This library is very large and can be quite complex,
but we will cover the basics of Pandas in this lesson.
If you want additional information or practice,
we recommend the official [Pandas Tutorial](https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html).

The data we will use in this lesson is located in this repository in a file called `synthetic_covid_data.csv`.

## Synthetic Covid-19 Data

*The data/scenario used for the following exercise is entirely fictional and intended for instructional purposes only.*

Sainte Croix University has developed a new, cheap, rapid antigen test for SARS-CoV-2 with potentially high sensitivity
(it can detect even trace amounts of a certain protein of the SARS-CoV-2 virion).
However, the protein that the test detects is also a human **isoantigen**:
it is already present in some subset of humans.
We will call those individuals who have the isoantigen without Covid infection "isoantigenic".

The antigen test yields a **titer** value:
the number of times a serum extracted from a person can be diluted before the antigen is no longer detectable by the test.
The higher the value of the titer, the more prevalent the antigen must be in the serum.
Note that titer values are recorded as integers.

Synthetic data, representing an unbiased sample of the population in Sainte Croix County
who *do not* have current symptoms and were not previously infected with Covid-19 at the time of testing,
has been included in this repository as `synthetic_covid_data.csv`.
This data includes titer values and, if the patient became symptomatic within 14 days of the test, the number of days after the test that symptoms appeared.

The following cell imports Pandas and loads the data into a Pandas DataFrame.

In [None]:
import pandas

# Load the file "synthetic_covid_data.csv" into a pandas DataFrame.
# index_col refers to the column name in the csv (comma separated values)
# file that we will treat as an index (identifier for each example/row).
covid_data = pandas.read_csv('synthetic_covid_data.csv', index_col = 'id')

# Print out the DataFrame.
# Note that it only prints a summary if there are too many rows.
covid_data

## Part 0: Dataframe Manipulation

Throughout this course, you will be heavily using [Pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)
(a dataframe is also frequently just called a "frame").
Running the cell above will print out a summary of the frame we created,
note that we can also look at it as a [table](https://en.wikipedia.org/wiki/Table_(information)),
with rows of values organized into columns, where each column shares a specific data type and interpretation.

This section provides a small overview of some of the common operations you will be using with Pandas DataFrames.

### Column Selection

To select a single column from a frame, you can just index it like you would a normal Python dict/map.
For example, to select only the `infected` column we would do the following:

In [None]:
covid_data['infected']

When a single column selected from a DataFrame,
Pandas will return the result as a [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) instead of a DataFrame.

You can also select multiple columns at the same time by using a list of column names instead of just a single column name:

In [None]:
covid_data[['infected', 'symptomatic']]

When multiple columns are selected, a DataFrame will be returned.

<h4 style="color: darkorange";>★ Task 0.A</h4>

Your task now is to complete the function below.
This function takes two arguments: a frame and a column name;
and returns the column from the frame with the given name.

In [None]:
def select_column(frame, column_name):
    return NotImplemented

print("Selected column 'symptomatic':")
select_column(covid_data, 'symptomatic')

### Row Selection

To select rows from a DataFrame, we will generally use the "indexing" syntax.
You first select a column, and then you make an expression using that column.
Rows where the expression is true are returned.
Most simple Python expression can be used.

(Note that the [DataFrame.loc()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) method allows for much more complex selection, but it is outside the scope of this assignment.)

For example, the following cell will select all the rows where individuals have a `titer` value of 32.
Note that we referenced `covid_data` twice: once to select the column (`covid_data['titer']`) and again to reference the frame that we want to select data from (the outer use of `covid_data`).

In [None]:
covid_data[covid_data['titer'] == 32]

We can also use more expressions than just equals (`==`):

In [None]:
# You can use inequalities.
print("titer > 32")
print(covid_data[covid_data['titer'] > 32])

print('---')

# You can use more complex expression with different columns.
# `&` is used for "and", and `|` is used for "or".
print("asymptomatic and (titer > 40)")
print(covid_data[(covid_data['symptomatic'] is False) & (covid_data['titer'] > 40)])

print('---')

# You can also use `~` to negate a condition (note the parens).
print("not (titer > 32)")
print(covid_data[~(covid_data['titer'] > 32)])

As you create more complex expressions to select data,
you may find some cases or operators that Pandas does not support (which normal Python does support).
This is because Pandas overrides the normal Python operators (`==`, `<`, `~`, etc) when a Pandas object (frame/series) is on the left hand side of the operator.
You should always test your selection expression and may have to play around with it a bit.

<h4 style="color: darkorange";>★ Task 0.B</h4>

Your task is to complete the function below.
This function takes three arguments: a frame, a column name, and a value;
and returns the rows of the frame where the value in the supplied column matches the given value.

In [None]:
def filter_rows(frame, column_name, value):
    return NotImplemented

print("Filtered rows where 'titer' == 5:")
filter_rows(covid_data, 'titer', 5)

### Creating a new DataFrame

Empty frames can be created simply using the default constructor: `new_frame = pandas.DataFrame()`.

However to create a populated frame, the easiest way is to construct the frame using a dictionary that already has your data in it.
The keys should be the column names, and the values should be lists of values you want in each of the frame's columns.

For the remaining exercises in this section, we will be creating a new frame (`test_frame`) with some test data.

In [None]:
# Define a dictionary containing the data we want to add.
test_data = {
    'Name': ['Andrew', 'Eriq', 'Reilly', 'Michael'],
    'Surfing Score': [5.3, 5.9, 5.1, 5.2],
    'Qualification': ['MS', 'PhD', 'PhD', 'PhD'],
}

# Create the DataFrame from the test data.
test_frame = pandas.DataFrame(test_data)

# Observe the result.
test_frame

### Adding Columns

To add a column to an existing frame, you can just select the column (even if it does not exist) and assign a list values into to.
For example:
```
some_frame['column_name'] = [1, 2, 3]
```
This also works for replacing an entire existing column.

Suppose we wanted to add a new column ("State") to our new test frame.
We can use the following code to add the new column.

In [None]:
# Declare a list of the values we want to put in the column (in order).
states = ['California', 'California', 'Arizona', 'Oregon']

# Assign the values into the frame using the new column name ('State').
test_frame['State'] = states

# Observe the result.
test_frame

<h4 style="color: darkorange";>★ Task 0.C</h4>

Your task is to now complete the function below.
This function takes three arguments: a frame, a column name, and a list of values;
and returns the modified frame with the new column.

In [None]:
def add_column(frame, column_name, new_list):
    return NotImplemented

print("Added a column 'Patient ID' that is filled with ints:")
add_column(covid_data, 'Patient ID', list(range(len(covid_data))))

### Removing Columns

You can remove a column from a frame using the [DataFrame.pop()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pop.html) method.
Just specify the column that you want to remove, and the column will be removed from the frame and returned to you.

For example, if we wanted to remove the "Surfing Score" column from our test frame, we could do the following:

In [None]:
# Drop the surfing score column.
old_column = test_frame.pop("Surfing Score")

# Observe the result.
test_frame

<h4 style="color: darkorange";>★ Task 0.D</h4>

Your task is to now complete the function below.
This function takes two arguments: a frame and a column name;
and returns the modified frame which no longer has the specified column.

In [None]:
def drop_column(frame, column_name):
    return NotImplemented

print("Removed the 'Patient ID' column added in the previous cell (if it exists):")
if ('Patient ID' in covid_data.columns):
    drop_column(covid_data, 'Patient ID')
covid_data

### Concatenating Frames

Putting together two data frames is a pretty complex task, and there are a lot of things to consider, like:
 - What if there are some columns in one frame and not the other?
 - What if there are duplicate rows?
 - What if there are missing values?
 
As this class progresses, you will become more equipped to handle these different situations.
But for now, we will discuss the most simple case: adding all the rows from two or more frames into a single frame.
To do this, we can use the [pandas.concat()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) function.
Just specify a list of the frames you want to combine as arguments.

For example, the below cell creates a new frame with more test data and add it to our test frame.

In [None]:
# Additional data that we want to add.
additional_data = {
    'Name': ['Alice', 'Zack'],
    'Qualification': ['BS', 'BS'],
    'State': ["Georgia", "Hawaii"],
}
additional_frame = pandas.DataFrame(additional_data)

# Combine the two frames together.
# Note that our original frame is unchanged.
# The `ignore_index` parameter is used to keep the internal row numbers consistent
# (try without this parameter and see how the result changes).
new_frame = pandas.concat([test_frame, additional_frame], ignore_index = True)

# Observe the result.
new_frame

<h4 style="color: darkorange";>★ Task 0.E</h4>

Your task is to now complete the function below.
This function takes two arguments: a frame and another frame;
and returns a new frame that combines the two passed in frames.

In [None]:
def concat_frames(frame1, frame2):
    return NotImplemented

print("Combined one frame with only uninfected individuals and one with only infected individuals:")
concat_frames(covid_data[~covid_data['infected']], covid_data[covid_data['infected']])

### Iterating Over Rows

You can do a lot in Pandas just by using their builtin functions.
But sometimes, you just need to get the raw data and work with it yourself.
In these cases, it can be useful to iterate over each row in a frame.

Like with many things in Pandas, there are many different ways to iterate over rows.
We recommend either using [DataFrame.index](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.index.html) to get the index for each row,
or [DataFrame.iterrows](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html) to get each row.

In [None]:
# Get each index and use that to index into each row.
# Note that the column, not row is indexed first.
for index in covid_data.index:
    if ((not covid_data['symptomatic'][index]) and (covid_data['titer'][index] > 44)):
        print("Special Row Index: ", index)

print('---')

# Get each index and each row.
for (index, row) in covid_data.iterrows():
    if ((not row['symptomatic']) and (row['titer'] > 44)):
        print("Special Row Index: ", index)
        print("Full Row:")
        print(row)
        print('###')

### Useful Functions

The DataFrame class has **MANY** [methods associated with it](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).
Throughout this course, make sure to consult the documentation before implementing new functionality.

Below are a few methods that you may find useful in this and future assignments:
 - Number of Rows -- You can use the builtin function `len()` on a frame the same way that you can use it on a list or dict.
     You can also use [DataFrame.count()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.count.html).
 - Column Names -- You can get the available column names using [DataFrame.columns](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.columns.html).
 - Basic Frame Info -- [DataFrame.info()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) can be used to output basic information about the structure of the frame.
 - Numeric Aggregates -- The DataFrame class has
     [sum()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html),
     [min()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.min.html),
     [max()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.max.html),
     [median()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.median.html),
     and [mean()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.mean.html) methods.
 - Plotting Functions -- The DataFrame class has premade methods for plotting (using [matplotlib](https://matplotlib.org/)).
     For example, [DataFrame.hist()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html) can be used to make a histogram
     and [DataFrame.plot.scatter()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html) can be used to make a scatter plot.
     These methods are not as full-features as using matplotlib directly, but they can give you a fast and simple way to visualize a frame.
     For more information, see [this reference](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html).

Below are some examples of these functions in use.

In [None]:
# Get the number of rows in the frame.
print("Number of rows: ", len(covid_data))

# Get the number of rows we selected.
print("Number of selected rows: ", len(covid_data[covid_data['titer'] == 32]))

In [None]:
# Get the column names.
print(covid_data.columns)

# Iterate over the column names as strings.
print([column_name for column_name in covid_data.columns])

In [None]:
# Get basic information about the frame.
covid_data.info()

In [None]:
# Aggregate over a single column.
print("Average days before symptoms: ", covid_data['days_before_symptoms'].mean())

# When called on a full frame, the aggregate is applied to each column.
# Note that non-numeric columns are converted to numeric values for aggregation
# (e.g. False -> 0.0 and True -> 1.0).
frame_aggregate = covid_data.mean()
print("Type of result when aggregating a full DataFrame: ", type(frame_aggregate))
frame_aggregate

In [None]:
# Display a histogram of values for different columns in a frame.
# Note that only numeric columns are shown be default.
covid_data.hist()

In [None]:
# You can also select just the rows and column you want to see.
# Here we first select just the rows with (titer = 32),
# then we take just the 'days_before_symptoms' column and turn it into a histogram.
covid_data[covid_data['titer'] == 32]['days_before_symptoms'].hist()

## Part 1: Data Exploration

Now that we have covered some of the basics of Panda's DataFrames,
we will use those techniques to explore our data.

For all of the following questions, you can assume that the DataFrame your function will be receiving is structured the same (will have the same columns)
as the Covid-19 DataFrame we have been using up to this point (the one created in the first Python cell in this notebook).

<h3 style="color: darkorange";>★ Task 1.A</h3>

Complete the function below that counts the number of **infected** individuals.

In [None]:
def count_infected(frame):
    return NotImplemented

print("Number of infected individuals: ", count_infected(covid_data))

<h3 style="color: darkorange";>★ Task 1.B</h3>

Complete the function below that counts the number of **symptomatic** individuals.

In [None]:
def count_symptomatic(frame):
    return NotImplemented

print("Number of symptomatic individuals: ", count_symptomatic(covid_data))

<h3 style="color: darkorange";>★ Task 1.C</h3>

Complete the function below that computes the mean **days_before_symptoms** for individuals that have had symptoms (you may ignore rows with no value for this column).

In [None]:
def mean_days(frame):
    return NotImplemented

print("Mean number of days before symptoms: ", mean_days(covid_data))

---

## Part 2: Data Selection

In this part, we will now ask more complex questions that require selecting specific collections of rows.
For all exercises, do not round your answers.

<h3 style="color: darkorange";>★ Task 2.A</h3>

Complete the function below that computes the fraction of individuals that are infected.

In [None]:
def fraction_infected(frame):
    return NotImplemented

print("Fraction of infected individuals: ", fraction_infected(covid_data))

<h3 style="color: darkorange";>★ Task 2.B</h3>

Complete the function below that computes the fraction of infected individuals that are also symptomatic.

In [None]:
def fraction_symptomatic(frame):
    return NotImplemented

print("Fraction of infected and symptomatic individuals: ", fraction_symptomatic(covid_data))

<h3 style="color: darkorange";>★ Task 2.C</h3>

Complete the function below that computes the number of uninfected individuals that have titers between 3 and 13 (exclusive).

In [None]:
def count_special_uninfected(frame):
    return NotImplemented

print("Number of uninfected with titers in (3, 13): ", count_special_uninfected(covid_data))

<h3 style="color: darkorange";>★ Task 2.D</h3>

Complete the function below that computes the fraction of uninfected individuals that are isoantigenic (**not infected** and have a titer value >= 1).

In [None]:
def fraction_isoantigenic(frame):
    return NotImplemented

print("Fraction of infected and isoantigenic individuals: ", fraction_isoantigenic(covid_data))

---

## Part 3: Feature Engineering

[Feature Engineering](https://en.wikipedia.org/wiki/Feature_engineering) is a **HUGE** part of machine learning that involves picking out (or transforming) the data that will be most useful to your algorithms.
In future assignments, we will cover feature engineering in much greater detail.
For now, let's just create a single simple feature.

<h3 style="color: darkorange";>★ Task 3.A</h3>

Complete the function below that adds a new column to the given frame.
This column should be labeled "isoantigenic",
and should contain boolean values describing individuals whom are isoantigenic (**not infected** and have a titer value >= 1).

Hint: Although Pandas can do this in one line, it is much easier to first iterate over each row to compute a value and then add these values as a new column.

In [None]:
def add_isoantigenic_column(frame):
    return NotImplemented

add_isoantigenic_column(covid_data)
print("Frame with added column:")
covid_data

---

## Part 4: Plotting

In this section, we will work with some basic visualization capabilities built into Pandas.
For a more complete reference, see [this reference](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html).
Keep in mind that in most applied settings, you'll probably see more customizable methods using [matplotlib](https://matplotlib.org/) or [seaborn](https://seaborn.pydata.org/).

[Histograms](https://en.wikipedia.org/wiki/Histogram) are a good visualization to start with, since they can help give you a rough idea about specific columns in your data.
Creating a histogram from a frame is very simple:
just select the column you want to examine and call [DataFrame.hist()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html):

In [None]:
covid_data['titer'].hist()

You can also have more fine-grained control over how the data is grouped together and counted (just consult [the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.hist.html) for the options you can configure):

In [None]:
covid_data['titer'].hist(bins = 3)

After histograms, [scatter plots](https://en.wikipedia.org/wiki/Scatter_plot) are the next go-to visualization for data.
Histograms are good for giving us an idea about a single dimension/column of the data,
and scatter plots are good for giving us an idea about how two dimensions/columns of the data interact.

For example, we can look at how the `days_before_symptoms` and `titer` columns of our dataset interact with one another:

In [None]:
covid_data.plot.scatter(x = 'days_before_symptoms', y = 'titer')

<h3 style="color: darkorange";>★ Task 4.A</h3>

Complete the function below that takes in a frame and prepares it to be rendered as a scatter plot.
The function takes in a frame, two column names, and two labels for the x and y axis of the scatter plot.
The function should return a frame that can then be turned into the scatter plot we want by calling `.plot.scatter(x = 0, y = 1)` on it.
(If you are not sure what those options are, then make sure to consult [the documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.scatter.html).

In [None]:
def prep_scatter(frame, x_column, y_column, x_label, y_label):
    return NotImplemented

# Prep the data to be displayed as a scatter plot.
scatter_frame = prep_scatter(covid_data,
                             'days_before_symptoms', 'titer',
                             'Number of Days Before Symptoms', 'Titer Level')

# Display the scatter plot.
scatter_frame.plot.scatter(x = 0, y = 1)

---

## Part 5: Making Sense of Equations

Throughout this course (and many of your CS (or math) courses going forward),
you will be presented with mathematical equations that you will need to understand.
You may have to just read these equations, apply these equations on pen and paper,
or translate these equations into code.
If you are not used to working with equations, that last task may be a bit intimidating.
In this part, we will walk through an example of breaking down an equation which you will then translate into code.

To start, let's first imagine that we have some way of predicting whether someone has Covid-19 based on the data we have been working with in this assignment.
In future assignments, we will dive into much more rigorous, accurate, and cool ways to do this,
but for now let's use the following naive function:

In [None]:
def predict_covid(titer):
    return titer > 20

This function is simple and will return true (indicating the function thinks the patient is infected) if the titer value is greater than 20.
Otherwise, the prediction function will return false (indicating the function thinks the patient is not infected).

Now that we have something that makes predictions (even though it is overly simple),
we can evaluate how good our predictions are.
To evaluate our function's performance, we can use [evaluation metrics](https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers#Single_metrics),
which are numbers that quantify predictive performance (how well our predictions match reality).
There are [dozens of evaluation metrics](https://scikit-learn.org/stable/modules/model_evaluation.html),
but for this example we will use [root mean square error (RMSE)](https://en.wikipedia.org/wiki/Root-mean-square_deviation) (also sometimes called root mean square deviation (RMSD)).

The equation for RMSE is:
$$
\sqrt{  \frac{  \sum\nolimits_{i = 1}^{N} ( \hat{y}_i - y_i )^2 }{ N } }
$$

This equation may look intimidating, but let's break the equation down piece-by-piece.

First, we see a big square root covering everything.
We already know how to use a square root, so let's put that aside for now and simplify the equation:
$$
\frac{  \sum\nolimits_{i = 1}^{N} ( \hat{y}_i - y_i )^2 }{ N }
$$

Now we see a fraction with a summation in the numerator and a single variable ($ N $) as the denominator.
Note that the summation goes from $ i = 1 $ to $ N $, so it is adding up $ N $ different values.
Since we are adding up $ N $ values and $ N $ is also the denominator,
this fraction looks like it is computing the mean of whatever the summation is adding up.
Since we also know how to compute a mean, let's put that part aside and focus on the summation:
$$
\sum\nolimits_{i = 1}^{N} ( \hat{y}_i - y_i )^2
$$

The summation is using $ i $ as its incrementing variable and is going from $ 1 $ to $ N $.
We can also see that the value inside the summation is squared.
We can easily deal with squaring values, so let's simplify again:
$$
\hat{y}_i - y_i
$$

Here, we are finally left with a simple expression, but we have to make sure we understand what these two values represent.
In machine learning equations (as you have already seen in class),
we represent labels/classes with the letter $ y $.
By itself $ y $ usually represents a true label,
and with a hat $ \hat{y} $ usually represents a predicted label.
Therefore, our expression ($ \hat{y}_i - y_i $) is taking the difference between the predicted label ($ \hat{y} $) and the true label ($ y $) for the ith data point.
We often call this (the difference between true and predicted values) the *error* of a prediction.

So to summarize what we discovered starting from the inner-most part of the equation and working out:
 - First, the equation computes the **error** for each prediction.
 - Then, the **square** of that error is computed.
 - The **mean** of all those squared errors is computed using the summation and division.
 - Then finally, the equation takes the square **root** of that mean.
When we phrase it like this,
you can see why this metric is called "root mean square error".

By breaking up this equation into smaller chunks,
we were able to understand each part individually and then put them all back together into the full equation.
And now that we understand the equation, we can implement it in code.

Here are some tips to use when translating equations into code:
 - Look for summations, these usually indicate loops.
 - Tend to start with the inside of the equation, since the outer parts of an equation are evaluated last.
 - Don't be afraid to break up your implementation using more code functions (just like how some equations call into other equations/functions).
 - Be careful about order of operations.
 - Even though equations will usually use short variable names, longer and more descriptive names can be more useful in code (e.g. using `predicted_label` is more readable than `y_hat`).

<h3 style="color: darkorange";>★ Task 5.A</h3>

Complete the function below that takes in two lists and computes the RMSE between them.
The lists will always have the same number of values in them and will never be empty.

*Note: If you want to use the math library (like for `math.sqrt()`), make sure to import it in the first code cell of this notebook (where pandas is also imported).*

In [None]:
def rmse(predicted_values, true_values):
    return NotImplemented

# Test our new function on simple data.
predictions = [1, 1, 0, 0]
labels = [1, 0, 1, 0]

rmse(predictions, labels)

We can also use our new function on our Covid-19 data, but it will require a little work to prepare the data:

In [None]:
# Make predictions for each row in our Covid-19 data.
predictions = [predict_covid(value) for value in covid_data['titer']]

# Get the actual labels (infected status) for our data and convert it into a list.
labels = list(covid_data['infected'])

# Right now, all our predicted and true values are booleans.
# Convert them into ints so we can do math on them (False = 0, True = 1).
predictions = list(map(int, predictions))
labels = list(map(int, labels))

rmse(predictions, labels)

In [None]:
covid_data