# Loading and examining data

Accessing data is the very first step of any analysis. Tabular data, which can be stored in CSV files, is the most common type of data out there, so it is vital that you know how to handle it.

In this exercise, you will load some patient data of body temperatures and heart rates from a CSV file. You used this body temperature data in chapter 2, where you converted it from Fahrenheit to Celsius. Now that you know how to use DataFrames, you can load it yourself!

```
# Import packages
using CSV
using DataFrames

# Load CSV File
file_contents = CSV.File("patients.csv")

# Convert to dataframe
df_patients = DataFrame(file_contents)

# Print the first 5 rows of the DataFrame
println(first(df_patients, 5))
```

# Creating a DataFrame

You are examining the students' grade data, which is available in your environment as `grades_array`. Each array element is a string of 4 letters for a student's mathematics, history, science, and drama grades.

In the last chapter, you wrote a function called `get_gradenumber(grades, n)` to select the nth grade from the grade string `grade`. The function's first argument is the grade string, and the second argument is the number n from 1-4. The grades in each string are for mathematics, history, science, and drama, respectively.

The `DataFrames` package has been imported for you with the `using` keyword.

```
# Extract the mathematics grades
math_grades = get_gradenumber.(grades_array, 1)

# Create the DataFrame
df_grades = DataFrame(
	mathematics=math_grades
)

println(first(df_grades, 5))
```

```
# Create the DataFrame
df_grades = DataFrame(
	mathematics=get_gradenumber.(grades_array, 1),
    history=get_gradenumber.(grades_array, 2), 
    science=get_gradenumber.(grades_array, 3), 
    drama=get_gradenumber.(grades_array, 4),
)

println(first(df_grades, 5))
```

# DataFrame properties

One of the benefits of DataFrames is that they allow us to work with big datasets much more easily than with arrays. These datasets can be very long and have lots of rows, and they can be very wide with lots of columns.

In this chapter, you will work with the `"books.csv"` dataset. Each row corresponds to information about a single book and its ratings on a book review website.

In this exercise, you'll explore the dataset to familiarize yourself with its columns and length. You'll analyze it in more depth later.

The `DataFrames` and `CSV` packages have been imported for you with the `using` keyword.

```
# Load the book review data
df_books = DataFrame(CSV.File("books.csv"))

# Print column names
println(names(df_books))

# Find number of rows and columns
println(size(df_books))
```

# Indexing DataFrames

DataFrames are powerful tools for analyzing data, but you need to be able to select the information you need from them. Let's practice selecting data here.

The DataFrames `df_patients`, `df_grades`, and `df_books` are available in your environment.

```
# Select the body temperature column
body_temps = df_patients[:, "bodytemp"]

println(body_temps)

# Select the third row of df_grades
third_grades = df_grades[3, :]

println(third_grades)

# Select the 
book_title = df_books[710, "title"]

println("The book is $book_title")
```

# Slicing DataFrames

You are working with the books DataFrame again. This DataFrame had 12 columns, which makes it hard to print any single row. However, the most important information is contained in the first 6 columns of this DataFrame. Therefore, we slice it to select the parts we want and make processing and printing easier.

The books DataFrame is available in your environment as `df_books`.

```
# Slice the first 6 columns
df_narrow = df_books[:, 1:6]

# Slice the first 6 columns
df_narrow = df_books[:, 1:6]

# Slice the 10th to 20th rows
df_short = df_narrow[10:20, :]

println(df_short)
```

# Sorting patients

You are looking at the patient data again, and you have come to realize just how unruly it is. Let's bring some order to this data using the `sort()` function.

The patient data is available in your environment as `df_patients`.

```
# Sort the data by heart rate
df_byheart = sort(df_patients, "heartrate")

# Print the first 5 rows
println(df_byheart[1:5, :])


# Sort the data by body temperature
df_bytemp = sort(df_patients,"bodytemp" , rev= true)

# Print the first 5 rows
println(df_bytemp[1:5,:])
```

# Literary analysis

Let's look at the book rating data again. You've been tasked with finding out more information about this data that you couldn't calculate before. You need to find the total number of ratings used to create this dataset and how old is the oldest book.

The DataFrame `df_books` is available in your environment.

```
# Find the total number of ratings
total_reviews = sum(df_books.ratings_count)

# Find the earliest publication year
earliest_year = minimum(df_books.original_publication_year)

println("Total number of reviews is $total_reviews")
println("Earliest year of publication is $earliest_year")
```

# Describing patient data

This time, you have been tasked with producing a summary of the patient body temperature and heart rate data. You need to find the different columns' mean, minimum, and maximum. Luckily, you know a function that can calculate all of these at once.

The DataFrame `df_patients` is available in your environment.

```
# Summarize the DataFrame
println(describe(df_patients))
```

<center><img src="images/04.11.jpg"  style="width: 400px, height: 300px;"/></center>


# Standardize heart rate

"Standardization" is a data transformation that modifies values of (variable)  as:

 `standardized_x = (x - x_mean) / std_x`

Standardization can be helpful when you don't know how to interpret an absolute value. For example, what does a heart rate of 78 beats per minute mean? Is it high, low, or average?

In this exercise, you'll calculate the mean and standard deviation of the heart rate data and create a new column of standardized heart rates to help answer this question.

The patient data is available in your environment as `df_patients`. The `Statistics` package has been imported for you with the `using` keyword.

```
# Find the mean heart rate
mean_hr = mean(df_patients.heartrate)

# Find the standard deviation of heart rates
std_hr = std(df_patients.heartrate)

# Calculate the normalized array of heart rates
norm_heartrate = (df_patients.heartrate .- mean_hr) ./ std_hr

# Add the normalized heartrate to the DataFrame
df_patients.norm_heartrate = norm_heartrate

println(last(df_patients, 5))
```

# Constructing filters

You are working with a DataFrame of wildfires named df_wild. You want to filter the data to examine only the largest fires, with a burn area of over 300 acres. You can do this by filtering the rows of the DataFrame to only keep rows where the acres column is more than 300.

Which of the following code snippets should you use to complete the filter:

`df_bigfires = filter(____, df)`

- `row -> row.acres > 300`

# Filtered body temp

Now that you know how to filter data, you can dive deeper into the patient data and perform conditional data analysis.

You are working with the patient data again. You are interested in whether the patient's body temperature depends on their sex. You can use filtering to answer this question.

The patient DataFrame is available in your environment as `df_patients`. The `Statistics` package has been imported for you with the `using` keyword.

```
# Filter to where the sex is female
df_female = filter(row -> row.sex == "female", df_patients)

# Filter to where the sex is male
df_male = filter(row -> row.sex == "male", df_patients)

# Calculate mean body temperature for females
female_temp = mean(df_female.bodytemp)

# Calculate mean body temperature for males
male_temp = mean(df_male.bodytemp)

println("Body temperatures of females is: $female_temp F")
println("Body temperatures of males is: $male_temp F")
```

# Classic books

Time to use your new DataFrame skills to analyze the books dataset one last time.

You are working for a publisher that wants to spotlight classic books this month. Can you analyze the DataFrame `df_books` to find the top-rated books published before 1900?

`df_books` is available in your environment.

```
# Filter to books which were published before 1900
df_old_books = filter(row -> row.original_publication_year<1900, df_books)

# Sort these books by rating
df_old_books_sorted = sort(df_old_books, "average_rating", rev=true)

# Print the 5 top-rated old books
println(first(df_old_books_sorted,5))
```