# 1 Transforming Data with dplyr

Learn verbs you can use to transform your data, including select, filter, arrange, and mutate. You'll use these functions to modify the counties dataset to view particular observations and answer questions about the data.

# Understanding your data

Take a look at the counties dataset using the glimpse() function.

What is the first value in the income variable?

# Possible answers

( ) 1001

(x) 51281

( ) 50254

( ) 40725

# Selecting columns

Select the following four columns from the counties variable:
- state
- county
- population
- poverty
You don't need to save the result to a variable.

# Instructions:

- Select the columns listed from the counties variable.

In [None]:
counties %>%
  # Select the columns
select(state, county, population, poverty)

# Arranging observations

Here you see the counties_selected dataset with a few interesting variables selected. These variables: private_work, public_work, self_employed describe whether people work for the government, for private companies, or for themselves.

In these exercises, you'll sort these observations to find the most interesting cases.

# Instructions:

- Add a verb to sort the observations of the public_work variable in descending order.

In [None]:
counties_selected <- counties %>%
  select(state, county, population, private_work, public_work, self_employed)

# Add a verb to sort in descending order of public_work
counties_selected %>%
arrange(desc(public_work))

# Filtering for conditions

You use the filter() verb to get only observations that match a particular condition, or match multiple conditions.

# Instructions:

- Find only the counties that have a population above one million (1000000).

In [None]:
counties_selected <- counties %>%
  select(state, county, population)

  # Filter for counties with a population above 1000000
  counties_selected %>%
  filter(population > 1000000)

- Find only the counties in the state of California that also have a population above one million (1000000).

In [None]:
counties_selected <- counties %>%
  select(state, county, population)

  # Filter for counties with a population above 1000000
  counties_selected %>%
filter(state == "California", population > 1000000)

# Filtering and arranging

We're often interested in both filtering and sorting a dataset, to focus on observations of particular interest to you. Here, you'll find counties that are extreme examples of what fraction of the population works in the private sector.

# Instructions:

- Filter for counties in the state of Texas that have more than ten thousand people (10000), and sort them in descending order of the percentage of people employed in private work.

In [None]:
counties_selected <- counties %>%
  select(state, county, population, private_work, public_work, self_employed)

counties_selected %>%
  # Filter for Texas and more than 10000 people
  filter(state == "Texas",
         population > 10000) %>%
  # Sort in descending order of private_work
  arrange(desc(private_work))

# Calculating the number of government employees

In the video, you used the unemployment variable, which is a percentage, to calculate the number of unemployed people in each county. In this exercise, you'll do the same with another percentage variable: public_work.

The code provided already selects the state, county, population, and public_work columns.

# Instructions:

- Use mutate() to add a column called public_workers to the dataset, with the number of people employed in public (government) work.

In [None]:
counties_selected <- counties %>%
  select(state, county, population, public_work)

counties_selected %>%
  # Add a new column public_workers with the number of people employed in public work
  mutate(public_workers = public_work * population / 100)

- Sort the new column in descending order.

In [None]:
counties_selected <- counties %>%
  select(state, county, population, public_work)

counties_selected %>%
  mutate(public_workers = public_work * population / 100) %>%
  # Sort in descending order of the public_workers column
  arrange(desc(public_workers))

# Calculating the percentage of women in a county

The dataset includes columns for the total number (not percentage) of men and women in each county. You could use this, along with the population variable, to compute the fraction of men (or women) within each county.

In this exercise, you'll select the relevant columns yourself.

# Instructions:

- Select the columns state, county, population, men, and women.
- Add a new variable called proportion_women with the fraction of the county's population made up of women.

In [None]:
counties_selected <- counties %>%
  # Select the columns state, county, population, men, and women
  select(state, county, population, men, women)

counties_selected %>%
  # Calculate proportion_women as the fraction of the population made up of women
  mutate(proportion_women = women / population) 

# Mutate, filter, and arrange

In this exercise, you'll put together everything you've learned in this chapter (select(), mutate(), filter() and arrange()), to find the counties with the highest proportion of men.

# Instructions:

- Use a single verb to add a proportion_men column with the fractional male population and also keep the state, county, and population columns.
- Filter for counties with a population of at least ten thousand (10000).
- Arrange counties in descending order of their proportion of men.

In [None]:
counties %>%
  # Keep state, county, and population, and add proportion_men
  mutate(state, county, population, proportion_men = men / population, .keep = "none") %>%
  # Filter for population of at least 10,000
  filter(population >= 10000) %>% 
  # Arrange proportion of men in descending order 
  arrange(desc(proportion_men))