# Data Type Conversion
Data type conversion starts when you first import data and continues on with the data wrangling process. When adding categorical variables and your own derived columns, this becomes an ongoing iterative process. 

In this exercise, we will clean up the automotive data from UCI, exposing you to some new key functions such as mutate(), and as.factor(). We will walk through the process step-by-step looking at the data and determining what data is important and how it should be transformed. The end results in a data analysis of the newly cleaned data showing the relationship between diesel and gas engines and their highway gas mileage performance.



## R Features
* library()
* read_csv()
* glimpse()
* summary()
* mutate()
* as.factor()
* fct_drop()
* drop_na()
* as.integer()
* ggplot()
* geom_jitter()
* geom_smooth()
* filter()
* facet_grid()

## Datasets
* automotive from UCI


# L03-5-Data Type Conversion
## Assignment Instructions
Rename with your name in place of Studentname and make your edits and updates here.


In [None]:
# Load libraries
library(___) # forcats
library(___) # tidyverse

## UCI Automobile Data Set

Manually explore UCI and a description of the data in your web browser. 

https://archive.ics.uci.edu/ml/datasets/Automobile

Did you notice where the column names were stored? 

In [None]:
# Load the UCI auto data
# store auto.csv in df_auto
___ <- ___("___")

In [None]:
# Glimpse data frame
___

Look at the above information carefully. This is where you can start to determine the following:
•	Which columns you are interested in and which you can ignore
•	For the columns of interest, what are the most appropriate data types
•	For columns that we want to change the data type, is there any data cleaning needed first
•	Determine how many missing values and how to best handle them



## select()
I am interested in analyzing fuel economy in conjunction with vehicle and engine size. Columns that come to mind are:
* fuel_type
* drive_wheels
* curb_weight
* engine_type
* num_of_cylinders
* engine_size
* fuel_system
* horsepower
* city_mpg
* highway_mpg

In [None]:
# Create a new df with
# just the columns of interest above
# Use select()
___ <- ___ %>% 
   ___(___, drive_wheels, 
          curb_weight, engine_type, 
          num_of_cylinders, engine_size, 
          fuel_system, horsepower, 
          city_mpg, ___)

# Glimpse result
___

## summary()
Another useful function similar to glimpse is summary(). It gives some statistics and an idea of missing values. These concepts will be covered in the stats lessons in the course. Here we will just take a peek.

In [None]:
# Run summary() on df
___(___)

There is a lot of information in summary(). When in the context of data cleaning and data type conversion, I am looking for missing values or unexpected min or max outliers. I don't see much here. The work we have to do is on the character columns and unfortunately summary() doesn't provide much insight. 

There are two columns that I want to look into further, num_of_cylinders and horsepower. Both of these are character data type and have the potential for an integer data type.

## as.factor()
Factors are just a way for R to normalize character strings by referencing them by number. 

I like to temporarility convert strings to factors just to see what is in there. Often there are some outliers that may need to be removed before we can convert the string to a number. 

If we combine as.factor() with summary() we can see more.



In [None]:
# Use $ to select num of cylinders column of df
# then pipe it to as.factor() 
# then pipe to summary()
df$___ %>% ___() %>% ___()

Now we can see all of the different values of number of cylinders and how many rows each contain. Since I am interested in analyzing the general trend of fuel economy, I would like to stick to the more popular cylinder counts. I don't want any outliers to confuse my analysiss. 

From the data it looks like I should keep 4 and 6 cylinder engines. Let's filter the data for this.

In [None]:
# Filter df so it contains 
# only 4 and 6 cylinder vehicles
# store back in df
___ <- ___ %>% 
   ___(num_of_cylinders ___ ___("four", "six"))

# Glimpse result
___

Let's rerun the last as.factor() + summary() code and confirm that only 4 and 6 cylinder vehicles remain. Don't be fooled by glimpse(). It is only showing the first set of rows of data.

In [None]:
# Use $ to select num of cylinders column of df
# then pipe it to as.factor() 
# then pipe to summary()
df$___ %>% ___() %>% ___()

## mutate()

That looks good. Now back to the question of data type conversion. We could convert the string "four" and "six" to numbers and we could do that. However, I know that this is going to be used as a categorical variable not and not as a variable that we would perform mathematical operations against. So, leaving it as strings won't hurt anything. 

In either case, we want to convert this column to a factor. This is where mutate() comes in. mutate() is a tidyverse dplyr function that can update existing columns or create new columns in a data frame.


In [None]:
# Convert cylinders to a factor
df <- df %>% 
   ___(num_of_cylinders = ___(num_of_cylinders))

# View factor levels with levels()
___(df$num_of_cylinders)

# Glimpse result
___

Notice above that num_of_cylinders data type changed from <chr> to <fctr> and the values are not quoted.

Let's convert the following to factors:
* fuel_type
* drive_wheels
* engine_type
* fuel_system



In [None]:
# Convert strings to factors
df <- df %>% 
   ___(fuel_type = as.factor(fuel_type),
         ___ = ___,
         ___ = ___),
         ___ = ___)

# Glimpse result
___

# Summary result
___

It is looking much better. Notice all of our data types are either factor or integer, with the exception of horsepower, which we will tackle next. 

Also notice how much more informative the summary() function is for factors as compared to characters. I can see for example that there are only 7 4wd, much lower than the rest. 

Let's filter out 4wd before we tackle horsepower.

In [None]:
df <- df %>% 
   ___(drive_wheels ___ "___")

# Glimpse result
___

# Summary result
___

Notice the number of observations decreased from 183 to 176. Also notice that under drive wheels summary, 4wd is 0. This is a situation where I would like to drop that level otherwise it will always show 0.

In [None]:
# Drop drive_wheels unused levels
# using fct_drop
df <- df %>% 
   ___(drive_wheels = fct_drop(___))

# Glimpse result
___

# Summary result
___

Great. drive_wheels looks good. 4wd is gone. Let's explore horsepower. Recall from earlier how we piped the column to as.factor() then onto summary(). Let's do that with horsepower.

In [None]:
# Summary of factorized horsepower
df$___ %>% ___() %>% ___()

Notice the reason why it couldn't be converted to a number? It is the 2 rows with a question mark. I am tempted to just filter out those rows, but for learning purposes, let's see what happens when we leave them in and convert to an integer data type.

In these exercises we are overwriting the columns and the data frame with updated data. In practice, there is a balance with traceability back to the source and information loss. For example, you could store snapshots of the data frame in distinct variables at each stage in the pipeline. You could also have multiple copies of the same column. I would append '_original' to the raw data and the regular column name for the 'cleaned' version.

In [None]:
df <- df %>% 
   ___(horsepower = ___(horsepower))

# Glimpse result
___

# Summary result
___

Success. Well there was a warning. We expected this. 'NAs introduced by coercion'. Notice it doesn't say how many NAs were introduced. Whenever you get this message, you need to check the count. It could 'introduce' NA to *every* row. That has happened to me before. 

glimpse shows we didn't lose any rows or columns. It also indicates horsepower is an integer. 

summary now shows something useful for horsepower. It has the min of 52. That sure seems low. A median of 92...which also seems low to me. I guess it is my U.S. muscle car teenage years. Notice the number of NA values. That is where the two question marks ended up.

That is a good thing. I prefer to have missing values to be the more formal NA R value than a custom value in the data. There are functions that operate on NA values making them easier to deal with. 

With that said, let's remove the NA rows to give us a clean data set to work with. We'll use the tidyverse drop_na function from tidyr.

## drop_na()
Drop rows containing missing values. 

drop_na(data, ...)

In [None]:
# Let's open help on drop_na 
?drop_na

We have a choice of specifying columns explicitly or just leaving it blank and rows with NAs in any column will be dropped. 

This, of course depends on our analysis. Typically we would fill in all the missing values that we wanted to with the assumption that whatever we didn't fill in, we would drop at the end. 

This time let's not specify column names so it will look over the entire data set.

In [None]:
# Drop all rows containing NAs
df <- ___(___)

# Glimpse result
___

# Summary result
___

Notice above that the row count dropped from 176 to 174 as expected. Also notice that horsepower summary does not contain any NA rows.

We did a good job cleaning the data! Now let's see if that made any difference on our analysis.

Let's plot highway mpg vs engine size colored by number of cylinders, faceted by drive wheels and fuel type. And yes, jitter, alpha blend, and linear trend lines please!

In [None]:
# Create highway mpg vs engine size plot
# include cylinders, drive wheels, and fuel type
# Add jitter, alpha and trend lines
df %>% 
   ggplot(aes(x = ___, y = ___, ___ = ___)) + 
   ___(alpha = 0.3) +
   ___(se = FALSE, ___) +
   ___(drive_wheels ~ ___)

Let's summarize the cleaning process we did in a single code block. This is an exercise in looking back at what you did earlier and recreating it below to reinforce your learning and to use this as a more concise code reference.

In [None]:
# Load libraries
___ # forcats
___ # tidyverse

# Load data
df_auto <- ___("auto.csv")

# Select columns of interest
df <- df_auto %>% 
   ___(fuel_type, drive_wheels, 
          curb_weight, engine_type, 
          num_of_cylinders, engine_size, 
          fuel_system, horsepower, 
          city_mpg, highway_mpg)

# Filter rows as necessary
df <- df %>% 
   ___(num_of_cylinders ___, 
          drive_wheels ___) %>% 
   ___(num_of_cylinders = ___,
          fuel_type = ___,
          drive_wheels = ___,
          engine_type = ___,
          fuel_system = ___,
          horsepower = ___)

# Dropping levels wasn't required
# Drop all rows containing NAs
___

# Glimpse result
___

# Summary result
___

Notice the same number of rows and columns and data types as before. Nicely done!