# L03-3-Importing From Internet
## Assignment Instructions
Rename with your name in place of Studentname and make your edits and updates here.


# Importing From Internet
Now that we know how to import flat files locally, what about directly from the Internet? Although manually downloading a file to your local computer is not difficult, there are times when importing directly from the Internet would be helpful. To illustrate this, let me introduce the UCI machine learning repository. University of California Irvine hosts a machine learning repository containing many real-world data sets. As luck would have it, they have a fuel economy automotive dataset. 

The same tidyverse import function read_csv() can be used to import from a URL. We just replace the local filename with a URL string.

In this exercise, we will import this UCI data and then do a bit of visual data analysis on it. I hope you haven’t forgotten your ggplot skills. This data is a bit messier, which makes for a great learning opportunity. First, the header data is stored separately. I created an Excel spreadsheet with this info, so we will have to import this and combine the headers from the spreadsheet to the UCI automotive data. 

After we have a data frame containing the data, we will analyze highway miles per gallon vs engine displacement. This is an opportunity to reinforce your filtering, piping, and plotting skills. See if you can do it without any hints by looking at the prior exercises and applying it to this new data set. If you need hints, they are also provided later in this notebook and of course the solution notebook is available.



## R Features
* library()
* read_csv()
* read_excel()
* glimpse()
* write_csv()
* print()
* c()
* ggplot()
* geom_point()
* geom_jitter()
* geom_smooth()
* filter()
* facet_grid()

## Datasets
* automotive from UCI


In [None]:
# Load libraries
library(___) # readxl
library(___) # tidyverse

## UCI Automobile Data Set

Manually explore UCI and a description of the data in your web browser. 

https://archive.ics.uci.edu/ml/datasets/Automobile

Did you notice where the column names were stored? 

In [None]:
# Let's start by assuming you have the 
# URL for the file you want to import
# Store this text string in a variable named auto_url
___ <- "https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data"

The column names are not in the first row of the data file, but rather in a second description file. I didn't want to force you to type all those columns names. I'm sure you have enough busy work like this already. I manually copied the data into Excel and with some manual work and Excel formulas I got the list of column names in an Excel sheet. 

Also, I changed the hyphen dashes in the columns names to underscores. This isn't just to be consistent with my snake_case naming convention, but also some R functions including ggplot interpret the dash as a subtraction operator. So, I avoid having to wrestle with R and just use underscores instead of dashes for column names.

In [None]:
# Import the column names
# from Excel file: auto_column_names.xlsx
# into df_auto_cols data frame
___ <- read_excel("___", col_names = ___)

# Glimpse the results
# there should be 26 rows
# and one column
___

# Notice the name of the column

We will now use the url + column names to import the data from UCI. Notice however that we have columns names stored in a data frame. That doesn't exactly the specification read_csv() is expecting if we want to provide the list of columns names. It is expecting a vector, really a single column in a data frame, not the whole data frame.

Confusingly, our df_auto_cols data frame only has one column, but it is still of class data frame, nonetheless. So we need to reference the specific column in the data frame and that is what we pass as a parameter for col_names in read_csv.

There are many ways to reference columns in a data frame. A common method is to use the '\$' followed by the column name. In this case df_auto_cols$X__1

In [None]:
# Import the UCI auto data using 
# auto_url for the url string
# and df_auto_cols$X0 for the col_names
# Store the result in df
___ <- read_csv(auto_url, col_names = df_auto_cols$___)

# Glimpse the result
# There should be 205 rows
# and 26 columns with appropriate column names
___

In [None]:
# Write a snapshot of the source data
# to auto.csv
___(df, "___")

In [None]:
# If we wanted to specify the column 
# names directly in R, we can build up a 
# vector using the c() combine function
col_names_r <- ___(
"symboling",
"normalized_losses",
"make",
"fuel_type",
"aspiration",
"num_of_doors",
"body_style",
"drive_wheels",
"engine_location",
"wheel_base",
"length",
"width",
"height",
"curb_weight",
"engine_type",
"num_of_cylinders",
"engine_size",
"fuel_system",
"bore",
"stroke",
"compression_ratio",
"horsepower",
"peak_rpm",
"city_mpg",
"highway_mpg",
"price"
)

# We can print the values of a vector using the 
# print command, or just having the variable on a line with nothing else
# I like to be explicit and use print()
print(___)


Notice above how it prints vector values. It moves left to right from top to bottom. The index numbers are the vector index position of the value just to the right of it. So index 25 is highway-mpg. Index 26 is price.

Although there is a bit of clean up work that should be done at some point around this data such as changing character strings to numbers, and handling missing values, I always want to visualize the data as soon as I can. There's always time later for data wrangling.

Our goal is to see the highway miles per hour trend (y axis) compared to engine size (x axis). The trend should be based on the number of doors, two or four, the number of cylinders, 4, 6, or 8, and the number of drive wheels, 2 or 4. 

In [None]:
# Create the plot as described above
# This is very challenging to do from scratch
# At least start it and get at least a scatter plot
# and if you need more guidance, look at the below cells
# for hints

# dataframe: df
# plot type: scatter plot with jitter and alpha and linear trend lines
# y axis: highway mpg
# x axis: engine size
# color by cylinder
# facet by number of doors and drive wheels
# filter out any categories that are outliers
___

Don't just skip over the above cell, give it a try. Look back a prior notebooks. I am sure you can at least get a start with a scatter plot. Build your way up one step at a time. I don't do this all at once. 

Below is a guided version of the above if you need more hints. There is also the solution file and the help ? if you need them. That is what they are there for. 

In [None]:
# I start with reminding myself 
# the columns of the data
___

In [None]:
# Then I get the initial plot working
___ %>% 
  ___(aes(x = engine_size, y = ___)) + 
   geom_point()

In [None]:
# I'll add jitter and alpha blend as a good practice
___ %>% 
   ___(aes(x = engine_size, y = ___)) + 
   geom____(alpha = ___)

In [None]:
# Next add the trend line
___ %>% 
  ___(aes(x = engine_size, y = ___)) + 
   geom____(alpha = ___) + 
   geom____(se = ___, method = "___")

In [None]:
# Then add more dimensions 
# such as color for number of cylinders
___ %>% 
   ___(aes(x = engine_size, y = ___, color = ___)) + 
   geom____(alpha = ___) + 
   geom____(se = ___, method = "___")

In [None]:
# Filter out the cylinders that I don't want
# I only want 4, 6, and 8
# Here is where there are strings rather than numbers
# I'll use %in% and c() to group all three together
# but an OR of each would work too.

# Since I am piping, I'll just add the filter() to the pipeline
___ %>% 
   ___(num_of_cylinders ___ ___("four", "six", "eight")) %>%
   ___(aes(x = engine_size, y = ___, color = ___)) + 
   geom____(alpha = ___) + 
   geom____(se = ___, method = "___")

In [None]:
# Let me facet by number of doors and drive wheels
___ %>% 
   ___(num_of_cylinders ___ ___("four", "six", "eight")) %>%
   ___(aes(x = engine_size, y = ___, color = ___)) + 
   geom____(alpha = ___) + 
   geom____(se = ___, method = "___") + 
   ___(___ ~ ___)

In [None]:
# Let's filter out the ? for doors
# using filter() and the != operator
___ %>% 
   ___(num_of_cylinders ___ ___("four", "six", "eight"), 
         num_of_doors ___ "___") %>%
  ___(aes(x = engine_size, y = ___, color = ___)) + 
   geom____(alpha =___) + 
   geom____(se = ___, method = "___") + 
   ___(___ ~ ___)

Nicely done! Check out the trend lines. Any stand out as being different than the others? 