#### Let's begin by loading the packages we are working with.

In [None]:
install.packages("mlbench", repos = "http://cran.us.r-project.org")
install.packages("DataExplorer", repos = "http://cran.us.r-project.org")
install.packages("corrplot", repos = "http://cran.us.r-project.org")
install.packages("e1071", repos = "http://cran.us.r-project.org")
install.packages("usdm", repos = "http://cran.us.r-project.org")
library(tidyverse)
library(ggplot2)
library(pryr)
library(moments)
library(mlbench)
library(DataExplorer)
library(corrplot)
library(e1071)
library(usdm)

#### Then we load the data.

In [None]:
BostonURL <- "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
Boston <- read.csv(url(BostonURL), sep = "", header = FALSE)

#### Let's give the columns names, structure the data, and check to see its size and if there are any missing or duplicate data.

In [None]:
colnames(Boston) <- c('Crime', 'LrgLots', 'IndAcr', 'River', 'NOX', 'Rms', 'OwnOcc', 'DistWork', 'HiwayAcc', 'PropTax', 'EdRat', 'Min', 'LowStatus', 'MedVal')
as_tibble(Boston)
object_size(Boston)
sum(is.na(Boston))
anyDuplicated(Boston)

#### Next we will run a statistical summary of the data and examine its structure.

In [None]:
summary(Boston)
str(Boston)

#### We see we have a couple features that are integers.  The rest are numeric.  The two integer features are really factors.  *River* is a binary and *HiwayAcc* is an index.  Other features appear to be factors, so we will take care of those in a bit.

#### Next we will create an object, *BostonNum*, to work with.  It contains only the numeric features.  We run a sanity check to make sure we didn't screw up.

In [None]:
BostonNum <- Boston
BostonNum <- subset(BostonNum, select = c('Crime', 'LrgLots', 'IndAcr', 'NOX', 'Rms', 'OwnOcc', 'DistWork', 'PropTax', 'EdRat', 'Min', 'LowStatus', 'MedVal'))
summary(BostonNum)
str(BostonNum)

#### Things look good, so now we can visualize the data features to see what we are working with.  First we run a histogram of the entire data set, including all features.

In [None]:
plot_histogram(Boston)

#### It would be helpful to develop some density plots.  None of our data look normally distributed, do they?

In [None]:
plot_density(Boston)

#### Let's break out the other features we think are factors and look at them separately using barplots.  Again, things looked very skewed.

In [None]:
Boston$HiwayAcc <- as.factor(Boston$HiwayAcc)
for (col in c('Crime', 'LrgLots', 'IndAcr', 'Min', 'River'))
        Boston[[paste0(col, "_d")]] <- as.factor(ggplot2::cut_interval(Boston[[col]], 2))
plot_bar(Boston)

#### Let's see if we can get a better idea of the nature of the skew and identify any outliers.  We can use boxplots for this.  They are built on Tukey's principles so outliers here meet Tukey's definition of anything beyond *1.5 x IQR*.

In [None]:
plot_boxplot(Boston, by = 'River')

#### Let's also look at the features we identify as numeric in order to see if there are any linear relationships of interest.  We can also get a sense of correlations by looking at their scatterplots.  We find some linear relationships appear to exist, and some strong correlation might be there, but overall the groupings appear pretty loose.  The plots in the bottom half are not of much interest to us.

In [None]:
plot_scatterplot(subset(Boston, select = -c(Crime, LrgLots, IndAcr, Min, River)), by = 'MedVal', size = 0.5)

#### Our intuition tells us there is little or no normalcy in the distribution of the data, but we can run quantile plots to see this more clearly.  The red line is the theoretical path the data would take were they distributed normally. 

In [None]:
par(mfrow = c(3, 4))
qqnorm(Boston[[1]], main = 'Crime')
qqline(Boston[[1]], col = 'red')
qqnorm(Boston[[2]], main = 'LrgLots')
qqline(Boston[[2]], col = 'red')
qqnorm(Boston[[3]], main = 'IndAcr')
qqline(Boston[[3]], col = 'red')
qqnorm(Boston[[5]], main = 'NOX')
qqline(Boston[[5]], col = 'red')
qqnorm(Boston[[6]], main = 'Rms')
qqline(Boston[[6]], col = 'red')
qqnorm(Boston[[7]], main = 'OwnOcc')
qqline(Boston[[7]], col = 'red')
qqnorm(Boston[[8]], main = 'DistWork')
qqline(Boston[[8]], col = 'red')
qqnorm(Boston[[10]], main = 'PropTax')
qqline(Boston[[10]], col = 'red')
qqnorm(Boston[[11]], main = 'EdRat')
qqline(Boston[[11]], col = 'red')
qqnorm(Boston[[12]], main = 'Min')
qqline(Boston[[12]], col = 'red')
qqnorm(Boston[[13]], main = 'LowStatus')
qqline(Boston[[13]], col = 'red')
qqnorm(Boston[[14]], main = 'MedVal')
qqline(Boston[[14]], col = 'red')

#### We can see that the features do not follow the red line too closely so we cannot say any of the data are distriubted normally.