## Formatting Data
My first step is to always look at some basic descriptors of the dataset, make sure the data loaded correctly, and format the data as necessary. 

#### Dataset Description
I will be looking at the Boston Housing dataset which first appeared in the *Journal of Environmental Economics and Management* in 1978. It provides data on census tracts in the Boston are.
I will get the dataset by loading R package mlbench. It is a time-tested, reliable R package. Details about the data set can be found in [R documentation](https://www.rdocumentation.org/packages/mlbench/versions/2.1-1/topics/BostonHousing).

Mlbench includes BostonHousing and BostonHousing2. BostonHousing2 includes extra features such as the official census tract code that the US Census Bureau assigns each tract. I want to include that in my dataset as row names, indices. 

After determining how I want my dataset to be formatted, I will make those formats in an R file that I will load in each notebook. Rather than load the raw dataset (running `data(BostonHousing2)`), I will load my R file (`source('load_data.r')`). 

In [2]:
install.packages("mlbench")
library(mlbench)

Loading required package: mlbench
“there is no package called ‘mlbench’”Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


In [5]:
data(BostonHousing2); bh2 <- BostonHousing2

In [6]:
str(bh2)

'data.frame':	506 obs. of  19 variables:
 $ town   : Factor w/ 92 levels "Arlington","Ashland",..: 54 77 77 46 46 46 69 69 69 69 ...
 $ tract  : int  2011 2021 2022 2031 2032 2033 2041 2042 2043 2044 ...
 $ lon    : num  -71 -71 -70.9 -70.9 -70.9 ...
 $ lat    : num  42.3 42.3 42.3 42.3 42.3 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
 $ cmedv  : num  24 21.6 34.7 33.4 36.2 28.7 22.9 22.1 16.5 18.9 ...
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : int  296 242 242 222 2

#### Removing Unnecessary Features
Of the features not in the original dataset, I only want the census tract code. Thus I will remove the extraneous features. I will also make tract my index and remove tract as a feature.

In [7]:
# indexing dataframe using Census tract code
rownames(bh2) <- bh2$tract

# removing attributes in updated dataset BostonHousing2 and not in original BostonHousing.
# I removed tract because I made it my datasets row names above and do not want it as a feature.
bh2 <- subset(bh2, select = -c(cmedv, town, lon, lat, tract))