## Preliminary Analysis

We will start with some preliminary analysis on our data set.

### Load the dataset using R

First, we love the data set using the R function `read.csv` and assign it to the variable `titanic`. Note that the `read.table` and `read.csv` in R are equivalent accept for the default args. `read.table` defaults to separating on white space. `read.csv` defaults to separating on commas. `read.csv` also defaults to the argument `header=T`.

#### load the dataset using `read.csv()`

Load the csv file `titanic.csv` into a dataframe called `titanic`.

In [None]:
# Write your code here.


In [None]:
stopifnot(dim(titanic) == c(891,12))

We displayed the dimension `dim()` and the structure `str()` of our datafrane. This is mostly done as a sanity check. We should have some idea of what the dimension and structure of our data is. By displaying these results immediately after loading the data, we can verify that the data has been loaded as we expect.

#### display the dimension of the data set

In [None]:
dim(titanic)

#### display the structure of the dataframe

In [None]:
str(titanic)

### The `R` Structure Object

I interpret the structure of our data frame in the following way. Each row in the structure object, `str(titanic)` represents a column in the data frame `titanic`. The value immediately following the `$` is the name of that column. The value immediately following the `:` is the data type of that column. The values following the datatype are the first few values of the data in the column itself. 

Note that R has made some default decisions about the structure of our data. It has designated five columns as integer columns, five columns as factor columns, and two columns as numerical problems. These may or may not be accurate according to our own understanding of the data. This was done by R, doing its best to intuit the structure of the data during the read of the CSV file. For example, a reasonable case could be made that the the `Survived` column should not be an integer, nor should the `Pclass`.

### Categorical Features In R 

R stores categorical features using a special type of vector called a **factor**. The data is stored as a vector of integers. The factor has an additional attribute, however. It also has a vector of levels. The integer stored as data are actually references to the vector of names. We can think of the data stored in the Factor as a mapping to the vector of levels.

#### display the class of the `titanic$embarked` column

In [None]:
class(titanic$Embarked)

#### display the levels of the `titanic$embarked` column

In [None]:
levels(titanic$Embarked)

#### display that first few values of the `titanic$embarked` column

In [None]:
titanic$Embarked[1:5]

### Completely Unique Columns

We can see from the structure of our data frame that it contains two columns that are completely unique. We are attempting to use the patterns in our data to make predictions about the survival of passengers during the Titanic disaster. This is done by identifying patterns in the data. If they column is completely unique there is no pattern to be identified there. Each passenger has its own unique value and there is really no immediate way to associate these unique values with each other. For this reason we will simply remove the completely unique columns. Prior to doing this, however, we should verify that they are in fact completely.

The two columns in question are `PassengerId` and `Name`. We will use the following method to establish that they are both completely unique:

1. We will take a measure of the number of passengers in the data set
2. We will take a measure of the number of unique values in each of the columns in question
3. If the values match we will consider the column safe for removal

#### store the number of passengers

Store the number of passengers as the variable `number_of_passengers`.

In [None]:
# Write your code here.


In [None]:
# Write your code here.


#### display the length of the unique values in `titanic$passengerid` and `titanic$name` 

In [None]:
for (column in colnames(titanic)) {
    print(paste(column, length(unique(titanic[,column]))))
}

Any column where every value is unique can be safely dropped. This can be done by assigning the `NULL` value to the named column. For example, we might do the following on a generic data frame and column

    dataframe$mycolumn = NULL

#### drop the columns with completely unique values

Drop the columns from `titanic` that have completely unique values.

In [None]:
# Write your code here.


In [None]:
### HIDDEN TESTS

### Write the Updated `DataFrame` to a new csv

In [None]:
write.csv(titanic, 'titanic-updated.csv')

### Summarize The Data

Finally, having dropped the features deemed not immediately useful, we display the summary statistics of the dataframe using the `summary()` function. This function shows the quartile values of the data as well as mean and median for numerical features and the counts to the best of its ability for the factors.

In [None]:
summary(titanic)