# Julia for Data Science

* Data
* **Data processing**
* Visualization

### Data processing: Standard machine learning algorithms in Julia
In what's next, we will see how to use some of the standard machine learning algorithms implemented in Julia.

First add DataFrames and CSV packages

### Example 1: Kmeans Clustering

Let's start with some data.

The Sacramento real estate transactions file that we download next is a list of 985 real estate transactions in the Sacramento area reported over a five-day period.
Download file from http://samplecsvs.s3.amazonaws.com/Sacramentorealestatetransactions.csv and read it with CSV into a `houses` variable

Let's use `Plots` to plot with the `pyplot` backend.
Add the Plots package and create an empty plot.

Now let's create a scatter plot to show the price of a house vs. its square footage,

*Houses with 0 square feet that cost money?*

The square footage seems to not have been recorded in these cases. 

Filtering these houses out is easy to do!

This makes sense! The higher the square footage, the higher the price.

We can filter a `DataFrame` by feature value too, using the `by` function. 

Get the mean price for each house type.
`mean()` has been moved into the `Statistics` module in the standard library; you need to first enter `using Statistics` to start using it.

Now let's do some kmeans clustering on this data.

First, we can load the `Clustering` package to do this.

Let us see how `Clustering` works with a generic example first.

Make a random dataset with 1000 points, where each point is a 5-dimensional vector
Performs a K-means over X, trying to group them into 20 clusters with a maximum number of iterations to 200

Now, let's get back to the problem in hand and see how this can be applied over there.
Let's store the features `:latitude` and `:longitude` in an array `X` that we will pass to `kmeans`. First we add data for `:latitude` and `:longitude` to a new `DataFrame` called `X`.

and then we convert `X` to an `Array` via `X = convert(Array, X)`. This will turn `X` into an `Array`.

Then we replace missing values in X with the median value.

We now take the transpose of `X` using the `transpose()` function. A transpose is required since `kmeans()` function takes each row as a `feature`, and each column a `data point`.
To be able to use this transpose in kmeans, we have (for now ?) to assure that X' is a simple array (and not a TransposedArray type). For this use the `copy` function.

As a first pass at guessing how many clusters we might need, let's use the number of zip codes in our data.

(Try changing this to see how it impacts results!)

Now, we can use the `kmeans()` function to do kmeans clustering!

Now let's create a new data frame, `df`, with all the same data as `filter_houses` that also includes a column for the cluster to which each house has been assigned.

Let's plot each cluster as a different color.

And now let's try coloring them by zip code.

Let's see the two plots side by side.

Not exactly! but almost... Now we know that ZIP codes are not randomly assigned!

### Example 2: Nearest Neighbor with a KDTree

For this example, let's start by loading the `NearestNeighbors` package.

With this package, we'll look for the `knearest` neighbors of one of the houses, `point`.

Now we can build a `KDTree` and use `knn` to look for `point`'s nearest neighbors!

We'll first generate a plot with all of the houses in the same color,

and then overlay the data corresponding to the nearest neighbors of `point` in a different color.

There are those nearest neighbors in red!

We can see the cities of the neighboring houses by using the indices, `idxs`, and the feature, `:city`, to index into the `DataFrame` `filter_houses`.

### Example 3: PCA for dimensionality reduction

Let us try to reduce the dimensions of the price/area data from the houses dataset.

We can start by grabbing the square footage and prices of the houses and storing them in an `Array`.

Recall how the data looks when we plot housing prices against square footage.

We can use the `MultivariateStats` package to run PCA

Use `fit` to fit the model

Note that you can choose the maximum dimension of the new space by setting `maxoutdim`, and you can change the method to, for example, `:svd` with the following syntax.

```julia
fit(PCA, F; maxoutdim = 1,method=:svd)
```

It seems like we only get one dimension with PCA! Let's use `transform` to map all of our 2D data in `F` to `1D` data with our model, `M`.

Let's use `reconstruct` to put our now 1D data, `y`, in a form that we can easily overlay (`Xr`) with our 2D data in `F` along the principle direction/component.

And now we create that overlay, where we can see points along the principle component in red. 

(Each blue point maps uniquely to some red point!)

### Example 4: Learn how to build a simple multi-layer-perceptron on the MNIST dataset

MNIST from: https://github.com/FluxML/model-zoo/blob/master/mnist/mlp.jl

Let's start by loading `Flux`, importing a few things from `Flux` explicitly, and bringing the `repeated` function into our scope.

We can now store all the MNIST images in `imgs` and take a peak into this vector to see what the data looks like

Let's look at the type of an individual image.

#### Reorganizing our array of images

We see this is a 2D array that stores `ColorTypes`. To work more easily with this data, let's convert all `ColorTypes` to floating point numbers.

Now we can see what `imgs[3]` looks like as an array of floats, rather than as an array of colors!

**Let's stack the images to create one large 2D array, `X`, that stores the data for each image as a column.**

To do this, we can **first** use `reshape` to unravel each image, creating a 1D array (`Vector`) of floats from a 2D array (`Matrix`) of floats.

(Note that `Vector` is an alias for a 1D `Array`.)

This makes `unraveled_fpt_imgs` a `Vector` of `Vector`s where `imgs[3]` is now

After using `reshape` to get a `Vector` of `Vector`s, we can use `hcat` to build a `Matrix`, `X`, from `unraveled_fpt_imgs` where the `Vector`s stored in `unraveled_fpt_imgs` will become the columns of `X`.

Note that we're using the "splat" command below, `...`, which allows you to pass all the elements of an object to a function, rather than just passing the object itself.

#### How to go back to images from this 2D `Array`

So now each column in X is an image reshaped to a vector of floating points. Let's pick one column and see what the digit is.

Let's try to view the second image in the original array, `imgs`, by taking the second column of `X`

We'll `reshape` this array to a 2D, 28x28 array,

and finally use `colorview` from the `Images` package to view the handwritten digit.

*Our data is in working order!*

For our machine to learn the digit with which each image is associated, we'll need to train it using correct answers. Therefore we'll make use of the `labels` associated with these images from MNIST.

One-hot-encode the labels with `onehotbatch`

which gives a binary indicator vector for each figure

Build the network

Define the loss functions and accuracy

Use `X` to create our training data and then declare our evaluation function:

So far, we have defined our training data and our evaluation functions.

Let's take a look at the function signature of Flux.train!

**Now we can train our model and look at the accuracy thereafter.**

Now that we've trained our model, let's create test data, `tX`, 

and run our model on one of the images from `tX`

The largest element of `test_image` is the 8th element, so our model says that test_image is a "7".

Now we can look at the original image.

and there we have it!

### Example 5: Linear regression in Julia (we will write our own Julia code and Python code)

Let's try to find the best line fit of the following data:

We want to fit a line through this data.

Let's write a Julia function to do this.

To fit the line, we just need to find the slope and the y-intercept (a and b).

Then add this fit to the existing plot!

Let's generate a much bigger dataset,

and now we can time how long it takes to find a fit to this data.

Now we will write the same code using Python

**Let's use the benchmarking package to time these two.**