# Intro to Data Manipulation and Visualization in Julia
In this section, we will learn how to read in data and conduct data manipulation and visualization in Julia. This is an important step in solving a real-world optimization problem, as real-world data can be messy and difficult to work with.

## DataFrames
Like data frames in `R`, `Julia` also has a similar structure for datasets. You will need to load the packages `DataFrames` and `CSV` first:

In [None]:
using DataFrames, CSV

Now let's read in a csv file for the dataset _iris_ using the `CSV.read` function. The csv file should sit in the same directory as this script. Otherwise, you will need to change the path to the file for the first argument to the `CSV.read` function.

In [None]:
iris = CSV.read("iris.csv");

In [None]:
### If you are unable to read the data, you can uncomment the following code and run it:
# using RDatasets
# iris = dataset("datasets", "iris")

To view the first few rows of the data, you can use `first()`, or index the dataframe similar to what you did you in `R`:

To subset rows, pass in the indices in the first dimension. If you are not subsetting to particular columns, just pass in ``:`` in the second dimension (as opposed to leaving it blank in `R`).

In [None]:
iris[1:5,:]
first(iris,5)

To index a column using column name, simply put a `:` in front of the name to make it into a Julia symbol. 
We could also write the column name like this: `symbol("SepalLength")`.


To select all rows, you can either type `[:,:columnName]` or `[!,:columnName]`.

In [None]:
iris[!,:SepalLength]

We often times need to join/merge datasets. Let's look at an example first: suppose we have a dataframe that gives the species and the respective price at a flower shop:

In [None]:
species_price = DataFrame(Species = ["setosa", "versicolor", "virginica"],
                        Price = [2.5, 3.1, 3.2])

 To join, simply pass in:
 * the two data frames,
 * the shared variable name, and
 * the option for the kind of join you wanted: 
 
 `:left`, `:right`, `:inner`, `:outer`, etc.

In [None]:
join(iris, species_price, on = :Species, kind = :left)


## Plotting in Julia

Julia also has extensive support for plotting. 

* `Plots.jl` is a powerful and concise tool for plotting. It provides the interface to many other plotting packages with simple and consistent syntax.
* `StatPlots.jl` offers the DataFrames integration for `Plots`. You can pass in a data frame, and map aesthetics to the column names directly. 

Using these would be somewhat similar to working with `ggplot2` in `R`. 

Here is an example of a scatter plot based on the `iris` data, where the x axis is the `SepalLength`, y axis is `SepalWidth`, and the grouping (therefore the colors) are based on the `Species`.

In [None]:
using Plots
using StatsPlots
pyplot()
scatter(iris[!,:SepalLength],iris[!,:SepalWidth],group=iris[!,:Species])

We can make the plot more interesting by adding a few custom settings. For example:
* Give it a title
* Provide xlabel and ylabel
* Change the transparency, shape, and size of the dots
* change background color to dark grey

In [None]:
scatter(iris[!,:SepalLength],iris[!,:SepalWidth],group=iris[!,:Species],
        title = "Sepal length vs. width",
        xlabel = "Length", ylabel = "Width",
        m=(0.5, [:cross :hex :star7], 12),
        bg=RGB(.2,.2,.2))

You can also do a box plot (with the violin plot in the background) grouped by the species. Note the `!` in `boxplot!` adds the current plot to the existing one. 

In [None]:
violin(iris[!,:Species],iris[!,:SepalLength])
boxplot!(iris[!,:Species],iris[!,:SepalLength], leg=false,
    xlabel = "Species", ylabel = "Sepal Length")

There are many other types of plots and custom options. You can explore more from [the tutorial](https://juliaplots.github.io/tutorial/).

## Exercise: Plotting Icecream data

This time, we are going to read in a dataset directly from the package `RDatasets`. Use the following syntax 
```dataset("Ecdat", "Icecream")```

and save it as a dataframe called `icecream`. 

The dataset is on the ice cream consumption. The columns are:
* `Cons`: consumption level of ice cream
* `Income`: income level
* `Price`: price of ice cream
* `Temperature`: outside temperature at time of measurement

Inspect the first few rows of the data.

In [None]:
using RDatasets
icecream = dataset("Ecdat", "Icecream")
first(icecream,5)

### Task 1:
How is income related to Consumption?

In [None]:
scatter(icecream[!,:Income], icecream[!,:Cons],
    xlabel = "Income", ylabel = "Consumption")

### Task 2:
Create the `Revenue` variable as the product between `Price` and `Cons`. 

Do you see a positive relationship between the temperature and revenue?

In [1]:
icecream[!,:Revenue] = icecream[!,:Price] .* icecream[!,:Cons]
scatter(icecream[!,:Temp], icecream[!,:Revenue],
xlabel = "Temperature", ylabel = "Revenue")

UndefVarError: UndefVarError: icecream not defined

### Task 3:
Create a new variable `IncomeGroup` that assigns label to each row based on how much income was recorded (e.g. you could have 'low', 'medium' and 'high' groups).

Plot the distribution of the consumption over the different groups. What do you find?

In [None]:
function get_income_group(x)
    if (x < 80) 
        gr = "low"
    elseif (x < 85)
        gr = "medium"
    else 
        gr = "high"
    end
end

icecream[!,:IncomeGroup] = map(get_income_group,icecream[!,:Income])

In [None]:
boxplot(icecream[!,:IncomeGroup], icecream[!,:Cons], leg=false,
xlabel = "Income group", ylabel = "Consumption")