# Live Training 2023-02-28 Exploratory Data Analysis in Julia for Absolute Beginners - An Analysis of Artisanal French Cheese

Today we are going to explore a dataset on french cheese production. We'll cover

- Loading Julia packages
- Importing data from TAB-delimited files
- Counting the number of rows for each category of a categorical variable
- Drawing bar plots, histograms, and scatter plots
- Filtering rows of a data frame

The original dataset: [Dataset on the Life Cycle Assessment of 44 artisanally produced French Protected Designation of Origin (PDO) cheeses](https://entrepot.recherche.data.gouv.fr/dataset.xhtml?persistentId=doi:10.15454/JQLIOX)

Journal article describing the dataset: [Adeline Cortesi, Laure Dijoux, Gwenola Yannou-Le Bris, Caroline Pénicaud, Data related to the life cycle assessment of 44 artisanally produced french protected designation of origin (PDO) cheeses, Data in Brief, Volume 43, 2022](https://www.sciencedirect.com/science/article/pii/S235234092200600X)

## Task 0: Install Julia

Colab doesn't natively support Julia, so we'll need to install it.

### Instructions

Run the cell below, and follow the instructions to change the runtime type.

In [None]:
# Installation cell
%%capture
%%shell
if ! command -v julia 3>&1 > /dev/null
then
    wget -q 'https://julialang-s3.julialang.org/bin/linux/x64/1.8/julia-1.8.2-linux-x86_64.tar.gz' \
        -O /tmp/julia.tar.gz
    tar -x -f /tmp/julia.tar.gz -C /usr/local --strip-components 1
    rm /tmp/julia.tar.gz
fi
echo 'Julia installed'

After you run the first cell (the the cell directly above this text), go to Colab's menu bar and select **Edit** and select **Notebook settings** from the drop down. Select *Julia 1.8* in Runtime type. You can change the GPU hardware acceleration to 'None' since we don't need it. 

<br/>You should see something like this:

> ![Colab Img](https://raw.githubusercontent.com/Dsantra92/Julia-on-Colab/master/misc/julia_menu.png)


Click on SAVE

Run the next cell to see what version of Julia we are running. If it throws an error, you are probably still stuck using Python.





In [None]:
VERSION

## Task 1: Load the Julia packages

Like Python and R, Julia functionality is spread across many packages. Today we'll use

- `CSV` for reading the tab-delimited files
- `DataFrames` for working with rectangular data
- `Plots` for data visualization

The code pattern for loading packages is

```julia
using PackageName
```

Before we do that, we need to install the packages to Colab. The code for this is provided.

### Instructions

Run the cell to install the Julia packages.

Load the `Requests`, `CSV`, `DataFrames`, and `Plots` packages.

In [None]:
# Install Julia packages
run(`julia -e 'using Pkg; pkg"add CSV DataFrames Plots; precompile;"'`)

In [None]:
# Load the CSV, DataFrames, and Plots packages
using CSV
using DataFrames
using Plots

## Task 2: Import the characterization file

The characterization dataset is in a tab-delimited file. It contains information about each cheese. There are three columns of interest:

- The variety of cheese
- The animal that produced the milk for the cheese
- The technology used to create the cheese

The code pattern for importing tab-delimited files is

```julia
dataset = DataFrame(CSV.File(filename, more_arguments))
```

- `CSV.File()` reads the file and returns a CSV file object
- `DataFrame()` converts the file object to a data frame
- It's a tab-delimited file not a comma-delimited file, so we need to set the `delim` argument of `CSV.File()` to `\t`
- The first column of the file is blank, we so need to drop it by setting the `drop` arument to `CSV.File()` to `[1]`. The square bracket indicate that we want an array not a single number.

You can get some quick summary statistics of a data frame using

```julia
describe(dataframe)
```

### Instructions

Run the command to download the dataset from Github.

Read the `"data_PDOcheeses_characterization.tab"` file and convert it to a DataFrame.

- It is tab-delimited.
- Drop the first column.
- Assign to `characterization`.

Describe the contents of `characterization`.

In [None]:
# Download the dataset from GitHub
run(`wget https://raw.githubusercontent.com/richierocks/live_training_eda_julia_abs_beginners/main/data_PDOcheeses_characterization.tab`)

In [None]:
# Read the "data_PDOcheeses_characterization.tab" file and convert it to a DataFrame
characterization = DataFrame(CSV.File("data_PDOcheeses_characterization.tab", delim="\t", drop=[1]))

In [None]:
# Describe the contents of characterization
describe(characterization) 

## Task 3: Count the cheese by technology

A common task for categorical variables is to count the number of times each category occurs to see which ones are most common. Here we'll count the technologies.

The code pattern for counting categories is:

```julia
counts = combine(groupby(dataframe, :columnname), nrow => :n)
```

- `groupby()` doesn't calculate anything itself, it just tells the next function that calculations should be performs grouped by values in the specified column.
- The first argument to `groupby()` is the data frame.
- The second argument is the column name (or column names) to group by.
- The `:` before the column name makes it into a `Symbol` object. Essentially, it means "don't treat `columnname` as a variable; use the name directly".
- `combine()` performs the calculation. 
- The first argument to `combine()` is a dataframe, or in this case, a grouped data frame.
- The second argument to `combine()` is a function, in this case `nrow()`.
- `nrow()` returns the number of rows in a data frame, or in this case the number of rows in each group of the data frame.
- The right arrow, `=>`, creates a `Pair` object. It's usually used for populating dictionary objects but here it means "rename the new column of counts to `n`".

### Instructions

Count the number of times each technology was used in the `characterization` dataset.

- Assign to `n_tech`.

In [None]:
# Count the number of times each technology was used in the characterization dataset
n_tech = combine(groupby(characterization, :Technology), nrow => :n)

## Task 4: Clean the technologies

We have two problems with the technologies.

- `"Pressed uncooked cheese "` has a trailing space.
- `"pâte pressée  non cuite "` has extra spaces and hasn't been translated from French.

Both these technologies should be `"Pressed uncooked cheese"`.

The code pattern to transform the values in a column is

```julia
new_dataframe = transform(dataframe, :columnname => ByRow(x -> new_value_of_x))
```

- `transform()` performs calculations on a column of a data frame.
- The first argument to `transform()` is a dataframe.
- The second argument to `transform()` is a Pair.
- The left hand side of the Pair is the name of the column to transform, as a symbol.
- `ByRow()` tells `transform()` that each calculation should be applied to rows one-at-a-time rather than to the whole column.
- `->` denotes an anonymous function. That is, it's a concise syntax for a simple, disposable function that we won't bother to name.
-  The `x` to the left of `->` is the variable that we'll use as the input to the function. In this case, it's the original value from `columnname`.
- `new_value_of_x` will be whatever we calculate for the the new value of `x`.
- The name of the new column that is created takes the form `columnname_function`.

Other things we'll need here:

- The `in` operator checks for values in an array. That is, `x in ["a", "b", "c"]` returns `true` if `x` has the value `"a"`, `"b"` or `"c"`, and returns `false` otherwise.
- The ternary operator is a concise syntax or if-else. `condition ? yes : no` returns `yes` if the condition is `true` and `no` if the condition is `false`.

Overall we need 

```julia
new_dataframe = transform(dataframe, :columnname => ByRow(x -> x in [bad, values] ? new_x : x))
```

### Instructions

Transform the `Technology` column of `characterization` as follows:

- If the technology is `"Pressed uncooked cheese "` or `"pâte pressée  non cuite "` then return `"Pressed uncooked cheese"`.
- Otherwise return the existing value of `Technology`.

In [None]:
# Transform the Technology column of characterization
clean_characterization = transform(
  characterization, 
  :Technology => ByRow(tech -> tech in ["pâte pressée  non cuite ", "Pressed uncooked cheese "] ? "Pressed uncooked cheese" : tech)
)

## Your turn: Calculate the counts again on the cleaned dataset

Now that the `Technology` column is cleaned up, we can get correct counts. Run the code to count the technologies again.

Hint: The transformation code created a new column named `Technology_function`. Count those values not the originals!

### Instructions

Count the number of times each technology was used in the `clean_characterization` dataset.

- Use the `Technology_function` column.
- Assign to `n_tech` again.

In [None]:
# Count the number of times each technology was used in the clean_characterization dataset
n_tech = combine(groupby(clean_characterization, :Technology_function), nrow => :n)

## Task 5: Sort the counts from largest to smallest

The bar plot was hard to read because the bars in the plot weren't in order from largest to smallest. We need to reorder the row of the data frame by descending `n`.

The code pattern for this is:

```julia
sort(dataframe, :columnname, other_arguments)
```

- `sort()` sorts the rows of a data frame.
- `:columnname` tell `sort()` which column to sort by.
- By default, `sort()` returns rows from the smallest value of `:columnname` to the largest. For our purposes we want largest to smallest, so we need to reverse this by setting `rev` to `true`.

### Instructions

Sort the rows of `n_tech` in descending order of `n`.

- Assign to `n_tech_sorted`.

In [None]:
# Sort the rows of n_tech in descending order of n
n_tech_sorted = sort(n_tech, :n, rev=true)

## Task 6: Draw a bar plot of technology counts

A common way to visualize counts of categories is with a bar plot.

The code pattern for drawing bar plots is

```julia
bar(categorical_array, count_array, other_arguments)
```

- `bar()` draws the bar plot
- The first argument is the names of the categories.
- The second argument is the counts.
- The category names are quite long. To prevent them overlapping, rotate the lables by setting `xrotation` to `10`.
- We can also give a more informative y-axis label by setting `ylabel` to `"Count"`.

There are many ways to access the columns in a data frame. The easiest way is to use `dataframe.columnname`.

Overall, the code we want to write looks like

```julia
bar(dataframe.categorycolumn, dataframe.countcolumn, xrotation=n, ylabel="Count")
```

### Instructions

Draw a bar plot of technology counts (`n`) vs. technologies (`Technology_function`).

- Rotate the x-axis category labels by `10` degrees.
- Set the y-axis label to `"Count"`.

In [None]:
# Draw a bar plot of technology counts vs. technologies
bar(n_tech_sorted.Technology_function, n_tech_sorted.n, xrotation=10, ylabel="Count")

## Task 7: Import the coproducts file

Coproducts are other useful things that are created at the same time as the thing you were trying to create. When you make cheese, you also make whey and cream.

The coproducts dataset contains three columns of interest.

- The variety of cheese
- The quantity of whey produced in kg per kg of cheese produced
- The quantity of cream produced in kg per kg of cheese produced

We use the same code pattern as before, with one change. The first line of the coproducts file contains information about the dataset, but doesn't contain any data. The column headers are on row 2, so we need to set `header` to `2`.

Just as in the characterization file, the first column is blank, so we need to drop it.

### Instructions

Run the cell to download the dataset from GitHub.

Read the "data_PDOcheeses_coproducts.tab" file and convert it to a DataFrame.

- It is tab-delimited.
- The columns headers are on row `2`.
- Drop the first column.
- Assign to `coproducts`.

Describe the contents of `coproducts`.

In [None]:
# Download the dataset from GitHub
run(`wget https://raw.githubusercontent.com/richierocks/live_training_eda_julia_abs_beginners/main/data_PDOcheeses_coproducts.tab`)

In [None]:
# Read the "data_PDOcheeses_coproducts.tab" file and convert it to a DataFrame
coproducts = DataFrame(CSV.File("data_PDOcheeses_coproducts.tab", delim="\t", header=2, drop=[1]))

In [None]:
# Describe the contents of coproducts
describe(coproducts)

## Task 8: Rename the columns

The first column in `coproducts` doesn't have a name (since the cell was empty in the data file).

The other column names (like `"Cream produced (kg)"`) aren't standard Julia variable names. This is OK, but makes them slightly harder to work with since syntax like `dataframe.columname` doesn't work.

We need to rename the columns. The code pattern for this is:

```julia
rename!(dataframe, ["existing column 1" => "NewColumn1", "existing column 2" => "NewColumn1"])
```

- `rename()` renames the columns of a dataframe.
- `rename!()` renames the columns of a dataframe and overwrites that data frame. That is, `rename!(dataframe)` is shorthand for `dataframe = rename(dataframe)`. In general, functions with a name ending in an exclamation mark overwrite the data argument.



### Instructions

Rename the columns in `coproducts` as follows.

- Rename `"Column2"` to `"Variety"`.
- Rename `"Cream produced (kg)"` to `"CreamProducedKg"`.
- Rename `"Whey produced (kg)"` to `"WheyProducedKg"`.


In [None]:
# Rename the columns in coproducts
rename!(coproducts, ["Column2" => "Variety", "Cream produced (kg)" => "CreamProducedKg", "Whey produced (kg)" => "WheyProducedKg"])

## Task 9: Draw a histogram of cream production

A common way of visualizing numeric variables is to draw a histogram of their distribution.

The code pattern to draw a histogram is:

```julia
histogram(numeric_array, other_arguments)
```

- `histogram()` draws the histogram.
- You can change the number of bins with the `bins` argument, but Julia is quite good at guessing at sensible default.
- We also want to change the x- and y- axis labels with `xlabel` and `ylabel`.

### Instructions

Draw a histogram of the `CreamProducedKg` variable in `coproducts`.

- Set the x-axis label to `"Cream produced (kg)"`.
- Set the y-axis label to `"Count"`.


In [None]:
# Draw a histogram of the CreamProducedKg variable in coproducts
histogram(coproducts.CreamProducedKg, xlabel="Cream produced (kg)", ylabel="Count")

## Task 10: Filter for cream producing cheeses

The histogram showed that the amount of cream produced was zero for many cheeses. If we want to analyze cream production it might be helpful to only look at cheeses where the cream produced is greater than zero.

The code pattern for filtering rows of a data frame is:

```julia
filter(row -> condition, dataframe)
```

- `filter()` filters the rows of a data frame.
- Notice that this time, the data frame is the second argument.
- The first argument to `filter()` is a function.
- Again we create an anonymous function using `->`.
- Unlike `transform()`, we don't need to use `ByRow()` &ndash; the calculations always happen by row.
- The input to the function is a row of the data frame, so `row` is a common variable for the input to the function.
- The `condition` must resolve to `true` or `false` for each row. In this case we want a condition of the form `row.columnname > n`.

### Instructions

Filter `coproducts` for rows where `CreamProducedKg` is greater than `0`.

In [None]:
# Filter coproducts for rows where CreamProducedKg is greater than 0
cream_producing = filter(row -> row.CreamProducedKg > 0, coproducts)

## Your turn: Draw a histogram of cream-producing cheeses

Now that we have a dataset of cream-producing cheeses, we can plot the distribution of cream production again without the zeroes.

### Instructions

Draw a histogram of the `CreamProducedKg` variable in `cream_producing`.

- Set the number of bins to `6`.
- Set the x-axis label to `"Cream produced (kg)"`.
- Set the y-axis label to `"Count"`.

In [None]:
# Draw a histogram of the CreamProducedKg variable in cream_producing
histogram(cream_producing.CreamProducedKg, bins=6, xlabel="Cream produced (kg)", ylabel="Count")

## Your turn: Draw a histogram of whey production

It's also worth taking a look at whey production in cheeses.

### Instructions

Draw a histogram of the `WheyProducedKg` variable in `coproducts`.

- Set the number of bins to `8`.
- Set the x-axis label to `"Whey produced (kg)"`.
- Set the y-axis label to `"Count"`.

In [None]:
# Draw a histogram of the WheyProducedKg variable in coproducts
histogram(coproducts.WheyProducedKg, bins=8, xlabel="Whey produced (kg)", ylabel="Count")

## Task 11: Draw a scatter plot of whey produced vs. cream produced

Two compare the relationship between two numeric variables, we typically use a scatter plot.

The code pattern for creating a scatter plot is

```julia
scatter(x_array, y_array, other_arguments)
```

- `scatter()` draws the scatter plot.
- We also want to set the x- and y-axis labels using `xlabel` and `ylabel` as before.

### Instructions

Draw a scatter plot of whey production (y-axis) vs. cream production (x-axis).

- Set the x-axis label to "Whey produced (kg)".
- Set the y-axis label to "Cream produced (kg)".

In [None]:
#  Draw a scatter plot of whey production vs. cream production
scatter(coproducts.CreamProducedKg, coproducts.WheyProducedKg, xlabel="Whey produced (kg)", ylabel="Cream produced (kg)")