<a href="https://colab.research.google.com/github/xKDR/Julia-Workshop/blob/main/DataStructuresForSpeed.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <img src="https://github.com/JuliaLang/julia-logo-graphics/raw/master/images/julia-logo-color.png" height="100" /> _Colab Notebook Template_

## Instructions
1. Work on a copy of this notebook: _File_ > _Save a copy in Drive_ (you will need a Google account). Alternatively, you can download the notebook using _File_ > _Download .ipynb_, then upload it to [Colab](https://colab.research.google.com/).
2. If you need a GPU: _Runtime_ > _Change runtime type_ > _Harware accelerator_ = _GPU_.
3. Execute the following cell (click on it and press Ctrl+Enter) to install Julia, IJulia and other packages (if needed, update `JULIA_VERSION` and the other parameters). This takes a couple of minutes.
4. Reload this page (press Ctrl+R, or ⌘+R, or the F5 key) and continue to the next section.

_Notes_:
* If your Colab Runtime gets reset (e.g., due to inactivity), repeat steps 2, 3 and 4.
* After installation, if you want to change the Julia version or activate/deactivate the GPU, you will need to reset the Runtime: _Runtime_ > _Factory reset runtime_ and repeat steps 3 and 4.

In [None]:
%%shell
set -e

#---------------------------------------------------#
JULIA_VERSION="1.10.4" # any version ≥ 0.7.0
JULIA_PACKAGES="IJulia BenchmarkTools"
JULIA_PACKAGES_IF_GPU="CUDA" # or CuArrays for older Julia versions
JULIA_NUM_THREADS=2
#---------------------------------------------------#

if [ -z `which julia` ]; then
  # Install Julia
  JULIA_VER=`cut -d '.' -f -2 <<< "$JULIA_VERSION"`
  echo "Installing Julia $JULIA_VERSION on the current Colab Runtime..."
  BASE_URL="https://julialang-s3.julialang.org/bin/linux/x64"
  URL="$BASE_URL/$JULIA_VER/julia-$JULIA_VERSION-linux-x86_64.tar.gz"
  wget -nv $URL -O /tmp/julia.tar.gz # -nv means "not verbose"
  tar -x -f /tmp/julia.tar.gz -C /usr/local --strip-components 1
  rm /tmp/julia.tar.gz

  # Install Packages
  nvidia-smi -L &> /dev/null && export GPU=1 || export GPU=0
  if [ $GPU -eq 1 ]; then
    JULIA_PACKAGES="$JULIA_PACKAGES $JULIA_PACKAGES_IF_GPU"
  fi
  for PKG in `echo $JULIA_PACKAGES`; do
    echo "Installing Julia package $PKG..."
    julia -e 'using Pkg; pkg"add '$PKG'; precompile;"' &> /dev/null
  done

  # Install kernel and rename it to "julia"
  echo "Installing IJulia kernel..."
  julia -e 'using IJulia; IJulia.installkernel("julia", env=Dict(
      "JULIA_NUM_THREADS"=>"'"$JULIA_NUM_THREADS"'"))'
  KERNEL_DIR=`julia -e "using IJulia; print(IJulia.kerneldir())"`
  KERNEL_NAME=`ls -d "$KERNEL_DIR"/julia*`
  mv -f $KERNEL_NAME "$KERNEL_DIR"/julia

  echo ''
  echo "Successfully installed `julia -v`!"
  echo "Please reload this page (press Ctrl+R, ⌘+R, or the F5 key) then"
  echo "jump to the 'Checking the Installation' section."
fi

Installing Julia 1.10.4 on the current Colab Runtime...
2024-10-23 05:56:43 URL:https://julialang-s3.julialang.org/bin/linux/x64/1.10/julia-1.10.4-linux-x86_64.tar.gz [173704015/173704015] -> "/tmp/julia.tar.gz" [1]
Installing Julia package IJulia...
Installing Julia package BenchmarkTools...


In [None]:
versioninfo()

Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 2 × Intel(R) Xeon(R) CPU @ 2.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, broadwell)
Threads: 2 default, 0 interactive, 1 GC (on 2 virtual cores)
Environment:
  LD_LIBRARY_PATH = /usr/local/nvidia/lib:/usr/local/nvidia/lib64
  JULIA_NUM_THREADS = 2


In [None]:
using BenchmarkTools

M = rand(2^11, 2^11)

@btime $M * $M;

  555.310 ms (2 allocations: 32.00 MiB)


In [None]:
try
    using CUDA
catch
    println("No GPU found.")
else
    run(`nvidia-smi`)
    # Create a new random matrix directly on the GPU:
    M_on_gpu = CUDA.CURAND.rand(2^11, 2^11)
    @btime $M_on_gpu * $M_on_gpu; nothing
end

# Data structures for speed

Julia is clearly the winner when it comes to speed of execution for
tabular data structure manipulation. In this session we will cover the
basics of the manipulatin tabular data structures with DataFrames.jl
and timeseries data using TSFrames.jl.

In [7]:
using Pkg
Pkg.add("DataFrames")
Pkg.add("TSFrames")
Pkg.add("RDatasets")
Pkg.add("CSV")
Pkg.add("MarketData")
Pkg.add("Impute")

[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Manifest.toml`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.10/Project.toml`
[32m[1m  No Changes[22m[39m to

In [3]:
using DataFrames

LoadError: ArgumentError: Package DataFrames not found in current path.
- Run `import Pkg; Pkg.add("DataFrames")` to install the DataFrames package.

In [None]:
df = DataFrame([])
df = DataFrame(a=[1,2], b=[2,3])

In [None]:
Pkg.add("CSV")
using CSV
aapl_df = CSV.read("aapl.csv", DataFrame)

In [None]:
## Pkg.add("MySQL")
## Pkg.add("JSON")

In [None]:
using RDatasets
iris = dataset("datasets", "iris")

In [None]:
DataFrames.describe(iris)

In [None]:
first(iris)
first(iris, 10)
last(iris, 10)
iris[1, :]
iris[:, 1]
iris[!, 1]
iris[!, [1, 2]]
iris[!, :SepalLength]
iris[!, [:SepalLength, :SepalWidth]]
iris.SepalLength
iris.SepalWidth

In [None]:
iris[!, r"Sepal"]

iris[!, Not(r"Sepal")]
iris[!, Not(:SepalLength)]

iris[!, Between(:SepalWidth, :PetalWidth)]
iris[!, Between(2, 4)]

In [None]:
iris[!, Cols(r"Petal", :)]

In [None]:
iris[iris.SepalLength .> 4, :]
iris[iris.Species .== "setosa", :]
iris[(iris.SepalLength .> 4) .& (iris.PetalLength .> 3), :]

DataFrames.subset(iris,
                    :SepalLength => s -> s .> 4,
                    :PetalLength => p -> p .> 3)

DataFrames.subset(iris, :Species => s -> s .== "setosa")

iriscopy = copy(iris)
DataFrames.subset!(iriscopy, :Species => s -> s .== "setosa")
nrow(iris)
nrow(iriscopy)

In [None]:
select(iris, Not(:SepalLength))
select(iris, :SepalLength => s -> s * 2)
select(iris, :SepalLength => s -> s * 2, :SepalWidth)
select(iris, :SepalLength => s -> s * 2, [:SepalLength, :SepalWidth] => ((x,y) -> x[1] + x[2]) => :X)
select(iris, :SepalLength => :S1, :SepalWidth => :S2) ## Rename columns
select!(iris, :SepalLength => :S1, :SepalWidth => :S2) ## Don't copy columns

In [None]:
transform(iris, Not(:SepalLength))
transform(iris, Not(:SepalLength)) == select(iris, Not(:SepalLength)) # true
transform(iris, :SepalLength => s -> s * 2) # returns new column
transform(iris, :SepalLength => (s -> s * 2) => :SepalLength2) # returns new column
transform(iris, :SepalLength => s -> s * 2, [:SepalLength, :SepalWidth] => ((x,y) -> x[1] + x[2]) => :X)

In [None]:
combine(iris, :SepalLength .=> sum)
combine(iris, Not(:Species) .=> sum)
combine(iris, :SepalLength => x -> sum(x * 10))

In [None]:
df = DataFrame(x=[1, 2, missing], y=[1, missing, missing])
combine(df, All() .=> x -> x * 10)
combine(df, All() .=> x -> sum(x * 10))
combine(df, All() .=> x -> sum(skipmissing(x * 10)))

In [None]:
gd = groupby(iris, :Species)
combine(gd, :SepalLength => sum)
combine(gd, Not(:Species) .=> sum)
combine(gd, Not(:Species) .=> sum, DataFrames.nrow)
using Statistics
combine(gd, Not(:Species) .=> mean, DataFrames.nrow)

combine(gd, AsTable([:SepalLength, :PetalLength]) => ByRow((x) -> x[1] / x[2]) => :Ratio)

In [None]:
using TSFrames
ts = TSFrame(1:10)
ts = TSFrame(1:10, 2301: 2310)

In [None]:
using MarketData
aapl_df = DataFrame(MarketData.yahoo(:AAPL))
aapl_ts = TSFrame(MarketData.yahoo(:AAPL))
aapl_ts = CSV.read("aapl.csv", TSFrame)

In [None]:
nr(aapl_ts)
nc(aapl_ts)
size(aapl_ts)
length(aapl_ts)
names(aapl_ts)
index(aapl_ts)
TSFrame.describe(aapl_ts)

In [None]:
aapl_ts[1]
aapl_ts[2, 1]
aapl_ts[2, [1]]
aapl_ts[[2, 3], [1, 2, 3, 4]]
aapl_ts[[2, 3], [:Open, :High, :Low, :Close]]
aapl_ts.Open


In [None]:
aapl_ts[Date(2007, 1, 10)]
aapl_ts[Date(2007, 1, 10), [:Open, :High, :Low, :Close]]
aapl_ts[Year(2007), Month(1)]
aapl_ts[Year(2007), Month(1)][:, [:Open, :High, :Low, :Close]]
aapl_ts[Year(2007), Quarter(1)][:, [:Open, :High, :Low, :Close]]

In [None]:
# Pkg.add("Plots")
# using Plots
# plot(aapl_ts, [:AdjClose])

In [None]:
aapl_monthly = apply(aapl_ts, Month(1), last)
aapl_weekly = apply(aapl_ts, Week(1), Statistics.std)
aapl_weekly = apply(aapl_ts, Week(1), Statistics.std, last)

In [None]:
ibm_ts = TSFrame(MarketData.yahoo(:IBM))

In [None]:
date_from = Date(2021, 06, 01);
date_to = Date(2021, 12, 31);
ibm = TSFrames.subset(ibm_ts, date_from, date_to)
aapl = TSFrames.subset(aapl_ts, date_from, date_to)

In [None]:
ibm_aapl = TSFrames.join(ibm[:, ["AdjClose"]], aapl[:, ["AdjClose"]]; jointype = :JoinBoth)

In [None]:
TSFrames.rename!(ibm_aapl, [:IBM, :AAPL])

In [None]:
using Impute
ibm_aapl = ibm_aapl |> Impute.locf()

# Need Help?

* Learning: https://julialang.org/learning/
* Documentation: https://docs.julialang.org/
* Questions & Discussions:
  * https://discourse.julialang.org/
  * http://julialang.slack.com/
  * https://stackoverflow.com/questions/tagged/julia

If you ever ask for help or file an issue about Julia, you should generally provide the output of `versioninfo()`.

Add new code cells by clicking the `+ Code` button (or _Insert_ > _Code cell_).

Have fun!

<img src="https://raw.githubusercontent.com/JuliaLang/julia-logo-graphics/master/images/julia-logo-mask.png" height="100" />