# Julia for Data Science
Based on work by [@nassarhuda](https://github.com/nassarhuda)!


In this tutorial, we will discuss why *Julia* is the tool you want to use for your data science applications.

We will cover the following:
* **Data**
* Data processing
* Visualization

### Data: Build a strong relationship with your data.
Every data science task has one main ingredient, the _data_! Most likely, you want to use your data to learn something new. But before the _new_ part, what about the data you already have? Let's make sure you can **read** it, **store** it, and **understand** it before you start using it.

Julia makes this step really easy with data structures and packages to process the data, as well as, existing functions that are readily usable on your data. 

The goal of this first part is get you acquainted with some Julia's tools to manage your data.

First, let's download a csv file from github that we can work with :
https://raw.githubusercontent.com/nassarhuda/easy_data/master/programming_languages.csv

Note: `download` depends on external tools such as curl, wget or fetch. So you must have one of these.

In [1]:
P = download("https://raw.githubusercontent.com/nassarhuda/easy_data/master/programming_languages.csv","programminglanguages.csv")

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   876  100   876    0     0   2646      0 --:--:-- --:--:-- --:--:--  2654


"programminglanguages.csv"

We can use shell commands like `ls` (in Linux/MacOs) in Julia by preceding them with a semicolon.
We can also call the julia command `readdir()` (works on all platform)

In [42]:
;ls

1. Julia for Data Science - Data.ipynb
2. Julia for Data Science - Algorithms.ipynb
3. Julia for Data Science - Plotting.ipynb
houses.csv
julialogo.png
LICENSE.md
programminglanguages.csv
programming_languages_data.txt
README.md


Add the CSV package to Julia using `add()`. `CSV.read()` will automatically  define headers from the .csv file if we set the `header` argument as `true`.
We could also use the `DelimitedFiles` stdlib and its `readdlm()` function as shown below.

In [44]:
using Pkg
Pkg.add("CSV") # for CSV.read()

[32m[1m Resolving[22m[39m package versions...
[32m[1m  Updating[22m[39m `~/.julia/environments/v1.1/Project.toml`
[90m [no changes][39m
[32m[1m  Updating[22m[39m `~/.julia/environments/v1.1/Manifest.toml`
[90m [no changes][39m


Read the csv files and store the dataset in `P` and the header in `H`

In [48]:
# using CSV
# P = CSV.read("programminglanguages.csv",header=true)
# or
using DelimitedFiles
P,H= readdlm("programminglanguages.csv",',',header=true)

(Any[1951 "Regional Assembly Language"; 1952 "Autocode"; … ; 2012 "Julia"; 2014 "Swift"], AbstractString["year" "language"])

In [49]:
P # stores the dataset

73×2 Array{Any,2}:
 1951  "Regional Assembly Language"
 1952  "Autocode"                  
 1954  "IPL"                       
 1955  "FLOW-MATIC"                
 1957  "FORTRAN"                   
 1957  "COMTRAN"                   
 1958  "LISP"                      
 1958  "ALGOL 58"                  
 1959  "FACT"                      
 1959  "COBOL"                     
 1959  "RPG"                       
 1962  "APL"                       
 1962  "Simula"                    
    ⋮                              
 2003  "Scala"                     
 2005  "F#"                        
 2006  "PowerShell"                
 2007  "Clojure"                   
 2009  "Go"                        
 2010  "Rust"                      
 2011  "Dart"                      
 2011  "Kotlin"                    
 2011  "Red"                       
 2011  "Elixir"                    
 2012  "Julia"                     
 2014  "Swift"                     

In [50]:
H # stores the header names

1×2 Array{AbstractString,2}:
 "year"  "language"

Here we write our first small function. <br>
Now you can answer questions such as, "when was language X created?"
Create a function `language_created_year` that takes `P` and a language name and return its year of creation

In [51]:
function language_created_year(P,language::String)
    loc = findfirst(P[:,2].==language)
    return P[loc,1]
end

language_created_year (generic function with 1 method)

Try with "Julia"

In [52]:
language_created_year(P,"Julia")

2012

Now try with "julia"

In [53]:
language_created_year(P,"julia")

ArgumentError: ArgumentError: `nothing` should not be printed; use `show`, `repr`, or custom output instead.

As expected, this will not return what you want, but thankfully, string manipulation is really easy in Julia!
You can use `lowercase` (beware of what `lowercase` takes as input)

In [54]:
function language_created_year_v2(P,language::String)
    loc = findfirst(lowercase.(P[:,2]).==lowercase.(language))
    return P[loc,1]
end
language_created_year_v2(P,"julia")

2012

**Reading and writing to files is really easy in Julia.** <br>

You can use different delimiters with the function `readdlm` (`readcsv` is just an instance of `readdlm`) available with the `DelimitedFiles` package. <br>

To write to files, we can use `writecsv` or `writedlm`. <br>

Let's write this same data to a file with a different delimiter.

In [55]:
writedlm("programming_languages_data.txt", P, '-')

We can now check that this worked using a shell command to glance at the file,

In [56]:
;head -10 programming_languages_data.txt

1951-Regional Assembly Language
1952-Autocode
1954-IPL
1955-"FLOW-MATIC"
1957-FORTRAN
1957-COMTRAN
1958-LISP
1958-ALGOL 58
1959-FACT
1959-COBOL


and also check that we can use `readdlm` to read our new text file correctly.

In [14]:
P_new_delim = readdlm("programming_languages_data.txt", '-');
P == P_new_delim

true

### Dictionaries
Let's try to store the above data in a dictionary format!

First, let's initialize an empty dictionary

In [15]:
dict = Dict{Integer,Vector{String}}()

Dict{Integer,Array{String,1}} with 0 entries

Here we told Julia that we want `dict` to only accept integers as keys and vectors of strings as values.

However, we could have initialized an empty dictionary without providing this information (depending on our application).

In [16]:
dict2 = Dict()

Dict{Any,Any} with 0 entries

This dictionary takes keys and values of any type!

Now, let's populate the dictionary with years as keys and vectors that hold all the programming languages created in each year as their values.

In [17]:
for i = 1:size(P,1)
    year,lang = P[i,:]
    
    if year in keys(dict)
        dict[year] = push!(dict[year],lang)
    else
        dict[year] = [lang]
    end
end

Now you can pick whichever year you want and find what programming languages were invented in that year

In [18]:
dict[2003]

2-element Array{String,1}:
 "Groovy"
 "Scala" 

### DataFrames! 
*Shout out to R fans!*
One other way to play around with data in Julia is to use a DataFrame.

This requires loading the `DataFrames` package

In [19]:
# Pkg.add("DataFrames")
using DataFrames
df = DataFrame(year = P[:,1], language = P[:,2])

Unnamed: 0_level_0,year,language
Unnamed: 0_level_1,Any,Any
1,1951,Regional Assembly Language
2,1952,Autocode
3,1954,IPL
4,1955,FLOW-MATIC
5,1957,FORTRAN
6,1957,COMTRAN
7,1958,LISP
8,1958,ALGOL 58
9,1959,FACT
10,1959,COBOL


You can access columns by header name, or column index.

In this case, `df[1]` is equivalent to `df[:year]`.

Note that if we want to access columns by header name, we precede the header name with a colon! In Julia, this means that the header names are treated as *symbols*.

In [20]:
df[:year]

73-element Array{Any,1}:
 1951
 1952
 1954
 1955
 1957
 1957
 1958
 1958
 1959
 1959
 1959
 1962
 1962
    ⋮
 2003
 2005
 2006
 2007
 2009
 2010
 2011
 2011
 2011
 2011
 2012
 2014

**`DataFrames` provides some handy features when dealing with data**

First, it uses the "missing" type.

In [21]:
a = missing
typeof(a)

Missing

Let's see what happens when we try to add a "missing" type to a number.

In [22]:
a + 1

missing

`DataFrames` provides the `describe` can give you quick statistics about each column in your dataframe 

In [23]:
describe(df)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Nothing,Any,Int64,Int64,DataType
1,year,1982.99,1951,,2014,45,0,Any
2,language,,ALGOL 58,,dBase III,73,0,Any


### RDatasets

We can use RDatasets to play around with pre-existing datasets

In [24]:
using Pkg
Pkg.add("RData")
Pkg.add("RDatasets")
Pkg.add("RCall") # should have R installed to build RCall

[32m[1m Resolving[22m[39m package versions...
[32m[1m  Updating[22m[39m `~/.julia/environments/v1.1/Project.toml`
 [90m [df47a6cb][39m[92m + RData v0.6.0[39m
[32m[1m  Updating[22m[39m `~/.julia/environments/v1.1/Manifest.toml`
[90m [no changes][39m
[32m[1m Resolving[22m[39m package versions...
[32m[1m  Updating[22m[39m `~/.julia/environments/v1.1/Project.toml`
 [90m [ce6b1742][39m[92m + RDatasets v0.6.1[39m
[32m[1m  Updating[22m[39m `~/.julia/environments/v1.1/Manifest.toml`
[90m [no changes][39m
[32m[1m Resolving[22m[39m package versions...
[32m[1m Installed[22m[39m RCall ─ v0.13.1
[32m[1m  Updating[22m[39m `~/.julia/environments/v1.1/Project.toml`
 [90m [6f49c342][39m[92m + RCall v0.13.1[39m
[32m[1m  Updating[22m[39m `~/.julia/environments/v1.1/Manifest.toml`
 [90m [6f49c342][39m[92m + RCall v0.13.1[39m
 [90m [1b915085][39m[92m + WinReg v0.3.1[39m
[32m[1m  Building[22m[39m RCall → `~/.julia/packages/RCall/29zDq/deps/

In [25]:
using RDatasets
iris = dataset("datasets", "iris")

┌ Info: Recompiling stale cache file /home/raphaelb/.julia/compiled/v1.1/RDatasets/JyIbx.ji for RDatasets [ce6b1742-4840-55fa-b093-852dadbb1d8b]
└ @ Base loading.jl:1184


Unnamed: 0_level_0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Categorical…
1,5.1,3.5,1.4,0.2,setosa
2,4.9,3.0,1.4,0.2,setosa
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa
7,4.6,3.4,1.4,0.3,setosa
8,5.0,3.4,1.5,0.2,setosa
9,4.4,2.9,1.4,0.2,setosa
10,4.9,3.1,1.5,0.1,setosa


Note that data loaded with `dataset` is stored as a DataFrame. 😃

In [26]:
typeof(iris) 

DataFrame

The summary we get from `describe` on `iris` gives us a lot more information than the summary on `df`!

In [27]:
describe(iris)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Union…,Nothing,DataType
1,SepalLength,5.84333,4.3,5.8,7.9,,,Float64
2,SepalWidth,3.05733,2.0,3.0,4.4,,,Float64
3,PetalLength,3.758,1.0,4.35,6.9,,,Float64
4,PetalWidth,1.19933,0.1,1.3,2.5,,,Float64
5,Species,,setosa,,virginica,3.0,,CategoricalString{UInt8}


### Manage missing values

The handling of `missing` type has been completly reworked in 1.0 [see here for more details](https://docs.julialang.org/en/v1/manual/missing/#Arrays-With-Missing-Values-1).


In [28]:
using Statistics

In [29]:
foods = ["apple", "cucumber", "tomato", "banana"]
calories = [missing,47,22,105]
typeof(calories)

Array{Union{Missing, Int64},1}

In [30]:
mean(calories)

missing

Missing values ruin everything! 😑

Luckily we can ignore them with `skipmissing`!

In [31]:
mean(skipmissing(calories))

58.0

In fact, `describe' will drop these values too

In [32]:
describe(calories)

Summary Stats:
Length:         4
Type:           Union{Missing, Int64}
Number Unique:  4


Note that `typeof(calories)` is `Array{Union{Missing, Int64},1}`


We can remove all missing values by e.g. 0

In [33]:
newcalories = coalesce.(calories,0) # i.e. replace every missing with the value 0

4-element Array{Int64,1}:
   0
  47
  22
 105

In [34]:
prices = [0.85,1.6,0.8,0.6,]

4-element Array{Float64,1}:
 0.85
 1.6 
 0.8 
 0.6 

In [35]:
dataframe_calories = DataFrame(item=foods,calories=calories)

Unnamed: 0_level_0,item,calories
Unnamed: 0_level_1,String,Int64⍰
1,apple,missing
2,cucumber,47
3,tomato,22
4,banana,105


In [36]:
dataframe_prices = DataFrame(item=foods,price=prices)

Unnamed: 0_level_0,item,price
Unnamed: 0_level_1,String,Float64
1,apple,0.85
2,cucumber,1.6
3,tomato,0.8
4,banana,0.6


We can also `join` two dataframes together

In [37]:
DF = join(dataframe_calories,dataframe_prices,on=:item)

Unnamed: 0_level_0,item,calories,price
Unnamed: 0_level_1,String,Int64⍰,Float64
1,apple,missing,0.85
2,cucumber,47,1.6
3,tomato,22,0.8
4,banana,105,0.6


### FileIO

In [38]:
Pkg.add("ImageMagick")
Pkg.add("FileIO")
using FileIO
julialogo = download("https://avatars0.githubusercontent.com/u/743164?s=200&v=4","julialogo.png")

[32m[1m Resolving[22m[39m package versions...
[32m[1m  Updating[22m[39m `~/.julia/environments/v1.1/Project.toml`
[90m [no changes][39m
[32m[1m  Updating[22m[39m `~/.julia/environments/v1.1/Manifest.toml`
[90m [no changes][39m
[32m[1m Resolving[22m[39m package versions...
[32m[1m  Updating[22m[39m `~/.julia/environments/v1.1/Project.toml`
[90m [no changes][39m
[32m[1m  Updating[22m[39m `~/.julia/environments/v1.1/Manifest.toml`
[90m [no changes][39m


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12791  100 12791    0     0   114k      0 --:--:-- --:--:-- --:--:--  114k


"julialogo.png"

Again, let's check that this download worked!

In [39]:
;ls

1. Julia for Data Science - Data.ipynb
2. Julia for Data Science - Algorithms.ipynb
3. Julia for Data Science - Plotting.ipynb
houses.csv
julialogo.png
LICENSE.md
programminglanguages.csv
programming_languages_data.txt
README.md


Next, let's load the Julia logo, stored as a .png file

In [40]:
X1 = load("julialogo.png")

200×200 Array{RGBA{N0f8},2} with eltype ColorTypes.RGBA{FixedPointNumbers.Normed{UInt8,8}}:
 RGBA{N0f8}(0.0,0.0,0.0,0.0)  …  RGBA{N0f8}(0.0,0.0,0.0,0.0)
 RGBA{N0f8}(0.0,0.0,0.0,0.0)     RGBA{N0f8}(0.0,0.0,0.0,0.0)
 RGBA{N0f8}(0.0,0.0,0.0,0.0)     RGBA{N0f8}(0.0,0.0,0.0,0.0)
 RGBA{N0f8}(0.0,0.0,0.0,0.0)     RGBA{N0f8}(0.0,0.0,0.0,0.0)
 RGBA{N0f8}(0.0,0.0,0.0,0.0)     RGBA{N0f8}(0.0,0.0,0.0,0.0)
 RGBA{N0f8}(0.0,0.0,0.0,0.0)  …  RGBA{N0f8}(0.0,0.0,0.0,0.0)
 RGBA{N0f8}(0.0,0.0,0.0,0.0)     RGBA{N0f8}(0.0,0.0,0.0,0.0)
 RGBA{N0f8}(0.0,0.0,0.0,0.0)     RGBA{N0f8}(0.0,0.0,0.0,0.0)
 RGBA{N0f8}(0.0,0.0,0.0,0.0)     RGBA{N0f8}(0.0,0.0,0.0,0.0)
 RGBA{N0f8}(0.0,0.0,0.0,0.0)     RGBA{N0f8}(0.0,0.0,0.0,0.0)
 RGBA{N0f8}(0.0,0.0,0.0,0.0)  …  RGBA{N0f8}(0.0,0.0,0.0,0.0)
 RGBA{N0f8}(0.0,0.0,0.0,0.0)     RGBA{N0f8}(0.0,0.0,0.0,0.0)
 RGBA{N0f8}(0.0,0.0,0.0,0.0)     RGBA{N0f8}(0.0,0.0,0.0,0.0)
 ⋮                            ⋱                             
 RGBA{N0f8}(0.0,0.0,0.0,0.0)     RGBA{N0f8}(0.0,0.0,0.

We see below that Julia stores this logo as an array of colors.

In [41]:
@show typeof(X1);
@show size(X1);

typeof(X1) = Array{ColorTypes.RGBA{FixedPointNumbers.Normed{UInt8,8}},2}
size(X1) = (200, 200)
