# README - RedAmber

This notebook walks through [README of RedAmber](https://github.com/heronshoes/red_amber/blob/master/README.md).

In [1]:
require 'red_amber' # require 'red-amber' is also OK.
{RedAmber: RedAmber::VERSION, Arrow: Arrow::VERSION}

{:RedAmber=>"0.2.1", :Arrow=>"9.0.0"}

## `RedAmber::DataFrame`

It represents a set of data in 2D-shape. The entity is a Red Arrow's Table object. 

![dataframe model of RedAmber](https://github.com/heronshoes/red_amber/raw/master/doc/image/dataframe_model.png)

Download Penguins dataset from Red Data Tools and create DataFrame.

In [2]:
require 'datasets-arrow'

arrow = Datasets::Penguins.new.to_arrow
penguins = RedAmber::DataFrame.new(arrow)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
Adelie,Torgersen,39.1,18.7,181,3750,male,2007
Adelie,Torgersen,39.5,17.4,186,3800,female,2007
Adelie,Torgersen,40.3,18.0,195,3250,female,2007
Adelie,Torgersen,(nil),(nil),(nil),(nil),(nil),2007
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
Gentoo,Biscoe,50.4,15.7,222,5750,male,2009
Gentoo,Biscoe,45.2,14.8,212,5200,female,2009
Gentoo,Biscoe,49.9,16.1,213,5400,male,2009


For example, `DataFrame#pick` accepts keys as arguments and returns a sub DataFrame.

![pick method image](https://github.com/heronshoes/red_amber/raw/master/doc/image/dataframe/pick.png)

In [3]:
penguins.keys

[:species, :island, :bill_length_mm, :bill_depth_mm, :flipper_length_mm, :body_mass_g, :sex, :year]

In [4]:
df1 = penguins.pick(:species, :island, :body_mass_g)

species,island,body_mass_g
Adelie,Torgersen,3750
Adelie,Torgersen,3800
Adelie,Torgersen,3250
Adelie,Torgersen,(nil)
⋮,⋮,⋮
Gentoo,Biscoe,5750
Gentoo,Biscoe,5200
Gentoo,Biscoe,5400


`DataFrame#drop` drops some columns to create a remainer DataFrame.

![drop method image](https://github.com/heronshoes/red_amber/raw/master/doc/image/dataframe/drop.png)

You can specify by keys or a boolean array of same size as n_keys.

In [5]:
# Same as df.drop(:species, :island)
df2 = df1.drop(true, true, false)

body_mass_g
3750
3800
3250
(nil)
⋮
5750
5200
5400


Arrow data is immutable, so these methods always return an new object.

`DataFrame#assign` creates new columns or update existing columns.

![assign method image](https://github.com/heronshoes/red_amber/raw/master/doc/image/dataframe/assign.png)

New column is created because `:body_mass_kg` is a new key.

In [7]:
df2.assign(:body_mass_kg => df2[:body_mass_g] / 1000.0)

body_mass_g,body_mass_kg
3750,3.75
3800,3.8
3250,3.25
(nil),(nil)
⋮,⋮
5750,5.75
5200,5.2
5400,5.4


`DataFrame#slice` selects rows (observations) to create a sub DataFrame.

![slice method image](https://raw.githubusercontent.com/heronshoes/red_amber/master/doc/image/dataframe/slice.png)

Returns 5 rows at the start and 5 rows from the end.

In [8]:
penguins.slice(0...5, -5..-1)

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
Adelie,Torgersen,39.1,18.7,181,3750,male,2007
Adelie,Torgersen,39.5,17.4,186,3800,female,2007
Adelie,Torgersen,40.3,18.0,195,3250,female,2007
Adelie,Torgersen,(nil),(nil),(nil),(nil),(nil),2007
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
Gentoo,Biscoe,50.4,15.7,222,5750,male,2009
Gentoo,Biscoe,45.2,14.8,212,5200,female,2009
Gentoo,Biscoe,49.9,16.1,213,5400,male,2009


`DataFrame#remove` rejects rows (observations) to create a remainer DataFrame.

![remove method image](https://github.com/heronshoes/red_amber/raw/master/doc/image/dataframe/remove.png)

DataFrame manipulating methods like `pick`, `drop`, `slice`, `remove`, `rename` and `assign` accept a block.

Previous example is also OK with a block.

In [9]:
penguins.remove { bill_length_mm < 40 }

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
Adelie,Torgersen,40.3,18.0,195,3250,female,2007
Adelie,Torgersen,(nil),(nil),(nil),(nil),(nil),2007
Adelie,Torgersen,42.0,20.2,190,4250,(nil),2007
Adelie,Torgersen,41.1,17.6,182,3200,female,2007
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
Gentoo,Biscoe,50.4,15.7,222,5750,male,2009
Gentoo,Biscoe,45.2,14.8,212,5200,female,2009
Gentoo,Biscoe,49.9,16.1,213,5400,male,2009


Next example is an usage of block to update a column.

In [10]:
df3 = RedAmber::DataFrame.new(
  integer: [0, 1, 2, 3, nil],
  float:   [0.0, 1.1,  2.2, Float::NAN, nil],
  string:  ['A', 'B', 'C', 'D', nil],
  boolean: [true, false, true, false, nil])

integer,float,string,boolean
0,0.0,A,true
1,1.1,B,false
2,2.2,C,true
3,,D,false
(nil),(nil),(nil),(nil)


In [11]:
df3.assign do
  vectors.select(&:float?).map { |v| [v.key, -v] }
  # => returns [[:float], [-0.0, -1.1, -2.2, NAN, nil]]
end

integer,float,string,boolean
0,-0.0,A,true
1,-1.1,B,false
2,-2.2,C,true
3,,D,false
(nil),(nil),(nil),(nil)


Next example is to eliminate rows containing nil.

Remove all observations containing nil

In [12]:
df3.remove { vectors.map(&:is_nil).reduce(&:|) }

integer,float,string,boolean
0,0.0,A,True
1,1.1,B,False
2,2.2,C,True
3,,D,False


For this frequently needed task, we can do it much simpler.

In [13]:
df3.remove_nil # => same result as above

integer,float,string,boolean
0,0.0,A,True
1,1.1,B,False
2,2.2,C,True
3,,D,False


`DataFrame#summary` shows summary statistics in a DataFrame.

In [14]:
penguins.summary

variables,count,mean,std,min,25%,median,75%,max
bill_length_mm,342,43.92192982456141,5.459583713926532,32.1,39.225,44.382000000000005,48.5,59.6
bill_depth_mm,342,17.151169590643274,1.9747931568167811,13.1,15.6,17.32,18.7,21.5
flipper_length_mm,342,200.91520467836256,14.061713679356888,172.0,190.0,197.0,213.0,231.0
body_mass_g,342,4201.754385964912,801.9545356980955,2700.0,3550.0,4031.5,4750.0,6300.0
year,344,2008.0290697674416,0.8183559254837041,2007.0,2007.0,2008.0,2009.0,2009.0


In [15]:
puts penguins.summary.to_s(width: 82)

  variables            count     mean      std      min      25%   median      75%      max
  <dictionary>      <uint16> <double> <double> <double> <double> <double> <double> <double>
1 bill_length_mm         342    43.92     5.46     32.1    39.23    44.38     48.5     59.6
2 bill_depth_mm          342    17.15     1.97     13.1     15.6    17.32     18.7     21.5
3 flipper_length_mm      342   200.92    14.06    172.0    190.0    197.0    213.0    231.0
4 body_mass_g            342  4201.75   801.95   2700.0   3550.0   4031.5   4750.0   6300.0
5 year                   344  2008.03     0.82   2007.0   2007.0   2008.0   2009.0   2009.0


`DataFrame#group` method can be used for the grouping tasks.

In [16]:
starwars = RedAmber::DataFrame.load(URI("https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/starwars.csv"))

unnamed1,name,height,mass,hair_color,skin_color,eye_color,birth_year,sex,gender,homeworld,species
1,Luke Skywalker,172,77.0,blond,fair,blue,19.0,male,masculine,Tatooine,Human
2,C-3PO,167,75.0,,gold,yellow,112.0,none,masculine,Tatooine,Droid
3,R2-D2,96,32.0,,"white, blue",red,33.0,none,masculine,Naboo,Droid
4,Darth Vader,202,136.0,none,white,yellow,41.9,male,masculine,Tatooine,Human
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
85,BB8,(nil),(nil),none,none,black,(nil),none,masculine,,Droid
86,Captain Phasma,(nil),(nil),unknown,unknown,unknown,(nil),,,,
87,Padmé Amidala,165,45.0,brown,light,brown,46.0,female,feminine,Naboo,Human


In [17]:
starwars.group(:species) { [count(:species), mean(:height, :mass)] }
        .slice { count > 1 }

species,count,mean(height),mean(mass)
Human,35,176.6451612903226,82.78181818181818
Droid,6,131.2,69.75
Wookiee,2,231.0,124.0
Gungan,3,208.66666666666666,74.0
⋮,⋮,⋮,⋮
Twi'lek,2,179.0,55.0
Mirialan,2,168.0,53.1
Kaminoan,2,221.0,88.0


## `RedAmber::Vector`

Class `RedAmber::Vector` represents a series of data in the DataFrame.
Method `RedAmber::DataFrame#[key]` returns a Vector with the key `key`. 

In [51]:
penguins[:bill_length_mm]

#<RedAmber::Vector(:double, size=344):0x000000000000f320>
[39.1, 39.5, 40.3, nil, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, 37.8, 37.8, 41.1, ... ]


Vectors accepts some [functional methods from Arrow](https://arrow.apache.org/docs/cpp/compute.html).

This is an element-wise comparison and returns a boolean Vector of same size.

![unary element-wise](https://github.com/heronshoes/red_amber/raw/master/doc/image/vector/unary_element_wise.png)

In [52]:
penguins[:bill_length_mm] < 40

#<RedAmber::Vector(:boolean, size=344):0x000000000000f334>
[true, true, false, nil, true, true, true, true, true, false, true, true, false, ... ]


Next example returns aggregated result.

![unary aggregation](https://github.com/heronshoes/red_amber/raw/master/doc/image/vector/unary_aggregation.png)

In [53]:
penguins[:bill_length_mm].mean

43.92192982456141

## Another Jupyter notebook

[Examples of Red Amber](examples_of_red_amber.ipynb) shows more examples in jupyter notebook.