# Manipulating rows of DataFrame
## Selecting rows

In [None]:
using DataFrames
using Statistics
using Random
Random.seed!(1);

In [None]:
df = DataFrame(rand(4, 5), :auto)

  using `:` as row selector will copy columns

In [None]:
df[:, :]

  this is the same as

In [None]:
copy(df)

  you can get a subset of rows of a data frame without copying using `view` to get a `SubDataFrame`

In [None]:
sdf = view(df, 1:3, 1:3)

  you still have a detailed reference to the parent

In [None]:
parent(sdf), parentindices(sdf)

  selecting a single row returns a `DataFrameRow` object which is also a view

In [None]:
dfr = df[3, :]

In [None]:
parent(dfr), parentindices(dfr), rownumber(dfr)

  let us add a column to a data frame by assigning a scalar broadcasting

In [None]:
df[!, :Z] .= 1

In [None]:
df

  Earlier we used : for column selection in a view (`SubDataFrame` and `DataFrameRow`). In this case a view will have all columns of the parent after the parent is mutated.

In [None]:
dfr

In [None]:
parent(dfr), parentindices(dfr), rownumber(dfr)

  Note that `parent` and `parentindices` refer to the true source of data for a `DataFrameRow` and `rownumber` refers to row number in the direct object that was used to create `DataFrameRow`

In [None]:
df = DataFrame(a=1:4)

In [None]:
dfv = view(df, [3, 2], :)

In [None]:
dfr = dfv[1, :]

In [None]:
parent(dfr), parentindices(dfr), rownumber(dfr)

## Reordering rows
  We create some random data frame (and hope that x.x is not sorted :), which is quite likely with 12 rows)

In [None]:
x = DataFrame(id=1:12, x=rand(12), y=[zeros(6); ones(6)])

  check if a DataFrame or a subset of its columns is sorted

In [None]:
issorted(x), issorted(x, :x)

  we sort x in place

In [None]:
sort!(x, :x)

  now we create a new DataFrame

In [None]:
y = sort(x, :id)

  here we sort by two columns, first is decreasing, second is increasing

In [None]:
sort(x, [:y, :x], rev=[true, false])

In [None]:
sort(x, [order(:y, rev=true), :x]) ## the same as above

now we try some more fancy sorting stuff

In [None]:
sort(x, [order(:y, rev=true), order(:x, by=v -> -v)])

  this is how you can reorder rows (here randomly)

In [None]:
x[shuffle(1:10), :]

  it is also easy to swap rows using broadcasted assignment

In [None]:
sort!(x, :id)
x[[1, 10], :] .= x[[10, 1], :]
x

## Merging/adding rows

In [None]:
x = DataFrame(rand(3, 5), :auto)

  merge by rows - data frames must have the same column names; the same is `vcat`

In [None]:
[x; x]

  you can efficiently `vcat` a vector of `DataFrames` using `reduce`

In [None]:
reduce(vcat, [x, x, x])

  get `y` with other order of names

In [None]:
y = x[:, reverse(names(x))]

  `vcat` is still possible as it does column name matching

In [None]:
vcat(x, y)

but column names must still match

In [None]:
try
    vcat(x, y[:, 1:3])
catch e
    show(e)
end

unless you pass `:intersect`, `:union` or specific column names as keyword argument `cols`

In [None]:
vcat(x, y[:, 1:3], cols=:intersect)

In [None]:
vcat(x, y[:, 1:3], cols=:union)

In [None]:
vcat(x, y[:, 1:3], cols=[:x1, :x5])

append!` modifies `x` in place

In [None]:
append!(x, x)

here column names must match exactly unless `cols` keyword argument is passed

In [None]:
append!(x, y)

standard `repeat` function works on rows; also `inner` and `outer` keyword arguments are accepted

In [None]:
repeat(x, 2)

  `push!` adds one row to `x` at the end; one must pass a correct number of values unless `cols` keyword argument is passed

In [None]:
push!(x, 1:5)
x

  also works with dictionaries

In [None]:
push!(x, Dict(:x1 => 11, :x2 => 12, :x3 => 13, :x4 => 14, :x5 => 15))
x

  and `NamedTuples` via name matching

In [None]:
push!(x, (x2=2, x1=1, x4=4, x3=3, x5=5))

  and `DataFrameRow` also via name matching

In [None]:
push!(x, x[1, :])

Please consult the documentation of `push!`, `append!` and `vcat` for allowed values of `cols` keyword argument.
This keyword argument governs the way these functions perform column matching of passed arguments. Also `append!` and `push!` support a `promote` keyword argument that decides if column type promotion is allowed.

Let us here just give a quick example of how heterogeneous data can be stored in the data frame using these functionalities:

In [None]:
source = [(a=1, b=2), (a=missing, b=10, c=20), (b="s", c=1, d=1)]

In [None]:
df = DataFrame()

In [None]:
for row in source
    push!(df, row, cols=:union) ## if cols is :union then promote is true by default
end

In [None]:
df

  and we see that `push!` dynamically added columns as needed and updated their element types
 ### Subsetting/removing rows

In [None]:
x = DataFrame(id=1:10, val='a':'j')

by using indexing

In [None]:
x[1:2, :]

a single row selection creates a `DataFrameRow`

In [None]:
x[1, :]

but this is a `DataFrame`

In [None]:
x[1:1, :]

the same but a view

In [None]:
view(x, 1:2, :)

selects columns 1 and 2

In [None]:
view(x, :, 1:2)

indexing by Bool, exact length math is required

In [None]:
x[repeat([true, false], 5), :]

alternatively we can also create a view

In [None]:
view(x, repeat([true, false], 5), :)

we can delete one row in place

In [None]:
deleteat!(x, 7)

  or a collection of rows, also in place

In [None]:
deleteat!(x, 6:7)

  you can also create a new DataFrame when deleting rows using Not indexing

In [None]:
x[Not(1:2), :]

In [None]:
x

now we move to row filtering

In [None]:
x = DataFrame([1:4, 2:5, 3:6], :auto)

  create a new `DataFrame` where filtering function operates on `DataFrameRow`

In [None]:
filter(r -> r.x1 > 2.5, x)

In [None]:
filter(r -> r.x1 > 2.5, x, view=true) # the same but as a view

 or

In [None]:
filter(:x1 => >(2.5), x)

in place modification of `x`, an example with `do`-block syntax

In [None]:
filter!(x) do r
    if r.x1 > 2.5
        return r.x2 < 4.5
    end
    r.x3 < 3.5
end

  A common operation is selection of rows for which a value in a column is contained in a given set. Here are a few ways in which you can achieve this.

In [None]:
df = DataFrame(x=1:12, y=mod1.(1:12, 4))

We select rows for which column `y` has value `1` or `4`.

In [None]:
filter(row -> row.y in [1, 4], df)

In [None]:
filter(:y => in([1, 4]), df)

In [None]:
df[in.(df.y, Ref([1, 4])), :]

  DataFrames.jl also provides a subset function that works on whole columns and allows for multiple conditions:

In [None]:
x = DataFrame([1:4, 2:5, 3:6], :auto)

In [None]:
subset(x, :x1 => x -> x .< mean(x), :x2 => ByRow(<(2.5)))

  Similarly an in-place `subset!` function is provided.

 ## Deduplicating

In [None]:
x = DataFrame(A=[1, 2], B=["x", "y"])
append!(x, x)
x.C = 1:4
x

  get first unique rows for given index

In [None]:
unique(x, [1, 2])

  now we look at whole rows

In [None]:
unique(x)

  get indicators of non-unique rows

In [None]:
nonunique(x, :A)

  modify `x` in place

In [None]:
unique!(x, :B)

 ## Extracting one row from a DataFrame into standard collections

In [None]:
x = DataFrame(x=[1, missing, 2], y=["a", "b", missing], z=[true, false, true])

In [None]:
cols = [:y, :z]

you can use a conversion to a `Vector` or an `Array`

In [None]:
Vector(x[1, cols])

In [None]:
Array(x[1, cols]) ## the same

  now you will get a vector of vectors

In [None]:
[Vector(x[i, cols]) for i in axes(x, 1)]

  it is easy to convert a `DataFrameRow` into a `NamedTuple`

In [None]:
copy(x[1, cols])

  or a `Tuple`

In [None]:
Tuple(x[1, cols])

 ## Working with a collection of rows of a data frame
 You can use eachrow to get a vector-like collection of DataFrameRows

In [None]:
df = DataFrame(reshape(1:12, 3, 4), :auto)

In [None]:
er_df = eachrow(df)

In [None]:
er_df[1]

In [None]:
last(er_df)

In [None]:
er_df[end]

  As DataFrameRows objects keeps connection to the parent data frame you can get the columns of the parent using getproperty

In [None]:
er_df.x1

 ## Flattening a data frame
  Occasionally you have a data frame whose one column is a vector of  collections. You can expand (flatten) such a column using the flatten  function

In [None]:
df = DataFrame(a='a':'c', b=[[1, 2, 3], [4, 5], 6])

In [None]:
flatten(df, :b)

 ## Only one row
 `only` from Julia Base is also supported in DataFrames.jl and succeeds if the data frame has only one row, in which case it is returned.

In [None]:
df = DataFrame(a=1)

In [None]:
only(df)

In [None]:
df2 = repeat(df, 2)

In [None]:
try
    only(df2)
catch e
    show(e)
end