# Introduction to DataFrames
**[Bogumił Kamiński](http://bogumilkaminski.pl/about/), July 16, 2019**

In [1]:
using DataFrames

## Possible pitfalls

### Know what is copied when creating a `DataFrame`

In [2]:
x = DataFrame(rand(3, 5))

Unnamed: 0_level_0,x1,x2,x3,x4,x5
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64
1,0.492559,0.243183,0.966703,0.693184,0.92395
2,0.0928541,0.665956,0.355291,0.249962,0.347599
3,0.552664,0.403156,0.351879,0.963344,0.760738


In [3]:
y = convert(DataFrame, x)
x === y # no copyinng performed

true

In [4]:
y = copy(x)
x === y # not the same object

false

In [5]:
y = DataFrame(x)
x === y

false

In [6]:
all(x[!, i] === y[!, i] for i in ncol(x)) # the columns are also not the same

false

In [7]:
x = 1:3; y = [1, 2, 3]; df = DataFrame(x=x,y=y) # the same when creating arrays or assigning columns

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


In [8]:
y === df.y # different object

false

In [9]:
typeof(x), typeof(df.x) # range is converted to a vector

(UnitRange{Int64}, Array{Int64,1})

In [10]:
y === df[:, :y] # slicing rows always creates a copy

false

you can avoid copying by using `copycols=false` keyword argument in functions. In particular `DataFrame!` is a shorthand for a non-copying constructor.

In [11]:
df = DataFrame!(x=x,y=y)

Unnamed: 0_level_0,x,y
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,2,2
3,3,3


In [12]:
y === df.y # now it is the same

true

In [13]:
select(df, :y)[!, 1] === y # not the same

false

In [14]:
select(df, :y, copycols=false)[!, 1] === y # the same

true

### Do not modify the parent of `GroupedDataFrame` or `view`

In [15]:
x = DataFrame(id=repeat([1,2], outer=3), x=1:6)
g = groupby(x, :id)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,1,1
2,1,3
3,1,5

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,2
2,2,4
3,2,6


In [16]:
x[1:3, 1] = [2,2,2]
g # well - it is wrong now, g is only a view

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,3
3,1,5

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,2
2,2,4
3,2,6


In [17]:
s = view(x, 5:6, :)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,1,5
2,2,6


In [18]:
deleterows!(x, 3:6)

Unnamed: 0_level_0,id,x
Unnamed: 0_level_1,Int64,Int64
1,2,1
2,2,2


In [19]:
s # error

BoundsError: BoundsError: attempt to access 2-element Array{Int64,1} at index [5:6]

### Single column selection for `DataFrame` creates aliases with `!` and `getproperty` syntax and copies with `:`

In [20]:
x = DataFrame(a=1:3)
x.b = x[!, 1] # alias
x.c = x[:, 1] # copy
x.d = x[!, 1][:] # copy
x.e = copy(x[!, 1]) # explicit copy
display(x)
x[1,1] = 100
display(x)

Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,1,1,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3


Unnamed: 0_level_0,a,b,c,d,e
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64
1,100,100,1,1,1
2,2,2,2,2,2
3,3,3,3,3,3
