# II. Working with Dataframes

In this notebook, we'll start doing basic operations on dataframes. Specifically we'll look at

* Column operations
* Row operations
* Sorting
* Categorical variables

In [1]:
using DataFrames, Dates, CSV

In [2]:
# load csv file

df = CSV.read("fakedata.csv")

│   caller = read(::String) at CSV.jl:40
└ @ CSV /nas/longleaf/apps/julia/1.3.0/share/julia/packages/CSV/MKemC/src/CSV.jl:40


Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,String
1,1,0.1078,a,2019-10-31,2019-11-03,IZFi329Kfs
2,2,-0.79658,a,2018-11-12,2018-12-01,ZC1tXGibbk
3,4,0.6142,c,2012-05-14,2012-06-14,BddsrFBO8E
4,1,0.39258,a,2017-01-01,2017-08-01,g7PQAwGDrH
5,4,-0.29707,a,2012-06-06,2012-06-12,gxeJaCvj92
6,4,0.2167,c,2019-04-03,2019-05-01,Q23fycpFNT
7,6,0.04906,b,2015-07-28,2015-08-20,QE63406gEj
8,6,-0.76691,b,2014-09-27,2014-11-27,GCMwV5msPt
9,1,0.17141,d,2013-10-04,2013-11-14,92PkSbQonD
10,3,0.35063,d,2014-08-12,2014-09-12,RUXG3N8RtF


The `first` and `last` functions can be used to examine the first  and last few lines of the dataframe.

In [3]:
first(df,7)

Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,String
1,1,0.1078,a,2019-10-31,2019-11-03,IZFi329Kfs
2,2,-0.79658,a,2018-11-12,2018-12-01,ZC1tXGibbk
3,4,0.6142,c,2012-05-14,2012-06-14,BddsrFBO8E
4,1,0.39258,a,2017-01-01,2017-08-01,g7PQAwGDrH
5,4,-0.29707,a,2012-06-06,2012-06-12,gxeJaCvj92
6,4,0.2167,c,2019-04-03,2019-05-01,Q23fycpFNT
7,6,0.04906,b,2015-07-28,2015-08-20,QE63406gEj


In [4]:
last(df,7)

Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,String
1,3,-0.39826,a,2015-02-01,2015-07-12,UIVdF2TxWM
2,4,0.52603,a,2011-07-10,2011-08-13,s0ltSZb4fm
3,4,-0.41483,c,2012-03-12,2012-08-01,0OoRG1ly8l
4,5,-0.02946,a,2013-07-16,2013-07-17,uLFMepv4pI
5,2,-0.16647,c,2016-03-14,2016-04-14,TKdJZ2SddY
6,5,0.86743,d,2012-09-10,2012-09-26,0ltwHfLBHf
7,6,-0.85316,d,2018-04-12,2018-05-12,hjkJkeasXG


You might want to extract basic information about the dataframe such as the number of rows and columns in the data frame.  To get the dataframe size, you can use the `size` command which tells us there are 20 rows and 5 columns in the dataframe.

In [5]:
nr, nc = size(df)

(20, 6)

The `describe` function can be used to get summary statistics on the dataframe. The object returned is also a dataframe.

In [6]:
describe(df)

Unnamed: 0_level_0,variable,mean,min,median,max,nunique,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Union…,Nothing,DataType
1,v1,3.35,1,3.5,6,,,Int64
2,v2,0.0646675,-0.85316,0.07843,0.90441,,,Float64
3,v3,,a,,d,4.0,,String
4,v4,,2011-07-10,,2019-10-31,19.0,,Date
5,v5,,2011-08-13,,2019-11-12,19.0,,Date
6,v6,,0OoRG1ly8l,,vFSFbWZBGM,20.0,,String


We can restrict describe only to work on certain variables of interest and return only certain statistics.

In [7]:
summstats = describe(df[:,[:v1, :v2]], :eltype, :min, :max, :median)

Unnamed: 0_level_0,variable,eltype,min,max,median
Unnamed: 0_level_1,Symbol,DataType,Real,Real,Float64
1,v1,Int64,1.0,6.0,3.5
2,v2,Float64,-0.85316,0.90441,0.07843


Since <i>summstats</i> is itself a dataframe you can work with it as you would any other dataframe.

### Column operations:

We've seen the `size` function to get the dimensions of the dataframe. There is a `ncols` function if you just want the number of columns:

In [8]:
nc = ncol(df)

6

Add a column using the [ ] notation and by providing a name for the column along with values using the assignment operator. Here we'll add a new column named <i>RandomStr</i> to **df** that is just an array of random strings of length 10.

In [9]:
using Random 
df[!,:RandStr] = [randstring(10) for j in 1:nr];

In [10]:
df

Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6,RandStr
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,String,String
1,1,0.1078,a,2019-10-31,2019-11-03,IZFi329Kfs,Qd7o4K1ppr
2,2,-0.79658,a,2018-11-12,2018-12-01,ZC1tXGibbk,zP4ZFgdeom
3,4,0.6142,c,2012-05-14,2012-06-14,BddsrFBO8E,ijy3LjvUzV
4,1,0.39258,a,2017-01-01,2017-08-01,g7PQAwGDrH,nURrlAWJ64
5,4,-0.29707,a,2012-06-06,2012-06-12,gxeJaCvj92,uwg0UEO4b0
6,4,0.2167,c,2019-04-03,2019-05-01,Q23fycpFNT,qm2W3LtYvz
7,6,0.04906,b,2015-07-28,2015-08-20,QE63406gEj,BcF0hu1v85
8,6,-0.76691,b,2014-09-27,2014-11-27,GCMwV5msPt,lgJymJVbme
9,1,0.17141,d,2013-10-04,2013-11-14,92PkSbQonD,wN5lihnRag
10,3,0.35063,d,2014-08-12,2014-09-12,RUXG3N8RtF,zqEMP2kiGq


You can do operations on existing columns in the dataframe to create new columns. We know that **v4** is a Date type so we can use Julia's <i>Dates</i> package to work with these types.

In [11]:
eltype(df.v4)

Date

Let's use the `day` function to add a new column that calculates the day of the month for each date value in **v4**.

In [12]:
df.v4_day =  day.(df.v4);

In [13]:
df

Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6,RandStr,v4_day
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,String,String,Int64
1,1,0.1078,a,2019-10-31,2019-11-03,IZFi329Kfs,Qd7o4K1ppr,31
2,2,-0.79658,a,2018-11-12,2018-12-01,ZC1tXGibbk,zP4ZFgdeom,12
3,4,0.6142,c,2012-05-14,2012-06-14,BddsrFBO8E,ijy3LjvUzV,14
4,1,0.39258,a,2017-01-01,2017-08-01,g7PQAwGDrH,nURrlAWJ64,1
5,4,-0.29707,a,2012-06-06,2012-06-12,gxeJaCvj92,uwg0UEO4b0,6
6,4,0.2167,c,2019-04-03,2019-05-01,Q23fycpFNT,qm2W3LtYvz,3
7,6,0.04906,b,2015-07-28,2015-08-20,QE63406gEj,BcF0hu1v85,28
8,6,-0.76691,b,2014-09-27,2014-11-27,GCMwV5msPt,lgJymJVbme,27
9,1,0.17141,d,2013-10-04,2013-11-14,92PkSbQonD,wN5lihnRag,4
10,3,0.35063,d,2014-08-12,2014-09-12,RUXG3N8RtF,zqEMP2kiGq,12


Another way to add columns to your dataframe is using `insertcols!` or `insertcols`. The first argument to `insertcols!` is the name of the data frame, the second argument is the column index number where you want the new column to be placed, and the last argument specifies the data values of the new column. The non in-place version is `insertcols`.

In [14]:
insertcols!(df, 6, elapsed_days = df.v5 - df.v4)

│   caller = ip:0x0
└ @ Core :-1


Unnamed: 0_level_0,v1,v2,v3,v4,v5,elapsed_days,v6,RandStr
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,Day,String,String
1,1,0.1078,a,2019-10-31,2019-11-03,3 days,IZFi329Kfs,Qd7o4K1ppr
2,2,-0.79658,a,2018-11-12,2018-12-01,19 days,ZC1tXGibbk,zP4ZFgdeom
3,4,0.6142,c,2012-05-14,2012-06-14,31 days,BddsrFBO8E,ijy3LjvUzV
4,1,0.39258,a,2017-01-01,2017-08-01,212 days,g7PQAwGDrH,nURrlAWJ64
5,4,-0.29707,a,2012-06-06,2012-06-12,6 days,gxeJaCvj92,uwg0UEO4b0
6,4,0.2167,c,2019-04-03,2019-05-01,28 days,Q23fycpFNT,qm2W3LtYvz
7,6,0.04906,b,2015-07-28,2015-08-20,23 days,QE63406gEj,BcF0hu1v85
8,6,-0.76691,b,2014-09-27,2014-11-27,61 days,GCMwV5msPt,lgJymJVbme
9,1,0.17141,d,2013-10-04,2013-11-14,41 days,92PkSbQonD,wN5lihnRag
10,3,0.35063,d,2014-08-12,2014-09-12,31 days,RUXG3N8RtF,zqEMP2kiGq


__By default Jupyter Notebook will limit the number of rows and columns when displaying a data frame to roughly fit the screen size.__

In [15]:
show(df, allcols=true)

20×9 DataFrame
│ Row │ v1    │ v2       │ v3     │ v4         │ v5         │ elapsed_days │
│     │ [90mInt64[39m │ [90mFloat64[39m  │ [90mString[39m │ [90mDate[39m       │ [90mDate[39m       │ [90mDay[39m          │
├─────┼───────┼──────────┼────────┼────────────┼────────────┼──────────────┤
│ 1   │ 1     │ 0.1078   │ a      │ 2019-10-31 │ 2019-11-03 │ 3 days       │
│ 2   │ 2     │ -0.79658 │ a      │ 2018-11-12 │ 2018-12-01 │ 19 days      │
│ 3   │ 4     │ 0.6142   │ c      │ 2012-05-14 │ 2012-06-14 │ 31 days      │
│ 4   │ 1     │ 0.39258  │ a      │ 2017-01-01 │ 2017-08-01 │ 212 days     │
│ 5   │ 4     │ -0.29707 │ a      │ 2012-06-06 │ 2012-06-12 │ 6 days       │
│ 6   │ 4     │ 0.2167   │ c      │ 2019-04-03 │ 2019-05-01 │ 28 days      │
│ 7   │ 6     │ 0.04906  │ b      │ 2015-07-28 │ 2015-08-20 │ 23 days      │
│ 8   │ 6     │ -0.76691 │ b      │ 2014-09-27 │ 2014-11-27 │ 61 days      │
│ 9   │ 1     │ 0.17141  │ d      │ 2013-10-04 │ 2013-11-14 │ 41 days      │
│

You can accomplish the same thing using `transform!` or `tansform` for the in-place and non in-place (respectively) operations. 

The first argument is the name of the dataframe; the second argument is a pair of pairs. In the second argument the first part indicates the columns of the dataframe to operate on. The second part uses `ByRow` to iterate over the rows applying the passed in anonymous function to each row iteration; the columns specified in the first part are used as the inputs to the anonymous function. The result will be placed in a new variable called __elapsed_days_trans__ as indicated in the last part of the second argument.

In [16]:
transform!(df, [:v5, :v4] => ByRow( (a,b) -> a - b ) => :elapsed_days_trans)
# transform(df, [:v5, :v4] => ByRow( - ) => :elapsed_days_trans)

Unnamed: 0_level_0,v1,v2,v3,v4,v5,elapsed_days,v6,RandStr
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,Day,String,String
1,1,0.1078,a,2019-10-31,2019-11-03,3 days,IZFi329Kfs,Qd7o4K1ppr
2,2,-0.79658,a,2018-11-12,2018-12-01,19 days,ZC1tXGibbk,zP4ZFgdeom
3,4,0.6142,c,2012-05-14,2012-06-14,31 days,BddsrFBO8E,ijy3LjvUzV
4,1,0.39258,a,2017-01-01,2017-08-01,212 days,g7PQAwGDrH,nURrlAWJ64
5,4,-0.29707,a,2012-06-06,2012-06-12,6 days,gxeJaCvj92,uwg0UEO4b0
6,4,0.2167,c,2019-04-03,2019-05-01,28 days,Q23fycpFNT,qm2W3LtYvz
7,6,0.04906,b,2015-07-28,2015-08-20,23 days,QE63406gEj,BcF0hu1v85
8,6,-0.76691,b,2014-09-27,2014-11-27,61 days,GCMwV5msPt,lgJymJVbme
9,1,0.17141,d,2013-10-04,2013-11-14,41 days,92PkSbQonD,wN5lihnRag
10,3,0.35063,d,2014-08-12,2014-09-12,31 days,RUXG3N8RtF,zqEMP2kiGq


In [17]:
show(df, allcols=true)

20×10 DataFrame
│ Row │ v1    │ v2       │ v3     │ v4         │ v5         │ elapsed_days │
│     │ [90mInt64[39m │ [90mFloat64[39m  │ [90mString[39m │ [90mDate[39m       │ [90mDate[39m       │ [90mDay[39m          │
├─────┼───────┼──────────┼────────┼────────────┼────────────┼──────────────┤
│ 1   │ 1     │ 0.1078   │ a      │ 2019-10-31 │ 2019-11-03 │ 3 days       │
│ 2   │ 2     │ -0.79658 │ a      │ 2018-11-12 │ 2018-12-01 │ 19 days      │
│ 3   │ 4     │ 0.6142   │ c      │ 2012-05-14 │ 2012-06-14 │ 31 days      │
│ 4   │ 1     │ 0.39258  │ a      │ 2017-01-01 │ 2017-08-01 │ 212 days     │
│ 5   │ 4     │ -0.29707 │ a      │ 2012-06-06 │ 2012-06-12 │ 6 days       │
│ 6   │ 4     │ 0.2167   │ c      │ 2019-04-03 │ 2019-05-01 │ 28 days      │
│ 7   │ 6     │ 0.04906  │ b      │ 2015-07-28 │ 2015-08-20 │ 23 days      │
│ 8   │ 6     │ -0.76691 │ b      │ 2014-09-27 │ 2014-11-27 │ 61 days      │
│ 9   │ 1     │ 0.17141  │ d      │ 2013-10-04 │ 2013-11-14 │ 41 days      │


Note the type of <i>df.elapsed_days</i>. It's an array with element of type **Day** (not **Int64**). **Day** is a built-in type that is part of the <i>Dates</i> package.

In [18]:
typeof(df.elapsed_days)

Array{Day,1}

You can also create a new column by mapping an arbitrary function to another column. Here let's use Julia's `map` function to create new column called <i>elapsed_hours</i> equal to <i>elapsed_days</i> converted into hours.

In [19]:
df[!, :elapsed_hours] = map(t -> Dates.Hour(t), df[:, :elapsed_days]); 

# You could also do df[!, :elapsed_hours] =  Dates.Hour.(df.elapsed_days);
# You could also do df[!, :elapsed_hours] = map(t -> Dates.Hour(t), df.elapsed_days);

In [20]:
show(df, allcols=true)

20×11 DataFrame
│ Row │ v1    │ v2       │ v3     │ v4         │ v5         │ elapsed_days │
│     │ [90mInt64[39m │ [90mFloat64[39m  │ [90mString[39m │ [90mDate[39m       │ [90mDate[39m       │ [90mDay[39m          │
├─────┼───────┼──────────┼────────┼────────────┼────────────┼──────────────┤
│ 1   │ 1     │ 0.1078   │ a      │ 2019-10-31 │ 2019-11-03 │ 3 days       │
│ 2   │ 2     │ -0.79658 │ a      │ 2018-11-12 │ 2018-12-01 │ 19 days      │
│ 3   │ 4     │ 0.6142   │ c      │ 2012-05-14 │ 2012-06-14 │ 31 days      │
│ 4   │ 1     │ 0.39258  │ a      │ 2017-01-01 │ 2017-08-01 │ 212 days     │
│ 5   │ 4     │ -0.29707 │ a      │ 2012-06-06 │ 2012-06-12 │ 6 days       │
│ 6   │ 4     │ 0.2167   │ c      │ 2019-04-03 │ 2019-05-01 │ 28 days      │
│ 7   │ 6     │ 0.04906  │ b      │ 2015-07-28 │ 2015-08-20 │ 23 days      │
│ 8   │ 6     │ -0.76691 │ b      │ 2014-09-27 │ 2014-11-27 │ 61 days      │
│ 9   │ 1     │ 0.17141  │ d      │ 2013-10-04 │ 2013-11-14 │ 41 days      │


As you likely noticed, not all columns were displayed. You can use the __show__ command to see all column variables for a few rows.

If you want to drop a column you can use the `select!` function along with the `Not` keyword argument. This is an in-place operation and therefore will modify the dataset. Use `select` if you want to do the non in-place version of this operation.

We'll drop the <i>elapsed_hours</i> and <i>v5</i> columns from **df**.

In [21]:
select!(df, Not([:elapsed_hours, :v5, :elapsed_days_trans]));

In [22]:
names(df)

8-element Array{String,1}:
 "v1"          
 "v2"          
 "v3"          
 "v4"          
 "elapsed_days"
 "v6"          
 "RandStr"     
 "v4_day"      

The `select!` command can also be used to keep a specific subset of columns:

In [23]:
select!(df, [:v1, :v2, :v3, :v4, :elapsed_days, :RandStr, :v4_day])

Unnamed: 0_level_0,v1,v2,v3,v4,elapsed_days,RandStr,v4_day
Unnamed: 0_level_1,Int64,Float64,String,Date,Day,String,Int64
1,1,0.1078,a,2019-10-31,3 days,Qd7o4K1ppr,31
2,2,-0.79658,a,2018-11-12,19 days,zP4ZFgdeom,12
3,4,0.6142,c,2012-05-14,31 days,ijy3LjvUzV,14
4,1,0.39258,a,2017-01-01,212 days,nURrlAWJ64,1
5,4,-0.29707,a,2012-06-06,6 days,uwg0UEO4b0,6
6,4,0.2167,c,2019-04-03,28 days,qm2W3LtYvz,3
7,6,0.04906,b,2015-07-28,23 days,BcF0hu1v85,28
8,6,-0.76691,b,2014-09-27,61 days,lgJymJVbme,27
9,1,0.17141,d,2013-10-04,41 days,wN5lihnRag,4
10,3,0.35063,d,2014-08-12,31 days,zqEMP2kiGq,12


To change the name of a column variable you can use the `rename!` function. We'll change the column name of <i>elapsed_days</i> to be <i>elap_days</i> and <i>v3</i> to be <i>Category</i>. The `rename` function is the non in-place version of this operation.

In [24]:
rename!(df, :elapsed_days => :elap_days, :v3 => :Category,);

In [25]:
names(df)

7-element Array{String,1}:
 "v1"       
 "v2"       
 "Category" 
 "v4"       
 "elap_days"
 "RandStr"  
 "v4_day"   

`select!` and `select` can be used to reorder the columns in the dataframe. Suppose we want _Category_ to appear first, followed by _elap_days_ and then the rest of the order to stay as is?

In [26]:
select(df, [:Category, :elap_days], :)

Unnamed: 0_level_0,Category,elap_days,v1,v2,v4,RandStr,v4_day
Unnamed: 0_level_1,String,Day,Int64,Float64,Date,String,Int64
1,a,3 days,1,0.1078,2019-10-31,Qd7o4K1ppr,31
2,a,19 days,2,-0.79658,2018-11-12,zP4ZFgdeom,12
3,c,31 days,4,0.6142,2012-05-14,ijy3LjvUzV,14
4,a,212 days,1,0.39258,2017-01-01,nURrlAWJ64,1
5,a,6 days,4,-0.29707,2012-06-06,uwg0UEO4b0,6
6,c,28 days,4,0.2167,2019-04-03,qm2W3LtYvz,3
7,b,23 days,6,0.04906,2015-07-28,BcF0hu1v85,28
8,b,61 days,6,-0.76691,2014-09-27,lgJymJVbme,27
9,d,41 days,1,0.17141,2013-10-04,wN5lihnRag,4
10,d,31 days,3,0.35063,2014-08-12,zqEMP2kiGq,12


The last function we will look at in regard to working with columns is `eachcol`. This is a function that allows you to iterate over the columns of your dataframe: 

`eachcol(df, bool)`

The first argument to `eachcol` is the name of the dataframe whose columns you want to iterate over and the second is a boolean. If you specify true for the boolean argument then for each column you get a __Pair__ whose first element is the column name (symbol) and second element is the corresponding column values.

We can `eachcol` is to pick out columns (based on their type) and perform an operation on that column. Here we calculate basic summary statistics **only** for columns in the **df** dataframe that are subtypes of <i>Real</i>.

In [27]:
using Statistics

for col in eachcol(df, true)
    if (eltype(col[2]) <: Real) 
        println("Variable ", col[1], "\n  ", "Mean: ", round(mean(col[2]), digits=3), 
            "\n  Max: ", round(maximum(col[2]), digits=1),
            "\n  Min: ", round(minimum(col[2]), digits=1),
            "\n  Range: ", round(maximum(col[2]) - minimum(col[2]), digits=1)
        )
    end
end

Variable v1
  Mean: 

│     collect(pairs(eachcol(df)))
│ else
│     eachcol(df)
│ end` instead.
│   caller = top-level scope at In[27]:2
└ @ Core ./In[27]:2


3.35
  Max: 6.0
  Min: 1.0
  Range: 5.0
Variable v2
  Mean: 0.065
  Max: 0.9
  Min: -0.9
  Range: 1.8
Variable v4_day
  Mean: 13.05
  Max: 31.0
  Min: 1.0
  Range: 30.0


### Row operations:

In [28]:
df = CSV.read("fakedata.csv")

Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,String
1,1,0.1078,a,2019-10-31,2019-11-03,IZFi329Kfs
2,2,-0.79658,a,2018-11-12,2018-12-01,ZC1tXGibbk
3,4,0.6142,c,2012-05-14,2012-06-14,BddsrFBO8E
4,1,0.39258,a,2017-01-01,2017-08-01,g7PQAwGDrH
5,4,-0.29707,a,2012-06-06,2012-06-12,gxeJaCvj92
6,4,0.2167,c,2019-04-03,2019-05-01,Q23fycpFNT
7,6,0.04906,b,2015-07-28,2015-08-20,QE63406gEj
8,6,-0.76691,b,2014-09-27,2014-11-27,GCMwV5msPt
9,1,0.17141,d,2013-10-04,2013-11-14,92PkSbQonD
10,3,0.35063,d,2014-08-12,2014-09-12,RUXG3N8RtF


If you want the number of rows in your dataframe there is a `nrows` function for that:

In [29]:
nr = nrow(df)

20

When indexing into a dataframe the first index indicates the rows and the second the columns. **Indexing a dataframe works as you expect**. The colon indicates all elements or items along the dimension.

For example, if we wanted to "subset" all rows and all columns:

In [30]:
df[:, :]

Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,String
1,1,0.1078,a,2019-10-31,2019-11-03,IZFi329Kfs
2,2,-0.79658,a,2018-11-12,2018-12-01,ZC1tXGibbk
3,4,0.6142,c,2012-05-14,2012-06-14,BddsrFBO8E
4,1,0.39258,a,2017-01-01,2017-08-01,g7PQAwGDrH
5,4,-0.29707,a,2012-06-06,2012-06-12,gxeJaCvj92
6,4,0.2167,c,2019-04-03,2019-05-01,Q23fycpFNT
7,6,0.04906,b,2015-07-28,2015-08-20,QE63406gEj
8,6,-0.76691,b,2014-09-27,2014-11-27,GCMwV5msPt
9,1,0.17141,d,2013-10-04,2013-11-14,92PkSbQonD
10,3,0.35063,d,2014-08-12,2014-09-12,RUXG3N8RtF


If we only wanted rows with a certain index we would specify that in the first argument. Here we get rows 5 through 9.

In [31]:
df[5:9, :]

Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,String
1,4,-0.29707,a,2012-06-06,2012-06-12,gxeJaCvj92
2,4,0.2167,c,2019-04-03,2019-05-01,Q23fycpFNT
3,6,0.04906,b,2015-07-28,2015-08-20,QE63406gEj
4,6,-0.76691,b,2014-09-27,2014-11-27,GCMwV5msPt
5,1,0.17141,d,2013-10-04,2013-11-14,92PkSbQonD


You can make a copy of the dataframe via the `copy` command:

In [32]:
dfc = copy(df)

Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,String
1,1,0.1078,a,2019-10-31,2019-11-03,IZFi329Kfs
2,2,-0.79658,a,2018-11-12,2018-12-01,ZC1tXGibbk
3,4,0.6142,c,2012-05-14,2012-06-14,BddsrFBO8E
4,1,0.39258,a,2017-01-01,2017-08-01,g7PQAwGDrH
5,4,-0.29707,a,2012-06-06,2012-06-12,gxeJaCvj92
6,4,0.2167,c,2019-04-03,2019-05-01,Q23fycpFNT
7,6,0.04906,b,2015-07-28,2015-08-20,QE63406gEj
8,6,-0.76691,b,2014-09-27,2014-11-27,GCMwV5msPt
9,1,0.17141,d,2013-10-04,2013-11-14,92PkSbQonD
10,3,0.35063,d,2014-08-12,2014-09-12,RUXG3N8RtF


You can change the value of an entry by specifying the index position and the new value. Here we change the value for the _v1_ variable in row 2 to 4.

In [33]:
dfc[2, :v1] = 4

4

Note when modifying values the new values need to be of a valid type. For example _v1_ is of type __Int64__ so the assigned value needs to be an __Int64__ or something that can be converted into an __Int64__. Trying to set the value to, for example, a __String__ won't work:

In [34]:
df[2, :v1] = "four"

MethodError: MethodError: Cannot `convert` an object of type String to an object of type Int64
Closest candidates are:
  convert(::Type{T}, !Matched::T) where T<:Number at number.jl:6
  convert(::Type{T}, !Matched::Number) where T<:Number at number.jl:7
  convert(::Type{T}, !Matched::Ptr) where T<:Integer at pointer.jl:23
  ...

You can update slices as well as long as the types and dimensions match.

In [35]:
first(dfc, 10)

Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,String
1,1,0.1078,a,2019-10-31,2019-11-03,IZFi329Kfs
2,4,-0.79658,a,2018-11-12,2018-12-01,ZC1tXGibbk
3,4,0.6142,c,2012-05-14,2012-06-14,BddsrFBO8E
4,1,0.39258,a,2017-01-01,2017-08-01,g7PQAwGDrH
5,4,-0.29707,a,2012-06-06,2012-06-12,gxeJaCvj92
6,4,0.2167,c,2019-04-03,2019-05-01,Q23fycpFNT
7,6,0.04906,b,2015-07-28,2015-08-20,QE63406gEj
8,6,-0.76691,b,2014-09-27,2014-11-27,GCMwV5msPt
9,1,0.17141,d,2013-10-04,2013-11-14,92PkSbQonD
10,3,0.35063,d,2014-08-12,2014-09-12,RUXG3N8RtF


This will update the first five rows of the _v1_ and _v2_ columns:

In [36]:
dfc[1:5, [:v1, :v2]] = [[1, 2, 3, 6, 5] randn(5,1)];

In [37]:
dfc

Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,String
1,1,0.680515,a,2019-10-31,2019-11-03,IZFi329Kfs
2,2,-0.980425,a,2018-11-12,2018-12-01,ZC1tXGibbk
3,3,0.343577,c,2012-05-14,2012-06-14,BddsrFBO8E
4,6,0.642345,a,2017-01-01,2017-08-01,g7PQAwGDrH
5,5,1.31946,a,2012-06-06,2012-06-12,gxeJaCvj92
6,4,0.2167,c,2019-04-03,2019-05-01,Q23fycpFNT
7,6,0.04906,b,2015-07-28,2015-08-20,QE63406gEj
8,6,-0.76691,b,2014-09-27,2014-11-27,GCMwV5msPt
9,1,0.17141,d,2013-10-04,2013-11-14,92PkSbQonD
10,3,0.35063,d,2014-08-12,2014-09-12,RUXG3N8RtF


Suppose we wanted to subset based on the value of a column variable? For example, if we wanted all rows of data where <i>v3</i> has the value "a":

In [38]:
dfc[dfc.v3 .== "a", :]

Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,String
1,1,0.680515,a,2019-10-31,2019-11-03,IZFi329Kfs
2,2,-0.980425,a,2018-11-12,2018-12-01,ZC1tXGibbk
3,6,0.642345,a,2017-01-01,2017-08-01,g7PQAwGDrH
4,5,1.31946,a,2012-06-06,2012-06-12,gxeJaCvj92
5,2,0.90441,a,2018-11-12,2019-11-12,LE6qR2Q37p
6,1,0.81387,a,2014-06-24,2016-06-24,vFSFbWZBGM
7,3,-0.39826,a,2015-02-01,2015-07-12,UIVdF2TxWM
8,4,0.52603,a,2011-07-10,2011-08-13,s0ltSZb4fm
9,5,-0.02946,a,2013-07-16,2013-07-17,uLFMepv4pI


You can use Julia's regular expression support to match on values of arbitrary character expressions. For example we can get only the rows of data where the _v6_ variable contains the expression "XG". One way to do this is to use the `match` function to find the rows of _v6_ that contain "XG". 

In [39]:
mymatch = match.(r"XG", dfc.v6)

20-element Array{Union{Nothing, RegexMatch},1}:
 nothing         
 RegexMatch("XG")
 nothing         
 nothing         
 nothing         
 nothing         
 nothing         
 nothing         
 nothing         
 RegexMatch("XG")
 nothing         
 nothing         
 nothing         
 nothing         
 nothing         
 nothing         
 nothing         
 nothing         
 nothing         
 RegexMatch("XG")

We can see that __mymatch__ is a one dimensional Array and the entries in __mymatch__ corresponding to non-matches of the regular expression "XG" have a value of `nothing`. We can use this to select the rows of the dataframe that correspond to the rows of __mymatch__ that _do not_ equal `nothing` (i.e. match the expression "XG").

In [40]:
dfc[mymatch .!== nothing,:]

Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,String
1,2,-0.980425,a,2018-11-12,2018-12-01,ZC1tXGibbk
2,3,0.35063,d,2014-08-12,2014-09-12,RUXG3N8RtF
3,6,-0.85316,d,2018-04-12,2018-05-12,hjkJkeasXG


In [41]:
dfc

Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,String
1,1,0.680515,a,2019-10-31,2019-11-03,IZFi329Kfs
2,2,-0.980425,a,2018-11-12,2018-12-01,ZC1tXGibbk
3,3,0.343577,c,2012-05-14,2012-06-14,BddsrFBO8E
4,6,0.642345,a,2017-01-01,2017-08-01,g7PQAwGDrH
5,5,1.31946,a,2012-06-06,2012-06-12,gxeJaCvj92
6,4,0.2167,c,2019-04-03,2019-05-01,Q23fycpFNT
7,6,0.04906,b,2015-07-28,2015-08-20,QE63406gEj
8,6,-0.76691,b,2014-09-27,2014-11-27,GCMwV5msPt
9,1,0.17141,d,2013-10-04,2013-11-14,92PkSbQonD
10,3,0.35063,d,2014-08-12,2014-09-12,RUXG3N8RtF


Here is an example of selecting rows based on a date value. Here we select rows where the value of the date for the _v4_ variable is after 7/27/2015.

In [42]:
dfc[dfc.v4 .> Date(2015,7,27), :]

Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,String
1,1,0.680515,a,2019-10-31,2019-11-03,IZFi329Kfs
2,2,-0.980425,a,2018-11-12,2018-12-01,ZC1tXGibbk
3,6,0.642345,a,2017-01-01,2017-08-01,g7PQAwGDrH
4,4,0.2167,c,2019-04-03,2019-05-01,Q23fycpFNT
5,6,0.04906,b,2015-07-28,2015-08-20,QE63406gEj
6,2,0.90441,a,2018-11-12,2019-11-12,LE6qR2Q37p
7,2,-0.16647,c,2016-03-14,2016-04-14,TKdJZ2SddY
8,6,-0.85316,d,2018-04-12,2018-05-12,hjkJkeasXG


You can select rows based on multiple variables. For example if we wanted the rows of data where <i>v3</i> has value "a" and <i>v2</i> is greater than 0 and just the columns <i>v1</i>, <i>v2</i>, and <i>v3</i>?

In [43]:
dfc[(dfc.v3 .== "a") .& (dfc.v2 .> 0), [:v1, :v2, :v3]]

Unnamed: 0_level_0,v1,v2,v3
Unnamed: 0_level_1,Int64,Float64,String
1,1,0.680515,a
2,6,0.642345,a
3,5,1.31946,a
4,2,0.90441,a
5,1,0.81387,a
6,4,0.52603,a


You can also use Julia's built-in `filter` function to subset data. You can use either `filter` or `filter!`. The former will return a copy of the dataframe with rows that satisfy the filter; the latter will actually modify the dataframe in place __keeping__ only rows that satisfy the filter.

The first argument will be the filter itself almost always expressed as an anonymous function and the second argument will just be the name of the dataframe the filter should be applied to.

Here we filter on rows where the vale of <i>v1</i> is greather than 1.

In [44]:
filter(row -> row[:v1] > 1, dfc)

Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,String
1,2,-0.980425,a,2018-11-12,2018-12-01,ZC1tXGibbk
2,3,0.343577,c,2012-05-14,2012-06-14,BddsrFBO8E
3,6,0.642345,a,2017-01-01,2017-08-01,g7PQAwGDrH
4,5,1.31946,a,2012-06-06,2012-06-12,gxeJaCvj92
5,4,0.2167,c,2019-04-03,2019-05-01,Q23fycpFNT
6,6,0.04906,b,2015-07-28,2015-08-20,QE63406gEj
7,6,-0.76691,b,2014-09-27,2014-11-27,GCMwV5msPt
8,3,0.35063,d,2014-08-12,2014-09-12,RUXG3N8RtF
9,2,0.90441,a,2018-11-12,2019-11-12,LE6qR2Q37p
10,3,0.00197,b,2011-08-12,2011-08-13,B8oAeJRvEf


If you need to delete rows you can use the `deleterows!` function and provide the row ids to delete.

In [45]:
deleterows!(dfc, [3, 15])

│   caller = top-level scope at In[45]:1
└ @ Core In[45]:1


Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,String
1,1,0.680515,a,2019-10-31,2019-11-03,IZFi329Kfs
2,2,-0.980425,a,2018-11-12,2018-12-01,ZC1tXGibbk
3,6,0.642345,a,2017-01-01,2017-08-01,g7PQAwGDrH
4,5,1.31946,a,2012-06-06,2012-06-12,gxeJaCvj92
5,4,0.2167,c,2019-04-03,2019-05-01,Q23fycpFNT
6,6,0.04906,b,2015-07-28,2015-08-20,QE63406gEj
7,6,-0.76691,b,2014-09-27,2014-11-27,GCMwV5msPt
8,1,0.17141,d,2013-10-04,2013-11-14,92PkSbQonD
9,3,0.35063,d,2014-08-12,2014-09-12,RUXG3N8RtF
10,2,0.90441,a,2018-11-12,2019-11-12,LE6qR2Q37p


A more useful thing might be to get rid of duplicate rows. The first occurrence of the duplicate row is kept in the dataframe while all others are regarded as duplicates and removed.

To this end you can use either the `unique` or `unique!` functions. The former returns a copy of the dataframe with deleted duplicate rows while the latter is an in-place operation.

`unique(df, cols)` <br/>
`unique!(df, cols)`

The **cols** argument is optional. If you specify the **cols** argument the comparison will be made using only the specified columns. If you don't specify the **cols** argument then the comparison will be made using all columns which means only rows with identical values across all columns will be matches.

We can see in our dataframe above that when only considering columns <i>v1</i> and <i>v3</i> there are duplicate rows. For example there are many rows with <i>v1</i> equal to 1 and <i>v3</i> equal to "a" or with <i>v1</i> equal to 2 and <i>v3</i> equal to a.

To get rid of duplicate rows on columns <i>v1</i> and <i>v3</i>:

In [46]:
unique(dfc, [:v1, :v3])

Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,String
1,1,0.680515,a,2019-10-31,2019-11-03,IZFi329Kfs
2,2,-0.980425,a,2018-11-12,2018-12-01,ZC1tXGibbk
3,6,0.642345,a,2017-01-01,2017-08-01,g7PQAwGDrH
4,5,1.31946,a,2012-06-06,2012-06-12,gxeJaCvj92
5,4,0.2167,c,2019-04-03,2019-05-01,Q23fycpFNT
6,6,0.04906,b,2015-07-28,2015-08-20,QE63406gEj
7,1,0.17141,d,2013-10-04,2013-11-14,92PkSbQonD
8,3,0.35063,d,2014-08-12,2014-09-12,RUXG3N8RtF
9,3,0.00197,b,2011-08-12,2011-08-13,B8oAeJRvEf
10,3,-0.39826,a,2015-02-01,2015-07-12,UIVdF2TxWM


The `nonunique` function will tell you if a given row is a duplicate of any row **before** it.

In [47]:
dfc

Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,String
1,1,0.680515,a,2019-10-31,2019-11-03,IZFi329Kfs
2,2,-0.980425,a,2018-11-12,2018-12-01,ZC1tXGibbk
3,6,0.642345,a,2017-01-01,2017-08-01,g7PQAwGDrH
4,5,1.31946,a,2012-06-06,2012-06-12,gxeJaCvj92
5,4,0.2167,c,2019-04-03,2019-05-01,Q23fycpFNT
6,6,0.04906,b,2015-07-28,2015-08-20,QE63406gEj
7,6,-0.76691,b,2014-09-27,2014-11-27,GCMwV5msPt
8,1,0.17141,d,2013-10-04,2013-11-14,92PkSbQonD
9,3,0.35063,d,2014-08-12,2014-09-12,RUXG3N8RtF
10,2,0.90441,a,2018-11-12,2019-11-12,LE6qR2Q37p


In [48]:
nonunique(dfc, [:v1, :v3])

18-element Array{Bool,1}:
 0
 0
 0
 0
 0
 0
 1
 0
 0
 1
 0
 1
 0
 1
 1
 0
 0
 0

You can use the `findall` function to find the row indices that correspond to duplicates of some previous row. The first argument to `findall` is a function that returns __true__ or __false__ and the `findall` function will return the indices of second argument that result in the first argument returning a value of __true__.

In [49]:
findall(row -> row == true, nonunique(dfc, [:v1, :v3]))

5-element Array{Int64,1}:
  7
 10
 12
 14
 15

Lastly if you need to randomly reorder the rows of data you can use the __shuffle__ function.

In [50]:
using Random

dfc[shuffle(1:nrow(dfc)), :]

Unnamed: 0_level_0,v1,v2,v3,v4,v5,v6
Unnamed: 0_level_1,Int64,Float64,String,Date,Date,String
1,6,-0.85316,d,2018-04-12,2018-05-12,hjkJkeasXG
2,6,-0.76691,b,2014-09-27,2014-11-27,GCMwV5msPt
3,6,0.04906,b,2015-07-28,2015-08-20,QE63406gEj
4,5,1.31946,a,2012-06-06,2012-06-12,gxeJaCvj92
5,3,-0.39826,a,2015-02-01,2015-07-12,UIVdF2TxWM
6,5,-0.02946,a,2013-07-16,2013-07-17,uLFMepv4pI
7,1,0.680515,a,2019-10-31,2019-11-03,IZFi329Kfs
8,2,0.90441,a,2018-11-12,2019-11-12,LE6qR2Q37p
9,2,-0.980425,a,2018-11-12,2018-12-01,ZC1tXGibbk
10,4,0.2167,c,2019-04-03,2019-05-01,Q23fycpFNT


### Sorting:

Sorting allows you to order the data in different ways. Typically, you'll want to order the data with respect to a variable in the dataframe. The functions you can use for sorting are `sort` or `sort!`. The latter is the in-place version of the former.

In [51]:
df = DataFrame(A = [0, 0, 1, 1, 0, 0, 1], B = [0, 1, 0, 0, 1, 0, 1], 
               C = [0, 1, 1, 1, 0, 1, 1], D = [1, 2, 3, 4, 2, 1, 0])

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,0,0,0,1
2,0,1,1,2
3,1,0,1,3
4,1,0,1,4
5,0,1,0,2
6,0,0,1,1
7,1,1,1,0


To check if a dataframe is sorted by columns you can use the __issorted__ function:

In [52]:
issorted(df)

false

In [53]:
issorted(df, :B) #checks if dataframe is sorted by column B

false

Simply calling sort on your dataframe will sort the dataframe using __all__ columns of data. Specifically, it will sort by the first column, then by the second column, then by the third column, etc. In this case it sorts by <i>A</i>, <i>B</i>, <i>C</i>, and then <i>D</i>. By default the sorting is done in ascending order.

In [54]:
sort(df)

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,0,0,0,1
2,0,0,1,1
3,0,1,0,2
4,0,1,1,2
5,1,0,1,3
6,1,0,1,4
7,1,1,1,0


To sort by specified columns you can pass the column names as parameters to the sort function. For example, here we just sort by column <i>C</i>.

In [55]:
sort(df, :C)

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,0,0,0,1
2,0,1,0,2
3,0,1,1,2
4,1,0,1,3
5,1,0,1,4
6,0,0,1,1
7,1,1,1,0


If we wanted order first by column <i>C</i> and then by column <i>A</i> you would specify both variables in an array.

In [56]:
sort(df, [:C, :A])

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,0,0,0,1
2,0,1,0,2
3,0,1,1,2
4,0,0,1,1
5,1,0,1,3
6,1,0,1,4
7,1,1,1,0


As mentioned above, sorting by default is done in ascending order. However, you can change this via the __rev__ keyword argument. For example, if we wanted to sort **df** by column <i>D</i> in descending order we would set `rev` to true.

In [57]:
sort(df, :D, rev=true)

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,1,0,1,4
2,1,0,1,3
3,0,1,1,2
4,0,1,0,2
5,0,0,0,1
6,0,0,1,1
7,1,1,1,0


If sorting by two or more column variables you can specify how each column should be sorted, i.e. ascending or descending. Here we'll sort by the variables <i>D</i> and <i>C</i>, but <i>D</i> will be sorted in descending order and <i>C</i> in ascending order.

In [58]:
sort(df, (:D, :C), rev = (true, false))

│   caller = #sort#500(::Nothing, ::Function, ::Function, ::Tuple{Bool,Bool}, ::Base.Order.ForwardOrdering, ::typeof(sort), ::DataFrame, ::Tuple{Symbol,Symbol}) at sort.jl:353
└ @ DataFrames /nas/longleaf/apps/julia/1.3.0/share/julia/packages/DataFrames/lfoTw/src/abstractdataframe/sort.jl:353


Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,1,0,1,4
2,1,0,1,3
3,0,1,0,2
4,0,1,1,2
5,0,0,0,1
6,0,0,1,1
7,1,1,1,0


### Categorical variables:

In [59]:
df = DataFrame(A = [0, 0, 1, 1, 0, 0, 1], B = [0, 1, 0, 0, 1, 0, 1], 
               C = [0, 1, 1, 1, 0, 1, 1], D = [1, 2, 3, 4, 2, 1, 0])

Unnamed: 0_level_0,A,B,C,D
Unnamed: 0_level_1,Int64,Int64,Int64,Int64
1,0,0,0,1
2,0,1,1,2
3,1,0,1,3
4,1,0,1,4
5,0,1,0,2
6,0,0,1,1
7,1,1,1,0


Some of the variables in a dataframe may take on values that represent categories, i.e. categorical, ordinal, etc. variables.

Let's add a categorial variable called <i>t</i> to the **df** dataframe.

In [60]:
df.t = ["High", "High", "High", "Low", "Low", "Medium", "Medium"];

As you can see the <i>t</i> variable is an array with element type **String**. So there is no notion of this variable having levels or being categorical.

In [61]:
typeof(df.t)

Array{String,1}

In Julia you can work with categorical types via the __CategoricalArrays__ package.

In [62]:
using CategoricalArrays

If you have an existing dataframe and want to convert a variable to be categorical you can use the `categorical!` function passing the name of the dataframe and variable to be converted. The non in-place version of this operation is `categorical`.

In [63]:
categorical!(df, :t)

Unnamed: 0_level_0,A,B,C,D,t
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Cat…
1,0,0,0,1,High
2,0,1,1,2,High
3,1,0,1,3,High
4,1,0,1,4,Low
5,0,1,0,2,Low
6,0,0,1,1,Medium
7,1,1,1,0,Medium


In [64]:
typeof(df.t)

CategoricalArray{String,1,UInt32,String,CategoricalValue{String,UInt32},Union{}}

You can see what the levels of a categorical variables are:

In [65]:
levels(df.t)

3-element Array{String,1}:
 "High"  
 "Low"   
 "Medium"

We can also verify that this categorical variable has no ordering:

In [66]:
isordered(df.t)

false

If we want we can impose an ordering on the variable <i>t</i>, e.g. Low < Medium < High. The default ordering is based on the current output of the `levels` command, i.e. the ordering will be High < Low < Medium. If we want the former we can change the order using the `levels!` function. The first argument is the categorical variable and the second is the desired ordering.

In [67]:
levels!(df.t, ["Low", "Medium", "High"]);

In [68]:
levels(df.t)

3-element Array{String,1}:
 "Low"   
 "Medium"
 "High"  

Now we can use the `ordered!` function to impose an ordering on the <i>t</i> column variable. The result of the `ordered!` function is based on the output of the `levels` command.

In [69]:
ordered!(df.t, true);

In [70]:
isordered(df.t)

true

Now with respect to the <i>t</i> column variable, Low < Medium < High.

In [71]:
df.t[1]

CategoricalValue{String,UInt32} "High" (3/3)

In [72]:
df.t[4]

CategoricalValue{String,UInt32} "Low" (1/3)

In [73]:
df.t[1] < df.t[4]

false

In this lesson we covered:
* Column operations such as creating new columns, selecting columns, renaming columns, etc.
* Row operations including subsetting rows, filtering, and identifying duplicate rows.
* Sorting dataframes.
* Categorical data.