# II. Working with Dataframes

In this notebook, we'll start doing basic operations on dataframes. Specifically we'll look at

* Column operations
* Row operations
* Sorting
* Categorical variables

In [None]:
using DataFrames, Dates, CSV

In [None]:
# load csv file

df = CSV.read("fakedata.csv")

The `first` and `last` functions can be used to examine the first  and last few lines of the dataframe.

In [None]:
first(df,7)

In [None]:
last(df,7)

You might want to extract basic information about the dataframe such as the number of rows and columns in the data frame.  To get the dataframe size, you can use the `size` command which tells us there are 20 rows and 5 columns in the dataframe.

In [None]:
nr, nc = size(df)

The `describe` function can be used to get summary statistics on the dataframe. The object returned is also a dataframe.

In [None]:
describe(df)

We can restrict describe only to work on certain variables of interest and return only certain statistics.

In [None]:
summstats = describe(df[:,[:v1, :v2]], :eltype, :min, :max, :median)

Since <i>summstats</i> is itself a dataframe you can work with it as you would any other dataframe.

### Column operations:

We've seen the `size` function to get the dimensions of the dataframe. There is a `ncols` function if you just want the number of columns:

In [None]:
nc = ncol(df)

Add a column using the [ ] notation and by providing a name for the column along with values using the assignment operator. Here we'll add a new column named <i>RandomStr</i> to **df** that is just an array of random strings of length 10.

In [None]:
using Random 
df[!,:RandStr] = [randstring(10) for j in 1:nr];

In [None]:
df

You can do operations on existing columns in the dataframe to create new columns. We know that **v4** is a Date type so we can use Julia's <i>Dates</i> package to work with these types.

In [None]:
eltype(df.v4)

Let's use the `day` function to add a new column that calculates the day of the month for each date value in **v4**.

In [None]:
df.v4_day =  day.(df.v4);

We can create a new column variable calculated based on other columns in the dataframe and insert this into a specific location in the dataframe using `insertcols!`. This is an in-place operation.

The first argument to `insertcols!` is the name of the data frame, the second argument is the column index number where you want the new column to be placed, and the last argument specifies the data values of the new column. The non in-place version is `insertcols`.

In [None]:
insertcols!(df, 6, elapsed_days = df.v5 - df.v4)

You can accomplish the same thing using `transform!` or `tansform` for the in-place and non in-place (respectively) operations. 

The first argument is the name of the dataframe; the second argument is a pair of pairs. In the second argument the first part indicates the columns of the dataframe to operate on. The second part uses `ByRow` to iterate over the rows applying the passed in anonymous function to each row iteration; the columns specified in the first part are used as the inputs to the anonymous function. The result will be placed in a new variable called __elapsed_days_trans__ as indicated in the last part of the second argument.

In [None]:
transform(df, [:v5, :v4] => ByRow( (a,b) -> a - b ) => :elapsed_days_trans)
# transform(df, [:v5, :v4] => ByRow( - ) => :elapsed_days_trans)

Note the type of <i>df.elapsed_days</i>. It's an array with element of type **Day** (not **Int64**). **Day** is a built-in type that is part of the <i>Dates</i> package.

In [None]:
typeof(df.elapsed_days)

You can also create a new column by mapping an arbitrary function to another column. Here let's use Julia's `map` function to create new column called <i>elapsed_hours</i> equal to <i>elapsed_days</i> converted into hours.

In [None]:
#df[!, :elapsed_hours] = map(t -> Dates.Hour(t), df[:, :elapsed_days]); 

# You could also do df[!, :elapsed_hours] =  Dates.Hour.(df.elapsed_days);
# You could also do df[!, :elapsed_hours] = map(t -> Dates.Hour(t), df.elapsed_days);

In [None]:
df

As you likely noticed, not all columns were displayed. You can use the __show__ command to see all column variables for a few rows.

__By default Jupyter Notebook will limit the number of rows and columns when displaying a data frame to roughly fit the screen size.__

In [None]:
show(df, allcols=true)

If you want to drop a column you can use the `select!` function along with the `Not` keyword argument. This is an in-place operation and therefore will modify the dataset. Use `select` if you want to do the non in-place version of this operation.

We'll drop the <i>elapsed_hours</i> and <i>v5</i> columns from **df**.

In [None]:
select!(df, Not([:elapsed_hours, :v5]));

In [None]:
names(df)

The `select!` command can also be used to keep a specific subset of columns:

In [None]:
select!(df, [:v1, :v2, :v3, :v4, :elapsed_days, :RandStr, :v4_day])

To change the name of a column variable you can use the `rename!` function. We'll change the column name of <i>elapsed_days</i> to be <i>elap_days</i> and <i>v3</i> to be <i>Category</i>. The `rename` function is the non in-place version of this operation.

In [None]:
rename!(df, :elapsed_days => :elap_days, :v3 => :Category,);

In [None]:
names(df)

A simple way to reorder the columns is by indexing. In this case you index using the column names and listing them in the desired order. Here we make <i>Category</i> the first variable in the dataframe.

In [None]:
df[:, [3, 1, 2, 4, 5, 6, 7]]

In [None]:
df

You can do the same thing in place using the `permutecols!` function which will do this operation in-place.

The last function we will look at in regard to working with columns is `eachcol`. This is a function that allows you to iterate over the columns of your dataframe: 

`eachcol(df, bool)`

The first argument to `eachcol` is the name of the dataframe whose columns you want to iterate over and the second is a boolean. If you specify true for the boolean argument then for each column you get a __Pair__ whose first element is the column name (symbol) and second element is the corresponding column values.

We can `eachcol` is to pick out columns (based on their type) and perform an operation on that column. Here we calculate basic summary statistics **only** for columns in the **df** dataframe that are subtypes of <i>Real</i>.

In [None]:
using Statistics

for col in eachcol(df, true)
    if (eltype(col[2]) <: Real) 
        println("Variable ", col[1], "\n  ", "Mean: ", round(mean(col[2]), digits=3), 
            "\n  Max: ", round(maximum(col[2]), digits=1),
            "\n  Min: ", round(minimum(col[2]), digits=1),
            "\n  Range: ", round(maximum(col[2]) - minimum(col[2]), digits=1)
        )
    end
end

### Row operations:

In [None]:
df = CSV.read("fakedata.csv")

If you want the number of rows in your dataframe there is a `nrows` function for that:

In [None]:
nr = nrow(df)

When indexing into a dataframe the first index indicates the rows and the second the columns. **Indexing a dataframe works as you expect**. The colon indicates all elements or items along the dimension.

For example, if we wanted to "subset" all rows and all columns:

In [None]:
df[:, :]

If we only wanted rows with a certain index we would specify that in the first argument. Here we get rows 5 through 9.

In [None]:
df[5:9, :]

You can make a copy of the dataframe via the `copy` command:

In [None]:
dfc = copy(df)

You can change the value of an entry by specifying the index position and the new value. Here we change the value for the _v1_ variable in row 2 to 4.

In [None]:
dfc[2, :v1] = 4

Note when modifying values the new values need to be of a valid type. For example _v1_ is of type __Int64__ so the assigned value needs to be an __Int64__ or something that can be converted into an __Int64__. Trying to set the value to, for example, a __String__ won't work:

In [None]:
df[2, :v1] = "four"

You can update slices as well as long as the types and dimensions match.

In [None]:
first(dfc, 10)

This will update the first five rows of the _v1_ and _v2_ columns:

In [None]:
dfc[1:5, [:v1, :v2]] = [[1, 2, 3, 6, 5] randn(5,1)];

In [None]:
dfc

Suppose we wanted to subset based on the value of a column variable? For example, if we wanted all rows of data where <i>v3</i> has the value "a":

In [None]:
dfc[dfc.v3 .== "a", :]

You can use Julia's regular expression support to match on values of arbitrary character expressions. For example we can get only the rows of data where the _v6_ variable contains the expression "XG". One way to do this is to use the `match` function to find the rows of _v6_ that contain "XG". 

In [None]:
mymatch = match.(r"XG", dfc.v6)

We can see that __mymatch__ is a one dimensional Array and the entries in __mymatch__ corresponding to non-matches of the regular expression "XG" have a value of `nothing`. We can use this to select the rows of the dataframe that correspond to the rows of __mymatch__ that _do not_ equal `nothing` (i.e. match the expression "XG").

In [None]:
dfc[mymatch .!== nothing,:]

In [None]:
dfc

Here is an example of selecting rows based on a date value. Here we select rows where the value of the date for the _v4_ variable is after 7/27/2015.

In [None]:
dfc[dfc.v4 .> Date(2015,7,27), :]

You can select rows based on multiple variables. For example if we wanted the rows of data where <i>v3</i> has value "a" and <i>v2</i> is greater than 0 and just the columns <i>v1</i>, <i>v2</i>, and <i>v3</i>?

In [None]:
dfc[(dfc.v3 .== "a") .& (dfc.v2 .> 0), [:v1, :v2, :v3]]

You can also use Julia's built-in `filter` function to subset data. You can use either `filter` or `filter!`. The former will return a copy of the dataframe with rows that satisfy the filter; the latter will actually modify the dataframe in place __keeping__ only rows that satisfy the filter.

The first argument will be the filter itself almost always expressed as an anonymous function and the second argument will just be the name of the dataframe the filter should be applied to.

Here we filter on rows where the vale of <i>v1</i> is greather than 1.

In [None]:
filter(row -> row[:v1] > 1, dfc)

If you need to delete rows you can use the `deleterows!` function and provide the row ids to delete.

In [None]:
deleterows!(dfc, [3, 15])

A more useful thing might be to get rid of duplicate rows. The first occurrence of the duplicate row is kept in the dataframe while all others are regarded as duplicates and removed.

To this end you can use either the `unique` or `unique!` functions. The former returns a copy of the dataframe with deleted duplicate rows while the latter is an in-place operation.

`unique(df, cols)` <br/>
`unique!(df, cols)`

The **cols** argument is optional. If you specify the **cols** argument the comparison will be made using only the specified columns. If you don't specify the **cols** argument then the comparison will be made using all columns which means only rows with identical values across all columns will be matches.

We can see in our dataframe above that when only considering columns <i>v1</i> and <i>v3</i> there are duplicate rows. For example there are many rows with <i>v1</i> equal to 1 and <i>v3</i> equal to "a" or with <i>v1</i> equal to 2 and <i>v3</i> equal to a.

To get rid of duplicate rows on columns <i>v1</i> and <i>v3</i>:

In [None]:
unique(dfc, [:v1, :v3])

The `nonunique` function will tell you if a given row is a duplicate of any row **before** it.

In [None]:
dfc

In [None]:
nonunique(dfc, [:v1, :v3])

You can use the `findall` function to find the row indices that correspond to duplicates of some previous row. The first argument to `findall` is a function that returns __true__ or __false__ and the `findall` function will return the indices of second argument that result in the first argument returning a value of __true__.

In [None]:
findall(row -> row == true, nonunique(dfc, [:v1, :v3]))

Lastly if you need to randomly reorder the rows of data you can use the __shuffle__ function.

In [None]:
using Random

dfc[shuffle(1:nrow(dfc)), :]

### Sorting:

Sorting allows you to order the data in different ways. Typically, you'll want to order the data with respect to a variable in the dataframe. The functions you can use for sorting are `sort` or `sort!`. The latter is the in-place version of the former.

In [None]:
df = DataFrame(A = [0, 0, 1, 1, 0, 0, 1], B = [0, 1, 0, 0, 1, 0, 1], 
               C = [0, 1, 1, 1, 0, 1, 1], D = [1, 2, 3, 4, 2, 1, 0])

To check if a dataframe is sorted by columns you can use the __issorted__ function:

In [None]:
issorted(df)

In [None]:
issorted(df, :B) #checks if dataframe is sorted by column B

Simply calling sort on your dataframe will sort the dataframe using __all__ columns of data. Specifically, it will sort by the first column, then by the second column, then by the third column, etc. In this case it sorts by <i>A</i>, <i>B</i>, <i>C</i>, and then <i>D</i>. By default the sorting is done in ascending order.

In [None]:
sort(df)

To sort by specified columns you can pass the column names as parameters to the sort function. For example, here we just sort by column <i>C</i>.

In [None]:
sort(df, :C)

If we wanted order first by column <i>C</i> and then by column <i>A</i> you would specify both variables in an array.

In [None]:
sort(df, [:C, :A])

As mentioned above, sorting by default is done in ascending order. However, you can change this via the __rev__ keyword argument. For example, if we wanted to sort **df** by column <i>D</i> in descending order we would set `rev` to true.

In [None]:
sort(df, :D, rev=true)

If sorting by two or more column variables you can specify how each column should be sorted, i.e. ascending or descending. Here we'll sort by the variables <i>D</i> and <i>C</i>, but <i>D</i> will be sorted in descending order and <i>C</i> in ascending order.

In [None]:
sort(df, (:D, :C), rev = (true, false))

### Categorical variables:

In [None]:
df = DataFrame(A = [0, 0, 1, 1, 0, 0, 1], B = [0, 1, 0, 0, 1, 0, 1], 
               C = [0, 1, 1, 1, 0, 1, 1], D = [1, 2, 3, 4, 2, 1, 0])

Some of the variables in a dataframe may take on values that represent categories, i.e. categorical, ordinal, etc. variables.

Let's add a categorial variable called <i>t</i> to the **df** dataframe.

In [None]:
df.t = ["High", "High", "High", "Low", "Low", "Medium", "Medium"];

As you can see the <i>t</i> variable is an array with element type **String**. So there is no notion of this variable having levels or being categorical.

In [None]:
typeof(df.t)

In Julia you can work with categorical types via the __CategoricalArrays__ package.

In [None]:
using CategoricalArrays

If you have an existing dataframe and want to convert a variable to be categorical you can use the `categorical!` function passing the name of the dataframe and variable to be converted. The non in-place version of this operation is `categorical`.

In [None]:
categorical!(df, :t)

In [None]:
typeof(df.t)

You can see what the levels of a categorical variables are:

In [None]:
levels(df.t)

We can also verify that this categorical variable has no ordering:

In [None]:
isordered(df.t)

If we want we can impose an ordering on the variable <i>t</i>, e.g. Low < Medium < High. The default ordering is based on the current output of the `levels` command, i.e. the ordering will be High < Low < Medium. If we want the former we can change the order using the `levels!` function. The first argument is the categorical variable and the second is the desired ordering.

In [None]:
levels!(df.t, ["Low", "Medium", "High"]);

In [None]:
levels(df.t)

Now we can use the `ordered!` function to impose an ordering on the <i>t</i> column variable. The result of the `ordered!` function is based on the output of the `levels` command.

In [None]:
ordered!(df.t, true);

In [None]:
isordered(df.t)

Now with respect to the <i>t</i> column variable, Low < Medium < High.

In [None]:
df.t[1]

In [None]:
df.t[4]

In [None]:
df.t[1] < df.t[4]

In this lesson we covered:
* Column operations such as creating new columns, selecting columns, renaming columns, etc.
* Row operations including subsetting rows, filtering, and identifying duplicate rows.
* Sorting dataframes.
* Categorical data.