-
Notifications
You must be signed in to change notification settings - Fork 2
/
dplyr_parsing.Rmd
117 lines (103 loc) · 3.81 KB
/
dplyr_parsing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
title: "dplyr nettbank parsing"
author: "Tim Knutsen"
date: "22.1.2016"
output: html_document
---
Here we will use dplyr and other packages to expolure Tim's financial expenses.
First we load the packages we need.
All packages are developed by Hadley Wickham https://github.com/hadley
![hadley](https://avatars3.githubusercontent.com/u/4196?v=3&s=400)
```{r load pckgs}
## if not installed, install and load.
if(!require(lubridate)) install.packages("lubridate")
if(!require(stringr)) install.packages("stringr")
if(!require(tidyr)) install.packages("tidyr")
if(!require(ggplot2)) install.packages("ggplot2")
if(!suppressPackageStartupMessages(require(dplyr))) install.packages("dplyr")
```
Read the data with read.table. `fill = TRUE` will fill in blank spaces in the data were needed. Also make data frame into a dplyr data.fram which is easier to work with.
```{r}
felleskonto <- read.delim(file = "dplyr_nettbank/gjensidige.txt",stringsAsFactors = FALSE,
fill = T, header = T, dec = ",", na.strings = "") %>% tbl_df()
## Have a look at the data
head(felleskonto)
## Will first select only the columns we need, and rename them to English while we are at it.
felleskonto_subset <- select(felleskonto,
Date = Bokfort,
Text_code = Tekstkode,
Description = Beskrivelse,
Out = Ut_av_konto,
In = Inn_pa_konto)
## Look again
felleskonto_subset
```
A dplyr data_frame will only print first 10 rows if you print it, and also specify the type of data in each column.
## The data
This is my VISA card data from last year. To take advantage of the information, we need to format, extract and clean the it a bit.
```{r, fig.width=12}
## Reformat the date column to a date format.
felleskonto_subset$Date <- lubridate::dmy(felleskonto_subset$Date)
head(felleskonto_subset)
## Add a month column
felleskonto_subset <- mutate(felleskonto_subset,
Month = month(Date, label = TRUE),
new_date = Date + days(30))
## Set In and Out to numeric.
class(felleskonto_subset$In) <- "numeric"
class(felleskonto_subset$Out) <- "numeric"
## Summarize expenes and incomes per month and gather them in one column
in_out_summary <- felleskonto_subset %>%
group_by(Month) %>%
summarize(expences = sum(-Out, na.rm = T),
incomes = sum(In, na.rm = T)) %>%
gather(key = in_out, value = NOK, -Month)
## Nice and tidy
in_out_summary
## Lets plot it per month!
ggplot(in_out_summary, aes(Month, NOK, fill = in_out)) +
geom_bar(stat="identity", position = position_dodge()) +
theme(axis.text = element_text(size = 12, face = "bold")) +
scale_fill_brewer(type = "qual", palette = "Set1")
```
## Grouping
dplyr lets you summarize the data within groups in the data. Here, we sum yearly incomes and outcomes by grouping by `in_out`
```{r}
group_by(in_out_summary, in_out) %>%
summarize(sum = sum(NOK))
```
We are using more that we earn. :(
dplyr also has a very convenient counting function.
We use our card for buying stuff.
```{r}
count(felleskonto_subset, Text_code, sort = TRUE)
```
But what unique description sums up to the highest expence over the year?
```{r}
group_by(felleskonto_subset, Description) %>%
filter(Text_code == "VARER") %>%
count(Description, sort = T)
```
Need to get rid of the date in the beginning.
```{r}
felleskonto_subset$Description <- str_replace(felleskonto_subset$Description, "^[\\d]{2}.[\\d]{2}\\s", "")
## Try again
felleskonto_subset %>%
filter(Text_code == "VARER") %>%
count(Description, sort = T)
```
How much did I use?
```{r}
felleskonto_subset %>%
filter(Text_code == "VARER") %>%
group_by(Description) %>%
summarize(Sum_out_varer = sum(-Out, na.rm = TRUE))
```
Sort it by biggest sum
```{r}
felleskonto_subset %>%
filter(Text_code == "VARER") %>%
group_by(Description) %>%
summarize(Sum_out_varer = sum(-Out, na.rm = TRUE)) %>%
arrange(desc(Sum_out_varer))
```