# Tidyverse

The tidyverse is a set of packages that expand the core functionalities of R by making data wrangling
and visualization more friendly. 

Its main libraries are dplyer, tidyr and ggplot2.

In this notebook, we'll try some functionalities of tidyverse, and some other related packages.

In [106]:
library(tidyverse)

### MBTI Dataset

This dataset was extracted from Kaggle, and it shows MBTI personality type along with Age, Gender, Education and Scores for personality traits.

In [109]:
df <- read.csv('mbti_data.csv')
df[1:5,]

  Age Gender Education Introversion.Score Sensing.Score Thinking.Score Judging.Score Interest Personality
1  21 Female         1            5.89208      2.144395        7.32363      5.462224     Arts        ENTP
2  24 Female         1            2.48366      3.206188        8.06876      3.765012  Unknown        INTP
3  26 Female         1            7.02910      6.469302        4.16472      5.454442   Others        ESFP
4  30   Male         0            5.46525      4.179244        2.82487      5.080477   Sports        ENFJ
5  31 Female         0            3.59804      6.189259        5.31347      3.677984   Others        ISFP

In [111]:
df_interest <- df %>%
    group_by(Interest) %>%
    summarise(Thinking = mean(Thinking.Score), Judging=mean(Judging.Score), Sensing=mean(Sensing.Score), Introversion=mean(Introversion.Score))

df_interest

[38;5;246m# A tibble: 5 × 5[39m
  Interest   Thinking Judging Sensing Introversion
  [3m[38;5;246m<chr>[39m[23m         [3m[38;5;246m<dbl>[39m[23m   [3m[38;5;246m<dbl>[39m[23m   [3m[38;5;246m<dbl>[39m[23m        [3m[38;5;246m<dbl>[39m[23m
[38;5;250m1[39m Arts           5.35    5.37    5.77         4.58
[38;5;250m2[39m Others         5.47    5.40    5.78         4.58
[38;5;250m3[39m Sports         5.42    5.42    5.80         4.60
[38;5;250m4[39m Technology     5.45    5.40    5.75         4.64
[38;5;250m5[39m Unknown        5.42    5.38    5.79         4.58

In order to check if the personality scores follow normal distributions, we apply the Anderson-Darling test to each Score column. It is an alternative to Shapiro-Wilk's test for big (n > 5000) datasets. 

In [113]:
library(nortest)

for (column in c("Thinking.Score", "Judging.Score", "Sensing.Score", "Introversion.Score"))
{
  results = ad.test(df[[column]])
  print(column)
  print(results)
}

[1] "Thinking.Score"

	Anderson-Darling normality test

data:  df[[column]]
A = 618.11, p-value < 2.2e-16

[1] "Judging.Score"

	Anderson-Darling normality test

data:  df[[column]]
A = 801.46, p-value < 2.2e-16

[1] "Sensing.Score"

	Anderson-Darling normality test

data:  df[[column]]
A = 742.06, p-value < 2.2e-16

[1] "Introversion.Score"

	Anderson-Darling normality test

data:  df[[column]]
A = 605.83, p-value < 2.2e-16



Note: given the extremely low p-values, they provide a strong evidence against the normality of distributions. As such, we'll not consider the distributions as normal, and we'll apply non-parametric tests to the data.

In order to compare if the difference of thinking score means across the different interest groups is significant, we apply Kruskal-Wallis test to scores and interests.

It is a non-parametric test for comparing the distributions of a continuous variable from more than two independent groups, suitable since the scores do not follow a normal distribution.

In [114]:
df$Interest <- as.factor(df$Interest)

for (column in c("Thinking.Score", "Judging.Score", "Sensing.Score", "Introversion.Score")) {
  
  print(column)
  
  df[[column]] <- as.numeric(df[[column]])
  
  result <- kruskal.test(as.formula(paste(column, "~ Interest")), data = df)
  print(result)
}


[1] "Thinking.Score"

	Kruskal-Wallis rank sum test

data:  Thinking.Score by Interest
Kruskal-Wallis chi-squared = 8.2643, df = 4, p-value = 0.08236

[1] "Judging.Score"

	Kruskal-Wallis rank sum test

data:  Judging.Score by Interest
Kruskal-Wallis chi-squared = 66.408, df = 4, p-value = 1.299e-13

[1] "Sensing.Score"

	Kruskal-Wallis rank sum test

data:  Sensing.Score by Interest
Kruskal-Wallis chi-squared = 8.0385, df = 4, p-value = 0.09018

[1] "Introversion.Score"

	Kruskal-Wallis rank sum test

data:  Introversion.Score by Interest
Kruskal-Wallis chi-squared = 2.1265, df = 4, p-value = 0.7125



Note: The only column showing significant p-value is Judging.Score. All the others provide weak evidence to the hypotheses that interest groups show different distributions of scores, specially Introversion.Score. It suggests that Judging is the best predictor of interests, and introversion, the worst.

Note: This examples show that using statiscal functions in R is potentially more convenient than in Python. 

### Ggplot2

Ggplot is a plotting library with a syntax derived from "The Grammar of Graphics" book, which allows for building plots intuitively.

In the example below, we build a distribution plot additively with Ggplot's syntax.

In [None]:
df %>%
  ggplot() +
  aes(x=Sensing.Score) +
  geom_density(alpha = 0.7) + 
  geom_histogram(aes(y=after_stat(density)), 
                    fill="cyan",
                    bins=9, 
                    alpha=0.8) +
  ggtitle("Distribution plot Example") + 
    theme(panel.background=element_rect("white"),
          plot.title = element_text(hjust = 0.5))

### Further Examples

The examples below have been extracted from the book "Manual de Análise de Dados: Estatística e Machine Learning com Excel, SPSS, Stata, R e Python" from Luiz Flávio Fávero and Patrícia Belfiore. It was included here for being instructive of R capabilities.

In [115]:
library("e1071")
library("questionr")

In [116]:
load(file="Cotações.RData")
head(Cotações)

[38;5;246m# A tibble: 6 × 1[39m
  preço
  [3m[38;5;246m<dbl>[39m[23m
[38;5;250m1[39m  18.7
[38;5;250m2[39m  18.3
[38;5;250m3[39m  18.4
[38;5;250m4[39m  18.7
[38;5;250m5[39m  18.8
[38;5;250m6[39m  18.8

### Summary Statistical Functions

In [117]:
summary(Cotações$preço)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  16.90   18.18   18.50   18.48   18.80   19.90 

In [118]:
mean(Cotações$preço)

[1] 18.475

In [119]:
median(Cotações$preço)

[1] 18.5

In [120]:
quantile(Cotações$preço, 0.70)

 70% 
18.8 

In [121]:
sd(Cotações$preço)

[1] 0.6323515

In [122]:
var(Cotações$preço)

[1] 0.3998684

In [123]:
skewness(Cotações$preço) #from e1071 library

[1] -0.3118111

In [124]:
kurtosis(Cotações$preço) #from e1071 library

[1] 0.6628825

### Plots

In [None]:
hist(Cotações$preço)

In [None]:
Cotações %>%
  ggplot(aes(x=preço)) + 

  geom_histogram(aes(y = ..density..),
    color="grey50",
    fill="darkorchid",
    bins=7,
    alpha=0.6) + 
  
  stat_function(fun = dnorm, 
      args=list(mean=mean(Cotações$preço),
                sd = sd(Cotações$preço)),
      aes(color = "Curva Normal Teórica"),
      linewidth=2) +
  
  geom_density(linewidth=2,
    aes(color = "Curva KDE estimada")) +
  
  labs(x="Preço", y="Frequência") + 
  
  theme(panel.background = element_rect("white"),
    panel.grid=element_line("grey95"),
    panel.border=element_rect(NA),
    legend.position="bottom",
    plot.title=element_text(hjust=0.5, 
                            size=15)) +
  
  ggtitle("Histograma de preço com curva normal")

It should be noted that the example above is considerably complex. It shows that ggplot2 can be very flexible and that it accepts many parameters.

In [125]:
stem(Cotações$preço)


  The decimal point is at the |

  16 | 9
  17 | 59
  18 | 11233455778889
  19 | 119



In [None]:
Cotações %>%
  ggplot(aes(y = preço, x = "")) +
  geom_boxplot(fill = "lightblue",
               alpha = 0.7,
               color = "black",
               outlier.colour = "red",
               outlier.shape = 15,
               outlier.size = 2.5) +
  labs(y = "Preço") +
  theme(panel.background = element_rect("white"),
        panel.grid = element_line("grey95"),
        panel.border = element_rect(NA),
        legend.position="none",
        plot.title = element_text(size=15)
        ) +
  ggtitle("Boxplot de preço com ggplot") +
  xlab("")

Once again, the example provided seems to be less convenient to be built than in Python (seaborn). It should be taken into consideration, however, the amount of optional parameters used. A more practical example is the following: 

In [None]:
Cotações %>%
  ggplot(aes(y = preço, x = "")) +
  geom_boxplot(fill = "lightblue") +
  theme(panel.background = element_rect("white")) + 
  labs(y = "Preço") +
  ggtitle("Boxplot de preço com ggplot")

Note: It still seems that seaborn is easier to use, specially when the user can leverage its flexibility along with matplotlib.