In [1]:
data <- read.csv("https://raw.githubusercontent.com/guru99-edu/R-Programming/master/lahman-batting.csv")

## nth observation
The fonction nth() is complementary to first() and last(). You can access the nth observation within a group with the index to return.

For instance, you can filter only the second year that a team played.

In [4]:
library(dplyr)
data %>%
group_by(teamID) %>%
summarise(second_game = nth(yearID, 2)) %>%
arrange(second_game)

`summarise()` ungrouping output (override with `.groups` argument)


teamID,second_game
BS1,1871
CH1,1871
CL1,1871
FW1,1871
NY2,1871
PH1,1871
RC1,1871
TRO,1871
WS3,1871
BL1,1872


### Distinct number of observation
The function n() returns the number of observations in a current group. A closed function to n() is n_distinct(), which count the number of unique values.

In the next example, you add up the total of players a team recruited during the all periods.

In [5]:
data %>%
group_by(teamID) %>%
summarise(number_player = n_distinct(playerID)) %>%
arrange(desc(number_player))

`summarise()` ungrouping output (override with `.groups` argument)


teamID,number_player
CHN,2051
PHI,2036
SLN,1971
CIN,1912
PIT,1877
CLE,1855
BOS,1783
CHA,1744
DET,1658
NYA,1658


### Multiple groups
A summary statistic can be realized among multiple groups.

In [6]:
data %>%
group_by(yearID, teamID) %>%
summarise(mean_games = mean(G)) %>%
arrange(desc(teamID, yearID))

`summarise()` regrouping output by 'yearID' (override with `.groups` argument)


yearID,teamID,mean_games
1884,WSU,19.784314
1891,WS9,33.710526
1886,WS8,29.918919
1887,WS8,54.571429
1888,WS8,47.269231
1889,WS8,40.206897
1884,WS7,22.880000
1875,WS6,13.368421
1873,WS5,23.400000
1872,WS4,7.071429


Code Explanation

* group_by(yearID, teamID): Group by year and team
* summarise(mean_games = mean(G)): Summarize the number of game player
* arrange(desc(teamID, yearID)): Sort the data by team and year

### Filter
Before you intend to do an operation, you can filter the dataset. The dataset starts in 1871, and the analysis does not need the years prior to 1980.

In [7]:
data %>%
filter(yearID > 1980) %>%
group_by(yearID) %>%
summarise(mean_game_year = mean(G))

`summarise()` ungrouping output (override with `.groups` argument)


yearID,mean_game_year
1981,39.97564
1982,56.55444
1983,56.05268
1984,58.07114
1985,57.17735
1986,56.0944
1987,54.67748
1988,54.16329
1989,53.25349
1990,52.1991


Code Explanation

* filter(yearID > 1980): Filter the data to show only the relevant years (i.e. after 1980)
* group_by(yearID): Group by year
* summarise(mean_game_year = mean(G)): Summarize the data

# Ungroup
Last but not least, you need to remove the grouping before you want to change the level of the computation.

In [8]:
data %>%
filter(HR > 0) %>%
group_by(playerID) %>%
summarise(average_HR_game = sum(HR) / sum(G)) %>%
ungroup() %>%
summarise(total_average_homerun = mean(average_HR_game))

`summarise()` ungrouping output (override with `.groups` argument)


total_average_homerun
0.06244721


Code Explanation

* filter(HR >0) : Exclude zero homerun
* group_by(playerID): group by player
* summarise(average_HR_game = sum(HR)/sum(G)): Compute average homerun by player
* ungroup(): remove the grouping
* summarise(total_average_homerun = mean(average_HR_game)): Summarize the data

# Aggregating Data
It is relatively easy to collapse data in R using one or more BY variables and a defined function.

In [9]:
attach(mtcars)

The attach() function in R can be used to make objects within dataframes accessible in R with fewer keystrokes. As an example:

In [12]:
head(mtcars)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


In [13]:
aggdata <- aggregate(mtcars,by=list(cyl,vs),
                    FUN = mean,na.rm=TRUE)
print(aggdata)

  Group.1 Group.2      mpg cyl   disp       hp     drat       wt     qsec vs
1       4       0 26.00000   4 120.30  91.0000 4.430000 2.140000 16.70000  0
2       6       0 20.56667   6 155.00 131.6667 3.806667 2.755000 16.32667  0
3       8       0 15.10000   8 353.10 209.2143 3.229286 3.999214 16.77214  0
4       4       1 26.73000   4 103.62  81.8000 4.035000 2.300300 19.38100  1
5       6       1 19.12500   6 204.55 115.2500 3.420000 3.388750 19.21500  1
         am     gear     carb
1 1.0000000 5.000000 2.000000
2 1.0000000 4.333333 4.666667
3 0.1428571 3.285714 3.500000
4 0.7000000 4.000000 1.500000
5 0.0000000 3.500000 2.500000


In [14]:
head(aggdata)

Group.1,Group.2,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
4,0,26.0,4,120.3,91.0,4.43,2.14,16.7,0,1.0,5.0,2.0
6,0,20.56667,6,155.0,131.6667,3.806667,2.755,16.32667,0,1.0,4.333333,4.666667
8,0,15.1,8,353.1,209.2143,3.229286,3.999214,16.77214,0,0.1428571,3.285714,3.5
4,1,26.73,4,103.62,81.8,4.035,2.3003,19.381,1,0.7,4.0,1.5
6,1,19.125,6,204.55,115.25,3.42,3.38875,19.215,1,0.0,3.5,2.5


In [17]:
agdata <- aggregate(mtcars,by=list(drat,wt),
                    FUN = mean,na.rm=TRUE)


In [18]:
head(agdata)

Group.1,Group.2,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
3.77,1.513,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
4.93,1.615,30.4,4,75.7,52,4.93,1.615,18.52,1,1,4,2
4.22,1.835,33.9,4,71.1,65,4.22,1.835,19.9,1,1,4,1
4.08,1.935,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1
4.43,2.14,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
4.08,2.2,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1


In [19]:
attach(CO2)

In [20]:
head(CO2)

Plant,Type,Treatment,conc,uptake
Qn1,Quebec,nonchilled,95,16.0
Qn1,Quebec,nonchilled,175,30.4
Qn1,Quebec,nonchilled,250,34.8
Qn1,Quebec,nonchilled,350,37.2
Qn1,Quebec,nonchilled,500,35.3
Qn1,Quebec,nonchilled,675,39.2


In [22]:
aco <- aggregate(CO2,by =list(uptake,conc),
              FUN = mean, na.rm = TRUE)





"argument is not numeric or logical: returning NA"

In [23]:
head(aco)

Group.1,Group.2,Plant,Type,Treatment,conc,uptake
7.7,95,,,,95,7.7
9.3,95,,,,95,9.3
10.5,95,,,,95,10.5
10.6,95,,,,95,10.6
11.3,95,,,,95,11.3
12.0,95,,,,95,12.0


In [24]:
attach(iris)

In [25]:
head(iris)

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa


In [26]:
aggmean = aggregate(iris[,1:4],by=list(iris$Species),FUN=mean, na.rm=TRUE)
aggmean

Group.1,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
setosa,5.006,3.428,1.462,0.246
versicolor,5.936,2.77,4.26,1.326
virginica,6.588,2.974,5.552,2.026


In [27]:
agg_sum = aggregate(iris[,1:4],by=list(iris$Species),FUN=sum, na.rm=TRUE)
agg_sum

Group.1,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
setosa,250.3,171.4,73.1,12.3
versicolor,296.8,138.5,213.0,66.3
virginica,329.4,148.7,277.6,101.3


In [28]:
agg_count = aggregate(iris[,1:4],by=list(iris$Species),FUN=length)
agg_count

Group.1,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
setosa,50,50,50,50
versicolor,50,50,50,50
virginica,50,50,50,50


In [29]:
agg_max = aggregate(iris[,1:4],by=list(iris$Species),FUN=max, na.rm=TRUE)
agg_max

Group.1,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
setosa,5.8,4.4,1.9,0.6
versicolor,7.0,3.4,5.1,1.8
virginica,7.9,3.8,6.9,2.5


In [30]:
agg_min = aggregate(iris[,1:4],by=list(iris$Species),FUN=min, na.rm=TRUE)
agg_min

Group.1,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
setosa,4.3,2.3,1.0,0.1
versicolor,4.9,2.0,3.0,1.0
virginica,4.9,2.2,4.5,1.4
