# Classification of NHL game data

## the idea

## theoretical background

As explained in chapter 13 in the book by Zaki and Meira [](#cite-zaki2014data), clustering is a way to group data into sets with common central points. The aim of clustering datasets is to inspect data structures which lay hidden to other statistical tools. Especially for multivariate datasets, a clustering may reveal interconnections between datapoints which may be hard to reveal starting from probabilistic measures, such as crosscorrelations. 
One of the simplest clustering algorithms, which will be applied naively below, is the k-means algorithm. As explained in algorithm 13.1 in [](#cite-zaki2014data), given the number of clusters to expect, say k, k points in D dimensional data space are selected as start values of cluster mean points. Clustering the data means to assign to each of these k points the nearest datapoints in the datasets, with nearest in the sense of the norm of a distance function. As we are left with k clusters, we compute in the next step the intra-cluster mean values, updating the initial k mean values. In the following we reassign all the points to the newly evaluated cluster means. This iterative procedure, when converged, optimally leads to k clearly distinctive clusters of datapoints.

In [None]:
Our goal is to see whether we see clustering in multidimensional space, 

In [3]:
library(fpc)  
DATACAPTIONVEC <- c("ID","SEASON","DATE","TEAM1","TEAM2","WON","SCORE","SHOTS","FACEOFF","TAKEAWAY","GIVEAWAY","PIM","HITS","PPG","ATTENDANCE")

In [4]:
nhlDataSum=data.frame()
nhlDataDelta=data.frame()
SeasonVector=c(2010,2011)#,2012,2014,2015,2016)
NumberOfSeasons=length(SeasonVector)
for(season in SeasonVector)
{
  tableName=paste("../dataSetsNHL/dataFileNhl_",season,"_regular_sum.dat",sep="")
  nhlDataS=read.table(tableName)
  colnames(nhlDataS) <- DATACAPTIONVEC
  nhlDataSum<-rbind(nhlDataSum,nhlDataS)
  tableName=paste("../dataSetsNHL/dataFileNhl_",season,"_regular_delta.dat",sep="")
  nhlDataS=read.table(tableName)
  colnames(nhlDataS) <- DATACAPTIONVEC
  nhlDataDelta<-rbind(nhlDataDelta,nhlDataS)
}
colnames(nhlDataSum) <- DATACAPTIONVEC
colnames(nhlDataDelta) <- DATACAPTIONVEC

## cluster analysis of team data

In [None]:
teams <- nhlDataDelta$TEAM1
LISTOFTEAMS=unique(teams)
print("we have the following teams")
print(LISTOFTEAMS)

In [None]:
getTeamGameStatistics <- function(thisTeam)
    {
    matchS<-subset(nhlDataSum,nhlDataSum$TEAM1==thisTeam | nhlDataSum$TEAM2==thisTeam)
    matchH<-subset(nhlDataDelta,nhlDataDelta$TEAM1==thisTeam)
    matchA<-subset(nhlDataDelta,nhlDataDelta$TEAM2==thisTeam)
    
    #invert away data
    matchA$SCORE=-matchA$SCORE
    matchA$SHOTS=-matchA$SHOTS
    matchA$FACEOFF=-matchA$FACEOFF
    matchA$TAKEAWAY=-matchA$TAKEAWAY
    matchA$GIVEAWAY=-matchA$GIVEAWAY
    matchA$PIM=-matchA$PIM
    matchA$HITS=-matchA$HITS
    matchA$PPG=-matchA$PPG

    #add delta data
    matchD<-rbind(matchH,matchA)
    #now order for data then season
    tmp<-matchD[order(matchD$DATE),]
    matchDOrdered<-tmp[order(tmp$SEASON),]

    #compute the NYI values by combining delta and summed data
    teamData=0.5*(matchDOrdered[,sapply(matchDOrdered,is.numeric)]+matchS[,sapply(matchS,is.numeric)])
    #return(teamData)
    }

In [None]:
teamsDataList <- list()
for(i in 1:length(LISTOFTEAMS))
{
team=LISTOFTEAMS[i]
#team="NYI"
teamsDataList[[i]] <- getTeamGameStatistics(team)
}
#teamsDataList <- mapply(getTeamGameStatistics,unlist(LISTOFTEAMS))
print("Have retrieved a data list of size")
print(length(teamsDataList))
print("for ")
print(length(LISTOFTEAMS))
print("teams")
#print(lapply(teamsDataList,dim))

In [None]:
matchNYI<-getTeamGameStatistics("NYI")

In [None]:
  #matchNYIMatrix=as.matrix(cbind(matchNYI$SCORE,matchNYI$SHOTS,matchNYI$FACEOFF))
  #matchNYIMatrix=as.matrix(cbind(matchNYI$WON,matchNYI$SCORE,matchNYI$SHOTS,matchNYI$FACEOFF,matchNYI$TAKEAWAY,matchNYI$PIM))
matchNYIMatrix=as.matrix(cbind(matchNYI$SCORE,matchNYI$SHOTS,matchNYI$FACEOFF,matchNYI$TAKEAWAY))

In [14]:
library(fpc)  
#colnames(matchNYIMatrix)<-c("WON","SCORE","SHOTS","FACEOFF","TAKEAWAY","PIM")
clusterResult=kmeansruns(matchNYIMatrix,krange=2:4,criterion="ch",iter.max=100,runs=1,scaledata=FALSE,alpha=0.001,critout=FALSE,plot=FALSE)
  
print(clusterResult$centers)
print(clusterResult$size)
print(clusterResult$bestk)
print(clusterResult$crit)
  
#clusterResult=kmeansruns(matchNYIMatrix,krange=2,criterion="ch",iter.max=100,runs=1,scaledata=FALSE,alpha=0.001,critout=FALSE,plot=TRUE)

ERROR: Error in as.matrix(data): object 'matchNYIMatrix' not found


NULL
NULL
NULL
[1] "asw"


In [None]:
clusterResult=kmeansruns(matchNYIMatrix,krange=3,criterion="ch",iter.max=100,runs=1,scaledata=FALSE,alpha=0.001,critout=FALSE,plot=TRUE)

## cluster analysis of home game data

In [22]:
dataMatrix <- matrix()
dataMatrix <- as.matrix(cbind(nhlDataDelta$SCORE))
dataMatrix <- as.matrix(cbind(dataMatrix,nhlDataDelta$SHOTS))
dataMatrix <- as.matrix(cbind(dataMatrix,nhlDataDelta$FACEOFF))
print(paste("dataMatrix with",ncol(dataMatrix),"columns and",nrow(dataMatrix),"rows!"))

[1] "dataMatrix with 3 columns and 7275 rows!"


In [23]:
clusterResult=kmeansruns(dataMatrix,krange=2,criterion="ch",iter.max=150,runs=1,scaledata=FALSE,alpha=0.001,critout=FALSE,plot=FALSE)
print(clusterResult$centers)
print(clusterResult$size)
print(clusterResult$bestk)
print(clusterResult$crit)

       [,1]      [,2]       [,3]
1 0.3625137  9.703348  3.1174533
2 0.2090333 -6.077114 -0.1985679
[1] 3644 3631
[1] 2
[1]    0.000 4987.108


In [28]:
dataMatrix <- matrix()
dataMatrix <- as.matrix(cbind(nhlDataDelta$SCORE))
dataMatrix <- as.matrix(cbind(dataMatrix,nhlDataDelta$SHOTS))
dataMatrix <- as.matrix(cbind(dataMatrix,nhlDataDelta$FACEOFF))
dataMatrix <- as.matrix(cbind(dataMatrix,nhlDataDelta$TAKEAWAY))
dataMatrix <- as.matrix(cbind(dataMatrix,nhlDataDelta$PPG))
print(paste("dataMatrix with",ncol(dataMatrix),"columns and",nrow(dataMatrix),"rows!"))

[1] "dataMatrix with 5 columns and 7275 rows!"


In [31]:
clusterResult=kmeansruns(dataMatrix,krange=2:12,criterion="ch",iter.max=150,runs=1,scaledata=FALSE,alpha=0.001,critout=TRUE,plot=FALSE)
print(clusterResult$centers)
print(clusterResult$size)
print(clusterResult$bestk)
print(clusterResult$crit)

2  clusters  4253.764 
3  clusters  3510.085 
4  clusters  3407.732 
5  clusters  3092.962 
6  clusters  2864.851 
7  clusters  2718.037 
8  clusters  2542.459 
9  clusters  2436.43 
10  clusters  2330.348 
11  clusters  2222.217 
12  clusters  2150.145 
       [,1]      [,2]       [,3]     [,4]       [,5]
1 0.2105843 -6.090959 -0.1645535 1.958931 0.06146637
2 0.3608445  9.704140  3.0808884 2.103647 0.14834110
[1] 3628 3647
[1] 2
 [1]    0.000 4253.764 3510.085 3407.732 3092.962 2864.851 2718.037 2542.459
 [9] 2436.430 2330.348 2222.217 2150.145


In [19]:
clusterResult=disthclustCBI(dataMatrix,krange=2,method=fixmahal,criterion="ch",iter.max=150,runs=1,scaledata=FALSE,alpha=0.001,critout=FALSE,plot=TRUE)

ERROR: Error in method == "ward": comparison (1) is possible only for atomic and list types


In [13]:
clusterResult=pamkCBI(dataMatrix,krange=2,k=2,criterion="asw", usepam=TRUE,scaling=TRUE,diss=inherits(data,"dist"))