Today, our goal is to predict the likelihood of someone winning the Oscars each year, based on various dependent variables, as accurately as possible. There are tremedous consquences of winning an Oscars, be it future earnings or fame. For our analysis, only Best Picture, Best Director, Best Leading Actor, Best Leading Actress awards are considered.

Here, we will consider an approach based on discrete choice models. Given that we have multiple nominees, the binary choice logistic model that we've considered in the previous week is inadequate; need to use a multinomial choice logisitic model.

In [1]:
oscars <- read.csv("csv/oscars.csv")

# normalise Ch variable to {0,1} range
oscars$Ch <- 2 - oscars$Ch
str(oscars)
head(oscars)

'data.frame':	1140 obs. of  32 variables:
 $ Year  : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
 $ Name  : Factor w/ 715 levels "Abraham","Accidental",..: 30 346 122 144 61 576 538 259 130 17 ...
 $ PP    : int  1 1 1 1 1 0 0 0 0 0 ...
 $ DD    : int  0 0 0 0 0 1 1 1 1 1 ...
 $ MM    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ FF    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Mode  : int  1 2 3 4 5 1 2 3 4 5 ...
 $ Ch    : num  0 0 0 1 0 0 0 0 1 0 ...
 $ Movie : Factor w/ 597 levels "Absence","Accidental",..: 37 281 111 127 54 155 281 111 127 54 ...
 $ Nom   : int  7 4 7 8 8 1 4 7 8 8 ...
 $ Pic   : int  1 1 1 1 1 0 1 1 1 1 ...
 $ Dir   : int  0 1 1 1 1 1 1 1 1 1 ...
 $ Aml   : int  0 0 1 0 1 0 0 1 0 1 ...
 $ Afl   : int  0 1 0 0 0 0 1 0 0 0 ...
 $ PrN   : int  0 0 0 0 0 0 0 0 1 0 ...
 $ PrW   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PrNl  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ PrWl  : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Gdr   : int  1 0 0 0 0 0 0 0 0 0 ...
 $ Gmc   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ Gd   

Year,Name,PP,DD,MM,FF,Mode,Ch,Movie,Nom,⋯,Gm2,Gf1,Gf2,PGA,DGA,SAM,SAF,Age,Length,Days
2007,Atonement,1,0,0,0,1,0,Atonement,7,⋯,0,0,0,0,0,0,0,0,130,51
2007,Juno,1,0,0,0,2,0,Juno,4,⋯,0,0,0,0,0,0,0,0,96,61
2007,Clayton,1,0,0,0,3,0,Clayton,7,⋯,0,0,0,0,0,0,0,0,119,135
2007,Country,1,0,0,0,4,1,Country,8,⋯,0,0,0,1,0,0,0,0,122,95
2007,Blood,1,0,0,0,5,0,Blood,8,⋯,0,0,0,0,0,0,0,0,158,44
2007,Schnabel,0,1,0,0,1,0,Diving,1,⋯,0,0,0,0,0,0,0,0,0,0


In [2]:
# a naive calculation of how many nominees winners and non-winners get
tapply(oscars$Nom[oscars$PP==1], oscars$Ch[oscars$PP==1], mean)

# however, note that comparing the mean itself is insufficient as their variances may differ so wildly...
# ...that it may not be statistically sound that most movies with higher nominees will likely win.
tapply(oscars$Nom[oscars$PP==1], oscars$Ch[oscars$PP==1], var)

Using a naive T-test now, to check the significance

In [3]:
t.test(oscars$Nom[oscars$PP==1 & oscars$Ch==1],
       oscars$Nom[oscars$PP==1 & oscars$Ch==0],
       alternative="greater")


	Welch Two Sample t-test

data:  oscars$Nom[oscars$PP == 1 & oscars$Ch == 1] and oscars$Nom[oscars$PP == 1 & oscars$Ch == 0]
t = 8.1994, df = 87.361, p-value = 9.479e-13
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 2.188922      Inf
sample estimates:
mean of x mean of y 
 9.526316  6.780702 


In [4]:
# we wish to find out if  Best Picture winners also receive nominations for Best Director
table(oscars$Dir[oscars$PP==1 & oscars$Ch==1])

# what is the sole movie that won despite not having a Best Director nomination?
which(oscars$Dir==0 & oscars$PP==1 & oscars$Ch==1)
oscars[362,]


 0  1 
 1 56 

Unnamed: 0,Year,Name,PP,DD,MM,FF,Mode,Ch,Movie,Nom,⋯,Gm2,Gf1,Gf2,PGA,DGA,SAM,SAF,Age,Length,Days
362,1989,Driving,1,0,0,0,2,1,Driving,9,⋯,0,0,0,1,0,0,0,0,99,59


Do *Best Actor* and *Best Actress* winners also have nominations for movies in the *Best Picture* category?

In [5]:
# best actor
table(oscars$Pic[oscars$MM==1 & oscars$Ch==1])

# best actress
table(oscars$Pic[oscars$FF==1 & oscars$Ch==1])


 0  1 
14 43 


 0  1 
23 35 

Surprisingly, there is one extra winner in the *Best Actress* category. We can see that in 1968 there are two award winners for *Best Actress*.

In [6]:
subset(oscars, Year==1968 & FF==1 & Ch==1)

Unnamed: 0,Year,Name,PP,DD,MM,FF,Mode,Ch,Movie,Nom,⋯,Gm2,Gf1,Gf2,PGA,DGA,SAM,SAF,Age,Length,Days
796,1968,HepburnK,0,0,0,1,1,1,Lion,7,⋯,0,0,0,0,0,0,0,61,0,0
799,1968,Streisand,0,0,0,1,4,1,Funny,8,⋯,0,0,1,0,0,0,0,26,0,0


Do Golden Globe awards help predict the Oscars?

In [7]:
oscars$G <- oscars$Gdr+oscars$Gmc

# all entries in dataframe
table(oscars$G)

# oscars Best Picture awardees and Golden Globe awardees
table(oscars$G[oscars$PP==1 & oscars$Ch==1])

# Oscars Best Director & Golden Globe awardees
table(oscars$Gd[oscars$DD==1 & oscars$Ch==1])


   0    1 
1051   89 


 0  1 
18 39 


 0  1 
26 31 

What is the effect of having won awards in the previous years for Oscars to winning in a current year?

In [8]:
table(oscars$PrNl[oscars$MM==1],
      oscars$Ch[oscars$MM==1])

   
      0   1
  0 111  27
  1  43  14
  2  29   3
  3  11   6
  4  14   3
  5   7   2
  6   6   2
  7   5   0
  8   2   0

- $i \in \{1,2,...,57\}$ (no. of observations, in this context is no. of years)
- $k \in \{1,2,...,5\}$ (no. of choices for the winner)

To build a new model with multiple alternatives for the dependent variable, let us consider the following. Given $k \in \{1,2,...,k\}$ choices (alternatives) and $i \in \{1,2,...,n\}$ consumers (observations), we can compute the utility of each consumer(observation) $i$ for each alternative $k$ as follows:

$$\large U_{ik} = \beta' x_{ik}+\epsilon_{ik}$$

$$\large P(Y_i=k)=P(\large U_{ik} \geq \large U_{il}, \:\forall \:l \neq k)$$

Assuming that all $\epsilon_{ik}$ are independent and identically disributed with **Gumbel (Type I extreme value) dsitribution**, it can be shown that ([McFadden 1974](https://eml.berkeley.edu/reprints/mcfadden/zarembka.pdf)):

$$\large P(Y_i=j)=\frac{e^{\beta' x_{ij}}} {\sum_{l=1}^k e^{\beta' x_{il}}}$$

This model is known as the conditional logit model (also referred to as the **multinomial logit model**). However, when it comes to going from eqn (1) to the multinomial logit model, there is no clear answer, merely that the one discovered and most widely used best approximates the actual relationship between the dependent variables and whether the movie wins or not.

There are other alternative functions, but they do not guarantee convergence to global optimality.

The equation for logit model holds when: $$\large F(\epsilon_{ik})=e^{-e^{-\epsilon_{ik}}}$$

ASC is analogous to $\beta_0$ offset (intercept) values. Including them may lead to a better fit, but depending on the context, not including it may make more intuitive sense and prevent overfitting. In the context of travel mode choice models, where the alternatives are between **Car** and **Public Transport**, it makes sense for us to consider perhaps **Car** as having an intrinsic advantage over **Public Transport** when compared across all observations and include ASC. In the context of Oscars award prediction, comparing between nominees that will be different movies every year doesn't make sense, hence we would not be able to say that there is definitely an ASCC across different years(observations).

**Independence of Irrelevant Alternatives**: IIA says that if one adds an additional alternative C, it should not alter the preference of choice A over choice B. One method of overcoming IIA may be to use nested logit models. (car v. bus) --> (red bus v. blue bus)

**Testing fit quality**
- $AIC=-2LL+2(p+1)$, where $p=$ number of parameters
- Likelihood ratio index (McFadden index): $\rho = 1 - \frac{LL(\hat{\beta})}{LL(0)}$ (recall that the lower the log likelihood, the ccloser to zero, the better). Also, the $LL(0)$ will be merely $\frac{1}{\#\:alternatives}$, which is both mathematically true and intuitive. hence, if $\rho=0$, the model doesnt perform better than the baseline random model
- Percent correctly predicted (cross-classification?)

In [8]:
# Building a logit model
library(mlogit)
?mlogit.data
oscarsPP <- subset(oscars, PP==1)
oscarsPP
D1 <- mlogit.data(subset(oscarsPP, Year<=2006),
                 choice = "Ch",
                 shape = "long",
                 chid.var = "Year",
                 alt.var = "Mode")
# alternatively, can use "alt.var = c(1,2,3,4,5)"

# optimisation via newton's method
# M1 <- mlogit(Ch~Nom+Dir+G+Aml+Afl+PGA+Days+Length-1, data=D1)
# summary(M1)
# ?mlogit.data
# oscarsPP

Unnamed: 0,Year,Name,PP,DD,MM,FF,Mode,Ch,Movie,Nom,⋯,Gm2,Gf1,Gf2,PGA,DGA,SAM,SAF,Age,Length,Days
1,2007,Atonement,1,0,0,0,1,0,Atonement,7,⋯,0,0,0,0,0,0,0,0,130,51
2,2007,Juno,1,0,0,0,2,0,Juno,4,⋯,0,0,0,0,0,0,0,0,96,61
3,2007,Clayton,1,0,0,0,3,0,Clayton,7,⋯,0,0,0,0,0,0,0,0,119,135
4,2007,Country,1,0,0,0,4,1,Country,8,⋯,0,0,0,1,0,0,0,0,122,95
5,2007,Blood,1,0,0,0,5,0,Blood,8,⋯,0,0,0,0,0,0,0,0,158,44
21,2006,Babel,1,0,0,0,1,0,Babel,7,⋯,0,0,0,0,0,0,0,0,142,121
22,2006,Departed,1,0,0,0,2,1,Departed,5,⋯,0,0,0,0,0,0,0,0,151,142
23,2006,Letters,1,0,0,0,3,0,Letters,4,⋯,0,0,0,0,0,0,0,0,140,67
24,2006,Sunshine,1,0,0,0,4,0,Sunshine,4,⋯,0,0,0,1,0,0,0,0,101,214
25,2006,Queen,1,0,0,0,5,0,Queen,6,⋯,0,0,0,0,0,0,0,0,97,148


In [10]:
M2 <- mlogit(Ch~Nom+Dir+G+PGA-1, data=D)
summary(M2)

ERROR: Error in as.data.frame.default(data): cannot coerce class ""function"" to a data.frame


Now, predicting 2007 winning results using model M2. We can see that based on this model, alternative \#4 (No Country for Old Men) has the highest chance of winning.

In [11]:
D2 <- mlogit.data(subset(oscarsPP, Year==2007),
                 choice = "Ch",
                 shape = "long",
                 chid.var = "Year",
                 alt.var = "Mode")
P2 <- predict(M2, D2)

# showing results
P2
subset(oscarsPP, Year==2007)

ERROR: Error in predict(M2, D2): object 'M2' not found


Similarly, we can also see who the surprise winners are based on our prediction probabilities versus actual winners. We can see that for the year 2004 there are surprise winners.

In [None]:
D3 <- mlogit.data(oscarsPP,
                 choice = "Ch",
                 shape = "long",
                 chid.var = "Year",
                 alt.var = "Mode")
M3 <- mlogit(Ch~Nom+Dir+G+PGA-1, data=D3)
P3 <- predict(M3, D3)

# assigning winning prediction probabilties back to dataframe
Pred <- as.vector(t(P3))
oscarsPP$Pred <- Pred

# showing results
print(subset(oscarsPP, oscarsPP$Year==2004))

Now, to predict Best Actor winners instead of Best Picture

In [None]:
Fail <- 0
Predict <- NULL
Coefficients <- NULL
oscarsMM <- subset(oscars, MM==1)


# here in this loop, we wish to model real life predictions and use only previous year's data to predict
for (i in 1960:2006) {
    
    # creating logit model based on training set Year<=i
    D4 <- mlogit.data(subset(oscarsMM, Year<=i),
         choice = "Ch",
         shape = "long",
         chid.var = "Year",
         alt.var = "Mode")
    M4 <- mlogit(Ch~Pic+Gm1+Gm2+PrNl+PrWl-1, data=D4)
    # memoise coefficients results
    Coefficients <- rbind(Coefficients, M4$coeff)
    
    # predicting Year=i+1's results based on past trained model
    D5 <- mlogit.data(subset(oscarsMM, Year==i+1),
         choice = "Ch",
         shape = "long",
         chid.var = "Year",
         alt.var = "Mode")
    P <- predict(M4, D5, Year=i+1)
    # memoise prediction results
    Predict <- rbind(Predict, P)
    
    # 
    Fail <- Fail + as.logical(which.max(P)-
                              which.max(subset(oscarsMM, Year==i+1)$Ch))
    
}
Fail