# Predicting the Baseball World Series Champion

<img src="images/baseball.jpg"/>

Last week, in the Moneyball lecture, we discussed how regular season performance is not strongly correlated with winning the World Series in baseball. In this homework question, we'll use the same data to investigate how well we can predict the World Series winner at the beginning of the playoffs.

To begin, load the dataset baseball.csv into R using the read.csv function, and call the data frame "baseball". This is the same data file we used during the Moneyball lecture, and the data comes from Baseball-Reference.com.

As a reminder, this dataset contains data concerning a baseball team's performance in a given year. It has the following variables:

    Team: A code for the name of the team
    
    League: The Major League Baseball league the team belongs to, either AL (American League) or NL (National League)
    
    Year: The year of the corresponding record
    
    RS: The number of runs scored by the team in that year
    
    RA: The number of runs allowed by the team in that year
    
    W: The number of regular season wins by the team in that year
    
    OBP: The on-base percentage of the team in that year
    
    SLG: The slugging percentage of the team in that year
    
    BA: The batting average of the team in that year
    
    Playoffs: Whether the team made the playoffs in that year (1 for yes, 0 for no)
    
    RankSeason: Among the playoff teams in that year, the ranking of their regular season records (1 is best)
    
    RankPlayoffs: Among the playoff teams in that year, how well they fared in the playoffs. The team winning the World Series gets a RankPlayoffs of 1.
    
    G: The number of games a team played in that year
    
    OOBP: The team's opponents' on-base percentage in that year
    
    OSLG: The team's opponents' slugging percentage in that year

### Load and Exploration the Data

In [1]:
base = read.csv("data/baseball.csv")
head(base)

Unnamed: 0_level_0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
Unnamed: 0_level_1,<fct>,<fct>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<dbl>,<dbl>
1,ARI,NL,2012,734,688,81,0.328,0.418,0.259,0,,,162,0.317,0.415
2,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4.0,5.0,162,0.306,0.378
3,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5.0,4.0,162,0.315,0.403
4,BOS,AL,2012,734,806,69,0.315,0.415,0.26,0,,,162,0.331,0.428
5,CHC,NL,2012,613,759,61,0.302,0.378,0.24,0,,,162,0.335,0.424
6,CHW,AL,2012,748,676,85,0.318,0.422,0.255,0,,,162,0.319,0.405


In [2]:
str(base)

'data.frame':	1232 obs. of  15 variables:
 $ Team        : Factor w/ 39 levels "ANA","ARI","ATL",..: 2 3 4 5 7 8 9 10 11 12 ...
 $ League      : Factor w/ 2 levels "AL","NL": 2 2 1 1 2 1 2 1 2 1 ...
 $ Year        : int  2012 2012 2012 2012 2012 2012 2012 2012 2012 2012 ...
 $ RS          : int  734 700 712 734 613 748 669 667 758 726 ...
 $ RA          : int  688 600 705 806 759 676 588 845 890 670 ...
 $ W           : int  81 94 93 69 61 85 97 68 64 88 ...
 $ OBP         : num  0.328 0.32 0.311 0.315 0.302 0.318 0.315 0.324 0.33 0.335 ...
 $ SLG         : num  0.418 0.389 0.417 0.415 0.378 0.422 0.411 0.381 0.436 0.422 ...
 $ BA          : num  0.259 0.247 0.247 0.26 0.24 0.255 0.251 0.251 0.274 0.268 ...
 $ Playoffs    : int  0 1 1 0 0 0 1 0 0 1 ...
 $ RankSeason  : int  NA 4 5 NA NA NA 2 NA NA 6 ...
 $ RankPlayoffs: int  NA 5 4 NA NA NA 4 NA NA 2 ...
 $ G           : int  162 162 162 162 162 162 162 162 162 162 ...
 $ OOBP        : num  0.317 0.306 0.315 0.331 0.335 0.319 0.305 0.3

In [3]:
summary(base)

      Team     League        Year            RS               RA        
 BAL    : 47   AL:616   Min.   :1962   Min.   : 463.0   Min.   : 472.0  
 BOS    : 47   NL:616   1st Qu.:1977   1st Qu.: 652.0   1st Qu.: 649.8  
 CHC    : 47            Median :1989   Median : 711.0   Median : 709.0  
 CHW    : 47            Mean   :1989   Mean   : 715.1   Mean   : 715.1  
 CIN    : 47            3rd Qu.:2002   3rd Qu.: 775.0   3rd Qu.: 774.2  
 CLE    : 47            Max.   :2012   Max.   :1009.0   Max.   :1103.0  
 (Other):950                                                            
       W              OBP              SLG               BA        
 Min.   : 40.0   Min.   :0.2770   Min.   :0.3010   Min.   :0.2140  
 1st Qu.: 73.0   1st Qu.:0.3170   1st Qu.:0.3750   1st Qu.:0.2510  
 Median : 81.0   Median :0.3260   Median :0.3960   Median :0.2600  
 Mean   : 80.9   Mean   :0.3263   Mean   :0.3973   Mean   :0.2593  
 3rd Qu.: 89.0   3rd Qu.:0.3370   3rd Qu.:0.4210   3rd Qu.:0.2680  
 Max.   

### Problem 1.1 - Limiting to Teams Making the Playoffs

Each row in the baseball dataset represents a team in a particular year.

How many team/year pairs are there in the whole dataset?

In [4]:
nrow(base)

### Problem 1.2 - Limiting to Teams Making the Playoffs
Though the dataset contains data from 1962 until 2012, we removed several years with shorter-than-usual seasons. Using the table() function, identify the total number of years included in this dataset.

In [5]:
table(base$Year)


1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977 1978 
  20   20   20   20   20   20   20   24   24   24   24   24   24   24   26   26 
1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1996 1997 
  26   26   26   26   26   26   26   26   26   26   26   26   26   28   28   28 
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 
  30   30   30   30   30   30   30   30   30   30   30   30   30   30   30 

In [6]:
length(table(base$Year))

### Problem 1.3 - Limiting to Teams Making the Playoffs
Because we're only analyzing teams that made the playoffs, use the subset() function to replace baseball with a data frame limited to teams that made the playoffs (so your subsetted data frame should still be called "baseball"). How many team/year pairs are included in the new dataset?

In [7]:
baseball = subset(base, Playoffs == 1)
head(baseball)

Unnamed: 0_level_0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG
Unnamed: 0_level_1,<fct>,<fct>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<dbl>,<dbl>
2,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4,5,162,0.306,0.378
3,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5,4,162,0.315,0.403
7,CIN,NL,2012,669,588,97,0.315,0.411,0.251,1,2,4,162,0.305,0.39
10,DET,AL,2012,726,670,88,0.335,0.422,0.268,1,6,2,162,0.314,0.402
19,NYY,AL,2012,804,668,95,0.337,0.453,0.265,1,3,3,162,0.311,0.419
20,OAK,AL,2012,713,614,94,0.31,0.404,0.238,1,4,4,162,0.306,0.378


In [8]:
nrow(baseball)

### Problem 1.4 - Limiting to Teams Making the Playoffs
Through the years, different numbers of teams have been invited to the playoffs. Which of the following has been the number of teams making the playoffs in some season? Select all that apply.

In [9]:
table(baseball$Year)


1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1973 1974 1975 1976 1977 1978 
   2    2    2    2    2    2    2    4    4    4    4    4    4    4    4    4 
1979 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1996 1997 
   4    4    4    4    4    4    4    4    4    4    4    4    4    4    8    8 
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 
   8    8    8    8    8    8    8    8    8    8    8    8    8    8   10 

Answer: 2, 4, 8, 10.

### Problem 2.1 - Adding an Important Predictor
It's much harder to win the World Series if there are 10 teams competing for the championship versus just two. Therefore, we will add the predictor variable NumCompetitors to the baseball data frame. NumCompetitors will contain the number of total teams making the playoffs in the year of a particular team/year pair. For instance, NumCompetitors should be 2 for the 1962 New York Yankees, but it should be 8 for the 1998 Boston Red Sox.

We start by storing the output of the table() function that counts the number of playoff teams from each year:

    PlayoffTable = table(baseball$Year)

You can output the table with the following command:

    PlayoffTable

We will use this stored table to look up the number of teams in the playoffs in the year of each team/year pair.

Just as we can use the names() function to get the names of a data frame's columns, we can use it to get the names of the entries in a table. What best describes the output of names(PlayoffTable)?

In [10]:
PlayoffTable = table(baseball$Year)
names(PlayoffTable)

### Problem 2.2 - Adding an Important Predictor
Given a vector of names, the table will return a vector of frequencies. Which function call returns the number of playoff teams in 1990 and 2001? (HINT: If you are not sure how these commands work, go ahead and try them out in your R console!)

In [11]:
PlayoffTable[c('1990', '2001')]


1990 2001 
   4    8 

### Problem 2.3 - Adding an Important Predictor
Putting it all together, we want to look up the number of teams in the playoffs for each team/year pair in the dataset, and store it as a new variable named NumCompetitors in the baseball data frame. While of the following function calls accomplishes this? (HINT: Test out the functions if you are not sure what they do.)

In [12]:
baseball$NumCompetitors = PlayoffTable[as.character(baseball$Year)]
head(baseball)

Unnamed: 0_level_0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG,NumCompetitors
Unnamed: 0_level_1,<fct>,<fct>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<table>
2,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4,5,162,0.306,0.378,10
3,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5,4,162,0.315,0.403,10
7,CIN,NL,2012,669,588,97,0.315,0.411,0.251,1,2,4,162,0.305,0.39,10
10,DET,AL,2012,726,670,88,0.335,0.422,0.268,1,6,2,162,0.314,0.402,10
19,NYY,AL,2012,804,668,95,0.337,0.453,0.265,1,3,3,162,0.311,0.419,10
20,OAK,AL,2012,713,614,94,0.31,0.404,0.238,1,4,4,162,0.306,0.378,10


### Problem 2.4 - Adding an Important Predictor
Add the NumCompetitors variable to your baseball data frame. How many playoff team/year pairs are there in our dataset from years where 8 teams were invited to the playoffs?

In [13]:
eightteamsplayoff = subset(baseball, NumCompetitors==8)
head(eightteamsplayoff)

Unnamed: 0_level_0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG,NumCompetitors
Unnamed: 0_level_1,<fct>,<fct>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<table>
31,ARI,NL,2011,731,662,94,0.322,0.413,0.25,1,5,4,162,0.316,0.409,8
40,DET,AL,2011,787,711,95,0.34,0.434,0.277,1,4,3,162,0.321,0.396,8
46,MIL,NL,2011,721,638,96,0.325,0.425,0.261,1,3,3,162,0.304,0.385,8
49,NYY,AL,2011,867,657,97,0.343,0.444,0.263,1,2,4,162,0.322,0.399,8
51,PHI,NL,2011,713,529,102,0.323,0.395,0.253,1,1,4,162,0.296,0.361,8
56,STL,NL,2011,762,692,90,0.341,0.425,0.273,1,7,1,162,0.319,0.398,8


In [14]:
nrow(eightteamsplayoff)

### Problem 3.1 - Bivariate Models for Predicting World Series Winner
In this problem, we seek to predict whether a team won the World Series; in our dataset this is denoted with a RankPlayoffs value of 1. Add a variable named WorldSeries to the baseball data frame, by typing the following command in your R console:

    baseball\\$WorldSeries = as.numeric(baseball$RankPlayoffs == 1)

WorldSeries takes value 1 if a team won the World Series in the indicated year and a 0 otherwise. How many observations do we have in our dataset where a team did NOT win the World Series?

In [15]:
baseball$WorldSeries = as.numeric(baseball$RankPlayoffs == 1)
head(baseball)

Unnamed: 0_level_0,Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,RankSeason,RankPlayoffs,G,OOBP,OSLG,NumCompetitors,WorldSeries
Unnamed: 0_level_1,<fct>,<fct>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<table>,<dbl>
2,ATL,NL,2012,700,600,94,0.32,0.389,0.247,1,4,5,162,0.306,0.378,10,0
3,BAL,AL,2012,712,705,93,0.311,0.417,0.247,1,5,4,162,0.315,0.403,10,0
7,CIN,NL,2012,669,588,97,0.315,0.411,0.251,1,2,4,162,0.305,0.39,10,0
10,DET,AL,2012,726,670,88,0.335,0.422,0.268,1,6,2,162,0.314,0.402,10,0
19,NYY,AL,2012,804,668,95,0.337,0.453,0.265,1,3,3,162,0.311,0.419,10,0
20,OAK,AL,2012,713,614,94,0.31,0.404,0.238,1,4,4,162,0.306,0.378,10,0


In [16]:
nrow(subset(baseball, WorldSeries==0))

### Problem 3.2 - Bivariate Models for Predicting World Series Winner
When we're not sure which of our variables are useful in predicting a particular outcome, it's often helpful to build bivariate models, which are models that predict the outcome using a single independent variable. Which of the following variables is a significant predictor of the WorldSeries variable in a bivariate logistic regression model? To determine significance, remember to look at the stars in the summary output of the model. We'll define an independent variable as significant if there is at least one star at the end of the coefficients row for that variable (this is equivalent to the probability column having a value smaller than 0.05). Note that you have to build 12 models to answer this question! Use the entire dataset baseball to build the models.

In [17]:
model1 = glm(WorldSeries~Year, data=baseball, family="binomial")
summary(model1)


Call:
glm(formula = WorldSeries ~ Year, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0297  -0.6797  -0.5435  -0.4648   2.1504  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) 72.23602   22.64409    3.19  0.00142 **
Year        -0.03700    0.01138   -3.25  0.00115 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 228.35  on 242  degrees of freedom
AIC: 232.35

Number of Fisher Scoring iterations: 4


In [18]:
model2 = glm(WorldSeries~RS, data=baseball, family="binomial")
summary(model2)


Call:
glm(formula = WorldSeries ~ RS, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8254  -0.6819  -0.6363  -0.5561   2.0308  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  0.661226   1.636494   0.404    0.686
RS          -0.002681   0.002098  -1.278    0.201

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 237.45  on 242  degrees of freedom
AIC: 241.45

Number of Fisher Scoring iterations: 4


In [19]:
model3 = glm(WorldSeries~RA, data=baseball, family="binomial")
summary(model3)


Call:
glm(formula = WorldSeries ~ RA, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9749  -0.6883  -0.6118  -0.4746   2.1577  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept)  1.888174   1.483831   1.272   0.2032  
RA          -0.005053   0.002273  -2.223   0.0262 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 233.88  on 242  degrees of freedom
AIC: 237.88

Number of Fisher Scoring iterations: 4


In [20]:
model4 = glm(WorldSeries~W, data=baseball, family="binomial")
summary(model4)


Call:
glm(formula = WorldSeries ~ W, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0623  -0.6777  -0.6117  -0.5367   2.1254  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept) -6.85568    2.87620  -2.384   0.0171 *
W            0.05671    0.02988   1.898   0.0577 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 235.51  on 242  degrees of freedom
AIC: 239.51

Number of Fisher Scoring iterations: 4


In [21]:
model5 = glm(WorldSeries~OBP, data=baseball, family="binomial")
summary(model5)


Call:
glm(formula = WorldSeries ~ OBP, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8071  -0.6749  -0.6365  -0.5797   1.9753  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)    2.741      3.989   0.687    0.492
OBP          -12.402     11.865  -1.045    0.296

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 238.02  on 242  degrees of freedom
AIC: 242.02

Number of Fisher Scoring iterations: 4


In [22]:
model6 = glm(WorldSeries~SLG, data=baseball, family="binomial")
summary(model6)


Call:
glm(formula = WorldSeries ~ SLG, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9498  -0.6953  -0.6088  -0.5197   2.1136  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)    3.200      2.358   1.357   0.1748  
SLG          -11.130      5.689  -1.956   0.0504 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 235.23  on 242  degrees of freedom
AIC: 239.23

Number of Fisher Scoring iterations: 4


In [23]:
model7 = glm(WorldSeries~BA, data=baseball, family="binomial")
summary(model7)


Call:
glm(formula = WorldSeries ~ BA, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6797  -0.6592  -0.6513  -0.6389   1.8431  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -0.6392     3.8988  -0.164    0.870
BA           -2.9765    14.6123  -0.204    0.839

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 239.08  on 242  degrees of freedom
AIC: 243.08

Number of Fisher Scoring iterations: 4


In [24]:
model8 = glm(WorldSeries~RankSeason, data=baseball, family="binomial")
summary(model8)


Call:
glm(formula = WorldSeries ~ RankSeason, family = "binomial", 
    data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.7805  -0.7131  -0.5918  -0.4882   2.1781  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -0.8256     0.3268  -2.527   0.0115 *
RankSeason   -0.2069     0.1027  -2.016   0.0438 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 234.75  on 242  degrees of freedom
AIC: 238.75

Number of Fisher Scoring iterations: 4


In [25]:
model9 = glm(WorldSeries~OOBP, data=baseball, family="binomial")
summary(model9)


Call:
glm(formula = WorldSeries ~ OOBP, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5318  -0.5176  -0.5106  -0.5023   2.0697  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -0.9306     8.3728  -0.111    0.912
OOBP         -3.2233    26.0587  -0.124    0.902

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 84.926  on 113  degrees of freedom
Residual deviance: 84.910  on 112  degrees of freedom
  (130 observations deleted due to missingness)
AIC: 88.91

Number of Fisher Scoring iterations: 4


In [26]:
model10 = glm(WorldSeries~OSLG, data=baseball, family="binomial")
summary(model10)


Call:
glm(formula = WorldSeries ~ OSLG, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5610  -0.5209  -0.5088  -0.4902   2.1268  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.08725    6.07285  -0.014    0.989
OSLG        -4.65992   15.06881  -0.309    0.757

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 84.926  on 113  degrees of freedom
Residual deviance: 84.830  on 112  degrees of freedom
  (130 observations deleted due to missingness)
AIC: 88.83

Number of Fisher Scoring iterations: 4


In [27]:
model11 = glm(WorldSeries~NumCompetitors, data=baseball, family="binomial")
summary(model11)


Call:
glm(formula = WorldSeries ~ NumCompetitors, family = "binomial", 
    data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9871  -0.8017  -0.5089  -0.5089   2.2643  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)     0.03868    0.43750   0.088 0.929559    
NumCompetitors -0.25220    0.07422  -3.398 0.000678 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 226.96  on 242  degrees of freedom
AIC: 230.96

Number of Fisher Scoring iterations: 4


In [28]:
model12 = glm(WorldSeries~League, data=baseball, family="binomial")
summary(model12)


Call:
glm(formula = WorldSeries ~ League, family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6772  -0.6772  -0.6306  -0.6306   1.8509  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.3558     0.2243  -6.045  1.5e-09 ***
LeagueNL     -0.1583     0.3252  -0.487    0.626    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 238.88  on 242  degrees of freedom
AIC: 242.88

Number of Fisher Scoring iterations: 4


Answer: Year, RA, RankSeason, NumCompetitors.

### Problem 4.1 - Multivariate Models for Predicting World Series Winner
In this section, we'll consider multivariate models that combine the variables we found to be significant in bivariate models. Build a model using all of the variables that you found to be significant in the bivariate models. How many variables are significant in the combined model?

In [29]:
modelmult = glm(WorldSeries ~ Year+RA+RankSeason+NumCompetitors, data=baseball, family="binomial")
summary(modelmult)


Call:
glm(formula = WorldSeries ~ Year + RA + RankSeason + NumCompetitors, 
    family = "binomial", data = baseball)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0336  -0.7689  -0.5139  -0.4583   2.2195  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)
(Intercept)    12.5874376 53.6474210   0.235    0.814
Year           -0.0061425  0.0274665  -0.224    0.823
RA             -0.0008238  0.0027391  -0.301    0.764
RankSeason     -0.0685046  0.1203459  -0.569    0.569
NumCompetitors -0.1794264  0.1815933  -0.988    0.323

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 239.12  on 243  degrees of freedom
Residual deviance: 226.37  on 239  degrees of freedom
AIC: 236.37

Number of Fisher Scoring iterations: 4


### Problem 4.2 - Multivariate Models for Predicting World Series Winner
Often, variables that were significant in bivariate models are no longer significant in multivariate analysis due to correlation between the variables. Which of the following variable pairs have a high degree of correlation (a correlation greater than 0.8 or less than -0.8)?

In [30]:
cor(baseball[c("Year", "RA", "RankSeason", "NumCompetitors")])

Unnamed: 0,Year,RA,RankSeason,NumCompetitors
Year,1.0,0.4762422,0.3852191,0.9139548
RA,0.4762422,1.0,0.3991413,0.5136769
RankSeason,0.3852191,0.3991413,1.0,0.4247393
NumCompetitors,0.9139548,0.5136769,0.4247393,1.0


Answer: Year & NumCompetitors.  

### Problem 4.3 - Multivariate Models for Predicting World Series Winner
Build all six of the two variable models listed in the previous problem. Together with the four bivariate models, you should have 10 different logistic regression models. Which model has the best AIC value (the minimum AIC value)?

In [31]:
glm(WorldSeries~Year, data=baseball, family="binomial")$aic

In [32]:
glm(WorldSeries~RA, data=baseball, family="binomial")$aic

In [33]:
glm(WorldSeries~RankSeason, data=baseball, family="binomial")$aic 

In [34]:
glm(WorldSeries~NumCompetitors, data=baseball, family="binomial")$aic

In [35]:
glm(WorldSeries~Year+RA, data=baseball, family="binomial")$aic

In [36]:
glm(WorldSeries~Year+RankSeason, data=baseball, family="binomial")$aic

In [37]:
glm(WorldSeries~Year+NumCompetitors, data=baseball, family="binomial")$aic

In [38]:
glm(WorldSeries~RA+RankSeason, data=baseball, family="binomial")$aic

In [39]:
glm(WorldSeries~RA+NumCompetitors, data=baseball, family="binomial")$aic

In [40]:
glm(WorldSeries~RankSeason+NumCompetitors, data=baseball, family="binomial")$aic

Answer: The best model it is that use only NumCompetitors, the AIC(NumCompetitors) = 230.96