Question 3

Use dataset available on http://users.stat.ufl.edu/~winner/data/nfl2008_fga.csv

* Use LDA to classify the dataset into few classes so that at least 90% of information of dataset is explained through new classification. (Hint: model the variable “qtr” to variables “togo”, “kicker”, and “ydline”). How many LDs do you choose? Explain the reason.
* Apply PCA, and identify the important principle components involving at least 90% of dataset variation. Explain your decision strategy?  Plot principle components versus their variance.

In [62]:
import pandas as pd
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
import warnings

In [63]:
nfl_df = pd.read_csv("http://users.stat.ufl.edu/~winner/data/nfl2008_fga.csv")
nfl_df.head()

Unnamed: 0,GameDate,AwayTeam,HomeTeam,qtr,min,sec,kickteam,def,down,togo,...,distance,homekick,kickdiff,timerem,offscore,defscore,season,GOOD,Missed,Blocked
0,20081130,IND,CLE,1,47,2,IND,CLE,4.0,11.0,...,30,0,-3,2822,0,3,2008,1,0,0
1,20081005,IND,HOU,1,54,47,IND,HOU,4.0,3.0,...,46,0,0,3287,0,0,2008,1,0,0
2,20081228,TEN,IND,1,45,20,IND,TEN,4.0,3.0,...,28,1,7,2720,7,0,2008,1,0,0
3,20081012,BAL,IND,1,45,42,IND,BAL,4.0,1.0,...,37,1,14,2742,14,0,2008,1,0,0
4,20080907,CHI,IND,1,50,56,IND,CHI,4.0,21.0,...,39,1,0,3056,0,0,2008,1,0,0


In [64]:
nfl_df["qtr"].unique()

array([1, 2, 3, 4, 5], dtype=int64)

In [65]:
nfl_df.isna().sum()

GameDate    0
AwayTeam    0
HomeTeam    0
qtr         0
min         0
sec         0
kickteam    0
def         0
down        2
togo        2
kicker      0
ydline      0
name        0
distance    0
homekick    0
kickdiff    0
timerem     0
offscore    0
defscore    0
season      0
GOOD        0
Missed      0
Blocked     0
dtype: int64

In [66]:
nfl_df = nfl_df.fillna(0)
nfl_df.isna().sum()

GameDate    0
AwayTeam    0
HomeTeam    0
qtr         0
min         0
sec         0
kickteam    0
def         0
down        0
togo        0
kicker      0
ydline      0
name        0
distance    0
homekick    0
kickdiff    0
timerem     0
offscore    0
defscore    0
season      0
GOOD        0
Missed      0
Blocked     0
dtype: int64

In [67]:
nfl_df = nfl_df.drop(["GameDate","AwayTeam","HomeTeam","kickteam","def", "name"], axis=1)

In [68]:
nfl_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1039 entries, 0 to 1038
Data columns (total 17 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   qtr       1039 non-null   int64  
 1   min       1039 non-null   int64  
 2   sec       1039 non-null   int64  
 3   down      1039 non-null   float64
 4   togo      1039 non-null   float64
 5   kicker    1039 non-null   int64  
 6   ydline    1039 non-null   int64  
 7   distance  1039 non-null   int64  
 8   homekick  1039 non-null   int64  
 9   kickdiff  1039 non-null   int64  
 10  timerem   1039 non-null   int64  
 11  offscore  1039 non-null   int64  
 12  defscore  1039 non-null   int64  
 13  season    1039 non-null   int64  
 14  GOOD      1039 non-null   int64  
 15  Missed    1039 non-null   int64  
 16  Blocked   1039 non-null   int64  
dtypes: float64(2), int64(15)
memory usage: 138.1 KB


In [69]:
X = nfl_df.drop("qtr", axis=1)
y = nfl_df["qtr"]

In [70]:
clfs = {
    "DT" : DecisionTreeClassifier(),
    "LR" : LogisticRegression(),
    "RFC" : RandomForestClassifier(),
    "GB" : GaussianNB()
}

In [71]:
accuracy = {}
y_preds = {}
for key in clfs:
    for i in range(1000):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state= 42)
        clf = clfs[key]
        t_clf = clf.fit(X_train, y_train)
        y_pred = t_clf.predict(X_test)
        a = accuracy_score(y_test, y_pred)
        accuracy[key]= a
        y_preds[key]= y_preds
warnings.filterwarnings("ignore")
pd.DataFrame([accuracy])

Unnamed: 0,DT,LR,RFC,GB
0,1.0,0.996795,1.0,0.945513


#### Using LDA

LDA Creates a linier combinations of the original features

`n_componets = 0.5 * x1 + 0.2 * x2 + 0.3 *x3`

In [72]:
lda_based_model = LinearDiscriminantAnalysis(n_components=1)
lda_X = lda_based_model.fit_transform(X, y)

In [73]:
var_ratio_lda = lda_based_model.explained_variance_ratio_
var_ratio_lda

array([0.99244456])

#### Creating a function to select n_components

In [75]:
def n_components(var_ratio_lda, des_var) -> int:
    variance = 0.0
    n_comp = 0
    for var_ratio in var_ratio_lda:
        variance = variance + var_ratio
        n_comp += 1
        if variance >= des_var:
            break
    return n_comp

In [77]:
comp = n_components(var_ratio_lda, 0.90)
comp

1

In [79]:
lda_df = pd.DataFrame(lda_X)
lda_df.columns = ['ldaOne']

In [82]:
accuracy = {}
y_preds = {}
for key in clfs:
    for i in range(1000):
        X_train, X_test, y_train, y_test = train_test_split(lda_X, y, test_size = 0.3, random_state= 42)
        clf = clfs[key]
        t_clf = clf.fit(X_train, y_train)
        y_pred = t_clf.predict(X_test)
        a = accuracy_score(y_test, y_pred)
        accuracy[key]= a
        y_preds[key]= y_preds
warnings.filterwarnings("ignore")
pd.DataFrame([accuracy])

Unnamed: 0,DT,LR,RFC,GB
0,0.980769,0.971154,0.980769,0.967949
