### Homework 8 Guide
<br>
In this guide, we will be conducting model evaluation through feature selection on the classification techniques to find the best model that our soccer database. Make sure to have the soccer database downloaded for working with this guide. As always, we'll need some libaries to get started.

In [1]:
import pandas as pd
import sqlite3
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt

We'll need to connect to the soccer database to start, let's do that below.

In [2]:
# Your Code Here

# importing soccer data
conn = sqlite3.connect("database.sqlite")


Let's grab the `strength`, `stamina`, and `jumping` columns from the `Player_Attributes` tables.

In [3]:
# Reading Player_Attributes table to dataframe

player_attr_df = pd.read_sql("SELECT strength, stamina, jumping FROM Player_Attributes", conn)

# Filling with 11 for all null values
player_attr_df.fillna(11, inplace=True)

Now let's grab our `x` and `y`. Use strength and stamina for `x`, and jumping for `y`.

In [4]:
x = player_attr_df[['strength', 'stamina']].values
y = player_attr_df[['jumping']].values

To get started, we need to split the data. Using `train_test_split()`, split the sample by 30%. 

In [6]:
X_train, X_test, y_train, y_test= train_test_split(x, y, test_size=0.3, random_state=0)

Now we can get our DecisionTreeClassifier up. Run the cell below to set it up. You may get a warning regarding the split. This is okay for completing this guide.

In [7]:
desicion_tree_params_grid = {'criterion':['gini','entropy'], 'max_depth':[4,5,6,7,8,9,10,11,12,15,20,30,40,50], 'splitter':["best", "random"], 'random_state':[0,1,2,4,6,8,10,12,14,16,20,40,42]}
grid_search_decision_tree_classifier = GridSearchCV(DecisionTreeClassifier(), desicion_tree_params_grid, cv=10)
grid_search_decision_tree_classifier.fit(X_train, y_train)

print("Decision Tree best grid score: " + str(grid_search_decision_tree_classifier.best_score_))
print("Decision Tree grid test score: " + str(grid_search_decision_tree_classifier.score(X_test, y_test)))

decision_tree_best_params = grid_search_decision_tree_classifier.best_params_
print("Decision Tree best params: " + str(decision_tree_best_params))



Decision Tree best grid score: 0.1615495474917537
Decision Tree grid test score: 0.1622821321158097
Decision Tree best params: {'criterion': 'entropy', 'max_depth': 30, 'random_state': 4, 'splitter': 'random'}


Now we can run `predict()` on our `grid_search_decision_tree_classifier`.

In [8]:
y_pred = grid_search_decision_tree_classifier.predict(X_test)

Let's look at the resulting report. Call `classification_report()` below.

In [9]:
grid_search_decision_tree_classification_report = classification_report(y_test, y_pred)
print("Decision Tree Classification report with whole data")
print(grid_search_decision_tree_classification_report)

Decision Tree Classification report with whole data
              precision    recall  f1-score   support

        11.0       0.82      0.32      0.47       757
        14.0       0.00      0.00      0.00         1
        20.0       0.00      0.00      0.00         1
        21.0       0.00      0.00      0.00         6
        22.0       0.00      0.00      0.00         6
        24.0       0.00      0.00      0.00         3
        25.0       0.30      0.25      0.27        12
        26.0       0.00      0.00      0.00         5
        27.0       0.40      0.30      0.34        20
        28.0       0.40      0.43      0.41        14
        29.0       0.16      0.23      0.19        13
        30.0       0.16      0.07      0.10        54
        31.0       0.37      0.12      0.18        59
        32.0       0.19      0.16      0.17       127
        33.0       0.22      0.30      0.25       172
        34.0       0.14      0.16      0.15       277
        35.0       0.75      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


We'll need to grab the features to use now, using the `SelectFromModel()` function. Then, let's run `fit()` on `select`.

In [15]:
select = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42), threshold='median') # Your Code Here


# Selecting features using RandomForestClassifier
select = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42), threshold='median')
select.fit(X_train, y_train)
X_train_selected = select.transform(X_train)
X_test_selected = select.transform(X_test)

  self.estimator_.fit(X, y, **fit_params)


Now let's apply those best params froem the grid search. Assign the respective fields from `decision_tree_best_params` for your classifier.

In [16]:
# Applying DecisionTreeClassifier using the best params from the grid search and with selected data
decision_tree_classifier = DecisionTreeClassifier(criterion = decision_tree_best_params['criterion'],\
    max_depth = decision_tree_best_params['max_depth'],\
    random_state = decision_tree_best_params['random_state'],\
    splitter = decision_tree_best_params['splitter'])

We need to run the `fit()` function using `X_train_selected` and `y_train` as parameters. Then, run `predict()` using `X_test_selected`.

In [17]:
decision_tree_classifier.fit(X_train_selected, y_train)
y_pred = decision_tree_classifier.predict(X_test_selected)

Lastly, rerun the `classification_report()` and print out what your results are.

In [18]:
# Your Code Here

In [19]:
decision_tree_classification_reprt = classification_report(y_test, y_pred)
print("Decision Tree Classification report with selected data")
print(decision_tree_classification_reprt)

Decision Tree Classification report with selected data
              precision    recall  f1-score   support

        11.0       1.00      0.31      0.47       757
        14.0       0.00      0.00      0.00         1
        20.0       0.00      0.00      0.00         1
        21.0       0.00      0.00      0.00         6
        22.0       0.00      0.00      0.00         6
        24.0       0.00      0.00      0.00         3
        25.0       0.00      0.00      0.00        12
        26.0       0.00      0.00      0.00         5
        27.0       0.00      0.00      0.00        20
        28.0       0.00      0.00      0.00        14
        29.0       0.00      0.00      0.00        13
        30.0       0.00      0.00      0.00        54
        31.0       0.00      0.00      0.00        59
        32.0       0.00      0.00      0.00       127
        33.0       0.00      0.00      0.00       172
        34.0       0.00      0.00      0.00       277
        35.0       0.00   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
