# Hyper parameter testing

In this section of the code we will try to fine tune our model to get better results.

<hr />

Import cleaned and aggreated data from [02_data_cleaning_and_aggregation.ipynb](./02_data_cleaning_and_aggregation.ipynb)

In [2]:
import pandas as pd

df=pd.read_json("./datasets/generated/cleaned_aggregated_data.json")

In [8]:
df

Unnamed: 0,tmID,year,playoff,averageWinRate,averagePoints,averageRebounds,averageAssists,averageSteals,averageBlocks,averageTurnovers,averageFGRatio,averageFTRatio,averageThreeRatio,coachWinRate,numberOfAwardedPlayers
0,WAS,10,Y,0.294118,189.512143,84.572143,42.982857,19.696786,7.073929,44.290357,0.408655,0.717564,0.306577,0.500000,0
1,WAS,9,N,0.470588,189.708943,82.593336,36.207412,20.147909,8.820809,38.090105,0.402735,0.800336,0.276818,0.500000,2
2,WAS,8,N,0.529412,239.624444,95.275556,50.680741,25.660000,10.042222,44.559259,0.434105,0.760861,0.303131,0.529412,2
3,WAS,7,Y,0.470588,216.456929,88.321161,42.381086,23.959738,8.204120,39.441011,0.422553,0.743379,0.337256,0.470588,2
4,WAS,6,N,0.500000,193.306310,92.482727,39.100875,21.625518,11.273146,39.394288,0.429846,0.706172,0.334885,0.437500,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
121,CHA,4,Y,0.562500,217.947214,88.550063,49.340595,23.384583,9.538333,38.548387,0.413633,0.758494,0.312949,0.500000,0
122,CHA,3,Y,0.562500,167.550824,73.576923,39.324863,17.955357,8.519231,37.429258,0.380137,0.774988,0.294400,0.562500,0
123,CHA,2,Y,0.250000,205.724760,77.169872,57.090144,23.515625,7.609375,44.546474,0.421223,0.750754,0.273636,0.281250,0
124,ATL,10,Y,0.117647,221.436923,105.233846,44.233846,25.540000,11.933846,44.327692,0.431086,0.707394,0.299977,0.117647,5


## Hyper parameter testing some attributes

In [3]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split # Import train_test_split function

# Assuming X_train, y_train, X_test, y_test are your training and testing data
# X_train and X_test should be your feature matrices, and y_train and y_test should be your target labels (0 or 1)
features=['averagePoints','averageRebounds','averageAssists','averageFGRatio', 'averageThreeRatio', 'coachWinRate', 'numberOfAwardedPlayers']
target = 'playoff'
x = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1) # 70% training and 30% test

# Logistic Regression
logreg = LogisticRegression()

logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)
logreg_predictions = logreg.predict(X_test)
logreg_accuracy = accuracy_score(y_test, logreg_predictions)
print("Logistic Regression Accuracy:", logreg_accuracy)

# Support Vector Machine
svm = SVC(kernel='linear')  # You can change the kernel as needed (e.g., 'rbf' for radial basis function)
svm.fit(X_train, y_train)
svm_predictions = svm.predict(X_test)
svm_accuracy = accuracy_score(y_test, svm_predictions)

print("SVM Accuracy:", svm_accuracy)

Logistic Regression Accuracy: 0.7368421052631579
SVM Accuracy: 0.6842105263157895


## Hyper parameter testing all attributes

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split # Import train_test_split function

# Assuming X_train, y_train, X_test, y_test are your training and testing data
# X_train and X_test should be your feature matrices, and y_train and y_test should be your target labels (0 or 1)
features=['averageWinRate', 'averagePoints','averageRebounds','averageAssists', 'averageSteals', 'averageBlocks', 'averageTurnovers', 'averageFGRatio', 'averageFTRatio', 'averageThreeRatio', 'coachWinRate', 'numberOfAwardedPlayers']
target = 'playoff'
x = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1) # 70% training and 30% test

# Logistic Regression
logreg = LogisticRegression()

logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)
logreg_predictions = logreg.predict(X_test)
logreg_accuracy = accuracy_score(y_test, logreg_predictions)
print("Logistic Regression Accuracy:", logreg_accuracy)

# Support Vector Machine
svm = SVC(kernel='linear')  # You can change the kernel as needed (e.g., 'rbf' for radial basis function)
svm.fit(X_train, y_train)
svm_predictions = svm.predict(X_test)
svm_accuracy = accuracy_score(y_test, svm_predictions)

print("SVM Accuracy:", svm_accuracy)

Logistic Regression Accuracy: 0.6842105263157895
SVM Accuracy: 0.6842105263157895


## Fine tuning and testing

In [4]:
from itertools import combinations
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import pandas as pd

features = ['averageWinRate', 'averagePoints', 'averageRebounds', 'averageAssists', 
            'averageSteals', 'averageBlocks', 'averageTurnovers', 'averageFGRatio', 
            'averageFTRatio', 'averageThreeRatio', 'coachWinRate', 'numberOfAwardedPlayers']

target = 'playoff'  
X = df[features]
y = df[target]

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

best_accuracy = 0
best_feature_combination = None

# Iterate through all possible feature combinations
for i in range(1, len(features) + 1):
    for subset in combinations(features, i):
        selected_features = list(subset)
        X_train_subset = X_train[selected_features]
        X_test_subset = X_test[selected_features]
        
        # Define hyperparameters and their possible values to search
        param_grid = {
            'C': [0.1, 1, 10],
            'penalty': ['l2']
        }
        
        # Create a Logistic Regression model
        model = LogisticRegression(solver='liblinear')
        
        # Perform grid search with cross-validation
        grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
        grid_search.fit(X_train_subset, y_train)
        
        # Get the best parameters and the best model
        best_params = grid_search.best_params_
        best_model = grid_search.best_estimator_
        
        # Evaluate the model on the test set
        accuracy = best_model.score(X_test_subset, y_test)
        
        # Check if this combination is the best so far
        if accuracy > best_accuracy:
            best_accuracy = accuracy
            best_feature_combination = selected_features
            best_hyperparameters = best_params

print("Best Feature Combination:", best_feature_combination)
print("Best Hyperparameters:", best_hyperparameters)
print("Best Model Accuracy:", best_accuracy)


Best Feature Combination: ['averageWinRate', 'averageRebounds', 'averageAssists', 'averageTurnovers', 'averageFGRatio', 'coachWinRate']
Best Hyperparameters: {'C': 1, 'penalty': 'l2'}
Best Model Accuracy: 0.8157894736842105


## Conclusion

By comparing the 2 model's accuracy, we can see that the fine tuned model (82%) has a greater accuracy than the base model (62%).