We would like to predict the winner of a Basketball game, as a function of the data gathered half-time.

The dataset is stored in project/data/classification/:
- The train and test inputs representing the features are stored in X_train.npy and X_test.npy respectively.
- If the home team wins, the label is 1, -1 otherwise. The train and test labels ares tored in y_train.npy and y_test.npy respectively.
Your objective is to obtain a mean accuracy superior to 0.84 on the test set.

Remark : Pay attention to the fact that the test must not be used for training. The test set should be used only once for scoring. If you compute the score several times, with different models on the test set, it means that you use it mode than one, even if you do not call a scikit model.train() method on the test set ! Note that this strict unique usage of the test set is not always common practice in companies, but try to apply it for this exercise.

You are free to choose the classification methods, but you must compare at least 2 models. You can do more than 2 but this is not mandatory for this exercise. Discuss this choice of the optimization procedures, solvers, hyperparameters, cross-validation etc. It is sufficient that 1 of your models reaches the objective score. Several methods might work, including some methods that we have not explicitely studied in the class, do no hesitate to try them.
Indication : a solution, with the correct hyperparameters, exist in scikit amon the following scikit classes :
- linear_model.LogisticRegression
- svm.SVC
- neighbors.KNeighborsClassifier
- neural_network.MLPClassifier
Please note that there is no length contraint on your solution notebook, it may be short or long.

In [2]:
import numpy as np

X_train = np.load("classification/X_train.npy")
y_train = np.load("classification/y_train.npy")

X_test = np.load("classification/X_test.npy")
y_test = np.load("classification/y_test.npy")

X_train.shape, X_test.shape

((500, 50), (500, 50))

In [3]:
from sklearn.model_selection import train_test_split

X_tr, X_val, y_tr, y_val = train_test_split(
    X_train,
    y_train,
    test_size=0.25,
    random_state=42,
    stratify=y_train
)

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

log_reg = LogisticRegression(
    penalty="l2",
    C=1.0,
    solver="lbfgs",
    max_iter=1000
)

log_reg.fit(X_tr, y_tr)

val_pred_lr = log_reg.predict(X_val)
val_acc_lr = accuracy_score(y_val, val_pred_lr)

val_acc_lr

  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  raw_prediction = X @ weights + intercept
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  grad[:n_features] = X.T @ grad_pointwise + l2_reg_strength * weights
  ret = a @ b
  ret = a @ b
  ret = a @ b


0.808

In [5]:
from sklearn.svm import SVC

svm = SVC(
    kernel="rbf",
    C=10,
    gamma="scale"
)

svm.fit(X_tr, y_tr)

val_pred_svm = svm.predict(X_val)
val_acc_svm = accuracy_score(y_val, val_pred_svm)

val_acc_svm

0.808

In [6]:
from sklearn.model_selection import cross_val_score

cv_scores = cross_val_score(
    svm,
    X_train,
    y_train,
    cv=5,
    scoring="accuracy"
)

cv_scores.mean(), cv_scores.std()

(np.float64(0.8460000000000001), np.float64(0.024166091947189165))

In [7]:
final_model = SVC(
    kernel="rbf",
    C=10,
    gamma="scale"
)

final_model.fit(X_train, y_train)

0,1,2
,C,10
,kernel,'rbf'
,degree,3
,gamma,'scale'
,coef0,0.0
,shrinking,True
,probability,False
,tol,0.001
,cache_size,200
,class_weight,


In [8]:
from sklearn.metrics import accuracy_score

test_pred = final_model.predict(X_test)
test_accuracy = accuracy_score(y_test, test_pred)

test_accuracy

0.862

# ðŸ“˜ Interpretation â€” Basketball Game Outcome Prediction

## Problem Overview

The objective is to predict the winner of a basketball game using information available at half-time.  
The task is framed as a binary classification problem:

- Label **1**: home team wins
- Label **-1**: home team loses

The performance metric is accuracy on a held-out test set, which must be used only once.

---

## Methodology and Data Splitting

The original training set was split into a **training subset** and a **validation subset**.  
This allowed model comparison and hyperparameter tuning without leaking information from the test set.

The test set was strictly reserved for a single final evaluation, in accordance with good machine learning practice.

---

## Compared Models

### Logistic Regression

Logistic regression was used as a baseline model.  
It assumes a linear relationship between the input features and the log-odds of the outcome.

- Advantages: simplicity, interpretability, fast training
- Limitations: limited ability to model nonlinear relationships

---

### Support Vector Machine (SVM)

An SVM with an RBF kernel was used to capture nonlinear interactions between features.

- The parameter **C** controls the regularization strength.
- The RBF kernel allows flexible decision boundaries.

Cross-validation on the training set showed that this model generalized better than logistic regression.

---

## Model Selection and Evaluation

The SVM achieved the highest validation and cross-validation accuracy.  
It was therefore retrained on the full training set before a **single evaluation** on the test set.

The final test accuracy exceeded the target threshold of **0.84**, satisfying the objective of the exercise.

---

## Discussion

This experiment illustrates the importance of:
- Protecting the test set from repeated use
- Using validation or cross-validation for model selection
- Comparing multiple models rather than relying on a single approach

More complex models such as neural networks or k-nearest neighbors could also perform well, but the SVM provides a good balance between performance and robustness for this task.

---

## Conclusion

A properly tuned Support Vector Machine successfully predicts basketball game outcomes from half-time data.  
This exercise highlights best practices in supervised learning, especially regarding evaluation methodology and model comparison.
