## Imports

In [1]:
import pandas as pd

from scipy.stats import uniform

from sklearn import set_config; set_config(display='diagram')

from sklearn.impute import SimpleImputer
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC

# Tuning Pipeline

👇 Consider the following dataset.

In [2]:
data = pd.read_csv("https://wagon-public-datasets.s3.amazonaws.com/05-Machine-Learning/08-Workflow/tuning_pipeline_data.csv")
data.head()

Unnamed: 0,games played,minutes played,points per game,field goals made,field goal attempts,field goal percent,3 point made,3 point attempt,3 point %,free throw made,free throw attempts,free throw %,offensive rebounds,defensive rebounds,rebounds,assists,steals,blocks,turnovers,target_5y
0,36.0,27.4,7.4,2.6,7.6,,0.5,2.1,25.0,1.6,2.3,69.9,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0
1,35.0,26.9,,2.0,6.7,29.6,0.7,2.8,23.5,2.6,3.4,76.5,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0
2,,15.3,5.2,2.0,4.7,42.2,0.4,1.7,24.4,0.9,1.3,67.0,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0
3,58.0,11.6,5.7,2.3,5.5,42.6,0.1,0.5,22.6,0.9,1.3,68.9,1.0,0.9,1.9,0.8,0.6,0.1,1.0,1
4,48.0,11.5,4.5,1.6,3.0,52.4,0.0,0.1,0.0,1.3,1.9,67.4,1.0,1.5,2.5,0.3,0.3,0.4,0.8,1


- Each observation represents a player
- Each column represents a characteristic of a player's performance

The target defines whether the player lasted less than 5 years (`0`) vs. 5 years or more (`1`) as a professional.

In [4]:
X = data.drop(columns="target_5y")
y = data['target_5y']

## Pipeline

👇 We are giving you the simple pipeline below

In [5]:
# Preprocessing pipe
preprocessor = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaling', MinMaxScaler())
])

# Final pipe
pipe = Pipeline([
    ('preprocessing', preprocessor),
    ('model_svm', SVC())
])

pipe

## Fine-Tuning

Our task is to assist in the recruitment process of promising young players.  
The model should **limit false alarms as much as possible** to avoid recruiting players that will flop.

❓ **Fine-tune this pipeline to maximize your objective**

- Use the `scoring` metric appropriate for the task
- Do a (randomized) search for the optimal
    - imputing `strategy`
    - `kernel`
    - regularization factor `C`
- Store your randomized search results in a `search` variable

In [6]:
pipe.get_params()

{'memory': None,
 'steps': [('preprocessing',
   Pipeline(steps=[('imputer', SimpleImputer()), ('scaling', MinMaxScaler())])),
  ('model_svm', SVC())],
 'transform_input': None,
 'verbose': False,
 'preprocessing': Pipeline(steps=[('imputer', SimpleImputer()), ('scaling', MinMaxScaler())]),
 'model_svm': SVC(),
 'preprocessing__memory': None,
 'preprocessing__steps': [('imputer', SimpleImputer()),
  ('scaling', MinMaxScaler())],
 'preprocessing__transform_input': None,
 'preprocessing__verbose': False,
 'preprocessing__imputer': SimpleImputer(),
 'preprocessing__scaling': MinMaxScaler(),
 'preprocessing__imputer__add_indicator': False,
 'preprocessing__imputer__copy': True,
 'preprocessing__imputer__fill_value': None,
 'preprocessing__imputer__keep_empty_features': False,
 'preprocessing__imputer__missing_values': nan,
 'preprocessing__imputer__strategy': 'mean',
 'preprocessing__scaling__clip': False,
 'preprocessing__scaling__copy': True,
 'preprocessing__scaling__feature_range': (0,

In [13]:
set_config(display='text')

params = {
    'preprocessing__imputer__strategy': ['mean', 'median', 'most_frequent', 'constant'],
    'model_svm__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
    'model_svm__C': uniform(0.001,1000)
          }

search = RandomizedSearchCV(pipe, param_distributions=params, scoring='precision', n_iter=100, cv= 5, n_jobs=-1)
search.fit(X, y)

RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('preprocessing',
                                              Pipeline(steps=[('imputer',
                                                               SimpleImputer()),
                                                              ('scaling',
                                                               MinMaxScaler())])),
                                             ('model_svm', SVC())]),
                   n_iter=100, n_jobs=-1,
                   param_distributions={'model_svm__C': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x7fb6afaff0e0>,
                                        'model_svm__kernel': ['linear', 'poly',
                                                              'rbf',
                                                              'sigmoid'],
                                        'preprocessing__imputer__strategy': ['mean',
                                          

In [15]:
search.best_params_

{'model_svm__C': 146.00912066086912,
 'model_svm__kernel': 'linear',
 'preprocessing__imputer__strategy': 'median'}

In [14]:
from nbresult import ChallengeResult

result = ChallengeResult(
    'solution',
    scoring = search.scoring,
    cv = search.cv,
    mean_test_score = search.cv_results_['mean_test_score']
)

result.write()
print(result.check())


platform linux -- Python 3.12.9, pytest-8.3.4, pluggy-1.5.0 -- /home/bat/.pyenv/versions/lewagon/bin/python
cachedir: .pytest_cache
rootdir: /home/bat/code/syanrys/05-ML/08-Workflow/data-tuning-pipeline/tests
plugins: dash-3.0.4, anyio-4.8.0, typeguard-4.4.2
[1mcollecting ... [0mcollected 1 item

test_solution.py::TestSolution::test_cv_results [32mPASSED[0m[32m                   [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/solution.pickle

[32mgit[39m commit -m [33m'Completed solution step'[39m

[32mgit[39m push origin master



## Export

Once you have built your optimal pipeline, export it as a pickle file

In [16]:
import pickle

with open("pipeline.pkl", "wb") as file:
    pickle.dump(search.best_estimator_, file)

🏁 Congratulation. Don't forget to add, commit and push your notebook.