# Tuning Pipeline

In [1]:
from sklearn import set_config; set_config(display='diagram')

👇 Consider the following dataset.

In [2]:
import pandas as pd

data = pd.read_csv("data.csv")

data.head()

Unnamed: 0,games played,minutes played,points per game,field goals made,field goal attempts,field goal percent,3 point made,3 point attempt,3 point %,free throw made,free throw attempts,free throw %,offensive rebounds,defensive rebounds,rebounds,assists,steals,blocks,turnovers,target_5y
0,36.0,27.4,7.4,2.6,7.6,,0.5,2.1,25.0,1.6,2.3,69.9,0.7,3.4,4.1,1.9,0.4,0.4,1.3,0
1,35.0,26.9,,2.0,6.7,29.6,0.7,2.8,23.5,2.6,3.4,76.5,0.5,2.0,2.4,3.7,1.1,0.5,1.6,0
2,,15.3,5.2,2.0,4.7,42.2,0.4,1.7,24.4,0.9,1.3,67.0,0.5,1.7,2.2,1.0,0.5,0.3,1.0,0
3,58.0,11.6,5.7,2.3,5.5,42.6,0.1,0.5,22.6,0.9,1.3,68.9,1.0,0.9,1.9,0.8,0.6,0.1,1.0,1
4,48.0,11.5,4.5,1.6,3.0,52.4,0.0,0.1,0.0,1.3,1.9,67.4,1.0,1.5,2.5,0.3,0.3,0.4,0.8,1


- Each observations represents a player
- Each column a characteristic of performance

The target defines whether the player last less than 5 years `0` or 5 years or more `1` as a professional.

In [3]:
X = data.drop(columns="target_5y")
y = data['target_5y']

## Pipeline

👇 We are giving you the simple pipeline below

In [4]:
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer

# Preprocessing pipe
preprocessor = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaling', MinMaxScaler())
])

# Final pipe
pipe = Pipeline([
    ('preprocessing', preprocessor),
    ('model_svm', SVC())])

pipe

## Fine Tuning

Our task is to assist the recruitment process of promising young players.  
The model should **limit false alarms as much as possible** to avoid recruiting players that will flop.

❓ **Fine-tune this pipeline so as to maximize your objective**

- Use the `scoring` metric appropriate for the task
- Grid Search for the optimal:
    - imputing `strategy`
    - `kernel`
    - regularization factor `C`... 


- Store your random search results in a `search`

In [0]:
from sklearn import set_config; set_config(display='text')
# YOUR CODE BELOW

In [0]:
pipe.get_params()

{'memory': None,
 'steps': [('preprocessing',
   Pipeline(steps=[('imputer', SimpleImputer()), ('scaling', MinMaxScaler())])),
  ('model_svm', SVC())],
 'verbose': False,
 'preprocessing': Pipeline(steps=[('imputer', SimpleImputer()), ('scaling', MinMaxScaler())]),
 'model_svm': SVC(),
 'preprocessing__memory': None,
 'preprocessing__steps': [('imputer', SimpleImputer()),
  ('scaling', MinMaxScaler())],
 'preprocessing__verbose': False,
 'preprocessing__imputer': SimpleImputer(),
 'preprocessing__scaling': MinMaxScaler(),
 'preprocessing__imputer__add_indicator': False,
 'preprocessing__imputer__copy': True,
 'preprocessing__imputer__fill_value': None,
 'preprocessing__imputer__missing_values': nan,
 'preprocessing__imputer__strategy': 'mean',
 'preprocessing__imputer__verbose': 0,
 'preprocessing__scaling__clip': False,
 'preprocessing__scaling__copy': True,
 'preprocessing__scaling__feature_range': (0, 1),
 'model_svm__C': 1.0,
 'model_svm__break_ties': False,
 'model_svm__cache_size

In [0]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

# Random search 
search = RandomizedSearchCV(
    pipe, 
    param_distributions ={
        'preprocessing__imputer__strategy': ['mean', 'median','most_frequent'],
        'model_svm__kernel' : ['linear', 'poly', 'rbf', 'sigmoid'],
        'model_svm__C': uniform(0.1,10)},
    cv=5,
    n_iter = 50,
    scoring="precision")

search.fit(X,y)

pipe_tuned = search.best_estimator_
pipe_tuned

Pipeline(steps=[('preprocessing',
                 Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),
                                 ('scaling', MinMaxScaler())])),
                ('model_svm', SVC(C=1.4055696650058025, kernel='poly'))])

In [0]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipe_tuned, X,y,cv=5, scoring="precision")
score = scores.mean()
print(scores)
print(scores.mean())

[0.72413793 0.75263158 0.77443609 0.7375     0.75625   ]
0.7489911200414829


In [0]:
from nbresult import ChallengeResult

result = ChallengeResult('solution',
    search = search
)
result.write()
print(result.check())


platform darwin -- Python 3.8.12, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /Users/brunolajoie/.pyenv/versions/lewagon3812/bin/python3
cachedir: .pytest_cache
rootdir: /Users/brunolajoie/code/lewagon/data-solutions/05-ML/08-Workflow/03-Tuning-Pipeline
plugins: anyio-3.3.0, dash-2.0.0
[1mcollecting ... [0mcollected 1 item

tests/test_solution.py::TestSolution::test_cv_results [32mPASSED[0m[32m             [100%][0m



💯 You can commit your code:

[1;32mgit[39m add tests/solution.pickle

[32mgit[39m commit -m [33m'Completed solution step'[39m

[32mgit[39m push origin master


## Export

Once you have built your optimal pipeline, export it as a pickle file

In [0]:
import pickle

# Export pipeline as pickle file
with open("pipeline.pkl", "wb") as file:
    pickle.dump(pipe_tuned, file )

🏁 Congratulation. Don't forget to add, commit and push your notebook.