<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Task" data-toc-modified-id="Task-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Task</a></span><ul class="toc-item"><li><span><a href="#Binary-encoding-and-metrics" data-toc-modified-id="Binary-encoding-and-metrics-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Binary encoding and metrics</a></span></li></ul></li><li><span><a href="#Naive-baseline" data-toc-modified-id="Naive-baseline-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Naive baseline</a></span><ul class="toc-item"><li><span><a href="#Metrics" data-toc-modified-id="Metrics-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Metrics</a></span></li></ul></li><li><span><a href="#Model-that-can-read" data-toc-modified-id="Model-that-can-read-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Model that can read</a></span></li><li><span><a href="#GridSearch-for-model-parameters" data-toc-modified-id="GridSearch-for-model-parameters-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>GridSearch for model parameters</a></span><ul class="toc-item"><li><span><a href="#Manual-search" data-toc-modified-id="Manual-search-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Manual search</a></span></li><li><span><a href="#Automated-search" data-toc-modified-id="Automated-search-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Automated search</a></span></li></ul></li><li><span><a href="#Decision-trees" data-toc-modified-id="Decision-trees-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Decision trees</a></span></li></ul></div>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette('muted')
sns.set_color_codes('muted')
sns.set_style('white')

In [None]:
%config InlineBackend.figure_format = 'retina'

Seminar plan:

1. Grid Search & friends
1. Decision trees

# Task

https://youtrack.jetbrains.com/issues/IDEA

Predict the issue type at the moment when the new issue is reported

Process model:

1. External users create a new issue. They specify its summary and description. Author ID and creation date are recorded automatically. For simplicity, we think that summary and description cannot be changed since then.
1. At some point in time issue becomes resolved. We're interested in the value of the Priority field at this moment. Again, for simplicity we suppose that the value of the Priority field did not change since then.

Therefore everything we need is `id`, `reporter`, `created`, `summary` and `description` of all resolved IDEA issues that were created by an external user.

In [None]:
df = pd.read_json('../data/issues.json.zip', lines=True)

In [None]:
df.sample(5)

In [None]:
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import FunctionTransformer

In [None]:
# the code in this cell is less readable, more usable analog of:
# df['type'] = df.customFields.map(lambda x: [cf['value']['name'] for cf in x if cf['name'] == 'Type'][0])
# df['reporter'] = df.reporter.map(lambda x: x['login'])
# df['created'] = pd.to_datetime(df.created, unit='ms')

formatter = DataFrameMapper([
    ('customFields', FunctionTransformer(
        lambda c: c.map(lambda x: [cf['value']['name'] for cf in x if cf['name'] == 'Type'][0])
    ), {'alias': 'type'}),
    ('reporter', FunctionTransformer(lambda c: c.map(lambda x: x['login']))),
    ('created', FunctionTransformer(lambda c: pd.to_datetime(c, unit='ms'))),
    ('summary', None),
    ('description', None),
    (['summary', 'description'], 
     FunctionTransformer(lambda x: x.summary.fillna('') + '\n\n' + x.description.fillna('')),
     dict(alias='text')
    ),
    ('idReadable', None)
], input_df=True, df_out=True)
formatter.fit_transform(df).sample(5)

In [None]:
X = formatter.transform(df)[['idReadable', 'summary', 'description', 'text', 'reporter', 'created']]
X.sample(5)

In [None]:
y = formatter.transform(df)['type']

In [None]:
y.value_counts(normalize=True)

## Binary encoding and metrics

There are two ways to encode the target to binary: `y_binary = y == 'Bug'` and `y_binary = y != 'Bug'`. Which one to choose?

It depends on which errors are more critical to us and which metrics do we use. 

Example: it is more important to decrease the load on support engineers (who handle bugs) $\implies$ we need to detect as many non-bugs as possible $\implies$ we have to choose `y_binary = y != 'Bug'` and look closely at the recall rate (percent of all non-bugs that were discovered).

In [None]:
binary_transformer = FunctionTransformer(lambda c: c != 'Bug')
y_binary = binary_transformer.fit_transform(y)

In [None]:
y_binary.value_counts(normalize=True)

# Naive baseline

In [None]:
from sklearn.dummy import DummyClassifier

In [None]:
dummy = DummyClassifier(strategy='most_frequent')

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_bin_train, y_bin_test = train_test_split(X, y_binary)

In [None]:
dummy.fit(X_train, y_bin_train)

In [None]:
dummy.predict(X_train)# .any()

## Metrics

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

In [None]:
accuracy_score(y_true=y_bin_train, y_pred=dummy.predict(X_train))

In [None]:
from sklearn.metrics import plot_confusion_matrix
plot_confusion_matrix(estimator=dummy, X=X_train, y_true=y_bin_train)

In [None]:
precision_score(y_true=y_bin_train, y_pred=dummy.predict(X_train))

In [None]:
recall_score(y_true=y_bin_train, y_pred=dummy.predict(X_train))

# Model that can read

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
preprocessor = DataFrameMapper([
    ('text', TfidfVectorizer(
            min_df=.05, max_df=.5, token_pattern=r'[A-Za-z]{2,}', stop_words='english'
    ))
], input_df=True, df_out=True).fit(X_train)
preprocessor.transform(X_train.sample(5))

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

In [None]:
lr = make_pipeline(preprocessor.set_params(df_out=False), LogisticRegression())
lr.fit(X_train, y_bin_train)

In [None]:
lr.predict(X_train).all()

In [None]:
accuracy_score(y_true=y_bin_train, y_pred=lr.predict(X_train))

In [None]:
plot_confusion_matrix(estimator=lr, X=X_train, y_true=y_bin_train)

In [None]:
precision_score(y_true=y_bin_train, y_pred=lr.predict(X_train))

In [None]:
recall_score(y_true=y_bin_train, y_pred=lr.predict(X_train))

In [None]:
f1_score(y_true=y_bin_train, y_pred=lr.predict(X_train))

# GridSearch for model parameters

In [None]:
from sklearn.model_selection import ParameterGrid

In [None]:
param_grid = dict(min_df=[.05, .1], max_df=[.2, .3, .5])
param_grid

## Manual search

**Task**: code the grid search =)

In [None]:
results = []
for params in ParameterGrid(param_grid):
    print(params)
    ...

In [None]:
results = pd.DataFrame(results)
results
# results.drop(columns='estimator').sort_values('test_recall').style.bar(vmin=0, vmax=1)

Short reminder: precision and recall are threshold-dependent, it is better to use sth elsee for cross-validation.

In [None]:
best_estimator = results.loc[results.test_recall.idxmax()].estimator

In [None]:
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_bin_test, best_estimator.predict_proba(X_test['text'])[:,1])

In [None]:
recall_score(y_bin_test, best_estimator.predict(X_test['text']))

In [None]:
import plotly.express as px

In [None]:
px.line(y=precision[:-1], x=recall[:-1], text=thresholds, labels=dict(x='recall', y='precision'))

## Automated search

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
pipe = make_pipeline(
        TfidfVectorizer(token_pattern=r'[A-Za-z]{2,}', stop_words='english'),
        LogisticRegression(penalty='none')
    )
cv = GridSearchCV(
    estimator=pipe,
    param_grid=dict(min_df=[.05, .02], max_df=[.1, .3, .6]),
    scoring=,
    refit=False,
    verbose=5
)

In [None]:
cv.fit(X_train['text'], y_bin_train)

How to speed up:

- `GridSearchCV(n_jobs=-1)` would parallel the fitting process
- smaller sample would decrease the fit time
- smaller number of parameters (greedy strategy) would allow to fit less models

In [None]:
pd.DataFrame(cv.cv_results_)[['params', 'mean_test_precision', 'mean_test_recall']]

# Decision trees

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt = make_pipeline(
        TfidfVectorizer(token_pattern=r'[A-Za-z]{2,}', stop_words='english', max_df=.3, min_df=.2),
        DecisionTreeClassifier()
    )

In [None]:
dt.fit(X_train['text'], y_bin_train)

In [None]:
plot_confusion_matrix(estimator=dt, X=X_train['text'], y_true=y_bin_train)

In [None]:
precision_score(y_true=y_bin_train, y_pred=dt.predict(X_train['text']))

In [None]:
recall_score(y_true=y_bin_train, y_pred=dt.predict(X_train['text']))

In [None]:
precision_score(y_true=y_bin_test, y_pred=dt.predict(X_test['text']))

In [None]:
recall_score(y_true=y_bin_test, y_pred=dt.predict(X_test['text']))

The model above is simply overfitted. WHat should we do with it?

In [None]:
dt.named_steps['decisiontreeclassifier'].get_depth()

In [None]:
dt = make_pipeline(
        TfidfVectorizer(token_pattern=r'[A-Za-z]{2,}', stop_words='english', max_df=.3, min_df=.2),
        DecisionTreeClassifier(max_depth=40, min_samples_leaf=10)
    )
dt.fit(X_train['text'], y_bin_train)

In [None]:
plot_confusion_matrix(estimator=dt, X=X_train['text'], y_true=y_bin_train)

In [None]:
precision_score(y_true=y_bin_train, y_pred=dt.predict(X_train['text']))

In [None]:
recall_score(y_true=y_bin_train, y_pred=dt.predict(X_train['text']))

**Task**: run grid search to find the best parameters for the Decision Tree model