<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Task" data-toc-modified-id="Task-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Task</a></span></li><li><span><a href="#Dataset" data-toc-modified-id="Dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Dataset</a></span></li><li><span><a href="#Naive-baseline" data-toc-modified-id="Naive-baseline-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Naive baseline</a></span><ul class="toc-item"><li><span><a href="#Metrics" data-toc-modified-id="Metrics-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Metrics</a></span></li></ul></li><li><span><a href="#Simple-model" data-toc-modified-id="Simple-model-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Simple model</a></span></li><li><span><a href="#Model-that-can-read" data-toc-modified-id="Model-that-can-read-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Model that can read</a></span></li><li><span><a href="#GridSearch-for-model-parameters" data-toc-modified-id="GridSearch-for-model-parameters-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>GridSearch for model parameters</a></span></li></ul></div>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette('muted')
sns.set_color_codes('muted')
sns.set_style('white')

In [None]:
%config InlineBackend.figure_format = 'retina'

Seminar plan:
1. Underfitting and overfitting.
2. Quality metrics (precision/recall/F1/...)
3. Grid Search & friends
4. (probably) Time-aware model validation

# Task

https://youtrack.jetbrains.com/issues/IDEA

Predict the issue type at the moment when the new issue is reported

Process model:

1. External users create a new issue. They specify its summary and description. Author ID and creation date are recorded automatically. For simplicity, we think that summary and description cannot be changed since then.
1. At some point in time issue becomes resolved. We're interested in the value of the Priority field at this moment. Again, for simplicity we suppose that the value of the Priority field did not change since then.

Therefore everything we need is `id`, `reporter`, `created`, `summary` and `description` of all resolved IDEA issues that were created by an external user.

# Dataset

Scraped from https://youtrack.jetbrains.com/issues/IDEA

In [None]:
df = pd.read_json('../data/issues.json.zip', lines=True)

In [None]:
df.sample(5)

In [None]:
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import FunctionTransformer

In [None]:
# the code in this cell is less readable, more usable analog of:
# df['type'] = df.customFields.map(lambda x: [cf['value']['name'] for cf in x if cf['name'] == 'Type'][0])
# df['reporter'] = df.reporter.map(lambda x: x['login'])
# df['created'] = pd.to_datetime(df.created, unit='ms')

formatter = DataFrameMapper([
    ('customFields', FunctionTransformer(
        lambda c: c.map(lambda x: [cf['value']['name'] for cf in x if cf['name'] == 'Type'][0])
    ), {'alias': 'type'}),
    ('reporter', FunctionTransformer(lambda c: c.map(lambda x: x['login']))),
    ('created', FunctionTransformer(lambda c: pd.to_datetime(c, unit='ms'))),
    ('summary', None),
    ('description', None),
    (['summary', 'description'], 
     FunctionTransformer(lambda x: x.summary.fillna('') + '\n\n' + x.description.fillna('')),
     dict(alias='text')
    ),
    ('idReadable', None)
], input_df=True, df_out=True)
formatter.fit_transform(df).sample(5)

In [None]:
X = formatter.transform(df)[['idReadable', 'summary', 'description', 'text', 'reporter', 'created']]
X.sample(5)

In [None]:
y = formatter.transform(df)['type']

In [None]:
y.value_counts(normalize=True)

**Q**: which type of the classifier do we need to build, binary or multiclass?

# Naive baseline

In [None]:
from sklearn.dummy import DummyClassifier

Naive baseline id a good place to **settle all your evaluation procedures**.

In [None]:
dummy = DummyClassifier(strategy='most_frequent')

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_bin_train, y_bin_test = train_test_split(X, y_binary)

In [None]:
dummy.fit(X_train, y_bin_train)

In [None]:
dummy.predict(X_train)# .any()

## Metrics

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score

**Q**: How does the confusion matrix for the DummyClassifier look like?

**Q**: What are accuracy, precision and recall values for the DummyClassifier?

# Simple model

In [None]:
import calendar

In [None]:
sns.barplot(x=X_train.created.dt.day_name(), y=y_bin_train, order=list(calendar.day_name))

There is a little bit lower probability to file a bug on Sunday. Maybe it can be encoded in a model?

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder
preprocessor = DataFrameMapper([
    ('created', make_pipeline(
        FunctionTransformer(lambda d: d.dt.day_name().to_frame()),
        OneHotEncoder()
    ))
], input_df=True, df_out=True)
preprocessor.fit_transform(X_train)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lr = make_pipeline(preprocessor, LogisticRegression())
lr.fit(X_train, y_bin_train)

**Q**: what is the quality of the day-of-week model?

# Model that can read

We have texts. How can we transform texts to a set of features?

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
preprocessor = DataFrameMapper([
    ('text', TfidfVectorizer(
            min_df=.05, max_df=.5, token_pattern=r'[A-Za-z]{2,}', stop_words='english'
    ))
], input_df=True, df_out=True).fit(X_train)
preprocessor.transform(X_train.sample(5))

In [None]:
lr = make_pipeline(preprocessor.set_params(df_out=False), LogisticRegression())
lr.fit(X_train, y_bin_train)

**Task**: Evaluate the model quality

Are we interested in how many actual bugs we identified as bugs? Or is it more important not to load the prioritization engine with unrelated stuff? Or, maybe, the ultimate goal is not to miss any suggestion?

# GridSearch for model parameters

In [None]:
from sklearn.model_selection import ParameterGrid

For the first time we will write the grid search logic manually. Usually you can employ `GridSearchCV` to do it for you. Or probably you cannot?

In [None]:
param_grid = dict(min_df=[.01, .1], max_df=[.2, .3, .5])
param_grid

In [None]:
results = []
for params in ParameterGrid(param_grid):
    pass

In [None]:
pd.DataFrame(results).sort_values('test_recall').style.bar(vmin=0, vmax=1)