# 00 - Transformação Avançada de Categóricas com Likelihood e P-value

## Introdução

Nesse trabalho queremos saber qual a probabilidade de uma pergunta ser de alta qualidade com base nas tags da pergunta.

## Importação

In [2]:
import pandas as pd
import numpy as np
%matplotlib inline
import re

## Pré-processamento dos dados

In [7]:
data = pd.read_csv("data-raw/train.csv", parse_dates=['CreationDate'])
data = data.sort_values("CreationDate")
data.head()

Unnamed: 0,Id,Title,Body,Tags,CreationDate,Y
0,34552656,Java: Repeat Task Every Random Seconds,<p>I'm already familiar with repeating tasks e...,<java><repeat>,2016-01-01 00:21:59,LQ_CLOSE
1,34553034,Why are Java Optionals immutable?,<p>I'd like to understand why Java 8 Optionals...,<java><optional>,2016-01-01 02:03:20,HQ
2,34553174,Text Overlay Image with Darkened Opacity React...,<p>I am attempting to overlay a title over an ...,<javascript><image><overlay><react-native><opa...,2016-01-01 02:48:24,HQ
3,34553318,Why ternary operator in swift is so picky?,"<p>The question is very simple, but I just cou...",<swift><operators><whitespace><ternary-operato...,2016-01-01 03:30:17,HQ
4,34553755,hide/show fab with scale animation,<p>I'm using custom floatingactionmenu. I need...,<android><material-design><floating-action-but...,2016-01-01 05:21:48,HQ


In [8]:
new_data = []
for ts, tags, y in data[['CreationDate','Tags', 'Y']].values:
    tags_ = re.findall("<([\w\d\-]+)", tags)
    y_ = 1 if 'HQ' in y else 0 
    
    data_pt = [ts, tags_, y_]
    new_data.append(data_pt)

Vamos fazer 3 separações de dados. Devemos tomar cuidado com o vazamento de dados, onde corremos o risco de apresentar dados ao modelo que não teriamos na produção nesse momento.

In [9]:
like = new_data[:20000]
train = new_data[20000:40000]
test = new_data[40000:]

In [10]:
like[0]

[Timestamp('2016-01-01 00:21:59'), ['java', 'repeat'], 0]

## Likelihood

Qual a proporção de vezes que uma tag aparece em uma pergunta que foi de alta qualidade? Ou seja, probabilidade de uma pergunta ser de alta qualidade, dado que ela tem uma tag xpto.

Vamos calcular pegando todos os exemplos positivos da tag sobre todos os exemplos da tag.

In [11]:
global_like = 0

tag_like_y = dict()
tag_like_num = dict()

for ts, tags, y  in like:
    global_like += y
    
    for t in tags:
        tag_like_y[t] = tag_like_y.get(t, 0) + y
        tag_like_num[t] = tag_like_num.get(t, 0) + 1

In [12]:
tag_like = dict()

for tag in tag_like_num.keys():
    tag_like[tag] = tag_like_y[tag] / tag_like_num[tag]

## Bias Factor

Comparado a probabilidade aleatório, qual a probabilidade de uma tag especifica ser de alta qualidade.

In [14]:
global_bias = global_like / 20000
    
tag_bias_factor = dict()
for tag, bias in tag_like.items():
    tag_bias_factor[tag] = bias / global_bias

## P-valor

Checa se a probabilidade acima não foi apenas pelo acaso. Verifica a significancia.

In [15]:
from scipy.stats import binom

In [17]:
tag_pvals = dict()
global_bias = 0.4093

for tag in tag_like_num.keys():
    k = tag_like_y[tag]
    n = tag_like_num[tag]
    tag_pvals[tag] = np.min([binom.cdf(k=k, n=n, p=global_bias), binom.sf(k=k-1, n=n, p=global_bias)])
    
tag_pvals

{'java': 9.777594299743535e-82,
 'repeat': 0.34892649,
 'optional': 0.028682670309242264,
 'javascript': 4.826000894975935e-07,
 'image': 0.22410331283204216,
 'overlay': 0.365442285286,
 'react-native': 3.524430726726627e-67,
 'opacity': 0.65107351,
 'swift': 1.8465129525956367e-11,
 'operators': 0.026341674852645304,
 'whitespace': 0.4749587118995165,
 'ternary-operator': 0.365442285286,
 'android': 0.48507856654588277,
 'material-design': 0.014403741581310844,
 'floating-action-button': 0.365442285286,
 'c': 6.086308533874862e-126,
 'pointers': 2.0150464388807798e-13,
 'data-structures': 0.025141917206355335,
 'jquery': 7.623679431434003e-62,
 'jquery-ui': 0.2962612280941942,
 'html': 1.3208534471773006e-75,
 'css': 1.6174537564414793e-25,
 'twitter-bootstrap': 0.016342419387422202,
 'windows-10': 0.09626227879450157,
 'windows-10-mobile': 0.4093,
 'windows-10-universal': 0.068568592357,
 'vb6': 0.3210782967714346,
 'linux': 0.0026336869669543084,
 'mongodb': 5.961721467962766e-09,


## Featurização e Modelagem

In [18]:
#avg bias feature

def gen_features(data, tag_dict, return_y=True):
    feature_col = []
    Y = []

    for ts, tags, y in data:

        feature_row = []

        for tag in tags:
            if tag not in tag_dict:
                feature_row.append(1)
                continue
            feature_row.append(tag_dict[tag])
            
        feature_col.append(np.mean(feature_row))
        Y.append(y)
        
    feature_col = np.array(feature_col)
    feature_col[np.isnan(feature_col)] = 1. # nan = 1.
    feature_col = feature_col.reshape(-1,1) #sklearn
    
    Y = np.array(Y)
        
    if return_y:
        return feature_col, Y
    return feature_col

In [20]:
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_scorea

ImportError: cannot import name 'roc_auc_scorea' from 'sklearn.metrics' (R:\Install\anaconda3\lib\site-packages\sklearn\metrics\__init__.py)

In [None]:
feature_col_tag_like_tr, Y_tr = gen_features(train, tag_like)
feature_col_tag_like_ts, Y_ts = gen_features(test, tag_like)

mdl = LGBMClassifier(random_state=0)
mdl.fit(feature_col_tag_like_tr, Y_tr)

raw_roc = roc_auc_score(Y_ts, feature_col_tag_like_ts)
model_roc = roc_auc_score(Y_ts, mdl.predict_proba(feature_col_tag_like_ts)[:,1])
print("ROC AUC - Feature = {} - Model = {}".format(raw_roc, model_roc))

# Fim