<h1 align="center"> Aplicações em Processamento de Linguagem Natural </h1>
<h2 align="center"> Aula 07 - Extração de Informação (Parte 2)</h2>
<h3 align="center"> Prof. Fernando Vieira da Silva MSc.</h3>

<h2>1. Extração de Relacionamentos</h2>
<p>A extração de relacionamentos consiste em identificar a ligação entre diversas entidades nomeadas no texto. Isso envolve mencionar qual é o tipo da ligação entre duas entidades. Considere o exemplo de sentença abaixo.</p>

<p>"Carlos Alberto de Nogueira é o morador mais antigo da Rua Praça da Alegria."</p>

<p>Temos as entidades:</p>

* Carlos Alberto de Nogueira (PESSOA)
* Rua Praça da Alegria (LOCALIDADE)

<p>Essas mesmas entidades estão relacionadas da seguinte forma:</p>

[Carlos Alberto de Nogueira (PERSON); morador mais antigo; Rua Praça da Alegria (LOCALIDADE)]


<p>Um dos mais famosos exemplos de sistema de reconhecimento é o [Never-Ending Language Learning (NELL)](http://rtw.ml.cmu.edu/), projeto desenvolvido pela Universidade Carnigie Mellon, com participação do Google e inclusive de pesquisadores brasileiros financiados pelo CNPq. Esse projeto consiste em extrair relacionamentos de milhões de páginas da internet, criando uma gigantesca base de conhecimento.</p>

<h2>2. Métodos para identificação de relacionamentos</h2>

<p>Os métodos mais comuns para identificar relacionamentos entre entidades são:</p>

* **Padrões codificados manualmente**: Basta criar padrões usando expressões regulares, por exemplo, para identificar que duas entidades se relacionam. Assim como em "X mora em Y" pode ser um padrão para identificar o relacionamento (X, mora_em, Y) entre uma entidade X do tipo PESSOA e uma entidade Y do tipo LOCALIDADE.
* **Métodos bootstraping**: Com poucos dados, procura por ocorrências de duas entidades em que já se conhece o relacionamento (no Google, por exemplo), e usa os modelos encontrados como modelos para o mesmo relacionamento entre outras entidades.
* **Métodos supervisionados**: Com base num corpus anotado com relacionamentos, criar modelos que 1) detecte quando existe o relacionamento entre duas entidades e 2) classifique o tipo de relacionamento entre elas. 

<p>Nesta aula, vamos ver um método supervisionado para classificar o relacionamento entre entidades, usando técnicas que já utilizamos em aulas anteriores.</p>

<p>Para isso, utilizaremos alguns atributos mais comuns para o problema, como:</p>

* Bag of Words/LSA
* Flags indicadores dos tipos das entidades
* Número de palavras entre as duas entidades
* Flag indicando se o texto de uma entidade é composto pelo texto da outra
* POS tags
* etc



In [2]:
import nltk

In [4]:
import pandas as pd

df_train = pd.read_csv('../input/figure-eight-medical-sentence-summary/train.csv')
df_test = pd.read_csv('../input/figure-eight-medical-sentence-summary/test.csv')

df_train.head(20)

Unnamed: 0,_unit_id,_created_at,_canary,_id,_started_at,_channel,_trust,_worker_id,_country,_region,_city,_ip,direction,b1,b2,direction_gold,e1,e2,relation,relex_relcos,sent_id,sentence,term1,term2,twrex
0,502808352,7/13/2014 13:48:35,,1321892767,7/13/2014 13:48:14,clixsense,0.9167,27871219,NLD,07,Amsterdam,87.210.207.223,IM CEFTRIAXONE treats URETHRAL OR RECTAL GONOR...,41,128,,69,142,treats,1.0,907845-FS1-2,"For treatment of uncomplicated cervical, URETH...",URETHRAL OR RECTAL GONORRHEA,IM CEFTRIAXONE,RO-may_treat
1,502808352,7/13/2014 13:51:12,,1321894040,7/13/2014 13:51:07,neodev,0.8333,17610000,GBR,I2,Manchester,90.200.140.201,URETHRAL OR RECTAL GONORRHEA treats IM CEFTRIA...,41,128,,69,142,treats,1.0,907845-FS1-2,"For treatment of uncomplicated cervical, URETH...",URETHRAL OR RECTAL GONORRHEA,IM CEFTRIAXONE,RO-may_treat
2,502808352,7/13/2014 16:24:57,,1321961909,7/13/2014 16:24:35,instagc,0.6639,25990856,USA,NV,Las Vegas,68.108.98.78,IM CEFTRIAXONE treats URETHRAL OR RECTAL GONOR...,41,128,,69,142,treats,1.0,907845-FS1-2,"For treatment of uncomplicated cervical, URETH...",URETHRAL OR RECTAL GONORRHEA,IM CEFTRIAXONE,RO-may_treat
3,502808352,7/13/2014 16:33:49,,1321965723,7/13/2014 16:33:31,elite,0.3923,28276268,USA,CA,San Diego,76.88.95.100,URETHRAL OR RECTAL GONORRHEA treats IM CEFTRIA...,41,128,,69,142,treats,1.0,907845-FS1-2,"For treatment of uncomplicated cervical, URETH...",URETHRAL OR RECTAL GONORRHEA,IM CEFTRIAXONE,RO-may_treat
4,502808352,7/13/2014 16:47:27,,1321970904,7/13/2014 16:47:06,neodev,0.6552,27597779,CAN,AB,Calgary,68.146.86.137,IM CEFTRIAXONE treats URETHRAL OR RECTAL GONOR...,41,128,,69,142,treats,1.0,907845-FS1-2,"For treatment of uncomplicated cervical, URETH...",URETHRAL OR RECTAL GONORRHEA,IM CEFTRIAXONE,RO-may_treat
5,502808352,7/13/2014 16:56:13,,1321973849,7/13/2014 16:55:37,clixsense,0.6639,28037714,GBR,I4,Mitcham,94.4.232.118,IM CEFTRIAXONE treats URETHRAL OR RECTAL GONOR...,41,128,,69,142,treats,1.0,907845-FS1-2,"For treatment of uncomplicated cervical, URETH...",URETHRAL OR RECTAL GONORRHEA,IM CEFTRIAXONE,RO-may_treat
6,502808352,7/13/2014 17:14:41,,1321979856,7/13/2014 17:14:06,prodege,0.6151,2422962,USA,IA,Honey Creek,12.73.110.97,IM CEFTRIAXONE treats URETHRAL OR RECTAL GONOR...,41,128,,69,142,treats,1.0,907845-FS1-2,"For treatment of uncomplicated cervical, URETH...",URETHRAL OR RECTAL GONORRHEA,IM CEFTRIAXONE,RO-may_treat
7,502808354,7/13/2014 13:45:15,,1321891302,7/13/2014 13:44:25,clixsense,0.9167,27871219,NLD,07,Amsterdam,87.210.207.223,no_relation,175,203,,187,217,diagnosed by,0.53033,906321-FS1-13,Diagnosis specific malignancies available for ...,OSTEOSARCOMA,RETINOBLASTOMA,RO-has_manifestation
8,502808354,7/13/2014 13:50:45,,1321893871,7/13/2014 13:50:40,neodev,0.8333,17610000,GBR,I2,Manchester,90.200.140.201,OSTEOSARCOMA diagnosed by RETINOBLASTOMA,175,203,,187,217,diagnosed by,0.53033,906321-FS1-13,Diagnosis specific malignancies available for ...,OSTEOSARCOMA,RETINOBLASTOMA,RO-has_manifestation
9,502808354,7/13/2014 14:07:58,,1321902037,7/13/2014 14:07:28,prodege,0.9444,23977248,GBR,B5,Wembley,82.28.55.95,no_relation,175,203,,187,217,diagnosed by,0.53033,906321-FS1-13,Diagnosis specific malignancies available for ...,OSTEOSARCOMA,RETINOBLASTOMA,RO-has_manifestation


<h2>3. Criando um Modelo Supervisionado</h2>
<p> Vamos utilizar o corpus [Figure Eight: Medical Sentence Summary](https://www.kaggle.com/kmader/figure-eight-medical-sentence-summary), que possui diversas sentenças extraídas do PubMed, com entidades anotadas, assim como seus tipos de relacionamento.</p>

In [5]:
df_train['relation'].unique()

array(['treats', 'diagnosed by', 'contraindicates', 'causes', 'location',
       'is location of', 'location of', 'is diagnosed by',
       'diagnose_by_test_or_drug'], dtype=object)

<p>Transformamos as sentenças e tipos de relacionamento em matrizes numpy. Também binarizamos os rótulos dos relacionamentos, para utilizarmos no nosso classificador logo mais.</p>

In [6]:
import numpy as np

x_train = df_train['sentence'].as_matrix()
y_train = df_train['relation'].as_matrix()

from sklearn.preprocessing import label_binarize

y_train = label_binarize(y_train, classes=df_train['relation'].unique())

print(x_train[:10])
print(y_train[:10])

['For treatment of uncomplicated cervical, URETHRAL OR RECTAL GONORRHEA CDC and others recommend IM ceftriaxone or oral cefixime; IM CEFTRIAXONE is drug of choice for pharyngeal infections.'
 'For treatment of uncomplicated cervical, URETHRAL OR RECTAL GONORRHEA CDC and others recommend IM ceftriaxone or oral cefixime; IM CEFTRIAXONE is drug of choice for pharyngeal infections.'
 'For treatment of uncomplicated cervical, URETHRAL OR RECTAL GONORRHEA CDC and others recommend IM ceftriaxone or oral cefixime; IM CEFTRIAXONE is drug of choice for pharyngeal infections.'
 'For treatment of uncomplicated cervical, URETHRAL OR RECTAL GONORRHEA CDC and others recommend IM ceftriaxone or oral cefixime; IM CEFTRIAXONE is drug of choice for pharyngeal infections.'
 'For treatment of uncomplicated cervical, URETHRAL OR RECTAL GONORRHEA CDC and others recommend IM ceftriaxone or oral cefixime; IM CEFTRIAXONE is drug of choice for pharyngeal infections.'
 'For treatment of uncomplicated cervical, UR

  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.


<p>Como não temos os tipos das entidades, mas sabemos que se trata de nomes de medicamentos e doenças na maioria dos casos, não utilizaremos o tipo das entidades como atributos, mas utilizaremos os POS tags de todas as palavras entre as entidades. Vamos criar outras matrizes com esses atributos. </p>

<p>Para os POS Tags, vamos fazer algo parecido ao chunking sugerido em https://courses.cs.washington.edu/courses/cse517/13wi/slides/cse517wi13-RelationExtraction.pdf, mas ao invés de usar chunking, vamos criar 3-grams desses POS tags para simplificar.</p>

In [7]:
x_train_sub_list = []

for i, row in df_train.iterrows():
    pos_t1 = row['sentence'].find(row['term1'])
    len_t1 = len(row['term1'])
    
    pos_t2 = row['sentence'].find(row['term2'])    
    
    x_train_sub_list.append(row['sentence'][pos_t1+len_t1:pos_t2])
    

x_train_sub = np.array(x_train_sub_list)

print(x_train_sub[:10])

[' CDC and others recommend IM ceftriaxone or oral cefixime; '
 ' CDC and others recommend IM ceftriaxone or oral cefixime; '
 ' CDC and others recommend IM ceftriaxone or oral cefixime; '
 ' CDC and others recommend IM ceftriaxone or oral cefixime; '
 ' CDC and others recommend IM ceftriaxone or oral cefixime; '
 ' CDC and others recommend IM ceftriaxone or oral cefixime; '
 ' CDC and others recommend IM ceftriaxone or oral cefixime; '
 " Wilms' tumour, " " Wilms' tumour, " " Wilms' tumour, "]


<p>Agora vamos definir duas funções de tokenização: uma para tokenizar bag-of-words e outra para tokenizar os POS tags</p>

In [8]:
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import string
from nltk.corpus import wordnet

def my_tokenizer_pos(doc):
    words = word_tokenize(doc)
    
    pos_tags = pos_tag(words)
    
    return [pos[1] for pos in pos_tags]

# testando nossa função:

for x in x_train_sub[:10]:
    print(my_tokenizer_pos(x))

['NNP', 'CC', 'NNS', 'VBP', 'NNP', 'NN', 'CC', 'JJ', 'NN', ':']
['NNP', 'CC', 'NNS', 'VBP', 'NNP', 'NN', 'CC', 'JJ', 'NN', ':']
['NNP', 'CC', 'NNS', 'VBP', 'NNP', 'NN', 'CC', 'JJ', 'NN', ':']
['NNP', 'CC', 'NNS', 'VBP', 'NNP', 'NN', 'CC', 'JJ', 'NN', ':']
['NNP', 'CC', 'NNS', 'VBP', 'NNP', 'NN', 'CC', 'JJ', 'NN', ':']
['NNP', 'CC', 'NNS', 'VBP', 'NNP', 'NN', 'CC', 'JJ', 'NN', ':']
['NNP', 'CC', 'NNS', 'VBP', 'NNP', 'NN', 'CC', 'JJ', 'NN', ':']
['NNP', 'POS', 'NN', ',']
['NNP', 'POS', 'NN', ',']
['NNP', 'POS', 'NN', ',']


In [10]:
stopwords_list = stopwords.words('english')

lemmatizer = WordNetLemmatizer()

def my_tokenizer_bow(doc):
    words = word_tokenize(doc)
    
    pos_tags = pos_tag(words)
    
    non_stopwords = [w for w in pos_tags if not w[0].lower() in stopwords_list]
    
    non_punctuation = [w for w in non_stopwords if not w[0] in string.punctuation]
    
    lemmas = []
    for w in non_punctuation:
        if w[1].startswith('J'):
            pos = wordnet.ADJ
        elif w[1].startswith('V'):
            pos = wordnet.VERB
        elif w[1].startswith('N'):
            pos = wordnet.NOUN
        elif w[1].startswith('R'):
            pos = wordnet.ADV
        else:
            pos = wordnet.NOUN
        
        lemmas.append(lemmatizer.lemmatize(w[0], pos))

    return lemmas

<p>Vamos reaproveitar a classe para seleção de atributos usando SVD.</p>

In [11]:
from sklearn.decomposition import TruncatedSVD

class SVDDimSelect(object):
    def fit(self, X, y=None):        
        try:
            self.svd_transformer = TruncatedSVD(n_components=round(X.shape[1]/2))
            self.svd_transformer.fit(X)
        
            cummulative_variance = 0.0
            k = 0
            for var in sorted(self.svd_transformer.explained_variance_ratio_)[::-1]:
                cummulative_variance += var
                if cummulative_variance >= 0.5:
                    break
                else:
                    k += 1
                
            self.svd_transformer = TruncatedSVD(n_components=k)
        except Exception as ex:
            print(ex)
            
        return self.svd_transformer.fit(X)
    
    def transform(self, X, Y=None):
        return self.svd_transformer.transform(X)
        
    def get_params(self, deep=True):
        return {}

<p>Agora vamos criar nosso Pipeline. Em resumo, vamos usar o TFIDF Vectorizer e o nosso POS Tagger em paralelo, e depois juntar os atributos para redimensionar usando o SVD.</p>

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import scipy

clf = OneVsRestClassifier(LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial'))


my_pipeline = Pipeline([
                        ('union', FeatureUnion([('bow', TfidfVectorizer(tokenizer=my_tokenizer_bow)),\
                                                ('pos', Pipeline([('pos-vect', CountVectorizer(tokenizer=my_tokenizer_pos)), \
                                                         ('pos-tfidf', TfidfTransformer())]))
                                               ])),\
                       ('svd', SVDDimSelect()), \
                       ('clf', clf)])

par = {'clf__estimator__C' : np.logspace(-4, 4, 20)}

hyperpar_selector = RandomizedSearchCV(my_pipeline, par, cv=3, scoring='f1_weighted', n_jobs=1, n_iter=20)

<p>Agora vamos treinar os algoritmos</p>

In [None]:
print(x_train_sub.shape)
print(y_train.shape)

hyperpar_selector.fit(X=x_train_sub, y=y_train)

(13340,)
(13340, 9)


  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted',

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true',

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, 

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)

  'precision', 'predicted', average, warn_for)
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true',

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  str(classes[c]))
  str(classes[c]))

  'precision', 'predicted', average, warn_for)
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
  str(classes[c]))
  str(classes[c]))
  str(classes[c]))
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [None]:
x_test = df_test['sentence'].as_matrix()
y_test = df_test['relation'].as_matrix()

y_test = label_binarize(y_test, classes=df_train['relation'].unique())

x_test_sub_list = []

for i, row in df_test.iterrows():
    pos_t1 = row['sentence'].find(row['term1'])
    len_t1 = len(row['term1'])
    
    pos_t2 = row['sentence'].find(row['term2'])    
    
    x_test_sub_list.append(row['sentence'][pos_t1+len_t1:pos_t2])
    

x_test_sub = np.array(x_test_sub_list)

In [None]:
y_predicted = hyperpar_selector.predict(x_test)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_predicted, target_names=df_train['relation'].unique()))

<p>
</p>

<p><b>Exercício 7:</b> Treine um modelo de extração de relacionamentos em Português, utilizando o corpus extraído do DBPedia e com relacionamentos entre pares de entidades anotadas.</p>

In [None]:
df_dict = {"sentence": [], "term1": [], "term2":[], "relation": []}
with open("../input/dbpedia-with-entity-relations-in-portuguese/DBpediaRelations-PT-0.2.txt","r") as train_arquivo:
    linha = train_arquivo.readlines()
    for l in linha:
         if len(l) > 1:
                valores = l.split(':')
                if valores[0].strip() == 'SENTENCE':
                    df_dict["sentence"].append(' '.join(valores[1:]))
                elif valores[0].strip() == 'ENTITY1':
                        df_dict["term1"].append(' '.join(valores[1:]))
                elif valores[0].strip() == 'ENTITY2':
                        df_dict["term2"].append(' '.join(valores[1:]))
                elif valores[0].strip() == 'REL TYPE':
                            df_dict["relation"].append(' '.join(valores[1:]))       
                    
train_arquivo.close()


In [None]:
Df = pd.DataFrame.from_dict(df_dict)

In [None]:
Df

In [None]:
Df['relation'].unique()

In [None]:
x_train_sub_list = []

for i, row in Df.iterrows():
    pos_t1 = row['sentence'].find(row['term1'])
    pos_t2 = row['sentence'].find(row['term2']) 
    if pos_t1 < pos_t2:
        len_t1 = len(row['term1'])
        x_train_sub_list.append(row['sentence'][pos_t1+len_t1:pos_t2])
    else:
        len_t2 = len(row['term2'])
        x_train_sub_list.append(row['sentence'][pos_t2+len_t2:pos_t1])
    
    

x_train_sub = np.array(x_train_sub_list)

print(x_train_sub[:10])

In [None]:
X_test = Df['sentence'].as_matrix()
y_test = Df['relation'].as_matrix()

y_test = label_binarize(y_test, classes= Df['relation'].unique())

x_test_sub_list = []

for i, row in Df.iterrows():
    pos_t1 = row['sentence'].find(row['term1'])
    pos_t2 = row['sentence'].find(row['term2']) 
    if pos_t1 < pos_t2:
        len_t1 = len(row['term1'])
        x_test_sub_list.append(row['sentence'][pos_t1+len_t1:pos_t2])
    else:
        len_t2 = len(row['term2'])
        x_test_sub_list.append(row['sentence'][pos_t2+len_t2:pos_t1])
    
    

x_test_sub = np.array(x_test_sub_list)

print(x_test_sub[:10])

In [None]:
X = Df['sentence']
y = Df['relation']

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
import numpy as np

X_train = Df['sentence'].as_matrix()
y_train = Df['relation'].as_matrix()

from sklearn.preprocessing import label_binarize

y_train = label_binarize(y_train, classes=Df['relation'].unique())

print(X_train[:10])
print(y_train[:10])

In [None]:
import numpy as np

X_test = Df['sentence'].as_matrix()
y_test = Df['relation'].as_matrix()

from sklearn.preprocessing import label_binarize

y_test = label_binarize(y_test, classes=Df['relation'].unique())

print(X_test[:10])
print(y_test[:10])

In [None]:
stopwords_list = stopwords.words('portuguese')
stemmer = nltk.stem.RSLPStemmer()


#lemmatizer = WordNetLemmatizer()

def my_tokenizer_bow_pt(doc):
    words = word_tokenize(doc, language ='portuguese')
    
    #pos_tags = pos_tag(words)
    
    non_stopwords = [w for w in words if not w[0].lower() in stopwords_list]
    
    non_punctuation = [w for w in non_stopwords if not w[0] in string.punctuation]
    
    stems = []
    for w in non_punctuation:
        stems.append(stemmer.stem(w))
    return stems

In [None]:
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import string
from nltk.corpus import wordnet
import nltk
from nltk.stem import RSLPStemmer
stopwords_list = stopwords.words('portuguese')

import pt_core_news_sm
nlp = pt_core_news_sm.load()

def my_tokenizer_pos_pt(doc):
    tokens = nlp(doc)
    pos_tags = []
    for t in tokens:
        pos_tags.append(t.pos_)
    
    return pos_tags

# testando nossa função:

for x in x_train[:10]:
    print(my_tokenizer_pos_pt(x))

In [None]:
from sklearn.decomposition import TruncatedSVD

class SVDDimSelect(object):
    def fit(self, X, y=None):        
        try:
            self.svd_transformer = TruncatedSVD(n_components=round(X.shape[1]/2))
            self.svd_transformer.fit(X)
        
            cummulative_variance = 0.0
            k = 0
            for var in sorted(self.svd_transformer.explained_variance_ratio_)[::-1]:
                cummulative_variance += var
                if cummulative_variance >= 0.5:
                    break
                else:
                    k += 1
                
            self.svd_transformer = TruncatedSVD(n_components=k)
        except Exception as ex:
            print(ex)
            
        return self.svd_transformer.fit(X)
    
    def transform(self, X, Y=None):
        return self.svd_transformer.transform(X)
        
    def get_params(self, deep=True):
        return {}

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.multiclass import OneVsRestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import scipy

clf = OneVsRestClassifier(LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial'))


my_pipeline = Pipeline([
                        ('union', FeatureUnion([('bow', TfidfVectorizer(tokenizer=my_tokenizer_bow_pt)),\
                                                ('pos', Pipeline([('pos-vect', CountVectorizer(tokenizer=my_tokenizer_pos_pt)), \
                                                         ('pos-tfidf', TfidfTransformer())]))
                                               ])),\
                       ('svd', SVDDimSelect()), \
                       ('clf', clf)])

par = {'clf__estimator__C' : np.logspace(-4, 4, 20)}

hyperpar_selector = RandomizedSearchCV(my_pipeline, par, cv=3, scoring='f1_weighted', n_jobs=1, n_iter=20)

In [None]:
print(x_train_sub.shape)
print(y_train.shape)

hyperpar_selector.fit(X=x_train_sub[:500], y=y_train[:500])

In [None]:
print(x_test_sub.shape)
print(y_test.shape)

hyperpar_selector.fit(X=x_test_sub[:500], y=y_test[:500])

In [None]:
y_predicted = hyperpar_selector.predict(X_test)

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_predicted, target_names=Df['relation'].unique()))