<a id="11"></a> <br>
##   1-1 Import

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib
import warnings
import sklearn
import gensim
import scipy
import numpy
import json
import nltk
import sys
import csv
import os

<a id="12"></a> <br>
## 1-2 Version

In [2]:
print('matplotlib: {}'.format(matplotlib.__version__))
print('scipy: {}'.format(scipy.__version__))
print('seaborn: {}'.format(sns.__version__))
print('pandas: {}'.format(pd.__version__))
print('numpy: {}'.format(np.__version__))
print('Python: {}'.format(sys.version))

matplotlib: 3.0.3
scipy: 1.1.0
seaborn: 0.9.0
pandas: 0.23.4
numpy: 1.16.2
Python: 3.6.6 |Anaconda, Inc.| (default, Oct  9 2018, 12:34:16) 
[GCC 7.3.0]


<a id="13"></a> <br>
## 1-3 Setup

A few tiny adjustments for better **code readability**

In [3]:
sns.set(style='white', context='notebook', palette='deep')
warnings.filterwarnings('ignore')
sns.set_style('white')
%matplotlib inline

<a id="14"></a> <br>
## 1-4 Data set

In [4]:
print(os.listdir("../input/"))

['gendered-pronoun-resolution', 'gapdevelopment']


In [5]:
# @inproceedings{webster2018gap,
#   title =     {Mind the GAP: A Balanced Corpus of Gendered Ambiguou},
#   author =    {Webster, Kellie and Recasens, Marta and Axelrod, Vera and Baldridge, Jason},
#   booktitle = {Transactions of the ACL},
#   year =      {2018},
#   pages =     {to appear}
# }

In [6]:
gendered_pronoun_df = pd.read_csv('../input/gapdevelopment/gap-development.tsv', delimiter='\t')
test_df1 = pd.read_csv('../input/gendered-pronoun-resolution/test_stage_1.tsv', delimiter='\t')
test_df2 = pd.read_csv('../input/gendered-pronoun-resolution/test_stage_2.tsv', delimiter='\t')

In [7]:
submission1 = pd.read_csv('../input/gendered-pronoun-resolution/sample_submission_stage_1.csv')
submission2 = pd.read_csv('../input/gendered-pronoun-resolution/sample_submission_stage_2.csv')

In [8]:
gendered_pronoun_df.shape

(2000, 11)

In [9]:
test_df2.shape

(12359, 9)

In [10]:
submission2.shape

(12359, 4)

<a id="15"></a> <br>
## 1-5 Gendered Pronoun Data set Analysis
<img src='https://storage.googleapis.com/kaggle-media/competitions/GoogleAI-GenderedPronoun/PronounResolution.png' width=600 height=600>
**Pronoun resolution** is part of coreference resolution, the task of pairing an expression to its referring entity. This is an important task for natural language understanding, and the resolution of ambiguous pronouns is a longstanding challenge. for more information you can check this [link](https://www.kaggle.com/c/gendered-pronoun-resolution)
<a id="151"></a> <br>
### 1-5-1 Problem Feature
In this competition, you must identify the target of a pronoun within a text passage. The source text is taken from Wikipedia articles. You are provided with the pronoun and two candidate names to which the pronoun could refer. You must create an algorithm capable of deciding whether the pronoun refers to name A, name B, or neither.

In [11]:
gendered_pronoun_df.head()

Unnamed: 0,ID,Text,Pronoun,Pronoun-offset,A,A-offset,A-coref,B,B-offset,B-coref,URL
0,development-1,Zoe Telford -- played the police officer girlf...,her,274,Cheryl Cassidy,191,True,Pauline,207,False,http://en.wikipedia.org/wiki/List_of_Teachers_...
1,development-2,"He grew up in Evanston, Illinois the second ol...",His,284,MacKenzie,228,True,Bernard Leach,251,False,http://en.wikipedia.org/wiki/Warren_MacKenzie
2,development-3,"He had been reelected to Congress, but resigne...",his,265,Angeloz,173,False,De la Sota,246,True,http://en.wikipedia.org/wiki/Jos%C3%A9_Manuel_...
3,development-4,The current members of Crime have also perform...,his,321,Hell,174,False,Henry Rosenthal,336,True,http://en.wikipedia.org/wiki/Crime_(band)
4,development-5,Her Santa Fe Opera debut in 2005 was as Nuria ...,She,437,Kitty Oppenheimer,219,False,Rivera,294,True,http://en.wikipedia.org/wiki/Jessica_Rivera


In [12]:
test_df2.head()

Unnamed: 0,ID,Text,Pronoun,Pronoun-offset,A,A-offset,B,B-offset,URL
0,000075809a8e6b062f5fb3c191a8ed52,"For the U.S. Under Secretary of State, see Luc...",she,310,Lucy Benson,59,Kerrie Taylor,160,http://en.wikipedia.org/wiki/Lucy_Benson
1,0005d0f3b0a6c9ffbd31a48453029911,"After this match, she reached her new career h...",she,334,Kudryavtseva,182,Maria Sharapova,259,http://en.wikipedia.org/wiki/Alla_Kudryavtseva
2,0007775c40bedd4147a0573d66dc28f8,In the same way in his Preface of the Books of...,his,298,Ezra,191,Jerome,323,http://en.wikipedia.org/wiki/Development_of_th...
3,001194e3fe1234d00198ef6bba4cc588,Anita's so-called homeless mate Machteld Steen...,she,313,Dian,205,Anita,278,http://en.wikipedia.org/wiki/Dian_Alberts
4,0014bb7085278ef3f9b74f14771caca9,"By March, she was the King's mistress, install...",her,362,Pompadour,262,Jeanne Antoinette,336,http://en.wikipedia.org/wiki/Madame_de_Pompadour


In [13]:
gendered_pronoun_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 11 columns):
ID                2000 non-null object
Text              2000 non-null object
Pronoun           2000 non-null object
Pronoun-offset    2000 non-null int64
A                 2000 non-null object
A-offset          2000 non-null int64
A-coref           2000 non-null bool
B                 2000 non-null object
B-offset          2000 non-null int64
B-coref           2000 non-null bool
URL               2000 non-null object
dtypes: bool(2), int64(3), object(6)
memory usage: 144.6+ KB


<a id="152"></a> <br>
### 1-5-2  Variables

1. ID - Unique identifier for an example (Matches to Id in output file format)
1. Text - Text containing the ambiguous pronoun and two candidate names (about a paragraph in length)
1. Pronoun - The target pronoun (text)
1. Pronoun-offset The character offset of Pronoun in Text
1. A - The first name candidate (text)
1. A-offset - The character offset of name A in Text
1. B - The second name candidate
1. B-offset - The character offset of name B in Text
1. URL - The URL of the source Wikipedia page for the example

<a id="153"></a> <br>
### 1-5-3  Evaluation
Submissions are evaluated using the multi-class logarithmic loss. Each pronoun has been labeled with whether it refers to A, B, or NEITHER. For each pronoun, you must submit a set of predicted probabilities (one for each class). The formula is :
<img src='http://s8.picofile.com/file/8351608076/1.png'>

<a id="3"></a> <br>
## spaCy
spaCy is an Industrial-Strength Natural Language Processing in python. [**spacy**](https://spacy.io/)

In [14]:
import spacy

In [15]:
nlp = spacy.load('en')
def doc(row):
    return nlp(row['Text'])

In [16]:
gendered_pronoun_df['doc'] = gendered_pronoun_df.apply(doc,axis=1)

In [17]:
test_df2['doc'] = test_df2.apply(doc,axis=1)

In [18]:
sample= gendered_pronoun_df.loc[0]
# extract the sentence number
for d in sample.doc.sents:
    print(d)

Zoe Telford -- played the police officer girlfriend of Simon, Maggie.
Dumped by Simon in the final episode of series 1, after he slept with Jenny, and is not seen again.
Phoebe Thomas played Cheryl Cassidy, Pauline's friend and also a year 11 pupil in Simon's class.
Dumped her boyfriend following Simon's advice after he wouldn't have sex with her but later realised this was due to him catching crabs off her friend Pauline.


In [19]:
# extract the dependence 
from spacy import displacy
doc = nlp(sample.Text)
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})
depth=''
for d in d.doc:
    d_heads = [_ for _ in d.ancestors]
    depth+=d.text+'('+str(len(d_heads))+') '
print(depth)

Zoe(2) Telford(1) --(2) played(0) the(2) police(3) officer(2) girlfriend(1) of(2) Simon(3) ,(4) Maggie(4) .(1) Dumped(0) by(1) Simon(2) in(1) the(3) final(3) episode(2) of(3) series(4) 1(5) ,(1) after(2) he(2) slept(1) with(2) Jenny(3) ,(2) and(2) is(3) not(3) seen(2) again(3) .(1) Phoebe(2) Thomas(1) played(0) Cheryl(2) Cassidy(1) ,(2) Pauline(3) 's(4) friend(2) and(3) also(3) a(4) year(3) 11(4) pupil(2) in(3) Simon(5) 's(6) class(4) .(1) Dumped(0) her(2) boyfriend(1) following(1) Simon(3) 's(4) advice(2) after(2) he(2) would(2) n't(2) have(1) sex(2) with(2) her(3) but(2) later(3) realised(2) this(4) was(3) due(4) to(5) him(6) catching(5) crabs(6) off(6) her(8) friend(7) Pauline(8) .(1) 


In [20]:
def offset_to_index(row,col):
    stop = row[col+'-offset']+len(row[col])
    length = len(row['doc'])
    for i in range(length):
        if len(row['doc'][:i+1].text) >= stop:
            #print(row['doc'][i-1])
            return i  
        
def sentence_number(row,col):
    i=0
    count=0
    for d in row.doc.sents:
        count+=len(d)
        i+=1
        if count>= row[col+'-index']:
            return i

def ancestors(row,col):
    d = row['doc'][int(row[col+'-index'])]
    d_heads = [_ for _ in d.ancestors]
    if d_heads:
        ancestor=d_heads[-1].text
    else:
        ancestor='Null'
    #print(len(d_heads),ancestor,d.dep_)
    return pd.Series([len(d_heads),ancestor,d.dep_])

In [21]:
def data_process(df):
    for col in ['Pronoun','A','B']:
        df[col+'-index'] = df.apply(offset_to_index,col=col,axis=1)
        df[col+'-sent'] = df.apply(sentence_number,col=col,axis=1)
        df[[col+'-depth',col+'-ancestor', col+'-dep']] = df.apply(ancestors, col=col, axis=1)
        #creat some features from the dependencies
        if col in ['A','B']:
            df[col+'-sentdistance'] = df['Pronoun-sent']-df[col+'-sent']
            df[col+'-depthdistance'] = df['Pronoun-depth']-df[col+'-depth']
            df[col+'-sameancestor'] = (df['Pronoun-ancestor']==df[col+'-ancestor'])*1
            df[col+'-samedep'] = (df['Pronoun-dep']==df[col+'-dep'])*1

In [22]:
data_process(gendered_pronoun_df)

In [23]:
data_process(test_df2)

In [24]:
#check if all the name-indexes are right
def consistant(df):
    badlist=[]
    for row_index,row in df.iterrows():
        for col in ['A','B','Pronoun']:
            #print(row[col].split()[-1],row['doc'][int(row[col+'-index'])])
            if row[col].split()[-1]!=row['doc'][int(row[col+'-index'])].text:
                print(col,'---',row[col],'---',row['doc'][row[col+'-index']])
                badlist+=[row_index]
        
    return badlist

In [25]:
print(consistant(gendered_pronoun_df))

B --- Bo-lung --- lung
B --- Adele Chatfield-Taylor --- Taylor
B --- Gloria Macapagal-Arroyo --- Arroyo
A --- Gagnon-Tremblay --- Tremblay
B --- Monique Jerome-Forget --- Forget
B --- Graeme Dott. --- .
A --- Paul Pellisson.Il --- Il
B --- Adie (** --- *
B --- Delia --- Delia-
B --- Sarah Fitz-Gerald --- Gerald
A --- Helen Sainton-Dolby --- Dolby
A --- Nancy Banks-Smith --- Smith
B --- 7th Dan Shito-Ryu Karate-Do Sosei-Kai --- Kai
B --- William P --- P.
A --- G-Man --- Man
B --- St*phane Mallarm* --- *
A --- Jay-Z --- Z
A --- Betty Thatcher-Newsinger --- Newsinger
A --- Beyonc* --- *
A --- Co Stomp* --- Stomp*.
B --- TV-presenter --- presenter
A --- Jazmine James --- James-
B --- Mrs --- Mrs.
B --- Richard Rainshaw Rothwell I --- I.
B --- Johnson Aguiyi-Ironsi --- Ironsi
B --- Lepist* --- *
A --- Al-Kind* --- *
B --- Ibn S*n* --- *
A --- Tatjana Juri* --- *
A --- Luis Moreno-Ocampo --- Ocampo
B --- Arch Hall, Sr. --- .
B --- Herv* --- *
B --- Bront* --- *
B --- Jason Aldean --- Aldean(

In [26]:
print(consistant(test_df2))

B --- Margaret Oneill-Prott --- Prott
B --- James Curtis Skakel Sr. --- .
B --- Frank Lloyd Wright Jr --- Jr.
A --- Golding-Kirk --- Kirk
B --- Natalia Cordova-Buckley --- Buckley
B --- Claude Friese-Greene --- Greene
B --- Elizabeth Hamilton Foy. --- .
A --- John Hanbury-Williams --- Williams
B --- Charles V --- V.
B --- Janelle Corlass-Brown --- Brown
B --- Charles S --- S.
B --- Barbara Evadney Reid-Hibbert --- Hibbert
B --- Mrs --- Mrs.
B --- Beth Enright/Beresford --- Beresford
A --- Christopher Trevor-Roberts --- Roberts
A --- John Heath-Stubbs --- Stubbs
A --- Carrieri-Russo --- Russo
B --- Co-Founder --- Founder
A --- Al-Maqaleh --- Maqaleh
B --- William of Limburg-Broich --- Broich
B --- Phyllis --- Phyllis(1975
B --- half-French --- French
A --- Sun Yat-Sen --- Sen
A --- Amanda Winn-Lee --- Lee
A --- D-BOYS --- BOYS
A --- Legge-Bourke --- Bourke
B --- Fanny Blankers-Koen --- Koen
B --- Ann-Margret --- Margret
A --- Leveson-Gower --- Gower
A --- Bauffremont-Courtenay --- Court

In [27]:
gendered_pronoun_df.columns

Index(['ID', 'Text', 'Pronoun', 'Pronoun-offset', 'A', 'A-offset', 'A-coref',
       'B', 'B-offset', 'B-coref', 'URL', 'doc', 'Pronoun-index',
       'Pronoun-sent', 'Pronoun-depth', 'Pronoun-ancestor', 'Pronoun-dep',
       'A-index', 'A-sent', 'A-depth', 'A-ancestor', 'A-dep', 'A-sentdistance',
       'A-depthdistance', 'A-sameancestor', 'A-samedep', 'B-index', 'B-sent',
       'B-depth', 'B-ancestor', 'B-dep', 'B-sentdistance', 'B-depthdistance',
       'B-sameancestor', 'B-samedep'],
      dtype='object')

In [28]:
X = gendered_pronoun_df[['A-dep', 'B-dep','Pronoun-dep', 
                        'A-sentdistance', 'A-depthdistance', 'A-sameancestor','A-samedep',
                        'B-sentdistance', 'B-depthdistance', 'B-sameancestor','B-samedep']]

In [29]:
test_X = test_df2[['A-dep', 'B-dep','Pronoun-dep', 
                        'A-sentdistance', 'A-depthdistance', 'A-sameancestor','A-samedep',
                        'B-sentdistance', 'B-depthdistance', 'B-sameancestor','B-samedep']]

In [30]:
joined_X = pd.concat([X,test_X])

In [31]:
one_hot_encoded_joined_X = pd.get_dummies(joined_X)
print(one_hot_encoded_joined_X.shape)

(14359, 82)


In [32]:
one_hot_encoded_X = one_hot_encoded_joined_X.iloc[:2000]
one_hot_encoded_test_X = one_hot_encoded_joined_X.iloc[2000:]

In [33]:
def consistant(df):
    Ylabel=[]
    for row_index,row in df.iterrows():
        if row['A-coref']==True:
            Ylabel.append(0)
        elif row['B-coref']==True:
            Ylabel.append(1)
        else:
            Ylabel.append(2)
    return Ylabel

In [34]:
Y = np.array(consistant(gendered_pronoun_df))

In [35]:
Y.shape

(2000,)

In [36]:
#Xtrain,XCV,Ytrain,YCV = one_hot_encoded_X[:1500],one_hot_encoded_X[1500:],Y[:1500],Y[1500:]

In [37]:
from sklearn.metrics import accuracy_score
from sklearn import *
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV, StratifiedKFold

In [38]:
n_splits = 5
splits = list(StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=2019).split(one_hot_encoded_X, np.zeros(shape=(one_hot_encoded_X.shape[0], 1))))

In [39]:
params = {
          "objective" : "multiclass",
          "num_class" : 3,
          "num_leaves" : 30,
          "max_depth": -1,
          "learning_rate" : 0.01,
          "bagging_fraction" : 0.6,  # subsample
          "feature_fraction" : 0.6,  # colsample_bytree
          "bagging_freq" : 5,        # subsample_freq
          "bagging_seed" : 2018,
          "verbosity" : -1 }

preds_test = []

for idx, (train_idx, val_idx) in enumerate(splits):
    
    print("Beginning fold {}".format(idx+1))
    Xtrain, Ytrain, XCV, YCV = one_hot_encoded_X.iloc[train_idx], Y[train_idx], one_hot_encoded_X.iloc[val_idx], Y[val_idx]
    lgtrain, lgval = lgb.Dataset(Xtrain,Ytrain), lgb.Dataset(XCV,YCV)
    lgbmodel = lgb.train(params, lgtrain, 2000, valid_sets=[lgtrain, lgval], early_stopping_rounds=100, verbose_eval=200)
    pred=lgbmodel.predict(one_hot_encoded_test_X)
    preds_test.append(pred)

preds_test = np.mean(preds_test, axis=0)
print(preds_test.shape)

Beginning fold 1
Training until validation scores don't improve for 100 rounds.
[200]	training's multi_logloss: 0.655315	valid_1's multi_logloss: 0.72929
[400]	training's multi_logloss: 0.572688	valid_1's multi_logloss: 0.709235
Early stopping, best iteration is:
[420]	training's multi_logloss: 0.566856	valid_1's multi_logloss: 0.708087
Beginning fold 2
Training until validation scores don't improve for 100 rounds.
[200]	training's multi_logloss: 0.658598	valid_1's multi_logloss: 0.724297
[400]	training's multi_logloss: 0.574542	valid_1's multi_logloss: 0.699338
Early stopping, best iteration is:
[425]	training's multi_logloss: 0.566951	valid_1's multi_logloss: 0.698119
Beginning fold 3
Training until validation scores don't improve for 100 rounds.
[200]	training's multi_logloss: 0.671255	valid_1's multi_logloss: 0.692448
[400]	training's multi_logloss: 0.589698	valid_1's multi_logloss: 0.654475
[600]	training's multi_logloss: 0.540126	valid_1's multi_logloss: 0.649287
Early stopping, 

In [40]:
preds_test[:10]

array([[0.84079579, 0.0970818 , 0.06212241],
       [0.87275508, 0.0870051 , 0.04023983],
       [0.12003613, 0.85912486, 0.02083901],
       [0.85418223, 0.0985016 , 0.04731617],
       [0.33897014, 0.60831916, 0.0527107 ],
       [0.93397192, 0.03657491, 0.02945317],
       [0.55842207, 0.41638526, 0.02519267],
       [0.02891246, 0.94789833, 0.02318921],
       [0.52569098, 0.45406974, 0.02023928],
       [0.1312903 , 0.39345758, 0.47525212]])

In [41]:
submission2.head()

Unnamed: 0,ID,A,B,NEITHER
0,000075809a8e6b062f5fb3c191a8ed52,0.33333,0.33333,0.33333
1,0005d0f3b0a6c9ffbd31a48453029911,0.33333,0.33333,0.33333
2,0007775c40bedd4147a0573d66dc28f8,0.33333,0.33333,0.33333
3,001194e3fe1234d00198ef6bba4cc588,0.33333,0.33333,0.33333
4,0014bb7085278ef3f9b74f14771caca9,0.33333,0.33333,0.33333


In [42]:
submission2[['A','B','NEITHER']] = preds_test

In [43]:
submission2.head()

Unnamed: 0,ID,A,B,NEITHER
0,000075809a8e6b062f5fb3c191a8ed52,0.840796,0.097082,0.062122
1,0005d0f3b0a6c9ffbd31a48453029911,0.872755,0.087005,0.04024
2,0007775c40bedd4147a0573d66dc28f8,0.120036,0.859125,0.020839
3,001194e3fe1234d00198ef6bba4cc588,0.854182,0.098502,0.047316
4,0014bb7085278ef3f9b74f14771caca9,0.33897,0.608319,0.052711


In [44]:
submission2.to_csv('submission2.csv', index=False)