Gendered Pronoun Resolution
=====

# Objective

In this two-stage competition, Kagglers are challenged to build pronoun resolution systems that perform equally well regardless of pronoun gender. Stage two's final evaluation will use a new dataset following the same format. To encourage gender-fair modeling, the ratio of masculine to feminine examples in the official test data will not be known ahead of time. 

In this competition, you must identify the target of a pronoun within a text passage. The source text is taken from Wikipedia articles. You are provided with the pronoun and two candidate names to which the pronoun could refer. You must create an algorithm capable of deciding whether the pronoun refers to name ```A```, name ```B```, or neither.

# Evaluation

Submissions are evaluated using the multi-class logarithmic loss. Each pronoun has been labeled with whether it refers to A, B, or NEITHER. For each pronoun, you must submit a set of predicted probabilities (one for each class). The formula is then

$log loss = -\frac{1}{N}\sum_{i=1}^N\sum_{j=1}^My_{ij}\log(p_{ij})$

where N is the number of samples in the test set, M is 3, log is the natural logarithm, $y_{ij}$ is 1 if observation $i$ belongs to class $j$ and 0 otherwise, and $p_{ij}$ is the predicted probability that observation $i$ belongs to class $j$.

The submitted probabilities are not required to sum to one because they are rescaled prior to being scored (each row is divided by the row sum). In order to avoid the extremes of the log function, predicted probabilities are replaced with $max(min(p,1−10^{−15}),10^{−15})$.

# Data

Unlike many Kaggle challenges, this competition does not provide an explicit labeled training set. Files are also available on the [GAP Dataset Github Repo](https://github.com/google-research-datasets/gap-coreference). **Note that the labels for the test set are available on this page. However, your final score and ranking will be determined in stage 2, against a withheld private test set.**

- ```test_stage_1.tsv``` - the test set data for stage 1

## Columns

- ```ID``` - Unique identifier for an example (Matches to Id in output file format)
- ```Text``` - Text containing the ambiguous pronoun and two candidate names (about a paragraph in length)
- ```Pronoun``` - The target pronoun (text)
- ```Pronoun```- Offset The character offset of Pronoun in Text
- ```A``` - The first name candidate (text)
- ```A-offset``` - The character offset of name ```A``` in Text
- ```A-coref``` - Whether ```A``` corefers with the pronoun, TRUE or FALSE
- ```B``` - The second name candidate
- ```B-offset``` - The character offset of name ```B``` in Text
- ```B-coref``` - Whether ```B``` corefers with the pronoun, TRUE or FALSE
- ```URL``` - The URL of the source Wikipedia page for the example




In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from pathlib import Path

print(os.listdir("../input"))

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
DATA_ROOT = Path("../input")

test_path = DATA_ROOT / "test_stage_1.tsv"
test_path = "https://raw.githubusercontent.com/google-research-datasets/gap-coreference/master/gap-test.tsv"
dev_path = "https://raw.githubusercontent.com/google-research-datasets/gap-coreference/master/gap-development.tsv"
val_path = "https://raw.githubusercontent.com/google-research-datasets/gap-coreference/master/gap-validation.tsv"

In [None]:
devdf = pd.read_csv(dev_path, delimiter="\t")
devdf.shape
devdf.head()

In [None]:
valdf = pd.read_table(val_path, delimiter="\t")
valdf.shape
valdf.head()

In [None]:
testdf = pd.read_table(test_path, delimiter="\t")
testdf.shape
testdf.head()

Observations:

- In the original Google AI Language's GAP dataset there are 8,908 coreference-labeled pairs - 4000 pairs in test and development each and 908 pairs in validation set
- Through Github we are shared the half of that - 2000 pairs in test and development each and 454 pairs in validation set
- All the 3 tables have all the 11 columns which includes the correct labels as well

We'll combine all of them together to do the exploratory analysis.


In [None]:
df = pd.concat([devdf, valdf, testdf])
df.shape
df.head()

## Labels

In [None]:
filt1 = (df["A-coref"] == True) & (df["B-coref"] == True)
filt2 = (df["A-coref"] == True) | (df["B-coref"] == True)

df["label"] = "NEITHER"
df.loc[df["A-coref"] == True, "label"] = "A"
df.loc[df["B-coref"] == True, "label"] = "B"

In [None]:
print("Cases where both A and B are correct:", filt1.sum())

In [None]:
print("Cases where either A or B is correct:", filt2.sum())

In [None]:
print("Cases where neither A nor B is correct:", df[~filt2].shape[0])

In [None]:
df.label.value_counts()

Observations: 
- There are no cases where both A and B are correct (sanity check)
- There are 490 cases where neither A nor B are the true labels
- There are 3964 cases (1985 cases of B and 1979 cases of A) where either A or B are true labels

## Text

In [None]:
df["textlen"] = df.Text.str.len()
df["textwords"] = df.Text.str.split().str.len()

In [None]:
df.textlen.describe()

ax = df.textlen.plot.hist(bins=15, figsize=(10, 5))
_ = ax.set_title("Text length distribution")

Observations:
- Text length ranges from 69 to 1347 characters
- Most of the texts lie in 300 to 500 character

Let's check the both extremes of the text based on text lengths

In [None]:
df[df.textlen < 70].T.to_dict()

In [None]:
df[df.textlen > 1340].T.to_dict()

In [None]:
df.textwords.describe()

ax = df.textwords.plot.hist(bins=15, figsize=(10, 5))
_ = ax.set_title("Text words histogram")

Observations:

- text words range from 12 to 223
- Most words lie between 50 and 100

Let's see one example of the both extremes

In [None]:
df[df.textwords < 13].T.to_dict()

In [None]:
df[df.textwords > 220].T.to_dict()

In [None]:
ax = df[["textlen", "textwords"]].plot.scatter(x="textlen", y="textwords", figsize=(10, 5))
_ = ax.set_title("text words vs text length")

Observations:
- it follows a roughly linear relationship between text words and text lengths.
- databset seems to be carefully created as I've never seen this plot this neat

(by first looking at it, I was surprisingly reminded of the milky way galaxy)

## Pronoun and Gender

In [None]:
male_pro = ["his", "he", "He", "him", "His"]
female_pro = ["her", "she", "She", "her", "hers"]

df["gender"] = df.Pronoun.apply(lambda x: "male" if x.lower() in male_pro else "female")

In [None]:
df.gender.value_counts()

In [None]:
df.Pronoun.str.lower().value_counts()

Observations:
- we have a 50-50 split in the genders as the competition details said, but this wont be true with the 2nd stage test data
- in male pronouns, we have - he, his, him - with his being the highest in the frequency (1193)
- in female pronouns we have - she, her, hers - with her being the highest in frequency (1315)

In [None]:
df[df.Pronoun == "hers"]

Observations:

- "hers" is not present in the train or validation set

## Pronoun position

In [None]:
ax = df[["textlen", "Pronoun-offset"]].plot.scatter(x="Pronoun-offset", y="textlen", figsize=(10, 5))
_ = ax.set_title("text len vs pronoun offset")

Observations:
- half of the area where there are no points is because pronoun offset can't get higher then the text length itself; this is also a sanity check of our data (```textlen=Pronoun-offset```)
- there are many texts where the selected pronoun is towards the end; this relationship is explored next

In [None]:
(df.textlen / df["Pronoun-offset"]).describe()

In [None]:
df[(df.textlen / df["Pronoun-offset"])>3].shape

In [None]:
((df.textlen / df["Pronoun-offset"])<=2).sum() / df.shape[0]

In [None]:
filt = (df.textlen / df["Pronoun-offset"])<=3
temp = df.loc[filt, ["textlen", "Pronoun-offset"]]
ax = (temp.textlen / temp["Pronoun-offset"]).plot.hist(bins=15, figsize=(10, 5))
_ = ax.set_title("pronoun position to text length ratio distribution")

Observations:

- this ratio will never go below 1 as the pronoun offset cannot be greater than text length
- from the plot it's apparent that majority of the pronouns are in the 2nd half of the text; (this majority is 96% of the data)
- only 43 cases (0.9%) cases are there where the pronoun is in the initial 1/3rd of the data
- this plot shows that in most of the cases, first the entity/person is introduced in a sentence and then the pronoun is used

Next we'll explore the gap between the person and the pronoun

## Entity Pronoun gap

In [None]:
filt = (df["A-coref"] == True) | (df["B-coref"] == True)

df["label_offset"] = pd.np.nan
df.loc[df["A-coref"] == True, "label_offset"] = df.loc[df["A-coref"] == True, "A-offset"]
df.loc[df["B-coref"] == True, "label_offset"] = df.loc[df["B-coref"] == True, "B-offset"]

In [None]:
df[filt].shape

In [None]:
temp = df.loc[filt, ["Pronoun-offset", "label_offset"]]
temp["label_pronoun_gap"] = temp["label_offset"] - temp["Pronoun-offset"]

temp["label_pronoun_gap"].describe()

ax = temp["label_pronoun_gap"].plot.hist(bins=15, figsize=(10, 5))

Observations:
- The gap between the entity and pronoun ranges from -502 to 291
- mean is at -57 shows that entity appears before pronoun which we also previously
- positive gap tells us about the cases where the pronoun is used before the entity is mentioned

In [None]:
(temp.label_pronoun_gap > 0).value_counts()
(temp.label_pronoun_gap > 0).value_counts()*100/temp.shape[0]

Observations:

- False means the cases where pronoun comes after entity (81%)
- True means the cases where the entity comes after pronoun (19%)

In [None]:
ax = temp.plot.scatter(x="Pronoun-offset", y="label_pronoun_gap", figsize=(10, 5))

Observations:
- the horizontal like at ```pronoun_label_gap=0```, tells us that the entity and pronoun offsets can never be same (which makes sense)
- the vertical points at ```Pronoun-offset=0``` tells us about the points where pronoun comes at the start of the sentence and entities comes later in the sentense
- anything above hirozontal line of 0 are cases where pronoun comes before entities (19% data)
- anything below hirozontal line of 0 are cases where entities comes before pronouns (81% data)
- the slanted line shows that the entities are at the start of the sentence since max gap possible (for cases where entity comes before pronoun) can be pronoun offset; entites can't come before the sentence even starts (```label_pronoun_gap = -Pronoun-offset```)

In [None]:
ax = temp.plot.scatter(x="label_offset", y="label_pronoun_gap", figsize=(10, 5))

Observations:
- this plot is very similar to the previous plot where we plotted label_pronoun_gap vs Pronoun-offset
- horizontal line at label_pronoun_gap=0 means entity offset and pronoun offset can never be the same
- anything above hirozontal line of 0 are cases where pronoun comes before entities (19% data)
- anything below hirozontal line of 0 are cases where entities comes before pronouns (81% data)
- the vertical line at ```label_offset=0``` shows that the entities are at the start of sentence and the pronouns later in the sentence. Since here, the entities are at the start, there can never be any case where the gap will be positive. this like is same as the slanted line from the previous plot.
- the slight (and small) group of points in a slanted line are the points where pronoun is at the start of the sentence and entities come later. This line is similar to the vertical points at ```Pronoun-offset=0``` in the previous plot