# Settings and utils

## Extract type facts from a Wikipedia file


### === Purpose ===

The goal of this lab is to extract the class to which an entity belongs from Wikipedia.
For example, given the Wikipedia article about Leicester:

    Leicester is a small city in England
    
the goal is to extract:

    Leicester TAB city


### === Provided Data ===

We provide:

1. a preprocessed version of the Simple Wikipedia (`wikipedia-first.txt`), which looks like above
2. a template for your code, `extractor.py`
3. a gold standard sample (`gold-standard-sample.tsv`).


### === Task ===

Complete the `extract_type()` function so that it extracts the type of the article entity from the content.
For example, for a content of "Leicester is a beautiful English city in the UK", it should return "city".
Exclude terms that are too abstract ("member of...", "way of..."), and try to extract exactly the noun(s).
You can also skip articles (e.g. `return None`) if you are not sure or if the text does not contain any type.
In order to ensure a fair evaluation, do not use any non-standard Python libraries except `nltk` (`pip install nltk`).

Input:

April
April is the fourth month of the year with 30 days.

Output:
April TAB month


### === Development and Testing ===

We provide a certain number of gold samples for validating your model.
Finally, we calculate a F1 score using following equation:

`F1 = (1 + beta * beta) * precision * recall / (beta * beta * precision + recall)`

with `beta = 0.5`, putting more weight on precision in that way.


In [3]:
from nltk import pos_tag, download
from nltk.tokenize import word_tokenize
from nltk import RegexpParser
from nltk import Tree
download('punkt')
download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [4]:
"""
Don't modify this code.
"""

import sys


class Page:
    '''
    This class is used to store title and content of a wiki page
    '''
    __author__ = "Jonathan Lajus"

    def __init__(self, title, content):
        self.content = content
        self.title = title
        if sys.version_info[0] < 3:
            self.title = title.decode("utf-8")
            self.content = content.decode("utf-8")

    def __eq__(self, other):
        return isinstance(other, self.__class__) and self.title == other.title and self.content == other.content

    def __ne__(self, other):
        return not self.__eq__(other)

    def __hash__(self):
        return hash((self.title, self.content))

    def __str__(self):
        return 'Wikipedia page: "' + (self.title.encode("utf-8") if sys.version_info[0] < 3 else self.title) + '"'

    def __repr__(self):
        return self.__str__()

    def _to_tuple(self):
        return (self.title, self.content)


class Parsy:
    '''
    Parse a Wikipedia file, return page objects
    '''
    __author__ = "Jonathan Lajus"

    def __init__(self, wikipediaFile):
        self.file = wikipediaFile

    def __iter__(self):
        title, content = None,""
        with open(self.file, encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                if not line and title is not None:
                    yield Page(title, content.rstrip())
                    title, content = None,""
                elif title is None:
                    title = line
                elif title is not None:
                    content += line + " "

    
def eval_f1(gold_file, pred_file):

    # Dictionaries
    goldstandard = dict()
    student = dict()

    # Reading first file
    with open(gold_file, 'r', encoding="utf-8") as f:
        for line in f:
            temp = line.split("\t")
            if len(temp) != 2:
                print("The line:", line, "has an incorrect number of tabs")
            else:
                if temp[0] in goldstandard:
                    print(temp[0], " has two solutions")
                goldstandard[temp[0]] = str.lower(temp[1])

    # Reading second file
    with open(pred_file, 'r', encoding="utf-8") as f:
        for line in f:
            temp = line.split("\t")
            if len(temp) != 2:
                if not debug:
                    print("Comment :=>> The line: '", line, "' has an incorrect number of tabs")
                else:
                    print("The line: '", line, "' has an incorrect number of tabs")
            else:
                if temp[0] in student:
                    if not debug:
                        print("Comment :=>>", temp[0], "has two solutions")
                    else:
                        print(temp[0], " has two solutions")
                student[temp[0]] = str.lower(temp[1])

    true_pos = 0
    false_pos = 0
    false_neg = 0

    for key in student:
        if key in goldstandard:
            if student[key] == goldstandard[key]:
                true_pos += 1
            else:
                false_pos += 1
                print("You got", key, "wrong. Expected output: ", goldstandard[key], ",given:", student[key])

    for key in goldstandard:
        if key not in student:
            false_neg += 1
            print("No solution was given for", key)

    if true_pos + false_pos != 0:
        precision = float(true_pos) / (true_pos + false_pos) * 100.0
    else:
        precision = 0.0

    if true_pos + false_neg != 0:
        recall = float(true_pos) / (true_pos + false_neg + false_pos) * 100.0
    else:
        recall = 0.0

    beta = 0.5

    if precision + recall != 0.0:
        f05 = (1 + beta * beta) * precision * recall / (beta * beta * precision + recall)
    else:
        f05 = 0.0

    # grade = 0.75 * precision + 0.25 * recall
    grade = f05

    print("Comment :=>>", "Precision:", precision, "%")
    print("Comment :=>>", "Recall:", recall, "%")
    print("Simulated Grade (F0.5) :=>>", grade, "%")


In [5]:
def tag_sent(sent):
  pos = pos_tag(word_tokenize(sent.lower()))
  c, p = [],[]
  for tup in pos:
    if tup[0] not in punc:
      c.append(tup[0])
      p.append(tup[1])
  return (c, p)

def intersection(lst1, lst2):
    return list(set(lst1) & set(lst2))

def find_from_pattern(pattern, token_sent):
  word = None
  PChunker = RegexpParser(pattern)
  chunk_output = PChunker.parse(pos_tag(token_sent))
  for child in chunk_output:
    if isinstance(child, Tree):               
        if child.label() == 'NP':
            for num in range(len(child)):
                if not (child[num][1]=='JJ' and child[num+1][1]=='JJ'):
                  word = child[num][0]
  return word

def get(deb, num):
  fin = deb+num
  data = {}
  i = 0
  for page in Parsy(wiki_file):
    if i>fin:
      break
    if i>deb:
      data[page.title]=page.content
    i+=1

  data_pos = {}
  for title, content in data.items():
    data_pos[title] = tag_sent(content)
  return data, data_pos


In [6]:
# a simplified wiki page document
wiki_file = 'wikipedia-first.txt'
# some gold samples for validation
gold_file = 'gold-standard-sample.tsv'
# predicted results generated by your model
# you are supposed to submit this file
result_file = 'results.tsv'

# Evaluation

In [7]:
def run():
    '''
    First, extract types from each sentence in the wiki file
    Next, use gold samples to evaluate your model
    :return:
    '''
    dic={}
    with open(result_file, 'w', encoding="utf-8") as output:
        for page in Parsy(wiki_file):
            typ = extract_type(page)
            if typ:
                #output.write(page.title + "\t" + typ + "\n")
                dic[page.title]=typ
    
    # Evaluate on some gold samples for checking your model
    eval_f1(gold_file, result_file)
    return dic

In [8]:
unwanted = ['form', 'kind', 'sort', 'type', 'way', 'part', 'name', 'piece', "asia", 'africa', 'america', 'antarctica', 'europe', 'australia', 'male', 'female', 'man', 'women', 'amount', 'something', 'word' ]
punc=['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', ' ', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']
patterns = ["NP: {<.>?(<VBZ>|<VBP>|<VBD>)<DT>?<RB>*(<JJ>+|<JJ>?)(<NN>+|<NNS>+)}", "NP: {<DT>(<VBZ>|<VBP>|<VBD>)<CD>(<NN>+|<NNS>+)}", "NP: {(<NN>+|<NNS>+)(<VBZ>|<VBP>|<VBD>)(<NN>+|<NNS>+)<POS>(<NN>+|<NNS>+)}",
            "NP: {<NN>+(<VBZ>|<VBP>|<VBD>)<DT><JJ>(<NN>+|<NNS>+)}", "NP: {<.>?(<VBZ>|<VBP>|<VBD>)<DT><NN><IN><JJ>?(<NN>+|<NNS>+)}",
            "NP: {<TO>(<WP>|<VB>)<NN>?<VBZ><TO><VB>}", "NP: {<DT>(<NN>+|<NNS>+)(<VBZ>|<VBP>|<VBD>)<DT><``>(<NN>+|<NNS>+)}", "NP: {<DT><NN>?(<VBZ>|<VBP>|<VBD>)<DT>(<NN>+|<NNS>+)}",
            "NP: {(<NN>+|<NNS>+)(<VBZ>|<VBP>|<VBD>)<DT>(<NN>+|<NNS>+)}", "NP: {<DT>(<NN>+|<NNS>+)(<VBZ>|<VBP>|<VBD>)<VBN><TO><VB><DT>*(<NN>+|<NNS>+)}", 
            "NP: {<.>*<VBZ><DT><JJS>(<NN>+|<NNS>+)}", "NP: {<.>?(<VBZ>|<VBP>|<VBD>)<DT><JJ><NN><IN>(<NN>+|<NNS>+)}"
            ]

In [9]:
def extract_type(wiki_page):
    '''

    :param wiki_page is an object contains a title and the first sentence from its wiki page.
    :return:
    '''
    title = wiki_page.title
    content = wiki_page.content
    print('title: ', title)
    print('content: ', content)
    possibilities = []
    for pat in patterns:
      cont_tok = word_tokenize(content.lower())
      poss = find_from_pattern(pat, cont_tok)
      if poss!=None and poss not in unwanted and poss not in possibilities:
        possibilities.append(poss)
    typ, ind = None, 100
    for poss in possibilities:
      index = cont_tok.index(poss)
      if ind>index:
        typ, ind =poss, index
    print('ouput: ', typ)
    print('\n')  
    
    return typ

In [10]:
dic=run()

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
title:  Le Locheur
content:  Le Locheur is a commune.
ouput:  commune


title:  Le Manoir, Calvados
content:  Le Manoir, Calvados is a commune.
ouput:  commune


title:  Le Marais-la-Chapelle
content:  Le Marais-la-Chapelle is a commune.
ouput:  commune


title:  Le Mesnil-Auzouf
content:  Le Mesnil-Auzouf is a commune.
ouput:  commune


title:  Le Mesnil-Bacley
content:  Le Mesnil-Bacley is a commune.
ouput:  commune


title:  Le Mesnil-Benoist
content:  Le Mesnil-Benoist is a commune.
ouput:  commune


title:  Le Mesnil-Caussois
content:  Le Mesnil-Caussois is a commune.
ouput:  commune


title:  Le Mesnil-Durand
content:  Le Mesnil-Durand is a commune.
ouput:  commune


title:  Le Mesnil-Eudes
content:  Le Mesnil-Eudes is a commune.
ouput:  commune


title:  Le Mesnil-Germain
content:  Le Mesnil-Germain is a commune.
ouput:  commune


title:  Le Mesnil-Guillaume
content:  Le Mesnil-Guillaume is a commune.
ouput:  commu

In [11]:
with open(result_file, 'w', encoding="utf-8") as output:
  for title, typ in dic.items():
    output.write(title + "\t" + typ + "\n")

In [12]:
eval_f1(gold_file, result_file)

The line: 
 has an incorrect number of tabs
You got President of Russia wrong. Expected output:  leader
 ,given: russia

You got Hillary Rodham Clinton wrong. Expected output:  senator
 ,given: states

You got Edip Yuksel wrong. Expected output:  student
 ,given: group

You got Isthmus wrong. Expected output:  land
 ,given: strip

You got Ritual wrong. Expected output:  actions
 ,given: set

You got Nu metal wrong. Expected output:  music
 ,given: style

You got Moat wrong. Expected output:  water
 ,given: body

You got Cerulean wrong. Expected output:  colours
 ,given: range

You got Communication Studies wrong. Expected output:  study
 ,given: college

You got Hittites wrong. Expected output:  people
 ,given: kingdom

No solution was given for Army
No solution was given for Seam ripper
No solution was given for Ilia
No solution was given for North and South
No solution was given for Loreto Region
No solution was given for Pinocchio
No solution was given for Phylum
No solution was giv