<a href="https://colab.research.google.com/github/omdena/WRI/blob/master/notebooks/Task_3_Coreference_Resolution_Unsupervised.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# About 
- Coref Resolution of articles: Section8 2080-7567; Row 838-977
- Output File: [ReferenceChainReplace_ID_2080_7567.csv](https://drive.google.com/file/d/1fM-1zF6hF-TN7j882iNtwC8yZ0XdmCD1/view?usp=sharing)
- File Columns: 
  - ids	
  - month	
  - class	
  - title	
  - text	
  - text_coref

# Installing All dependencies
- spacy 2.1.0 (later versions have some issues with neuralcoref)
- neuralcoref 
- news-please

In [0]:
# uninstall default version & install 2.1.0
!pip uninstall spacy 
!pip install spacy==2.1.0

# download pretrained english model
!python -m spacy download en

# install neuralcoref
!pip install neuralcoref

# pickled articles need this
!pip install news-please

# Loading Dependencies

In [0]:

import pandas as pd
import pickle
import spacy
import neuralcoref
from newsplease import NewsPlease
from google.colab import drive

100%|██████████| 40155833/40155833 [00:01<00:00, 20945232.99B/s]


In [0]:
# mount google drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Clone WRI Github Repo to Google Drive [Skip if already did]

Skip this if you already have WRI repo and the new manually_annotated.csv file

In [0]:
# dir where WRI repo will be cloned
!mkdir '/content/drive/My Drive/WRI'

In [0]:
!git clone 'https://github.com/omdena/WRI.git' '/content/drive/My Drive/WRI'

# Unzip Article Files [Skip if already did]
I'm extracting files to WRI_DATA folder in my Google Drive. It might take 10-15 minutes

In [0]:
!mkdir '/content/drive/My Drive/WRI_DATA'
!unzip -qq '/content/drive/My Drive/WRI/data/texts/*.zip' -d '/content/drive/My Drive/WRI_DATA'

# Loading Text Data from Google Drive

In [0]:
# path where manual text data is located
file_path = '/content/drive/My Drive/WRI/data/gold-standard/manually_annotated.csv'

In [0]:
gold_df  = pd.read_csv(file_path)
gold_df.head()

Unnamed: 0,ids,month,class,title
0,1600,1,negative,BJP won't cower down by CPI-M's violence: Amit...
1,3335,1,negative,Supreme court: How India handles the civil dis...
2,3914,1,negative,"Trump admins must list for immigrants: skill, ..."
3,1094,1,positive,"Discard differences, says RSS chief Mohan Bhag..."
4,4077,1,negative,Doctor found dead in flat


In [0]:
# starting row #838
gold_df[gold_df.ids == 2080 ]

Unnamed: 0,ids,month,class,title
838,2080,1,positive,Gujarat: Massive fire breaks out in a chemical...


In [0]:
# end row #977
gold_df[gold_df.ids == 7567 ]

Unnamed: 0,ids,month,class,title
977,7567,1,positive,Positively Filipino | Online Magazine for Fili...


In [0]:
# change it to rows assigned to you
begin_row, end_row = 838, 977

# Util Functions

In [0]:
# util function to load articles for given month & ids
# taken from the WRI repo 
def load_obj(month, idx):
    month = str(month).zfill(2)
    idx = str(idx).zfill(5)
    with open("/content/drive/My Drive/WRI_DATA/{}/{}.pkl".format(month, idx), "rb") as f:
        return pickle.load(f)

# Coref Resolution

In [0]:
output_df = pd.DataFrame() # DataFrame to store resulted text
nlp = spacy.load('en')
neuralcoref.add_to_pipe(nlp)

<spacy.lang.en.English at 0x7f79dc541ac8>

In [0]:
print(f'starting at {begin_row}th row....')
for row in range(begin_row, end_row+1):
  article = load_obj(gold_df['month'][row], gold_df['ids'][row])
  doc = nlp(article.text)
  temp_df = pd.DataFrame([
      {'ids': gold_df['ids'][row], 
       'month': gold_df['month'][row], 
       'class':gold_df['class'][row], 
       'title': gold_df['title'][row], 
       'text':article.text, 
       'text_coref':doc._.coref_resolved}])
  output_df = pd.concat([output_df, temp_df])
  
  if row%5==0:
    print(f'Completed Upto {row}th row...')
print(f'Done')

starting at 838th row....
Completed Upto 840th row...
Completed Upto 845th row...
Completed Upto 850th row...
Completed Upto 855th row...
Completed Upto 860th row...
Completed Upto 865th row...
Completed Upto 870th row...
Completed Upto 875th row...
Completed Upto 880th row...
Completed Upto 885th row...
Completed Upto 890th row...
Completed Upto 895th row...
Completed Upto 900th row...
Completed Upto 905th row...
Completed Upto 910th row...
Completed Upto 915th row...
Completed Upto 920th row...
Completed Upto 925th row...
Completed Upto 930th row...
Completed Upto 935th row...
Completed Upto 940th row...
Completed Upto 945th row...
Completed Upto 950th row...
Completed Upto 955th row...
Completed Upto 960th row...
Completed Upto 965th row...
Completed Upto 970th row...
Completed Upto 975th row...
Done


In [0]:
output_df.head()

Unnamed: 0,ids,month,class,title,text,text_coref
0,2080,1,positive,Gujarat: Massive fire breaks out in a chemical...,Image Source : ANI Gujarat: Massive fire break...,Image Source : ANI Gujarat: Massive fire break...
0,3651,1,negative,BBMP sets January 20 deadline for details on m...,The civic body’s ongoing war against the menac...,The civic body’s ongoing war against the menac...
0,3439,1,negative,5 Jaish men shot dead on LoC,The security agencies on Monday aborted a “sui...,The security agencies on Monday aborted a “sui...
0,1549,1,negative,"Chidambaram, Sibal attack Modi government over...","New Delhi, Jan 7 (IANS) Congress leader P. Chi...","New Delhi, Jan 7 (IANS) Congress leader P. Chi..."
0,894,1,positive,Wanted: Greater private sector investment in a...,"Last updated on: January 04, 2018 17:28 IST\n'...","Last updated on: January 04, 2018 17:28 IST\n'..."


# Example: For Positive class

In [0]:
# a random article for comparison
p_ex = output_df.loc[output_df['class'] =='positive'].sample(n=1, random_state=1)

In [0]:
p_ex.title

0    Fresh push for Delhi-Manesar expressway as HC ...
Name: title, dtype: object

In [0]:
print(str(p_ex['text'][0]))

File photo
GURUGRAM: PWD minister Rao Narbir Singh on Sunday announced that work on the stalled Greater Southern Peripheral Road (G-SPR) project, to connect Delhi-Gurugram border to IMT Manesar, should gain pace, now that Punjab and Haryana high court has vacated a stay order on land acquisition that it had imposed in 2016.Huda had started acquisition for the 40km-long GSPR in 2016, but faced a challenge from a land owner. Some villagers had filed a case before the high court, challenging the acquisition in Naurangpur village and nearby areas, and got a stay from court.With the stay now vacated, the state will have to issue a fresh notification for acquisition, but Huda is raring to start the process.“The high court has vacated the stay over acquisition by dismissing the case filed by land owners, paving the way for construction of G-SPR as per the Master Plan,” said the PWD minister, adding that once complete, this road would offer an alternative route between Delhi-Gurugram border an

In [0]:
print(str(p_ex['text_coref'][0]))

File photo
GURUGRAM: PWD minister Rao Narbir Singh on Sunday announced that work on the stalled Greater Southern Peripheral Road (G-SPR) project, to connect Delhi-Gurugram border to IMT Manesar, should gain pace, now that Punjab and Haryana high court has vacated a stay order on land acquisition that Punjab and Haryana high court had imposed in 2016.Huda had started acquisition for the 40km-long GSPR in 2016, but faced a challenge from a land owner. Some villagers had filed a case before Punjab and Haryana high court, challenging the acquisition in Naurangpur village and nearby areas, and got a stay from court.With a stay from court now vacated, the state will have to issue a fresh notification for acquisition, but Huda is raring to start Punjab and Haryana high court has vacated the stay over acquisition by dismissing the case filed by land owners, paving the way for construction of G-SPR as per the Master Plan,” said the PWD minister, adding that once complete, this road would offer 

# Original Text 
- 🍎Incorrect Replacement
- 🍏Correct Replacement


GURUGRAM: PWD minister Rao Narbir Singh on Sunday announced that work on the stalled Greater Southern Peripheral Road (G-SPR) project, to connect Delhi-Gurugram border to IMT Manesar, should gain pace, now that Punjab and Haryana high court has vacated a stay order on land acquisition that   🍏it  🍏 had imposed in 2016.Huda had started acquisition for the 40km-long GSPR in 2016, but faced a challenge from a land owner. Some villagers had filed a case before the high court, challenging the acquisition in Naurangpur village and nearby areas, and got a stay from court.With the stay now vacated, the state will have to issue a fresh notification for acquisition, but Huda is raring to start the process.“The high court has vacated the stay over acquisition by dismissing the case filed by land owners, paving the way for construction of G-SPR as per the Master Plan,” said the PWD minister, adding that once complete, this road would offer an alternative route between Delhi-Gurugram border and IMT Manesar, which are connected only through Delhi-Gurugram expressway (NH-8) at present, and thus reduce congestion on the NH-8.The minister added the road will branch out from NH-8 at IMT Manesar, pass through Badshapur as  🍎 it  🍎 proceeds towards Gurugram-Faridabad road, cross MG Road and again merge with NH-8 near Sirhaul on the Delhi-Gurugram border from behind Ambience mall. The road is supposed to be constructed as per Gurugram-Manesar Plan 2031.“This road will be more like an ‘outer ring road’ for the city,”   🍏 he 🍏 said, adding  they are trying to convince NHAI to build  🍎it 🍎, for which,   🍏he  🍏 has already met Union minister Nitin Gadkari . The road is expected to benefit villages along the road, while also reducing congestion on NH-8, and thereby pollution in the city.As per sources, NHAI is not keen on building the road, due to high cost of land acquisition. “The initial survey had indicated the project was not viable due to high cost of land acquisition. That’s why a fresh feasibility study is now being carried out, in view of Haryana’s new policy, announced in late 2017, of transit oriented development (TOD), under which land along the road will get additional FAR,” said the source.Around 395 acres has to be acquired in eight villages for G-SPR.  They are — Aklimpur (5.95 acre), Tekliki (62.95 acre), Sakatpur (68.90 acre), Sikohour (15.92 acre), Naurangpur (49.40 acre), Bar Gujjar (99.13 acre), Nainwal (59.92 acre) and Manesar (33.15 acre). The road will be 90m-wide with a 30m-green belt on either side. The land will be acquired by Huda and handed over to the developer — which could be NHAI — for construction.

# Text After Coref Resolution
GURUGRAM: PWD minister Rao Narbir Singh on Sunday announced that work on the stalled Greater Southern Peripheral Road (G-SPR) project, to connect Delhi-Gurugram border to IMT Manesar, should gain pace, now that Punjab and Haryana high court has vacated a stay order on land acquisition that   🍏Punjab and Haryana high court  🍏 had imposed in 2016.Huda had started acquisition for the 40km-long GSPR in 2016, but faced a challenge from a land owner. Some villagers had filed a case before Punjab and Haryana high court, challenging the acquisition in Naurangpur village and nearby areas, and got a stay from court.With a stay from court now vacated, the state will have to issue a fresh notification for acquisition, but Huda is raring to start Punjab and Haryana high court has vacated the stay over acquisition by dismissing the case filed by land owners, paving the way for construction of G-SPR as per the Master Plan,” said the PWD minister, adding that once complete, this road would offer an alternative route between Delhi-Gurugram border and IMT Manesar, which are connected only through Delhi-Gurugram expressway (NH-8) at present, and thus reduce congestion on the NH-8.The minister added this road will branch out from NH-8 at IMT Manesar, pass through Badshapur as  🍎Badshapur 🍎 proceeds towards Gurugram-Faridabad road, cross MG Road and again merge with NH-8 near Sirhaul on the Delhi-Gurugram border from behind Ambience mall. this road is supposed to be constructed as per Gurugram-Manesar Plan 2031.“This road will be more like an ‘outer ring road’ for the city,”   🍏the PWD minister  🍏 said, adding they are trying to convince NHAI to build  🍎NHAI 🍎, for which,   🍏the PWD minister  🍏 has already met Union minister Nitin Gadkari . this road is expected to benefit villages along this road, while also reducing congestion on NH-8, and thereby pollution in the city.As per sources, NHAI is not keen on building this road, due to high cost of land acquisition. “The initial survey had indicated the project was not viable due to high cost of land acquisition. That’s why a fresh feasibility study is now being carried out, in view of Haryana’s new policy, announced in late 2017, of transit oriented development (TOD), under which land along this road will get additional FAR,” said the source.Around 395 acres has to be acquired in eight villages for G-SPR. They are — Aklimpur (5.95 acre), Tekliki (62.95 acre), Sakatpur (68.90 acre), Sikohour (15.92 acre), Naurangpur (49.40 acre), Bar Gujjar (99.13 acre), Nainwal (59.92 acre) and Manesar (33.15 acre). this road will be 90m-wide with a 30m-green belt on either side. The land will be acquired by Huda and handed over to the developer — which could be NHAI — for construction.

# Saving Coref Resolution DataFrame

In [0]:
output_df.to_csv('/content/drive/My Drive/WRI_DATA/ReferenceChainReplace_ID_2080_7567.csv')

In [0]:
# if saved file takes time to show up in drive then flush data
drive.flush_and_unmount()