# PROJECT 1: Categorizing news articles

### Your task
* Given a bunch of Reuters news service articles, develop a set of labels for categorizing them
* Labels should be a single word or short phrase. Some articles might fit more than one label, and some might not fit any.
* Aim for about 10–15 labels, give or take
* Use methods from labs so far (keyword analysis, terminology extraction, topic models)
* No specific ‘correct’ answer; the process you use to develop the list is more important than the solution.

### Deliverables
* List of labels
* For each label, the number of articles from the dataset that fit that label
* The number of articles that don't fit any of the labels (ideally this won't be a big number)
* Annotated notebook showing your process

In [1]:
import pandas as pd
import numpy as np
from cytoolz import *
import re
from tqdm.auto import tqdm

tqdm.pandas()

In [2]:
df = pd.read_parquet('s3://ling583/project1.parquet', storage_options={'anon':True})

In [3]:
df # load & print data

Unnamed: 0,headline,text,byline,dateline,date
0,Planet Hollywood launches credit card.,If dining at Planet Hollywood made you feel li...,,LOS ANGELES,1996-08-20
1,Sprint to offer consumer Internet access service.,Sprint Corp. Tuesday announced plans to offer ...,Susan Nadeau,CHICAGO,1996-08-20
2,Chains may raise prices after minimum wage hike.,The higher minimum wage signed into law Tuesda...,Patricia Commins,CHICAGO,1996-08-20
3,Sprint to offer consumer Internet access service.,Sprint Corp. Tuesday announced plans to offer ...,,"KANSAS CITY, Mo.",1996-08-20
4,Sprint to offer consumer Internet access service.,Sprint Corp. Tuesday announced plans to offer ...,,"KANSAS CITY, Mo.",1996-08-20
...,...,...,...,...,...
50080,Ryanair says to cease cargo operation.,"Irish independent, low-cost, no-frills airline...",,DUBLIN 1997-08-19,1997-08-19
50081,Teamsters president hails deal with UPS.,"United Parcel Service, the world's largest par...",David Lawsky,WASHINGTON 1997-08-19,1997-08-19
50082,"Teamsters, UPS in tentative deal, sources say.",The Intenational Brotherhood of Teamsters have...,,WASHINGTON 1997-08-18,1997-08-19
50083,Teamsters' Carey to hold news conference.,"The president of the Teamsters union, Ron Care...",,WASHINGTON 1997-08-18,1997-08-19


In [4]:
df = df.drop(['byline', 'dateline','date'], axis=1)

In [5]:
df #just making things look easier to work with

Unnamed: 0,headline,text
0,Planet Hollywood launches credit card.,If dining at Planet Hollywood made you feel li...
1,Sprint to offer consumer Internet access service.,Sprint Corp. Tuesday announced plans to offer ...
2,Chains may raise prices after minimum wage hike.,The higher minimum wage signed into law Tuesda...
3,Sprint to offer consumer Internet access service.,Sprint Corp. Tuesday announced plans to offer ...
4,Sprint to offer consumer Internet access service.,Sprint Corp. Tuesday announced plans to offer ...
...,...,...
50080,Ryanair says to cease cargo operation.,"Irish independent, low-cost, no-frills airline..."
50081,Teamsters president hails deal with UPS.,"United Parcel Service, the world's largest par..."
50082,"Teamsters, UPS in tentative deal, sources say.",The Intenational Brotherhood of Teamsters have...
50083,Teamsters' Carey to hold news conference.,"The president of the Teamsters union, Ron Care..."


In [6]:
import spacy
from spacy.matcher import Matcher

#loading spaCy pipeline, excluding unnecessary features
nlp = spacy.load('en_core_web_sm', exclude=['parser', 'ner', 'lemmatizer', 'attribute_ruler'])

In [7]:
doc_reut=nlp(df['text'].iloc[0])

In [8]:
doc_reut

If dining at Planet Hollywood made you feel like a movie star, now you can spend money like Arnold Schwarzenegger with a new credit card from the themed restaurant chain. The fast growing company, whose outlets are festooned with kitsch movie memorabilia, has teamed up with the William Morris talent agency and MBNA America Bank of Wilmington, Del., to offer a credit card with appropriate Hollywood perks. These include preferential seating in the restaurants, a limited edition T-shirt and discounts on food and merchandise, a statement said. Planet Hollywood joins other pop culture companies such as Rolling Stone magazine that are issuing branded credit cards that make going into debt more fun than usual. Approved applicants don't have to pay an annual fee, and there's a special introductory annual percentage rate of 5.9 percent for balance transfers and cash advance checks. Orlando, Florida-based Planet Hollywood is part of Planet Hollywood International Inc.

In [9]:
matcher = Matcher(nlp.vocab) #matcher imported for phrase matching
matcher.add('Term', [[{'TAG': {'IN': ['JJ', 'NN', 'NNP']}},
                      {'TAG': {'IN': ['JJ', 'NN', 'IN',
                                      'HYPH', 'NNP']}, 'OP': '*'},
                      {'TAG': {'IN': ['NN', 'NNP']}}]])

In [10]:
spans=matcher(doc_reut,as_spans=True)

In [11]:
tuple(tok.norm_ for tok in spans[0])

('dining', 'at', 'planet')

# Gettin me candidate terms

In [12]:
def get_candidates(text):
    doc_reut=nlp(text)
    spans=matcher(doc_reut,as_spans=True)
    return [tuple(tok.norm_ for tok in span) for span in spans]

In [13]:
candidates_reut=list(concat(df['text'].progress_apply(get_candidates)))

  0%|          | 0/50085 [00:00<?, ?it/s]

In [14]:
from collections import defaultdict, Counter

freqs=defaultdict(Counter) 
for c in candidates_reut:
    freqs[len(c)][c]+=1

In [15]:
freqs.keys() #sequences up to 42 tokens


dict_keys([3, 2, 4, 5, 6, 7, 9, 10, 8, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42])

In [16]:
freqs[4].most_common(5)

[(('air', 'cargo', 'newsroom', 'tel+44'), 1831),
 (('new', 'york', 'stock', 'exchange'), 933),
 (('reuters', 'air', 'cargo', 'newsroom'), 914),
 (('same', 'period', 'last', 'year'), 626),
 (('lloyds', 'shipping', 'intelligence', 'service'), 589)]

In [17]:
df_check = df['text'].str.contains("\+") #wanted to see if phone numbers would be an issue
print(df_check)
df[df_check] #decided many articles contained them and proceeded as I was

0        False
1        False
2        False
3        False
4        False
         ...  
50080     True
50081    False
50082    False
50083    False
50084    False
Name: text, Length: 50085, dtype: bool


Unnamed: 0,headline,text
7,Virgin group to expand into South Africa - paper.,Richard Branson's Virgin group is planning to ...
8,BSkyB sees digital TV link with BT.,Satellite broadcaster British Sky Broadcasting...
11,"FINNISH H1 HOTEL, RESTAURANT SALES UP 2.1 PCT ...",The value of sales of the Finnish hotel and re...
14,Europe Online logs off for good after talks fail.,Bankrupt online information service Europe Onl...
16,Jungfraubahn prices new shares at 220 Sfr.,"Jungfraubahn Holding AG, which owns the transp..."
...,...,...
50064,FOCUS-Telecom delivers few surprises in Q1.,Telecom Corp of New Zealand on Tuesday deliver...
50076,CanalSatelite open to share Spain soccer TV ri...,Audiovisual Sport said on Monday that CanalSat...
50078,Austrian Air H1 profit soars into black.,Austrian Airlines announced on Tuesday a huge ...
50079,Austrian Air H1 profit swoops into black.,Austrian Airlines announced on Tuesday a huge ...


In [18]:
from nltk import ngrams
list(range(4,1,-1))

def get_subterms(term):
    k=len(term) #get length
    for m in range(k-1,1,-1): #basically for finding sequences of 1 less than "term" input
        yield from ngrams(term,m)

In [19]:
from math import log2

def c_value(F, theta):
    
    termhood = Counter()
    longer = defaultdict(list)
    
    for k in sorted(F, reverse=True):
        for term in F[k]:
            if term in longer:
                discount = sum(longer[term]) / len(longer[term])
            else:
                discount = 0
            c = log2(k) * (F[k][term] - discount)
            if c > theta:
                termhood[term] = c
                for subterm in get_subterms(term):
                    if subterm in F[len(subterm)]:
                        longer[subterm].append(F[k][term])
    return termhood

#c-value cases

In [20]:
terms_reut=c_value(freqs,theta=500)

In [21]:
# top
for t,c in terms_reut.most_common(20):
    print(f'{c:8.2f} {freqs[len(t)][t]:4d} {" ".join(t)}') 

 6496.00 7319 hong kong
 6336.50 7495 new york
 5700.00 6326 last year
 4533.20 5881 air cargo
 4348.00 4348 united states
 4033.73 2545 long - distance
 3328.00 3328 percent stake
 3226.00 3226 general cargo
 3169.93 2000 london newsroom +44
 2694.00 2694 net income
 2577.00 2577 co ltd
 2466.00 2466 last week
 2464.00 2464 joint venture
 2282.00 1831 air cargo newsroom tel+44
 2228.46 1406 long - term
 2217.00 2217 first quarter
 2215.00 2215 news conference
 2193.59 1384 new york newsdesk
 2112.00 2112 net profit
 2089.00 2089 france telecom


In [22]:
# bottom
for t,c in tail(20,terms_reut.most_common()):
    print(f'{c:8.2f} {freqs[len(t)][t]:4d} {" ".join(t)}')

  528.00  264 pt indonesian satellite corporation
  527.00 1084     b
  521.45  329 --london newsroom +44
  519.58  201 data above 000s except per share
  516.70  326 long - haul
  516.00  516 internet access
  515.00  515 phone service
  512.00  512 karachi port
  510.36  322 part - time
  510.00  510 fiscal year
  510.00  510 cabin crew
  509.00  509 hongkong telecom
  506.00  506 port authority
  505.00  505 san francisco
  504.00  504 communications inc
  503.00  503 h   
  502.00  502 fuel oil
  501.00  501 singapore newsroom
  501.00  501 interest expense
  501.00  501 paris newsroom


In [23]:
with open('reuters-terms.txt', 'w') as f: #save as new file
    for term in terms_reut:
        print(' '.join(term), file=f)
        
# also reproduced .py file for MWE tokenizer