# Homework #3

## Text Processing
---
### Question #1

1. Modify the code I wrote in lecture 8 with what you have learnt in lecture 9 and correctly tokenize the text both on the word and sentence level, and by removing the stopwords. Rewrite the `getSummary` function and all the other functions that it depends by making these corrections.

2. Rewrite the code I wrote for `getKeywords` function making the same corrections.

3. Test your code from parts 1 and 2 on random articles from the Guardian.

4. Rewrite the `getSubjectGuardian` function for another newspaper in English, and test your code from part 1 and 2 on random articles from this new newspaper.

**Importing the necessary modules**

As usual we will start of by importing the necessary libraries for the work we are going to do. All libraries use cases are trivial as we are more than halfway through the semester so there is no point stating the obvious.

In [353]:
import requests
import nltk
import regex as re
import numpy as np
import string

from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('stopwords')

from snowballstemmer import TurkishStemmer
from bs4 import BeautifulSoup

from collections import Counter
from xmltodict import parse

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import PCA

[nltk_data] Downloading package stopwords to /Users/uzay/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Copying the desired functions from the lecture notes**

I copy-pasted the *getSubjectguardian*, *getText* functions for ease of testing the changes we will perform on the *getSummary* and *getKeyword* methods. 

In [354]:
def getSubjectGuardian(subject):
    with requests.get(f'https://www.theguardian.com/{subject}/rss') as link:
        raw = parse(link.text)
    return raw['rss']['channel']['item']

def getText(url):
    with requests.get(url) as link:
        raw = BeautifulSoup(link.content,'html.parser')
    return ' '.join([x.text for x in raw.find_all('p')])

**Writing the *getNewsMixmag* function**

I wrote a function similiar to the one you've written for Guardian. But this one feeds off of the RSS feed of Mixmag, a UK based DJing magazine.

In [355]:
def getNewsMixmag():
    with requests.get("https://mixmag.net/rss-category/news") as link:
        raw = parse(link.text)
    return raw['rss']['channel']['item']

**Gathering raw text for testing the functions**

Using the two functions for the RSS feeds of Guardian and Mixmag, I gathered a total of 5 articles for testing purposes. These articles will also be used for testing the solutions of Question 2 and 3, as well as this one.

In [356]:
nba = getSubjectGuardian("sport/nba")
football = getSubjectGuardian("football")
film = getSubjectGuardian("film")
mixmag = getNewsMixmag()

text1 = getText(nba[3]['link'])
text2 = getText(football[2]['link'])
text3 = getText(film[0]['link'])
text4 = getText(mixmag[10]['link'])
text5 = getText(mixmag[15]['link'])

**Writing the *tokenizeText* function**

Here I wrote a *tokenizeText* which takes in the raw text as the parameter and returns a dictionary with two keys:
1. *sentences* : a list containing all of the sentences in text data, tokenized and without stop-words(English) or punctuation.
2. *words* : a list containing all of the words in the text data, tokenized and without stop-words or punctuation.

In [357]:
swEN = set(stopwords.words('english')) 
def tokenizeText(text, sw=swEN):
    sentences =  [re.sub(r'[^\w\s]','',sentence.lower()) for sentence in sent_tokenize(text)]
    words = [word.lower() for word in word_tokenize(text) if not word in string.punctuation]

    tokenizedText = {'sentences' : [' '.join([word for word in sentence.split() if not word in sw]) for sentence in sentences],
                    'words' : [word for word in words if not word in sw]}

    return tokenizedText

text1_tokenized = tokenizeText(text1)
text1_tokenized

{'sentences': ['former la lakers player general manager claims hbo demeaned shock value popular drama basketball team former los angeles lakers player coach general manager jerry west reportedly demanded apology retraction depiction hbo series winning time rise lakers dynasty calling baseless malicious assault',
  'new drama series centred rise la lakers 1980s showtime era west teams general manager',
  'actor jason clarke plays west alongside john c reilly teams owner dr jerry buss',
  'according espn statement sent hbo shows executive producer adam mckay tuesday night wests legal team allege show falsely cruelly portrays mr west outofcontrol intoxicated rageaholic causing great distress jerry family',
  'jerry west integral part lakers nbas success letter continued',
  'travesty hbo knowingly demeaned shock value pursuit ratings',
  'act common decency hbo producers owe jerry public apology least retract baseless defamatory portrayal espn reported 83yearolds lawyers asking damages re

**Modifiying the *getSummary* function**

Next, I'm going to modify the getSummary function to use my *tokenizeText* function instead of the *processText* which uses regex instead of NLTK's tokenization functions.

In [358]:
def getMatrix(sentences):
    vectorizer = CountVectorizer()
    return vectorizer.fit_transform(sentences)
    
def getSummary(text,k):
    tokenized_text = tokenizeText(text)
    sentences = tokenized_text['sentences']
    matrix = getMatrix(sentences)
    projection = PCA(n_components=1)
    weights = projection.fit_transform(matrix.toarray())
    res = list(zip(weights.transpose()[0],range(112),sentences))
    tmp = sorted(res,key=lambda x: x[0],reverse=True)[:k]
    return sorted(tmp, key=lambda x: x[1])

getSummary(text2, 3)

[(0.8339380155484528,
  9,
  'provision 11 fiveaside well school activations social activities recently joined south african football association safa take part local league season'),
 (0.5517105603256783,
  13,
  'weekend sasol league safas provincial womens league kicked around 3800 players involved'),
 (5.504080096786349,
  39,
  'maybe come home trophy season maybe next season going lose minds brawls goals records fall shortage drama pitch old rivals lyon psg came face face champions league')]

**Modifying the *getKeywords* function**

Similarly, I'm going to modify the getKeywords function from Lecture 8 to use my *tokenizeText* function.

In [359]:
def getKeywords(text,sw,k):
    tokenized_text = tokenizeText(text)
    sentences = tokenized_text['sentences']

    vectorizer = CountVectorizer()
    matrix = vectorizer.fit_transform(sentences)
    words = vectorizer.get_feature_names_out()

    projection = PCA(n_components=1)
    tmp = projection.fit_transform(matrix.transpose().toarray())
    weights = tmp.transpose()[0]

    return sorted(zip(weights,words),key=lambda x: x[0], reverse=True)[:k]

getKeywords(text2, swEN, 5)

[(1.9320255596544318, 'season'),
 (1.8358248692821177, 'face'),
 (1.8358248692821177, 'maybe'),
 (1.2941397729389434, 'league'),
 (0.9627317629176271, 'pitch')]

**Testing both functions with the articles from Mixmag**

In [360]:
getSummary(text4, 3)

[(0.5387236713642941,
  7,
  'however nme reports new forecast pwc global entertainment media outlook 20212025 report predicts prepandemic rates met 2025'),
 (6.0387449236903015,
  9,
  'read next terrifying figures show many streams artist needs earn minimum wage mark maitland uk head entertainment media pwc said uk consumers rapid migration digital behaviours pandemic become embedded daytoday lives helping sustain overall growth across entertainment media coming five years'),
 (0.3352019607436324,
  10,
  'companies race meet consumers evolving needs new products services experiences em industry become pervasive immersive diverse')]

In [361]:
getKeywords(text4, swEN, 4)

[(2.6450643205143787, 'music'),
 (1.519837245114415, 'live'),
 (1.2316546856709494, 'despite'),
 (0.8933117418002829, 'seen')]

---
### Question #2

Write a function that returns all named entities (proper names, country names, corporation names only) from a URL. Function should take the URL as the input and must return the list of named entities from that URL. Test your code on random articles from the Guardian. Don't use the NLTK's NER that I demonstrated during the lecture. Use the SpaCY's NER function.

**Importing *spacy***

First of all, I import *spacy* and load the *en_core_web_sm* model.

In [362]:
import spacy
from spacy import displacy

ner = spacy.load("en_core_web_sm")

**Some analysis on how NER works**

During the analysis I have found that Spacy's NER returns some unwanted names which they consider named entities such as cardinals, time, date etc. I looked up which labels where going to desirable for this question and only accepted them in my function.

In [363]:
ne = ner(text1)
labels = list(dict.fromkeys([word.label_ for word in ne.ents]))
labels

['ORG', 'PERSON', 'DATE', 'LOC', 'TIME', 'GPE']

In [364]:
explanation = [spacy.explain(label) for label in labels]
label_exp = dict(zip(labels, explanation))
label_exp

{'ORG': 'Companies, agencies, institutions, etc.',
 'PERSON': 'People, including fictional',
 'DATE': 'Absolute or relative dates or periods',
 'LOC': 'Non-GPE locations, mountain ranges, bodies of water',
 'TIME': 'Times smaller than a day',
 'GPE': 'Countries, cities, states'}

**Writing the *getNamedEnts* function**

Writing this function was relatively straightforward thanks to the team at Spacy doing all the heavylifting(DNN stuff) behind the scenes. After using the Spacy NER to gather named entities from the text I only return a list of names with the types of desired named entities as stated in the question.

*Note: Added a touch of *fancy* using displacy. If you add the additional parameter *style='parameter'* to the function it will display named entities highlighted in the text*

In [365]:
def getNamedEnts(url, style='simple'):
    text = getText(url)
    rawNE = ner(text)
    
    if style == 'fancy':
        displacy.render(rawNE, style="ent", jupyter=True)
        return

    wanted = ['PERSON', 'ORG', 'GPE', 'FAC', 'GPE']

    ne = list(dict.fromkeys([item.text for item in rawNE.ents if item.label_ in wanted]))
    return ne

**Testing**

In [366]:
getNamedEnts(nba[4]['link'], style='fancy')

In [367]:
getNamedEnts(mixmag[4]['link'])

['Worthy Farm',
 'Glastonbury Festival',
 'TikTok',
 'Glastonbury',
 'The Glastonbury Free Press',
 'Ukraine',
 'War Child',
 'UK',
 'Pyramid Stage',
 'Silver Hayes',
 'The Lonely Hearts Club',
 'Overmono',
 'Avalon Emerson',
 'Nia Archives',
 'TSHA',
 'Billie Eilish',
 'Paul McCartney',
 'Kendrick Lamar',
 'Diana Ross',
 'Bicep',
 'Caribou',
 'Somerset',
 "Mixmag's Digital Intern"]

---
### Question #3

1. Write a function that returns the most positive and the most negative sentences from a text. The function must take the text as the input and must return a 2-tuple: the first element as the most positive and the second as the most negative sentence with their polarity scores.

2. Test your function on random articles from the Guardian.

**Importing *SentimentIntensityAnalyzer***

In [368]:
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

analyzer = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/uzay/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


**Writing the function**

I wrote a helper function named *tokenizePolarize* which first tokenizes the raw text and, well, calculates polarity scores per sentence as the name suggests. Using this function I wrote the *getPosNeg* function which returns the most positive and the most negative sentences in the text as a tuple with their respective polarity scores.

*Note: For the max and min function I provided a *key* parameter for determining values for finding the max. The lambda function takes in an item of (sentence, polarization score) and returns the compound parameter of the polarization score as the respective value for each item.*

In [369]:
def tokenizePolarize(text):
    sentences = sent_tokenize(text)
    return [(sentence, analyzer.polarity_scores(sentence)) for sentence in sentences]

def getPosNeg(text):
    pol_scores = tokenizePolarize(text)
    most_pos = max(pol_scores, key=lambda item : item[1]['compound'])
    most_neg = min(pol_scores, key=lambda item : item[1]['compound'])

    return (most_pos, most_neg)


**Testing**

In [370]:
getPosNeg(text1)

(('“Contrary to the show, the book leaves readers with the true impression of Jerry as a brilliant and thoughtful GM,” they wrote.',
  {'neg': 0.0, 'neu': 0.605, 'pos': 0.395, 'compound': 0.8779}),
 ('I also never saw or heard Jerry go on an angry rant or tirade nor did I ever see or hear Jerry scream or yell at anyone.',
  {'neg': 0.276, 'neu': 0.724, 'pos': 0.0, 'compound': -0.8126}))

In [371]:
getPosNeg(text2)

(('On the pitch, Romaney has her eyes on creating a surprise in their new league: “I would love to see us excel.',
  {'neg': 0.0, 'neu': 0.596, 'pos': 0.404, 'compound': 0.8885}),
 ('They looked to have put their troubles aside when the effervescent Marie-Antoinette Katoto fired them ahead.',
  {'neg': 0.32, 'neu': 0.68, 'pos': 0.0, 'compound': -0.765}))

In [372]:
getPosNeg(text3)

(('You might think that she did this because writing down “The Jamaican tourist board bought me a freebie for no clear reason” doesn’t look that brilliant.',
  {'neg': 0.068, 'neu': 0.648, 'pos': 0.284, 'compound': 0.7906}),
 ('You’d only get to watch sad films about people dying, such as Contagion or Million Dollar Baby.',
  {'neg': 0.289, 'neu': 0.711, 'pos': 0.0, 'compound': -0.7269}))

In [373]:
getPosNeg(text4)

(('Read this next: "Terrifying" figures show how many streams an artist needs to earn minimum wage Mark Maitland, UK head of Entertainment and Media at PwC, said: “UK consumers’ rapid migration to digital behaviours in the pandemic has now become embedded in their day-to-day lives, helping to sustain overall growth across Entertainment and Media for the coming five years.',
  {'neg': 0.0, 'neu': 0.841, 'pos': 0.159, 'compound': 0.8555}),
 ('However, sectors such as live music have struggled to go virtual, as it’s so difficult to replicate the in-person experience online.',
  {'neg': 0.229, 'neu': 0.771, 'pos': 0.0, 'compound': -0.685}))

In [374]:
getPosNeg(text5)

(("Read this next: The Warehouse Project is one of the world's best clubbing experiences — and these pictures prove it WHP22 /// NEW - BONOBO + CARIBOUBonobo makes his Depot Mayfield debut for a special performance, alongside Caribou, who return to WHP for August Bank Holiday to kick off the season.",
  {'neg': 0.0, 'neu': 0.821, 'pos': 0.179, 'compound': 0.8625}),
 ('Caused a bit of confusion, but impressive!"',
  {'neg': 0.275, 'neu': 0.725, 'pos': 0.0, 'compound': -0.2244}))