# Natural Language Processing Applied to Collected News Articles with NLTK

## Members

1. First member: Tarık Kılıç / kilicsem@itu.edu.tr

## Description of the project

This project contains applications of Natural Language Processing such as keyword extraction on collected news articles of Turkish media. News articles are collected from Web Archive and for time span of 2 months (December 2016-January 2017). Goal of the project is to extract most important keywords and concepts from this data source by using Natural Language Processing techniques.

### Libraries

#### NLTK
NLTK, [the Natural Language Toolkit](http://www.nltk.org/) is a natural language platform for Python which includes a variety of corporas, documentations and modules for frequently used techniques of analysis for human language data. It is an open-source and community-driven project.

#### NumPy
[NumPy](https://docs.scipy.org/doc/numpy-dev/user/quickstart.html) is an essential Python module for scientific computing. NumPy provides multi-dimensional array and matrix feature with all levels of mathematical operations to operate on this structures. 

#### BeautifulSoup
[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is an Python library that extracts, filters and manipulates data from XML and HTML documents and bind it to other data structures. It also provides navigation and search in XML and HTML documents. It has flexibility to work with several parsers in Python environment.

#### peewee
[peewee](http://docs.peewee-orm.com/en/latest/) is a small and lightweight Object Relational Mapper library for several database technologies such as MySQL, SQLite and Postgresql. Its code hosted on [GitHub](https://github.com/coleifer/peewee)


### The methods to be used

While processing data, Latent Semantic Analysis(LSA) concepts will be implemented. To sum up, LSA is a mathematical technique to extract and determine relationship between terms and concepts in semantic context. It is an analysis method of text documents which provides latent, hidden concepts of semantics in it. LSA has the ability to extract indirect relations between terms. If two word never appear together in the same document but always appear with another third word individually, LSA is able to extract this connection between terms.  In this project, context or in another words document data is news articles. Frequently mentioned terms and techniques of LSA will be explained in detail.

#### Document: 
Every seperate text body represent documents. In that case, each news article will be mentioned as document. 

#### Term-Document Matrix: 
It is a two-dimensional matrix representing frequency of each term in each document. Rows are for terms and columns are for documents.

#### Term Frequency(tf): 
For each term in a document, the number of occurances is called term frequency. This must be noted that term frequencies are identical between documents "X is better than Y" and "Y is better than X". There could be different type of term frequency weighting such as raw count, adjusted term frequency or augmented term frequency. Term frequency adjusted to document length will be the choice in that project for normalization purposes.Thus, the effect of long articles to the overall result tried to be minimized.

#### Inverse Document Frequency(idf) : 
In terms of relevancy, terms as "is" is weighting equally as "better" in term frequency. But it shouldn't weight as much as "better" in semantic context. To solve this problem, inverse document frequency is defined for measuring whether term is rare or common across all documents. Idf is calculated as \begin{equation*} idf(t,D) = log(\frac{N}{1+\left|\big\{ d \in D : t \in d \big\}\right|}) \end{equation*} where N is the total number of the documents in collection, and $\left|\big\{ d \in D : t \in d \big\}\right|$ is the number of documents where term t appears. It is easy to observe in any document collection that idf of a rare term is high and idf of a common term is low.

#### tf-idf : 
tf-idf weight is calculated by the product of tf and idf. Behaviour of tf-idf can be categorized as 
   * highest whenever a term occurs many time within a small number of documents.
   * lower when a term occurs fewer times in a document or occurs in many documents.
   * lowest when a term occurs many times in all documents.

#### Stop words : 
Stop words can be defined as a group words which are filtered before processing of text. There is no standart list of stop words to be filtered. It can vary in different semantic contexts. NLTK Data module is used for stopword extraction. It has a variety of stopword lists on it for different languages.

#### Stemming: 
It is a process of reducing inflected or derived words to their root form. It has a direct effect on result, since LSA is based on number of occurances in corpus. For the words derived from same root and has same meaning on corpus, stemming provides a corrected weight. [Snowball Turkish stemming algorithm](http://snowball.tartarus.org/algorithms/turkish/stemmer.html) used as stemmer in this project.

#### Tokenization: 
Tokenization is a process of dividing string into substrings. Within NLTK library, word tokenization and sentence tokenization are some tokenization techniques. Since we're working with terms, word tokenization is used in this project. word_tokenize method from nltk.tokenize module returns a word token list for given string. 

#### Single Value Decomposition: 
SVD is a factorization of a real matrix A such as \begin{equation*} A \cong UsV\intercal \end{equation*} where U is a m $\times$ m matrix, s is a m $\times$ n matrix, V is a n $\times$ n matrix whenever A is a m $\times$ n matrix. It should be noted that diagonal elements of s is singular values of matrix A and thus reduced into diagonal elements. In LSA, tf-idf matrix put into SVD factorization in order to get U,s and V which all represents different dimensions of analysis. s consists weights of features or topics in corpus. U represents relation between terms and features of corpus and V represents relation between documents and features. Since reduced vector s is sorted, first element of s represent most powerful feature of corpus, thus first row of matrix U gives us numerical score of how each term related to the most powerful feature in corpus. All those numerical values represents a meaning in a sorted context but not as numerical weights. Keyword extraction by using matrix U is sensible if numerical weights is used in order to rank terms between themselves. For SVD calculations numpy library used. [svd](https://docs.scipy.org/doc/numpy-1.12.0/reference/generated/numpy.linalg.svd.html) method of numpy.linalg module returns U,s and V matrix respectively where s is a reduced vector of singular values of given matrix.

For some cases, it is observed that data noise caused by stemmer is not important in result. For instance, let x,y,z be derived forms of same root form a, which stemmer failed to reduce. In results, it is observed that x, y, z are appearing together and weighting as they are one term in root form.

### The data

Data set used in this project consists news articles from all topics across various newspapers. We select 10 newspapers from Turkish media according to circulation statistics. Our data set is really dependent to Web Archive. Methodology to obtain data set has 2 parts;

1. Url Extraction of Specified Date: Every newspaper has a web site and in a relational database(SQLite), homepage URL of every newspaper is recorded. For a specified date in the past, using [client library of Memento Protocol](https://github.com/mementoweb/py-memento-client) for Python, [Memento Protocol](http://www.mementoweb.org/about/) is used in order to get past version of URL. If there is no past version of specified date of URL, [Wayback Availability API](https://archive.org/help/wayback_api.php) of Internet Archive is used in order to extract URL.  

2. Content-News Article URL Scraping: After obtaining homepage of specified date, we need to access to news articles of that date. In order to achieve that, all anchor tags of homepage needs to be scraped. This may not get all news articles in that date, but a sufficient majority of articles can be collected in that way. It is stored in URLInformation table with necessary information to scrape.

3. Article Scraping from URL: In this section, articles should be scraped from obtained URL's in order to construct corpus to run analysis on. Beautiful Soup library is used for scraping. After get request made to URL, if response is successful html body passed to scraper methods. For a more clear scraping, for every newspaper, specifications on div element which consists data for scraping has been made. Class and id properties of html elements are added to Platform table. For most of the cases, in a newspaper domain, content of news article found on spesific html elements with a certain class and id. Beautiful Soup has ability to navigate and filter through HTML document with class and id information. It is demonstrated in code section. Scraped article is saved to CollectedArticles table.


## Code

In [32]:
# Necessary imports
from urllib.parse import urlparse

import sys
path_of_modules = '/GraduationProject/NewsAnalysis/GraduationProject'
if path_of_modules not in sys.path:
    sys.path.append(path_of_modules)
from Scraper import Scraper
from Utility import Utility
import math
from nltk.tokenize import word_tokenize
import snowballstemmer
from nltk.corpus import stopwords
import numpy

print("Modules are imported")

Modules are imported


This is demonstration of Url extraction of specified date. If memento protocol unable to obtain past version of url, Wayback Machine API used.

In [33]:
from memento_client import MementoClient
import datetime
import requests

# Date and url initialization.
memento = MementoClient()
date = datetime.datetime.strptime("01-12-2016", "%d-%m-%Y")
url = "http://www.hurriyet.com.tr/"
wayback_url = "http://archive.org/wayback/available?url={}&timestamp={}"

try:
    memento_info = memento.get_memento_info(url, date)
    if memento_info.get("mementos"):
        mementos = memento_info.get("mementos")
        if mementos is not None and mementos.get("closest").get("uri") and mementos.get("closest").get("uri").__len__() > 0:
            past_version = mementos.get("closest").get("uri")[0]
    raise Exception()
except Exception:
    timestamp = date.strftime("%Y%m%d%H%M%S")
    wayback_get = requests.get(wayback_url.format(url, timestamp))
    if wayback_get.status_code == 200:
        result = wayback_get.json()["archived_snapshots"]
        if result is not None and result.get("closest") and result["closest"]["available"] is True:
            past_version = result["closest"]["url"]

%store past_version
print(past_version)

Stored 'past_version' (str)
http://web.archive.org/web/20161130190715/http://www.hurriyet.com.tr/


After getting url of archived version, home page of website needs to be scraped to get news pages on that date. This is made via get_anchor_list_for_domain method from Scraper class. It basically uses Beautiful Soup to filter anchor elements in document and checks if href of anchor tag is a valid url but not a social media sharing link for instance.

In [34]:
# getting back past version of homepage url
%store -r past_version

import NewsAnalysisDatabase as db
    
scraper = Scraper()
utility = Utility()
content_list = set()

r_get_homepage = requests.get(past_version)
if r_get_homepage.status_code == 200:
    platform = db.get_platform_by_object_id(1)
    if platform is not None:
        content_list.update(scraper.get_anchor_list_for_domain(r_get_homepage.content, utility.get_fixed_domain(past_version), platform.urldomain))
    
print(str(len(content_list)) + ' URL extracted from homepage')
%store content_list

116 URL extracted from homepage
Stored 'content_list' (set)


In this part, get requests are made to extracted URL's and if status is OK, response HTML is scraped for articles. Then scraped articles are saved to database. For demonstrational purposes, 10 article scraped in this cell.

In [35]:
%store -r content_list
from http.client import RemoteDisconnected
count = 0

for uri_inf in content_list:
    if count == 10:
        break
    # Gets news text for a given specific url: makes get request, removes unnecessary text from response.
    try:
        r_get = requests.get(uri_inf.content_uri)
    except RemoteDisconnected as e:
        print(e.line)
        continue
    if r_get.status_code == 200:
        platform = db.get_platform_by_object_id(1)
        if platform is not None:
            scraped_news = scraper.get_news_text(r_get.content, platform)
            if scraped_news and scraped_news.strip():
                db.create_collected_news(scraped_news, uri_inf.content_uri)
                count += 1
                print("article added: ")
                print(scraped_news.strip())
                print("------------------------------------------")
            else:
                print("Nothing found on: " + uri_inf.content_uri)
                print("------------------------------------------")
                
print("Collected 10 articles added to database")

article added: 
Sonbaharda hava kapansa, bulutlar tepemizden eksik olmasa da hala dışarıda egzersiz yapabileceğiniz harika güneşli günler bulunuyor. Bu günlerden birinde bahçeye çıktık ve fitness eğitmeni Zeynep Kapdan ile oyunla egzersizi birleştiren harika bir antrenman yaptık. 4 bölümden oluşan ve yaklaşık 30 dakikanızı alacak antrenman yağ yakmanızı sağlarken core kaslarınızı güçlendirecek, özellikle kalçalarınızı şekillendirecek.BÖLÜM 1 ALIŞTIRMALAR (DRILLS) 3 kez tekrarlayın➤ Yürüyerek lunge: Her bir bacakla 10 tekrar➤ Yüksek tekme: Her bir bacakla 10 tekrar➤ Zıplayarak diz çekme: Her bir bacakla 15 tekrar➤ Diz çekme: Her bir bacakla 15 tekrar➤ Kalçaya tekme: Her bir bacakla 15 tekrarYürüyerek lungeBacaklar kalça hizasında aralı dik durarak başlayın. Sağ bacağınızla ileri adım atın ve her iki bacağınızı dizden 90 derecelik açıyla bükerek alçalın, öndeki bacağın üst kısmı yere paralel, öndeki ayak yere tam basıyor, arkadaki ayağın ise parmak ucuna basıyorsunuz ve diz yere değiyor…

After collecting corpus, data needs to be cleaned and prepared for analysis. Corpus is cleaned by regex while getting data from database via get_sample_document_list method. Article text tokenized into words and stop words are removed from tokens.

In [None]:
stemmer = snowballstemmer.stemmer("turkish")
stopwords = set(stopwords.words("turkish"))
stopwords.update(['.', ',', '"', "'", "''", '""', '?', '!', ':', ';', '(', ')', '[', ']', '{', '}'])
sample_data, url_list = db.get_sample_document_list()

terms = set()
for i in range(len(sample_data)):
    terms.update([stemmer.stemWord(token) for token in word_tokenize(sample_data[i].lower()) if token not in stopwords])
    
terms = list(terms)
%store terms   
%store sample_data    

Stored 'terms' (list)
Stored 'sample_data' (list)


After articles tokenized, term-document, term-frequency, inverse document frequency and tf-idf matrices are constructed.

In [None]:
%store -r terms
%store -r sample_data
import math

term_document = []
for i in range(len(terms)):
    term_document.append([])
    term_document[i] = [sample_data[doc_index].count(terms[i]) for doc_index in range(len(sample_data))]

term_freq = []
for i in range(len(terms)):
    term_freq.append([])
    term_freq[i] = [sample_data[doc_index].count(terms[i]) / len([token for token in word_tokenize(sample_data[doc_index].lower()) if token not in stopwords]) for doc_index in range(len(sample_data))]

inverse_document_freq = [0] * len(terms)
for i in range(len(terms)):
    inverse_document_freq[i] = math.log(len(sample_data) / (1 + len([doc_index for doc_index in range(len(sample_data)) if term_document[i][doc_index] > 0])))

tfidf = []
for i in range(len(terms)):
    tfidf.append([])
    tfidf[i] = [term_frequency * inverse_document_freq[i] for term_frequency in term_freq[i]]
%store tfidf


After tf-idf matrix constructed by corpus, SVD of matrix is calculated. Appropriate column of matrix U is choosen and terms are sorted according to that column of U matrix.

In [None]:
%store -r tfidf
%store -r terms

U, s, V = numpy.linalg.svd(tfidf, full_matrices=True)
token_scores = U[:, 0]
u = numpy.array(token_scores).argsort()[::-1][:20]
print("Keywords are:")
for i in range(len(u)):
    print(terms[u[i]])