# Natural Language Processing Applied to Collected News Articles with NLTK

## Members

1. First member: Tarık Kılıç / kilicsem@itu.edu.tr

## Description of the project

This project contains applications of Natural Language Processing such as keyword extraction on collected news articles of Turkish media. News articles are collected from Web Archive and for time span of 2 months (December 2016-January 2017). Goal of the project is to extract most important keywords and concepts from this data source by using Natural Language Processing techniques.

### The methods to be used

While processing data, Latent Semantic Analysis(LSA) concepts will be implemented and be analyzed comparatively. To sum up, LSA is a mathematical technique to extract and determine relationship between terms and concepts in context. It is an analysis method of text documents which provides latent, hidden concepts of semantics in it. In this project, context or in another words document data is news articles. Frequently mentioned terms and techniques of LSA will be explained in detail.

* Document: Every seperate text body represent documents. In that case, each news article will be mentioned as document. 

* Term-Document Matrix: It is a two-dimensional matrix representing frequency of each term in each document. Rows are for terms and columns are for documents.

* Term Frequency(tf): For each term in a document, the number of occurances is called term frequency. This must be noted that term frequencies are identical between documents "X is better than Y" and "Y is better than X". There could be different type of term frequency weighting such as raw count, adjusted term frequency or augmented term frequency. Term frequency adjusted to document length will be the choice in that project for normalization purposes.(Not sure about that part, bcs augmented term frequency could be better since our news article lentghs varies)

* Inverse Document Frequency(idf) : In terms of relevancy, terms as "is" is weighting equally as "better" in term frequency. But it shouldn't weight as much as "better" in semantic context. To solve this problem, inverse document frequency is defined for measuring whether term is rare or common across all documents. Idf is calculated as \begin{equation*} idf(t,D) = log(\frac{N}{1+\left|\big\{ d \in D : t \in d \big\}\right|}) \end{equation*} where N is the total number of the documents in collection, and $\left|\big\{ d \in D : t \in d \big\}\right|$ is the number of documents where term t appears. It is easy to observe in any document collection that idf of a rare term is high and idf of a common term is low.

* tf-idf : tf-idf weight is calculated by the product of tf and idf. Behaviour of tf-idf can be categorized as 
    * highest whenever a term occurs many time within a small number of documents.
    * lower when a term occurs fewer times in a document or occurs in many documents.
    * lowest when a term occurs many times in all documents.

* Stop words : Stop words can be defined as a group words which are filtered before processing of text. There is no standart list of stop words to be filtered. It can vary in different semantic contexts.

* Stemming: It is a process of reducing inflected or derived words to their root form.

### The data

Data set used in this project consists news articles from all topics across various newspapers. We select 10 newspapers from Turkish media according to circulation statistics. Our data set is really dependent to Web Archive. Methodology to obtain data set has 2 parts;

1. Url Extraction of Specified Date: Every newspaper has a web site and in a relational database(MySQL), homepage URL of every newspaper is recorded. For a specified date in the past, using [client library of Memento Protocol](https://github.com/mementoweb/py-memento-client) for Python, [Memento Protocol](http://www.mementoweb.org/about/) is used in order to get past version of URL. If there is no past version of specified date of URL, [Wayback Availability API](https://archive.org/help/wayback_api.php) of Internet Archive is used in order to extract URL.  

2. Content-News Article URL Scraping: After obtaining homepage of specified date, we need to access to news articles of that date. In order to achieve that, all anchor tags of homepage needs to be scraped. This may not get all news articles in that date, but a sufficient majority of articles can be collected in that way. 

## Code

In [6]:
# Demonstration Url Extraction of Specified Date
from memento_client import MementoClient
import requests
from urllib.parse import urlparse
import datetime

# Date and url initialization.
memento = MementoClient()
date = datetime.datetime.strptime("01-12-2016", "%d-%m-%Y")
url = "http://www.hurriyet.com.tr/"
wayback_url = "http://archive.org/wayback/available?url={}&timestamp={}"

try:
    memento_info = memento.get_memento_info(url, date)
    if memento_info.get("mementos"):
        mementos = memento_info.get("mementos")
        if mementos is not None and mementos.get("closest").get("uri") and mementos.get("closest").get("uri").__len__() > 0:
            past_version = mementos.get("closest").get("uri")[0]
    raise Exception()
except Exception:
    timestamp = date.strftime("%Y%m%d%H%M%S")
    wayback_get = requests.get(wayback_url.format(url, timestamp))
    if wayback_get.status_code == 200:
        result = wayback_get.json()["archived_snapshots"]
        if result is not None and result.get("closest") and result["closest"]["available"] is True:
            past_version = result["closest"]["url"]

%store past_version
past_version

'http://web.archive.org/web/20161130190715/http://www.hurriyet.com.tr/'

In [61]:
# Demonstration of Content-News Article URL Scraping

#This part is necessary for importing our defined modules such as Scraper and Utility
import sys
path_of_modules = '/GraduationProject/NewsAnalysis/GraduationProject'
if path_of_modules not in sys.path:
    sys.path.append(path_of_modules)

# getting back past version of homepage url
%store -r past_version

from Scraper import Scraper
from Utility import Utility
scraper = Scraper()
utility = Utility()
import NewsAnalysisDatabase as db
content_list = set()


r_get_homepage = requests.get(past_version)
if r_get_homepage.status_code == 200:
    platform = db.get_platform_by_object_id(1)
    if platform is not None:
        content_list.update(scraper.get_anchor_list_for_domain(r_get_homepage.content, utility.get_fixed_domain(past_version), platform.urldomain))
    
str(len(content_list)) + ' URL extracted from homepage'
%store content_list

'116 URL extracted from homepage'

In [None]:



for uri_inf in uri_list:
    # Gets news text for a given specific url: makes get request, removes unnecessary text from response.
    if uri_inf.iscollected == 0:
        try:
            r_get = requests.get(uri_inf.urltext)
        except:
            continue
        if r_get.status_code == 200:
            platform = Provider.get_platform_by_object_id(uri_inf.platformid)
            if platform is not None:
                scraped_news = scraper.get_news_text(r_get.content, platform)
                if scraped_news and scraped_news.strip():
                    Provider.create_collected_news(scraped_news, uri_inf.urltext)
                    Provider.mark_as_collected_url(uri_inf.objectid)
                    print("article added")
                else:
                    print("Nothing found on: " + uri_inf.urltext)
