# Read the NEWS

## Introduction

Newspapers and their online formats supply the public with the information we need to understand the events occurring in the world around us. From politics to sports, the news keeps us informed, in the loop, and ready to make decisions about how to act in a rapidly changing world.
<br>
<br>
**Source:** [Kaggle Dataset of The News International articles](https://www.kaggle.com/asad1m9a9h6mood/news-articles)
<br>
<br>
In this project we will use term frequency-inverse document frequency (tf-idf) to analyze each article’s content and uncover the terms that best describe each article, providing quick insight into each article’s topic.

## Import Required Libraries

In [76]:
import nltk, re
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from collections import Counter

import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer

## Preprocessing

### Define stop_words and normalizer

In [77]:
stop_words = stopwords.words('english')
normalizer = WordNetLemmatizer()

### Define Function to get Part of Speech

In [78]:
def get_part_of_speech(word):
  probable_part_of_speech = wordnet.synsets(word)
  pos_counts = Counter()
  pos_counts["n"] = len(  [ item for item in probable_part_of_speech if item.pos()=="n"]  )
  pos_counts["v"] = len(  [ item for item in probable_part_of_speech if item.pos()=="v"]  )
  pos_counts["a"] = len(  [ item for item in probable_part_of_speech if item.pos()=="a"]  )
  pos_counts["r"] = len(  [ item for item in probable_part_of_speech if item.pos()=="r"]  )
  most_likely_part_of_speech = pos_counts.most_common(1)[0][0]
  return most_likely_part_of_speech

### Define Function to Preprocess Text

In [79]:
def preprocess_text(text):
  cleaned = re.sub(r'\W+', ' ', text).lower()
  tokenized = word_tokenize(cleaned)
  normalized = " ".join([normalizer.lemmatize(token, get_part_of_speech(token)) for token in tokenized if not re.match(r'\d+',token)])
  return normalized

## Load Data

In [80]:
df = pd.read_csv("Articles.csv")
df.head()

Unnamed: 0,Article,Date,Heading,NewsType
0,KARACHI: The Sindh government has decided to b...,01-01-2015,sindh govt decides to cut public transport far...,business
1,HONG KONG: Asian markets started 2015 on an up...,01-02-2015,asia stocks up in new year trad,business
2,HONG KONG: Hong Kong shares opened 0.66 perce...,01-05-2015,hong kong stocks open 0.66 percent lower,business
3,HONG KONG: Asian markets tumbled Tuesday follo...,01-06-2015,asian stocks sink euro near nine year,business
4,NEW YORK: US oil prices Monday slipped below $...,01-06-2015,us oil prices slip below 50 a barr,business


We will be using column `Article` for now so let's check if this column has any null values.

In [81]:
df.isnull().sum()

Article     0
Date        0
Heading     0
NewsType    0
dtype: int64

## Preprocessing Articles

In [82]:
articles = df['Article']

In [83]:
processed_articles = []
for article in articles:
  processed_articles.append(preprocess_text(article))
#print(processed_articles[5])

## Initializing and Fittings

In [84]:
# Initialize and fit the CountVectorizer
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(processed_articles)

# convert counts to tf-idf
#transformer = TfidfTransformer(norm = None)
#tfidf_scores_transformed = transformer.fit_transform(counts)

# initialize and fit TfidfVectorizer
vectorizer = TfidfVectorizer(norm = None)
tfidf_scores = vectorizer.fit_transform(processed_articles)

## Term Frequency - Inverse Document Frequency

In [85]:
# get vocabulary of terms
try:
  feature_names = vectorizer.get_feature_names_out()
except:
  pass

In [86]:
# get article index
try:
  article_index = [f"Article {i+1}" for i in range(len(articles))]
except:
  pass

### Term Frequency or Word Counts

In [87]:
# create pandas DataFrame with word counts
try:
  df_word_counts = pd.DataFrame(counts.T.todense(), index=feature_names, columns=article_index)
except:
  pass
df_word_counts

Unnamed: 0,Article 1,Article 2,Article 3,Article 4,Article 5,Article 6,Article 7,Article 8,Article 9,Article 10,...,Article 2683,Article 2684,Article 2685,Article 2686,Article 2687,Article 2688,Article 2689,Article 2690,Article 2691,Article 2692
__cf_email__,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
a300,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
a320,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
a321,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
a330,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zverev,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
zvereva,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
zyl,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
étienne,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### TF-IDF Scores

In [88]:
try:
  df_tf_idf = pd.DataFrame(tfidf_scores.T.todense(), index=feature_names, columns=article_index)
except:
  pass
df_tf_idf

Unnamed: 0,Article 1,Article 2,Article 3,Article 4,Article 5,Article 6,Article 7,Article 8,Article 9,Article 10,...,Article 2683,Article 2684,Article 2685,Article 2686,Article 2687,Article 2688,Article 2689,Article 2690,Article 2691,Article 2692
__cf_email__,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
a300,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
a320,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
a321,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
a330,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
zverev,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zvereva,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
zyl,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
étienne,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Highest tf-idf score words

In [89]:
# get highest scoring tf-idf term for each article
highest_scores = {}
for i in range(1, len(articles) + 1):
    highest_scores[f"Article {i}"] = df_tf_idf[f'Article {i}'].idxmax()
highest_scores

{'Article 1': 'fare',
 'Article 2': 'percent',
 'Article 3': 'hong',
 'Article 4': 'the',
 'Article 5': 'oil',
 'Article 6': 'arabia',
 'Article 7': 'kse',
 'Article 8': 'ang',
 'Article 9': 'sugar',
 'Article 10': 'oil',
 'Article 11': 'yen',
 'Article 12': 'hong',
 'Article 13': 'the',
 'Article 14': 'petrol',
 'Article 15': 'price',
 'Article 16': 'petrol',
 'Article 17': 'notification',
 'Article 18': 'percent',
 'Article 19': 'ecc',
 'Article 20': 'king',
 'Article 21': 'rent',
 'Article 22': 'brunei',
 'Article 23': 'the',
 'Article 24': 'litre',
 'Article 25': 'syriza',
 'Article 26': 'the',
 'Article 27': 'islamic',
 'Article 28': 'barrel',
 'Article 29': 'decrease',
 'Article 30': 'mortgage',
 'Article 31': 'oil',
 'Article 32': 'load',
 'Article 33': 'greek',
 'Article 34': 'imf',
 'Article 35': 'engine',
 'Article 36': 'fairly',
 'Article 37': 'the',
 'Article 38': 'greek',
 'Article 39': 'petrol',
 'Article 40': 'robust',
 'Article 41': 'hsbc',
 'Article 42': 'iea',
 'Artic