<img src="http://cfs22.simplicdn.net/ice9/new_logo.svgz "/>

# PG AI - Natural Language Processing and Speech Recognition
# Practice Project: Build Your Own News Search Engine.

DESCRIPTION

Use text feature engineering (TF-IDF) and some rules to make our first search engine for news articles. For any input query, we’ll present the five  most relevant news articles. 

#### Problem Statement: 
Reuters Ltd. is an international news agency headquartered in London and is a division of Thomson Reuters. The data was originally collected and labeled by Carnegie Group Inc. and Reuters Ltd. in the course of developing the construe text categorization system. <br>
An important step before assessing similarity between documents, or between documents and a search query, is the right representation i.e., correct feature engineering. We’ll make a process that provides the most similar news articles to a given text string (search query).<br>

#### Domain: News
Analysis to be done: Document similarity assessment to a search query using Tf-Idf

#### Content: 
#### Dataset: ‘r8-all-terms.txt’
Dataset has no header. For each row, it has a  label and the article text

#### Steps to perform:

Organizing or retrieving news is a big problem statement for any news agency, or any site that publishes news. The two major applications of measuring similarity between texts are news article retrieval for a given query and assessing similarity between any two documents.<br>

We will use TF-IDF representation of the text after clean up (stop word removal, case normalization, lemmatization). With the Tf-Idf representation, we will use cosine similarity as a measure of similarity between an article and a query or article.<br>

By Edson Teixeira<br>
teixeiraedson252@gmail.com <br>
December 29th 2021

In [1]:
import pandas as pd
import numpy as np

#### Using pandas, read in the text file
- Use the right delimiter
- The file has no header, while loading, give column names as label, text 

In [2]:
inp_docs = pd.read_table("r8-all-terms.txt", sep="\t", names=['label','text'])

In [3]:
inp_docs.label.value_counts()

earn        2840
acq         1596
crude        253
trade        251
money-fx     206
interest     190
ship         108
grain         41
Name: label, dtype: int64

In [4]:
inp_docs.head()

Unnamed: 0,label,text
0,earn,champion products ch approves stock split cham...
1,acq,computer terminal systems cpml completes sale ...
2,earn,cobanco inc cbco year net shr cts vs dlrs net ...
3,earn,am international inc am nd qtr jan oper shr lo...
4,earn,brown forman inc bfd th qtr net shr one dlr vs...


#### Get the text data into a list for easy manipulation

In [5]:
articles0 = inp_docs.text.values

In [6]:
# Checking the length of the new list.
len(articles0)

5485

In [7]:
articles0[:3]

array(['champion products ch approves stock split champion products inc said its board of directors approved a two for one stock split of its common shares for shareholders of record as of april the company also said its board voted to recommend to shareholders at the annual meeting april an increase in the authorized capital stock from five mln to mln shares reuter ',
       'computer terminal systems cpml completes sale computer terminal systems inc said it has completed the sale of shares of its common stock and warrants to acquire an additional one mln shares to sedio n v of lugano switzerland for dlrs the company said the warrants are exercisable for five years at a purchase price of dlrs per share computer terminal said sedio also has the right to buy additional shares and increase its total holdings up to pct of the computer terminal s outstanding common stock under certain circumstances involving change of control at the company the company said if the conditions occur the warr

#### Case normalization

In [8]:
articles_lower = [art.lower() for art in articles0]
articles_lower[:3]

['champion products ch approves stock split champion products inc said its board of directors approved a two for one stock split of its common shares for shareholders of record as of april the company also said its board voted to recommend to shareholders at the annual meeting april an increase in the authorized capital stock from five mln to mln shares reuter ',
 'computer terminal systems cpml completes sale computer terminal systems inc said it has completed the sale of shares of its common stock and warrants to acquire an additional one mln shares to sedio n v of lugano switzerland for dlrs the company said the warrants are exercisable for five years at a purchase price of dlrs per share computer terminal said sedio also has the right to buy additional shares and increase its total holdings up to pct of the computer terminal s outstanding common stock under certain circumstances involving change of control at the company the company said if the conditions occur the warrants would b

#### Tokenize the articles   
Use NLTKs word_tokenize for this 

In [9]:
from nltk.tokenize import word_tokenize
article_tokens = [word_tokenize(art) for art in articles_lower]

In [10]:
print(article_tokens[:3])

[['champion', 'products', 'ch', 'approves', 'stock', 'split', 'champion', 'products', 'inc', 'said', 'its', 'board', 'of', 'directors', 'approved', 'a', 'two', 'for', 'one', 'stock', 'split', 'of', 'its', 'common', 'shares', 'for', 'shareholders', 'of', 'record', 'as', 'of', 'april', 'the', 'company', 'also', 'said', 'its', 'board', 'voted', 'to', 'recommend', 'to', 'shareholders', 'at', 'the', 'annual', 'meeting', 'april', 'an', 'increase', 'in', 'the', 'authorized', 'capital', 'stock', 'from', 'five', 'mln', 'to', 'mln', 'shares', 'reuter'], ['computer', 'terminal', 'systems', 'cpml', 'completes', 'sale', 'computer', 'terminal', 'systems', 'inc', 'said', 'it', 'has', 'completed', 'the', 'sale', 'of', 'shares', 'of', 'its', 'common', 'stock', 'and', 'warrants', 'to', 'acquire', 'an', 'additional', 'one', 'mln', 'shares', 'to', 'sedio', 'n', 'v', 'of', 'lugano', 'switzerland', 'for', 'dlrs', 'the', 'company', 'said', 'the', 'warrants', 'are', 'exercisable', 'for', 'five', 'years', 'at'

#### Remove stop words

In [11]:
from nltk.corpus import stopwords

stop_nltk = stopwords.words("english")

In [12]:
def del_stop(inp_tokens):
    res = [term for term in inp_tokens if term not in stop_nltk]
    return res

Applying this to our entire article base using a list comprehension

In [13]:
articles_nostop = [del_stop(art) for art in article_tokens]

In [14]:
print(articles_nostop[:3])

[['champion', 'products', 'ch', 'approves', 'stock', 'split', 'champion', 'products', 'inc', 'said', 'board', 'directors', 'approved', 'two', 'one', 'stock', 'split', 'common', 'shares', 'shareholders', 'record', 'april', 'company', 'also', 'said', 'board', 'voted', 'recommend', 'shareholders', 'annual', 'meeting', 'april', 'increase', 'authorized', 'capital', 'stock', 'five', 'mln', 'mln', 'shares', 'reuter'], ['computer', 'terminal', 'systems', 'cpml', 'completes', 'sale', 'computer', 'terminal', 'systems', 'inc', 'said', 'completed', 'sale', 'shares', 'common', 'stock', 'warrants', 'acquire', 'additional', 'one', 'mln', 'shares', 'sedio', 'n', 'v', 'lugano', 'switzerland', 'dlrs', 'company', 'said', 'warrants', 'exercisable', 'five', 'years', 'purchase', 'price', 'dlrs', 'per', 'share', 'computer', 'terminal', 'said', 'sedio', 'also', 'right', 'buy', 'additional', 'shares', 'increase', 'total', 'holdings', 'pct', 'computer', 'terminal', 'outstanding', 'common', 'stock', 'certain', '

### Feature engineering - using TfIdf to represent each document

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer=TfidfVectorizer(max_features=3000)

#### Joining the tokens back into a string

In [16]:
articles_string = [" ".join(art) for art in articles_nostop]

In [17]:
articles_string[:3]

['champion products ch approves stock split champion products inc said board directors approved two one stock split common shares shareholders record april company also said board voted recommend shareholders annual meeting april increase authorized capital stock five mln mln shares reuter',
 'computer terminal systems cpml completes sale computer terminal systems inc said completed sale shares common stock warrants acquire additional one mln shares sedio n v lugano switzerland dlrs company said warrants exercisable five years purchase price dlrs per share computer terminal said sedio also right buy additional shares increase total holdings pct computer terminal outstanding common stock certain circumstances involving change control company company said conditions occur warrants would exercisable price equal pct common stock market price time exceed dlrs per share computer terminal also said sold technolgy rights dot matrix impact technology including future improvements woodco inc hou

#### Applying TfIdf on the data

In [18]:
articles_tfidf =vectorizer.fit_transform(articles_string)

In [19]:
articles_tfidf.shape

(5485, 3000)

In [20]:
tfidf_dense = articles_tfidf.todense()

In [21]:
type(tfidf_dense)

numpy.matrix

#### Cosine similarity between any two vectors

In [22]:
from sklearn.metrics.pairwise import cosine_similarity

In [23]:
cosine_similarity(tfidf_dense[3,:], tfidf_dense[4,:])

array([[0.51969816]])

In [24]:
articles_string[3:5]

['international inc nd qtr jan oper shr loss two cts vs profit seven cts oper shr profit vs profit revs mln vs mln avg shrs mln vs mln six mths oper shr profit nil vs profit cts oper net profit vs profit revs mln vs mln avg shrs mln vs mln note per shr calculated payment preferred dividends results exclude credits four cts nine cts qtr six mths vs six cts cts prior periods operating loss carryforwards reuter',
 'brown forman inc bfd th qtr net shr one dlr vs cts net mln vs mln revs mln vs mln nine mths shr dlrs vs dlrs net mln vs mln revs billion vs mln reuter']

In [25]:
cosine_similarity(tfidf_dense[3,:], tfidf_dense[4,:])

array([[0.51969816]])

#### Defining function to -

a.	For any given row number, extract the TfIdf vector  
b.	Compute similarity of this vector with all the others  
c.	Get indices of the top 5 matches  
d.	Return the text for the top 5 matches, and the text of the target row  

In [26]:
target_row = 4

In [27]:
target_vector = tfidf_dense[target_row,:]

In [28]:
print(articles_string[target_row])

brown forman inc bfd th qtr net shr one dlr vs cts net mln vs mln revs mln vs mln nine mths shr dlrs vs dlrs net mln vs mln revs billion vs mln reuter


In [29]:
sim_scores = []

for ind, vector in enumerate(tfidf_dense):
    sim = cosine_similarity(target_vector, tfidf_dense[ind,:])[0][0]
    sim_scores.append(sim)

#### Making a pandas series of similarity scores for easy manipulation

In [30]:
len(sim_scores)

5485

In [31]:
tfidf_dense.shape[0]

5485

In [32]:
similarity = pd.Series(sim_scores)
similarity.head()

0    0.076767
1    0.037874
2    0.619848
3    0.519698
4    1.000000
dtype: float64

In [33]:
top5_scores = similarity.sort_values(ascending=False).head(6)[1:]

In [34]:
top5_index = top5_scores.index.values
top5_index

array([3633, 1526, 3939, 3686,  427])

In [35]:
for ind in top5_index:
    print("Similarity score:" + str(round(top5_scores[ind],2)) + "\n" + "Article text: " + articles_string[ind] + "\n")

Similarity score:0.9
Article text: technitrol inc tnl th qtr shr cts vs cts net mln vs revs mln vs mln year shr dlrs vs dlrs net mln vs mln revs mln vs mln reuter

Similarity score:0.88
Article text: vista resources inc vist th qtr net shr dlrs vs one dlr net vs revs mln vs mln mths shr dlrs vs dlrs net vs revs mln vs mln reuter

Similarity score:0.87
Article text: nike inc nike rd qtr feb net shr cts vs cts net vs mln revs mln vs mln nine mths shr cts vs dlrs net mln vs mln revs mln vs mln reuter

Similarity score:0.87
Article text: quick reilly group bqr th qtr feb shr cts vs cts net mln vs mln revs mln vs mln year shr dlrs vs dlr net mln vs mln revs mln vs mln reuter

Similarity score:0.87
Article text: kay jewelers inc kji th qtr net shr dlrs vs dlrs net mln vs revs mln vs mln year shr dlrs vs dlrs net vs revs mln vs mln reuter



In [36]:
def get_top5(target_row):
    target_vector = tfidf_dense[target_row,:]
    
    sim_scores = []
    for ind, vector in enumerate(tfidf_dense):
        sim = cosine_similarity(target_vector, tfidf_dense[ind,:])[0][0]
        sim_scores.append(sim)
    
    
    similarity = pd.Series(sim_scores)
    top5_scores = similarity.sort_values(ascending=False).head(6)[1:]
    top5_index = top5_scores.index.values
    
    for ind in top5_index:
        print("Similarity score:" + str(round(top5_scores[ind],2)) + "\n" + "Article text: " + articles_string[ind] + "\n")

In [37]:
get_top5(4)

Similarity score:0.9
Article text: technitrol inc tnl th qtr shr cts vs cts net mln vs revs mln vs mln year shr dlrs vs dlrs net mln vs mln revs mln vs mln reuter

Similarity score:0.88
Article text: vista resources inc vist th qtr net shr dlrs vs one dlr net vs revs mln vs mln mths shr dlrs vs dlrs net vs revs mln vs mln reuter

Similarity score:0.87
Article text: nike inc nike rd qtr feb net shr cts vs cts net vs mln revs mln vs mln nine mths shr cts vs dlrs net mln vs mln revs mln vs mln reuter

Similarity score:0.87
Article text: quick reilly group bqr th qtr feb shr cts vs cts net mln vs mln revs mln vs mln year shr dlrs vs dlr net mln vs mln revs mln vs mln reuter

Similarity score:0.87
Article text: kay jewelers inc kji th qtr net shr dlrs vs dlrs net mln vs revs mln vs mln year shr dlrs vs dlrs net vs revs mln vs mln reuter



In [38]:
for ind in top5_index:
    print("Similarity score:" + str(round(top5_scores[ind],2)) + "\n" + "Article text: " + articles_string[ind] + "\n")

Similarity score:0.9
Article text: technitrol inc tnl th qtr shr cts vs cts net mln vs revs mln vs mln year shr dlrs vs dlrs net mln vs mln revs mln vs mln reuter

Similarity score:0.88
Article text: vista resources inc vist th qtr net shr dlrs vs one dlr net vs revs mln vs mln mths shr dlrs vs dlrs net vs revs mln vs mln reuter

Similarity score:0.87
Article text: nike inc nike rd qtr feb net shr cts vs cts net vs mln revs mln vs mln nine mths shr cts vs dlrs net mln vs mln revs mln vs mln reuter

Similarity score:0.87
Article text: quick reilly group bqr th qtr feb shr cts vs cts net mln vs mln revs mln vs mln year shr dlrs vs dlr net mln vs mln revs mln vs mln reuter

Similarity score:0.87
Article text: kay jewelers inc kji th qtr net shr dlrs vs dlrs net mln vs revs mln vs mln year shr dlrs vs dlrs net vs revs mln vs mln reuter



In [39]:
def get_top5_query(qry):
    #target_vector = tfidf_dense[target_row,:]
    target_vector = vectorizer.transform([qry])
    
    sim_scores = []
    for ind, vector in enumerate(tfidf_dense):
        sim = cosine_similarity(target_vector, tfidf_dense[ind,:])[0][0]
        sim_scores.append(sim)
    
    similarity = pd.Series(sim_scores)
    top5_scores = similarity.sort_values(ascending=False).head(5)
    top5_index = top5_scores.index.values
    
    print("Search query: " + qry + "\n")
    
    for ind in top5_index:
        print("Similarity score:" + str(round(top5_scores[ind],2)) + "\n" + "Article text: " + articles_string[ind] + "\n")

In [40]:
get_top5_query("crude oil price")

Search query: crude oil price

Similarity score:0.49
Article text: phillips p raises crude postings cts phillips petroleum said raised contract price grades crude oil cts barrel effective today increase brings phillip posted price west texas intermediate west texas sour grades dlrs bbl phillips last changed crude oil postings march price increase follows similar moves usx x subsidiary marathon oil sun co sun earlier today reuter

Similarity score:0.44
Article text: marathon petroleum reduces crude postings marathon petroleum co said reduced contract price pay grades crude oil one dlr barrel effective today decrease brings marathon posted price west texas intermediate west texas sour dlrs bbl south louisiana sweet grade crude reduced dlrs bbl company last changed crude postings jan reuter

Similarity score:0.43
Article text: diamond shamrock dia cuts crude prices diamond shamrock corp said effective today cut contract prices crude oil dlrs barrel reduction brings posted price west texas

In [41]:
get_top5_query("computer systems")

Search query: computer systems

Similarity score:0.55
Article text: vertex vetx buy computer transceiver stake vertex industries inc computer transceiver systems inc jointly announced agreement vertex acquire pct interest computer completes proposed reorganization computer reorganization proceedings chapter since september companies said agreement would allow computer unsecured creditors debenture holders receive new stock exchange exsiting debt shareholders receive one new share computer stock four shares previously held companies said united states bankruptcy court southern district new york given preliminary approval proposal subject formal approval computer creditors court agreement vertex also said would supply computer dlrs operating funds arrange renegotiation secured bank debt among things reuter

Similarity score:0.53
Article text: aw computer systems inc awcsa year end dec shr cts vs cts net vs revs vs reuter

Similarity score:0.48
Article text: hogan systems hogn acquisition