# Build Your Own News Search Engine

#### DESCRIPTION

Use text feature engineering (TF-IDF) and some rules to make our first search engine for news articles. For any input query, we’ll present the five  most relevant news articles. 

Problem Statement: 

Reuters Ltd. is an international news agency headquartered in London and is a division of Thomson Reuters. The data was originally collected and labeled by Carnegie Group Inc. and Reuters Ltd. in the course of developing the construe text categorization system. 

An important step before assessing similarity between documents, or between documents and a search query, is the right representation i.e., correct feature engineering. We’ll make a process that provides the most similar news articles to a given text string (search query).

Domain: News

Analysis to be done: Document similarity assessment to a search query using Tf-Idf

Content: 

Dataset: ‘r8-all-terms.txt’

Dataset has no header. For each row, it has a  label and the article text.

In [1]:
import pandas as pd
import numpy as np

#### Reading the file

In [2]:
input_doc = pd.read_table('r8-all-terms.txt',sep = '\t', names=['Label','Text'])
input_doc.head()

Unnamed: 0,Label,Text
0,earn,champion products ch approves stock split cham...
1,acq,computer terminal systems cpml completes sale ...
2,earn,cobanco inc cbco year net shr cts vs dlrs net ...
3,earn,am international inc am nd qtr jan oper shr lo...
4,earn,brown forman inc bfd th qtr net shr one dlr vs...


#### 1. Build the list out of the values from Text column for cleanup

In [3]:
articles = list(input_doc['Text'])
articles[:3]

['champion products ch approves stock split champion products inc said its board of directors approved a two for one stock split of its common shares for shareholders of record as of april the company also said its board voted to recommend to shareholders at the annual meeting april an increase in the authorized capital stock from five mln to mln shares reuter ',
 'computer terminal systems cpml completes sale computer terminal systems inc said it has completed the sale of shares of its common stock and warrants to acquire an additional one mln shares to sedio n v of lugano switzerland for dlrs the company said the warrants are exercisable for five years at a purchase price of dlrs per share computer terminal said sedio also has the right to buy additional shares and increase its total holdings up to pct of the computer terminal s outstanding common stock under certain circumstances involving change of control at the company the company said if the conditions occur the warrants would b

In [4]:
len(articles)

5485

In [5]:
articles[2]

'cobanco inc cbco year net shr cts vs dlrs net vs assets mln vs mln deposits mln vs mln loans mln vs mln note th qtr not available year includes extraordinary gain from tax carry forward of dlrs or five cts per shr reuter '

#### 2. Normalize the case

In [6]:
articles = [article.lower() for article in articles]
articles[1]

'computer terminal systems cpml completes sale computer terminal systems inc said it has completed the sale of shares of its common stock and warrants to acquire an additional one mln shares to sedio n v of lugano switzerland for dlrs the company said the warrants are exercisable for five years at a purchase price of dlrs per share computer terminal said sedio also has the right to buy additional shares and increase its total holdings up to pct of the computer terminal s outstanding common stock under certain circumstances involving change of control at the company the company said if the conditions occur the warrants would be exercisable at a price equal to pct of its common stock s market price at the time not to exceed dlrs per share computer terminal also said it sold the technolgy rights to its dot matrix impact technology including any future improvements to woodco inc of houston tex for dlrs but it said it would continue to be the exclusive worldwide licensee of the technology f

#### 3. Tokenize the articles using Word Tokenize

In [7]:
from nltk.tokenize import word_tokenize

In [8]:
word_tokenized_article = [word_tokenize(article) for article in articles]

In [9]:
print(word_tokenized_article[:2])

[['champion', 'products', 'ch', 'approves', 'stock', 'split', 'champion', 'products', 'inc', 'said', 'its', 'board', 'of', 'directors', 'approved', 'a', 'two', 'for', 'one', 'stock', 'split', 'of', 'its', 'common', 'shares', 'for', 'shareholders', 'of', 'record', 'as', 'of', 'april', 'the', 'company', 'also', 'said', 'its', 'board', 'voted', 'to', 'recommend', 'to', 'shareholders', 'at', 'the', 'annual', 'meeting', 'april', 'an', 'increase', 'in', 'the', 'authorized', 'capital', 'stock', 'from', 'five', 'mln', 'to', 'mln', 'shares', 'reuter'], ['computer', 'terminal', 'systems', 'cpml', 'completes', 'sale', 'computer', 'terminal', 'systems', 'inc', 'said', 'it', 'has', 'completed', 'the', 'sale', 'of', 'shares', 'of', 'its', 'common', 'stock', 'and', 'warrants', 'to', 'acquire', 'an', 'additional', 'one', 'mln', 'shares', 'to', 'sedio', 'n', 'v', 'of', 'lugano', 'switzerland', 'for', 'dlrs', 'the', 'company', 'said', 'the', 'warrants', 'are', 'exercisable', 'for', 'five', 'years', 'at'

#### 4. Remove Stop Words

In [10]:
from nltk.corpus import stopwords

In [126]:
stop_words = set(stopwords.words('english'))
stopword_free_tokenized_article =[]
for article in word_tokenized_article:
    sw_free = [word for word in article if not word in stop_words]
    stopword_free_tokenized_article.append(sw_free)

In [12]:
print(stopword_free_tokenized_article[:2])

[['champion', 'products', 'ch', 'approves', 'stock', 'split', 'champion', 'products', 'inc', 'said', 'board', 'directors', 'approved', 'two', 'one', 'stock', 'split', 'common', 'shares', 'shareholders', 'record', 'april', 'company', 'also', 'said', 'board', 'voted', 'recommend', 'shareholders', 'annual', 'meeting', 'april', 'increase', 'authorized', 'capital', 'stock', 'five', 'mln', 'mln', 'shares', 'reuter'], ['computer', 'terminal', 'systems', 'cpml', 'completes', 'sale', 'computer', 'terminal', 'systems', 'inc', 'said', 'completed', 'sale', 'shares', 'common', 'stock', 'warrants', 'acquire', 'additional', 'one', 'mln', 'shares', 'sedio', 'n', 'v', 'lugano', 'switzerland', 'dlrs', 'company', 'said', 'warrants', 'exercisable', 'five', 'years', 'purchase', 'price', 'dlrs', 'per', 'share', 'computer', 'terminal', 'said', 'sedio', 'also', 'right', 'buy', 'additional', 'shares', 'increase', 'total', 'holdings', 'pct', 'computer', 'terminal', 'outstanding', 'common', 'stock', 'certain', '

#### 5. Using TF-IDF to repesent each document
vocabulary size - 3000

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [115]:
tfidf_vect = TfidfVectorizer(max_features=3000)

#### 6. Create a string from stopword free articles

In [15]:
# Feed the sting to vectorizer as it need a string not single words. 
article_string = [" ".join(article) for article in stopword_free_tokenized_article]

In [24]:
article_string[:3]

['champion products ch approves stock split champion products inc said board directors approved two one stock split common shares shareholders record april company also said board voted recommend shareholders annual meeting april increase authorized capital stock five mln mln shares reuter',
 'computer terminal systems cpml completes sale computer terminal systems inc said completed sale shares common stock warrants acquire additional one mln shares sedio n v lugano switzerland dlrs company said warrants exercisable five years purchase price dlrs per share computer terminal said sedio also right buy additional shares increase total holdings pct computer terminal outstanding common stock certain circumstances involving change control company company said conditions occur warrants would exercisable price equal pct common stock market price time exceed dlrs per share computer terminal also said sold technolgy rights dot matrix impact technology including future improvements woodco inc hou

In [117]:
# Applying TF-IDF to entire article strings 

article_tfidf = tfidf_vect.fit_transform(article_string)
article_tfidf.shape

(5485, 3000)

In [118]:
article_tfidf

<5485x3000 sparse matrix of type '<class 'numpy.float64'>'
	with 212752 stored elements in Compressed Sparse Row format>

#### 7. Convert the Sparse Matrix into Dense Matrix using .todense()

In [28]:
article_tfidf_dense = article_tfidf.todense()
type(article_tfidf_dense)

numpy.matrix

In [29]:
article_tfidf_dense.shape

(5485, 3000)

#### 8. Calculate the Cosin Similarities between any two vectors

In [30]:
from sklearn.metrics.pairwise import cosine_similarity

In [31]:
cosine_similarity(article_tfidf_dense[3,:], article_tfidf_dense[4,:])

array([[0.51969816]])

In [34]:
article_tfidf_dense[3,:]

matrix([[0., 0., 0., ..., 0., 0., 0.]])

In [35]:
article_tfidf_dense[4,:]

matrix([[0., 0., 0., ..., 0., 0., 0.]])

In [36]:
article_string[3:5]

['international inc nd qtr jan oper shr loss two cts vs profit seven cts oper shr profit vs profit revs mln vs mln avg shrs mln vs mln six mths oper shr profit nil vs profit cts oper net profit vs profit revs mln vs mln avg shrs mln vs mln note per shr calculated payment preferred dividends results exclude credits four cts nine cts qtr six mths vs six cts cts prior periods operating loss carryforwards reuter',
 'brown forman inc bfd th qtr net shr one dlr vs cts net mln vs mln revs mln vs mln nine mths shr dlrs vs dlrs net mln vs mln revs billion vs mln reuter']

### 9. Search Engine - Find the five most relevant articles for any given query, fetch the text against them

#### First lets run individually to understand each steps. 

First test it individually
1. Get the vector correspoding to the test row number and all columns (3000). In this case row number 4 will be the row that will be used to
   compare it with rest of the document. 

In [37]:

test_row = 4
test_vector = article_tfidf_dense[test_row,:]
test_vector

matrix([[0., 0., 0., ..., 0., 0., 0.]])

2. Check the data present at row number of the list.

In [39]:
article_string[test_row]

'brown forman inc bfd th qtr net shr one dlr vs cts net mln vs mln revs mln vs mln nine mths shr dlrs vs dlrs net mln vs mln revs billion vs mln reuter'

In [135]:
articles[test_row]

'brown forman inc bfd th qtr net shr one dlr vs cts net mln vs mln revs mln vs mln nine mths shr dlrs vs dlrs net mln vs mln revs billion vs mln reuter '

3. Build a list of similarity scores between test row vector and vectors of rest of the articles.

In [59]:
# 3. Build a list of similarity scores between test row vector and vectors of rest of the articles.
similarity_scores = []

for index, vector in enumerate(article_tfidf_dense):
    
    sim_score = cosine_similarity(test_vector, article_tfidf_dense[index,:])[0][0]
    print(sim_score)
    similarity_scores.append(sim_score)

0.07676665359403674
0.03787444770300154
0.6198477163445897
0.5196981645446619
1.0
0.04254306127417924
0.2573796790506862
0.4413093161758383
0.6012202286033078
0.10560787549613786
0.5444120975197557
0.7518591909380903
0.06517512095078365
0.4576939190405448
0.04148097216911336
0.09471302981379182
0.0009856432934897154
0.021177635653944888
0.149849709400108
0.13951941212510374
0.0004496518287761983
0.09234873056598893
0.7751775103311724
0.6845228568125206
0.023706729848446417
0.14067317089218018
0.6026843354478166
0.15076463171672685
0.06475267514271965
0.1013622179656827
0.1220874744601668
0.14385658519264458
0.12050264614499374
0.4121560579209793
0.037386927764925054
0.557598429032735
0.14205520673447425
0.028562691295164384
0.015348898271138589
0.14279865327516508
0.001256060202667117
0.03317462011824703
0.7087841216890932
0.013126872941123927
0.0025764169662836653
0.025115220020663947
0.16539073854474073
0.04960742312878201
0.009690550660282625
0.006605494956212437
0.5920418282030806


0.25637031800463306
0.38373849606099725
0.4573663263327538
0.4585345207650051
0.02571212419613776
0.06305630789729808
0.018357202692748403
0.06762317040448401
0.7123976225870438
0.0019291593536995381
0.1859673369258247
0.7412739188093178
0.14187792802713461
0.01492892472375079
0.4019387020880133
0.06872441370701568
0.06090174077651267
0.5743673853268976
0.6172013846628763
0.002096938701754764
0.0480696139289562
0.007558373735549352
0.0005574686001389269
0.23249491548192922
0.52220672057367
0.4448535946392592
0.0
0.6577921733530315
0.6466757454062755
0.11640250596342858
0.13728104948278505
0.6309228847500856
0.04566882787342782
0.11850115850343654
0.11207596580938113
0.1190746125433096
0.5013796795169545
0.4564071523686248
0.11555287525785148
0.0
0.039636821442773654
0.1579102150204915
0.06705100361226439
0.005783372893230323
0.02373794744185924
0.010098947483206477
0.03005196774786354
0.03534853713942268
0.32373853328383684
0.04495639645920769
0.13353930392476862
0.4564071523686248
0.1

0.7053789278355679
0.3293832407974693
0.4698066761372015
0.00624876064087879
0.1227882128088469
0.6983906585486133
0.06404702212082755
0.014084194737323618
0.5942706927054605
0.11672242649536507
0.0719484948916019
0.5372090673564698
0.614386946965743
0.4602123214883669
0.498265033196796
0.12690678592528334
0.08220498236442712
0.10799639906876479
0.021249917398649835
0.006063320045608386
0.013086541928629148
0.0023559969044045794
0.06948188329900462
0.5118385511623013
0.12785899714912283
0.11242088673177093
0.1300084928200168
0.046525079419515926
0.01310692603300726
0.10346178586335987
0.5061269477380559
0.37481800214750377
0.0007905663506213861
0.4582220983797392
0.02476009595863554
0.7582011208188237
0.712815270271322
0.01929592895211058
0.2668087817922434
0.0
0.08103348475736698
0.0
0.10374528531689103
0.11042649380891628
0.7417406901761538
0.5970683131898736
0.05560278191363699
0.040033963900437806
0.5352171378208983
0.052648095673666614
0.5011186177733162
0.6453131995606116
0.08032

0.03732671836477265
0.03318037708253408
0.3609372875137683
0.0280388766600738
0.7442274858517195
0.4510730858593692
0.6701118985033374
0.5243911373503078
0.4336798864544126
0.028507648058550678
0.43760103726397737
0.017958829304426736
0.5092312644465818
0.04494052227535002
0.0
0.0009853503634261443
0.09482246860522645
0.0013093141980295062
0.02162537985472351
0.5683889790442742
0.07146815735157312
0.30038149959296245
0.04355533678240524
0.09868541896435315
0.06424212171550002
0.011181931365808643
0.02124857056062572
0.01289367863085917
0.5087280360806319
0.4948405348221537
0.6431351165567165
0.009621115625596734
0.14044033176404763
0.39878768193825515
0.09029726311955072
0.027445351437638193
0.0010376612255442801
0.09087259969110388
0.7552378633006447
0.05058373976367529
0.01838982689787786
0.0068874657230507475
0.019842930116024324
0.011874773306713613
0.06644218284422956
0.08154578301721266
0.0066317230776291826
0.0148961943770923
0.0010147235739497982
0.31891516358430205
0.589339127

0.010834218954126515
0.10023736646751402
0.02021206800773641
0.3178529882205033
0.2191223631494577
0.04607275218411336
0.5341313176156591
0.5380217343962155
0.11403705632202943
0.12245128591146613
0.5261875170574515
0.130191399584176
0.6641403478519895
0.1003775336074243
0.11621359570069759
0.12345881012924854
0.14967750483327807
0.050518583529818986
0.007713805849817566
0.0870493693492102
0.34165526761336146
0.03879976492565256
0.0038250257590551917
0.0007539723574035448
0.0013273616841488936
0.018961521550995467
0.004436093120083004
0.11313819266863356
0.0
0.12576775421047814
0.028696021074527744
0.6617899539112104
0.7446742513251581
0.15489270155569934
0.03682798393813538
0.4462326423117612
0.21413092937665265
0.025986283787534988
0.066506395211313
0.0017562089314040973
0.12229769576609201
0.030972351364847092
0.02233271554769597
0.008693074262048168
0.002955641969169519
0.0003142243932192794
0.03423573893387552
0.44487201874004
0.007310314436323466
0.024708449742531814
0.0738193318

0.5821869237114061
0.43270199730512005
0.0073652214404169684
0.0
0.44146947445647267
0.013796748694516423
0.4914843805885856
0.5375733310293067
0.7234701278055511
0.00796753030298263
0.4137997456349451
0.5359316778151564
0.2997892146140878
0.0016527643272258963
0.42037372583928306
0.0
0.0012676081051763597
0.10872880586384041
0.7437684968755438
0.10557106278802735
0.4917987535517019
0.7202160564306483
0.07623231504545509
0.029028710728640095
0.018921088436621998
0.03865303157183371
0.4062883024997376
0.5072744086487928
0.6206061707378965
0.4017908103334605
0.6063927946065093
0.056905584549709065
0.023534424494570674
0.009308352081928612
0.618316649586065
0.009890067229452851
0.05910580090800776
0.04290399076815828
0.0019179520175779795
0.05612406721577078
0.7598776970633083
0.3367538673018127
0.02482825606928429
0.0
0.052450954662406825
0.13268599081430751
0.01710898289352281
0.10311896854408122
0.0
0.5264502073184413
0.4232621947772624
0.5412576247542749
0.0011592034327682112
0.490645

0.022005590736937353
0.522947143090652
0.4979594616090923
0.06850493094546076
0.1910903198856078
0.00720569529278145
0.3420191985100056
0.019398548128010888
0.6462903321248528
0.0008029487737343488
0.10270771182687309
0.12480295697910712
0.009270698288449099
0.0
0.016392960145503642
0.44833446969916113
0.0016820027498386098
0.6651369818407932
0.10498328365429437
0.10795326076283372
0.6795757501878814
0.15576824543936424
0.47781026712489927
0.001843083092812431
0.08840718573546691
0.07812412125061927
0.12124519348513485
0.06624953267419376
0.3107853758958169
0.001746234765915701
0.016186855355701232
0.5594398493427066
0.010286000107376294
0.3780171640598199
0.019177207058362525
0.0509506433606594
0.0
0.00959451861566513
0.08145552044251662
0.005378566534584807
0.29123863840190867
0.11957569241817627
0.002345472248219658
0.6433406418835768
0.0
0.6062762651834089
0.29202639226117727
0.3014177547282372
0.01155840731274671
0.04605321095600515
0.023996799573414763
0.011907582612712518
0.0814

0.06628337200355015
0.08541952656907317
0.012080669008334433
0.7227431800591648
0.10076477373944781
0.01421158494616994
0.7382551479777927
0.37914278560664066
0.006014054844290866
0.03743682988788569
0.005977021599767323
0.08303859173821454
0.34795848382236183
0.06297301420358986
0.7770205797277192
0.0008923118772287991
0.8063583593819943
0.4481008410406062
0.025307069524354374
0.3667477602654383
0.06655048337368313
0.13045898937782666
0.3096173791759964
0.029339523093539217
0.031756844640598664
0.01852376783828217
0.0863092258375644
0.04530467141655686
0.24696455336734693
0.0007847131915470354
0.2936311247699759
0.015348811719910364
0.688376082860747
0.09558963724730128
0.08159568936890123
0.32368033251942
0.0
0.08142294719040583
0.011193183373360978
0.00534562016263918
0.3801796661970804
0.5428225194844863
0.30775561909259613
0.29627182079711856
0.46615479218608047
0.016212312717346234
0.033500646843971346
0.38176802210189253
0.3885407746950945
0.5293233976409293
0.47051687082662524


0.45300036598790056
0.05019822000204998
0.0013551675583403967
0.0
0.03314412726185327
0.07718977865125085
0.5640498594027914
0.3675661349664785
0.4665480652950152
0.09454015474783103
0.062068963472666765
0.01602810024824213
0.15939465679914822
0.2840732842097604
0.3558121701723894
0.5004591354713981
0.2997892146140878
0.310945700137893
0.38670548169979757
0.6354674303640009
0.16298897952092928
0.5222220791478621
0.09911999583836599
0.05090995154958809
0.404013004070244
0.05385970381048437
0.06560378513100788
0.1404885804686189
0.52883902734759
0.12424600933752351
0.0
0.08674031941436358
0.6322743559900084
0.014689714397086641
0.004302814416596147
0.5535663252018457
0.03475574453419124
0.08321145943116737
0.5454822251184854
0.07987310123128308
0.011024699229583846
0.04613073859837473
0.4873094224958898
0.002072083802220457
0.6407150783453821
0.020111967774936568
0.007707976854387549
0.01259335343008248
0.004091141758680669
0.017584989427849383
0.4458705567460548
0.0010558531136074401
0.

The type of similarity score is a list. We need to convert it to Pandas series so that we can get the Index number
and then use these Index number to get sort and get the top 4 Index and corresponding articles present at those indices. 

In [60]:
# The type of similarity score is a list. We need to convert it to Pandas series so that we can get the Index number
# and then use these Index number to get sort and get the top 4 Index and corresponding articles present at those indices. 
type(similarity_scores)

list

Create  pandas series so we can get the index values corresponding to the similarities scores. 

In [62]:
# Create  pandas series so we can get the index values corresponding to the similarities scores. 
similarities_series = pd.Series(data = similarity_scores)
similarities_series.head()

0    0.076767
1    0.037874
2    0.619848
3    0.519698
4    1.000000
dtype: float64

In [63]:
similarities_series.sort_values(ascending=False,inplace=True)
similarities_series.head()

4       1.000000
3633    0.895294
1526    0.884519
3939    0.873976
3686    0.871784
dtype: float64

Top 4 index excluding the one of itself

In [71]:
# Top 4 index excluding the one of itself
top4_indexes = similarities_series.index.values[1:5]
top4_indexes

array([3633, 1526, 3939, 3686], dtype=int64)

Let's print the top 4 Similarity scores and its corresponding similar lines. 

In [134]:
for index in top4_indexes:
    print("Similarity Score: ",similarities_series[index])
    print("Similar Article : ",article_string[index])
    print()
    print("Original: ---  ",articles[index])
    print()

Similarity Score:  0.8952935059563496
Similar Article :  technitrol inc tnl th qtr shr cts vs cts net mln vs revs mln vs mln year shr dlrs vs dlrs net mln vs mln revs mln vs mln reuter

original: ---   technitrol inc tnl th qtr shr cts vs cts net mln vs revs mln vs mln year shr dlrs vs dlrs net mln vs mln revs mln vs mln reuter 

Similarity Score:  0.8845187608727545
Similar Article :  vista resources inc vist th qtr net shr dlrs vs one dlr net vs revs mln vs mln mths shr dlrs vs dlrs net vs revs mln vs mln reuter

original: ---   vista resources inc vist th qtr net shr dlrs vs one dlr net vs revs mln vs mln mths shr dlrs vs dlrs net vs revs mln vs mln reuter 

Similarity Score:  0.873976409412619
Similar Article :  nike inc nike rd qtr feb net shr cts vs cts net vs mln revs mln vs mln nine mths shr cts vs dlrs net mln vs mln revs mln vs mln reuter

original: ---   nike inc nike rd qtr feb net shr cts vs cts net vs mln revs mln vs mln nine mths shr cts vs dlrs net mln vs mln revs mln v

### Function to put together above individual steps
#### Lets define a Function to accept the row number based on the above individual test steps 

In [95]:
# Lets define a Function to accept the row number based on the above individual test steps 

def get_top5_similar_articles(row_number):
    vector_article_tfidf_dense = article_tfidf_dense[row_number,:]
    print("Article at row {} is --- {}".format(row_number,article_string[row_number]))
    print()
    
    # Build a list of similarity scores based on cosin similarity between the search article vector and 
    # rest of articles vectors
    similarity_scores = []
    for index, vector in enumerate(article_tfidf_dense):
        sim_score = cosine_similarity(vector_article_tfidf_dense, article_tfidf_dense[index,:])[0][0]
        similarity_scores.append(sim_score)
        
    # Build a Pandas Series to get the Index position
    similarities_series = pd.Series(data = similarity_scores)
    
    # Sort the series. Here we get Similarities in descending orders
    similarities_series.sort_values(ascending=False,inplace=True)
    
    # Get top 5 similarities scores indices excluding the one of search row. 
    top5_indexes = similarities_series.index.values[1:6]
    
    for index in top5_indexes:
        print("Similarity Score: ",similarities_series[index])
        print("Similar Article : ",article_string[index])
        print()
    
    
    

In [96]:
get_top5_similar_articles(1)

Article at row 1 is --- computer terminal systems cpml completes sale computer terminal systems inc said completed sale shares common stock warrants acquire additional one mln shares sedio n v lugano switzerland dlrs company said warrants exercisable five years purchase price dlrs per share computer terminal said sedio also right buy additional shares increase total holdings pct computer terminal outstanding common stock certain circumstances involving change control company company said conditions occur warrants would exercisable price equal pct common stock market price time exceed dlrs per share computer terminal also said sold technolgy rights dot matrix impact technology including future improvements woodco inc houston tex dlrs said would continue exclusive worldwide licensee technology woodco company said moves part reorganization plan would help pay current operation costs ensure product delivery computer terminal makes computer generated labels forms tags ticket printers termin

In [97]:
get_top5_similar_articles(4)

Article at row 4 is --- brown forman inc bfd th qtr net shr one dlr vs cts net mln vs mln revs mln vs mln nine mths shr dlrs vs dlrs net mln vs mln revs billion vs mln reuter

Similarity Score:  0.8952935059563496
Similar Article :  technitrol inc tnl th qtr shr cts vs cts net mln vs revs mln vs mln year shr dlrs vs dlrs net mln vs mln revs mln vs mln reuter

Similarity Score:  0.8845187608727545
Similar Article :  vista resources inc vist th qtr net shr dlrs vs one dlr net vs revs mln vs mln mths shr dlrs vs dlrs net vs revs mln vs mln reuter

Similarity Score:  0.873976409412619
Similar Article :  nike inc nike rd qtr feb net shr cts vs cts net vs mln revs mln vs mln nine mths shr cts vs dlrs net mln vs mln revs mln vs mln reuter

Similarity Score:  0.8717835351513126
Similar Article :  quick reilly group bqr th qtr feb shr cts vs cts net mln vs mln revs mln vs mln year shr dlrs vs dlr net mln vs mln revs mln vs mln reuter

Similarity Score:  0.871580278033083
Similar Article :  kay 

In [98]:
get_top5_similar_articles(10)

Article at row 10 is --- computer language research clri th qtr shr loss cts vs loss cts net loss vs loss revs mln vs mln qtly div three cts vs three cts prior year shr profit two cts vs profit cts net profit vs profit revs mln vs mln note dividend payable april one shareholders record march reuter

Similarity Score:  0.8688407453805792
Similar Article :  ciro inc ciri year shr loss three cts vs profit cts net loss vs profit revs mln vs mln reuter

Similarity Score:  0.8583772470886208
Similar Article :  writer corp wrtc th qtr loss shr loss cts vs profit cts net loss vs profit revs mln vs mln year shr loss cts vs profit cts net loss vs profit revs mln vs mln reuter

Similarity Score:  0.8491554807096116
Similar Article :  pse inc pow th qtr shr loss cts vs profit cts net loss mln vs profit revs mln vs mln year shr loss cts vs profit five cts net loss mln vs profit revs mln vs mln reuter

Similarity Score:  0.8472420981077892
Similar Article :  otf equities inc otfe th qtr net shr prof

### Lets define a Function to accept the search string based on the above individual test steps 
Lets define a Function to accept the search string based on the above individual test steps. Here the extra step would be to vectororize the search raw string by using transform to get its vector representation based on the vocabulary and document term frequency learned in fit_transform previously.

##### transform(raw_documents)[source]
Transform documents to document-term matrix.

Uses the vocabulary and document frequencies (df) learned by fit (or fit_transform).

Parameters:

raw_documents: iterable

An iterable which yields either str, unicode or file objects.

Returns:

X: sparse matrix of (n_samples, n_features)

Tf-idf-weighted document-term matrix.

In [136]:
# Lets define a Function to accept the search string based on the above individual test steps. 
# Here the extra step would be to vectororize the search raw string by using transform to get its vector representation based
# on the vocabulary and document term frequency learned in fit_transform previously

def get_top5_similar_search_articles(search_string):
    
    vector_search_string = tfidf_vect.transform([search_string])
    
    print("Search String -- ",search_string)
    print("Vector of Search String, \n", vector_search_string)
    print()
    
    # Build a list of similarity scores based on cosin similarity between the search article vector and 
    # rest of articles vectors
    similarity_scores = []
    for index, vector in enumerate(article_tfidf_dense):
        sim_score = cosine_similarity(vector_search_string, article_tfidf_dense[index,:])[0][0]
        similarity_scores.append(sim_score)
        
    # Build a Pandas Series to get the Index position
    similarities_series = pd.Series(data = similarity_scores)
    
    # Sort the series. Here we get Similarities in descending orders
    similarities_series.sort_values(ascending=False,inplace=True)
    
    # Get top 5 similarities scores indices
    top5_indexes = similarities_series.index.values[1:6]
    
    for index in top5_indexes:
        print("Similarity Score: ",similarities_series[index])
        print("Similar Article : ",article_string[index])
        print()


In [137]:
get_top5_similar_search_articles("computer language")

Search String --  computer language
Vector of Search String, 
   (0, 565)	1.0

Similarity Score:  0.5561417712838604
Similar Article :  computer horizons chrz acquisition computer horizons corp said purchased computerknowledge inc software training education company headquartered dallas terms disclosed reuter

Similarity Score:  0.5123917616853629
Similar Article :  imtec imtc gets merger offer imtec inc said shareholders computer identics inc cidn proposed merger two companies company said shareholders previously expressed dissatisfaction computer identics management informed computer identics present board longer support majority shares held said shareholders called resignation one computer identics directors suggested new board pursue merger talks imtec imtec said merger talks havew yet taken place reuter

Similarity Score:  0.4921995247634891
Similar Article :  wavehill international make acquisition wavehill international ventures inc said agreed acquire personal computer rental c

In [123]:
vector_search_string = tfidf_vect.transform(["computer"])
vector_search_string

<1x3000 sparse matrix of type '<class 'numpy.float64'>'
	with 1 stored elements in Compressed Sparse Row format>

In [125]:
get_top5_similar_search_articles("oil news")

Search String --  oil news
Vector of Search String, 
   (0, 1857)	0.6055591964584901
  (0, 1799)	0.7958002636243265

Similarity Score:  0.3595811066046406
Similar Article :  iraq turkey oil pipeline cut landslide turkey oil pipeline near southern town adana cut landslide hurriyet anatolian news agencies said little oil lost landslide friday night taps one mln bpd line switched accident said pipeline carries oil turkey customers iraq kirkuk field yumurtalik terminal turkish mediterranean coast iraq main oil outlet reuter

Similarity Score:  0.33267837990508153
Similar Article :  news corp nws completes purchase newspaper news corp said south china morning post ltd hong kong become wholly owned subsidiary march previously announced reuter

Similarity Score:  0.3310638131611706
Similar Article :  exxon xon sees synfuels role year development costly shale oil liquified coal kinds synthetic fuels halted recent years cheap abundant petroleum supplies become economic world oil prices top dlrs