Our next step will be using tf-idf vectors to vectorize full text in our news pickled file containing a author. once it is vectorized just nearest neighbors to predict closest news stories to input text and return the author names that did it. 

The practical application of this include that if you have a kickstarter campaign going and are trying to reach out to jounos to spread the word, it would be nice to be able to shortlist the names of relevant journos who wrote a similar story to your campaign pitch. 

use sklearn to start out with, and by tuesday/webnesday we will make it interesting and switch to approximate nearest neighbor approaches. 

In [1]:
import pickle
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
import re
import ftfy

In [2]:
with open('../Data/news.pkl', 'rb') as f:
    authors = pickle.load(f)
    
with open('../Data/pickled_df.pkl', 'rb') as f:
    articles = pickle.load(f)
    
data = authors.merge(articles, on = 'title')[['author_x', 'article_count', 'site_name_x', 'topic', 
                                              'url', 'title', 'title_sentiment', 'description', 'full_text']]
data = data.rename(columns = {'author_x': 'author', 'site_name_x': 'site_name'})

In [3]:
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

In [4]:
def clean_author(author):
    author = author.lower()
    author = re.sub(r'\|.*$', '', author) # remove everything after |
    author = re.sub(r'\,.*$', '', author) # remove evrything after ,
    punctuations = '''!()[]{};:'"\,<>./?#$%^&*_~'''
    for x in author.lower(): 
        if x in punctuations: author = author.replace(x, "") # remove punctuations except hyphen
    author = re.sub(r'(\d+)', '', author) # remove numbers
    author = re.sub(r'(--)', '', author)
    author = re.sub(r'(\w+@\w+)', '', author) # remove emails
    author = ' '.join([i.strip() for i in author.split()]) # remove spaces and lowercase
    author = re.sub(r'(\s+)(and)(\s+)(\w*)(\s*)(\w*)', '', author) # remove anything after and
    keywords = ['sa', 'tulsa', 'editor', 'writer', 'world-herald', 'news', 'world', 'richmond', 'times-dispatch', 'new',
                'hampshire', 'union', 'leader', 'for', 'the', 'state', 'journal', 'correspondant', 'sfgate', 'special',
                'from', 'the', 'gazette', 'times', 'staff', 'senior', 'dr', 'correspondent', 'by', 'editorial board',
                'research', 'wire reports', 'security', 'real', 'estate', 'to', 'post', 'and', 'courier', 'policy',
                'commercial', 'bureau', 'political', 'roanoke', 'college', 'football', 'editorial','democrat-gazette',
                'arizona', 'daily', 'star', '--hamburg', 'column', 'lincoln', 'managing', 'backstage', 'with', 'sports',
                'ii', 'iii', 'capitol', 'media', 'services']
    author = ' '.join([i for i in author.split() if i not in keywords]) # remove other keywords
    author = re.sub(r'(@)', '', author) # remove handles
    return author

In [5]:
data['clean_author'] = data.apply(lambda x: clean_author(x['author']), axis=1)

In [6]:
data

Unnamed: 0,author,article_count,site_name,topic,url,title,title_sentiment,description,full_text,clean_author
0,Aja Styles,14.0,Brisbane Times,entertainment,https://www.brisbanetimes.com.au/national/west...,'Pack Lego': Perth family caught in hard borde...,-9.09,Perth mother Clare has found herself mostly co...,Perth mother Clare* has found herself mostly ...,aja styles
1,Jake Johnson,33.0,Truthout,politics,https://truthout.org/articles/congress-passes-...,Congress Passes COVID Relief With Billions in ...,18.18,The billâs gifts to the wealthy underscore t...,In late-night votes just hours after nearly 5...,jake johnson
2,Christine Favocci,19.0,The Western Journal,tech,https://www.westernjournal.com/pa-man-facing-c...,PA Man Facing Charges of Unlawful Voting After...,-38.46,It is naive to think that either party is free...,The left has insisted that voter fraud is jus...,christine favocci
3,William Rivers Pitt,14.0,Truthout,politics,https://truthout.org/articles/what-will-trump-...,What Will Trump Attempt in His Last Days? We M...,0.00,What Trump may do in his waning days is only u...,"The endgame being played out by Donald Trump,...",william rivers pitt
4,Amy Goodman,19.0,Truthout,business,https://truthout.org/video/the-insufficient-co...,The Insufficient COVID Stimulus Must Not Be Fo...,-20.00,Critics say the $900 billion relief package do...,As Congress passes a $900 billion coronavirus...,amy goodman
...,...,...,...,...,...,...,...,...,...,...
119957,Olivia Tobin,56.0,Liverpool Echo,tech,https://www.liverpoolecho.co.uk/news/liverpool...,"Boy, five, battling rare brain cancer will be ...",4.55,\n Five-year-old Aaron Wharton had surgery at ...,When you subscribe we will use the informatio...,olivia tobin
119958,Victoria Jones,92.0,WalesOnline,tech,https://www.walesonline.co.uk/news/uk-news/how...,How to teach saving and spending to kids as yo...,0.00,Now might be a perfect time to involve childre...,When you subscribe we will use the informatio...,victoria jones
119959,Victoria Jones,92.0,WalesOnline,tech,https://www.walesonline.co.uk/news/uk-news/spa...,Space experiment could unlock resources for mi...,0.00,Experimenting on the ISS allows scientists to ...,When you subscribe we will use the informatio...,victoria jones
119960,Nisha Mal,66.0,WalesOnline,tech,https://www.walesonline.co.uk/news/uk-news/wom...,Woman's home is in Tier 2 while her garden fal...,-10.00,"'It's all one big conundrum,' says Sheila Herbert",Woman's home is in Tier 2 while her garden fa...,nisha mal


In [7]:
# Cleaning the text sentences so that punctuation marks, stop words &amp; digits are removed
def clean(doc):
    doc = ftfy.fix_text(doc)
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    processed = re.sub(r"\d+","", normalized)
    y = processed.split()
    return ' '.join(y)

In [8]:
data['text'] = ((data['title'] + ' ') * 4) + ((data['description'] + ' ') * 2) + (data['full_text'])

In [9]:
data

Unnamed: 0,author,article_count,site_name,topic,url,title,title_sentiment,description,full_text,clean_author,text
0,Aja Styles,14.0,Brisbane Times,entertainment,https://www.brisbanetimes.com.au/national/west...,'Pack Lego': Perth family caught in hard borde...,-9.09,Perth mother Clare has found herself mostly co...,Perth mother Clare* has found herself mostly ...,aja styles,'Pack Lego': Perth family caught in hard borde...
1,Jake Johnson,33.0,Truthout,politics,https://truthout.org/articles/congress-passes-...,Congress Passes COVID Relief With Billions in ...,18.18,The billâs gifts to the wealthy underscore t...,In late-night votes just hours after nearly 5...,jake johnson,Congress Passes COVID Relief With Billions in ...
2,Christine Favocci,19.0,The Western Journal,tech,https://www.westernjournal.com/pa-man-facing-c...,PA Man Facing Charges of Unlawful Voting After...,-38.46,It is naive to think that either party is free...,The left has insisted that voter fraud is jus...,christine favocci,PA Man Facing Charges of Unlawful Voting After...
3,William Rivers Pitt,14.0,Truthout,politics,https://truthout.org/articles/what-will-trump-...,What Will Trump Attempt in His Last Days? We M...,0.00,What Trump may do in his waning days is only u...,"The endgame being played out by Donald Trump,...",william rivers pitt,What Will Trump Attempt in His Last Days? We M...
4,Amy Goodman,19.0,Truthout,business,https://truthout.org/video/the-insufficient-co...,The Insufficient COVID Stimulus Must Not Be Fo...,-20.00,Critics say the $900 billion relief package do...,As Congress passes a $900 billion coronavirus...,amy goodman,The Insufficient COVID Stimulus Must Not Be Fo...
...,...,...,...,...,...,...,...,...,...,...,...
119957,Olivia Tobin,56.0,Liverpool Echo,tech,https://www.liverpoolecho.co.uk/news/liverpool...,"Boy, five, battling rare brain cancer will be ...",4.55,\n Five-year-old Aaron Wharton had surgery at ...,When you subscribe we will use the informatio...,olivia tobin,"Boy, five, battling rare brain cancer will be ..."
119958,Victoria Jones,92.0,WalesOnline,tech,https://www.walesonline.co.uk/news/uk-news/how...,How to teach saving and spending to kids as yo...,0.00,Now might be a perfect time to involve childre...,When you subscribe we will use the informatio...,victoria jones,How to teach saving and spending to kids as yo...
119959,Victoria Jones,92.0,WalesOnline,tech,https://www.walesonline.co.uk/news/uk-news/spa...,Space experiment could unlock resources for mi...,0.00,Experimenting on the ISS allows scientists to ...,When you subscribe we will use the informatio...,victoria jones,Space experiment could unlock resources for mi...
119960,Nisha Mal,66.0,WalesOnline,tech,https://www.walesonline.co.uk/news/uk-news/wom...,Woman's home is in Tier 2 while her garden fal...,-10.00,"'It's all one big conundrum,' says Sheila Herbert",Woman's home is in Tier 2 while her garden fa...,nisha mal,Woman's home is in Tier 2 while her garden fal...


In [10]:
trial = data.head(100000)
trial['clean_text'] = trial.apply(lambda x: clean(x['text']), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [11]:
trial.clean_text[0]

'pack lego perth family caught hard border crossfire christmas pack lego perth family caught hard border crossfire christmas pack lego perth family caught hard border crossfire christmas pack lego perth family caught hard border crossfire christmas perth mother clare found mostly confined small sydney city apartment sixyearold twoyearold northern beach outbreak take toll perth mother clare found mostly confined small sydney city apartment sixyearold twoyearold northern beach outbreak take toll perth mother clare found mostly confined small sydney city apartment two young daughter northern beach outbreak take toll clare flew cora georgia sydney december could reunite family christmas husband charles working remotely past year perth mother clare daughter cora sydney christmas hard border closure it pretty tough were really small apartment set small kid clare said we worked theory itd fine wed day tourist thing sydney obviously weve tried stay home much can apart grocery shop advertisemen

In [12]:
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(list(trial.clean_text))

In [13]:
vectorizer.get_feature_names()

['aa',
 'aaa',
 'aaaa',
 'aaaaa',
 'aaaaaa',
 'aaaaaaaaa',
 'aaaaaaallll',
 'aaaaahhhh',
 'aaaaand',
 'aaaargh',
 'aaahhstyle',
 'aaarated',
 'aaaron',
 'aaastable',
 'aac',
 'aacc',
 'aachen',
 'aacps',
 'aacs',
 'aacspca',
 'aacta',
 'aactas',
 'aadhaar',
 'aadhaarlike',
 'aadhar',
 'aadil',
 'aaditi',
 'aaditya',
 'aadmi',
 'aadvantage',
 'aae',
 'aaegi',
 'aafje',
 'aafs',
 'aag',
 'aage',
 'aahana',
 'aahat',
 'aahhhhhh',
 'aahs',
 'aai',
 'aaib',
 'aail',
 'aails',
 'aaioperated',
 'aais',
 'aaja',
 'aajbharatbandhhai',
 'aakanksha',
 'aakash',
 'aal',
 'aalal',
 'aalam',
 'aaliyah',
 'aalla',
 'aallbig',
 'aalo',
 'aalsmeer',
 'aam',
 'aamar',
 'aamc',
 'aamer',
 'aami',
 'aaminah',
 'aamir',
 'aampe',
 'aanchal',
 'aandolan',
 'aang',
 'aanholt',
 'aaon',
 'aap',
 'aapan',
 'aapi',
 'aapke',
 'aapko',
 'aapl',
 'aapla',
 'aapled',
 'aapls',
 'aappa',
 'aapruled',
 'aaps',
 'aapshow',
 'aaqms',
 'aar',
 'aarated',
 'aardman',
 'aarey',
 'aarhus',
 'aari',
 'aarmstrongsays',
 'aa

In [14]:
nn = NearestNeighbors(n_neighbors=5, radius=1.0, algorithm='brute', leaf_size=30, metric='cosine', p=2, metric_params=None, n_jobs=None)
nn.fit(X)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [54]:
input_text = [clean("Man City fan reveals what Jurgen Klopp does better than anyone else at Liverpool")]
new = vectorizer.transform(input_text)
results = nn.kneighbors(new.todense())
data['clean_author'][results[1][0][0]]

'ian doyle'

In [53]:
for i in range(5):
    print(data['clean_author'][results[1][0][i]], results[1][0][i])
    print(data.iloc[results[1][0][i]].full_text)
    print()

ian doyle 36583
 Nothing wrong with being a newcomer, so sign up to the Liverpool newsletter When you subscribe we will use the information you provide to send you these newsletters. Sometimes they’ll include recommendations for other related newsletters or services we offer. OurPrivacy Noticeexplains more about how we use your data, and your rights. You can unsubscribe at any time. Related Articles Read More Related Articles And Klopp said: “I actually like the whole game but the start was probably the best period. “Especially against City, you have to use your chances or create even more. We had really top football moments but for the football we played in that period we didn’t have enough, 100%.

sean bradbury 3107
 Get the latest Reds news as Klopp's men push for Christmas and beyond with our email bulletin When you subscribe we will use the information you provide to send you these newsletters. Sometimes they’ll include recommendations for other related newsletters or services we 

In [17]:
data.iloc[-220].full_text

" Ronnie O'Sullivan is a six-time and reigning snooker world champion (Image: Getty Images) The Daily Star's FREE newsletter is spectacular! Sign up today for the best stories straight to your inbox When you subscribe we will use the information you provide to send you these newsletters. Sometimes they’ll include recommendations for other related newsletters or services we offer. OurPrivacy Noticeexplains more about how we use your data, and your rights. You can unsubscribe at any time. Few competitors have completely dominated their sport quite like snooker superstar Ronnie O’Sullivan. A six-time world champion and a record seven-time UK champion, his skill and prowess have made him one of the most formidable forces the game has ever witnessed. O’Sullivan is still at the top of his game, winning this year's world title, and statistically the most successful snooker player in the history of the game. Boasting 20 Triple Crown series titles (the most in snooker history), his overall silv

In [51]:
data.iloc[-12].text

"Man City fan reveals what Jurgen Klopp does better than anyone else at Liverpool - Liverpool Echo Man City fan reveals what Jurgen Klopp does better than anyone else at Liverpool - Liverpool Echo Man City fan reveals what Jurgen Klopp does better than anyone else at Liverpool - Liverpool Echo Man City fan reveals what Jurgen Klopp does better than anyone else at Liverpool - Liverpool Echo James Erskine, director of the new Liverpool documentary 'The End Of The Storm', is a Man City fan and has opened up on what makes the Reds so special James Erskine, director of the new Liverpool documentary 'The End Of The Storm', is a Man City fan and has opened up on what makes the Reds so special  When you subscribe we will use the information you provide to send you these newsletters. Sometimes they’ll include recommendations for other related newsletters or services we offer. OurPrivacy Noticeexplains more about how we use your data, and your rights. You can unsubscribe at any time. On July 22,