Our next step will be using tf-idf vectors to vectorize full text in our news pickled file containing a author. once it is vectorized just nearest neighbors to predict closest news stories to input text and return the author names that did it. 

The practical application of this include that if you have a kickstarter campaign going and are trying to reach out to jounos to spread the word, it would be nice to be able to shortlist the names of relevant journos who wrote a similar story to your campaign pitch. 

use sklearn to start out with, and by tuesday/webnesday we will make it interesting and switch to approximate nearest neighbor approaches. 

In [1]:
import pickle
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.model_selection import train_test_split
import string
import re
import ftfy

In [2]:
with open('../Data/news.pkl', 'rb') as f:
    authors = pickle.load(f)
    
with open('../Data/pickled_df.pkl', 'rb') as f:
    articles = pickle.load(f)
    
data = authors.merge(articles, on = 'title')[['author_x', 'article_count', 'site_name_x', 'topic', 
                                              'url', 'title', 'title_sentiment', 'description', 'full_text']]
data = data.rename(columns = {'author_x': 'author', 'site_name_x': 'site_name'})

In [3]:
data[data.article_count >= 100].author.value_counts()

Neil Shaw        523
James Rodger     512
Jack Davis       251
Adam Wells       223
Sophie McCoid    218
                ... 
Molly Pike       101
Chris Beesley    101
KATE FELDMAN     100
Carly Johnson    100
Adam Chapman     100
Name: author, Length: 139, dtype: int64

In [4]:
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

In [5]:
def clean_author(author):
    author = author.lower()
    author = re.sub(r'\|.*$', '', author) # remove everything after |
    author = re.sub(r'\,.*$', '', author) # remove evrything after ,
    punctuations = '''!()[]{};:'"\,<>./?#$%^&*_~'''
    for x in author.lower(): 
        if x in punctuations: author = author.replace(x, "") # remove punctuations except hyphen
    author = re.sub(r'(\d+)', '', author) # remove numbers
    author = re.sub(r'(--)', '', author)
    author = re.sub(r'(\w+@\w+)', '', author) # remove emails
    author = ' '.join([i.strip() for i in author.split()]) # remove spaces and lowercase
    author = re.sub(r'(\s+)(and)(\s+)(\w*)(\s*)(\w*)', '', author) # remove anything after and
    keywords = ['sa', 'tulsa', 'editor', 'writer', 'world-herald', 'news', 'world', 'richmond', 'times-dispatch', 'new',
                'hampshire', 'union', 'leader', 'for', 'the', 'state', 'journal', 'correspondant', 'sfgate', 'special',
                'from', 'the', 'gazette', 'times', 'staff', 'senior', 'dr', 'correspondent', 'by', 'editorial board',
                'research', 'wire reports', 'security', 'real', 'estate', 'to', 'post', 'and', 'courier', 'policy',
                'commercial', 'bureau', 'political', 'roanoke', 'college', 'football', 'editorial','democrat-gazette',
                'arizona', 'daily', 'star', '--hamburg', 'column', 'lincoln', 'managing', 'backstage', 'with', 'sports',
                'ii', 'iii', 'capitol', 'media', 'services']
    author = ' '.join([i for i in author.split() if i not in keywords]) # remove other keywords
    author = re.sub(r'(@)', '', author) # remove handles
    return author

In [6]:
data['clean_author'] = data.apply(lambda x: clean_author(x['author']), axis=1)

In [7]:
# Cleaning the text sentences so that punctuation marks, stop words &amp; digits are removed
def clean(doc):
    doc = ftfy.fix_text(doc)
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    processed = re.sub(r"\d+","",normalized)
    y = processed.split()
    return ' '.join(y)

In [8]:
data

Unnamed: 0,author,article_count,site_name,topic,url,title,title_sentiment,description,full_text,clean_author
0,Aja Styles,14.0,Brisbane Times,entertainment,https://www.brisbanetimes.com.au/national/west...,'Pack Lego': Perth family caught in hard borde...,-9.09,Perth mother Clare has found herself mostly co...,Perth mother Clare* has found herself mostly ...,aja styles
1,Jake Johnson,33.0,Truthout,politics,https://truthout.org/articles/congress-passes-...,Congress Passes COVID Relief With Billions in ...,18.18,The billâs gifts to the wealthy underscore t...,In late-night votes just hours after nearly 5...,jake johnson
2,Christine Favocci,19.0,The Western Journal,tech,https://www.westernjournal.com/pa-man-facing-c...,PA Man Facing Charges of Unlawful Voting After...,-38.46,It is naive to think that either party is free...,The left has insisted that voter fraud is jus...,christine favocci
3,William Rivers Pitt,14.0,Truthout,politics,https://truthout.org/articles/what-will-trump-...,What Will Trump Attempt in His Last Days? We M...,0.00,What Trump may do in his waning days is only u...,"The endgame being played out by Donald Trump,...",william rivers pitt
4,Amy Goodman,19.0,Truthout,business,https://truthout.org/video/the-insufficient-co...,The Insufficient COVID Stimulus Must Not Be Fo...,-20.00,Critics say the $900 billion relief package do...,As Congress passes a $900 billion coronavirus...,amy goodman
...,...,...,...,...,...,...,...,...,...,...
119957,Olivia Tobin,56.0,Liverpool Echo,tech,https://www.liverpoolecho.co.uk/news/liverpool...,"Boy, five, battling rare brain cancer will be ...",4.55,\n Five-year-old Aaron Wharton had surgery at ...,When you subscribe we will use the informatio...,olivia tobin
119958,Victoria Jones,92.0,WalesOnline,tech,https://www.walesonline.co.uk/news/uk-news/how...,How to teach saving and spending to kids as yo...,0.00,Now might be a perfect time to involve childre...,When you subscribe we will use the informatio...,victoria jones
119959,Victoria Jones,92.0,WalesOnline,tech,https://www.walesonline.co.uk/news/uk-news/spa...,Space experiment could unlock resources for mi...,0.00,Experimenting on the ISS allows scientists to ...,When you subscribe we will use the informatio...,victoria jones
119960,Nisha Mal,66.0,WalesOnline,tech,https://www.walesonline.co.uk/news/uk-news/wom...,Woman's home is in Tier 2 while her garden fal...,-10.00,"'It's all one big conundrum,' says Sheila Herbert",Woman's home is in Tier 2 while her garden fa...,nisha mal


In [9]:
# trial = data.head(10000)
trial = data[data.article_count >= 100]
trial['clean_text'] = trial.apply(lambda x: clean(x['full_text']), axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [10]:
trial

Unnamed: 0,author,article_count,site_name,topic,url,title,title_sentiment,description,full_text,clean_author,clean_text
6,David Matthews,113.0,nydailynews.com,sport,https://www.nydailynews.com/coronavirus/ny-cov...,U.S. teen who broke Cayman quarantine gets red...,-21.43,The Cayman Islands has reduced the prison sent...,The Cayman Islands has reduced the prison sen...,david matthews,cayman island reduced prison sentence skylar m...
43,Connie Rusk,117.0,Mail Online,entertainment,https://www.dailymail.co.uk/tvshowbiz/article-...,Michelle Keegan talks about her self doubt in ...,-6.67,"The actress, 33, reveals in Thursday's Jonatha...",'I gave everything to Our Girl': Michelle Kee...,connie rusk,i gave everything girl michelle keegan talk se...
48,Sarah Abraham,129.0,Mail Online,entertainment,https://www.dailymail.co.uk/tvshowbiz/article-...,Jonathan Rhys Meyers and John Malkovich star i...,-11.11,"Jonathan Rhys Meyers, 43, will star alongside ...",Jonathan Rhys Meyers and John Malkovich have ...,sarah abraham,jonathan rhys meyers john malkovich filmed pan...
53,Laura Fox,133.0,Mail Online,entertainment,https://www.dailymail.co.uk/tvshowbiz/article-...,Rita Ora 'faces being STRANDED in Bulgaria for...,-4.76,"The singer, 30, is reportedly concerned she wo...",Rita Ora 'could be STRANDED in Bulgaria for C...,laura fox,rita os could stranded bulgaria christmas uk f...
56,Roxy Simons,148.0,Mail Online,entertainment,https://www.dailymail.co.uk/tvshowbiz/article-...,The Great British Bake Off's Laura Adlington r...,18.75,"The Great British Bake Off star, 31, took to I...",The Great British Bake Off's Laura Adlington ...,roxy simons,great british bake offs laura adlington reveal...
...,...,...,...,...,...,...,...,...,...,...,...
119909,Unzela Khan,115.0,Dailystar.co.uk,tech,https://www.dailystar.co.uk/news/latest-news/d...,Dog's jaw 'glued together' as 'cement-like' pa...,14.29,\n Roxi the Scottish terrier was rushed to the...,When you subscribe we will use the informatio...,unzela khan,subscribe use information provide send newslet...
119919,Simon Duke,214.0,WalesOnline,entertainment,https://www.walesonline.co.uk/whats-on/whats-o...,Viewers blast I'm A Celeb's 'best bits' episod...,6.25,"\n ""What a total let down!""\n","Want the best food, film, music, arts and cul...",simon duke,want best food film music art culture news sen...
119930,Lottie Gibbons,106.0,Liverpool Echo,tech,https://www.liverpoolecho.co.uk/news/uk-world-...,MoneySavingExpert Martin Lewis makes DWP winte...,-25.00,\n You need to check your details urgently\n,When you subscribe we will use the informatio...,lottie gibbons,subscribe use information provide send newslet...
119944,Brett Gibbons,176.0,WalesOnline,tech,https://www.walesonline.co.uk/news/uk-news/how...,This is how the Pfizer vaccine works with roll...,0.00,\n The UK can begin administering the vaccine ...,When you subscribe we will use the informatio...,brett gibbons,subscribe use information provide send newslet...


In [11]:
tfidf = TfidfVectorizer(stop_words='english')

In [12]:
X = tfidf.fit_transform(trial.clean_text)
Y = trial.clean_author

In [13]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.1, random_state=1)

In [14]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')

In [15]:
knn.predict(X_test)

array(['brett gibbons', 'jenna ciccotelli', 'paul tassi', ...,
       'jessica sansome', 'charlie malam', 'simon duke'], dtype=object)

In [18]:
X_test.todense()

matrix([[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]])

In [16]:
knn.score(X_test, Y_test)

0.317937701396348

In [20]:
data.iloc[0].full_text

' Perth mother Clare* has found herself mostly confined to a small Sydney city apartment with her two young daughters as the northern beaches outbreak takes its toll. Clare flew with Cora, 6, and Georgia, 2, to Sydney on December 11 so they could reunite as a family over Christmas with husband Charles, who has been working remotely for the past year. Perth mother Clare and daughter Cora are in Sydney during the Christmas hard border closure. “It’s pretty tough because we’re in a really small apartment which is not set up for small kids,” Clare said. “We worked on the theory that it’d be fine because we’d be out and about during the day doing tourist things in Sydney. But obviously we’ve tried to stay home as much as we can, apart from a grocery shop.” Advertisement It’s also been raining “a lot”, which has put a damper on escaping to the big parks near their suburb of Haymarket that sits adjacent to tourism hotspot Darling Harbour.'