Jupyter notebook for scraping serious eats articles, and using machine learning to determine the author\
Rob Galloway\
2-1-20

Code flow:
- Use BeautifulSoup to scrape articles from seriouseats.com
- Convert articles to word frequency lists (bag of words)
- Train and test machine learning classifier to identify author using sklearn

Start by importing our packages:

In [3]:
import requests # For getting url's
import numpy as np # For data analysis/manipulation
from bs4 import BeautifulSoup # Web scraper
from sklearn.feature_extraction.text import CountVectorizer # Converts strings to word frequency arrays
from sklearn import neighbors # KNN machine learning package
from sklearn.svm import SVC # SVC ML package
from sklearn.tree import DecisionTreeClassifier # Decision Tree ML package
from sklearn.ensemble import RandomForestClassifier # Random Forest ML package



### Web Scraping
- Go to (https://www.seriouseats.com/how-tos/cooking-techniques) and use BeautifulSoup to grab links (iterate through every page)

- Follow links to articles, use BS to scrape articles for text
- Store each article as a single string, corpus variable is a list of article strings

NOTE: This cell takes a long time to run as it is scraping and storing >200 articles


In [4]:
# Initialize variables
corpus = []
authors = []

# This loop iterates through each page
for page in range(1,13):
    
    # concatenate url string with page numbers and convert to soup
    result = requests.get('https://www.seriouseats.com/how-tos/cooking-techniques?page='+str(page)+\
                          '&sort=latest#recipes')
    src = result.content
    soup = BeautifulSoup(src, "lxml") # This is the searchable soup
    
    # Isolate section of page that contains the article links
    block = soup.find('div', class_='block__wrapper')
    
    
    # Loop through every post on the page
    # scrape the article link and follow it to the article
    # scrape the text in each article and store to corpus
    for div_tag in block.find_all('div', class_='metadata'):
        
        link = div_tag.find('a').attrs['href'] # this line scrapes and saves the link        
        
        # scrape text and author from link and store
        newresult = requests.get(link)
        newsrc = newresult.content
        newsoup = BeautifulSoup(newsrc, "lxml") # This is the searchable soup
        
        # scrape the author's name and store
        author = newsoup.find('a', class_='name').get_text()
        authors = authors+[author] # This is the list of authors
        
        # Isolate main body from soup
        newbody = newsoup.find('div', class_='entry-body')
        
        
        text = '' # reset text variable before next loop
        
        # Scrape text and combine
        for p_tag in newbody.find_all('p'):
            
            text = text+p_tag.get_text() # combine text blocks
        
        # Add essay text to corpus list
        corpus= corpus+[text] # This is where all of the text is stored
        

        


### Data Formatting

- Convert raw strings to word frequency arrays (Bag Of Words)
- SKlearn has built in CountVectorizer package (easy!)
- Authors left as strings for easy analysis later

In [5]:
vectorizer = CountVectorizer() # Call the vectorizer package using default settings
X = vectorizer.fit_transform(corpus) # Convert corpus to word frequency vector
freq = X.toarray() # Store as array


### Machine Learning

ML classifiers come in many different flavors:
- K Nearest Neighbors
- Support Vector Classifiers
- Decision Tree/Random Forest

<img src="classifiers.png">

#### K Nearest Neighbors

In [6]:
# create a K Nearest Neighbours Classifier and fit the data.
clf = neighbors.KNeighborsClassifier(3, weights='distance') # Define classifier
clf.fit(freq[50:,:], authors[50:]) # Train on a portion of the data

# Test on the remaining data, this will output the percent accuracy on the test data
clf.score(freq[:50], authors[:50])


0.28000000000000003

To get an idea what this number represents, we can have the classifier output the prediced author of some articles and compare them to the actual authors

In [7]:
print(clf.predict(freq[6:10])) # Predict the authors of articles 6 through 9
print(authors[6:10]) # Print the actual authors of articles 6 through 9

['J. Kenji López-Alt' 'J. Kenji López-Alt' 'Niki Achitoff-Gray'
 'Stella Parks']
['Stella Parks', 'Sasha Marx', 'The Serious Eats Team', 'Stella Parks']


#### Support Vector Classifier

In [8]:
clf2 = SVC(kernel='linear',gamma='auto')
clf2.fit(freq[50:,:], authors[50:]) # Train on a portion of the data

# Test on the remaining data, this will output the percent accuracy on the test data
clf2.score(freq[:50], authors[:50])


0.47999999999999998

In [14]:
print(clf2.predict(freq[6:10])) # Predict the authors of articles 6 through 9
print(authors[6:10]) # Print the actual authors of articles 6 through 9

['Stella Parks' 'Daniel Gritzer' 'Daniel Gritzer' 'Stella Parks']
['Stella Parks', 'Sasha Marx', 'The Serious Eats Team', 'Stella Parks']


#### Decision tree classifier

In [10]:
clf3 = DecisionTreeClassifier(random_state=1)
clf3.fit(freq[50:,:], authors[50:]) # Train on a portion of the data

# Test on the remaining data, this will output the percent accuracy on the test data
clf3.score(freq[:50], authors[:50])

0.41999999999999998

In [15]:
print(clf3.predict(freq[6:10])) # Predict the authors of articles 6 through 9
print(authors[6:10]) # Print the actual authors of articles 6 through 9

['Stella Parks' 'J. Kenji López-Alt' 'J. Kenji López-Alt' 'Stella Parks']
['Stella Parks', 'Sasha Marx', 'The Serious Eats Team', 'Stella Parks']


#### Random Forest

In [12]:
clf4 = RandomForestClassifier(random_state=1,n_estimators=10)
clf4.fit(freq[50:,:], authors[50:]) # Train on a portion of the data

# Test on the remaining data, this will output the percent accuracy on the test data
clf4.score(freq[:50], authors[:50])

0.35999999999999999

In [16]:
print(clf4.predict(freq[6:10])) # Predict the authors of articles 6 through 9
print(authors[6:10]) # Print the actual authors of articles 6 through 9

['Stella Parks' 'J. Kenji López-Alt' 'J. Kenji López-Alt' 'Stella Parks']
['Stella Parks', 'Sasha Marx', 'The Serious Eats Team', 'Stella Parks']


### Troubleshooting

- Not enough data
- Vocabulary too large
- Model not behaving as intended

https://youtu.be/cjvS2bB0UVg?t=79
(2:55)

In [17]:
print(clf2.predict(freq[:20]))

['J. Kenji López-Alt' 'J. Kenji López-Alt' 'J. Kenji López-Alt'
 'J. Kenji López-Alt' 'J. Kenji López-Alt' 'J. Kenji López-Alt'
 'Stella Parks' 'Daniel Gritzer' 'Daniel Gritzer' 'Stella Parks'
 'J. Kenji López-Alt' 'J. Kenji López-Alt' 'J. Kenji López-Alt'
 'J. Kenji López-Alt' 'J. Kenji López-Alt' 'J. Kenji López-Alt'
 'J. Kenji López-Alt' 'J. Kenji López-Alt' 'Stella Parks' 'Daniel Gritzer']
