Author: Zach Hanly
This project takes article information from the New York Times and uses NLP, Random Forests and Logistic Regression to predict the daily Top 20 most shared articles on Facebook.
In order to show how social media popularity for text media can be predicted, metadata on articles from the New York Time's API website was gathered.
Two APIs were used. First, the NYT Archive API, which returns metadata on every article for a given range of months and years. Second, the NYT Most Popular API, which returns metadata on the most popular articles on NYTimes.com based on views on the NYT site, emails, or Facebook shares. For this project Facebook shares was chosen and calls were made to the API once per day to gather that day's top 20 list. Articles in the archive that were listed on a top 20 list were then labeled as a popular article for a modeling target.
Two identical looking pipelines were used. One using a TfidfVectorizer which converted text features to floating point values and the other using a CountVectorizer that converted text to binary values. Each feature then went into its own model, where the TfidfVectorizer pipeline used a Random Forest to turn article features into a probability that the article would be on the top 20 most shared on Facebook list, while the CountVecotized pipeline used a Random Forrest to turn them into a binary class label. Both pipelines end at a Logistic Regression model that outputs the probability that the article with all its converted features will be a top 20 article. This final model is where the word count feature is joined in from its Logistic Regression model, which turned the numeric values into either a probability or a class label for the respective pipelines.
After training and testing the pipelines with a train-test split, the pipelines attempted to predict several day's Top 20 list.
The pipeline using the TfidfVectorizer modeled to probabilities outperformed the pipeline using the CountVectorizer modeled to class labels.
Please review the full analysis in my Jupyter Notebook or my presentation.
├── data <- Sourced externally from APIs and generated from modeling
├── images <- Sourced externally and generated from code
├── notebooks <- building scripts for API calls and functions.py
├── .gitignore <- files github should ignore
├── README.md <- The top-level README for reviewers of this project
├── functions.py <- created functions for project
├── main_notebook.ipynb <- Narrative documentation of modeling
├── presentation.pdf <- PDF version of project
├── request_archive.py <- collects arichive of articles
├── request_most_shared.py <- collects daily top 20 most shared article son facebook
```# nyt_predicting_sharable_article