# 1. Install and Import Baseline Dependencies

In [None]:
!pip install transformers

In [2]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from transformers import pipeline

from bs4 import BeautifulSoup
import requests

# 2. Setup Summarization Model

In [3]:
model_name = "facebook/bart-large-cnn"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

Downloading config.json:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading merges.txt:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.51G [00:00<?, ?B/s]

# 4. Building a News and Sentiment Pipeline

In [9]:
# search options #
monitored_tickers = []
n = input("Please enter a Data Science topic of interest: ")
monitored_tickers.append(n)

Please enter a Data Science topic of interest: fastai


In [10]:
monitored_tickers

['fastai']

## 4.1. Search for Data science News using Medium website

In [11]:
def search_meduim_urls(monitored_tickers):
    search_url = "https://medium.com/tag/{}".format(monitored_tickers)
    r = requests.get(search_url)
    soup = BeautifulSoup(r.text, 'html.parser')
    # location where link to news is found(a tag with attribute "aria-label"= "Post Preview Title") #
    atags = soup.find_all('a', attrs={"aria-label": "Post Preview Title"})
    hrefs = ['https://medium.com'+link['href'] for link in atags]
    return hrefs 

In [12]:
# make a dictionary {framework: link_to_article about the framework} #
raw_urls = {framework:search_meduim_urls(framework) for framework in monitored_tickers}
raw_urls

{'fastai': ['https://medium.com/@msubhaditya/improving-classification-by-visualizing-activations-2afbc73b1f8e?source=topics_v2---------0-84--------------------74cbbc5d_706c_4147_98be_eedae6a933b4-------17',
  'https://medium.com/mlearning-ai/deep-learning-for-bear-image-classification-using-pytorch-fastai-duckduckgo-api-89bb7452c730?source=topics_v2---------1-84--------------------74cbbc5d_706c_4147_98be_eedae6a933b4-------17',
  'https://medium.com/@santiago-paz/create-a-simple-simpson-character-classifier-with-fast-ai-and-google-collab-part-i-6dfec07e0075?source=topics_v2---------2-84--------------------74cbbc5d_706c_4147_98be_eedae6a933b4-------17',
  'https://medium.com/@juancruzalriccortabarria/my-journey-using-fastai-part-ii-95436f8d0130?source=topics_v2---------3-84--------------------74cbbc5d_706c_4147_98be_eedae6a933b4-------17',
  'https://medium.com/latinxinai/my-journey-using-fastai-part-i-13dfec6c4fb1?source=topics_v2---------4-84--------------------74cbbc5d_706c_4147_98be

## 4.2. Strip out unwanted URLs

In [None]:
# not nessesary here #

In [13]:
cleaned_urls = raw_urls

In [14]:
cleaned_urls

{'fastai': ['https://medium.com/@msubhaditya/improving-classification-by-visualizing-activations-2afbc73b1f8e?source=topics_v2---------0-84--------------------74cbbc5d_706c_4147_98be_eedae6a933b4-------17',
  'https://medium.com/mlearning-ai/deep-learning-for-bear-image-classification-using-pytorch-fastai-duckduckgo-api-89bb7452c730?source=topics_v2---------1-84--------------------74cbbc5d_706c_4147_98be_eedae6a933b4-------17',
  'https://medium.com/@santiago-paz/create-a-simple-simpson-character-classifier-with-fast-ai-and-google-collab-part-i-6dfec07e0075?source=topics_v2---------2-84--------------------74cbbc5d_706c_4147_98be_eedae6a933b4-------17',
  'https://medium.com/@juancruzalriccortabarria/my-journey-using-fastai-part-ii-95436f8d0130?source=topics_v2---------3-84--------------------74cbbc5d_706c_4147_98be_eedae6a933b4-------17',
  'https://medium.com/latinxinai/my-journey-using-fastai-part-i-13dfec6c4fb1?source=topics_v2---------4-84--------------------74cbbc5d_706c_4147_98be

## 4.3. Search and Scrape Cleaned URLs

In [15]:
def scrape_and_process(URLs):
    ARTICLES = []
    for url in URLs: 
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        paragraphs = soup.find_all('p')
        text = [paragraph.text for paragraph in paragraphs]
        words = ' '.join(text).split(' ')[:350]
        ARTICLE = ' '.join(words)
        ARTICLES.append(ARTICLE)
    return ARTICLES

In [16]:
articles = {ticker:scrape_and_process(cleaned_urls[ticker]) for ticker in monitored_tickers}
articles

{'fastai': ['Aug 19 Note: This is a slightly advanced article. If you are not comfortable with training neural networks, this is probably not for you yet. Start here instead. · Intro ∘ The Objective and Data ∘ Code ∘ Training ∘ Hooks ∘ Plotting Activations ∘ What’s next? ∘ Fin So you want to train a Neural Network to classify images. Woah. That’s awesome! How well did it do? Did you get a good score? Oh? You want to do better? I hear you. What if you could see what the network sees to make the choice? That would help understand how to make it perform better right? Read on! A few years ago, a paper titled “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization” by Selvaraju et al. talked about how we could visually see the activation maps of a trained CNN by looking at the gradients in the final layer. This post will show you how to use that for your own needs. Note: We will be using PyTorch and the fast.ai library. But the concepts stay the same, so you should

In [18]:
articles["fastai"][2]

'Aug 5 When you’re starting in Machine Learning (like myself!), one of the most basic tools you can create is a simple classifier. In this case, we’re gonna create a classification model that, given an image, will tell you which character of The Simpsons is, and a metric on how sure the model is that it’s the correct answer. As we know, models are useless without data. So our first step will be to get the correct images and labels. For this step, I’m not going to use any hand-crafted dataset. Instead, I will download around 200 images I found on the Internet for each character I want to predict. We can use the powerful DuckDuckGo Search API to do this process. It’s free and you don’t need any token authorization at the moment I’m writing this article. (Thanks to Joe Dockrill and forums.fast.ai). Of course, first we need to import all the correct packages in a new notebook from Google Colab. To create a new notebook, simply enter to Google Collab site and click on New Notebook. Once we 

In [19]:
articles["fastai"][5]

'Jul 18 Reqs:1. Google Colab or Paperspace2. Fastai Environment Google Colab gives a free GPU so as paper space Goals: We are combining the First two lessons of Practical Deep Learning by Jeremy Howard 2. Import the libraries A contained environment to run scripts within localised area. It is more interactive we can see output for each cell along -- -- Researcher at UCL, UOV Love podcasts or audiobooks? Learn on the go with our new app. Khuyen Tran in Towards Data Science Ryan Sander in Towards Data Science Paul Xiong CSS MCQs Jose Antonio Ribeiro Neto (Zezinho) in XNEWDATA Philipp Dave W Shanahan Arkadiusz Modzelewski AboutHelpTermsPrivacy Researcher at UCL, UOV Britney in The 131217net Sarah Marciano Shivaraj karki Mohammad Faizan Help Status Writers Blog Careers Privacy Terms About Knowable'

## 4.4. Summarise all Articles

In [20]:
def summarize(articles):
    summaries = []
    for article in articles:
        input_ids = tokenizer.encode(article, return_tensors='pt',max_length=512, truncation=True)
        output = model.generate(input_ids, max_length=56, num_beams=5, early_stopping=True)
        summary = tokenizer.decode(output[0], skip_special_tokens=True)
        summaries.append(summary)
    return summaries

In [21]:
# takes 3 mins to execute #
summaries = {ticker:summarize(articles[ticker]) for ticker in monitored_tickers}
summaries

{'fastai': ['A few years ago, a paper titled “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization” by Selvaraju et al. talked about how we could visually see the activation maps of a trained CNN.',
  "MLearning.ai lets you train a model and then use it to help with data cleaning. It's helpful to see where exactly our errors are occurring, to see whether they're due to a dataset problem (e.g., images that aren't bears at",
  'In this article, we’re gonna create a classification model that, given an image, will tell you which character of The Simpsons is. As we know, models are useless without data. So our first step will be to get the correct images and labels.',
  'In this entry, we will try to get better results than what we got on the first model. We are going to try and predict whether a forest is on fire or not. To do this first we need images. We can get those using fastai “',
  'We need images to train our model. So we are going to download images of 

In [22]:
summaries["fastai"]

['A few years ago, a paper titled “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization” by Selvaraju et al. talked about how we could visually see the activation maps of a trained CNN.',
 "MLearning.ai lets you train a model and then use it to help with data cleaning. It's helpful to see where exactly our errors are occurring, to see whether they're due to a dataset problem (e.g., images that aren't bears at",
 'In this article, we’re gonna create a classification model that, given an image, will tell you which character of The Simpsons is. As we know, models are useless without data. So our first step will be to get the correct images and labels.',
 'In this entry, we will try to get better results than what we got on the first model. We are going to try and predict whether a forest is on fire or not. To do this first we need images. We can get those using fastai “',
 'We need images to train our model. So we are going to download images of a forest and a 

# 5. Adding Sentiment Analysis

In [23]:
# using pipeline #
sentiment = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [24]:
# test sentmrnt analysis with 'summaries["fastai"]' #
sentiment(summaries["fastai"])

[{'label': 'POSITIVE', 'score': 0.9038269519805908},
 {'label': 'NEGATIVE', 'score': 0.5976590514183044},
 {'label': 'NEGATIVE', 'score': 0.9995133876800537},
 {'label': 'NEGATIVE', 'score': 0.9991627931594849},
 {'label': 'NEGATIVE', 'score': 0.9957771301269531},
 {'label': 'POSITIVE', 'score': 0.9954389929771423},
 {'label': 'NEGATIVE', 'score': 0.7657209634780884},
 {'label': 'POSITIVE', 'score': 0.9127288460731506},
 {'label': 'POSITIVE', 'score': 0.9030302166938782},
 {'label': 'NEGATIVE', 'score': 0.9671006798744202}]

In [25]:
scores = {ticker:sentiment(summaries[ticker]) for ticker in monitored_tickers}
scores

{'fastai': [{'label': 'POSITIVE', 'score': 0.9038269519805908},
  {'label': 'NEGATIVE', 'score': 0.5976590514183044},
  {'label': 'NEGATIVE', 'score': 0.9995133876800537},
  {'label': 'NEGATIVE', 'score': 0.9991627931594849},
  {'label': 'NEGATIVE', 'score': 0.9957771301269531},
  {'label': 'POSITIVE', 'score': 0.9954389929771423},
  {'label': 'NEGATIVE', 'score': 0.7657209634780884},
  {'label': 'POSITIVE', 'score': 0.9127288460731506},
  {'label': 'POSITIVE', 'score': 0.9030302166938782},
  {'label': 'NEGATIVE', 'score': 0.9671006798744202}]}

In [26]:
print(summaries['fastai'][3], scores['fastai'][3]['label'], scores['fastai'][3]['score'])

In this entry, we will try to get better results than what we got on the first model. We are going to try and predict whether a forest is on fire or not. To do this first we need images. We can get those using fastai “ NEGATIVE 0.9991627931594849


sentiment analysis can be improved by finetuning to a datascience specific dataset.

# 6. Exporting Results to CSV

In [27]:
summaries

{'fastai': ['A few years ago, a paper titled “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization” by Selvaraju et al. talked about how we could visually see the activation maps of a trained CNN.',
  "MLearning.ai lets you train a model and then use it to help with data cleaning. It's helpful to see where exactly our errors are occurring, to see whether they're due to a dataset problem (e.g., images that aren't bears at",
  'In this article, we’re gonna create a classification model that, given an image, will tell you which character of The Simpsons is. As we know, models are useless without data. So our first step will be to get the correct images and labels.',
  'In this entry, we will try to get better results than what we got on the first model. We are going to try and predict whether a forest is on fire or not. To do this first we need images. We can get those using fastai “',
  'We need images to train our model. So we are going to download images of 

In [28]:
scores

{'fastai': [{'label': 'POSITIVE', 'score': 0.9038269519805908},
  {'label': 'NEGATIVE', 'score': 0.5976590514183044},
  {'label': 'NEGATIVE', 'score': 0.9995133876800537},
  {'label': 'NEGATIVE', 'score': 0.9991627931594849},
  {'label': 'NEGATIVE', 'score': 0.9957771301269531},
  {'label': 'POSITIVE', 'score': 0.9954389929771423},
  {'label': 'NEGATIVE', 'score': 0.7657209634780884},
  {'label': 'POSITIVE', 'score': 0.9127288460731506},
  {'label': 'POSITIVE', 'score': 0.9030302166938782},
  {'label': 'NEGATIVE', 'score': 0.9671006798744202}]}

In [29]:
cleaned_urls

{'fastai': ['https://medium.com/@msubhaditya/improving-classification-by-visualizing-activations-2afbc73b1f8e?source=topics_v2---------0-84--------------------74cbbc5d_706c_4147_98be_eedae6a933b4-------17',
  'https://medium.com/mlearning-ai/deep-learning-for-bear-image-classification-using-pytorch-fastai-duckduckgo-api-89bb7452c730?source=topics_v2---------1-84--------------------74cbbc5d_706c_4147_98be_eedae6a933b4-------17',
  'https://medium.com/@santiago-paz/create-a-simple-simpson-character-classifier-with-fast-ai-and-google-collab-part-i-6dfec07e0075?source=topics_v2---------2-84--------------------74cbbc5d_706c_4147_98be_eedae6a933b4-------17',
  'https://medium.com/@juancruzalriccortabarria/my-journey-using-fastai-part-ii-95436f8d0130?source=topics_v2---------3-84--------------------74cbbc5d_706c_4147_98be_eedae6a933b4-------17',
  'https://medium.com/latinxinai/my-journey-using-fastai-part-i-13dfec6c4fb1?source=topics_v2---------4-84--------------------74cbbc5d_706c_4147_98be

In [30]:
range(len(summaries['fastai']))

range(0, 10)

In [31]:
summaries['fastai'][3]

'In this entry, we will try to get better results than what we got on the first model. We are going to try and predict whether a forest is on fire or not. To do this first we need images. We can get those using fastai “'

In [32]:
def create_output_array(summaries, scores, urls):
    output = []
    for ticker in monitored_tickers:
        for counter in range(len(summaries[ticker])):
            output_this = [
                ticker,
                summaries[ticker][counter],
                scores[ticker][counter]['label'],
                scores[ticker][counter]['score'],
                urls[ticker][counter]
            ]
            output.append(output_this)
    return output

In [33]:
final_output = create_output_array(summaries, scores, cleaned_urls)
final_output

[['fastai',
  'A few years ago, a paper titled “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization” by Selvaraju et al. talked about how we could visually see the activation maps of a trained CNN.',
  'POSITIVE',
  0.9038269519805908,
  'https://medium.com/@msubhaditya/improving-classification-by-visualizing-activations-2afbc73b1f8e?source=topics_v2---------0-84--------------------74cbbc5d_706c_4147_98be_eedae6a933b4-------17'],
 ['fastai',
  "MLearning.ai lets you train a model and then use it to help with data cleaning. It's helpful to see where exactly our errors are occurring, to see whether they're due to a dataset problem (e.g., images that aren't bears at",
  'NEGATIVE',
  0.5976590514183044,
  'https://medium.com/mlearning-ai/deep-learning-for-bear-image-classification-using-pytorch-fastai-duckduckgo-api-89bb7452c730?source=topics_v2---------1-84--------------------74cbbc5d_706c_4147_98be_eedae6a933b4-------17'],
 ['fastai',
  'In this article, we’

In [34]:
# adding cols #
final_output.insert(0, ['Ticker', 'Summary', 'Label', 'Confidence', 'URL']) 

In [35]:
final_output

[['Ticker', 'Summary', 'Label', 'Confidence', 'URL'],
 ['fastai',
  'A few years ago, a paper titled “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization” by Selvaraju et al. talked about how we could visually see the activation maps of a trained CNN.',
  'POSITIVE',
  0.9038269519805908,
  'https://medium.com/@msubhaditya/improving-classification-by-visualizing-activations-2afbc73b1f8e?source=topics_v2---------0-84--------------------74cbbc5d_706c_4147_98be_eedae6a933b4-------17'],
 ['fastai',
  "MLearning.ai lets you train a model and then use it to help with data cleaning. It's helpful to see where exactly our errors are occurring, to see whether they're due to a dataset problem (e.g., images that aren't bears at",
  'NEGATIVE',
  0.5976590514183044,
  'https://medium.com/mlearning-ai/deep-learning-for-bear-image-classification-using-pytorch-fastai-duckduckgo-api-89bb7452c730?source=topics_v2---------1-84--------------------74cbbc5d_706c_4147_98be_eedae6

**Export results**

In [None]:
import csv
with open('assetsummaries.csv', mode='w', newline='') as f:
    csv_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csv_writer.writerows(final_output)