# Screen-Scrape Summaries of Machine Learning Blogs from Analytics Vidhya Website

This Jupyter notebook extracts information about all of the Machine Learning blogs from the [ML blogs archive](https://www.analyticsvidhya.com/blog/category/machine-learning/) of the Analytics Vidhya website.

To run this notebook on Colab, click [here](https://colab.research.google.com/github/sanjaydasgupta/data-mining-of-website-articles/blob/master/analytics-vidhya-ml-blogs.ipynb).

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from functools import reduce

## Set number of blog arichive pages in cell below

The number of blog archive pages is currently 45 (as on 30th Sept 2020, see bottom of [any ML blog archive page](https://www.analyticsvidhya.com/blog/category/machine-learning/page/3/)). This number, plus one, has to be set as the upper limit of the `range` in the last line of code in the cell below.

In [None]:
url_template = 'https://www.analyticsvidhya.com/blog/category/machine-learning/page/%d/'

def articles_from_page(page_no):
  page = requests.get(url_template % page_no)
  if page.status_code != 200:
    raise ValueError(page)
  html = BeautifulSoup(page.content, 'html.parser')
  fields = [(art.find('time')['datetime'], art.find('span'), art.find('h3').find('a'), art.find('p')) 
      for art in html.find_all('article')]
  articles = [(field[0], field[1].find('a').string if field[1] else None, 
      field[1].find('a')['href'] if field[1] else None, field[2]['title'], field[2]['href'], 
      field[3].string if field[3] else None) for field in fields]
  #print(page_no, len(articles))
  return articles

paged_articles = [articles_from_page(pn) for pn in range(1, 46)]

In [None]:
paged_articles[:2]

[[('2020-09-16T17:33:43+05:30',
   'Guest Blog',
   'https://www.analyticsvidhya.com/blog/author/guest-blog/',
   'Machine Learning in Cyber Security — Malicious Software Installation',
   'https://www.analyticsvidhya.com/blog/2020/09/machine-learning-in-cyber-security-malicious-software-installation/',
   'Introduction Monitoring of user activities performed by local administrators is always a challenge for SOC analysts and security professionals. Most of the security framework … '),
  ('2020-09-13T21:03:14+05:30',
   'Ram Dewani',
   'https://www.analyticsvidhya.com/blog/author/ram_dewani/',
   'How to Build a Sales Forecast using Microsoft Excel in Just 10 Minutes!',
   'https://www.analyticsvidhya.com/blog/2020/09/how-to-build-forecast-excel/',
   'Overview Learn how to build an accurate forecast in Excel – a classic technique to have for any analytics professional We’ll work on a … '),
  ('2020-09-08T00:00:23+05:30',
   'Abhishek Sharma',
   'https://www.analyticsvidhya.com/blog/a

## Data file (CSV) created by cell below

The following code creates a file named `articles.csv` containing information about all the blogs (621 as on 30th September 2020). The file has six columns: _datetime_, _author_, _author_url_, _title_, _url_, and _summary_, and can be directly read by pandas\' [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) for further processing.

In [None]:
all_articles = reduce(lambda a, b: a + b, paged_articles)
df = pd.DataFrame(all_articles, columns=['datetime', 'author', 'author_url', 'title', 'url', 'summary'])
print(df.shape)
df.to_csv('articles.csv')

(621, 6)


In [None]:
df.sample(10)

Unnamed: 0,datetime,author,author_url,title,url,summary
360,2020-09-28T22:54:05+05:30,,,What is AWS? Why Every Data Science Profession...,https://www.analyticsvidhya.com/blog/2020/09/w...,
202,2018-05-14T02:39:38+05:30,Tavish Srivastava,https://www.analyticsvidhya.com/blog/author/ta...,An Alternative to Deep Learning? Guide to Hier...,https://www.analyticsvidhya.com/blog/2018/05/a...,Introduction Deep learning has proved its supr...
48,2020-02-13T07:39:17+05:30,Aishwarya Singh,https://www.analyticsvidhya.com/blog/author/ai...,"4 Boosting Algorithms You Should Know – GBM, X...",https://www.analyticsvidhya.com/blog/2020/02/4...,How many boosting algorithms do you know? Can ...
334,2020-09-24T22:00:32+05:30,,,Presenting HackLive – A Guided Community Hacka...,https://www.analyticsvidhya.com/blog/2020/09/h...,
356,2017-01-19T04:29:53+05:30,Faizan Shaikh,https://www.analyticsvidhya.com/blog/author/ja...,Simple Beginner’s guide to Reinforcement Learn...,https://www.analyticsvidhya.com/blog/2017/01/i...,Introduction One of the most fundamental quest...
249,2020-09-27T22:17:11+05:30,,,10 Statistical Functions in Excel every Analyt...,https://www.analyticsvidhya.com/blog/2020/09/1...,
303,2017-05-26T18:26:06+05:30,Kunal Jain,https://www.analyticsvidhya.com/blog/author/ku...,Launching Analytics Industry Report 2017 – Tre...,https://www.analyticsvidhya.com/blog/2017/05/l...,Introduction Let me start with laying out a re...
432,2020-09-24T22:00:32+05:30,,,Presenting HackLive – A Guided Community Hacka...,https://www.analyticsvidhya.com/blog/2020/09/h...,
19,2020-06-30T01:27:11+05:30,Abhishek Sharma,https://www.analyticsvidhya.com/blog/author/ab...,4 Simple Ways to Split a Decision Tree in Mach...,https://www.analyticsvidhya.com/blog/2020/06/4...,Overview How do you split a decision tree? Wha...
37,2020-04-06T08:57:17+05:30,Alakh Sethi,https://www.analyticsvidhya.com/blog/author/al...,Supervised Learning vs. Unsupervised Learning ...,https://www.analyticsvidhya.com/blog/2020/04/s...,Introduction “What’s the difference between su...
