# Screen-Scrape Summaries of Machine Learning Blogs from Analytics Vidhya Website

This Jupyter notebook extracts information about all of the Machine Learning blogs from the [ML blogs archive](https://www.analyticsvidhya.com/blog/category/machine-learning/) of the Analytics Vidhya website.

To run this notebook on Colab, click [here](https://colab.research.google.com/github/sanjaydasgupta/data-mining-of-website-articles/blob/master/analytics-vidhya-ml-blogs.ipynb).

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Set number of blog arichive pages in cell below

The number of blog archive pages is currently 45 (as on 30th Sept 2020, see bottom of [any ML blog archive page](https://www.analyticsvidhya.com/blog/category/machine-learning/page/3/)). This number, plus one, has to be set as the upper limit of the `range` in the last line of code in the cell below.

In [None]:
url_template = 'https://www.analyticsvidhya.com/blog/category/machine-learning/page/%d/'

def articles_from_page(page_no):
  page = requests.get(url_template % page_no)
  if page.status_code != 200:
    raise ValueError(page)
  html = BeautifulSoup(page.content, 'html.parser')
  a_and_p_values = [(art.find('h3').find('a'), art.find('p'))for art in html.find_all('article')]
  articles = [(ap[0]['href'], ap[0]['title'], ap[1].string if ap[1] else None) for ap in a_and_p_values]
  #print(page_no, len(articles))
  return articles

paged_articles = [articles_from_page(pn) for pn in range(1, 46)]

## Data file (CSV) created by cell below

The following code creates a file named `articles.csv` containing information about all the blogs (621 as on 30th September 2020). The file has three columns: _url_, _title_, and _summary_, and can be directly read by pandas\' [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) for further processing.

In [None]:
from functools import reduce
all_articles = reduce(lambda a, b: a + b, paged_articles)
df = pd.DataFrame(all_articles, columns=['url', 'title', 'summary'])
df.to_csv('articles.csv')

In [None]:
all_articles[:10]

[('https://www.analyticsvidhya.com/blog/2020/09/machine-learning-in-cyber-security-malicious-software-installation/',
  'Machine Learning in Cyber Security — Malicious Software Installation',
  'Introduction Monitoring of user activities performed by local administrators is always a challenge for SOC analysts and security professionals. Most of the security framework … '),
 ('https://www.analyticsvidhya.com/blog/2020/09/how-to-build-forecast-excel/',
  'How to Build a Sales Forecast using Microsoft Excel in Just 10 Minutes!',
  'Overview Learn how to build an accurate forecast in Excel – a classic technique to have for any analytics professional We’ll work on a … '),
 ('https://www.analyticsvidhya.com/blog/2020/09/how-dbscan-clustering-works/',
  'How to Master the Popular DBSCAN Clustering Algorithm for Machine Learning',
  'Overview DBSCAN clustering is an underrated yet super useful clustering algorithm for unsupervised learning problems Learn how DBSCAN clustering works, why you sh