# Screen-Scrape Summaries of Machine Learning Blogs from Analytics Vidhya Website

This Jupyter notebook extracts information about all of the Machine Learning blogs from the [ML blogs archive](https://www.analyticsvidhya.com/blog/category/machine-learning/) of the Analytics Vidhya website.

To run this notebook on Colab, click [here](https://colab.research.google.com/github/sanjaydasgupta/data-mining-of-website-articles/blob/master/analytics-vidhya-ml-blogs.ipynb).

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from functools import reduce

## Set number of blog arichive pages in cell below

The number of blog archive pages is currently 45 (as on 30th Sept 2020, see bottom of [any ML blog archive page](https://www.analyticsvidhya.com/blog/category/machine-learning/page/3/)). This number, plus one, has to be set as the upper limit of the `range` in the last line of code in the cell below.

In [2]:
url_template = 'https://www.analyticsvidhya.com/blog/category/machine-learning/page/%d/'

def articles_from_page(page_no):
  page = requests.get(url_template % page_no)
  if page.status_code != 200:
    raise ValueError(page)
  html = BeautifulSoup(page.content, 'html.parser')
  fields = [(art.find('time')['datetime'], art.find('span'), art.find('h3').find('a'), art.find('p')) 
      for art in html.find_all('article')]
  articles = [(field[0], field[1].find('a').string if field[1] else None, 
      field[1].find('a')['href'] if field[1] else None, field[2]['title'], field[2]['href'], 
      field[3].string if field[3] else None, page_no) for field in fields]
  #print(page_no, len(articles))
  return articles

paged_articles = [articles_from_page(pn) for pn in range(1, 46)]

In [3]:
paged_articles[:2]

[[('2020-10-01T13:22:23+05:30',
   'ananyd36',
   'https://www.analyticsvidhya.com/blog/author/ananyd36/',
   '7 Feature Engineering Techniques in Machine Learning You Should Know',
   'https://www.analyticsvidhya.com/blog/2020/10/7-feature-engineering-techniques-machine-learning/',
   'This article was published as a part of the Data Science Blogathon. Overview Feature engineering techniques are a must know concept for machine learning … ',
   1),
  ('2020-09-16T17:33:43+05:30',
   'Guest Blog',
   'https://www.analyticsvidhya.com/blog/author/guest-blog/',
   'Machine Learning in Cyber Security — Malicious Software Installation',
   'https://www.analyticsvidhya.com/blog/2020/09/machine-learning-in-cyber-security-malicious-software-installation/',
   'Introduction Monitoring of user activities performed by local administrators is always a challenge for SOC analysts and security professionals. Most of the security framework … ',
   1),
  ('2020-09-13T21:03:14+05:30',
   'Ram Dewani',
  

## Data file (CSV) created by cell below

The following code creates a file named `articles.csv` containing information about all the blogs (621 as on 30th September 2020). The file has six columns: _datetime_, _author_, <i>author_url</i>, _title_, _url_, and _summary_, and can be directly read by pandas\' [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) for further processing.

In [4]:
all_articles = reduce(lambda a, b: a + b, paged_articles)
df = pd.DataFrame(all_articles, columns=['datetime', 'author', 'author_url', 'title', 'url', 'summary', 'page_no'])
print(df.shape)
df.to_csv('articles.csv')

(622, 7)


In [5]:
df.sample(10)

Unnamed: 0,datetime,author,author_url,title,url,summary,page_no
418,2020-09-30T13:28:08+05:30,,,Hypothesis Generation for Data Science Project...,https://www.analyticsvidhya.com/blog/2020/09/h...,,30
230,2018-03-15T10:19:42+05:30,Pranav Dar,https://www.analyticsvidhya.com/blog/author/da...,Top 5 Data Science & Machine Learning Reposito...,https://www.analyticsvidhya.com/blog/2018/03/t...,Introduction Continuing our theme of collectin...,17
226,2018-03-26T09:58:03+05:30,Pranav Dar,https://www.analyticsvidhya.com/blog/author/da...,AVBytes: AI & ML Developments this week – IBM’...,https://www.analyticsvidhya.com/blog/2018/03/a...,"In recent times, one of the more popular theme...",17
523,2015-11-03T19:07:51+05:30,Analytics Vidhya,https://www.analyticsvidhya.com/blog/author/av...,Free Resources for Beginners on Deep Learning ...,https://www.analyticsvidhya.com/blog/2015/11/f...,Introduction Machines have already started the...,38
381,2016-12-05T01:43:31+05:30,Faizan Shaikh,https://www.analyticsvidhya.com/blog/author/ja...,45 questions to test Data Scientists on Tree B...,https://www.analyticsvidhya.com/blog/2016/12/d...,Introduction Tree Based algorithms like Random...,28
92,2019-07-15T19:38:42+05:30,Pranav Dar,https://www.analyticsvidhya.com/blog/author/da...,Popular Machine Learning Applications and Use ...,https://www.analyticsvidhya.com/blog/2019/07/u...,Overview We are the in middle of a revolution ...,7
199,2018-06-12T20:51:48+05:30,Pranav Dar,https://www.analyticsvidhya.com/blog/author/da...,Don’t miss out on these awesome GitHub Reposit...,https://www.analyticsvidhya.com/blog/2018/06/t...,Take a look at the top machine learning and da...,15
384,2016-11-24T06:02:19+05:30,Kunal Jain,https://www.analyticsvidhya.com/blog/author/ku...,25+ websites to find datasets for data science...,https://www.analyticsvidhya.com/blog/2016/11/2...,"Introduction If there is one sentence, which s...",28
267,2017-10-05T09:38:44+05:30,Ankit Gupta,https://www.analyticsvidhya.com/blog/author/fa...,25 Questions to test a Data Scientist on Suppo...,https://www.analyticsvidhya.com/blog/2017/10/s...,Introduction You can think of machine learning...,20
45,2020-03-25T01:56:16+05:30,Purva Huilgol,https://www.analyticsvidhya.com/blog/author/pu...,6 Python Libraries to Interpret Machine Learni...,https://www.analyticsvidhya.com/blog/2020/03/6...,The Case for Building Trust in Machine Learnin...,4
