# Screen-Scrape Summaries of Machine Learning Blogs from Analytics Vidhya Website

This Jupyter notebook extracts information about all of the Machine Learning blogs from the [ML blogs archive](https://www.analyticsvidhya.com/blog/category/machine-learning/) of the Analytics Vidhya website.

To run this notebook on Colab, click [here](https://colab.research.google.com/github/sanjaydasgupta/data-mining-of-website-articles/blob/master/analytics-vidhya-ml-blogs.ipynb).

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from functools import reduce

## Fetch List of Blog Records

In [16]:
url_template = 'https://www.analyticsvidhya.com/blog/category/machine-learning/page/%d/'

def articles_from_page(page_no):
  page = requests.get(url_template % page_no)
  if page.status_code != 200:
    return (page.status_code, None)
  html = BeautifulSoup(page.content, 'html.parser')
  fields = [(art.find('time')['datetime'], art.find('span'), art.find('h3').find('a'), art.find('p')) 
      for art in html.find_all('article')]
  articles = [(field[0], field[1].find('a').string if field[1] else None, 
      field[1].find('a')['href'] if field[1] else None, field[2]['title'], field[2]['href'], 
      field[3].string if field[3] else None, page_no) for field in fields]
  #print(page_no, len(articles))
  return (page.status_code, articles)

paged_articles = []
for page in range(1, 100):
  status, articles = articles_from_page(page)
  if status != 200:
    break
  paged_articles.extend(articles)

print('Got %d articles' % len(paged_articles))

Got 683 articles


## Data file (CSV) created by cell below

The following code creates a file named `articles.csv` containing information about all the blogs (682 as on 5th November 2020). The file has six columns: _datetime_, _author_, <i>author_url</i>, _title_, _url_, and _summary_, and can be directly read by pandas\' [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) for further processing.

In [17]:
df = pd.DataFrame(paged_articles, columns=['datetime', 'author', 'author_url', 'title', 'url', 'summary', 'page_no'])
print(df.shape)
df.to_csv('articles.csv')

(683, 7)


In [18]:
df.sample(10)

Unnamed: 0,datetime,author,author_url,title,url,summary,page_no
390,2020-11-04T15:52:47+05:30,,,Create your Own Image Caption Generator using ...,https://www.analyticsvidhya.com/blog/2020/11/c...,,28
670,2020-11-04T15:52:47+05:30,,,Create your Own Image Caption Generator using ...,https://www.analyticsvidhya.com/blog/2020/11/c...,,48
222,2020-11-04T15:52:47+05:30,,,Create your Own Image Caption Generator using ...,https://www.analyticsvidhya.com/blog/2020/11/c...,,16
391,2020-11-04T11:46:16+05:30,,,Top 5 Statistical Concepts Every Data Scientis...,https://www.analyticsvidhya.com/blog/2020/11/t...,,28
153,2020-11-04T11:46:16+05:30,,,Top 5 Statistical Concepts Every Data Scientis...,https://www.analyticsvidhya.com/blog/2020/11/t...,,11
529,2020-11-04T20:00:20+05:30,,,Artificial Intelligence in Agriculture : Using...,https://www.analyticsvidhya.com/blog/2020/11/a...,,38
575,2015-11-26T20:43:34+05:30,Sunil Ray,https://www.analyticsvidhya.com/blog/author/su...,Simple Methods to deal with Categorical Variab...,https://www.analyticsvidhya.com/blog/2015/11/e...,Introduction Categorical variables are known t...,42
524,2016-03-22T19:04:32+05:30,Guest Blog,https://www.analyticsvidhya.com/blog/author/gu...,How to perform feature selection (i.e. pick im...,https://www.analyticsvidhya.com/blog/2016/03/s...,Introduction Variable selection is an importan...,38
92,2020-05-19T01:07:59+05:30,LAKSHAY ARORA,https://www.analyticsvidhya.com/blog/author/la...,Running Low on Time? Use PyCaret to Build your...,https://www.analyticsvidhya.com/blog/2020/05/p...,Overview PyCaret is a super useful and low-cod...,7
227,2018-09-27T20:00:52+05:30,Aishwarya Singh,https://www.analyticsvidhya.com/blog/author/ai...,A Multivariate Time Series Guide to Forecastin...,https://www.analyticsvidhya.com/blog/2018/09/m...,Vector Auto Regression method for forecasting ...,17
