<a href="https://colab.research.google.com/github/sanjaydasgupta/data-mining-of-website-articles/blob/master/analytics-vidhya-ml-blogs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Screen-Scrape Summaries of Machine Learning Blogs from Analytics Vidhya Website

This Jupyter notebook extracts information about all of the Machine Learning blogs from the [ML blogs archive](https://www.analyticsvidhya.com/blog/category/machine-learning/) of the Analytics Vidhya website.

To run this notebook on Colab, click [here](https://colab.research.google.com/github/sanjaydasgupta/data-mining-of-website-articles/blob/master/analytics-vidhya-ml-blogs.ipynb).

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm

## Fetch List of Blog Records

In [4]:
url_template = 'https://www.analyticsvidhya.com/blog/category/machine-learning/page/%d/'

def articles_from_page(page_no):
  page = requests.get(url_template % page_no)
  if page.status_code != 200:
    return (page.status_code, None)
  html = BeautifulSoup(page.content, 'html.parser')
  fields = [(art.find('time')['datetime'], art.find('span'), art.find('h3').find('a'), art.find('p')) 
      for art in html.find_all('article')]
  articles = [(field[0], field[1].find('a').string if field[1] else None, 
      field[1].find('a')['href'] if field[1] else None, field[2]['title'], field[2]['href'], 
      field[3].string if field[3] else None, page_no) for field in fields]
  #print(page_no, len(articles))
  return (page.status_code, articles)

paged_articles = []
for page in tqdm(range(1, 101)):
  status, articles = articles_from_page(page)
  if status != 200:
    break
  paged_articles.extend(articles)

print('\nGot %d articles' % len(paged_articles))

 49%|████▉     | 49/99 [01:10<00:46,  1.08it/s]

Got 683 articles


## Data file (CSV) created by cell below

The following code creates a file named `articles.csv` containing information about all the blogs (682 as on 5th November 2020). The file has six columns: _datetime_, _author_, <i>author_url</i>, _title_, _url_, and _summary_, and can be directly read by pandas\' [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) for further processing.

In [5]:
df = pd.DataFrame(paged_articles, columns=['datetime', 'author', 'author_url', 'title', 'url', 'summary', 'page_no'])
print(df.shape)
df.to_csv('articles.csv')

(683, 7)


In [6]:
df.sample(10)

Unnamed: 0,datetime,author,author_url,title,url,summary,page_no
289,2018-03-26T03:18:09+05:30,Tavish Srivastava,https://www.analyticsvidhya.com/blog/author/ta...,Introduction to k-Nearest Neighbors: A powerfu...,https://www.analyticsvidhya.com/blog/2018/03/i...,ArticleAssess Yourself Note: This article was ...,21
333,2020-11-05T20:14:37+05:30,,,RDDs vs. Dataframes vs. Datasets – What is the...,https://www.analyticsvidhya.com/blog/2020/11/w...,,24
395,2017-03-29T14:11:47+05:30,Yogesh Kulkarni,https://www.analyticsvidhya.com/blog/author/yo...,Extracting information from reports using Regu...,https://www.analyticsvidhya.com/blog/2017/03/e...,Introduction Many times it is necessary to ext...,29
513,2016-05-03T04:49:55+05:30,Analytics Vidhya,https://www.analyticsvidhya.com/blog/author/av...,data.table() vs data.frame() – Learn to work o...,https://www.analyticsvidhya.com/blog/2016/05/d...,Introduction R users (mostly beginners) strugg...,37
155,2019-07-26T09:14:46+05:30,Abir Mukherjee,https://www.analyticsvidhya.com/blog/author/am...,Introduction to Bayesian Adjustment Rating: Th...,https://www.analyticsvidhya.com/blog/2019/07/i...,Overview Curious how the big product companies...,12
146,2019-08-19T08:45:55+05:30,Pulkit Sharma,https://www.analyticsvidhya.com/blog/author/pu...,The Most Comprehensive Guide to K-Means Cluste...,https://www.analyticsvidhya.com/blog/2019/08/c...,Overview K-Means Clustering is a simple yet po...,11
124,2020-11-05T20:13:18+05:30,,,Lasso Regression causes sparsity while Ridge R...,https://www.analyticsvidhya.com/blog/2020/11/l...,,9
335,2020-11-05T15:00:45+05:30,,,12 Essential Tips for People starting a Career...,https://www.analyticsvidhya.com/blog/2020/11/t...,,24
164,2020-11-06T12:16:52+05:30,,,Summarize Twitter Live data using Pretrained N...,https://www.analyticsvidhya.com/blog/2020/11/s...,,12
428,2016-12-23T05:11:35+05:30,Guest Blog,https://www.analyticsvidhya.com/blog/author/gu...,Artificial Intelligence Demystified,https://www.analyticsvidhya.com/blog/2016/12/a...,Introduction Artificial Intelligence has becom...,31
