<a href="https://colab.research.google.com/github/sanjaydasgupta/data-mining-of-website-articles/blob/master/analytics-vidhya-ml-blogs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Screen-Scrape Summaries of Machine Learning Blogs from Analytics Vidhya Website

This Jupyter notebook extracts information about all of the Machine Learning blogs from the [ML blogs archive](https://www.analyticsvidhya.com/blog/category/machine-learning/) of the Analytics Vidhya website.

To run this notebook on Colab, click [here](https://colab.research.google.com/github/sanjaydasgupta/data-mining-of-website-articles/blob/master/analytics-vidhya-ml-blogs.ipynb).

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from tqdm import tqdm

## Fetch List of Blog Records

In [2]:
url_template = 'https://www.analyticsvidhya.com/blog/category/machine-learning/page/%d/'

def articles_from_page(page_no):
  page = requests.get(url_template % page_no)
  if page.status_code != 200:
    return (page.status_code, None)
  html = BeautifulSoup(page.content, 'html.parser')
  fields = [(art.find('time')['datetime'], art.find('span'), art.find('h3').find('a'), art.find('p')) 
      for art in html.find_all('article')]
  articles = [(field[0], field[1].find('a').string if field[1] else None, 
      field[1].find('a')['href'] if field[1] else None, field[2]['title'], field[2]['href'], 
      field[3].string if field[3] else None, page_no) for field in fields]
  #print(page_no, len(articles))
  return (page.status_code, articles)

paged_articles = []
for page in tqdm(range(1, 101)):
  status, articles = articles_from_page(page)
  if status != 200:
    break
  paged_articles.extend(articles)

print('\nGot %d articles' % len(paged_articles))

 59%|█████▉    | 59/100 [02:20<01:41,  2.46s/it]


Got 817 articles


## Data file (CSV) created by cell below

The following code creates a file named `articles.csv` containing information about all the blogs (817 as of 7th March 2021). The file has six columns: _datetime_, _author_, <i>author_url</i>, _title_, _url_, and _summary_, and can be directly read by pandas\' [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) for further processing.

In [3]:
df = pd.DataFrame(paged_articles, columns=['datetime', 'author', 'author_url', 'title', 'url', 'summary', 'page_no'])
print(df.shape)
df.to_csv('articles.csv')

(817, 7)


In [4]:
df.sample(10)

Unnamed: 0,datetime,author,author_url,title,url,summary,page_no
75,2020-12-16T13:30:51+05:30,shanthababu,https://www.analyticsvidhya.com/blog/author/sh...,Understand Machine Learning and Its End-to-End...,https://www.analyticsvidhya.com/blog/2020/12/u...,ArticleVideo Book This article was published a...,6
652,2016-03-28T05:17:08+05:30,Analytics Vidhya,https://www.analyticsvidhya.com/blog/author/av...,Practical Guide to deal with Imbalanced Classi...,https://www.analyticsvidhya.com/blog/2016/03/p...,ArticleVideo Book Introduction We have several...,47
742,2015-08-02T19:27:33+05:30,Tavish Srivastava,https://www.analyticsvidhya.com/blog/author/ta...,Basics of Ensemble Learning Explained in Simpl...,https://www.analyticsvidhya.com/blog/2015/08/i...,ArticleVideo Book Introduction Ensemble modeli...,54
582,2016-11-18T05:22:29+05:30,Saurav Kaushik,https://www.analyticsvidhya.com/blog/author/sa...,An Introduction to APIs (Application Programmi...,https://www.analyticsvidhya.com/blog/2016/11/a...,ArticleVideo Book Introduction If you are in t...,42
732,2015-09-11T03:50:40+05:30,Tavish Srivastava,https://www.analyticsvidhya.com/blog/author/ta...,Learn Gradient Boosting Algorithm for better p...,https://www.analyticsvidhya.com/blog/2015/09/c...,ArticleVideo Book Introduction The accuracy of...,53
752,2021-03-05T16:47:13+05:30,,,Data Scientist’s Guide to Logistic regression,https://www.analyticsvidhya.com/blog/2021/03/l...,,54
12,2021-03-05T13:44:17+05:30,,,Data Validation and Data Verification – From D...,https://www.analyticsvidhya.com/blog/2021/03/d...,,1
128,2020-11-07T16:02:50+05:30,nandhini97,https://www.analyticsvidhya.com/blog/author/na...,"Handling Imbalanced Data – Machine Learning, C...",https://www.analyticsvidhya.com/blog/2020/11/h...,ArticleVideo Book This article was published a...,10
584,2021-03-05T16:47:13+05:30,,,Data Scientist’s Guide to Logistic regression,https://www.analyticsvidhya.com/blog/2021/03/l...,,42
802,2014-01-18T00:23:00+05:30,Tavish Srivastava,https://www.analyticsvidhya.com/blog/author/ta...,Framework to build logistic regression model i...,https://www.analyticsvidhya.com/blog/2014/01/l...,ArticleVideo Book Only 531 out of a population...,58
