Basic web-scrapping practice with `BeautifulSoup`. I was curious to gather all the names of DataCamp courses that are live as of now. This is just a starter notebook. There are lots of redundant codes as well which I will remove as I proceed. I will work more on it to improve upon my scrapping skills. 

In [1]:
# Dependencies
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime as dt

#### Courses available on the following topics:
- tech:r 
- tech:python 
- tech:sql 
- tech:git 
- tech:shell 
- tech:spreadsheets

In [4]:
# The main URL where all the course names and their descriptions can be found
root_url = 'https://www.datacamp.com/courses/'

In [5]:
# Necessary for scrapping
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/71.0.3578.98 Safari/537.36'}

In [9]:
# Function to scrape course names by topic
def scrape_courses_by_topic(topic):
    courses = []
    url_to_be_scrapped = root_url + topic
    print('Scrapping ' + url_to_be_scrapped)
    page_content_per_topic = requests.get(url_to_be_scrapped, headers=headers)
    soup = BeautifulSoup(page_content_per_topic.text, 'html.parser')
    res = soup.find_all('div', attrs={'class': 'courses__explore-list row'})
    for entry in res:
        urls = entry.find_all('h4', attrs={'class': 'course-block__title'})
        for url in urls:
            courses.append(url.get_text())
    print("Scrapping done for " + topic +"!")
    return courses

> We start the scrapping process now. Starting with **R** first. 

### R

In [10]:
all_r_courses = scrape_courses_by_topic('tech:r')
print('Total number of live R courses ' + str(len(all_r_courses)))

Scrapping https://www.datacamp.com/courses/tech:r
Scrapping done for tech:r!
Total number of live R courses 138


### Python

In [11]:
all_python_courses = scrape_courses_by_topic('tech:python')
print('Total number of live Python courses - ' + str(len(all_python_courses)))

Scrapping https://www.datacamp.com/courses/tech:python
Scrapping done for tech:python!
Total number of live Python courses - 63


### SQL

In [13]:
all_sql_courses = scrape_courses_by_topic('tech:sql')
print('Total number of live SQL courses - ' + str(len(all_sql_courses)))

Scrapping https://www.datacamp.com/courses/tech:sql
Scrapping done for tech:sql!
Total number of live SQL courses - 3


### Git

In [14]:
all_git_courses = scrape_courses_by_topic('tech:git')
print('Total number of live Git courses - ' + str(len(all_git_courses)))

Scrapping https://www.datacamp.com/courses/tech:git
Scrapping done for tech:git!
Total number of live Git courses - 1


### Shell

In [15]:
all_shell_courses = scrape_courses_by_topic('tech:shell')
print('Total number of live Shell courses - ' + str(len(all_shell_courses)))

Scrapping https://www.datacamp.com/courses/tech:shell
Scrapping done for tech:shell!
Total number of live Shell courses - 3


### Spreadsheets

In [16]:
all_spreadsheets_courses = scrape_courses_by_topic('tech:spreadsheets')
print('Total number of live Shell courses - ' + str(len(all_spreadsheets_courses)))

Scrapping https://www.datacamp.com/courses/tech:spreadsheets
Scrapping done for tech:spreadsheets!
Total number of live Shell courses - 4


> Serializing the course names by topics to an Excel file.

In [17]:
r_courses = pd.DataFrame(data = all_r_courses, columns = ['Course Name'])
r_courses['Topic Name'] = 'R'

In [18]:
python_courses = pd.DataFrame(data = all_python_courses, columns = ['Course Name'])
python_courses['Topic Name'] = 'Python'

In [19]:
sql_courses = pd.DataFrame(data = all_sql_courses, columns = ['Course Name'])
sql_courses['Topic Name'] = 'SQL'

In [20]:
git_courses = pd.DataFrame(data = all_git_courses, columns = ['Course Name'])
git_courses['Topic Name'] = 'Git'

In [24]:
shell_courses = pd.DataFrame(data = all_shell_courses, columns = ['Course Name'])
shell_courses['Topic Name'] = 'Shell'

In [21]:
spreadsheet_courses = pd.DataFrame(data = all_spreadsheets_courses, columns = ['Course Name'])
spreadsheet_courses['Topic Name'] = 'Spreadsheets'

In [25]:
all_courses = pd.concat([r_courses,python_courses,sql_courses,git_courses,shell_courses,spreadsheet_courses])

In [26]:
len(all_courses)

212

In [27]:
filename = 'DataCamp Courses as of ' + dt.now().strftime('%Y-%m-%d') + '.xlsx'
writer = pd.ExcelWriter(filename)
all_courses.to_excel(writer,'Sheet1',index=False)
writer.save()
print('File saved!')

File saved!
