# Content
1. [Description](#Description)<br>
2. [Importing libraries](#Importing-libraries)<br>
3. [Data Science Courses](#Data-Science-Courses)<br>
&emsp;a. [Scraping data](#Scraping-data) for DS courses<br>
&emsp;b. [Analyzing and Visualizing data](#Analyzing-and-Visualizing-data)<br>
4. [Data Analytics Courses](#Data-Analytics-Courses)<br>
&emsp;a. [Scraping data](#Scraping-data) for DA courses<br>
&emsp;b. [Analyzing and Visualizing data](#Analyzing-and-Visualizing-data)<br>

# Description

In this exploratory data analysis notebook, I analyzed the rankings of Data Science and Data Analysis courses. Objectives of this study is to find out what are the best DS courses, using MOOCs ranking platform, Classcentral.

The study is carried out in two stages: Data Science courses and Data Analysis courses. The main Python libraries used in this notebook are Beautiful Soup for scraping data from Classcentral, Pandas for forming dataframes, and Matplotlib, Plotly, Seaborn for visualizing the data.

The results show that edX and Coursera are the most popular Data Science and Data Analytics course providers, while MIT and Johns Hopkins University are the highest ranked institutes who produce these courses.

<strong>NOTE:</strong> Since the study is limited to data from Classcentral platform, it can be improved substantially by taking a more qualitative approach and by taking a closer look at individual courses inside their respective provider platforms. For instance, on Coursera IBM's Data Science and Google's Data Analytics are among the highest ranked courses, and each has thousands of reviews. In Classcentral, however, these courses have not yet received enough number of reviews, and thus are not present in the results section.

## Importing libraries

In [2]:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np

In [3]:
!pip install webdriver_manager --user

Collecting webdriver_manager
  Downloading webdriver_manager-3.4.0-py2.py3-none-any.whl (16 kB)
Collecting crayons
  Downloading crayons-0.4.0-py2.py3-none-any.whl (4.6 kB)
Collecting configparser
  Downloading configparser-5.0.2-py3-none-any.whl (19 kB)
Installing collected packages: crayons, configparser, webdriver-manager
Successfully installed configparser-5.0.2 crayons-0.4.0 webdriver-manager-3.4.0


In [4]:
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(ChromeDriverManager().install())



Current google-chrome version is 89.0.4389
Get LATEST driver version for 89.0.4389
There is no [win32] chromedriver for browser 89.0.4389 in cache
Get LATEST driver version for 89.0.4389
Trying to download new driver from https://chromedriver.storage.googleapis.com/89.0.4389.23/chromedriver_win32.zip
Driver has been saved in cache [C:\Users\User\.wdm\drivers\chromedriver\win32\89.0.4389.23]


In [5]:
driver.get('https://www.classcentral.com/subject/data-science')

In [49]:
# Visualization - Imports:

import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
%matplotlib inline
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
import plotly.express as px

In [53]:
import chart_studio
username = 'ulmasovjafarbek'
api_key = 'q7Z7IxV93CYlhCx0JNv6'
chart_studio.tools.set_credentials_file(username=username, api_key=api_key)

# Data Science Courses
## a. Scraping data

In [47]:
# Data Science Courses
Course = []
Institute = []
Start_Date = []
Offered_By = []
No_Of_Reviews = []
Rating = []

In [7]:
content = driver.page_source
soup = BeautifulSoup(content)

In [8]:
def find_2nd(string, substring):
    return string.find(substring, string.find(substring) + 1)
def find_1st(string, substring):
    return string.find(substring, string.find(substring))

In [9]:
for i in soup.findAll("span",{'class' : 'text-1 weight-semi line-tight'}):
    b = str(i)
   # print(b[find_1st(b,'>')+1:find_2nd(b,'<')])
    Course.append(b[find_1st(b,'>')+1:find_2nd(b,'<')])

In [14]:
# Getting a list of courses
course = []
for i in Course:
    i = i.strip()
  #  print(i)
    course.append(i)

In [11]:
# Getting a list of insitutes:
ins = []
for d in soup.findAll('div', attrs={'class':'truncate'}):
    abc = d.find('a', attrs={'class':'color-charcoal small-down-text-2 text-3'})
    if abc is not None:
        #print(rating.text)
        ins.append(abc.text)
    else:
        ins.append('-1')
        
newIns = []
for i in ins:
    i = i.strip()
    newIns.append(i)
newIns.pop(0)

'-1'

In [15]:
# Getting a list of Providers:
for i in soup.findAll('a',href=True, attrs={'class':'color-charcoal italic'}):
    b = str(i)
    #print(b[find_1st(b,'>')+1:find_2nd(b,'<')])
    Offered_By.append(b[find_1st(b,'>')+1:find_2nd(b,'<')])
provider = []
for i in Offered_By:
    i = i.strip()
    provider.append(i)

In [16]:
# Ratings:
rat = []
for d in soup.findAll('div', attrs={'class':'col border-box text-center nowrap row large-up-text-right padding-horz-small push'}):
    abc = d.find('span', attrs={'class':'xlarge-up-hidden color-charcoal text-center'})
    if abc is not None:
        #print(rating.text)
        rat.append(abc.text)
    else:
        rat.append('-1')

In [43]:
for i in rat:
    i = i.strip()
    #print(i)
    Rating.append(i)
rating = Rating[0:50]

In [19]:
# Num of Reviews
for i in soup.findAll("span",{'class' : 'large-down-hidden block line-tight text-4 color-gray'}):
    b = str(i)
  #  print(b[find_1st(b,'>')+1:find_2nd(b,'<')])
    No_Of_Reviews.append(b[find_1st(b,'>')+1:find_2nd(b,'<')])

In [20]:
num_reviews = []
for i in No_Of_Reviews:
    i = i.strip()
    #print(i)
    num_reviews.append(i) 
    
# Variables: newIns, course, provider, Rating, num_reviews

In [21]:
# rating:
t = []
for d in soup.findAll('div', attrs={'class':'small-down-text-2 text-3 row vert-align-middle'}):
    abc = d.find('span', attrs={'class':'hidden medium-up-inline-block small-down-text-2 text-3 large-up-margin-left-xxsmall icon-clock-charcoal icon-left'})
    if abc is not None:
        t.append(abc.text)
    else:
        t.append('-1')

## b. Analyzing and Visualizing data

In [46]:
dfDS = pd.DataFrame({'course':course,'ratings': rating,'No_of_Reviews':num_reviews,
                  'provider':provider, 'institute':newIns})
dfDS

Unnamed: 0,course,ratings,No_of_Reviews,provider,institute
0,R Programming,2.8,245 Reviews,Coursera,Johns Hopkins University
1,The Data Scientist’s Toolbox,3.3,165 Reviews,Coursera,Johns Hopkins University
2,Getting and Cleaning Data,3.5,57 Reviews,Coursera,Johns Hopkins University
3,Computational Social Science,4.8,76 Reviews,Coursera,"University of California, Davis"
4,Introduction to Data Science in Python,2.4,46 Reviews,Coursera,University of Michigan
5,The Analytics Edge,4.7,80 Reviews,edX,Massachusetts Institute of Technology
6,Probability - The Science of Uncertainty and Data,4.9,32 Reviews,edX,Massachusetts Institute of Technology
7,Become a Data Analyst,4.5,64 Reviews,Udacity,Kaggle
8,Statistical Inference,2.8,34 Reviews,Coursera,Johns Hopkins University
9,Introduction to Big Data,2.7,35 Reviews,Coursera,"University of California, San Diego"


In [48]:
# Rating:
dfDS['ratings'] = dfDS['ratings'].astype(float)
dfRating = dfDS.dropna()
dfRating = dfRating[dfRating.ratings != -1]
np.mean(dfRating['ratings'])

3.5799999999999996

In [54]:
# All course ratings

fig = px.histogram(dfRating, x="ratings")
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='DS Ratings')
fig.show()
py.plot(fig)

'temp-plot.html'

In [55]:
# No.of.Reviews:
dfRating['No_of_Reviews'] = dfRating['No_of_Reviews'].str.extract('(\d+)', expand=False).astype(int)
# average no of reviews:
np.mean(dfRating['No_of_Reviews']) # 31.3

31.44

In [56]:
# Num of Reviews- all courses
fig = px.histogram(dfRating, x="No_of_Reviews")
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Num of Reviews')
py.plot(fig)
fig.show()

In [57]:
# best courses - highest rated:

bestRated = dfRating.loc[(dfRating['No_of_Reviews'] >= 20) & 
                         (dfRating['ratings'] >= 4)]

In [61]:
bestRated

Unnamed: 0,course,ratings,No_of_Reviews,provider,institute
3,Computational Social Science,4.8,76,Coursera,"University of California, Davis"
5,The Analytics Edge,4.7,80,edX,Massachusetts Institute of Technology
6,Probability - The Science of Uncertainty and Data,4.9,32,edX,Massachusetts Institute of Technology
7,Become a Data Analyst,4.5,64,Udacity,Kaggle
11,Python for Data Science,4.4,47,edX,"University of California, San Diego"
17,Mining Massive Datasets,4.6,25,edX,Stanford University
18,Introduction to Computational Thinking and Dat...,4.5,31,edX,Massachusetts Institute of Technology
19,Digital Marketing Analytics in Practice,4.2,24,Coursera,University of Illinois at Urbana-Champaign
21,Spatial Data Science: The New Frontier in Anal...,5.0,41,Independent,Esri
27,Whole genome sequencing of bacterial genomes -...,4.8,22,Coursera,Technical University of Denmark (DTU)


In [62]:
# best rated course providers:
# pie chart:

dist = bestRated['provider'].value_counts()
colors = ['mediumturquoise', 'darkorange', 'gold', 'lightgreen']
trace = go.Pie(values=(np.array(dist)),labels=dist.index)
layout = go.Layout(title='Best Rated DS Course Provider')
data = [trace]
fig = go.Figure(trace,layout)
fig.update_traces(marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.show() 
py.plot(fig)

'temp-plot.html'

In [63]:
# best rated DS institute counts:

institute_count= pd.DataFrame({'institute':bestRated["institute"].value_counts().index, 'counts':bestRated["institute"].value_counts().values}).sort_values("counts")
fig = px.bar(institute_count, x='institute', y='counts')
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Best Rated DS Institute')
py.plot(fig)
fig.show()

In [65]:
# Best Rated Courses:

bestRated.loc[7,'course'] = 'Python for Data Science edX'
bestRated.loc[9,'course'] = 'Python for Data Science Swayam'

fig = px.bar(bestRated, x='course', y='ratings')
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Highest Rated DS Courses')
py.plot(fig)
fig.show()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [66]:
# Best Courses & Num of Reviews

fig = px.bar(bestRated, x='course', y='No_of_Reviews')
fig.update_traces(marker_color="turquoise",marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5)
fig.update_layout(title_text='Best Course Num of Reviews')
py.plot(fig)
fig.show()

# Data Analysis Courses
## a. Scraping data

In [68]:
# Data Analysis Courses:

ratingg = []
reviewNum = []
title = []
providerr = []
institutee = [] 

In [69]:
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://www.classcentral.com/subject/data-analysis')

content = driver.page_source
soup = BeautifulSoup(content)



Current google-chrome version is 89.0.4389
Get LATEST driver version for 89.0.4389
Driver [C:\Users\User\.wdm\drivers\chromedriver\win32\89.0.4389.23\chromedriver.exe] found in cache


In [70]:
# course title
for i in soup.findAll("span",{'class' : 'text-1 weight-semi line-tight'}):
    b = str(i)
    #print(b[find_1st(b,'>')+1:find_2nd(b,'<')])
    title.append(b[find_1st(b,'>')+1:find_2nd(b,'<')])

In [71]:
coursetitle = []
for i in title:
    i = i.strip()
    coursetitle.append(i)

In [72]:
# Institute:
inst = []
for d in soup.findAll('div', attrs={'class':'truncate'}):
    abc = d.find('a', attrs={'class':'color-charcoal small-down-text-2 text-3'})
    if abc is not None:
        #print(rating.text)
        inst.append(abc.text)
    else:
        inst.append('-1')

In [73]:
newInstitute = []
for i in inst:
    i = i.strip()
    newInstitute.append(i)
newInstitute.pop(0)

'-1'

In [74]:
# Provider:

for i in soup.findAll('a',href=True, attrs={'class':'color-charcoal italic'}):
    b = str(i)
    #print(b[find_1st(b,'>')+1:find_2nd(b,'<')])
    providerr.append(b[find_1st(b,'>')+1:find_2nd(b,'<')])
analysisProv = []
for i in providerr:
    i = i.strip()
    analysisProv.append(i)

In [75]:
# rating:
rat_ = []
for d in soup.findAll('div', attrs={'class':'col border-box text-center nowrap row large-up-text-right padding-horz-small push'}):
    abc = d.find('span', attrs={'class':'xlarge-up-hidden color-charcoal text-center'})
    if abc is not None:
        #print(rating.text)
        rat_.append(abc.text)
    else:
        rat_.append('-1')

In [76]:
ratingg = []
for i in rat_:
    i = i.strip()
    ratingg.append(i)

In [77]:
reviewws = []
for d in soup.findAll('div', attrs={'class':'width-100'}):
    abc = d.find('span', attrs={'class':'large-down-hidden block line-tight text-4 color-gray'})
    if abc is not None:
        #print(rating.text)
        reviewws.append(abc.text)
    else:
        reviewws.append('-1')

In [78]:
revs = reviewws[8:-8]

In [79]:
for i in revs:
    i = i.strip()
    reviewNum.append(i)

## b. Analyzing and Scraping data

In [84]:
# create dataframe:
dfAnalysis = pd.DataFrame({'course':coursetitle,'ratings': ratingg,'No_of_Reviews':reviewNum[0:50],
                  'provider':analysisProv, 'institute':newInstitute})
dfAnalysis

Unnamed: 0,course,ratings,No_of_Reviews,provider,institute
0,Getting and Cleaning Data,3.5,64 Reviews,Coursera,Johns Hopkins University
1,Become a Data Analyst,4.5,39 Reviews,Udacity,Kaggle
2,Exploratory Data Analysis,3.9,26 Reviews,Coursera,Johns Hopkins University
3,Mastering Data Analysis in Excel,1.8,24 Reviews,Coursera,Duke University
4,Digital Marketing Analytics in Practice,4.2,9 Reviews,Coursera,University of Illinois at Urbana-Champaign
5,Managing Data Analysis,2.7,5 Reviews,Coursera,Johns Hopkins University
6,People Analytics,4.2,4 Reviews,Coursera,University of Pennsylvania
7,Mathematical Biostatistics Boot Camp 2,4.0,13 Reviews,Coursera,Johns Hopkins University
8,Data Analysis for Social Scientists,3.2,3 Reviews,edX,Massachusetts Institute of Technology
9,Causal Diagrams: Draw Your Assumptions Before ...,5.0,3 Reviews,edX,Harvard University


In [85]:
dfAnalysis['ratings'] = dfAnalysis['ratings'].astype(float)
AnalysisRating = dfAnalysis.dropna()
AnalysisRating = AnalysisRating[AnalysisRating.ratings != -1]
np.mean(AnalysisRating['ratings'])

3.6769230769230767

In [86]:
# Visualize Overall Rating:

fig = px.histogram(AnalysisRating, x="ratings")
fig.update_traces(marker_color="midnightblue",marker_line_color='black',
                  marker_line_width=1.5)
fig.update_layout(title_text='Data Analysis Course Ratings',xaxis=dict(range=[1, 5]))
fig.show()
py.plot(fig)

'temp-plot.html'

In [87]:
# No.of.Reviews:
AnalysisRating['No_of_Reviews'] = AnalysisRating['No_of_Reviews'].str.extract('(\d+)', expand=False).astype(int)
# average no of reviews:
np.mean(AnalysisRating['No_of_Reviews'])

9.038461538461538

In [88]:
# Num of Reviews- all courses

fig = px.histogram(AnalysisRating, x="No_of_Reviews")
fig.update_traces(marker_color="midnightblue",marker_line_color='black',
                  marker_line_width=1.5)
fig.update_layout(title_text='Data Analysis: Number of Reviews')
fig.show()
py.plot(fig)

'temp-plot.html'

In [89]:
# best courses - highest rated:

bestAnalysis = AnalysisRating.loc[(AnalysisRating['No_of_Reviews'] >= 5) & 
                         (AnalysisRating['ratings'] >= 3)]

In [90]:
bestAnalysis

Unnamed: 0,course,ratings,No_of_Reviews,provider,institute
0,Getting and Cleaning Data,3.5,64,Coursera,Johns Hopkins University
1,Become a Data Analyst,4.5,39,Udacity,Kaggle
2,Exploratory Data Analysis,3.9,26,Coursera,Johns Hopkins University
4,Digital Marketing Analytics in Practice,4.2,9,Coursera,University of Illinois at Urbana-Champaign
7,Mathematical Biostatistics Boot Camp 2,4.0,13,Coursera,Johns Hopkins University
10,High-Dimensional Data Analysis,3.7,10,edX,Harvard University
30,Practical Time Series Analysis,4.5,6,Coursera,State University of New York


In [91]:
# best rated course providers:
# pie chart:

dist = bestAnalysis['provider'].value_counts()
colors = ['mediumturquoise', 'darkorange', 'gold', 'lightgreen']
trace = go.Pie(values=(np.array(dist)),labels=dist.index)
layout = go.Layout(title='Best Rated Analysis Course Provider')
data = [trace]
fig = go.Figure(trace,layout)
fig.update_traces(marker=dict(colors=colors, line=dict(color='#000000', width=2)))
fig.show() 
py.plot(fig)

'temp-plot.html'

In [92]:
# best rated DA institute counts:

institute_count= pd.DataFrame({'institute':bestAnalysis["institute"].value_counts().index, 'counts':bestAnalysis["institute"].value_counts().values}).sort_values("counts")
fig = px.bar(institute_count, x='institute', y='counts')
fig.update_traces(marker_color="midnightblue",marker_line_color='black',
                  marker_line_width=1.5)
fig.update_layout(title_text='Highest Rated DA Institute')
py.plot(fig)
fig.show()

In [93]:
# Best Rated Courses:

fig = px.bar(bestAnalysis, x='course', y='ratings')
fig.update_traces(marker_color="midnightblue",marker_line_color='black',
                  marker_line_width=3.5)
fig.update_layout(title_text='Highest Rated DA Courses')
py.plot(fig)
fig.show()

In [94]:
# Best Courses & Num of Reviews

fig = px.bar(bestAnalysis, x='course', y='No_of_Reviews')
fig.update_traces(marker_color="midnightblue",marker_line_color='black',
                  marker_line_width=1.5)
fig.update_layout(title_text='Best Analysis Courses: Num of Reviews')
py.plot(fig)
fig.show()