In [None]:
from IPython.display import Image
from IPython.core.display import HTML 

Image(url= "https://findlogovector.com/wp-content/uploads/2018/11/udemy-logo-vector.png")

# Overview
This dataset is a compilation of all the development related courses ( 13 thousand courses) which are available on Udemy's website. Under the development category, there are courses from Finance, Accounting, Book Keeping, Compliance, Cryptocurrence, Blockchain, Economics, Investing & Trading, Taxes and much more each having multiple courses under it's domain.

The end goal of this notebook is to conduct an Exploratory Data Analysis, as well as show other data cleaning and preprocessing techniques.

#### Main Question
Which feature(s) affects the ratings of courses on Udemy?

In [None]:
# import relevant packages
import pandas as pd
import numpy as np
import pandas_profiling 

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline


In [None]:
#upload dataset
df = pd.read_csv('../input/finance-accounting-courses-udemy-13k-course/udemy_output_All_Finance__Accounting_p1_p626.csv')

In [None]:
df.head()

In [None]:
#get dataset details
df.info()

In [None]:
#counts of missing values in each feature
df.isnull().sum()

In [None]:
#further graphical visualization of missing values row 
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

In [None]:
# getting a deeper dive into missing cells to understand factors behind their emptiness in order to determine next steps
df1 = df[df.isna().any(axis=1)]
df1

From my observation, it is only logical for discounts and price details to be null, if payments were not made for those courses (where is_paid is False).

In [None]:
# drop irrelevant and duplicate feature columns 
df = df.drop(['discount_price__currency', 'discount_price__price_string', 
              'price_detail__currency', 'price_detail__price_string'], axis = 1)

In [None]:
# replace nulls with 0 for all free courses in prices and discounts column
df["price_detail__amount"] = df["price_detail__amount"].apply(lambda x: 0 if pd.isnull(x) else x) 
df["discount_price__amount"] = df["discount_price__amount"].apply(lambda x: 0 if pd.isnull(x) else x)

In [None]:
#run a check to ensure null values have been taken care of
df.isnull().sum()

In [None]:
# graphical distribution of all numeric features
numerics = ['num_subscribers', 'avg_rating', 'avg_rating_recent', 'rating', 'num_reviews', 'num_published_lectures', 'num_published_practice_tests', 'discount_price__amount', 'price_detail__amount']
df.loc[:,numerics].hist(color='red', figsize=(20,10), edgecolor='white')
plt.show()
display(df[numerics].describe())


In [None]:
# this gives more informations especially number of labels in my categorical variable
for col in df.columns:
    print(col, ':', len(df[col].unique()), 'labels')

In [None]:
# droping irrelevant feature columns 
df = df.drop(['is_wishlisted'], axis = 1) #dropping since all courses appear to be false

In [None]:
# amazing python tool used in reading dataset, which produces a summarized report
profile = df.profile_report(title = "Dataset Report")

In [None]:
profile

# Visualizations

In [None]:
# check for collinearity in features using heatmap
matrix = df.corr()
mask = np.triu(np.ones_like(matrix, dtype=bool))
cmap = sns.diverging_palette(26, 15, s=75, l=40, n=9, center="dark", as_cmap=True)
plt.figure(figsize = (25,17))

sns.heatmap(matrix, mask=mask, center = 0, annot = True, fmt='.2f', square = True, cmap="RdYlGn")
plt.show()


In [None]:
# graph to prove the collinearity between avg_rating and rating
sns.set()
plt.figure(figsize = (16,9))
sns.scatterplot(x=df.loc[:,'rating'], y=df.loc[:,'avg_rating'], palette = 'rocket', hue=df.is_paid)

Based on this graph, it will be wise to drop one of the above feature since they are linear and have similar data.

In [None]:
# drop average rating since thing is a duplicate of rating
df = df.drop(['avg_rating'], axis = 1)

## Number of Published Lectures vs Rating

In [None]:
sns.set()
plt.figure(figsize = (16,9))
sns.scatterplot(x=df.loc[:,'rating'], y=df.loc[:,'num_published_lectures'], palette = 'icefire', hue=df.is_paid)

We can deduce that the number of published lectures for a course does not necessarily result in better ratings. However, we can see a large number of courses within the range of 0 to 200 published lectures having better chances of ratings above average.

## Number of Subscribers vs Rating

In [None]:
sns.set()
plt.figure(figsize = (16,9))
sns.scatterplot(x=df.loc[:,'rating'], y=df.loc[:,'num_subscribers'], hue=df.is_paid, palette='magma')

Graph above, Shows us the following:
* The number of subscribers has no effect on the rating of a course. We can see concentration of raters at the bottom between 0 to 50,000 subscribers.
* Sparse number of courses with both very high number of subscribers and high ratings are mostly paid courses.
* A very high number of unpaid courses with less than 50,000 have good ratings ranging from 3.5 to 5.0, which I find impressive.

In [None]:
# this gives me details on dispersed points on the far top-right of the graph 

df.loc[:,['title', 'is_paid','rating','num_subscribers', 'num_reviews']].sort_values(
    by = 'num_subscribers', ascending = False)

## Number of Reviews vs Rating

In [None]:
sns.set()
plt.figure(figsize = (16,9))
sns.scatterplot(x=df.loc[:,'rating'], y=df.loc[:,'num_reviews'], hue=df.is_paid, palette='viridis')

Majority of free courses have better ratings(above 3.5) compaired to paid courses. Also, the higher the number of reviews, the greater chance of an above average rating.

## Published Practice Tests vs Rating

In [None]:
sns.set()
plt.figure(figsize = (16,9))
sns.histplot(x=df.loc[:,'rating'], y=df.loc[:,'num_published_practice_tests'], palette='magma')

## Number of Subscribers vs Paid/Free courses

In [None]:
sns.set()
plt.figure(figsize = (16,9))
sns.barplot(x=df.loc[:,'is_paid'], y=df.loc[:,'num_subscribers'], palette = 'magma')

## Number of Reviews vs Paid/Free courses

In [None]:
sns.set()
plt.figure(figsize = (16,9))
sns.barplot(x=df.loc[:,'is_paid'], y=df.loc[:,'num_reviews'])

# Data Cleaning Process and more visuals

In [None]:
df["price_amount_usd"] = df["price_detail__amount"].apply(lambda x: x*0.014) # former feature name will be dropped in next cell
df["discount_amount_usd"] = df["discount_price__amount"].apply(lambda x: x*0.014) # former feature name will be dropped in next cell
df.head()

In [None]:
#this can serve as a check to verify all non-paying courses are without dollar amounts
sns.set()
plt.figure(figsize = (16,9))
sns.barplot(x=df.loc[:,'is_paid'], y=df.loc[:,'price_amount_usd'])

## Price Amount vs Published Practice Tests

In [None]:
sns.set()
plt.figure(figsize = (16,9))
sns.histplot(x=df.loc[:,'num_published_practice_tests'], y=df.loc[:,'price_amount_usd'])

## Price Amount vs Rating

In [None]:
sns.set()
plt.figure(figsize = (18,10))
sns.histplot(x=df.loc[:,'rating'], y=df.loc[:,'price_amount_usd'], palette = 'viridis')

The vast majority of courses with good ratings of 4.0 are priced around 20 and 120 dollars(deeper shade of green). Also, there are some very expensive courses with very poor ratings as well.

In [None]:
sns.set()
plt.figure(figsize = (16,9))
sns.scatterplot(x=df.loc[:,'num_subscribers'], y=df.loc[:,'price_amount_usd'], color = 'salmon')

Price of a course being too low or high does not affect the number of subscribers. We have a good number of subscribers on courses priced around 120 dollars.

In [None]:
#dropping other irelevant/duplicate features
df = df.drop(['id', 'url', 'discount_price__amount', 
              'price_detail__amount', 'avg_rating_recent'], axis = 1)

In [None]:
# further feature engineering and analysis can be carried out on this cleaned version of the dataset below
df.head()

# Key Takeaway

Following this exploratory analysis, there were a few things that stood out about Udemy ratings:
* Majority of free courses with lesser pushlished lectures tend to have better ratings (above average) compared to paid courses.
* The number of subscribers does not ultimately signify good ratings. There were indeed, low subscribed courses with very good ratings over 3.5. Like the saying goes, "quality over quantity".
* From a course instructor's standpoint, it should be noted that the number of published tests does not guarantee very good rating for a course. We could see a vast majority courses with 0 tests climbing the ratings ladder with excellent points.
* After analysising variables affecting ratings, I can come to a conclusion that the pricing of courses have adverse effect on rating of courses. Courses around 10 - 20 dollars seemed to have impressive ratings over a handful of more expensive courses. However, there is a caveat but before diving into that, we should bear in mind that most courses on Udemy are discounted and offered at the 10 - 20 dollars rate which makes sense in my conclusion above. A reviewer rating a 10 dollars course will have different expectations compared to when rating a more expensive course. This can have a biased effect on overall ratings and public impression on a course.