# Udemy EDA

In [None]:
from IPython.display import Image

In [None]:
Image("../input/images/download.png")

# Introduction

Udemy, Inc. is an American massive open online course (MOOC) provider aimed at professional adults and students. It was founded in May 2010 by Eren Bali, Gagan Biyani, and Oktay Caglar.

As of Jan 2020, the platform has more than 35 million students and 57,000 instructors teaching courses in over 65 languages. There have been over 400 million course enrollments. Students and instructors come from 180+ countries and 2/3 of the students are located outside of the U.S. [https://en.wikipedia.org/wiki/Udemy](http://)

**Purpose of the Nothebook**

The purpose of the notebook is to do a basic exploratory data analysis (EDA) on the dataset

# About

A compilation of all the development related courses ( 13 thousand courses) which are available on Udemy's website. Under the development category, there are courses from Finance, Accounting, Book Keeping, Compliance, Cryptocurrence, Blockchain, Economics, Investing & Trading, Taxes and much more each having multiple courses under it's domain.

# Table of Contents
1. [Data Loading & Cleaning](#1.-Data-Loading-&-Cleaning)
2. [Descriptive Analysis](#2.-Descriptive-Analysis)
3. [EDA](#3-EDA)

**Content**

1. Data Loading & Cleaning
2. Descriptive Analysis
3. EDA

# 1. Data Loading & Cleaning

In [None]:
# Data manipulation
import pandas as pd
import numpy as np

# Data viz
import matplotlib.pyplot as plt
from matplotlib.colors import DivergingNorm
import seaborn as sns
import plotly.express as px

%matplotlib inline
np.warnings.filterwarnings('ignore')

# Oversamplig
#from imblearn.over_sampling import RandomOverSampler

In [None]:
udemy = pd.read_csv('../input/finance-accounting-courses-udemy-13k-course/udemy_output_All_Finance__Accounting_p1_p626.csv')

display(udemy.head(3))

## Dropping Columns, duplicates & Changing Data Types

The below columns are not necessary since they provide repeated or useless values.

Some data types are also going to be changed.

In [None]:
udemy.drop(['url', 'discount_price__currency', 'discount_price__price_string', 'price_detail__currency', 'price_detail__price_string', 'created'], axis=1, inplace=True)
udemy.drop_duplicates()
udemy['published_time'] = pd.to_datetime(udemy['published_time'])

In [None]:
display(udemy.head(3))
print(udemy.info())

## Missing Values

Let's check where the missing values are located

In [None]:
sns.heatmap(udemy.isnull())

In [None]:
total_rows = udemy.shape[0]
nulls = udemy[udemy['discount_price__amount'].isnull()].shape[0]
proportion = nulls / total_rows * 100
print("{} out of {} in colunm 'discount_price__amount' are nulls: {}% of total rows".format(nulls, total_rows, proportion), '\n')

total_rows = udemy.shape[0]
nulls = udemy[udemy['price_detail__amount'].isnull()].shape[0]
proportion = nulls / total_rows * 100
print("{} out of {} in colunm 'price_detail__amount' are nulls: {}% of total rows".format(nulls, total_rows, proportion))

The number of nulls are significant in colunm 'discount_price__amount' . Therefore, is wise to further check them before dropping or modify them.

In [None]:
display(udemy[udemy['discount_price__amount']==0].head())
display(udemy[udemy['price_detail__amount']==0].head())

Since there are no '0' values, it can be concluded that courses with no discount and price detail were given NaN's.
The next step will be replacing NaN's with 0's

In [None]:
udemy.fillna(0, inplace=True)
sns.heatmap(udemy.isnull())

Good to go!

# 2. Descriptive Analysis

In [None]:
udemy.info()

descriptive analysis is going to be divided into three sections: numbers, booleans and dates

## 2.1 Numbers

In [None]:
numbers = ['num_subscribers', 'avg_rating', 'avg_rating_recent', 'rating', 'num_reviews', 'num_published_lectures', 'num_published_practice_tests', 'discount_price__amount', 'price_detail__amount']
udemy.loc[:,numbers].hist(color='salmon', figsize=(20,10), edgecolor='white')
plt.show()
display(udemy[numbers].describe())

The above graphic shows some insights about how the data is distributed. Some are:
* num_subscribers, num_reviews, num_published_lectures and discount_price_amount are highly sensitive to outliers
* Courses tend to have prices between 1280 and 8640 (IQR)
* There is a tendency to leave ratings above 3
* On average (median) courses have 533 subscribers
* 'num_subscribers', 'num_reviews', 'num_published_lectures', 'discount_price_amount' are rich in outliers and make it hard to interpret the distribution. Let's take out the outliers and zoom in these distributions

In [None]:
udemy_o = udemy[['num_subscribers', 'num_reviews', 'num_published_lectures', 'discount_price__amount']]
udemy_o[udemy_o['num_subscribers']<6000].hist(color='salmon', figsize=(10,5), edgecolor='black')

These kind of distributions, where most of the values are located in the first bin and then descends, appears is really common. A feature engineering tool in these cases is to apply a log formula to make distribution be more 'normal', which can help in our ML model.

A example below with 'num_subscribers'

In [None]:
Log = np.log(udemy['num_subscribers'] + 1)
Log.hist(color='salmon', edgecolor='black')

## 2.2 booleans

In [None]:
booleans = ['is_paid', 'is_wishlisted']

fig, ax = plt.subplots(1,2, figsize=(15,5))

sns.countplot(udemy['is_paid'], ax=ax[0], palette='Set3')
sns.countplot(udemy['is_wishlisted'], ax=ax[1], palette='Set3')
plt.show()

paid = pd.DataFrame(udemy['is_paid'].value_counts())
paid['percentage'] = paid['is_paid'] / paid['is_paid'].sum() * 100
paid

- Since column 'is_wishlisted' has only one value, it can be dropped

- We have highly unbalanced data (496 are not paid and 13112 are paid). Given the amount of observations, this can bring difficulties to our ML model. Nevertheless, we are going to try some sampling methods to make it more fit for our ML model.

- Since we have so little False 'Is_paid' data (496) we are going to use and over sampling technique where we introduce small variations into copies of the False dara, creating diverse synthetic samples. A example below

In [None]:
udemy.drop('is_wishlisted', axis=1, inplace=True)

In [None]:
udemy.head(1)

In [None]:
"""ros = RandomOverSampler(random_state=42)

X = udemy.drop(['is_paid'], axis=1)
y = udemy['is_paid']

x_ros, y_ros = ros.fit_resample(X, y)

fig = px.histogram(data_frame=y_ros)
fig.show()

"""

- This will come in handy when building our ML model

## 2.3 Dates

In [None]:
fig, ax = plt.subplots(2,1,figsize=(25,10))

#yearly
date = pd.DataFrame(udemy['published_time'].dt.to_period('Y').value_counts())
date = date.sort_index()

vmin = date['published_time'].min()
vmax = date['published_time'].max()
vcenter = (vmax + vmin) / 2
norm = DivergingNorm(vmin=vmin, vcenter=vcenter, vmax=vmax)

colors = [plt.cm.Greens(norm(c)) for c in date['published_time']]

sns.barplot(x=date.index, y= date['published_time'], palette=colors, ax=ax[0])
ax[0].set_xticklabels(date.index, rotation=45)

#monthly
date = pd.DataFrame(udemy['published_time'].dt.to_period('m').value_counts())
date = date.sort_index()

vmin = date['published_time'].min()
vmax = date['published_time'].max()
vcenter = (vmax + vmin) / 2
norm = DivergingNorm(vmin=vmin, vcenter=vcenter, vmax=vmax)

colors = [plt.cm.Greens(norm(c)) for c in date['published_time']]

sns.barplot(x=date.index, y= date['published_time'], palette=colors, ax=ax[1])
ax[1].set_xticklabels(date.index, rotation=90)

plt.show()


The number of published courses have been increasing the last decade, having small drawbacks in 2016 and 2019

# 3 EDA

## 3.1 Categories

Since the data set doesn't have categories, we are going to create them out of the rating column using the pandas function 'cut'.
Then we are going to group the dataset into these categories and try to find patterns with other columns

In [None]:
udemy['rating_binned'] = pd.cut(udemy['rating'], [0,1,2,3,4,5])
display(udemy.groupby('rating_binned')[['num_subscribers', 'num_reviews', 'num_published_lectures', 'discount_price__amount', 'price_detail__amount', 'num_published_practice_tests']].agg(['mean', 'median','size']))

* There is a huge discrepancy between the mean and the median in the 'num_subscribers' and 'num_reviews' columns. This is because of the outliers: some courses that are extremely popular
* There is an increasing trend in every colunm except in the 'discount_price__amount' one: The greater the rating, the greater number of subscribers, reviews, published lectures and price
* Since the mean is so severely influenced by the outliers, the results are going to be shown with boxplots, since they show the median and IQR

In [None]:
fig, ax = plt.subplots(1,4,figsize=(25,7))

sns.boxplot(x='rating_binned', y='num_subscribers', data=udemy, showfliers = False, ax=ax[0], palette='Set3')
sns.boxplot(x='rating_binned', y='num_reviews', data=udemy, showfliers = False, ax=ax[1], palette='Set3')
sns.boxplot(x='rating_binned', y='num_published_lectures', data=udemy, showfliers = False, ax=ax[2], palette='Set3')
sns.boxplot(x='rating_binned', y='price_detail__amount', data=udemy, showfliers = False, ax=ax[3], palette='Set3')


## 3.2 Continuous

* Lets see a first glimpse of the continuous variables with seaborn function 'heatmap'

In [None]:
continuous = ['num_subscribers', 'rating', 'num_reviews', 'num_published_lectures', 'price_detail__amount','num_published_practice_tests']
sns.heatmap(udemy[continuous].corr(), vmin=-1, vmax=1, cmap=sns.diverging_palette(20, 220, as_cmap=True), annot=True)

* From the above graphic we can see the following relationships from strongest to weakest
    * num_subscribers/num_reviews (0.78). This has the strongest relationships, it makes sense that the reviews will increase as the subscribers increase
    * num_published_lectures/price_detail_amount (0.28). Not a strong relationship but it shows a tendency of increasing prices as the the courses lenght increase
    * num_subscribers/num_published_lectures (0.21). Do longer lectures attract more users?

In [None]:
fig, ax = plt.subplots(1,3,figsize=(25,7))

sns.regplot(x='num_subscribers', y='num_reviews', data=udemy, ax=ax[0], marker=',', color='teal')
sns.regplot(x='num_published_lectures', y='price_detail__amount', data=udemy, ax=ax[1], marker=',', color='teal')
sns.regplot(x='num_subscribers', y='num_published_lectures', data=udemy, ax=ax[2], marker=',', color='teal')

### Is there a trend in paid and free courses?

In [None]:
fig, ax = plt.subplots(1,2,figsize=(25,7))

sns.scatterplot(x='num_subscribers', y='num_reviews', data=udemy, ax=ax[0], marker=',', hue='is_paid')
sns.scatterplot(x='num_subscribers', y='num_published_lectures', data=udemy, ax=ax[1], marker=',', hue='is_paid')

* Given the short amount of free courses, is hard to tell any trend. The most evident is a tendency of free courses to have less lectures

## 3.3 Dates

In this section, we are going to explore the year of publication of the courses

In [None]:
udemy['year'] = udemy['published_time'].dt.to_period('Y').astype('str')

udemy.groupby('year')[['num_subscribers', 'num_reviews', 'num_published_lectures', 'price_detail__amount']].agg(['mean','sum','size'])

In [None]:
fig, ax = plt.subplots(2,2,figsize=(20,6))

subscribers = udemy.groupby('year')['num_subscribers'].sum()
reviews = udemy.groupby('year')['num_reviews'].sum()
lectures = udemy.groupby('year')['num_published_lectures'].sum()
price = udemy.groupby('year')['price_detail__amount'].mean()

ax[0,0].plot(subscribers)
ax[0,0].set_title('Sum of Subscribers')

ax[0,1].plot(reviews)
ax[0,1].set_title('Sum of Reviews')

ax[1,0].plot(lectures)
ax[1,0].set_title('Sum of Lectures')

ax[1,1].plot(price)
ax[1,1].set_title('Avg Price')



From the above table and graphics, we can draw some findings:
* As stablished in section 3.2, subscribers and reviews move in the same direction. In the time series analysis, we can see that both had a steady grow until 2017, from then on, the platform have seen a decrease in members and reviews, something to be aware of.
* The content of the platform, reflected by the sum of lectures, have a steady grow, except in 2015, 2018 and 2019
* The avg price of the lectures have also been icnreasing through the years. In 2018, it started decreasing along with the other variables, maybe an effort to recover subscribers?

## 3.4 Trivia

### Which is the most expensive course?

In [None]:
udemy.sort_values(by = 'price_detail__amount', ascending = False).head(10)

In [None]:
Image("../input/sethgo/Screen Shot 2020-11-02 at 15.52.51.png")

As for this list, looks like transitioning into freelancing is somehow expensive

[https://www.udemy.com/course/seth-godin-freelancer-course/](http://)

There are plenty of courses that share the most expensive price (12800). This might be a price ceiling given by udemy.
Let's check how many courses share this price

In [None]:
udemy[udemy['price_detail__amount']==12800.0].shape

let's reduce this quantity by calculating the price/rating ration to find what courses give the best rating for money

### Which courses give the best rating for money?

In [None]:
udemy['price/rating'] = udemy['price_detail__amount'] / udemy['rating']
udemy[udemy['is_paid']==True].sort_values(by = 'price/rating', ascending = True).head(10)

Looks like the best deal you can get for a course is 1280, since there are plenty of courses with this price and a rating of 5

In [None]:
Image('../input/speech/Screen Shot 2020-11-02 at 15.58.13.png')

As for this list, the course with the best ranking you can get is 'The basics of delivering a public speech'.

Sounds like a great deal for improving such an essential skill. [https://www.udemy.com/course/the-basics-of-delivering-a-public-speech/](http://)

### Which is the first published course?

In [None]:
udemy.sort_values(by='published_time').head()

In [None]:
Image('../input/laston/Screen Shot 2020-11-02 at 16.38.32.png')

The first published course is 'Simple Strategy for Swing Trading the Stock Market'.
The stock market building the hype since Udemy's beginnings.

[https://www.udemy.com/course/swing-trading-the-stock-market/](http://)

### Which course has the most subscribers?

In [None]:
udemy.sort_values(by='num_subscribers', ascending = False).head()

Two data courses sneak into the top five but don't make the first position (damn it!). First position goes to an MBA course.

In [None]:
Image('../input/laston/Screen Shot 2020-11-02 at 16.41.00.png')

The most popular course goes to: 'An Entire MBA in 1 Course:Award Winning Business School Prof'.

[https://www.udemy.com/course/an-entire-mba-in-1-courseaward-winning-business-school-prof/](http://)