Udemy is an online learning platform with 35M Learners, 57K Instructors, 130K Courses, 400M Course enrollments, 110M Minutes of video, courses taught in 65+ Languages (https://about.udemy.com/).

It was founded in May 2010 by Eren Bali, Gagan Biyani, and Oktay Caglar and it has consistently catered to the needs of those willing to improve on their existing skill or pick up a new skill.

In this notebook, we take a look at various courses offered by udemy between 2011 and 2017 and make a number of analysis based on the dataset.

Please let me know if this notebook was helpful and do feel free to comment on what aspects can be improved.

Some codes in this notebook were written based on some notebooks earlier submitted

# Its a bunch of open ended questions:
• What are the best free courses by subject?

• What are the most popular courses?

• What are the most engaging courses?

• How are courses related?

• Which courses offer the best cost benefit?



In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import plotly.express as px

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
udemy_courses = pd.read_csv('/kaggle/input/udemy-courses/udemy_courses.csv')
udemy_courses.head(5)

In [None]:
udemy_courses.describe()

From the above, we can deduce that

1) The minimum content duration is 0 hours and number of lectures is 0, which is impossible, so we would have to drop such rows

2) The highest number of subscribers is almost 10x more than the highest number of review

3) The average price for courses is $ 66.05 (approx.), the minimum price is $ 0, indicating that some courses were free and the highest price is $ 200

In [None]:
udemy_courses.info()

In [None]:
udemy_courses.isnull().sum()

The URL and Column id have little to no impact on our analysis, we can drop them

In [None]:
udemy_courses = udemy_courses.drop(columns = ['course_id', 'url'])

In [None]:
udemy_courses.head(5)

In [None]:
sns.distplot(udemy_courses['price'], kde = False).set_title('Price distribution')

Most of the prices fall on the $25 line

In [None]:
sns.distplot(udemy_courses['content_duration'], kde = True).set_title('Histogram of Content duration')

From earlier insight into the data, we noticed the duration for some courses was shown as 0. Since this is almost impossible, we would drop all rows with it content duration as 0

In [None]:
print (udemy_courses.loc[udemy_courses['content_duration'] == 0])

In [None]:
id = udemy_courses[udemy_courses['content_duration'] == 0].index.values[0]
udemy_courses.drop(id,axis=0,inplace=True)

In [None]:
udemy_courses.shape

# Question 1: What are the best free courses by subject?

In [None]:
free_courses = udemy_courses[udemy_courses['is_paid'] == 0]
free_courses.shape

There are 310 free courses

In [None]:
paid_courses = udemy_courses[udemy_courses['is_paid'] == 1]
paid_courses.shape

3367 courses are paid courses

In [None]:
subject_unique = udemy_courses['subject'].unique()
subject_unique

In [None]:
for x in subject_unique:
    subscribers = free_courses[free_courses['subject'] == x]['num_subscribers'].max()
    course = free_courses[(free_courses['num_subscribers'] == subscribers)]['course_title'].unique()
    
    print("The best free course offered by udemy for {} is \n{} with {} subscribers\n".format(x,course[0],subscribers))


In [None]:
for x in subject_unique:
    subscribers = paid_courses[paid_courses['subject'] == x]['num_subscribers'].max()
    course = paid_courses[(paid_courses['num_subscribers'] == subscribers)]['course_title'].unique()
    
    print("The best paid course offered by udemy for {} is \n{} with {} subscribers\n".format(x,course[0],subscribers))


# Question 2: What are the most popular courses

Grouping the most popular courses based on their number of subscribers

In [None]:
udemy_courses[['course_title', 'num_subscribers']].sort_values('num_subscribers', ascending = False).head(5)

Of the top 5 courses by its subscribers, four of them are web development courses.

# Question 3: What are the most engaging courses?
Total engagement = number of subscribers + number of reviews. We need to create new column

In [None]:
udemy_courses['Engagement'] = udemy_courses['num_subscribers'] + udemy_courses['num_reviews']
udemy_courses.head(5)

In [None]:
udemy_courses[['course_title', 'Engagement']].sort_values('Engagement', ascending = False).head(5)

Just like the total number of subscribers, the list barely changed.

# Question 4 : How are the courses related

In [None]:
corr = udemy_courses.corr()
f,ax = plt.subplots(figsize=(15, 10))
sns.heatmap(udemy_courses.corr(), annot=True, fmt= '.1f',ax=ax, cmap="BrBG")
sns.set(font_scale=1.25)
plt.show()

The correlation map above shows that there is a positive correlation/relationship between the number of reviews and the number of subscribers a course has (0.6), and a strong positive correlation (0.8) between the number of lectures and the content duration. These two have the highest correlation.

On the other hand, there is a negative correlation between the number of subscribers and payment for a course (-0.3)

# Question 5: What courses offer the best cost benefit
By saying the best cost benefit, we can determine this by finding the course with the least amount paid, but has the highest number of subscribers. If we had a reviews or star rating column, this might have been more accurate

In [None]:
cost_benefit = udemy_courses[(udemy_courses['price']
                              <=udemy_courses['price'].mean()) 
                             & (udemy_courses['num_subscribers']
                                >=udemy_courses['num_subscribers'].mean())].sort_values(('num_subscribers'),ascending=False)['course_title'].head(1).unique()
print("The course which offers the best cost benefit is : \n", cost_benefit)

# Further analysis and visualization
Before further analysis, the price will be distributed into bins of Free (0), Low (1-75), Medium (76-150), and High (151-200) and the published_timestamp column is not in datetime format, so it will be converted to datetime format and the year will be extracted

In [None]:
conditions = [
    udemy_courses['price'] == 0,
    ((udemy_courses['price'] > 0) & (udemy_courses['price'] <= 75)),
    ((udemy_courses['price'] > 75) & (udemy_courses['price'] <=150))
]

choices = [
    'Free',
    'Low',
    'Medium'
]


udemy_courses['price_bin'] = np.select(conditions, choices, 'High')

In [None]:
udemy_courses['published_timestamp'] = pd.to_datetime(udemy_courses['published_timestamp'])
udemy_courses['date_published'] = udemy_courses['published_timestamp'].dt.date
udemy_courses['time_published'] = udemy_courses['published_timestamp'].dt.time
udemy_courses['year']=pd.DatetimeIndex(udemy_courses['published_timestamp']).year

In [None]:
udemy_courses.head(5)

We have no need for the published_timestamp column again, so it is best to drop it

In [None]:
udemy_courses = udemy_courses.drop('published_timestamp', axis = 1)

# Total number of paid and unpaid courses

In [None]:
udemy_courses["is_paid"].value_counts()

In [None]:
price_list=udemy_courses["is_paid"].unique()
price_count=udemy_courses['is_paid'].value_counts().reset_index()
fig11=px.bar(price_count, x='index', y='is_paid', text='is_paid', color='is_paid',
             title='count of courses paid and unpaid for',
             labels={'index':'paid/unpaid courses','is_paid':'count of paid/unpaid courses'})
fig11.update_layout(showlegend=False, width=600)
fig11.show()

# Count of courses based on its bins

In [None]:
bin_list=udemy_courses["price_bin"].unique()
bin_count=udemy_courses['price_bin'].value_counts().reset_index()
fig2=px.bar(bin_count, x='index', y='price_bin', text='price_bin', color='price_bin',
             title='count of courses according to its price bin',
             labels={'index':'price bins','price_bin':'count of courses based on price'})
fig2.update_layout(showlegend=False, width=800)
fig2.show()

# Breakdown of subjects based on its total number of subscribers

In [None]:
udemy_courses.groupby('subject')['num_subscribers'].sum().sort_values(ascending = False).plot(kind = 'bar')
plt.ylabel('Sum of subscribers')
plt.title('Breakdown of subjects based on total number of subscribers')

As expected from earlier analysis, web development courses have the highest number of subscribers with almost 8 million subscribers in total. This could be because as at 2015, web development was number 5 in top 10 IT skills 194 IT executives said would be in demand (Mary K. Pratt, 2014)

In [None]:
udemy_courses.groupby('subject')['num_subscribers'].sum().sort_values(ascending = False)

In [None]:
udemy_courses.groupby(['subject', 'is_paid'])['num_subscribers'].sum().plot(kind = 'bar')

Of the 7.9 million subscribers to web development courses, over 2million of them subscribed to a free web development course

In [None]:
udemy_courses.groupby(['subject', 'is_paid'])['num_subscribers'].sum()

# Total Number of courses in each subject

In [None]:
udemy_courses['subject'].value_counts().sort_values(ascending = False).plot(kind = 'bar')
plt.ylabel('value count of subjects')
plt.xlabel('Subject name')
plt.title('Subject count')

Although web development courses have more than 3x the number of Business Finance subscribers, Business Finance rivaled web development for the number of courses released under it.

In [None]:
udemy_courses['subject'].value_counts().sort_values(ascending = False)

# Subject breakdown (Paid and Unpaid)

In [None]:
plt.figure(figsize = (10, 5))
sns.countplot('is_paid', hue = 'subject', data = udemy_courses)

Business Finance has the highest number of paid courses while Web Development has the highest number of free courses

In [None]:
udemy_courses.groupby('subject')['is_paid'].value_counts().sort_values(ascending = False)

# Breakdown of subjects based on wheter they are paid for or not

In [None]:
plt.figure(figsize = (10, 5))
sns.countplot('level', hue = 'is_paid', data = udemy_courses)

Seems that truly, a good education is expensive as Expert level courses although not many, has no free course. They have to be paid for.

In [None]:
udemy_courses.groupby('level')['is_paid'].value_counts()

# Breakdown of subject by level

In [None]:
plt.figure(figsize = (10,5))
sns.countplot('subject', hue = 'level', data = udemy_courses)

# Level count

In [None]:
udemy_courses["level"].value_counts().plot(kind="bar")
plt.ylabel('Count of courses')
plt.xlabel('Course Level')
plt.title('Number of courses based on level of expertise')

It seems as the level increases, the less number of courses are put out.

# Content/Course released per year

In [None]:
udemy_courses['year'].value_counts().plot(kind = 'bar')

In [None]:
udemy_courses.groupby('year')['course_title'].count()

# Number of subscribers per year

In [None]:
udemy_courses.groupby('year')['num_subscribers'].sum().sort_values(ascending = False).plot(kind = 'bar')

In [None]:
sns.catplot(x = 'is_paid', col = 'year',
            data = udemy_courses, 
            kind = 'count',
           height = 2.5, 
            aspect = .8
           )

# Correlation/Relationship

In [None]:
corr = udemy_courses.corr()
f,ax = plt.subplots(figsize=(15, 10))
sns.heatmap(udemy_courses.corr(), annot=True, fmt= '.1f',ax=ax, cmap="BrBG")
sns.set(font_scale=1.25)
plt.show()


We have another correlation map here because the first one did not read the published_timestamp because it was not in datetime format, thus the need for a new correlation when the date was converted

The correlation between the year a content was released and its number of subscribers is -0.2 (a negative correlation). This means the time a course was published has no impact on its number of subscribers.