# Introduction

Udemy is a Massive Online Open Course website, where you can purchase certified courses to learn from a wide range of topics.


In this notebook, I will be exploring the Udemy Courses datset. Let's see what kind of courses are offered, how popular are they, and see what features make them so. I will be making various kinds of plots to see the distribution of the variables as well as better see the relationships. 

Just a side note: the data is not cleaned so you will have to mess around with a lot of the columns to make the data usable. Also, the data doesn't really look complete to me (or even a random subset of the entire data) as a lot of subjects offered on Udemy are missing.

Also, there is one music and spanish song course, and the data for that is wrong. Some people have dropped it, but you can see the observation I made and how to fix it here: [https://www.kaggle.com/andrewmvd/udemy-courses/discussion/151353]. For now, I am only using a couple of columns, so I will fix the values as and when I need to.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
data = pd.read_csv('/kaggle/input/udemy-courses/udemy_courses.csv')

In [None]:
data

# Univariate Analysis - Standard Plots

In [None]:
#The data is not consistent for this column, so I am converting it to a True and false. Also, there is a udemy course link in this is paid column. Must be a bug, but I checked and the 
#course is paid, so I will be replacing it with True.
data['is_paid'] = data['is_paid'].replace('FALSE', 'False')
data['is_paid'] = data['is_paid'].replace('TRUE', 'True')
data['is_paid'] = data['is_paid'].replace('https://www.udemy.com/learnguitartoworship/', 'True')

sns.set(style="darkgrid")
ax = sns.countplot(x="is_paid", data=data)

So we immediately see that most of the courses are paid, which makes sense to me as whenever I browse Udemy, I never see a free course. In fact, I didn't even know they existed!

First off, let's plot a histogram for the price of the courses.

In [None]:
data['price'] = data['price'].replace('Free', '0')

#There is one course that has True as its price. I found the quote (Rs. 12480). I will assume that the prices are in dollars, so I will replace it with $178. Also, ignore the next two lines.
#True_sample = data[data['price'] == 'TRUE'].index
#data = data.drop(True_sample)
data['price'] = data['price'].replace('TRUE', '30')

data['price'] = data['price'].astype(int)


plt.hist(data['price'])

So most of the courses range from $25 to $50. This is the general price I see for most courses on the website (removing the permanent discounts!).

Let's look at the distribution of subscribers

In [None]:
sns.boxplot(data['num_subscribers'])

Oh man. Not a good plot for Udemy course creators. Most courses are not popular at all and barely have â‰¤10000 subsribers. This means most courses are small. However, there are big hits as well, which you can see from all of the outliers. So looks like Udemy courses are an all or nothing game. Let's look at the subject popularity.

In [None]:
plt.hist(data['subject'])

So business Finance and Web Development are extremely popular. Then there is musical instruments and graphic design. No wonder I keep getting those java script course emails!

I don't think the data is anywhere near complete, since a lot of courses on big data and ML have come out. That doesn't fit into any category here, so this dataset is very limited.

Let's look at the level of the courses.

In [None]:
data['level'] = data['level'].replace('52', 'Beginner Level')
plt.hist(data['level'])

Wow, basically everyone puts their course as all levels. I don't think course makers even bother with this part. They probably just enter all levels or beginners to attract the herd.

# Multi-variate analysis - Understand 3 variables at a time!

I want to see the relationship between whether a course is paid, the subject, and the number of subscribers. I suppose that some subjects can get a higher turnout even if they are paid, so let's see which ones. We will use a violin plot.

In [None]:
plt.figure(figsize=(20, 6))
ax = sns.violinplot(x="num_subscribers", y="subject", hue="is_paid", data=data, palette="muted")

This plot explains quite a bit. Remember all of those outlers we saw in the number of subscribers, most of them came from the free courses. That doesn't mean, however, that even paid courses go big. Look at the web development section. Maybe this is why this subject is very popular?

We see that courses in Business Finance and Graphics design, get a very low turnout if they are paid, so people take courses in these subjects only if they are free. Musical instruments have a higher turnout for paid courses, but not much more. Web Development gyush can charge money and still get a decent turnout. Also, web development's free course graph is extremely skewed so it probably means that everyone wants to learn web development, but aren't dedicated enough to actually pay for the course.

Now let's look at the relationship between price, number of subscribers and number of lectures.. We will use a bubble plot.

In [None]:
!pip install bubbly
!pip install chart_studio

This plot may lag quite a bit, because its quite bulky. Anyhow, I think it gives a lot of information

In [None]:

from bubbly.bubbly import bubbleplot 
from plotly.offline import iplot
import chart_studio.plotly as py



figure = bubbleplot(dataset=data, x_column='price', y_column='num_subscribers', bubble_column='num_lectures', size_column='num_lectures', height=350)

iplot(figure)

y - axis: number of subscribers
x - axis: price of course
bubble size: number of lectures

Now we can understand the effect of prices better. As expected, free courses, and many paid, have large number of subscribers (as shows in the y-axis).

One more thing we can see is that as the price of the course increases, the size of the course also increases (you can see that the bubbles at the right are much larger than those on the left). This makes sense, cause if you pay more, you would expect more in return.


# Word Cloud

Now, let's see what kind of titles course creators give. We will use a word cloud as it is easy to see world popularity as well. If you are not familiar with word clouds, essentially, the larger a word is, the more frequently it is used.

In [None]:
titles = data['course_title'].str.cat(sep=' ')
from wordcloud import WordCloud, ImageColorGenerator

wordcloud = WordCloud(max_words=200, colormap='Set3', background_color='black').generate(titles)

plt.figure(figsize=(15,10))
plt.imshow(wordcloud, interpolation='Bilinear')
plt.axis("off")
plt.figure(1,figsize=(12, 12))
plt.show()


So we can see that the main titles used are 'Learn' and 'How to'. Many courses even try to make it explicit by mentioning 'for beginner', 'Course' and 'from Scratch'.
Although this is great, I want to look at the titles chosen based on the subject, because you can see that the javascript word is bigger than the guitar word, purely because there are more javascript courses. So, let's split this up into different subjects.

In [None]:
data['subject'].unique()

In [None]:
sub_data = data[data['subject'] == 'Business Finance']
titles = sub_data['course_title'].str.cat(sep=' ')


wordcloud = WordCloud(max_words=200, colormap='Set3', background_color='black').generate(titles)

plt.figure(figsize=(15,10))
plt.imshow(wordcloud, interpolation='Bilinear')
plt.axis("off")
plt.figure(1,figsize=(12, 12))
plt.show()


So we see popularity of the general words to describe the category, 'trading', 'accounting', 'stock', etc. We also see lots of the general words like how to, and learn.

In [None]:
sub_data = data[data['subject'] == 'Graphic Design']
titles = sub_data['course_title'].str.cat(sep=' ')


wordcloud = WordCloud(max_words=200, colormap='Set3', background_color='black').generate(titles)

plt.figure(figsize=(15,10))
plt.imshow(wordcloud, interpolation='Bilinear')
plt.axis("off")
plt.figure(1,figsize=(12, 12))
plt.show()

Again we see the tools used in graphics design, like photoshop, adobe illustrator, Logo Design, etc. Again the how to and learn is present in the title.

In [None]:
sub_data = data[data['subject'] == 'Musical Instruments']
titles = sub_data['course_title'].str.cat(sep=' ')


wordcloud = WordCloud(max_words=200, colormap='Set3', background_color='black').generate(titles)

plt.figure(figsize=(15,10))
plt.imshow(wordcloud, interpolation='Bilinear')
plt.axis("off")
plt.figure(1,figsize=(12, 12))
plt.show()


As expected, we see the different musical instruments used. Piano and Guitar look like they are the most popular.

In [None]:
sub_data = data[data['subject'] == 'Web Development']
titles = sub_data['course_title'].str.cat(sep=' ')


wordcloud = WordCloud(max_words=200, colormap='Set3', background_color='black').generate(titles)

plt.figure(figsize=(15,10))
plt.imshow(wordcloud, interpolation='Bilinear')
plt.axis("off")
plt.figure(1,figsize=(12, 12))
plt.show()


Now this is getting boring: We see javascript, wordpress (I don't know why there is a course for this), PHP, HTML, and the learn, from scratch words.



In short, Udemy course creators are very consistent with their naming: 'Learn {topic based on subject} for beginners'.

# Conclusion

1. Udemy Courses either go big or bust ( by bust I mean they have very few subscribers). So for content creators, it is almost like an all or nothing game.
2. Free courses have a huge number of subscribers (it is kind of obvious I guess).
3. If a course is paid, the price of the course doesn't really affect the number of subscribers much. In fact, in paid courses, the only one with high number of subscribers are the expensive ones. 
4. Expensive courses are generally longer - I guess this is kind of obvious as well.
5. Udemy courses generally have the title: 'Learn to use {tool or topic specific to subject} from scratch'.


I hoped you liked this notebook a lot. I tried using a variety of plots to bring out a better understanding. If you liked the visualizations, please upvote the notebook (this is just a friendly reminder). Do leave feedback in the comments. I'll try to work on it.