# Data Analysis of Udemy Courses

## Library Imports

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
sns.set(style = "whitegrid", font = "sans-serif", palette = "Dark2", font_scale = 1.1)

## Importing the Dataset

In [None]:
udemy_DF = pd.read_csv('../input/udemy-courses/udemy_courses.csv')

# Overview of the Dataset

In [None]:
udemy_DF.shape

There are 3678 rows and 12 columns.

In [None]:
udemy_DF.head(10)

# Data Preparation

In [None]:
udemy_DF.isnull().sum()

There are no NaNs in this dataset.

## Checking for Courses with no Lectures

In [None]:
udemy_DF[udemy_DF['num_lectures'] == 0]

In [None]:
udemy_DF.drop(udemy_DF[udemy_DF['num_lectures'] == 0].index, inplace = True)

One course named "Mutual Funds for Investors in Retirement Accounts" has been removed.

## Extracting the Year from the Date

In [None]:
udemy_DF['published_timestamp'] = pd.to_datetime(udemy_DF['published_timestamp'])
udemy_DF['published_date'] = udemy_DF['published_timestamp'].dt.date
udemy_DF['published_year'] = pd.DatetimeIndex(udemy_DF['published_date']).year

In [None]:
udemy_DF.head()

# Analyzing All the Features

## Subjects

In [None]:
plt.figure(figsize = (9,4))
sns.countplot(data = udemy_DF, x = 'subject')
plt.show()

Majority of the courses fall under Web Development (32.6%) and Business Finance (32.5%).

## Difficulty Levels

In [None]:
plt.figure(figsize = (9,4))
sns.countplot(data = udemy_DF, x = 'level')
plt.show()

52.4% of all courses have all the difficulty levels. 34.5% of all courses are at a beginner level.

## Paid and Free Courses

In [None]:
plt.figure(figsize = (9,4))
sns.countplot(data = udemy_DF, x = 'is_paid')
plt.show()

91.6% of all Udemy courses are paid.

## Year of Publishing

In [None]:
plt.figure(figsize = (9,4))
sns.countplot(data = udemy_DF, x = 'published_year')
plt.show()

In [None]:
udemy_DF.nlargest(5, 'published_timestamp')

This dataset has courses published till 2017. There is a substantial increase in the number of courses published every year, due to the increase in popularity of online courses. The decrease in 2017 can be attributed to the fact the last recorded course was published on 6 July 2017, as visible in the above table. 32.8% of all courses were published in 2016.

## Course Price

In [None]:
plt.figure(figsize = (10,4))
sns.distplot(udemy_DF['price'], color = "#4c84f5", kde = False)
plt.show()

In [None]:
plt.figure(figsize = (10,4))
sns.boxplot(udemy_DF['price'], color = "#4c84f5", linewidth = 2.2)
plt.show()

The maximum price for courses is 200 dollars. The average price for all courses is 66.06 dollars. The middle 50% of the prices lie between 20 and 95 dollars.

## Number of Subscribers

In [None]:
plt.figure(figsize = (8,4))
sns.distplot(udemy_DF['num_subscribers'], color = "#c92c26", kde = False)
plt.show()

In [None]:
plt.figure(figsize = (10,4))
sns.boxplot(udemy_DF['num_subscribers'], color = "#359c91", linewidth = 2.2)
plt.show()

There is a large number of outliers in terms of subscribers.

## Number of Reviews

In [None]:
plt.figure(figsize = (8,4))
sns.distplot(udemy_DF['num_reviews'], color = "#359c91", kde = False)
plt.show()

In [None]:
plt.figure(figsize = (10,4))
sns.boxplot(udemy_DF['num_reviews'], color = "#359c91", linewidth = 2.2)
plt.show()

In [None]:
udemy_DF['num_reviews'].describe()

The highest number of reviews is 27,445. The average number of reviews is 156.

## Number of Lectures

In [None]:
plt.figure(figsize = (8,4))
sns.distplot(udemy_DF['num_lectures'], color = "#94d111", kde = False)
plt.show()

In [None]:
plt.figure(figsize = (10,4))
sns.boxplot(udemy_DF['num_lectures'], color = "#94d111", linewidth = 2.2)
plt.show()

In [None]:
udemy_DF['num_lectures'].describe()

The highest amount of lecture videos in a course is 779. The average number of lecture videos is 40.

## Content Duration

In [None]:
plt.figure(figsize = (8,4))
sns.distplot(udemy_DF['content_duration'], color = "#ff03ea", kde = False)
plt.show()

In [None]:
plt.figure(figsize = (10,4))
sns.boxplot(udemy_DF['content_duration'], color = "#ff03ea", linewidth = 2.2)
plt.show()

In [None]:
udemy_DF['content_duration'].describe()

The longest duration for a course is 78.5 hours. The shortest course is 0.133 hours, which equates to approximately 8 minutes.

# Most Subscribed Courses

## Top 10 Most Subscribed Courses

In [None]:
udemy_DF.nlargest(10, 'num_subscribers')

In [None]:
plt.figure(figsize = (10, 4))
sns.barplot(data = udemy_DF.nlargest(10, 'num_subscribers'), 
            x = 'num_subscribers', y = 'course_title')
plt.title(label = "Top 10 Most Subscribed Courses")
plt.show()

## Top 10 Most Subscribed Paid Courses

In [None]:
udemy_DF[udemy_DF['is_paid'] == True].nlargest(10, 'num_subscribers')

In [None]:
plt.figure(figsize = (10, 4))
sns.barplot(data = udemy_DF[udemy_DF['is_paid'] == True].nlargest(10, 'num_subscribers'), 
            x = 'num_subscribers', y = 'course_title')
plt.title(label = "Top 10 Most Subscribed Paid Courses")
plt.show()

## Top 10 Most Subscribed Free Courses

In [None]:
udemy_DF[udemy_DF['is_paid'] == False].nlargest(10, 'num_subscribers')

In [None]:
plt.figure(figsize = (10, 4))
sns.barplot(data = udemy_DF[udemy_DF['is_paid'] == False].nlargest(10, 'num_subscribers'), 
            x = 'num_subscribers', y = 'course_title')
plt.title(label = "Top 10 Most Subscribed Free Courses")
plt.show()

## Top 10 Most Subscribed Courses in terms of Subject 

In [None]:
plt.figure(figsize = (10,3))
sns.barplot(data = udemy_DF[udemy_DF['subject'] == 'Business Finance'].nlargest(10, 'num_subscribers'), 
            x = 'num_subscribers', y = 'course_title')
plt.title("Top 10 Most Subscribed Business Finance Courses")
plt.show()

plt.figure(figsize = (10,3))
sns.barplot(data = udemy_DF[udemy_DF['subject'] == 'Web Development'].nlargest(10, 'num_subscribers'), 
            x = 'num_subscribers', y = 'course_title')
plt.title("Top 10 Most Subscribed Web Development Courses")
plt.show()

plt.figure(figsize = (10,3))
sns.barplot(data = udemy_DF[udemy_DF['subject'] == 'Musical Instruments'].nlargest(10, 'num_subscribers'), 
            x = 'num_subscribers', y = 'course_title')
plt.title("Top 10 Most Subscribed Musical Instruments Courses")
plt.show()

plt.figure(figsize = (10,3))
sns.barplot(data = udemy_DF[udemy_DF['subject'] == 'Graphic Design'].nlargest(10, 'num_subscribers'), 
            x = 'num_subscribers', y = 'course_title')
plt.title("Top 10 Most Subscribed Graphic Design Courses")
plt.show()

# Most Reviewed Courses

## Top 10 Most Reviewed Courses

In [None]:
udemy_DF.nlargest(10, 'num_reviews')

In [None]:
plt.figure(figsize = (10, 4))
sns.barplot(data = udemy_DF.nlargest(10, 'num_reviews'), 
            x = 'num_reviews', y = 'course_title')
plt.title(label = "Top 10 Most Reviewed Courses")
plt.show()

## Top 10 Most Reviewed Paid Courses

In [None]:
udemy_DF[udemy_DF['is_paid'] == True].nlargest(10, 'num_reviews')

In [None]:
plt.figure(figsize = (10, 4))
sns.barplot(data = udemy_DF[udemy_DF['is_paid'] == True].nlargest(10, 'num_reviews'), 
            x = 'num_reviews', y = 'course_title')
plt.title(label = "Top 10 Most Reviewed Paid Courses")
plt.show()

## Top 10 Most Reviewed Free Courses

In [None]:
udemy_DF[udemy_DF['is_paid'] == False].nlargest(10, 'num_reviews')

In [None]:
plt.figure(figsize = (10, 4))
sns.barplot(data = udemy_DF[udemy_DF['is_paid'] == False].nlargest(10, 'num_reviews'), 
            x = 'num_reviews', y = 'course_title')
plt.title(label = "Top 10 Most Reviewed Free Courses")
plt.show()

## Top 10 Most Reviewed Courses in terms of Subject

In [None]:
plt.figure(figsize = (10,3))
sns.barplot(data = udemy_DF[udemy_DF['subject'] == 'Business Finance'].nlargest(10, 'num_reviews'), 
            x = 'num_reviews', y = 'course_title')
plt.title("Top 10 Most Reviewed Business Finance Courses")
plt.show()

plt.figure(figsize = (10,3))
sns.barplot(data = udemy_DF[udemy_DF['subject'] == 'Web Development'].nlargest(10, 'num_reviews'), 
            x = 'num_reviews', y = 'course_title')
plt.title("Top 10 Most Reviewed Web Development Courses")
plt.show()

plt.figure(figsize = (10,3))
sns.barplot(data = udemy_DF[udemy_DF['subject'] == 'Musical Instruments'].nlargest(10, 'num_reviews'), 
            x = 'num_reviews', y = 'course_title')
plt.title("Top 10 Most Reviewed Musical Instruments Courses")
plt.show()

plt.figure(figsize = (10,3))
sns.barplot(data = udemy_DF[udemy_DF['subject'] == 'Graphic Design'].nlargest(10, 'num_reviews'), 
            x = 'num_reviews', y = 'course_title')
plt.title("Top 10 Most Reviewed Graphic Design Courses")
plt.show()


# Analyzing the Subjects

## Available Paid and Free Courses for the Subjects

In [None]:
plt.figure(figsize = (12,4))
sns.countplot(data = udemy_DF, x = 'subject', hue = 'is_paid')
plt.show()

The majority of all the courses in each subject are paid.

## Difficulty Levels for Each Subject

In [None]:
plt.figure(figsize = (12,4))
sns.countplot(data = udemy_DF, x = 'subject', hue = 'level')
plt.show()

'All Levels' is the majority diffculty level out of all the subjects, except for Musical Instruments, which has a majority of beginner level courses. 

## Year of Publication of Subjects

In [None]:
plt.figure(figsize = (18,4))
sns.countplot(data = udemy_DF, x = 'subject', hue = 'published_year')
plt.show()

In each subject, highest amount of courses were published in 2016.

## Prices of Subjects 

In [None]:
plt.figure(figsize=(10, 5))
sns.boxplot(data = udemy_DF, x = 'price', y = 'subject', linewidth = 2.2)
plt.show()

In [None]:
udemy_DF.groupby(['subject']).describe()['price']

## Number of Subscribers for each Subject

In [None]:
plt.figure(figsize=(10, 5))
sns.boxplot(data = udemy_DF, x = 'num_subscribers', y = 'subject', linewidth = 2.2, showfliers = True)
plt.show()

In [None]:
udemy_DF.groupby(['subject']).describe()['num_subscribers']

## Number of Reviews for each Subject

In [None]:
plt.figure(figsize=(10, 5))
sns.boxplot(data = udemy_DF, x = 'num_reviews', y = 'subject', linewidth = 2.2, showfliers = True)
plt.show()

In [None]:
udemy_DF.groupby(['subject']).describe()['num_reviews']

## Content Duration for each Subject

In [None]:
plt.figure(figsize=(10, 5))
sns.boxplot(data = udemy_DF, x = 'content_duration', y = 'subject', linewidth = 2.2, showfliers = True)
plt.show()

In [None]:
udemy_DF.groupby(['subject']).describe()['content_duration']

# Correlation

In [None]:
udemy_corr = udemy_DF[['price', 'num_subscribers', 'num_reviews', 'num_lectures', 'content_duration', 'published_year']].corr()
plt.figure(figsize=(8,8))
sns.heatmap(udemy_corr, annot = True, linewidths = 1.2, linecolor = 'white')
plt.xticks(rotation = 75)
plt.show()

The content duration and the number of lectures have a strong positive correlation, since more videos will obviously result in total longer durations. The number of reviews and number of subscribers have a strong positive correlation as well - higher the users, higher the number of reviews.

# Courses with No Reviews/Subscribers

As visible in the boxplots for the number of reviews and subscribers, there are courses that have no reviews and/or subscribers. Let's create two subsets, with one containing courses with no subscribers and other containing courses with no reviews.

In [None]:
udemy_DF_no_subs = udemy_DF[udemy_DF['num_subscribers'] == 0]
udemy_DF_no_revs = udemy_DF[udemy_DF['num_reviews'] == 0]

## General Description of Courses with No Subscribers

In [None]:
udemy_DF_no_subs[['num_subscribers', 'num_reviews', 'num_lectures', 'content_duration']].describe()

## General Description of Courses with No Reviews

In [None]:
udemy_DF_no_revs[['num_subscribers', 'num_reviews', 'num_lectures', 'content_duration']].describe()

From the above 2 tables, we can say that all the courses that have 0 subscribers have 0 reviews. However, not all unreviewed courses have 0 subscribers. There are 69 courses with 0 subscribers and 288 courses with 0 reviews. This means that there are 219 unreviewed courses that have more than 0 subscribers.

In [None]:
udemy_DF_no_revs[udemy_DF_no_revs['num_subscribers'] > 0].nlargest(5, 'num_subscribers')

The most subscribed course that has no reviews is "Effective Personal Website Building and Hosting", with 4259 subscribers.

## Paid/Free Courses, Difficulty Level and Subject

The graphs on the left are for courses with no subscribers and the column on the right is for courses with no reviews.

In [None]:
f, ax1 = plt.subplots(4, 2, figsize = (20, 20))
sns.countplot(data = udemy_DF_no_subs, x = "is_paid", ax = ax1[0,0])
sns.countplot(data = udemy_DF_no_revs, x = "is_paid", ax = ax1[0,1])
sns.countplot(data = udemy_DF_no_subs, x = "level", ax = ax1[1,0])
sns.countplot(data = udemy_DF_no_revs, x = "level", ax = ax1[1,1])
sns.countplot(data = udemy_DF_no_subs, x = "subject", ax = ax1[2,0])
sns.countplot(data = udemy_DF_no_revs, x = "subject", ax = ax1[2,1])
sns.countplot(data = udemy_DF_no_subs, x = "published_year", ax = ax1[3,0])
sns.countplot(data = udemy_DF_no_revs, x = "published_year", ax = ax1[3,1])
plt.show()

In [None]:
udemy_DF_no_revs.groupby(['published_year']).count()

* Paid/Free Courses: All the unsubscribed courses are paid. 97.6% of unreviewed courses are paid.
* Difficulty Level: Both unsubscribed and unreviewed courses have beginner courses in the majority (50.7% and 43.05% respectively). 40.3% of all unreviewed courses have all the difficulty levels. The unsubscribed courses have no expert level courses.
* Subject: There are no Web Development courses that have 0 subscribers. Business Finance courses are the majority in both cases (56.5% for unsubscribed and 39.9% for unreviewed). There are an equal amount of unreviewed Musical Instruments and Graphic Design courses - each constitutes 29.2% of all unreviewed courses. 
* Published Year: In both cases, 2017 is the majority (69.6% for unsubscribed and 47.9% for unreviewed). A possible reason could be because users might not have come across the newer courses. There are no unsubscribed and unreviewed courses in 2011. There are no unsubscribed courses in 2011, 2012, 2013 and 2015.

## Price, Number of Lectures and Content Duration

The graphs on the left are for courses with no subscribers and the column on the right is for courses with no reviews.

In [None]:
f, ax2 = plt.subplots(3, 2, figsize = (20, 20))
sns.distplot(udemy_DF_no_subs['price'], ax = ax2[0,0], kde = False, bins = 20)
sns.distplot(udemy_DF_no_revs['price'], ax = ax2[0,1], kde = False, bins = 20)
sns.distplot(udemy_DF_no_subs['num_lectures'], ax = ax2[1,0], kde = False)
sns.distplot(udemy_DF_no_revs['num_lectures'], ax = ax2[1,1], kde = False)
sns.distplot(udemy_DF_no_subs['content_duration'], ax = ax2[2,0], kde = False)
sns.distplot(udemy_DF_no_revs['content_duration'], ax = ax2[2,1], kde = False, bins = 10)
plt.show()

* All the graphs are right-skewed. 
* Most of the prices for unsubscribed and unreviewed courses lie between 0 to 50 dollars. The maximum in both cases are 200 dollars. 
* Unsubscribed courses majorly have around 10 lectures, with 35 being the maximum. Unreviewed courses majorly have between 0 and 50 lectures, with the highest number of lectures being 321. Both courses have a minimum of 5 lectures.
* The longest amount of content for unsubscribed courses is 6 hours, whereas it is 31.5 hours for unreviewed courses. Majority of unsubscribed courses have content duration of almost 1 hour, while majority of unreviewed courses are approximately between 0.5 to 3 hours.