# **Udemy - Paid or Free?**

![image.png](attachment:image.png)

It is rightly said by 'Benjamin Franklin' that 'An investment in knowledge pays the best interest'. This notebook comprises of an educational platform known as udemy that aimes at educating professional adults and students. It was developed in May 2010 and ever since it has successfully provided numerous courses belonging to various subjects. Udemy has not only catered to its english speaking students but also embraced over 65 languages so that language would not be a barrier for all the enthusiastic students world wide. In this notebook we look closely at the various courses that are offered by Udemy, the popularity of certain courses/subject as well as the factors that influence the price of these courses.

Kindly provide an upvote if this notebook was useful. Also I would greatly appreciate any feedback or suggeston for improvement. Thank-you :)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
sns.set_style('whitegrid')
import matplotlib.pyplot as plt
%matplotlib inline
import datetime as dt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Data Loading**

In [None]:
train = pd.read_csv('../input/udemy-courses/udemy_courses.csv')
train.head()

In [None]:
row = train.shape[0]
col = train.shape[1]
print("The number of rows within the dataset are {} and the number of columns are {}".format(row,col))

# **Data Cleaning**

In [None]:
train.info()

In our dataset we have features of type: Boolean(1), Float(1), Integer(5) and Object/String(5). Thus we have a total of 12 features.

In [None]:
train.isnull().sum()

Since all the values of each feature are present we do not need to deal with missing values.

In [None]:
dates = []
for i in train['published_timestamp']:
    datess=dt.datetime.strptime(i, '%Y-%m-%dT%H:%M:%SZ')
    dates.append(datess)

train['time'] = dates
train['year'] = train['time'].dt.year
train.head()

In [None]:
train.drop('published_timestamp',axis=1,inplace=True)

# **Data Analysis**

In [None]:

Free = train[train['is_paid'] == 0]
Free.shape


In [None]:
Paid = train[train['is_paid'] == 1]
Paid.shape

In [None]:
columns = train['subject'].unique()
columns

* What are the best free and paid courses by subject?

In [None]:
for x in columns:
    maxr = Free[Free['subject'] == x]['num_subscribers'].max()
    course = Free[(Free['num_subscribers'] == maxr)]['course_title'].unique()
    print("-----------------------------------------------------------------------------------")
    print("The best free course offered by udemy for {} is \n{} with {} subscribers\n".format(x,course[0],maxr))

for x in columns:
    maxr = Paid[Paid['subject'] == x]['num_subscribers'].max()
    course = Paid[(Paid['num_subscribers'] == maxr)]['course_title'].unique()
    print("-----------------------------------------------------------------------------------")
    print("The best paid course offered by udemy for {} is \n{} with {} subscribers\n".format(x,course[0],maxr))


* What are the most popular courses?

In [None]:

print("The top 5 most popular courses with respect to subcribers are:\n")
popular = train.sort_values(['num_subscribers'],ascending=False).head()['course_title'].unique()
i = 0
while i<len(popular):
    print(popular[i])
    i = i+1

* What are the most engaging courses?

In [None]:
train['engagment']  = train['num_subscribers'] * train['num_reviews']
print("The top 5 most engaging courses with respect to subcribers and reviews are are:\n")
engaging = train.sort_values(['engagment'],ascending=False).head()['course_title'].unique()
i = 0
while i<len(engaging):
    print(engaging[i])
    i = i+1

* Which courses offer the best cost benefit?

In [None]:
priceprefect = train[(train['price']<=train['price'].mean()) & (train['engagment']>=train['engagment'].mean())].sort_values(('engagment'),ascending=False)['course_title'].head(1).unique()[0]
print("The best course that offers cost benefit is",priceprefect)

* Which were the most popular courses according to the year they were published?

In [None]:
years = train['year'].unique()

for x in years:
    maxr = train[train['year'] == x]['num_subscribers'].max()
    course = train[(train['num_subscribers'] == maxr)]['course_title'].unique()
    print("-----------------------------------------------------------------------------------")
    print("The best course offered by udemy in {} was \n{} with {} subscribers\n".format(x,course[0],maxr))
    

# **Data Visualization**

In [None]:
sns.countplot('is_paid',data=train)

The number of paid courses are comparatively more than that of the free courses 

**Relationship between Subject, Level and Price**

In [None]:
sns.countplot('is_paid',hue='subject',data=train)

A variety of free courses are available for the topic of 'Web Development'. 
Likewise for paid courses there are a variety of course based on 'Business Finance'

In [None]:
sns.countplot('is_paid',hue='level',data=train)
plt.legend(loc='upper left')

All types of courses are available for paid courses, whereas there are no expert level courses available for free.

In [None]:
plt.figure(figsize=(8,5))
sns.countplot('subject',hue='level',data=train)

Each subject have all levels of courses. However there are certain subjects that provide majority courses of a certain level. 
Business Finance, Graphic Design and Web Development offer many courses that consists of all the levels.
Musical Instruments offers a major amount of Beginner level courses.
All the subjects have very few expert level courses. 

In [None]:
train.price.hist(bins=10)
plt.xlabel("Price")
plt.title("Price range and its frequency")

The most purchased courses cost 25$

In [None]:
plt.figure(figsize=(10,5))
sns.barplot('subject','price',hue='level',data=Paid)

By comparing all the graphs we notice that the expert level of many subjects costs quite a lot than any other levels of each subject. Thus we can say that this could probably be a factor so as to why many users have not purchased the expert level courses. On the other hand the cost of the courses that serve to all levels and beginner level are very reasonable and popular among users .  

**Relationship between Number of Subscribers, Number of Reviews, Subject and Level**

In [None]:
sns.lmplot('num_subscribers','num_reviews',data=train)

There is a positive/direct relationship between the number of subscribers and reviews. If we look closely we realise that there are certain outliers within the graph. It is always advisable to deal with these outliers before we train a model for prediction.

In [None]:
table1 = pd.pivot_table(train, values=['num_subscribers','num_reviews'], index=['is_paid'],aggfunc=np.sum)
table1

In [None]:
table1.plot(kind='bar')

We create a pivot table to calculate the total number of subcribers and review present in the free as well as paid courses. We observe that there are more number of subcribers and reviews for paid courses.

In [None]:
plt.figure(figsize=(10,5))
sns.barplot('subject','num_subscribers',hue='level',data=train)

All the courses and levels belonging to Web Development have a many subcribers. This is then followed by the Graphic Design that offers all levels and then by the beginner level of the Business Finance.  

In [None]:
plt.figure(figsize=(10,5))
sns.barplot('subject','num_reviews',hue='level',data=train)

The most reviews were posted for the course of Web Development.

**Relationship between Number of Lectures, Content Duration, Subject and Level**

In [None]:
sns.lmplot('num_lectures','content_duration',data=train)

There is a positive/direct relationship between the number of subscribers and reviews. If we look closely with every increase in the number of lectures there is a positive increase in the content duration. However there also exist an amount of outliers.

In [None]:
table2 = pd.pivot_table(train, values=['num_lectures','content_duration'], index=['is_paid'],aggfunc=np.sum)
table2

In [None]:
table2.plot(kind='bar')

We create a pivot table to calculate the total length of the lectures and the number of lectures present in the free as well as paid courses. We observe that there are more number of lectures for paid courses.

In [None]:
plt.figure(figsize=(10,5))
sns.barplot('subject','content_duration',hue='level',data=train)

The most content were posted for the course of Web Development. We also observe that there is a very high error rate present in the expert level of the Graphic Design subject.

**Relationship between Year of post, Subject and Number of Subcribers**

In [None]:
sns.countplot('year',data=train)

In the year 2016 there were the most courses that were posted on udemy. 

In [None]:
table3 = pd.pivot_table(train, values=['num_subscribers'], index=['year'],columns=['subject'],aggfunc=np.sum)
table3

In [None]:
table3.plot(kind='bar',figsize=(10,5))

In [None]:
table4 = pd.pivot_table(train, values=['num_lectures'], index=['year'],columns=['subject'],aggfunc=np.sum)
table4

In [None]:
table4.plot(kind='bar',figsize=(8,5))

We observe that Web Development courses was the first to begin in the year of 2011 with minimum number of subcribers, it grew popularity in the year 2015 due to which udemy found it beneficial to increase the number of lectures which they did in the year of 2016. In the year 2012 they introduced two other subjects which were Graphic Design and Musical Instruments, these courses did not gain as must popularity among users, therefore the number of lectures increased at a slow pace. In 2013 Business Finance courses were introduced as they gained positive response from the users, the number of additional lectures increased moderately.

In [None]:
plt.figure(figsize=(10,5))
sns.barplot('year','price',data=train,estimator=np.sum)

# **Summary of observation**

* 	There are many users that have purchased online udemy courses.
* 	Business Finance and Web Development are the most popular subject.
* 	There are very few expert level courses among the available paid courses and none among the free courses.
* 	Users prefer those courses that cater to  ‘All Levels’ because they prove to be financially feasible as well as udemy has a wide range of such courses.
* 	 There are a maximum number of subscribers, review, lectures and duration for the paid courses. We observe an error in the duration of free courses that shows only 685.33 Hrs for 6639 lectures which is not possible.
* 	The most lectures and duration of these lectures posted belongs to Web Development – All Levels.
* 	There are a high number of posts in the year 2016 that belongs to the courses of Web Development in response to its increasing popularity.
* 	The revenue earned in the year 2016 was the highest among the 7 years.


# **Data Modeling**

For the purpose of Data Modeling we need to split our data into training and test set.Once the split is done we can put our data into various models and check each the precision of each model. We select the model with the highest precision score.

In [None]:
from sklearn.model_selection import train_test_split 
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
model = train
model.describe()

We notice that the minium number of lectures and their duration is 0 which is not possible, thus we need to eliminate that record. 

In [None]:
id = model[model['num_lectures'] == 0].index.values[0]

In [None]:
model.drop(id,axis=0,inplace=True)

In [None]:
model.info()

In the following dataset we have 1 boolean and 2 categorical features for optimum solution it is always desirable to convert these into numeric features. Also we do not need certain features such as title, url and time. Thus we drop these features.

In [None]:
model.drop(['course_title','url','time'],axis=1,inplace=True)

In [None]:
def fun(val): 
  
    if val == 0: 
        return 0
    else: 
        return 1
    
model['is_paid'] = model['is_paid'].apply(fun)

In [None]:
def lev(val): 
  
    if val == 'All Levels': 
        return 0
    elif val == 'Intermediate Level': 
        return 1
    elif val == 'Beginner Level': 
        return 2
    else:
        return 3
    
model['level'] = model['level'].apply(lev)

In [None]:
def sub(val): 
  
    if val == 'Business Finance': 
        return 0
    elif val == 'Graphic Design': 
        return 1
    elif val == 'Musical Instruments': 
        return 2
    else:
        return 3
    
model['subject'] = model['subject'].apply(sub)

In [None]:
def year(val): 
  
    if val == 2017: 
        return 6
    elif val == 2016: 
        return 5
    elif val == 2015: 
        return 4
    elif val == 2014: 
        return 3
    elif val == 2013: 
        return 2
    elif val == 2012: 
        return 1
    else:
        return 0
    
model['year'] = model['year'].apply(year)

We convert the year feature into simple values that can be used for prediction. 

In [None]:
X = model.drop(['course_id','is_paid'],axis=1)
y = model['is_paid']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)

# **1. Logistic Regression**

In [None]:
#Import Packages 
from sklearn.linear_model import LogisticRegression

In [None]:
#Object creation and fitting of training set
lmodel = LogisticRegression()
lmodel.fit(X_train,y_train)

In [None]:
#Creation of a prediction variable
lpredictions = lmodel.predict(X_test)

In [None]:
#Accuracy Matrix
print("Confusion Matrix")
print(confusion_matrix(y_test,lpredictions))

lscore = round((lmodel.score(X_test, y_test)*100),2)
print ("\nModel Score:",lscore,"%")

# **2. K-Nearest Neighbour**

In [None]:
#Import Packages 
from sklearn.neighbors import KNeighborsClassifier

In [None]:
kmodel = KNeighborsClassifier(n_neighbors=3)
kmodel.fit(X_train,y_train)

In [None]:
#Creation of a prediction variable
kpredictions = kmodel.predict(X_test)

In [None]:
#Accuracy Matrix
print("Confusion Matrix")
print(confusion_matrix(y_test,kpredictions))

kscore = round((kmodel.score(X_test, y_test)*100),2)
print ("\nModel Score:",kscore,"%")

# **3.Decision Tree**

In [None]:
#Import Packages 
from sklearn.tree import DecisionTreeClassifier

In [None]:
#Object creation and fitting of training set
dmodel = DecisionTreeClassifier()
dmodel.fit(X_train,y_train)

In [None]:
#Creation of a prediction variable
dprediction = dmodel.predict(X_test)

In [None]:
#Accuracy Matrix
print("Confusion Matrix")
print(confusion_matrix(y_test,dprediction))

dscore = round((dmodel.score(X_test, y_test)*100),2)
print ("\nModel Score:",dscore,"%")

# **4.Random Forest**

In [None]:
#Import Packages 
from sklearn.ensemble import RandomForestClassifier

In [None]:
#Object creation and fitting of training set
rmodel = RandomForestClassifier(n_estimators=100)
rmodel.fit(X_train,y_train)

In [None]:
#Creation of a prediction variable
rprediction = rmodel.predict(X_test)


In [None]:
#Accuracy Matrix
print("Confusion Matrix")
print(confusion_matrix(y_test,rprediction))

rscore = round((rmodel.score(X_test, y_test)*100),2)
print ("\nModel Score:",rscore,"%")

# **5.Support Vector Machine**

In [None]:
#Import Packages 
from sklearn.svm import SVC

In [None]:
#Object creation and fitting of training set
smodel = SVC()
smodel.fit(X_train,y_train)

In [None]:
#Creation of a prediction variable
sprediction = smodel.predict(X_test)

In [None]:
#Accuracy Matrix
print("Confusion Matrix")
print(confusion_matrix(y_test,sprediction))

sscore = round((smodel.score(X_test, y_test)*100),2)
print ("\nModel Score:",sscore,"%")

# **Conclusion**

In [None]:
data = [['Logistic Regression',lscore],['K-Nearest Neighbour',kscore],
        ['Decision Tree',dscore],['Random Forest',rscore],['Support Vector Machine',sscore]]
final = pd.DataFrame(data,columns=['Algorithm','Precision'],index=[1,2,3,4,5])

In [None]:
print("The results of Data Modeling are as follows:\n ")
print(final)

Among all the algorithms we notice that Decision Tree and Random Forest are of utmost precision. Therefore we can use either one of them to predict the price of the courses.