<h1 style="text-align: center;">EDA of Udemy Courses and ML to Predict Subscribers</h1>

In [None]:
# common imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# pandas imports
from pandas.plotting import scatter_matrix

# machine learning imports
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import GridSearchCV
from sklearn.dummy import DummyRegressor
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn import metrics

# display setup
pd.set_option("display.max_columns", None) # the None parameter displays unlimited columns
sns.set(style="whitegrid") # for plots

# 1. Getting the Data

In [None]:
# read the csv file
df = pd.read_csv("../input/udemy-courses/udemy_courses.csv")

In [None]:
# display the first 5 rows for a quick look
df.head()

In [None]:
# DataFrame shape (rows, columns)
# understand the amount of data we are working with
df.shape

In [None]:
# description of data
df.info()

In [None]:
# check if there are null values
df.isna().sum()

In [None]:
# summary of the numerical attributes
df.describe()

> As shown above, there are no missing values which is excellent!
>
> ##### *It is vital to understand the features we are working with.*
> ### Features in the DataFrame:
>> 1. course_id: Course identification number
>> 2. course_title: Title of course
>> 3. url: Course URL
>> 4. is_paid: True if the course costs money, false if the course is free
>> 5. price: Price of course
>> 6. num_subscribers: Number of subscribers for the course
>> 7. num_lectures: Number of lectures in the course
>> 8. level: Difficulty level of the course
>> 9. content_duration: Duration of all course materials
>> 10. published_timestamp: Course publication date
>> 11. subject: Subject of course

In [None]:
# a histogram plot for each numerical attribute
df.drop("is_paid", axis=1).hist(bins=30, figsize=(20,15))
plt.tight_layout()
plt.show()

> Initial observations from the histograms:
>> 1. Most course durations are between 0-5 hours.
>> 2. There are usually around 1-50 lectures per course.
>> 3. Courses tend to have few reviews. There are probably a handful of courses
>> with a large amount of reviews since the X axis goes up to 25000 while over 3000
>> instances are represented in the first bin.
>> 4. The majority of courses are in the same range of subscribers. The instances farther up
>> the scale were probably more successful or perhaps courses on a trending topic.
>> 5. Assuming the prices are in USD, the range is between 0-250 dollars.
>> The plot shows the most common price roughly $25.

> # Objective
> ## Predicting the number of subscribers for a course.
>> ### Chosen Feature:
>> #### *num_subscribers* column
>>> The column represents how many people have subscribed to each course.
>>> ### Motive:
>>> Predicting the number of people subscribed to a course, course popularity.

> ### Splitting the Data:
>> Before further analysis let's split the data into a training set and a testing set.
>> This will ensure avoidance of bias that could occur from learning the data as a whole.

In [None]:
# use sklearn train_test_split function to split the data
# the random state parameter ensures that data will be shuffled and split the same way in each run
train_set, test_set = train_test_split(df, test_size=0.20, random_state=42)

In [None]:
print("Number of instances in training set: ", len(train_set))
print("Number of instances in testing set: ", len(test_set))

# 2. Understanding and Visualizing the Data
> ##### *The motivation for this section is to gain more insights*

In [None]:
# deep copy of the training set
df2 = train_set.copy()

In [None]:
df2.head(2)

> ## Exploring Attribute Combinations

In [None]:
# method creates a correlations matrix
corr_matrix = df2.corr()

In [None]:
# looking at attributes correlation with num_subscribers feature
corr_matrix["num_subscribers"].sort_values(ascending=False)

In [None]:
# a histogram plot for attributes with a high correlation

attributes = ["num_subscribers", "num_reviews", "num_lectures", "content_duration", "course_id"]

scatter_matrix(df2[attributes], figsize=(12,8))
plt.tight_layout()
plt.show()

In [None]:
# scatter plot of the strongest correlation in the corr matrix
# the alpha is set to show the distribution more clearly
df2.plot(kind="scatter", x="num_reviews", y="num_subscribers", alpha=0.1,
         color='b', figsize=(10,5))
plt.title("Reviews and Subscribers Correlation", size=20)
plt.xlabel("num_reviews", size=15)
plt.ylabel("num_subscribers", size=15)
plt.tight_layout()
plt.show()

> ### Correlations with num_subscribers Attribute- Overview:
>> The strongest positive correlations (0.1 or more) are:
>> * num_reviews
>> * num_lectures
>> * content_duration
>>
>> The strongest negative correlations (-0.1 or less) are:
>> * course_id
>> * is_paid

> ### Examining Course ID Feature

In [None]:
print("Number of unique course IDs:", df2["course_id"].nunique())
print("Length of DataFrame:", len(df2))

In [None]:
# check if number of unique urls
# should be individual for each instance
df2["url"].nunique()

> Since there is a unique value for almost every course ID, the correlation was probably
> coincidental.

In [None]:
# show duplicated listings
df2[df2.duplicated("course_id")]

In [None]:
# remove duplicated listings
df2.drop_duplicates(inplace=True)

In [None]:
# examine changes
df2.shape

> ### Overview:
>> * The course ID is unique for each course.
>> * This column should be removed when training a model in order to generalize better.

> ### Assessing Price Features

In [None]:
# evaluate current values in column
df2["is_paid"].head(10)

In [None]:
# use encoder to convert "is_paid" column to binary outcome
ordinal_encoder = OrdinalEncoder(dtype=int)
df2["is_paid"] = ordinal_encoder.fit_transform(df2[["is_paid"]])

In [None]:
# evaluate changes
df2["is_paid"].head(10)

In [None]:
# 0 is False, 1 is True
ordinal_encoder.categories_

In [None]:
# count number of instances for each outcome
df2["is_paid"].value_counts()

In [None]:
# use groupby for price attribute
price_values = df2.groupby("price")

In [None]:
# check if number of free courses matches when the price is 0
price_values_0 = price_values.get_group(0)
price_values_0.shape

In [None]:
# plot of free and paid courses
plt.figure(figsize=(10,5))
sns.countplot(x=df2["is_paid"])
plt.title("Free and Paid Courses", size=20)
plt.xlabel("is_paid", size = 15)
plt.ylabel("count", size=15)
plt.tight_layout()
plt.show()

In [None]:
# course price values sorted by prices
df2["price"].value_counts().sort_index()

In [None]:
# top ten course price values sorted by value counts
prices_top10 = df2["price"].value_counts().sort_values(ascending=False).head(10)

In [None]:
# calculate percentage of instances per price in data
prices_percent_in_data = []
num_subscribed = []

for i in range(len(prices_top10.index)):
    prices_percent_in_data.append(round((prices_top10.values[i]/len(df2))*100,2))
    num_subscribed.append(price_values.get_group(prices_top10.index[i])["num_subscribers"].sum())

In [None]:
# create a DataFrame with the results
prices_top10_dict = {"price": prices_top10.index, "number_of_instances": prices_top10.values,
                     "% of data": prices_percent_in_data, "num_subscribers": num_subscribed}
prices_top10_df = pd.DataFrame(prices_top10_dict, index=range(1,11))
prices_top10_df

In [None]:
# plot of top 10 common prices by amount of subscribers
plt.figure(figsize=(10,5))
sns.barplot(x=prices_top10_df["price"], y=prices_top10_df["num_subscribers"])
plt.xlabel("price", size=15)
plt.ylabel("num_subscribers\n(millions)", size=15)
plt.title("Top 10 Common Prices by Subscribers", size=20)
plt.tight_layout()
plt.show()

In [None]:
# plot of content duration by free or paid course
plt.figure(figsize=(10,5))
sns.scatterplot(x=df2["content_duration"], y=df2["is_paid"], alpha=0.1)
plt.title("Content Duration by Type of Course Payment", size=20)
plt.xlabel("content_duration", size=15)
plt.ylabel("is_paid", size=15)
plt.tight_layout()
plt.show()

> ### Observations:
>> * As speculated earlier in the initial observations, $20 is the most common price for a course.
>> * The number of listings with the price $0 matches the number of instances that were
>> labeled "False" in the is_paid column.
>> * The prices listed tend to increase by 5 dollars until they reach the maximum price
>> which is $200.
>> * Amongst the 10 most common prices in the data, most are subscribed to the free courses.
>> * Content duration is longer for paid courses.

> ### Researching Level and Subject Features

In [None]:
# count number of instances
level_values = df2["level"].value_counts()
level_values

In [None]:
# count number of instances
subject_values = df2["subject"].value_counts()
subject_values

In [None]:
# pie plot of course levels and subjects in data
fig, ax = plt.subplots(1,2, figsize=(10,5))
ax[0].pie(level_values, startangle=180, labels=level_values.index, autopct="%1.1f%%")
ax[0].set_title("Course Levels", size=20)
ax[1].pie(subject_values, startangle=180, labels=subject_values.index, autopct="%1.1f%%")
ax[1].set_title("Course Subjects", size=20)
plt.tight_layout()
plt.show()

In [None]:
# scatter plot of price by course level
plt.figure(figsize=(10,5))
sns.scatterplot(y=df2["level"], x=df2["price"], alpha=0.1)
plt.title("Price by Course Level", size=20)
plt.xlabel("price", size=15)
plt.ylabel("level", size=15)
plt.tight_layout()
plt.show()

In [None]:
# plot subject by number of subscribers and level
# the black bars represent the error
plt.figure(figsize=(10,5))
sns.barplot(x=df2["subject"], y=df2["num_subscribers"], hue=df2["level"])
plt.title("Subject by Number of Subscribers and Level", size=20)
plt.xlabel("subject", size=15)
plt.ylabel("num_subscribers", size=15)
plt.tight_layout()
plt.show()

> ### Observations:
>> * All Levels is the most common level, representing over 50%.
>> * Web Development is the most common subject, and Business Finance is second with
>> approximately a 1% differential.
>> * Price variations according to the level of the course also show that Expert is
>> the least common level in the data. It is also the only level that does not
>> provide free courses. The other levels are dispersed more frequently
>> throughout the line.
>> * Web Development courses are significantly higher in subscribers than the other subjects.
>> Since Business Finance falls shortly behind in content, it is likely that people are more
>> interested in studying Web Development courses.

> ### Analyzing Additional Columns

In [None]:
# examine current shape
df2.shape

In [None]:
# every course has a unique URL
df2["url"].nunique()

In [None]:
# some courses have an identical title
df2["course_title"].nunique()

In [None]:
# find duplicated instances
# false marks all duplicates as true
title_df = df2[df2.duplicated("course_title", keep=False)].copy()
# show duplicated titles
title_df["course_title"].unique()

In [None]:
# examine number of unique subscribers values
title_df["num_subscribers"].nunique()

In [None]:
# groupy course title
title = title_df.groupby("course_title")

In [None]:
# examining one of the duplicated courses
# the courses have the same name and different values for some features
title.get_group("Acoustic Blues Guitar Lessons")

> ### Observations:
>> * The duplicated courses have different parameters such as is_paid or published_timestamp.
>> Maybe the course provides the first lessons free of charge, or they added new content.
>> * These instances can be kept as they are likely to have various values (i.e. each
>> value in the num_subscribers column is unique).

# 3. Data Cleaning

In [None]:
# clean copy of training set
df3 = train_set.copy()

In [None]:
df3.shape

In [None]:
# remove duplicated instances
df3.drop_duplicates("course_id", inplace=True)

In [None]:
# evaluate changes
df3.shape

In [None]:
# separate predictors from target values

# drop creates a copy without changing the training set
X_train = df3.drop("num_subscribers", axis=1)

# create a deep copy of the target values
y_train = df3["num_subscribers"].copy()

> ### Removing the Following Columns:
> The reason for removing these columns is for the model to generalize better.
> Furthermore, these columns have a unique value for each instance (i.e. URL, course ID) which
> does not provide information the model can learn from to predict on new data.
>> * course_id
>> * course_title
>> * url
>> * published_timestamp

In [None]:
# list of numerical features
num_features = ["price", "num_reviews", "num_lectures", "content_duration"]

# list of level feature categories
levels = ["All Levels", "Beginner Level", "Intermediate Level", "Expert Level"]

# column transformer:
# features generated by each transformer will be concatenated to form a single feature space
# columns of the original feature matrix that are not specified are dropped
full_pipeline = ColumnTransformer([

# MinMaxScaler normalizes data (rescales between 0-1)
    ("num", MinMaxScaler(), num_features),

# OrdinalEncoder converts categories to integers according to order specified in list
    ("level", OrdinalEncoder(categories=[levels]), ["level"]),

# OrdinalEncoder converts True and False values to integers
# True=1, False=0
    ("is_paid", OrdinalEncoder(dtype=int), ["is_paid"]),

# OneHotEncoder converts categories to a binary dummy array
    ("subject", OneHotEncoder(handle_unknown="ignore"), ["subject"])
])

In [None]:
features = num_features+["level", "is_paid", "subject"]

# transform training data using pipeline
X_train_prepared = full_pipeline.fit_transform(X_train)
X_tr_testing = full_pipeline.transform(X_train)

# 4. Training and Evaluating Models

> Chosen evaluation metric:
>
> The root-mean-square error (RMSE) is the standard deviation of the prediction error.
> It is the differences between the predicted and actual values, and shows how much they are
> spread out.

In [None]:
# function prints scores, mean and std
def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

# function prints evaluation metrics
def display_evaluation(actual, pred):
    mse = metrics.mean_squared_error(actual, pred)
    print("Mean Squared Error:", mse)
    print("Root Mean Squared Error:", np.sqrt(mse))

> The Linear Regression model computes a weighted sum of the input features, and a constant which
> is the bias/intercept term. As the name implies, this is in fact a linear function.

#### Model 1: Linear Regression

In [None]:
# instantiate model
lr = LinearRegression()

In [None]:
# fit the training data
lr.fit(X_train_prepared, y_train)

In [None]:
# predict using training data
lr_pred = lr.predict(X_tr_testing)

In [None]:
# test on a few instances from training data
some_data = X_train.iloc[:10]
some_labels = y_train.iloc[:10]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", lr.predict(some_data_prepared))
print("Labels:", list(some_labels))

In [None]:
# use function to show results
display_evaluation(y_train, lr_pred)

##### Cross Validation for Linear Regression Model

In [None]:
# 10 fold cross validation
lr_scores = cross_val_score(lr, X_train_prepared, y_train, cv=10, scoring="neg_mean_squared_error", )

# scoring function returns a negative value for MSE (need to add the minus)
lr_rmse_scores = np.sqrt(-lr_scores)
display_scores(lr_rmse_scores)

In [None]:
# estimate prediction using cross validation
lr_pred = cross_val_predict(lr, X_tr_testing, y_train, cv=10)

In [None]:
# test on a few instances from training data
some_data = X_train.iloc[:10]
some_labels = y_train.iloc[:10]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", lr.predict(some_data_prepared))
print("Labels:", list(some_labels))

In [None]:
# use function to show results
display_evaluation(y_train, lr_pred)

> The Random Forest Regressor model is based on many decision trees.
> A decision tree is a non-linear model built by constructing many linear boundaries.
> The random forest model samples random points and subsets of features when training.
> Then, the predictions are made by averaging the predictions made by each decision tree.

#### Model 2: Random Forest Regressor

In [None]:
# instantiate model
rfr = RandomForestRegressor(random_state=42)

In [None]:
# fit the training data
rfr.fit(X_train_prepared, y_train)

In [None]:
# predict using training data
rfr_pred = rfr.predict(X_tr_testing)

In [None]:
# use function to show results
display_evaluation(y_train, rfr_pred)

> The Random Forest Regressor model performed better than the linear regression model,
> even after cross validation. The next step is to find the hyperparameters
> that provide the best results.
>
> For this task we can use grid search cv. The grid search works by trying all parameter
> combinations from the ones instantiated, then shows the best combination according to
> the highest score.

#### Grid Search Cross Validation 1

In [None]:
# max features default is sqrt (number of features selected per split)
# bootstrap default is true (resampling data true)
# n estimators default is 100 (number of decision trees)
# parameters for grid search
param_grid = {"n_estimators": [10,50,100,500], "max_features":[2,4,8], "bootstrap": [True, False]}

In [None]:
# instantiate grid search
grid_search = GridSearchCV(rfr, param_grid, cv=5, scoring="neg_mean_squared_error", return_train_score=True)

In [None]:
# fit to the training data
grid_search.fit(X_train_prepared, y_train)

In [None]:
# show the best score
np.sqrt(-grid_search.best_score_)

In [None]:
# show the best parameters
grid_search.best_estimator_

In [None]:
# show results for each iteration
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

#### Model 3: Random Forest Regressor

In [None]:
# instantiate model
rfr = grid_search.best_estimator_
rfr

In [None]:
# test on a few instances from training data
some_data = X_train.iloc[:10]
some_labels = y_train.iloc[:10]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", rfr.predict(some_data_prepared))
print("Labels:", list(some_labels))

In [None]:
# predict using training data
rfr_pred_2 = rfr.predict(X_tr_testing)

In [None]:
# use function to show results
display_evaluation(y_train, rfr_pred_2)

#### Feature Importance

In [None]:
level_encoder = full_pipeline.named_transformers_["level"]
level_encoder_attribs = list(level_encoder.categories_[0])

subject_encoder = full_pipeline.named_transformers_["subject"]
subject_encoder_attribs = list(subject_encoder.categories_[0])

features_sub = num_features+level_encoder_attribs+["is_paid"]+subject_encoder_attribs

In [None]:
# pair the feature names with the results from grid search
feature_importance = grid_search.best_estimator_.feature_importances_
sorted(zip(feature_importance,features_sub), reverse=True)

> Next, lets train a model without the parameters that have less than 0.05 feature importance
> and compare the model performances.
>
> In this case, all categorical features will be removed.

In [None]:
# column transformer with numerical attributes only
full_pipeline_2 = ColumnTransformer([
    ("num", MinMaxScaler(), num_features),
])

In [None]:
X_train_prepared_2 = full_pipeline_2.fit_transform(X_train)
X_tr_testing_2 = full_pipeline_2.transform(X_train)

#### Model 4: Random Forest Regressor

In [None]:
# instantiate model
rfr = RandomForestRegressor(random_state=42)

In [None]:
# fit the training data
rfr.fit(X_train_prepared_2, y_train)

In [None]:
# test on a few instances from training data
some_data = X_train.iloc[:10]
some_labels = y_train.iloc[:10]
some_data_prepared = full_pipeline_2.transform(some_data)
print("Predictions:", rfr.predict(some_data_prepared))
print("Labels:", list(some_labels))

In [None]:
# predict using training data
rfr_pred_3 = rfr.predict(X_tr_testing_2)

In [None]:
# use function to show results
display_evaluation(y_train, rfr_pred_3)

#### Grid Search Cross Validation 2

In [None]:
# parameters for grid search
param_grid_2 = {"n_estimators": [10,50,100,500], "max_features":[2,3,4], "bootstrap": [True, False]}

In [None]:
# instantiate grid search
grid_search_2 = GridSearchCV(rfr, param_grid_2, cv=5, scoring="neg_mean_squared_error", return_train_score=True)

In [None]:
# fit the training data
grid_search_2.fit(X_train_prepared_2, y_train)

In [None]:
# show the best score
np.sqrt(-grid_search.best_score_)

In [None]:
# show the best parameters
grid_search_2.best_estimator_

In [None]:
# show results for each iteration
cvres = grid_search_2.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

#### Model 5: Random Forest Regressor

In [None]:
# instantiate model
rfr_2 = grid_search_2.best_estimator_
rfr_2

In [None]:
# test on a few instances from training data
some_data = X_train.iloc[:10]
some_labels = y_train.iloc[:10]
some_data_prepared = full_pipeline_2.transform(some_data)
print("Predictions:", rfr_2.predict(some_data_prepared))
print("Labels:", list(some_labels))

In [None]:
# predict using training data
rfr_pred_4 = rfr_2.predict(X_tr_testing_2)

In [None]:
# use function to show results
display_evaluation(y_train, rfr_pred_4)

#### Dummy Regressor
> The dummy regressor serves as an indication and comparison for model performance.

In [None]:
# instantiate dummy regressor
# predicts the mean for each instance
dummy = DummyRegressor(strategy="mean")

In [None]:
# fit the training set
dummy.fit(X_train_prepared_2, y_train)

In [None]:
# predict using dummy regressor
dummy_pred = dummy.predict(X_train_prepared_2)

In [None]:
# use function to show results
display_evaluation(y_train, dummy_pred)

> ### Overview:
>> ####  Removing the categorical features even slightly improved the score.
>> * The RMSE with all features was approximately 2551.
>> * The RMSE with only the numerical features was approximately 2520.
>> * The model is substantially better than the dummy regressor.

# 5. Evaluating the Test Set

In [None]:
# separate test set predictors and labels
X_test = test_set.drop("num_subscribers", axis=1)
y_test = test_set["num_subscribers"].copy()

In [None]:
final_model = grid_search_2.best_estimator_
final_model

In [None]:
# transform test set
X_test_prep = full_pipeline_2.transform(X_test)

In [None]:
# predict test set
final_predictions = final_model.predict(X_test_prep)

In [None]:
# evaluate predictions
display_evaluation(y_test, final_predictions)

> #### Resources:
> 1. Udemy Courses Dataset <a href="https://www.kaggle.com/andrewmvd/udemy-courses"
> title="Kaggle">link</a>
> 2. Regression Evaluation Metrics Article <a href="https://medium.com/analytics-vidhya/mae-mse-rmse
> -coefficient-of-determination-adjusted-r-squared-which-metric-is-better-cd0326a5697e" title="medium">link</a>
> 3. Random Forest Article <a href="https://towardsdatascience.com/an-implementation-and-
> explanation-of-the-random-forest-in-python-77bf308a9b76" title="towardsdatascience">link</a>

### Any feedback, suggestions, questions? Leave a comment below!
### Upvote if you liked this notebook, learned something new or found it useful!