# In this Exploratory Data Analysis, I will analyze courses provided by online course platform Udemy.

Here some introductory infos about Udemy: 

- Udemy, Inc. is an American massive open online course provider aimed at professional adults and students.

- It was founded in May 2010 by Eren Bali, Gagan Biyani, and Oktay Caglar.

- As of Jan 2020, the platform has more than 35 million students and 57,000 instructors teaching courses in over 65 languages.

As usual, first thing first, let's import our libraries.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

import plotly.io as pio
pio.renderers.default = 'iframe'

import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

# Knowing Dataset

In [None]:
df = pd.read_csv("../input/udemy-courses/udemy_courses.csv")
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.describe()

Based on above outputs, we can make the following cahnges and arrabgements in our dataset:
- Since 'course ID' and course 'url' would not be necessary for our analysis, we can drop them.
- 'published_timestamp', which displays courses' published date, is given in object format. We neeed to format it as a datetime object so that we can work on it.
- There is no missing value, which means less work in preparation stage.
- Since 'level' column is categorical variable, we can use it to see whether there are significant differences among the levels.
- Numerical variables deserves special attention for further analysis.

# Preparing Dataset

Firstly, let's create a new column called 'date' in datetime format then drop unnecessary columns we have mentioned above.

In [None]:
df["date"] = pd.to_datetime(df["published_timestamp"])

# df["date"] = df["published_timestamp"]astype(datetime64[ns])

In [None]:
df.sample(2)

In [None]:
df = df.drop(["course_id", "url", "published_timestamp"], axis=1).copy()
df.sample(2)

In [None]:
df.info()

Nice and neat. 

# Analyzing Dataset

In [None]:
df.describe()

Based on our observation on general statistical infos iferred from the above code, we can say that:

- While minimum values are 0, maximum values are in hundreds or thousands for the variables in our dataset.
- Differences between mean and median are big. All mean values are significantly higher than median values. This picture shows us that we will have a positivly skewed distribution with some outliers on maximum side of the distribution. Since median is more resilant to outliers that mean is, we will adopt median based approach in our analysis.
- Median value for the price is 45.
- Median value for the number of subscribers for the courses is approximately 912.
- Median value for the number of reviews is 18.
- Median value for the number of lectures is 25.
- Median value for the content duration is 2.

#### Price of Udemy Courses

In [None]:
fig = px.histogram(data_frame=df, x="price",marginal="box",title="Price of Udemy Courses")
fig.show()

According to the histogram above, UDEMY has 310 free course and it's 295 courses are priced as $200 . As we expected, there is highly right skewed distribution.

We can see right_skewness with a separate boxplot below.

In [None]:
fig = px.box(df, x="price", hover_data=df[["course_title", "subject"]], title="Price of Udemy Courses")
fig.update_traces(quartilemethod="inclusive")
fig.show()

#### Number of Subscribers of UDEMY Courses

In [None]:
fig = px.histogram(data_frame=df, x="num_subscribers", marginal="box", title="Number of Subscribers of UDEMY Courses")
fig.show()

While there are courses without any subscriber, some have 268923 of them. As expected, we have higly rigth skewed distribution here, as well.

In [None]:
fig = px.box(df, x="num_subscribers", hover_data=df[["course_title", "subject"]], title="Number of Subscribers of UDEMY Courses")
fig.update_traces(quartilemethod="inclusive")
fig.show()

#### Number of Reviews of UDEMY Courses

In [None]:
fig = px.histogram(data_frame=df, x="num_reviews", marginal="box", title="Number of Reviews of Udemy Courses")
fig.show()

Number of reviews ranges from 0 to 27445. Again highly skewed distribution.

In [None]:
fig = px.box(df, x="num_reviews", hover_data=df[["course_title", "subject"]], title="Number of Reviews of Udemy Courses")
fig.update_traces(quartilemethod="inclusive")
fig.show()

#### Number of Lectures of UDEMY Courses

In [None]:
fig = px.histogram(data_frame=df, x="num_lectures", marginal="box", title="Number of Lectures of UDEMY Courses")
fig.show()

There are lots of courses gather in range of 20-45. As we expected, we have highly skewed distribution with some outliers on the rigth side of the distribution.  

#### Durations of UDEMY Courses

In [None]:
fig = px.histogram(data_frame=df, x="content_duration", marginal="box", title="Durations of UDEMY Courses")
fig.show()

There are lots of courses gather in range of 0-3 hours. As we expected, we have highly skewed distribution with some outliers on the rigth side of the distribution.  

Beforing moving into further details let's see correlations among variables in our dataset.

In [None]:
df.corr()

In [None]:
fig = px.imshow(df.corr(), width=1200, height=600)
fig.update_layout(
    margin=dict(l=20, r=20, t=20, b=20),
    paper_bgcolor="LightSteelBlue",
)

fig.show()

In [None]:
plt.figure(figsize=(16, 8))
sns.heatmap(df.corr(), annot=True);

In [None]:
df.corr()

Based on observation, we can inferred that:
- There is a positive, not strong relation between number of reviews and number of subscribers. 
- There is hardly a relation between price and number of subscribers.
- There is almost a strong, positive relation between number of lectures and content duration.

Let's dive further into analysis of courses by different variables.

**By Subject**

In [None]:
df.subject.value_counts(normalize=True) * 100

We can see that Web Development and Business Finance are the top most popular subjects in Udemy courses (not surprising) follwed Musical Instruments (I am surprised that there are lots of people that are subscribed to online musical instruments course) and Graphic Design.

In [None]:
fig = px.histogram(data_frame=df, x="subject", title="Subjects of Udemy Courses")
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

In [None]:
df['year'] = df['date'].dt.year
subject_by_year = df.groupby('year')['subject'].value_counts().reset_index(level=0).rename(columns={'subject': 'subject count'}, index={'index': 'Subject'})
subject_by_year                                                                                            

In [None]:
fig = px.line(data_frame=subject_by_year, x='year', y='subject count', color= subject_by_year.index, title='UDEMY Courses By Subject in Each Year')
fig.show()

Until 2015, Web Development and Business Finance Courses increased. However, while Web Develeopment courses continued this raise in 2016, Business Finance stagnated. Since we do not have data for second half of 2017, it seems that there is decrease in each subject. Let's make no assumption on 2017.

**By Level**

In [None]:
np.round(df["level"].value_counts(normalize=True) * 100,0)

- 52% of the Udemy Courses is for all levels learners.
- Beginner level courses make up 35% of all of the courses
- 10% courses offered by UDEMY is in the intermediate level.
- 2% courses offered by UDEMY appeals to advance or exper level learners.

In [None]:
fig = px.histogram(data_frame=df, x="level", title="Levels of Udemy Courses")
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

In [None]:
level_by_year = df.groupby("year")["level"].value_counts().reset_index(level=0).rename(columns={"level":"level count"}, index={"index":"Level_of_Courses"})   

In [None]:
level_by_year

In [None]:
fig = px.line(data_frame=level_by_year, x='year', y='level count', color= level_by_year.index, title='UDEMY Courses By Level in Each Year')
fig.show()

It can be observed that:
- All levels of Udemy courses except for Expert Level increased consistently by each year.

**By Number of Subscribers, Number of Reviews, Number of Lectures**

Let's create a new dataset that contains only those columns we want to work with. 

In [None]:
df_new = df.groupby("year")[['num_subscribers','num_reviews','num_lectures']].sum().reset_index()
df_new

In [None]:
fig = px.line(data_frame=df_new, x='year', y=['num_subscribers','num_reviews','num_lectures'], title='UDEMY Courses By Number of Subscribers, Number of Reviews, Number of Lectures')
fig.show()

Number of subscribers increased constantly till 2015 and then experienced a decrease in 2016. Since 2017 data does not fully cover the 2017, it would be better not to make any assumption on 2017.

**By Price and Courses**

In [None]:
paid_by_year = df.groupby('year')['is_paid'].value_counts().reset_index(level=0).rename(columns={'is_paid': 'paid_free count'}, index={'index': 'is_paid'})
paid_by_year

In [None]:
fig = px.line(data_frame=paid_by_year, x='year', y='paid_free count', color= paid_by_year.index)
fig.show()

Number of paid_courses increased a lot between 2011 and 2016 (no assumption on 2017). We can see slight increase in the number of free cources, as well.

In [None]:
top_15_paid_courses = df[["course_title", "year", "subject", "num_subscribers"]][df["is_paid"]==True].sort_values(by="num_subscribers", ascending=False)[:15]

In [None]:
# Can be done in the following way, as well.

# top_15_paid_courses = df[df['price']!=0][['course_title','year','subject','num_subscribers']].sort_values(by= 'num_subscribers',ascending=False).head(15)
# top_15_paid_courses

In [None]:
top_15_paid_courses

In [None]:
fig = px.bar(top_15_paid_courses, y= 'num_subscribers', x='course_title', hover_data = top_15_paid_courses[['year','subject']], color='subject')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.update_layout(xaxis = go.layout.XAxis(tickangle = 45))
fig.show()

Seems that Web Development subject is the mst popular course subject provided by Udemy.

In [None]:
top_15_non_paid_courses = df[["course_title", "year", "subject", "num_subscribers"]][df["is_paid"]==False].sort_values(by="num_subscribers", ascending=False)[:15]

In [None]:
# Can be done in the following way, as well.

# top_15_non_paid_courses = df[df['price']==0][['course_title','year','subject','num_subscribers']].sort_values(by= 'num_subscribers',ascending=False).head(15)
# top_15_free_courses

In [None]:
top_15_non_paid_courses

In [None]:
fig = px.bar(top_15_non_paid_courses, y= 'num_subscribers', x='course_title', hover_data = top_15_non_paid_courses[['year','subject']], color='subject')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.update_layout(xaxis = go.layout.XAxis(tickangle = 45))
fig.show()

We have the same picture in free courses. Web Development dominates again.

In [None]:
top_15_price = df[['course_title','year','subject','num_subscribers', 'price']].sort_values(by=['price','num_subscribers'], ascending=False).head(15)
top_15_price

In [None]:
fig = px.bar(top_15_price , y= 'num_subscribers', x='course_title', hover_data = top_15_price[['price','year']], color='subject')
fig.update_layout(xaxis={'categoryorder':'total descending'})
fig.show()

Most expensive courses are $ 200, and all of the subjects areas are in the top 15 expensive course list.

**Here, we have completed our analysis. Have a fun reading.**