# Udemy Course Data Exploration

Coded by Luna McBride

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

from wordcloud import WordCloud, STOPWORDS #wordcloud 
import matplotlib.pyplot as plt #plotting
%matplotlib inline

plt.rcParams['figure.figsize'] = (15,10) #Set the default figure size
plt.style.use('ggplot') #Set the plotting method

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
udemy = pd.read_csv("../input/finance-accounting-courses-udemy-13k-course/udemy_output_All_Finance__Accounting_p1_p626.csv") #Get the Udemy data
udemy.head() #Take a peek at the data

---

# Check for Null Values

In [None]:
print(udemy.isnull().any()) #Check for null values

In [None]:
print(udemy.loc[udemy["price_detail__amount"].isnull()]) #Check where the price is null

The only null values appear to be prices on rows where the courses are not paid. I will be dropping several of these rows, but the others I will just replace with 0. 

---

# Drop Column/Fix Null

In [None]:
udemy = udemy.drop(columns = {"created", "discount_price__amount", "discount_price__currency", "discount_price__price_string",
                             "price_detail__price_string", "price_detail__currency"}) #Drop several unnecessary rows
udemy.head() #Take a peek at the dataframe

In [None]:
udemy["price_detail__amount"] = udemy["price_detail__amount"].apply(lambda x: 0 if pd.isnull(x) else x) #Change null values to 0
print(udemy.isnull().any()) #Check for null values

In [None]:
udemy["usd"] = udemy["price_detail__amount"].apply(lambda x: x*0.014) #Add the prices in USD
udemy.head() #Take a peek at the dataframe

---

# Update Column Names

In [None]:
columns = udemy.columns #Take the current column names
#Create a list of the new column names I want to give
newColumns = ["id", "title", "url", "costsMoney", "subCount", "avgRating", "recentRating", "rating", "reviewNum", 
              "wishlisted", "lectureNum", "testNum", "published", "priceRupees", "priceUSD"]

columnChange = dict(zip(columns, newColumns)) #Zip together the column names, then put them into a dictionary of current : new
udemy = udemy.rename(columns = columnChange) #Rename the columns with the dictionary
udemy.head() #Take a peek at the dataframe

---

# Build a WordCloud from the Titles

Source: https://www.geeksforgeeks.org/generating-word-cloud-python/

In [None]:
titleWords = "" #Make a holder variable for words to make a word cloud
stopwords = set(STOPWORDS) #Get a set of the stopwords to remove
titles = udemy["title"] #Get the titles to look through
  
#For each title, get the words for the cloud
for title in titles: 
    tokens = title.split() #Split the titles into words
    length = len(tokens) #Get the number of words to loop through
    
    #For each word, make the word lower case
    for i in range(0, length): 
        tokens[i] = tokens[i].lower() #Make the current word lowercase
      
    titleWords += " ".join(tokens) + " " #Add the word to the set of words for the cloud

cloud = WordCloud(width = 800, height = 800, #Build a word cloud of size 800x800 
            stopwords = stopwords, #Set the stopwords to remove
            min_font_size = 14).generate(titleWords) #Set the min size and generate the cloud
  
plt.figure(figsize = (10, 10), facecolor = None) #Build an 10x10 figure
plt.imshow(cloud) #Display the cloud
plt.axis("off") #Remove the axis
plt.tight_layout(pad = 0) #Remove the padding from the grid
  
plt.show() #Show the cloud

---

# Lecture Number vs Rating

In [None]:
udemy.plot.scatter(x = "lectureNum", y = "rating", title = "Num Lectures vs Rating") #Build a scatterplot comparing rating and lecture number

It appears a low number of lectures does not indicate rating. Once there are 200 or more lectures, however, it is almost guarenteed to have a rating above 3.

---

# Test Number vs Rating

In [None]:
udemy.plot.scatter(x = "testNum", y = "rating", title = "Num Tests vs Rating") #Build a scatterplot comparing rating and test number

Higher test numbers appear to have a higher rating, with exception to some 0 ratings (likely untaken courses)

---

# Price vs Rating

In [None]:
udemy.plot.scatter(x = "priceUSD", y = "rating", title = "Price vs Rating") #Build a scatterplot comparing rating and price

It does not seem price is a good indicator of quality with Udemy. While there are less lower rated courses at high price points, there is a steady amount of high and low rated courses throughout the whole graph.

---

# Review Number/Subscriber Count (Popularity) vs Rating

In [None]:
udemy.plot.scatter(x = "reviewNum", y = "rating", title = "Num Reviews vs Rating") #Build a scatterplot comparing rating and review number

In [None]:
udemy.plot.scatter(x = "subCount", y = "rating", title = "Subscriber Count vs Rating") #Build a scatterplot comparing rating and Sub Count

It appears as a course gets sufficiently high review/subscriber counts (and thus are more popular), they tend to get overall higher ratings. This turning point appears to be 5000 ratings and 50000 subscribers.

---

# Num Lectures vs Price

In [None]:
udemy.plot.scatter(x = "lectureNum", y = "priceUSD", title = "Num Lectures vs Price") #Build a scatterplot comparing price and lecture number

Price and Lecture Number appear to have no corelation. I had assumed a lower price would imply less content, but apparently not.

---

# The Best Courses

Review count has been shown to be an important metric in showing the best classes. So, which classes rate above 4.5 with more than 20000 reviews?

In [None]:
high = udemy.loc[udemy["reviewNum"] > 20000] #Take rows with review counts higher than 20000 into the "high" dataframe
high = high.loc[high["rating"] > 4.5] #Cut out courses from the high dataframe that are less than 4.5
high #Show all of the best Udemy courses

The best courses seem to be leadership, data analysis, and personal investment type courses. This is a financial dataset, so it makes sense. Though, the inclusion of Hadoop and Big Data is really interesting, as I pictured that as a heavy data science concept rather than financial analysis.