 Name: Vaughn Mitchell <br>
 LSE ID: 202384798 <br>
 ME204 - Data Engineering for the Social World

# **Notebook 3: Exploratory Data Analysis and Visualization**

Goals:
- Relate course-length and price together to see which courses offer the most value.
- Generate a data visualization of courses by category, examining which subject is offered most on edX's webpage.
- Explore relationships between price, enrollment, time-commitment, and subject. 
    - The idea is to have price on the x-axis, time-commitment on the y-axis, sized by enrollment, and colored by subject.

In [1]:
!pip install matplotlib lets-plot numpy cairosvg



Imports:

In [2]:
import pandas as pd
import sqlite3
import matplotlib
import numpy as np
import os
from lets_plot import *
import cairosvg

In [3]:
LetsPlot.setup_html()

In [4]:
os.chdir("../data/clean")

In [5]:
DATA_FOLDER = os.getcwd()

In [6]:
%load_ext sql
%config SqlMagic.autocommit=True

In [7]:
conn = sqlite3.connect(os.path.join(DATA_FOLDER, "edx.db"))

Let's run some queries for EDA. The first query examines how course duration relates to price:

In [8]:
df_course_length_avg_price = pd.read_sql_query('''
SELECT
    course_length_weeks,
    AVG(price_£) as average_price_£
FROM 
    value
GROUP BY course_length_weeks


''', conn)

df_course_length_avg_price.set_index("course_length_weeks").sort_index(ascending=True)
df_course_length_avg_price

Unnamed: 0,course_length_weeks,average_price_£
0,1,45.142857
1,2,74.137931
2,3,83.683333
3,4,109.178261
4,5,99.25
5,6,111.210191
6,7,110.185185
7,8,119.269231
8,9,122.533333
9,10,120.864407


I don't need a visualization for this data, but I think it would be helpful to find a correlation coefficient for these two columns to see if the average price (£) correlates with course length (weeks).

In [9]:
corr = df_course_length_avg_price['course_length_weeks'].corr(df_course_length_avg_price['average_price_£'])
insight = f"The correlation coefficient between course length (weeks) and average price (£) is: {corr}."
if abs(corr)>=0.8:
    print(insight + " This is a strong correlation!")
elif abs(corr) < 0.8 and abs(corr)>=0.5:
    print(insight + " This is a moderate correlation.")
elif abs(corr) < 0.5:
    print(insight) + " This is a weak correlation."

The correlation coefficient between course length (weeks) and average price (£) is: 0.5048325573088821. This is a moderate correlation.


Given this, I want to see which course has the highest duration-value ratio. How much value can be obtained from the course, respective to its price? I ignore time commitment per week because this is an estimate, and especially for self-paced courses, it is not expected that students who purchase the certificate will pace themself at the prescribed rate. For instance, they could save all the work until the end of the course, or do all the modules on the final day before the next week begins. Let's take a look:

I need to create a different DataFrame that copies the `value` table from the `edx.db` database as I am not working with average prices for this question:

In [10]:
df_duration_to_price_ratio = pd.read_sql_query('''
SELECT
    course,
    course_length_weeks,
    price_£
FROM 
    value


''', conn)

I could technically find the answer to this question two ways: Divide the `price_£` column by the `course_length_weeks` column and find the minimum, or vice versa and find the maximum. I will choose the latter:

In [11]:
df_duration_to_price_ratio.head()

Unnamed: 0,course,course_length_weeks,price_£
0,How to Learn Online,2,54
1,The Science of Happiness,11,131
2,Remote Work Revolution for Everyone,3,116
3,Data Visualization and Building Dashboards wit...,4,77
4,"Six Sigma Part 2: Analyze, Improve, Control",8,147


In [12]:
df_duration_to_price_ratio['value'] = df_duration_to_price_ratio['course_length_weeks']/df_duration_to_price_ratio['price_£']

In [13]:
df_duration_to_price_ratio.head()

Unnamed: 0,course,course_length_weeks,price_£,value
0,How to Learn Online,2,54,0.037037
1,The Science of Happiness,11,131,0.083969
2,Remote Work Revolution for Everyone,3,116,0.025862
3,Data Visualization and Building Dashboards wit...,4,77,0.051948
4,"Six Sigma Part 2: Analyze, Improve, Control",8,147,0.054422


In [14]:
df_duration_to_price_ratio.loc[df_duration_to_price_ratio['value'].idxmax()]


course                 Solid Waste Management
course_length_weeks                         6
price_£                                     4
value                                     1.5
Name: 663, dtype: object

The course that offers the most value is "Solid Waste Management". 6 weeks for a £4 certificate is great value, granted if you're studying environmental studies.

## Visualization #1
Thinking of edX at large, which subjects offer the most courses?

Let me load pertinent columns from the `courses` table from `edx.db` into a DataFrame:

In [15]:
df_vis1 = pd.read_sql_query('''
    SELECT 
        COUNT(course) as course_count,
        subject
    FROM
        courses
    GROUP BY subject
''',conn)

In [16]:
df_vis1

Unnamed: 0,course_count,subject
0,10,Architecture
1,10,Art & Culture
2,31,Biology & Life Sciences
3,155,Business & Management
4,11,Chemistry
5,47,Communication
6,214,Computer Science
7,87,Data Analysis & Statistics
8,9,Design
9,54,Economics & Finance


Now, let's plot this:

In [17]:
vis1 = ggplot(df_vis1) + \
geom_bar(mapping=aes(x="subject",y="course_count"), stat="identity", fill="red") + \
ggtitle("Number of Courses per Category") + \
xlab("Subject") + \
ylab("Course Count") + theme_minimal2()

vis1

In [18]:
DATA_FOLDER = os.path.join(os.getcwd(), "docs/figures")
DATA_FOLDER

'/Users/vaughnmitchell/Desktop/LSE/ME204/code/myproject/data/clean/docs/figures'

In [19]:
ggsave(vis1, DATA_FOLDER + "1.svg")

'/Users/vaughnmitchell/Desktop/LSE/ME204/code/myproject/data/clean/docs/figures1.svg'

In [20]:
cairosvg.svg2png(url=DATA_FOLDER + "1.svg", write_to=DATA_FOLDER + "1.png") 

## Visualization #2

Now for an exploration of 4 variables: price, time-commitment, enrollment, and subject.

In [21]:
df_vis2 = pd.read_sql_query(''' 
SELECT 
     course,
     price_£,
     time_commitment,
     current_enrollment,
     subject
FROM
     courses                         
                         
''', conn)

In [22]:
df_vis2

Unnamed: 0,course,price_£,time_commitment,current_enrollment,subject
0,How to Learn Online,54,2.5,287998,Education & Teacher Training
1,The Science of Happiness,131,4.5,583920,Social Sciences
2,Remote Work Revolution for Everyone,116,2.5,114666,Business & Management
3,Data Visualization and Building Dashboards wit...,77,2.5,48226,Data Analysis & Statistics
4,"Six Sigma Part 2: Analyze, Improve, Control",147,3.5,85904,Business & Management
...,...,...,...,...,...
948,IT Fundamentals for Business Professionals: Cy...,54,4.5,12980,Computer Science
949,Leadership and Management for PM Practitioners...,193,3.5,18445,Business & Management
950,Découvrir la responsabilité sociétale des entrep…,155,6.0,0,Business & Management
951,Quantitative Research Methods,100,3.5,0,Social Sciences


In order to size each course by enrollment, I defined this function called `map_size` that let's me size the points by predefined intervals:

In [23]:
def map_size(enrollment):
    if enrollment == 0:
        return 1  # Smallest size
    elif 0 < enrollment < 50000:
        return 2
    elif 50000 <= enrollment < 100000:
        return 3
    elif 100000 <= enrollment < 250000:
        return 4
    elif 250000 <= enrollment < 500000:
        return 5
    elif 500000 <= enrollment < 750000:
        return 6
    elif 750000 <= enrollment < 1000000:
        return 7  # Largest size
    elif 1000000 < enrollment:
        return 8

df_vis2['size_category'] = df_vis2['current_enrollment'].apply(map_size)

I created this visualization with price on the y-axis, time commitment on the x axis, points colored by subject and sized by enrollment:

In [24]:
vis2 = (ggplot(df_vis2) +
     aes(y='price_£', x='time_commitment', size='size_category', color='subject') +
     geom_point(alpha=0.75) +
     labs(y='Price (£), log(10)', x='Time Commitment (Weekly Hours)', title='Course Prices and Time Commitments, sized by Enrollment and colored by Subject') +
     scale_size(range=[2,10]) +
     scale_y_log10() +
     ggsize(2000,1750) +  
     scale_color_discrete(name='Subject'))
vis2

In [25]:
ggsave(vis2, DATA_FOLDER + "2.svg")

'/Users/vaughnmitchell/Desktop/LSE/ME204/code/myproject/data/clean/docs/figures2.svg'

In [26]:
cairosvg.svg2png(url= DATA_FOLDER + "2.svg", write_to= DATA_FOLDER + "2.png") 

This graph is not incredibly insightful by default, but under closer examination, I see that a lot of courses are outlined in contours of horizontal lines. This means that providers tend to set prices of certificates regardless of the weekly time commitment. 

## Visualization 3

We already did courses by subject, what about enrollment by subject? What happens to be the most-enrolled subject that students are taking a class in?

In [27]:
df_vis3 = pd.read_sql_query('''
    SELECT
        course,
        SUM(current_enrollment) as enrollment_count,
        subject   
    FROM 
        courses
    GROUP BY subject                         
                            ''', conn)

In [28]:
df_vis3.head()

Unnamed: 0,course,enrollment_count,subject
0,The Architectural Imagination,732558,Architecture
1,Japanese Culture and Language (I) | 日语与日本文化（1）,263425,Art & Culture
2,"Fundamentals of Neuroscience, Part 1: The Elec...",2091579,Biology & Life Sciences
3,Remote Work Revolution for Everyone,9035990,Business & Management
4,Energy and Thermodynamics,368207,Chemistry


In [29]:
vis3 = ggplot(df_vis3) + \
geom_bar(mapping=aes(x="subject",y="enrollment_count"), stat="identity", fill="blue") + \
xlab("Subject") + \
ylab("Enrollment Count") + \
ggtitle("Enrollment Count by Subject")

In [30]:
vis3

In [31]:
ggsave(vis3, DATA_FOLDER + "3.svg")

'/Users/vaughnmitchell/Desktop/LSE/ME204/code/myproject/data/clean/docs/figures3.svg'

In [32]:
cairosvg.svg2png(url= DATA_FOLDER + "3.svg", write_to= DATA_FOLDER + "3.png") 

Now we have enrollment count by subject! This visualization displays that - similar to the number of courses - Computer Science, Data Analysis & Statistics, and Business & Management have the highest enrollment for edX. This information is useful for curriculum developers seeking an online platform to distribute their course. For instance, they probably wouldn't have much success offering an Ethics course because there's only 7,952 enrolled students over 1 course. 