# Draft

* I will aim for 15-20 questions for my analysis
* The main goal is to analyze how difficulty, course rating and reviews are correlated
* Use plotly express, more fancy stuff if needed
* Use AI to guide me through the exercise as a tutor
* At the end, make sure you incorporated feedback from the last sprint project

Main goal: identify Opportunities for Course Development: The goal could be to find gaps in the current course offerings or areas where there's high demand but limited supply.


> Typically you need to complete several courses/moduls under “Specializations” category. These courses are designed for a specific subject in mind, e.g. data analysis.

Questions:


-   What are the top 10 most enrolled courses, and what subjects do they cover?
-   Which subject areas have the highest average enrollment but the fewest course offerings?
-   Are there any difficulty levels (Beginner, Intermediate, Advanced) that are underrepresented in popular subject areas?
-   Which specializations have the highest enrollment-to-course ratio, potentially indicating demand for more courses in that area?
-   What are the emerging trends in course topics based on recent additions to the catalog?
-   Are there any high-rated courses with relatively low enrollment, suggesting potential for growth with better marketing?
-   Which languages, other than English, show high demand but have limited course offerings?
-   Are there any gaps in the difficulty progression (Beginner to Advanced) within popular subject areas?
-   What subject areas have the highest ratings but fewer course options compared to other subjects?
-   Are there any in-demand skills or technologies that are underrepresented in the current course offerings?
-   Which organizations have the highest-rated courses, and are there subject areas they're not covering?
-   What is the distribution of course types (individual courses, specializations, professional certificates) across different subjects, and are there imbalances?
-   Are there any correlations between course characteristics (e.g., duration, difficulty) and enrollment that could inform new course design?
-   What topics are covered in the courses with the highest enrollment-to-rating ratio?
-   Are there any subject areas with high enrollment but lower average ratings, indicating a need for improved course quality?
-   What are the most common prerequisites for advanced courses, and are there gaps in preparing students for these?
-   Are there any successful course formats or structures that could be applied to other subject areas?
-   What is the distribution of course durations, and is there a sweet spot that could inform new course development?
-   Are there any interdisciplinary topics that are currently underrepresented in the course catalog?
-   What skills or knowledge areas are frequently mentioned in course descriptions of popular courses but don't have dedicated courses?


In [11]:
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

from helpers import strip_spaces, lowercase_data, convert_metric_prefix_to_numeric

df = pd.read_csv("coursera_data.csv", index_col=0)

df.head()

Unnamed: 0,course_title,course_organization,course_Certificate_type,course_rating,course_difficulty,course_students_enrolled
134,(ISC)² Systems Security Certified Practitioner...,(ISC)²,SPECIALIZATION,4.7,Beginner,5.3k
743,A Crash Course in Causality: Inferring Causal...,University of Pennsylvania,COURSE,4.7,Intermediate,17k
874,A Crash Course in Data Science,Johns Hopkins University,COURSE,4.5,Mixed,130k
413,A Law Student's Toolkit,Yale University,COURSE,4.7,Mixed,91k
635,A Life of Happiness and Fulfillment,Indian School of Business,COURSE,4.8,Mixed,320k


## Let's clean the data and prepare it for analysis

In [12]:
df = lowercase_data(df)
df = strip_spaces(df)

# Convert numeric values from e.g. '48k' to '48000' whole numbers for `course_students_enrolled`
df['course_students_enrolled'] = convert_metric_prefix_to_numeric(df['course_students_enrolled'])
df.head()

Unnamed: 0,course_title,course_organization,course_certificate_type,course_rating,course_difficulty,course_students_enrolled
134,(isc)² systems security certified practitioner...,(isc)²,specialization,4.7,beginner,5300.0
743,a crash course in causality: inferring causal...,university of pennsylvania,course,4.7,intermediate,17000.0
874,a crash course in data science,johns hopkins university,course,4.5,mixed,130000.0
413,a law student's toolkit,yale university,course,4.7,mixed,91000.0
635,a life of happiness and fulfillment,indian school of business,course,4.8,mixed,320000.0


In [13]:
missing_values = pd.Series(df.isnull().sum(), name="Missing Values")
missing_values

course_title                0
course_organization         0
course_certificate_type     0
course_rating               0
course_difficulty           0
course_students_enrolled    4
Name: Missing Values, dtype: int64

In [14]:
duplicate_values = pd.Series(df.duplicated().sum(), name="Duplicate Values")
duplicate_values

0    0
Name: Duplicate Values, dtype: int64

In [15]:
# Name index column to 'id' for clarity
df = df.reset_index().rename(columns={'index': 'id'}).set_index('id')

### We see that there no duplicate or no missing values, so we won't be handling them. Other data looks OK.

# Now, let's try to find any gaps in the market by answering the questions below

### What are the top 10 most enrolled courses, and what subjects do they cover?

In [16]:
top_10_enrolled_courses = df[['course_title', 'course_students_enrolled']].nlargest(10, 'course_students_enrolled')
top_10_enrolled_courses


Unnamed: 0_level_0,course_title,course_students_enrolled
id,Unnamed: 1_level_1,Unnamed: 2_level_1
13,data science,830000.0
44,career success,790000.0
175,english for career development,760000.0
40,successful negotiation: essential strategies a...,750000.0
15,data science: foundations using r,740000.0
5,deep learning,690000.0
62,neural networks and deep learning,630000.0
36,improve your english communication skills,610000.0
63,academic english: writing,540000.0
7,business foundations,510000.0


We see broad categories
- Data Science & AI
- Career
- Business

### Which subject areas have the highest average enrollment but the fewest course offerings?

In [17]:
# TODO change calculation
subject_areas = pd.Series(df.groupby('course_organization')['course_students_enrolled'].mean().nlargest(10))
subject_areas

course_organization
mcmaster university                             230000.000000
google - spectrum sharing                       210000.000000
ludwig-maximilians-universität münchen (lmu)    192500.000000
école polytechnique                             190000.000000
georgia institute of technology                 181300.000000
deeplearning.ai                                 178962.500000
university of washington                        167400.000000
university of california, irvine                160222.222222
johns hopkins university                        153532.142857
vanderbilt university                           144000.000000
Name: course_students_enrolled, dtype: float64

### Are there any difficulty levels (Beginner, Intermediate, Advanced) that are underrepresented in popular subject areas?

In [18]:
difficulty_levels = pd.Series(
    df.groupby('course_difficulty')['course_students_enrolled'].sum().astype(int).nlargest(10))
difficulty_levels

course_difficulty
beginner        38421800
mixed           17989400
intermediate    14506300
advanced         1264400
Name: course_students_enrolled, dtype: int64

### Which specializations have the highest enrollment-to-course ratio, potentially indicating demand for more courses in that area?

### What are the emerging trends in course topics based on recent additions to the catalog?

In [19]:
recent_courses = df.sort_values('course_title', ascending=False).head(10)
recent_courses

Unnamed: 0_level_0,course_title,course_organization,course_certificate_type,course_rating,course_difficulty,course_students_enrolled
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
163,финансовые инструменты для частного инвестора,national research university higher school of ...,specialization,4.7,beginner,38000.0
875,русский как иностранный,saint petersburg state university,specialization,4.6,intermediate,9800.0
545,разработка интерфейсов: вёрстка и javascript,e-learning development fund,specialization,4.5,intermediate,30000.0
883,психолингвистика (psycholinguistics),saint petersburg state university,course,4.8,mixed,21000.0
236,программирование на python,mail.ru group,specialization,4.5,intermediate,52000.0
889,погружение в python,moscow institute of physics and technology,course,4.7,intermediate,45000.0
841,основы разработки на c++: белый пояс,e-learning development fund,course,4.9,intermediate,41000.0
703,основы программирования на python,national research university higher school of ...,course,4.6,beginner,83000.0
405,основы digital маркетинга,national research university higher school of ...,specialization,4.5,intermediate,19000.0
132,машинное обучение и анализ данных,e-learning development fund,specialization,4.7,intermediate,77000.0


### Are there any high-rated courses with relatively low enrollment, suggesting potential for growth with better marketing?

### Which languages, other than English, show high demand but have limited course offerings?

### Are there any gaps in the difficulty progression (Beginner to Advanced) within popular subject areas?

### What subject areas have the highest ratings but fewer course options compared to other subjects?

### Are there any in-demand skills or technologies that are underrepresented in the current course offerings?

### Which organizations have the highest-rated courses, and are there subject areas they're not covering?

### What is the distribution of course types (individual courses, specializations, professional certificates) across different subjects, and are there imbalances?

### Are there any correlations between course characteristics (e.g., duration, difficulty) and enrollment that could inform new course design?

### What topics are covered in the courses with the highest enrollment-to-rating ratio?

### Are there any subject areas with high enrollment but lower average ratings, indicating a need for improved course quality?

### What are the most common prerequisites for advanced courses, and are there gaps in preparing students for these?

### Are there any successful course formats or structures that could be applied to other subject areas?

### What is the distribution of course durations, and is there a sweet spot that could inform new course development?

### Are there any interdisciplinary topics that are currently underrepresented in the course catalog?

In [22]:
underrepresented_topics = df['course_title'].str.split().explode().value_counts().head(10)
underrepresented_topics

course_title
and             222
to              111
for             107
the              90
introduction     75
of               73
data             69
with             64
in               64
management       46
Name: count, dtype: int64