# Draft

* I will aim for 15-20 questions for my analysis
* The main goal is to analyze how difficulty, course rating and reviews are correlated
* Use plotly express, more fancy stuff if needed
* Use AI to guide me through the exercise as a tutor
* At the end, make sure you incorporated feedback from the last sprint project

Main goal: identify Opportunities for Course Development: The goal could be to find gaps in the current course offerings or areas where there's high demand but limited supply.


> Typically you need to complete several courses/moduls under “Specializations” category. These courses are designed for a specific subject in mind, e.g. data analysis.

Questions:


-   What are the top 10 most enrolled courses, and what subjects do they cover?
-   Which subject areas have the highest average enrollment but the fewest course offerings?
-   Are there any difficulty levels (Beginner, Intermediate, Advanced) that are underrepresented in popular subject areas?
-   Which specializations have the highest enrollment-to-course ratio, potentially indicating demand for more courses in that area?
-   What are the emerging trends in course topics based on recent additions to the catalog?
-   Are there any high-rated courses with relatively low enrollment, suggesting potential for growth with better marketing?
-   Which languages, other than English, show high demand but have limited course offerings?
-   Are there any gaps in the difficulty progression (Beginner to Advanced) within popular subject areas?
-   What subject areas have the highest ratings but fewer course options compared to other subjects?
-   Are there any in-demand skills or technologies that are underrepresented in the current course offerings?
-   Which organizations have the highest-rated courses, and are there subject areas they're not covering?
-   What is the distribution of course types (individual courses, specializations, professional certificates) across different subjects, and are there imbalances?
-   Are there any correlations between course characteristics (e.g., duration, difficulty) and enrollment that could inform new course design?
-   What topics are covered in the courses with the highest enrollment-to-rating ratio?
-   Are there any subject areas with high enrollment but lower average ratings, indicating a need for improved course quality?
-   What are the most common prerequisites for advanced courses, and are there gaps in preparing students for these?
-   Are there any successful course formats or structures that could be applied to other subject areas?
-   What is the distribution of course durations, and is there a sweet spot that could inform new course development?
-   Are there any interdisciplinary topics that are currently underrepresented in the course catalog?
-   What skills or knowledge areas are frequently mentioned in course descriptions of popular courses but don't have dedicated courses?


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

from helpers import strip_spaces, lowercase_data, convert_metric_prefix_to_numeric

df = pd.read_csv("coursera_data.csv", index_col=0)

df.head()

Unnamed: 0,course_title,course_organization,course_Certificate_type,course_rating,course_difficulty,course_students_enrolled
134,(ISC)² Systems Security Certified Practitioner...,(ISC)²,SPECIALIZATION,4.7,Beginner,5.3k
743,A Crash Course in Causality: Inferring Causal...,University of Pennsylvania,COURSE,4.7,Intermediate,17k
874,A Crash Course in Data Science,Johns Hopkins University,COURSE,4.5,Mixed,130k
413,A Law Student's Toolkit,Yale University,COURSE,4.7,Mixed,91k
635,A Life of Happiness and Fulfillment,Indian School of Business,COURSE,4.8,Mixed,320k


## Let's clean the data and prepare it for analysis

In [2]:
df = lowercase_data(df)
df = strip_spaces(df)

# Convert numeric values from e.g. '48k' to '48000' whole numbers for `course_students_enrolled`
df['course_students_enrolled'] = convert_metric_prefix_to_numeric(df['course_students_enrolled'])

In [3]:
missing_values = pd.Series(df.isnull().sum(), name="Missing Values")
missing_values

course_title                0
course_organization         0
course_certificate_type     0
course_rating               0
course_difficulty           0
course_students_enrolled    4
Name: Missing Values, dtype: int64

In [4]:
duplicate_values = pd.Series(df.duplicated().sum(), name="Duplicate Values")
duplicate_values

0    0
Name: Duplicate Values, dtype: int64

In [5]:
# Name index column to 'id' for clarity
df = df.reset_index().rename(columns={'index': 'id'}).set_index('id')

### We see that there no duplicate or no missing values, so we won't be handling them. Other data looks OK.

# Now, let's try to find any gaps in the market by answering the questions below

### What are the top 10 most enrolled courses, and what subjects do they cover?

In [7]:
top_10_enrolled_courses = df.groupby('course_students_enrolled')
top_10_enrolled_courses

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x751930dee6f0>