# Coursera Course Offerings Insights

This notebook provides a detailed analysis of the Coursera course dataset. It includes steps for data cleaning, exploratory data analysis (EDA), and further analysis to uncover insights about the courses offered on Coursera.

## Objectives
- Understand the dataset structure and content.
- Clean the dataset to prepare for analysis.
- Perform EDA to identify trends and patterns.
- Conduct further analysis on course ratings, number of reviews, and course duration.
- Draw meaningful conclusions from the analysis.

## Dataset Overview
Dataset Name: Video Game Console Generations Dataset on data.world
The dataset contains information about various courses offered on Coursera, including course titles, ratings, levels, schedules, learning outcomes, skills gained, and more.

## Tools Used
- Python: For data manipulation and analysis.
- Pandas: For data processing.
- Matplotlib and Seaborn: For creating visualizations.


In [None]:
import pandas as pd

# Load the dataset
file_path = 'CourseraDataset-Clean.csv'
data = pd.read_csv(file_path)
data.head()

## Data Preprocessing
### Cleaning
The dataset was checked for missing values and duplicates, revealing missing values in the Modules and Instructor columns, which were noted but not imputed due to their specific nature. No duplicate rows were found.

In [None]:
# Check for missing values
missing_values = data.isnull().sum()

# Check for duplicate rows
duplicate_rows = data.duplicated().sum()

# Data type check
data_types = data.dtypes

missing_values, duplicate_rows, data_types

### Transformation
Data types were appropriate and required no adjustments.

## Exploratory Data Analysis (EDA)
Let's perform some EDA to get a better understanding of the dataset. We'll look at the distribution of course ratings, the number of courses per level, and the distribution of courses across different keywords (categories).

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Set the aesthetics for the plots
sns.set(style="whitegrid")

# 1. Distribution of course ratings
plt.figure(figsize=(10, 6))
sns.histplot(data['Rating'], bins=20, kde=True)
plt.title('Distribution of Course Ratings')
plt.xlabel('Rating')
plt.ylabel('Frequency')
plt.show()

The histogram shows a left-skewed distribution, indicating that most courses have high ratings, with a peak around the 4.5 to 5.0 range. This suggests that the majority of courses are well-received by learners.

In [None]:
# 2. Count of courses by level
sns.set(style="whitegrid")
sns.set_palette("pastel")

plt.figure(figsize=(10, 6))
sns.countplot(data=data, x='Level')
plt.title('Count of Courses by Level')
plt.xlabel('Level')
plt.ylabel('Number of Courses')
plt.xticks(rotation=45)
plt.show()

The majority of courses are targeted at the "Beginner level," followed by "Mixed level," "Intermediate level," and a few "Advanced level" courses. This distribution suggests that Coursera focuses on providing accessible education to a wide audience, with an emphasis on entry-level courses.

In [None]:
# 3. Distribution of courses across different keywords (categories)
sns.set(style="whitegrid")
sns.set_palette("pastel")

plt.figure(figsize=(12, 8))
sns.countplot(data=data, y='Keyword', order = data['Keyword'].value_counts().index)
plt.title('Distribution of Courses Across Different Keywords')
plt.xlabel('Number of Courses')
plt.ylabel('Keyword')
plt.show()

The courses are spread across various categories, with some keywords having significantly more courses than others. This indicates the diversity of subjects available on Coursera, catering to a wide range of interests and educational needs.

## Further Analysis

### Relationship between course ratings and the number of reviews:
To understand if more popular courses (as indicated by the number of reviews) tend to have higher or lower ratings.

In [None]:
# Relationship between course ratings and the number of reviews
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data, x='Rating', y='Number of Review', alpha=0.5)
plt.title('Relationship Between Course Ratings and Number of Reviews')
plt.xlabel('Rating')
plt.ylabel('Number of Reviews')
plt.show()

# Calculating Pearson's correlation coefficient between ratings and number of reviews
rating_review_corr = data['Rating'].corr(data['Number of Review'])

rating_review_corr

The scatter plot does not indicate a strong relationship between course ratings and the number of reviews, suggesting that course popularity (as indicated by reviews) does not necessarily correlate with higher ratings. The Pearson's correlation coefficient supports this with a value of approximately 0.091, indicating a very weak positive correlation.

### Analysis of "Skill gain":
To identify the most common skills that courses aim to impart, which can give us insights into the current trends in skills development.

In [None]:
# Analysis of "Skill gain"
# Preprocessing the 'Skill gain' column to count occurrences of each skill
from collections import Counter

skills_series = data['Skill gain'].str.split(', ')
all_skills = [skill for sublist in skills_series.dropna() for skill in sublist]
skill_counts = Counter(all_skills)

# Identifying the most common skills
most_common_skills = skill_counts.most_common(10)

most_common_skills

The most frequently mentioned skills in courses, excluding the 'Not specified' entries, include "Data Analysis," "Python Programming," "Machine Learning," "Communication," "Data Visualization," "Data Science," "Leadership," "Cloud Computing," and "SQL." This reflects a strong emphasis on technical skills, particularly in data science and programming, as well as soft skills like communication and leadership.

### Correlation between course length and ratings or number of reviews:
To see if there's a trend indicating that shorter or longer courses tend to be rated higher or have more reviews.

In [None]:
# 3. Correlation between course length and ratings/number of reviews
# Scatter plot for course length vs ratings
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data, x='Duration to complete (Approx.)', y='Rating', alpha=0.5)
plt.title('Course Length vs Ratings')
plt.xlabel('Duration to Complete (Approx. hours)')
plt.ylabel('Rating')
plt.show()

# Scatter plot for course length vs number of reviews
plt.figure(figsize=(10, 6))
sns.scatterplot(data=data, x='Duration to complete (Approx.)', y='Number of Review', alpha=0.5)
plt.title('Course Length vs Number of Reviews')
plt.xlabel('Duration to Complete (Approx. hours)')
plt.ylabel('Number of Reviews')
plt.show()

# Calculating Pearson's correlation coefficients
length_rating_corr = data['Duration to complete (Approx.)'].corr(data['Rating'])
length_reviews_corr = data['Duration to complete (Approx.)'].corr(data['Number of Review'])

length_rating_corr, length_reviews_corr

The scatter plots for course length versus ratings and number of reviews show a spread that doesn't indicate a strong relationship. However, Pearson's correlation coefficients are slightly positive (0.138 for course length vs. ratings and 0.175 for course length vs. number of reviews), suggesting a weak positive correlation. This might indicate that slightly longer courses have marginally higher ratings and a bit more reviews, but the correlation is not strong enough to draw definitive conclusions.

### Course Level and Learner Engagement:
 To see if a particular level attracts more learner engagement.

In [None]:
# Course Level and Learner Engagement
# Grouping the data by Level and calculating the average number of reviews
level_engagement = data.groupby('Level')['Number of Review'].mean().sort_values(ascending=False)

level_engagement

Beginner level courses have the highest engagement in terms of reviews, which could reflect broader accessibility or appeal of these courses to a wider audience, followed by courses whose levels are not specified, and then Intermediate level courses. Advanced level courses have the lowest average number of reviews, which might indicate a smaller audience or a higher barrier to entry for these courses.

## Conclusions

This analysis of the Coursera course dataset provided insights into course ratings, levels, and the variety of skills targeted. We observed that most courses are highly rated and cater primarily to beginner levels, emphasizing the accessibility of education on Coursera. Further analysis revealed weak correlations between course ratings, number of reviews, and course duration, suggesting that course quality is not solely dependent on popularity or length.

These findings can inform both learners and educators about prevailing trends in online education and help guide future course development and selection.
