
# Coursera Course Dataset Analysis

## Introduction
This notebook presents a comprehensive analysis of a dataset containing information about Coursera courses. The analysis covers various aspects, including data cleaning, exploratory data analysis (EDA), and deeper insights into course ratings, levels, instructor influence, text analysis on course descriptions, and detailed skill analysis.


In [None]:

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter



## Data Loading
Load the dataset to understand its structure and preliminary details.


In [None]:

data = pd.read_csv('/mnt/data/CourseraDataset-Clean.csv')
data.head()



## Data Cleaning
Identify and handle missing values, duplicates, and any data type discrepancies.


In [None]:

# Check for missing values and duplicates
missing_values = data.isnull().sum()
duplicate_rows = data.duplicated().sum()

missing_values, duplicate_rows



## Exploratory Data Analysis (EDA)
Perform an initial exploration to understand the distribution of courses based on ratings, levels, and other features.


In [None]:

# Distribution of course ratings
plt.figure(figsize=(10, 6))
sns.histplot(data['Rating'], bins=20, kde=False)
plt.title('Distribution of Course Ratings')
plt.xlabel('Rating')
plt.ylabel('Number of Courses')
plt.show()


In [None]:

# Count of courses by level
plt.figure(figsize=(10, 6))
sns.countplot(x='Level', data=data, order=data['Level'].value_counts().index)
plt.title('Count of Courses by Level')
plt.xticks(rotation=45)
plt.show()



## Further Analysis
Dig deeper into the dataset to uncover insights on instructor influence, course descriptions, and skills offered.


In [None]:

# Instructor Influence on Ratings
instructor_ratings = data.groupby('Instructor')['Rating'].mean().sort_values(ascending=False).head(10)
instructor_ratings


In [None]:

# Text Analysis on Course Descriptions
vectorizer = CountVectorizer(stop_words='english', max_features=20)
X = vectorizer.fit_transform(data['What you will learn'].dropna())
words = vectorizer.get_feature_names_out()
word_counts = np.asarray(X.sum(axis=0)).ravel().tolist()
word_counts_df = pd.DataFrame({'word': words, 'count': word_counts}).sort_values('count', ascending=False)
word_counts_df


In [None]:

# Detailed Skill Analysis
skills_series = data['Skill gain'].str.split(', ')
all_skills = [skill for sublist in skills_series.dropna() for skill in sublist]
skill_counts = Counter(all_skills)
most_common_skills = skill_counts.most_common(10)
most_common_skills
