## PART: BONUS

Assume labels are not provided, but Udemy can find enough resources to label up to 300 courses. In other words, without looking at the label column, you can select up to 300 courses solely by examining the course_section_lecture_title column and ask Udemy to label it for you. Which 300 observations do you select and why?

**Answer:** We'll choose a diverse set of 300 courses covering a wide range of topics to provide a comprehensive representation of the entire dataset. These selected courses will be informative for labeling purposes, capturing the variability and key patterns in the data.

**Approach: Clustering**
1. **Pre-Processing**
2. **Vectorization**
3. **Clustering**
4. **Selection**

### Importing Required Libraries


In [1]:
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /Users/sharukh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sharukh/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/sharukh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Importing the Data and Removing Label

In [2]:
#Loading Data into DataFrame
data = pd.read_csv("udemy_ds_algos_exercise (1).csv")
data = data[['courseid','course_section_lecture_title']]
data.head()

Unnamed: 0,courseid,course_section_lecture_title
0,8416,Beginners - How To Create iPhone And iPad Apps...
1,8723,"C Programming: iOS Development Starts Here!, {..."
2,9287,Microsoft Excel 2010 Course Beginners/ Interme...
3,9463,Programming Java for Beginners - The Ultimate ...
4,10318,"Adobe Photoshop for Photographers, {Color Corr..."


### PreProcessing
- Removing Stop Words
- Lemitization

In [3]:
# Text preprocessing
def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]
    return ' '.join(tokens)
data['processed_text'] = data['course_section_lecture_title'].apply(preprocess_text)

### Vectorize
Converting the Format from Text to Vectors

In [5]:
# Feature extraction using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(data['processed_text'])

### Clustering

-Assigining Required number of Clusters

Note: Optimal number of clusters can be selected through Elbow Method

In [14]:
# Clustering using K-means
num_clusters = 5
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
kmeans.fit(X)
# Add cluster labels 
data['cluster'] = kmeans.labels_


### Selection

- Diversifying the data based on the number of clusters (5) and the required rows (300), which results in 300/5=60.
- From each cluster, 60 rows should be selected.

In [11]:
# Function to select courses from each cluster
def select_courses_from_cluster(cluster_id, num_courses):
    cluster_courses = data[data['cluster'] == cluster_id]
    selected_courses = cluster_courses.sample(n=num_courses, random_state=42)
    return selected_courses
# Select 300 courses
num_courses_per_cluster = 300 // num_clusters
selected_courses = pd.DataFrame()

for i in range(num_clusters):
    cluster_courses = select_courses_from_cluster(i, num_courses_per_cluster)
    selected_courses = pd.concat([selected_courses, cluster_courses])

# Reset the index of selected courses
selected_courses.reset_index(drop=True, inplace=True)
# Print the selected course IDs
print("Selected Course IDs:")
print(selected_courses['courseid'].tolist())