## Data preprocessing & EDA

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("output.csv")

In [3]:
len(df)

559

In [4]:
df.head()

Unnamed: 0,title,link,directory
0,"Automotive, IoT & Industrial Solutions | NXP S...",https://www.nxp.com/,work > material_science > companies
1,MediaTek | Home Page,https://www.mediatek.com/,work > material_science > companies
2,Analog | Embedded processing | Semiconductor c...,https://www.ti.com/,work > material_science > companies
3,Taiwan Semiconductor Manufacturing Company Lim...,https://www.tsmc.com/english,work > material_science > companies
4,ASML | The world's supplier to the semiconduct...,https://www.asml.com/en,work > material_science > companies


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 559 entries, 0 to 558
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      559 non-null    object
 1   link       559 non-null    object
 2   directory  559 non-null    object
dtypes: object(3)
memory usage: 13.2+ KB


In [6]:
df.describe()

Unnamed: 0,title,link,directory
count,559,559,559
unique,550,556,65
top,Google Gemini,http://cslibrary.stanford.edu/108/EssentialPer...,coding > plots
freq,4,2,53


In [7]:
df.describe(include="object")

Unnamed: 0,title,link,directory
count,559,559,559
unique,550,556,65
top,Google Gemini,http://cslibrary.stanford.edu/108/EssentialPer...,coding > plots
freq,4,2,53


In [8]:
df['directory'].nunique()

65

In [9]:
df['leaf_dir'] = df['directory'].str.split('>').str[-1].str.strip()

In [10]:
df.describe()

Unnamed: 0,title,link,directory,leaf_dir
count,559,559,559,559
unique,550,556,65,65
top,Google Gemini,http://cslibrary.stanford.edu/108/EssentialPer...,coding > plots,plots
freq,4,2,53,53


descripbe() requires numerical attributes to generate statistics like mean, std, min, 25% etc.

Since 'directory' is a categorical attribute, describe() won't work on it.

In [11]:
df.sample(n=5)

Unnamed: 0,title,link,directory,leaf_dir
390,Helena Zhang,https://www.helenazhang.com/,coding > webDevelopment > selected,selected
460,Periodic Table – TikZ.net,https://tikz.net/periodic-table/,coding > plots,plots
35,Nicola Spaldin - Google Scholar,https://scholar.google.de/citations?user=eUfdZ...,work > material_science > scientists,scientists
165,Weights & Biases: The AI Developer Platform,https://wandb.ai/site/,coding > machineLearning > libraries/tools/models,libraries/tools/models
378,AntfuStyle | Astro,https://astro.build/themes/details/antfustyle-...,coding > webDevelopment > selected,selected


In [16]:
counts = df['leaf_dir'].value_counts()
counts[counts > 10]

leaf_dir
plots                     53
learn                     53
libraries/tools/models    31
articles                  26
people/organizations      26
selected                  24
vegDataset                19
linux / shell             17
AItools                   16
webDevelopment            15
MatSciPaper               15
DFTtools                  14
scientists                14
finance                   13
Name: count, dtype: int64

In [19]:
counts[counts > 10].sum()

336

In [20]:
print(counts)

leaf_dir
plots                     53
learn                     53
libraries/tools/models    31
articles                  26
people/organizations      26
                          ..
physics                    1
github repos               1
others                     1
projectIdeas               1
people                     1
Name: count, Length: 65, dtype: int64


In [21]:
counts[counts > 20].sum()

213

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 559 entries, 0 to 558
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      559 non-null    object
 1   link       559 non-null    object
 2   directory  559 non-null    object
 3   leaf_dir   559 non-null    object
dtypes: object(4)
memory usage: 17.6+ KB


In [18]:
import matplotlib.pyplot as plt

df.hist()
plt.show()

ValueError: hist method requires numerical or datetime columns, nothing to plot.

Since no numerical attributes in df, no histograms can be plotted.

## Create a new dataframe with only those directories which have more than 10 bookmarks

In [22]:
leaf_counts = df['leaf_dir'].value_counts()
valid_classes = leaf_counts[leaf_counts >= 10].index

df_filtered = df[df['leaf_dir'].isin(valid_classes)]

df_filtered.to_csv("filtered_bookmarks.csv", index=False)

In [23]:
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 396 entries, 6 to 552
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      396 non-null    object
 1   link       396 non-null    object
 2   directory  396 non-null    object
 3   leaf_dir   396 non-null    object
dtypes: object(4)
memory usage: 15.5+ KB


In [24]:
df_filtered.describe()

Unnamed: 0,title,link,directory,leaf_dir
count,396,396,396,396
unique,390,394,20,20
top,Keenan Crane,https://www.fast.ai/posts/2019-09-24-metrics.html,coding > machineLearning > learn,learn
freq,2,2,53,53


## Feature Engineering

In [25]:
df_filtered.head()

Unnamed: 0,title,link,directory,leaf_dir
6,(1) Shyue Ping Ong - Universal Graph Deep Lear...,https://www.youtube.com/watch?v=jnFAiAsaYCM,work > material_science > MLmaterialsScience,MLmaterialsScience
7,Machine-Learning-Assisted Materials Discovery ...,https://pubs.acs.org/doi/10.1021/acs.jcim.4c01329,work > material_science > MLmaterialsScience,MLmaterialsScience
8,Machine Learning for Predicting the Band Gaps ...,https://pubs.acs.org/doi/abs/10.1021/acs.jpcc....,work > material_science > MLmaterialsScience,MLmaterialsScience
9,"Comparing ANI-2x, ANI-1ccx neural networks, fo...",https://www.nature.com/articles/s41598-024-622...,work > material_science > MLmaterialsScience,MLmaterialsScience
10,ANI-1: an extensible neural network potential ...,https://pmc.ncbi.nlm.nih.gov/articles/PMC5414547/,work > material_science > MLmaterialsScience,MLmaterialsScience


In [2]:
import re
from urllib.parse import urlparse

In [3]:
filtered_bookmarks = pd.read_csv('filtered_bookmarks.csv')

In [4]:
df_clean = filtered_bookmarks.copy()

In [5]:
def link_to_tokens(link):
    """Extracts tokens from the URL path and domain."""
    if pd.isna(link):
        return ""
    
    # 1. Parse the URL
    parsed = urlparse(link)
    
    # 2. Get the clean domain (root.com)
    domain = parsed.netloc.replace('www.', '')
    
    # 3. Get path tokens (split by '/' and '-')
    path_tokens = re.split(r'[/_.-]', parsed.path)
    
    # Combine domain and path tokens into a list of words
    all_tokens = [domain] + [token for token in path_tokens if token]
    
    return ' '.join(all_tokens)

In [6]:
# Create the new combined feature column
df_clean['link_tokens'] = df_clean['link'].apply(link_to_tokens)

# Combine title and link tokens into one text column per bookmark
df_clean['combined_text'] = df_clean['title'] + ' ' + df_clean['link_tokens']

In [8]:
df_clean.sample(n=6)

Unnamed: 0,title,link,directory,leaf_dir,link_tokens,combined_text
175,[blog] on machine learning concepts,https://colah.github.io/,coding > machineLearning > learn,learn,colah.github.io,[blog] on machine learning concepts colah.gith...
273,Astro Academia | Astro,https://astro.build/themes/details/astro-acade...,coding > webDevelopment > selected,selected,astro.build themes details astro academia,Astro Academia | Astro astro.build themes deta...
381,Fresh and Rotten Classification,https://www.kaggle.com/datasets/swoyam2609/fre...,projects > vegDataset,vegDataset,kaggle.com datasets swoyam2609 fresh and stale...,Fresh and Rotten Classification kaggle.com dat...
182,Machine Learning | Google for Developers [co...,https://developers.google.com/machine-learning,coding > machineLearning > learn,learn,developers.google.com machine learning,Machine Learning | Google for Developers [co...
59,Is there any software or Python library for vi...,https://mattermodeling.stackexchange.com/quest...,work > MS_thesis > DFTtools,DFTtools,mattermodeling.stackexchange.com questions 127...,Is there any software or Python library for vi...
104,Yann LeCun - Wikipedia,https://en.m.wikipedia.org/wiki/Yann_LeCun,coding > machineLearning > people/organizations,people/organizations,en.m.wikipedia.org wiki Yann LeCun,Yann LeCun - Wikipedia en.m.wikipedia.org wiki...


### Apply TF-IDF on combined_text of title and link tokens

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    ngram_range=(1, 2),
    stop_words='english',
    lowercase=True,
    sublinear_tf=True
)

In [10]:
X_tfidf = vectorizer.fit_transform(df_clean['combined_text'])

In [12]:
X_tfidf.shape

(396, 4423)

In [13]:
type(X_tfidf)

scipy.sparse._csr.csr_matrix

X_tfidf is my feature matrix.

In [14]:
X_tfidf

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 6879 stored elements and shape (396, 4423)>

## Model Development

### Target Variable Encoding (y)

considered the `leaf_dir` as the target variable for classification

In [16]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()

y = label_encoder.fit_transform(df_clean['leaf_dir'])

In [17]:
label_encoder.classes_

array(['AItools', 'C/C++', 'DFTtools', 'DSA', 'MLmaterialsScience',
       'MatSciConcepts', 'MatSciPaper', 'articles',
       'bandStructurePlotCodes', 'finance', 'learn',
       'libraries/tools/models', 'linux / shell', 'materialsDatabase',
       'people/organizations', 'plots', 'scientists', 'selected',
       'vegDataset', 'webDevelopment'], dtype=object)

In [18]:
y

array([ 4,  4,  4,  4,  4,  4,  4,  4,  4,  4, 16, 16, 16, 16, 16, 16, 16,
       16, 16, 16, 16, 16, 16, 16, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
        8,  8,  8,  8,  8,  8,  8,  8,  8,  8,  5,  5,  5,  5,  5,  5,  5,
        5,  5,  5,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,
        6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6, 14, 14,
       14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,
       14, 14, 14, 14, 14, 14, 14, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
       11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
       11, 11, 11, 11,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,
        7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7, 10, 10, 10, 10,
       10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
       10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
       10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,  0,  0,
        0,  0,  0,  0,  0

### split data

In [19]:
from sklearn.model_selection import train_test_split

X_train_val, X_test, y_train_val, y_test = train_test_split(
    X_tfidf, y,
    test_size=0.2,
    random_state=69,
    stratify=y
)

X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val,
    test_size=0.125,  # 10% of total for validation
    random_state=69,
    stratify=y_train_val
)


In [20]:
print(X_train.shape)
print(X_val.shape)
print(X_test.shape)

(276, 4423)
(40, 4423)
(80, 4423)


In [21]:
print(y_train.shape)
print(y_val.shape)
print(y_test.shape)

(276,)
(40,)
(80,)


### Baseline Model: Multinomial Naive Bayes

In [22]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

mnb_model = MultinomialNB()
mnb_model.fit(X_train, y_train)

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


In [23]:
y_pred_mnb_val = mnb_model.predict(X_val)

accuracy_mnb_val = accuracy_score(y_val, y_pred_mnb_val)

In [24]:
print(f"MNB Accuracy on Validation Set: {accuracy_mnb_val:.4f}")
print("\nMNB Classification Report (Validation):\n")
print(classification_report(y_val, y_pred_mnb_val, target_names=label_encoder.classes_, zero_division=0))

MNB Accuracy on Validation Set: 0.2000

MNB Classification Report (Validation):

                        precision    recall  f1-score   support

               AItools       0.00      0.00      0.00         2
                 C/C++       0.00      0.00      0.00         1
              DFTtools       0.00      0.00      0.00         1
                   DSA       0.00      0.00      0.00         1
    MLmaterialsScience       0.00      0.00      0.00         1
        MatSciConcepts       0.00      0.00      0.00         1
           MatSciPaper       0.00      0.00      0.00         2
              articles       0.00      0.00      0.00         3
bandStructurePlotCodes       0.00      0.00      0.00         1
               finance       0.00      0.00      0.00         1
                 learn       0.26      1.00      0.42         5
libraries/tools/models       0.00      0.00      0.00         3
         linux / shell       0.00      0.00      0.00         2
     materialsDatabase

### Baseline Model: Logistic Regression

In [26]:
from sklearn.linear_model import LogisticRegression

lr_model = LogisticRegression(solver='lbfgs', C=1.0, random_state=69)
lr_model.fit(X_train, y_train)


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,69
,solver,'lbfgs'
,max_iter,100


In [27]:
y_pred_lr_val = lr_model.predict(X_val)

accuracy_lr_val = accuracy_score(y_val, y_pred_lr_val)

In [28]:
print(f"LR Accuracy on Validation Set: {accuracy_lr_val:.4f}")
print("\nLR Classification Report (Validation):\n")
print(classification_report(y_val, y_pred_lr_val, target_names=label_encoder.classes_, zero_division=0))

LR Accuracy on Validation Set: 0.2750

LR Classification Report (Validation):

                        precision    recall  f1-score   support

               AItools       0.00      0.00      0.00         2
                 C/C++       0.00      0.00      0.00         1
              DFTtools       0.00      0.00      0.00         1
                   DSA       0.00      0.00      0.00         1
    MLmaterialsScience       0.00      0.00      0.00         1
        MatSciConcepts       0.00      0.00      0.00         1
           MatSciPaper       0.00      0.00      0.00         2
              articles       0.00      0.00      0.00         3
bandStructurePlotCodes       0.00      0.00      0.00         1
               finance       0.00      0.00      0.00         1
                 learn       0.31      1.00      0.48         5
libraries/tools/models       0.00      0.00      0.00         3
         linux / shell       0.00      0.00      0.00         2
     materialsDatabase  

In [31]:
df_clean['leaf_dir'].nunique()

20

There are 20 classes in total.

### Insights from Baseline Models:
- 0.20 and 0.27 accuracy on validation set for MNB and LR respectively.
- A very low accuracy for both models, suggesting a significant issue with data or features.
- Instead of going to more complex models like random forests, I should first try to improve the performance of simpler models through `Hyperparameter tuning` and `Feature Refinement`.