## Data preprocessing & EDA

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("output.csv")

In [3]:
len(df)

559

In [4]:
df.head()

Unnamed: 0,title,link,directory
0,"Automotive, IoT & Industrial Solutions | NXP S...",https://www.nxp.com/,work > material_science > companies
1,MediaTek | Home Page,https://www.mediatek.com/,work > material_science > companies
2,Analog | Embedded processing | Semiconductor c...,https://www.ti.com/,work > material_science > companies
3,Taiwan Semiconductor Manufacturing Company Lim...,https://www.tsmc.com/english,work > material_science > companies
4,ASML | The world's supplier to the semiconduct...,https://www.asml.com/en,work > material_science > companies


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 559 entries, 0 to 558
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      559 non-null    object
 1   link       559 non-null    object
 2   directory  559 non-null    object
dtypes: object(3)
memory usage: 13.2+ KB


In [6]:
df.describe()

Unnamed: 0,title,link,directory
count,559,559,559
unique,550,556,65
top,Google Gemini,http://cslibrary.stanford.edu/108/EssentialPer...,coding > plots
freq,4,2,53


In [7]:
df.describe(include="object")

Unnamed: 0,title,link,directory
count,559,559,559
unique,550,556,65
top,Google Gemini,http://cslibrary.stanford.edu/108/EssentialPer...,coding > plots
freq,4,2,53


In [8]:
df['directory'].nunique()

65

In [9]:
df['leaf_dir'] = df['directory'].str.split('>').str[-1].str.strip()

In [10]:
df.describe()

Unnamed: 0,title,link,directory,leaf_dir
count,559,559,559,559
unique,550,556,65,65
top,Google Gemini,http://cslibrary.stanford.edu/108/EssentialPer...,coding > plots,plots
freq,4,2,53,53


descripbe() requires numerical attributes to generate statistics like mean, std, min, 25% etc.

Since 'directory' is a categorical attribute, describe() won't work on it.

In [11]:
df.sample(n=5)

Unnamed: 0,title,link,directory,leaf_dir
390,Helena Zhang,https://www.helenazhang.com/,coding > webDevelopment > selected,selected
460,Periodic Table – TikZ.net,https://tikz.net/periodic-table/,coding > plots,plots
35,Nicola Spaldin - Google Scholar,https://scholar.google.de/citations?user=eUfdZ...,work > material_science > scientists,scientists
165,Weights & Biases: The AI Developer Platform,https://wandb.ai/site/,coding > machineLearning > libraries/tools/models,libraries/tools/models
378,AntfuStyle | Astro,https://astro.build/themes/details/antfustyle-...,coding > webDevelopment > selected,selected


In [16]:
counts = df['leaf_dir'].value_counts()
counts[counts > 10]

leaf_dir
plots                     53
learn                     53
libraries/tools/models    31
articles                  26
people/organizations      26
selected                  24
vegDataset                19
linux / shell             17
AItools                   16
webDevelopment            15
MatSciPaper               15
DFTtools                  14
scientists                14
finance                   13
Name: count, dtype: int64

In [19]:
counts[counts > 10].sum()

336

In [20]:
print(counts)

leaf_dir
plots                     53
learn                     53
libraries/tools/models    31
articles                  26
people/organizations      26
                          ..
physics                    1
github repos               1
others                     1
projectIdeas               1
people                     1
Name: count, Length: 65, dtype: int64


In [21]:
counts[counts > 20].sum()

213

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 559 entries, 0 to 558
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      559 non-null    object
 1   link       559 non-null    object
 2   directory  559 non-null    object
 3   leaf_dir   559 non-null    object
dtypes: object(4)
memory usage: 17.6+ KB


In [18]:
import matplotlib.pyplot as plt

df.hist()
plt.show()

ValueError: hist method requires numerical or datetime columns, nothing to plot.

Since no numerical attributes in df, no histograms can be plotted.

## Create a new dataframe with only those directories which have more than 10 bookmarks

In [22]:
leaf_counts = df['leaf_dir'].value_counts()
valid_classes = leaf_counts[leaf_counts >= 10].index

df_filtered = df[df['leaf_dir'].isin(valid_classes)]

df_filtered.to_csv("filtered_bookmarks.csv", index=False)

In [23]:
df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Index: 396 entries, 6 to 552
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      396 non-null    object
 1   link       396 non-null    object
 2   directory  396 non-null    object
 3   leaf_dir   396 non-null    object
dtypes: object(4)
memory usage: 15.5+ KB


In [24]:
df_filtered.describe()

Unnamed: 0,title,link,directory,leaf_dir
count,396,396,396,396
unique,390,394,20,20
top,Keenan Crane,https://www.fast.ai/posts/2019-09-24-metrics.html,coding > machineLearning > learn,learn
freq,2,2,53,53


## Feature Engineering