In [2]:
import pandas as pd

df = pd.read_csv("./data/medium_articles.csv")
df.head()

Unnamed: 0,title,text,url,authors,timestamp,tags
0,Mental Note Vol. 24,Photo by Josh Riemer on Unsplash\n\nMerry Chri...,https://medium.com/invisible-illness/mental-no...,['Ryan Fan'],2020-12-26 03:38:10.479000+00:00,"['Mental Health', 'Health', 'Psychology', 'Sci..."
1,Your Brain On Coronavirus,Your Brain On Coronavirus\n\nA guide to the cu...,https://medium.com/age-of-awareness/how-the-pa...,['Simon Spichak'],2020-09-23 22:10:17.126000+00:00,"['Mental Health', 'Coronavirus', 'Science', 'P..."
2,Mind Your Nose,Mind Your Nose\n\nHow smell training can chang...,https://medium.com/neodotlife/mind-your-nose-f...,[],2020-10-10 20:17:37.132000+00:00,"['Biotechnology', 'Neuroscience', 'Brain', 'We..."
3,The 4 Purposes of Dreams,Passionate about the synergy between science a...,https://medium.com/science-for-real/the-4-purp...,['Eshan Samaranayake'],2020-12-21 16:05:19.524000+00:00,"['Health', 'Neuroscience', 'Mental Health', 'P..."
4,Surviving a Rod Through the Head,"You’ve heard of him, haven’t you? Phineas Gage...",https://medium.com/live-your-life-on-purpose/s...,['Rishav Sinha'],2020-02-26 00:01:01.576000+00:00,"['Brain', 'Health', 'Development', 'Psychology..."


In [3]:
df.shape

(192368, 6)

The CSV file contains more than `190k` rows!

First let's drop unnecessary columns, here the columns `title` `url`, `authors` and `timestamp` are not required for the NLP applicatin to identify the tags of the article.

In [63]:
usefull_cols = ["text", "tags"]
df = df[usefull_cols]

df.head()

Unnamed: 0,text,tags
0,Photo by Josh Riemer on Unsplash\n\nMerry Chri...,"['Mental Health', 'Health', 'Psychology', 'Sci..."
1,Your Brain On Coronavirus\n\nA guide to the cu...,"['Mental Health', 'Coronavirus', 'Science', 'P..."
2,Mind Your Nose\n\nHow smell training can chang...,"['Biotechnology', 'Neuroscience', 'Brain', 'We..."
3,Passionate about the synergy between science a...,"['Health', 'Neuroscience', 'Mental Health', 'P..."
4,"You’ve heard of him, haven’t you? Phineas Gage...","['Brain', 'Health', 'Development', 'Psychology..."


Before further cleaning the data, let's inspect entire set of tags. But the `tags` column requires further processing as they are stored as string insted of a list of string.

In [24]:
sample_tag = df.loc[0, "tags"]
sample_tag

"['Mental Health', 'Health', 'Psychology', 'Science', 'Neuroscience']"

In [25]:
def processTags(tags):
    processed_tags = tags[2:-2].replace("'", "").split(",")
    processed_tags = list(map(str.strip, processed_tags))
    
    return processed_tags

In [26]:
processTags(sample_tag)

['Mental Health', 'Health', 'Psychology', 'Science', 'Neuroscience']

In [35]:
tag_set = set()

for row in df.iterrows():
    row_tags = processTags(row[1]["tags"])
    tag_set.update(row_tags)

In [36]:
len(tag_set)

78636

The dataset contains `78636` tags, which are a lot for so we need to select most occuring ones

In [38]:
tag_counts = {}

for row in df.iterrows():
    row_tags = processTags(row[1]["tags"])
    for tag in row_tags:
        tag_counts[tag] = tag_counts.get(tag, 0) + 1

In [61]:
threshold = sorted(tag_counts.values(), reverse=True)[9]

top_10_tags = {key: value for key, value in tag_counts.items() if value >= threshold}
top_10_tags

{'Writing': 5115,
 'Machine Learning': 6055,
 'Life': 5954,
 'Technology': 6384,
 'Data Science': 7410,
 'Programming': 6364,
 'Poetry': 6336,
 'Blockchain': 7534,
 'Cryptocurrency': 6245,
 'Bitcoin': 5800}

In [82]:
def updateTags(tags):
    new_tags = []

    for top_tag in top_10_tags.keys():
        if top_tag in tags:
            new_tags.append(top_tag)
            
    if len(new_tags) == 0:
        return ""
    else:
        return new_tags

In [83]:
df["updated_tags"] = df["tags"].apply(updateTags)

df.head()

Unnamed: 0,text,tags,updated_tags
0,Photo by Josh Riemer on Unsplash\n\nMerry Chri...,"['Mental Health', 'Health', 'Psychology', 'Sci...",
1,Your Brain On Coronavirus\n\nA guide to the cu...,"['Mental Health', 'Coronavirus', 'Science', 'P...",
2,Mind Your Nose\n\nHow smell training can chang...,"['Biotechnology', 'Neuroscience', 'Brain', 'We...",
3,Passionate about the synergy between science a...,"['Health', 'Neuroscience', 'Mental Health', 'P...",
4,"You’ve heard of him, haven’t you? Phineas Gage...","['Brain', 'Health', 'Development', 'Psychology...",


Now we have remove the rows that don't atleast one tag. Rows with updated tag value of `""` (empty string) have no tags. So they can be removed

In [89]:
bool_map = df["updated_tags"] != ""
df = df[bool_map]

df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,text,tags,updated_tags
0,How to Turn Your Popular Blog Series Into a Be...,"['Books', 'Entrepreneurship', 'Writing', 'Mark...",[Writing]
1,Occam’s dice\n\nDistrusting biological metapho...,"['Machine Learning', 'Science', 'Neuroscience'...",[Machine Learning]
2,"A few months ago, I wrote an article sharing s...","['Productivity', 'Writing', 'Fiction', 'Books'...",[Writing]
3,As an architect and the author of a book about...,"['Writing', 'Psychology', 'Interior Design', '...",[Writing]
4,"“Your subconscious mind works continuously, wh...","['Productivity', 'Creativity', 'Motivation', '...",[Life]


In [91]:
df.shape

(58084, 3)

The dataset still contains more than `58k` rows, we can sample `5k` rows from it

In [95]:
sample_df = df.sample(n=5000, ignore_index=True)
sample_df.head()

Unnamed: 0,text,tags,updated_tags
0,What is Analytics? Part II: A Longer Definitio...,"['Big Data', 'Data Science', 'Analytics']",[Data Science]
1,Erica Stanford wrote the book Crypto Wars. She...,"['Ethereum', 'Bitcoin', 'Cryptocurrency', 'Sca...","[Cryptocurrency, Bitcoin]"
2,I felt the small walls inside my head crumble\...,"['Depression', 'Mental Health', 'Poetry', 'Sui...",[Poetry]
3,"New Website is Live!\n\nDear Yokai lovers,\n\n...","['Art', 'Token', 'Cryptocurrency', 'Binance Sm...",[Cryptocurrency]
4,React vs. Svelte: The War Between Virtual and ...,"['Trump', 'Babies', 'Mental Health', 'Life', '...",[Life]


Now we have to process the `text` column

In [106]:
from string import ascii_letters

dataset_char_set = set()

for row in sample_df.iterrows():
    row_text = row[1]["text"]
    dataset_char_set.update(row_text)

dataset_char_set = dataset_char_set - set(ascii_letters)
dataset_char_set = dataset_char_set - set([i for i in range(10)])

In [108]:
len(dataset_char_set)

1861

In [116]:
import random

random.sample(list(dataset_char_set), 5)

['沿', 'ะ', '況', '🏽', '»']

In [118]:
random.sample(list(dataset_char_set), 5)

['록', 'γ', '𝐅', '改', '無']

In [119]:
random.sample(list(dataset_char_set), 5)

['드', '주', '小', '首', '即']

In [123]:
random.sample(list(dataset_char_set), 5)

['🕵', '練', '👺', '🤔', 'ب']

There a lot of non acsii characters which most often will not provide much value to the application, therefore they can be replaced with `""` empty strings. But there are a few important characters which cannot be removed a ther provide meaningfull gramatical structure to the article such as `"\n", " ", "'", '''"''', "-", ":", "(", ")" ` etc. Hence apart from these the reamining characters will be removed.

In [128]:
remove_chars = dataset_char_set - set(["\n", " ", "'", '''"''', "-", ":", "(", ")", "!", "@", "#", "%"])

def processText(text):
    for char in remove_chars:
        text = text.replace(char, "")
    return text

In [131]:
sample_df["processed_text"] = sample_df["text"].apply(processText)

sample_df.head()

Unnamed: 0,text,tags,updated_tags,processed_text
0,What is Analytics? Part II: A Longer Definitio...,"['Big Data', 'Data Science', 'Analytics']",[Data Science],What is Analytics Part II: A Longer Definition...
1,Erica Stanford wrote the book Crypto Wars. She...,"['Ethereum', 'Bitcoin', 'Cryptocurrency', 'Sca...","[Cryptocurrency, Bitcoin]",Erica Stanford wrote the book Crypto Wars She ...
2,I felt the small walls inside my head crumble\...,"['Depression', 'Mental Health', 'Poetry', 'Sui...",[Poetry],I felt the small walls inside my head crumble\...
3,"New Website is Live!\n\nDear Yokai lovers,\n\n...","['Art', 'Token', 'Cryptocurrency', 'Binance Sm...",[Cryptocurrency],New Website is Live!\n\nDear Yokai lovers\n\nO...
4,React vs. Svelte: The War Between Virtual and ...,"['Trump', 'Babies', 'Mental Health', 'Life', '...",[Life],React vs Svelte: The War Between Virtual and R...


Further, `\n` is used to describe the end of line, but there is no reason to have `\n\n` to represent the end of line as often spacing improves readibility, as evident from 4th row in the data. But for classificaation of the article it is not necessary to `\n\n`, so it can be replaced with a single `\n`

In [132]:
sample_df["processed_text"] = sample_df["processed_text"].apply(lambda x: x.replace("\n\n", "\n"))

sample_df.head()

Unnamed: 0,text,tags,updated_tags,processed_text
0,What is Analytics? Part II: A Longer Definitio...,"['Big Data', 'Data Science', 'Analytics']",[Data Science],What is Analytics Part II: A Longer Definition...
1,Erica Stanford wrote the book Crypto Wars. She...,"['Ethereum', 'Bitcoin', 'Cryptocurrency', 'Sca...","[Cryptocurrency, Bitcoin]",Erica Stanford wrote the book Crypto Wars She ...
2,I felt the small walls inside my head crumble\...,"['Depression', 'Mental Health', 'Poetry', 'Sui...",[Poetry],I felt the small walls inside my head crumble\...
3,"New Website is Live!\n\nDear Yokai lovers,\n\n...","['Art', 'Token', 'Cryptocurrency', 'Binance Sm...",[Cryptocurrency],New Website is Live!\nDear Yokai lovers\nOur n...
4,React vs. Svelte: The War Between Virtual and ...,"['Trump', 'Babies', 'Mental Health', 'Life', '...",[Life],React vs Svelte: The War Between Virtual and R...


Futher links are also part of the article text, and they provide no usefull information for the classification application, so they they can aso be removed. But due to their sophisticated nature, they have to be removed using regular expressions.

In [137]:
import re

pattern = r'https?://(?:www\.)?\S+'

def removeLinks(text):
    text = re.sub(pattern, '', text)
    return text

In [138]:
sample_df["processed_text"] = sample_df["processed_text"].apply(removeLinks)
sample_df.head()

Unnamed: 0,text,tags,updated_tags,processed_text
0,What is Analytics? Part II: A Longer Definitio...,"['Big Data', 'Data Science', 'Analytics']",[Data Science],What is Analytics Part II: A Longer Definition...
1,Erica Stanford wrote the book Crypto Wars. She...,"['Ethereum', 'Bitcoin', 'Cryptocurrency', 'Sca...","[Cryptocurrency, Bitcoin]",Erica Stanford wrote the book Crypto Wars She ...
2,I felt the small walls inside my head crumble\...,"['Depression', 'Mental Health', 'Poetry', 'Sui...",[Poetry],I felt the small walls inside my head crumble\...
3,"New Website is Live!\n\nDear Yokai lovers,\n\n...","['Art', 'Token', 'Cryptocurrency', 'Binance Sm...",[Cryptocurrency],New Website is Live!\nDear Yokai lovers\nOur n...
4,React vs. Svelte: The War Between Virtual and ...,"['Trump', 'Babies', 'Mental Health', 'Life', '...",[Life],React vs Svelte: The War Between Virtual and R...


In [139]:
final_cols = ["processed_text", "updated_tags"]

final_df = sample_df[final_cols]
final_df.head()

Unnamed: 0,processed_text,updated_tags
0,What is Analytics Part II: A Longer Definition...,[Data Science]
1,Erica Stanford wrote the book Crypto Wars She ...,"[Cryptocurrency, Bitcoin]"
2,I felt the small walls inside my head crumble\...,[Poetry]
3,New Website is Live!\nDear Yokai lovers\nOur n...,[Cryptocurrency]
4,React vs Svelte: The War Between Virtual and R...,[Life]


In [140]:
final_df.to_csv("./data/processed-data.csv", index=False)