# Doctor and Veterinary Classification using NLP

This notebook is for building a model which will correctly classify a number of given reddit users as practicing doctors, practicng veterinary or others based on each user's comments 

The dataset for this task would be sourced from a databased whose link is given as

[postgresql://niphemi.oyewole:W7bHIgaN1ejh@ep-delicate-river-a5cq94ee-pooler.us-east-2.aws.neon.tech/Vetassist?statusColor=F8F8F8&env=&name=redditors%20db&tLSMode=0&usePrivateKey=false&safeModeLevel=0&advancedSafeModeLevel=0&driverVersion=0&lazyload=false](postgresql://niphemi.oyewole:W7bHIgaN1ejh@ep-delicate-river-a5cq94ee-pooler.us-east-2.aws.neon.tech/Vetassist?statusColor=F8F8F8&env=&name=redditors%20db&tLSMode=0&usePrivateKey=false&safeModeLevel=0&advancedSafeModeLevel=0&driverVersion=0&lazyload=false)

However, trying to access the database with the given link would result in errors

Therefore, a modified version of the link would be used

Before continuing, needed libraries would be imported below

In [1]:
import re             # for regrex operations
import string         # for removing punctuations
import numpy as np    # for mathematical calculations
import pandas as pd    # for working with structured data (dataframes)
from sqlalchemy import create_engine # for connecting to database

The modified link to access the database is defined below

In [8]:
# define the connection link
conn_str = "postgresql://niphemi.oyewole:endpoint=ep-delicate-river-a5cq94ee-pooler;W7bHIgaN1ejh@ep-delicate-river-a5cq94ee-pooler.us-east-2.aws.neon.tech/Vetassist?sslmode=require"

# create connection to the databse
engine =  create_engine(conn_str)

First, lets take a look at the tables in the database

In [3]:
# define sql query for retrieving the tables in the database
sql_for_tables = """
SELECT
    table_schema || '.' || table_name
FROM
    information_schema.tables
WHERE
    table_type = 'BASE TABLE'
AND
    table_schema NOT IN ('pg_catalog', 'information_schema');
"""

In [4]:
# retrieve the tables in a dataframe
tables_df = pd.read_sql_query(sql_for_tables, engine)

In [5]:
tables_df

Unnamed: 0,?column?
0,public.reddit_usernames_comments
1,public.reddit_usernames


There are two tables in the database as shown above

Each table would be saved in a pandas dataframe

In [6]:
sql_for_table1 = """
SELECT
    *
FROM
    public.reddit_usernames_comments;
"""

> Note: The code below may take a while to run. If it fails, reconnect the engine above then rerun the cell

In [9]:
user_comment_df = pd.read_sql_query(sql_for_table1, engine)

In [10]:
sql_for_table2 = """
SELECT
    *
FROM
    public.reddit_usernames;
"""

> Note: The code below may take a while to run. If it fails, reconnect the engine above then rerun the cell

In [11]:
user_info_df = pd.read_sql_query(sql_for_table2, engine)

Lets take a look at the tables one after the other

In [12]:
user_comment_df.head()

Unnamed: 0,username,comments
0,LoveAGoodTwist,"Female, Kentucky. 4 years out. Work equine on..."
1,wahznooski,"As a woman of reproductive age, fuck Texas|As ..."
2,Churro_The_fish_Girl,what makes you want to become a vet?|what make...
3,abarthch,"I see of course there are changing variables, ..."
4,VoodooKing,I have 412+ and faced issues because wireguard...


In [13]:
user_comment_df.shape

(3276, 2)

This table (now dataframe) contains usernames of users and their comments

Lets look at a comment in order to understand how it is structured

In [14]:
# print all comments for first user
user_comment_df["comments"][0]

'Female, Kentucky.  4 years out. Work equine only private practice. Base salary $85k plus bonuses/production which was $20k 2023. 6 days a week Jan-June/July then variable in the off season. No limit on PTO - took ~5 weeks last year. One paid conference a year (registration/travel/ 1/2 hotel/ transportation) or online CE program. All licensures & professional group fees covered. Cell phone allowance and mileage reimbursement.|Female, Kentucky.  4 years out. Work equine only private practice. Base salary $85k plus bonuses/production which was $20k 2023. 6 days a week Jan-June/July then variable in the off season. No limit on PTO - took ~5 weeks last year. One paid conference a year (registration/travel/ 1/2 hotel/ transportation) or online CE program. All licensures & professional group fees covered. Cell phone allowance and mileage reimbursement.|Female, Kentucky.  4 years out. Work equine only private practice. Base salary $85k plus bonuses/production which was $20k 2023. 6 days a wee

In [15]:
# split comments into individual comments
first_comments = user_comment_df["comments"][0].split("|")

# get the number of comments for first user
len(first_comments)

16

In [16]:
# remove repeated comments
unique_comment = []
for comment in first_comments:
    if comment in unique_comment:
        continue
    else:
        unique_comment.append(comment)

In [17]:
print(f"Length of unique comments for first user: {len(unique_comment)}")
print()
print(unique_comment)

Length of unique comments for first user: 1

['Female, Kentucky.  4 years out. Work equine only private practice. Base salary $85k plus bonuses/production which was $20k 2023. 6 days a week Jan-June/July then variable in the off season. No limit on PTO - took ~5 weeks last year. One paid conference a year (registration/travel/ 1/2 hotel/ transportation) or online CE program. All licensures & professional group fees covered. Cell phone allowance and mileage reimbursement.']


It can be seen that the comment column contains multiple comments separated with "|"

It can also be seen that there are repeated comments

Lets check for missing values

In [18]:
user_comment_df.isna().sum()

username    0
comments    0
dtype: int64

There are no missig values

Let's check if there are duplicate usernames

In [19]:
if user_comment_df["username"].nunique() == user_comment_df.shape[0]:
    print("There are no duplicated usernames")
else:
    print("There are duplicated usernames")

There are no duplicated usernames


Lets explore the second dataframe also

In [20]:
user_info_df.head()

Unnamed: 0,username,isused,subreddit,created_at
0,LoveAGoodTwist,True,Veterinary,2024-05-02
1,drawntage,True,Veterinary,2024-05-02
2,LinkPast84,True,Veterinary,2024-05-02
3,heatthequestforfire,True,Veterinary,2024-05-02
4,Most-Exit-5507,True,Veterinary,2024-05-02


In [21]:
user_info_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8259 entries, 0 to 8258
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   username    8259 non-null   object
 1   isused      8259 non-null   bool  
 2   subreddit   8259 non-null   object
 3   created_at  8259 non-null   object
dtypes: bool(1), object(3)
memory usage: 201.8+ KB


From the summary above, we se that there are no missing values as each feature has exactly 8259 values which is total entries in the dataset

Let's check if there are duplicate usernames

In [22]:
if user_info_df["username"].nunique() == user_info_df.shape[0]:
    print("There are no duplicated usernames")
else:
    print("There are duplicated usernames")

There are no duplicated usernames


Lets check out the unique values in the subreddit feature as well as the count of each value

In [23]:
subreddit_count = user_info_df['subreddit'].value_counts()
subreddit_count

subreddit
Veterinary          6170
MysteriumNetwork     967
medicine             409
HeliumNetwork        400
orchid               303
vet                   10
Name: count, dtype: int64

In [24]:
subreddit_list = list(subreddit_count.index)

Lets explore each of this subreddit categories starting from the least (the bottom)

In [25]:
# get the number of vet subscribers that are in the first dataset

# initialize counter
user_count = 0
# create container for vet subcribers also in the first dataframe
vet_subscribers = []

# for each username who is a subcriber of vet
for user in user_info_df[user_info_df['subreddit'] == "vet"]["username"]:
    # if username is found in table1
    if not user_comment_df[user_comment_df["username"] == user].empty:
        # increment counter by 1
        user_count += 1
        # capture the username
        vet_subscribers.append(user)

print("Vet Subreddit Count")
print("Table1: {}".format(subreddit_count["vet"]))
print(f"Table2: {user_count}")

Vet Subreddit Count
Table1: 10
Table2: 9


One of the suncribers of vet is not in the first dataset

At this point it would be better to combine both dataset into one

Lets do that

In [26]:
reddit_user_df = pd.merge(user_comment_df, user_info_df,
                          on="username", how="left")

In [27]:
reddit_user_df.head()

Unnamed: 0,username,comments,isused,subreddit,created_at
0,LoveAGoodTwist,"Female, Kentucky. 4 years out. Work equine on...",True,Veterinary,2024-05-02
1,wahznooski,"As a woman of reproductive age, fuck Texas|As ...",True,Veterinary,2024-05-02
2,Churro_The_fish_Girl,what makes you want to become a vet?|what make...,True,Veterinary,2024-05-02
3,abarthch,"I see of course there are changing variables, ...",True,MysteriumNetwork,2024-05-02
4,VoodooKing,I have 412+ and faced issues because wireguard...,False,MysteriumNetwork,2024-05-03


In [28]:
reddit_user_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3276 entries, 0 to 3275
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   username    3276 non-null   object
 1   comments    3276 non-null   object
 2   isused      3276 non-null   bool  
 3   subreddit   3276 non-null   object
 4   created_at  3276 non-null   object
dtypes: bool(1), object(4)
memory usage: 105.7+ KB


Now let us continue with the subreddits

Starting from the bottom and moving up

In [29]:
subreddit_list

['Veterinary',
 'MysteriumNetwork',
 'medicine',
 'HeliumNetwork',
 'orchid',
 'vet']

Let's check out the comments of the vet subscribers

In [30]:
# get list of vet subscribers
vet_sub_list = reddit_user_df[reddit_user_df["subreddit"] == "vet"]["username"].values

# get the comments made by vet subscribers
vet_sub_comments = reddit_user_df[reddit_user_df["subreddit"] == "vet"]["comments"].values

In [31]:
for i in range(len(vet_sub_list)):
    print(f"{vet_sub_list[i]}: {vet_sub_comments[i]}")

test_vet2: The puppy was brought in for its first round of vaccinations.
test_vet3: The adult horse was treated for laminitis.
test_vet4: The juvenile bird was treated for a wing injury.
test_vet5: The senior cat was brought in for a routine health check-up.
test_vet6: I just performed a neutering procedure on a cat.
test_vet7: The dog’s condition is improving after the deworming treatment.
test_vet8: The X-ray showed a fracture in the bird’s wing.
test_vet9: I prescribed flea prevention medication for the puppy.
test_vet: The horse’s blood test revealed signs of equine infectious anemia.


It would be logical for a practicing veterinarian or anyone whose work is related to veterinary to follow "vet" subreddit. This shows closeness to veterinary but does not guarantee being a veterinarian

It can be seen that all these people have just one comment each

Many spoke in 3rd person form which is hard to say if they are doctors or not

Only test_vet6 and test_vet9 can be confirmed to be practicing veterinarian
Others can be classified in the others category since there must be a solid evidence in order to classifier a user as a practicing veterinarian

In [32]:
# store usernames of confirmed veterinarians
veterinarians = ["test_vet6", "test_vet9"]

> It is noteworthy that this type of problem is usually solved effectively with labelled dataset

> However with unlabelled data as the one here, hand engineering may be employed to some extent enough to build a model, therafter the model can predict the rest

> That is the approach I wish to employ for this task

Now, unto next subreddit (orchid)

In [43]:
# get list of orchid subscribers
orchid_sub_list = reddit_user_df[reddit_user_df["subreddit"] == "orchid"]["username"].values

# get the comments made by orchid subscribers
orchid_sub_comment_list = reddit_user_df[reddit_user_df["subreddit"] == "orchid"]["comments"]

In [44]:
print(f"Orchid sunscrbers: {orchid_sub_list}")
print(f"Number of orchid subscribers: {len(orchid_sub_list)}")

Orchid sunscrbers: ['Think_Not_Doer' 'Personal-Escape4283' 'timee_bot' 'erlingspaulsen']
Number of orchid subscribers: 4


Lets check their comments one after the other

Next we need to remove links and symbols

In [50]:
def remove_web_link(text):
    text_list = text.split("|")
    for i in range(len(text_list)):
        text_list[i] = re.sub(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+",
                              "", text_list[i].strip())
    return " | ".join(text_list)

In [70]:
def remove_directories(text):
    text_list = text.split("|")
    for i in range(len(text_list)):
        text_list[i] = re.sub(r"(/[a-zA-Z0-9_]+)+(/)*(.[a-zA-Z_]+)*",
                              "", text_list[i]).strip()
    return " | ".join(text_list)

In [53]:
def remove_punctuations(text):
    text_list = text.split("|")
    for i in range(len(text_list)):
        text_list[i] = ''.join([l for l in text_list[i] if l not in string.punctuation])
    return " | ".join(text_list)

In [103]:
def remove_non_alphabets(text):
    text_list = text.split("|")
    for i in range(len(text_list)):
        text_list[i] = re.sub(r"[^a-zA-Z ]", "", text_list[i].strip())
    return " | ".join(text_list)

In [117]:
def remove_unneeded_spaces(text):
    text_list = text.split("|")
    for i in range(len(text_list)):
        text_list[i] = re.sub(r"(\s)+", " ", text_list[i].strip())
    return " | ".join(text_list)

In [55]:
def remove_repeated_sentence(text):
    text_list = text.split("|")
    unique_comment = []
    for comment in text_list:
        if comment.strip() in unique_comment:
            continue
        else:
            unique_comment.append(comment.strip())
    return " | ".join(unique_comment)

In [140]:
def nlp_preprocessing(text):
    text = remove_web_link(text)
    text = remove_directories(text)
    text = remove_punctuations(text)
    text = remove_non_alphabets(text)
    text = remove_unneeded_spaces(text)
    text = remove_repeated_sentence(text)
    return text

In [141]:
orchid_sub_comment_list = orchid_sub_comment_list.apply(nlp_preprocessing)

In [145]:
orchid_sub_comment_list.index

Index([205, 241, 306, 1547], dtype='int64')

In [148]:
orchid_sub_comment_list[1547]

'Nice If anyone is hosting a Mysterium node and have a spare hard disk its very easy to run a Storj node alongside Mysterium I do this on my Raspberry Pi | My Raspberry Pi is running the Mysterium Rasbian image Then I just followed the Storj installation guide and installed Docker engine downloaded their Docker image and started the Storj storage node inside its own container I just looked up Presearch and it looks like you can run their node through a Docker container so this should be okay to do alongside Mysterium To install the Docker engine on Rasbian you can follow this guidehttpsinstallusingtheconveniencescript Thanks for bringing it up as I was not aware of this project Might as well try this myself | Doesnt look like Presearch node is supported on Raspberry Pi yet Hardware Specification Docker image is only for x x based CPUs for now We plan to support ARM Raspberry Pi in the future but were currently focused on fixing bugs and making sure the current platform is robust before

Checking the comments of all the 4 people who subscribed to orchid shows none of them is either a practicing doctor or a practicing veterinarian

Next subreddit is HeliumNetwork

In [149]:
# get list of orchid subscribers
HeliumNetwork_sub_list = reddit_user_df[reddit_user_df["subreddit"] == "HeliumNetwork"]["username"].values

# get the comments made by orchid subscribers
HeliumNetwork_sub_comment_list = reddit_user_df[reddit_user_df["subreddit"] == "HeliumNetwork"]["comments"]

In [150]:
print(f"HeliumNetwork sunscrbers: {HeliumNetwork_sub_list}")
print(f"Number of HeliumNetwork subscribers: {len(HeliumNetwork_sub_list)}")

HeliumNetwork sunscrbers: ['Best_Bid_9327' 'SoulReaver-SS' 'Imthefatboyb' 'Passi-RVN' 'alexiskef'
 'drunknfoo']
Number of HeliumNetwork subscribers: 6


Thereare 6 sunscribers

Lets proprocess the data of all users who subscribe to HeliumNetwork in order to view it properly

In [151]:
HeliumNetwork_sub_comment_list = HeliumNetwork_sub_comment_list.apply(nlp_preprocessing)

In [152]:
HeliumNetwork_sub_comment_list.index

Index([93, 442, 458, 632, 670, 1504], dtype='int64')

In [157]:
HeliumNetwork_sub_comment_list[1504]

'Was averaging and MYST After three days of mainnet total about MYST and the other node nothing Not worth bothering for me anymore until something changes'

Everone who subscribes to HeliumNetwork belongs to the others category

Lets take alook at the medicine subreddit

In [161]:
# get list of orchid subscribers
medicine_sub_list = reddit_user_df[reddit_user_df["subreddit"] == "medicine"]["username"].values

# get the comments made by orchid subscribers
medicine_sub_comment_list = reddit_user_df[reddit_user_df["subreddit"] == "medicine"]["comments"]

In [162]:
print(f"medicine subscrbers: {medicine_sub_list}")
print(f"Number of medicine subscribers: {len(medicine_sub_list)}")

medicine subscrbers: ['test_doctor2' 'test_doctor3' 'test_doctor4' 'test_doctor5'
 'test_doctor6' 'test_doctor7' 'test_doctor8' 'test_doctor1']
Number of medicine subscribers: 8


Thereare 8 sunscribers

Lets proprocess the data of all users who subscribe to medicine in order to view it properly

In [163]:
medicine_sub_comment_list = medicine_sub_comment_list.apply(nlp_preprocessing)

In [164]:
medicine_sub_comment_list.index

Index([1438, 1439, 1440, 1441, 1442, 1443, 1444, 1445], dtype='int64')

In [174]:
medicine_sub_comment_list[1445]

'The patients EKG showed signs of a possible heart attack'

In [176]:
Doctors_index = [1441, 1444]
Doctors = medicine_sub_comment_list[Doctors_index]
Doctors

1441        I just performed an appendectomy on a patient
1444    I prescribed antibiotics for the patients bact...
Name: comments, dtype: object

Everone who subscribes to HeliumNetwork belongs to the others category

In [178]:
Doctors = reddit_user_df.iloc[Doctors_index]
Doctors

Unnamed: 0,username,comments,isused,subreddit,created_at
1441,test_doctor5,I just performed an appendectomy on a patient.,True,medicine,2024-05-11
1444,test_doctor8,I prescribed antibiotics for the patient’s bac...,True,medicine,2024-05-11


In [179]:
Doctors = Doctors["username"]
Doctors

1441    test_doctor5
1444    test_doctor8
Name: username, dtype: object

In [69]:
s_text = "My first observation is that Mysterium highlights the intention of splitting up packets to traverse different paths along the VPN network which protects a user from a malicious node. ---> /r/MysteriumNetwork/comments/mrg7bb/just_bought_my_first_myst_tokens_and_feel/guy06xo/"
re.sub(r"(/[a-zA-Z0-9_]+)+(/)*(.[a-zA-Z_]+)*",
                              "", s_text)

'My first observation is that Mysterium highlights the intention of splitting up packets to traverse different paths along the VPN network which protects a user from a malicious node. ---> '

In [102]:
def remove_non_alphabetic_chars(text):
    """Remove non-alphabetic characters from text."""
    return re.sub(r'[^a-zA-Z ]', '', text)

# Example usage:
text_with_non_alphabetic_chars = "Hello123, world! This has some 1n0n-4lph4b3t1c characters."
cleaned_text = remove_non_alphabetic_chars(text_with_non_alphabetic_chars)
print(cleaned_text)

Hello world This has some nnlphbtc characters


In [108]:
def replace_multiple_spaces(text):
    """Replace multiple spaces with a single space."""
    return re.sub(r'\s+', ' ', text)

# Example usage:
text_with_multiple_spaces = "This    text    has   multiple    spaces."
cleaned_text = replace_multiple_spaces(text_with_multiple_spaces)
print(cleaned_text)

This text has multiple spaces.
