# Doctor and Veterinary Classification using NLP

This notebook is for building a model which will correctly classify a number of given reddit users as practicing doctors, practicng veterinary or others based on each user's comments 

The dataset for this task would be sourced from a databased whose link is given as

[postgresql://niphemi.oyewole:W7bHIgaN1ejh@ep-delicate-river-a5cq94ee-pooler.us-east-2.aws.neon.tech/Vetassist?statusColor=F8F8F8&env=&name=redditors%20db&tLSMode=0&usePrivateKey=false&safeModeLevel=0&advancedSafeModeLevel=0&driverVersion=0&lazyload=false](postgresql://niphemi.oyewole:W7bHIgaN1ejh@ep-delicate-river-a5cq94ee-pooler.us-east-2.aws.neon.tech/Vetassist?statusColor=F8F8F8&env=&name=redditors%20db&tLSMode=0&usePrivateKey=false&safeModeLevel=0&advancedSafeModeLevel=0&driverVersion=0&lazyload=false)

However, trying to access the database with the given link would result in errors

Therefore, a modified version of the link would be used

## Module Importations and Data Retrieval

Before continuing, needed libraries would be imported below

In [1]:
import re             # for regrex operations
import string         # for removing punctuations
import numpy as np    # for mathematical calculations
import pandas as pd    # for working with structured data (dataframes)
from sqlalchemy import create_engine # for connecting to database
from nltk.tokenize import word_tokenize
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding
from tensorflow.keras.initializers import Constant
from sklearn.model_selection import train_test_split

The modified link to access the database is defined below

In [8]:
# define the connection link
conn_str = "postgresql://niphemi.oyewole:endpoint=ep-delicate-river-a5cq94ee-pooler;W7bHIgaN1ejh@ep-delicate-river-a5cq94ee-pooler.us-east-2.aws.neon.tech/Vetassist?sslmode=allow"

# create connection to the databse
engine =  create_engine(conn_str)

First, lets take a look at the tables in the database

In [3]:
# define sql query for retrieving the tables in the database
sql_for_tables = """
SELECT
    table_schema || '.' || table_name
FROM
    information_schema.tables
WHERE
    table_type = 'BASE TABLE'
AND
    table_schema NOT IN ('pg_catalog', 'information_schema');
"""

In [4]:
# retrieve the tables in a dataframe
tables_df = pd.read_sql_query(sql_for_tables, engine)

In [5]:
tables_df

Unnamed: 0,?column?
0,public.reddit_usernames_comments
1,public.reddit_usernames


There are two tables in the database as shown above

Each table would be saved in a pandas dataframe

In [9]:
sql_for_table1 = """
SELECT
    *
FROM
    public.reddit_usernames_comments;
"""

> Note: The code below may take a while to run. If it fails, reconnect the engine above then rerun the cell

In [11]:
# user_comment_df = pd.read_sql_query(sql_for_table1, engine)

Lets save the table as a csv file

In [12]:
user_comment_df = pd.read_csv("reddit_usernames_comments.csv")

In [15]:
# user_comment_df.to_csv('reddit_usernames_comments.csv')

In [16]:
sql_for_table2 = """
SELECT
    *
FROM
    public.reddit_usernames;
"""

> Note: The code below may take a while to run. If it fails, reconnect the engine above then rerun the cell

In [14]:
# user_info_df = pd.read_sql_query(sql_for_table2, engine)

In [15]:
user_info_df = pd.read_csv("reddit_usernames.csv")

Lets save the table as a csv file

In [16]:
# user_info_df.to_csv('reddit_usernames.csv')

Lets take a look at the tables one after the other

In [17]:
user_comment_df.head()

Unnamed: 0.1,Unnamed: 0,username,comments
0,0,LoveAGoodTwist,"Female, Kentucky. 4 years out. Work equine on..."
1,1,wahznooski,"As a woman of reproductive age, fuck Texas|As ..."
2,2,Churro_The_fish_Girl,what makes you want to become a vet?|what make...
3,3,abarthch,"I see of course there are changing variables, ..."
4,4,VoodooKing,I have 412+ and faced issues because wireguard...


In [18]:
user_comment_df.shape

(3276, 3)

## Data Exploration

This table (now dataframe) contains usernames of users and their comments

Lets look at a comment in order to understand how it is structured

In [19]:
# print all comments for first user
user_comment_df["comments"][0]

'Female, Kentucky.  4 years out. Work equine only private practice. Base salary $85k plus bonuses/production which was $20k 2023. 6 days a week Jan-June/July then variable in the off season. No limit on PTO - took ~5 weeks last year. One paid conference a year (registration/travel/ 1/2 hotel/ transportation) or online CE program. All licensures & professional group fees covered. Cell phone allowance and mileage reimbursement.|Female, Kentucky.  4 years out. Work equine only private practice. Base salary $85k plus bonuses/production which was $20k 2023. 6 days a week Jan-June/July then variable in the off season. No limit on PTO - took ~5 weeks last year. One paid conference a year (registration/travel/ 1/2 hotel/ transportation) or online CE program. All licensures & professional group fees covered. Cell phone allowance and mileage reimbursement.|Female, Kentucky.  4 years out. Work equine only private practice. Base salary $85k plus bonuses/production which was $20k 2023. 6 days a wee

In [20]:
# split comments into individual comments
first_comments = user_comment_df["comments"][0].split("|")

# get the number of comments for first user
len(first_comments)

16

In [21]:
# remove repeated comments
unique_comment = []
for comment in first_comments:
    if comment in unique_comment:
        continue
    else:
        unique_comment.append(comment)

In [22]:
print(f"Length of unique comments for first user: {len(unique_comment)}")
print()
print(unique_comment)

Length of unique comments for first user: 1

['Female, Kentucky.  4 years out. Work equine only private practice. Base salary $85k plus bonuses/production which was $20k 2023. 6 days a week Jan-June/July then variable in the off season. No limit on PTO - took ~5 weeks last year. One paid conference a year (registration/travel/ 1/2 hotel/ transportation) or online CE program. All licensures & professional group fees covered. Cell phone allowance and mileage reimbursement.']


It can be seen that the comment column contains multiple comments separated with "|"

It can also be seen that there are repeated comments

Lets check for missing values

In [23]:
user_comment_df.isna().sum()

Unnamed: 0    0
username      1
comments      0
dtype: int64

There are no missig values

Let's check if there are duplicate usernames

In [24]:
if user_comment_df["username"].nunique() == user_comment_df.shape[0]:
    print("There are no duplicated usernames")
else:
    print("There are duplicated usernames")

There are duplicated usernames


Lets explore the second dataframe also

In [25]:
user_info_df.head()

Unnamed: 0.1,Unnamed: 0,username,isused,subreddit,created_at
0,0,LoveAGoodTwist,True,Veterinary,2024-05-02
1,1,drawntage,True,Veterinary,2024-05-02
2,2,LinkPast84,True,Veterinary,2024-05-02
3,3,heatthequestforfire,True,Veterinary,2024-05-02
4,4,Most-Exit-5507,True,Veterinary,2024-05-02


In [26]:
user_info_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8259 entries, 0 to 8258
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  8259 non-null   int64 
 1   username    8258 non-null   object
 2   isused      8259 non-null   bool  
 3   subreddit   8259 non-null   object
 4   created_at  8259 non-null   object
dtypes: bool(1), int64(1), object(3)
memory usage: 266.3+ KB


From the summary above, we se that there are no missing values as each feature has exactly 8259 values which is total entries in the dataset

Let's check if there are duplicate usernames

In [27]:
if user_info_df["username"].nunique() == user_info_df.shape[0]:
    print("There are no duplicated usernames")
else:
    print("There are duplicated usernames")

There are duplicated usernames


At this point lets create a function to preprocess the comments

## Data Preprocessing

Lets define functions to clean the dataset

In [28]:
def remove_web_link(text):
    text_list = text.split("|")
    for i in range(len(text_list)):
        text_list[i] = re.sub(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+",
                              "", text_list[i].strip())
    return " | ".join(text_list)

In [29]:
def remove_directories(text):
    text_list = text.split("|")
    for i in range(len(text_list)):
        text_list[i] = re.sub(r"(/[a-zA-Z0-9_]+)+(/)*(.[a-zA-Z_]+)*",
                              "", text_list[i]).strip()
    return " | ".join(text_list)

In [30]:
def remove_punctuations(text):
    text_list = text.split("|")
    for i in range(len(text_list)):
        text_list[i] = ''.join([l for l in text_list[i] if l not in string.punctuation])
    return " | ".join(text_list)

In [31]:
def remove_non_alphabets(text):
    text_list = text.split("|")
    for i in range(len(text_list)):
        text_list[i] = re.sub(r"[^a-zA-Z ]", "", text_list[i].strip())
    return " | ".join(text_list)

In [32]:
def remove_unneeded_spaces(text):
    text_list = text.split("|")
    for i in range(len(text_list)):
        text_list[i] = re.sub(r"(\s)+", " ", text_list[i].strip())
    return " | ".join(text_list)

In [33]:
def remove_repeated_sentence(text):
    text_list = text.split("|")
    unique_comment = []
    for comment in text_list:
        if comment.strip() in unique_comment:
            continue
        else:
            unique_comment.append(comment.strip())
    return " | ".join(unique_comment)

In [34]:
def nlp_preprocessing(text):
    text = remove_web_link(text)
    text = remove_directories(text)
    text = remove_punctuations(text)
    text = remove_non_alphabets(text)
    text = remove_unneeded_spaces(text)
    text = remove_repeated_sentence(text)
    return text

## Hand Engineering

Lets check out the unique values in the subreddit feature as well as the count of each value

In [35]:
subreddit_count = user_info_df['subreddit'].value_counts()
subreddit_count

subreddit
Veterinary          6170
MysteriumNetwork     967
medicine             409
HeliumNetwork        400
orchid               303
vet                   10
Name: count, dtype: int64

In [36]:
subreddit_list = list(subreddit_count.index)

Lets explore each of this subreddit categories starting from the least (the bottom)

In [37]:
# get the number of vet subscribers that are in the first dataset

# initialize counter
user_count = 0
# create container for vet subcribers also in the first dataframe
vet_subscribers = []

# for each username who is a subcriber of vet
for user in user_info_df[user_info_df['subreddit'] == "vet"]["username"]:
    # if username is found in table1
    if not user_comment_df[user_comment_df["username"] == user].empty:
        # increment counter by 1
        user_count += 1
        # capture the username
        vet_subscribers.append(user)

print("Vet Subreddit Count")
print("Table1: {}".format(subreddit_count["vet"]))
print(f"Table2: {user_count}")

Vet Subreddit Count
Table1: 10
Table2: 9


One of the suncribers of vet is not in the first dataset

At this point it would be better to combine both dataset into one

Lets do that

In [38]:
reddit_user_df = pd.merge(user_comment_df, user_info_df,
                          on="username", how="left")

In [39]:
reddit_user_df.head()

Unnamed: 0,Unnamed: 0_x,username,comments,Unnamed: 0_y,isused,subreddit,created_at
0,0,LoveAGoodTwist,"Female, Kentucky. 4 years out. Work equine on...",0,True,Veterinary,2024-05-02
1,1,wahznooski,"As a woman of reproductive age, fuck Texas|As ...",7,True,Veterinary,2024-05-02
2,2,Churro_The_fish_Girl,what makes you want to become a vet?|what make...,9,True,Veterinary,2024-05-02
3,3,abarthch,"I see of course there are changing variables, ...",1133,True,MysteriumNetwork,2024-05-02
4,4,VoodooKing,I have 412+ and faced issues because wireguard...,1779,False,MysteriumNetwork,2024-05-03


In [40]:
reddit_user_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3276 entries, 0 to 3275
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0_x  3276 non-null   int64 
 1   username      3275 non-null   object
 2   comments      3276 non-null   object
 3   Unnamed: 0_y  3276 non-null   int64 
 4   isused        3276 non-null   bool  
 5   subreddit     3276 non-null   object
 6   created_at    3276 non-null   object
dtypes: bool(1), int64(2), object(4)
memory usage: 156.9+ KB


Now let us continue with the subreddits

Starting from the bottom and moving up

In [41]:
subreddit_list

['Veterinary',
 'MysteriumNetwork',
 'medicine',
 'HeliumNetwork',
 'orchid',
 'vet']

Let's check out the comments of the vet subscribers

In [42]:
# get list of vet subscribers
vet_sub_list = reddit_user_df[reddit_user_df["subreddit"] == "vet"]["username"].values

# get the comments made by vet subscribers
vet_sub_comments = reddit_user_df[reddit_user_df["subreddit"] == "vet"]["comments"].values

In [43]:
for i in range(len(vet_sub_list)):
    print(f"{vet_sub_list[i]}: {vet_sub_comments[i]}")

test_vet2: The puppy was brought in for its first round of vaccinations.
test_vet3: The adult horse was treated for laminitis.
test_vet4: The juvenile bird was treated for a wing injury.
test_vet5: The senior cat was brought in for a routine health check-up.
test_vet6: I just performed a neutering procedure on a cat.
test_vet7: The dog’s condition is improving after the deworming treatment.
test_vet8: The X-ray showed a fracture in the bird’s wing.
test_vet9: I prescribed flea prevention medication for the puppy.
test_vet: The horse’s blood test revealed signs of equine infectious anemia.


It would be logical for a practicing veterinarian or anyone whose work is related to veterinary to follow "vet" subreddit. This shows closeness to veterinary but does not guarantee being a veterinarian

It can be seen that all these people have just one comment each

Many spoke in 3rd person form which is hard to say if they are doctors or not

Initially, I chose only test_vet6 and test_vet9 to be practicing veterinarian but the firther instruction given has clarified that I should include all

In [44]:
# store usernames of confirmed veterinarians
vet = list(vet_sub_list)

> It is noteworthy that this type of problem is usually solved effectively with labelled dataset

> However with unlabelled data as the one here, hand engineering may be employed to some extent enough to build a model, therafter the model can predict the rest

> That is the approach I wish to employ for this task

Now, unto next subreddit (orchid)

In [45]:
# get list of orchid subscribers
orchid_sub_list = reddit_user_df[reddit_user_df["subreddit"] == "orchid"]["username"].values

# get the comments made by orchid subscribers
orchid_sub_comment_list = reddit_user_df[reddit_user_df["subreddit"] == "orchid"]["comments"]

In [46]:
print(f"Orchid sunscrbers: {orchid_sub_list}")
print(f"Number of orchid subscribers: {len(orchid_sub_list)}")

Orchid sunscrbers: ['Think_Not_Doer' 'Personal-Escape4283' 'timee_bot' 'erlingspaulsen']
Number of orchid subscribers: 4


It would be neccesary to preprocess our data first as the comments of users who subscribe to orchid is a bit much

In [47]:
orchid_sub_comment_list = orchid_sub_comment_list.apply(nlp_preprocessing)

In [48]:
orchid_sub_comment_list.index

Index([205, 241, 306, 1547], dtype='int64')

In [49]:
orchid_sub_comment_list[205]

'Yassss Queen | Yea I am excited about this project too I am happy to see their collaboration with Storj I am still reviewing their whitepaper and comparing it to the Orchid Protocol If any one has any cliffs notes Id appreciate it I am interested in using VPNs for increased DAPP security DDoS attacks | My first observation is that Mysterium highlights the intention of splitting up packets to traverse different paths along the VPN network which protects a user from a malicious node | You can buy on Polygon using Quickswap if you can cheaply get funds onto Polygon'

In [50]:
orchid_sub_comment_list[241]

'Got it But why can I only select or worth of MYST Seems pretty strange imo Why cant we just send however much we like Is there a way to just see our address and send whatever amount we choose | Youre a legend bro Wonder why tf they dont make this accessible seems like a no brainer | Same problem here WTF | UPDATEWow so i fixed it guys I deleted the DNS in my WiFi settings and after a new one was generated I hit apply Then all of the sudden I was back online My question is what in the world caused that to happen I love Mysterium but feel very sketched out yet idk if it was even their fault What do you guys think Im skeptical of reinstalling it again I had a some MYST in my account prior to uninstalling I doubt I would get it back if I reinstalled right Not the end of the world Im just happy to be online again although I missed a lot of meetings this morning Really curious how this happened after enabling the kill switch quitting and uninstalling Lmk what you think | Feels like Ive won 

In [51]:
orchid_sub_comment_list[306]

'View in your timezone March PM UTC | View in your timezone at PM UTC | View in your timezone July th at pm UTC | View in your timezone November UTC until November UTC'

In [52]:
orchid_sub_comment_list[1547]

'Nice If anyone is hosting a Mysterium node and have a spare hard disk its very easy to run a Storj node alongside Mysterium I do this on my Raspberry Pi | My Raspberry Pi is running the Mysterium Rasbian image Then I just followed the Storj installation guide and installed Docker engine downloaded their Docker image and started the Storj storage node inside its own container I just looked up Presearch and it looks like you can run their node through a Docker container so this should be okay to do alongside Mysterium To install the Docker engine on Rasbian you can follow this guideinstallusingtheconveniencescriptThanks for bringing it up as I was not aware of this project Might as well try this myself | Doesnt look like Presearch node is supported on Raspberry Pi yet Hardware Specification Docker image is only for x x based CPUs for now We plan to support ARM Raspberry Pi in the future but were currently focused on fixing bugs and making sure the current platform is robust before addit

Checking the comments of all the 4 people who subscribed to orchid shows none of them is either a practicing doctor or a practicing veterinarian

In [53]:
others = list(orchid_sub_list)

Next subreddit is HeliumNetwork

In [54]:
# get list of orchid subscribers
HeliumNetwork_sub_list = reddit_user_df[reddit_user_df["subreddit"] == "HeliumNetwork"]["username"].values

# get the comments made by orchid subscribers
HeliumNetwork_sub_comment_list = reddit_user_df[reddit_user_df["subreddit"] == "HeliumNetwork"]["comments"]

In [55]:
print(f"HeliumNetwork sunscrbers: {HeliumNetwork_sub_list}")
print(f"Number of HeliumNetwork subscribers: {len(HeliumNetwork_sub_list)}")

HeliumNetwork sunscrbers: ['Best_Bid_9327' 'SoulReaver-SS' 'Imthefatboyb' 'Passi-RVN' 'alexiskef'
 'drunknfoo']
Number of HeliumNetwork subscribers: 6


Thereare 6 sunscribers

Lets proprocess the data of all users who subscribe to HeliumNetwork in order to view it properly

In [56]:
HeliumNetwork_sub_comment_list = HeliumNetwork_sub_comment_list.apply(nlp_preprocessing)

In [57]:
HeliumNetwork_sub_comment_list.index

Index([93, 442, 458, 632, 670, 1504], dtype='int64')

In [58]:
HeliumNetwork_sub_comment_list[93]

'Im getting this too | They just told one of the accounts was compromised They told to not click in any links | Get rid of windows'

In [59]:
HeliumNetwork_sub_comment_list[442]

'Whats the solution to Mysterium VPN wrecking my home internet and making sites All the captchas | They tae cut from every node payment settlement if you havent noticed | Consequently its not worth it for me due to the damage it causes to my home internet Even when you ONLY enable BB approved partner vpn thiss a problem'

In [60]:
HeliumNetwork_sub_comment_list[458]

'Just did the update to my Pi node and my earnings dropped off almost completely so maybe fixing this could improve my experience'

In [61]:
HeliumNetwork_sub_comment_list[632]

'i dont get what you wanna say but thx | i hoped you explained it more please i dont know what you mean | sounds great my guess then it will be in the same region or near the previous region if none in the previous region isnt available | i just downloaded it paid dollars and now see that there i NO auto reconnect in the settings in the latest update i have version just a skill switch but no auto reconnect'

In [62]:
HeliumNetwork_sub_comment_list[670]

'What are you guys planning to do marketing wise How are you planning to raise awareness about Mysterium | Obvious scam However I would guess that Mysterium users are surely on a techicall level more than capable of seeing through such bullshit and NOT providing their Private Key or mnemonic phraseSCAM'

In [63]:
HeliumNetwork_sub_comment_list[1504]

'Was averaging and MYST After three days of mainnet total about MYST and the other node nothing Not worth bothering for me anymore until something changes'

Everone who subscribes to HeliumNetwork belongs to the others category (None of them is perceved to be a practicing doctor or veterinarian)

In [64]:
others.extend(list(HeliumNetwork_sub_list))

Lets take alook at the medicine subreddit

In [65]:
# get list of orchid subscribers
medicine_sub_list = reddit_user_df[reddit_user_df["subreddit"] == "medicine"]["username"].values

# get the comments made by orchid subscribers
medicine_sub_comment_list = reddit_user_df[reddit_user_df["subreddit"] == "medicine"]["comments"]

In [66]:
print(f"medicine subscrbers: {medicine_sub_list}")
print(f"Number of medicine subscribers: {len(medicine_sub_list)}")

medicine subscrbers: ['test_doctor2' 'test_doctor3' 'test_doctor4' 'test_doctor5'
 'test_doctor6' 'test_doctor7' 'test_doctor8' 'test_doctor1']
Number of medicine subscribers: 8


Thereare 8 subscribers

Lets proprocess the data of all users who subscribe to medicine in order to view it properly

In [67]:
medicine_sub_comment_list = medicine_sub_comment_list.apply(nlp_preprocessing)

In [68]:
for i in range(len(medicine_sub_list)):
    print(f"{medicine_sub_list[i]}: {medicine_sub_comment_list.values[i]}")

test_doctor2: The elderly man is recovering from hip replacement surgery
test_doctor3: The teenage boy was treated for a sports injury
test_doctor4: The woman is expecting a baby and visited for a prenatal checkup
test_doctor5: I just performed an appendectomy on a patient
test_doctor6: The patients blood pressure is stabilizing after the medication
test_doctor7: The MRI scan revealed a tumor in the patients brain
test_doctor8: I prescribed antibiotics for the patients bacterial infection
test_doctor1: The patients EKG showed signs of a possible heart attack


Just like the vet subreddit, most of the users in this category speak in third person form which makes it hard to say if they are really practicing doctor or a medical practitioner like nurse student

However, the further instruction clarified this and all of these subscribers would be classified as doctors

In [69]:
doctors = list(medicine_sub_list)

Lets take alook at the next subreddit, MysteriumNetwork

In [70]:
# get list of MysteriumNetwork subscribers
MysteriumNetwork_sub_list = reddit_user_df[reddit_user_df["subreddit"] == "MysteriumNetwork"]["username"].values

# get the comments made by MysteriumNetwork subscribers
MysteriumNetwork_sub_comment_list = reddit_user_df[reddit_user_df["subreddit"] == "MysteriumNetwork"]["comments"]

In [71]:
print(f"Number of MysteriumNetwork subscribers: {len(MysteriumNetwork_sub_list)}")

Number of MysteriumNetwork subscribers: 967


There are quite a lot of users who subscribe to MysteriumNetwork

Nevertheless we still need to preprocess the comments

In [72]:
MysteriumNetwork_sub_comment_list = MysteriumNetwork_sub_comment_list.apply(nlp_preprocessing)

Lets take a look at some of the comment made by subscribers of MysteriumNetwork

In [73]:
MysteriumNetwork_sub_comment_list[MysteriumNetwork_sub_comment_list.index[235]]

'As far as I can tell the fees change during the day from oh well to wtf quite easily Just wait a bit if theyre too high settlement fee is indeed unreasonably high | Just tried a withdrawal Myst via Polygon I received xxxSo gas was very low today | My connections are from all over the world Quite fun to see I had never ever a connection that provided more than about MYST The longest was days straight with around MB'

Lets take alook at the last subreddit also, Veterinary

In [74]:
# get list of Veterinary subscribers
Veterinary_sub_list = reddit_user_df[reddit_user_df["subreddit"] == "Veterinary"]["username"].values

# get the comments made by Veterinary subscribers
Veterinary_sub_comment_list = reddit_user_df[reddit_user_df["subreddit"] == "Veterinary"]["comments"]

In [75]:
print(f"Number of Veterinary subscribers: {len(Veterinary_sub_list)}")

Number of Veterinary subscribers: 2282


There are quite a lot of users who subscribe to Veterinary also

Lets first preprocess the comments

In [76]:
Veterinary_sub_comment_list = Veterinary_sub_comment_list.apply(nlp_preprocessing)

In [77]:
Veterinary_sub_comment_list[Veterinary_sub_comment_list.index[3]]

'Contrary to employers belief at will does not actually mean you can be fired for any reason with no consequences It may be worth contacting a lawyer for your case and see if you have any grounds | This And now shes an easy scapegoat because they fired her | I wish I had better advice for you but Im sure youve heard all of the basics like setting a routine going to bed earlier etc Hang in there | Well thats a bit rude There are plenty of people who manage to have healthy routine and outside lives while going through med Why is it wrong to ask for advice to balance your mental health and livestyle under a heavy workload and why does it upset you | So you think people shouldnt ask for advice to make things better for themselves They should just ignore their issues bottle it up expect it all goes away I can see that method has made you a healthy and positive person Please choose to be quiet next time rather than make things worse for someone looking for help We dont need it You arent help

Before building the model, it would be neccesary to also manually label some samples of medical school or vet students as Others so that the model can learn this also and not classify them otherwise

Lets start by searching for medical student

Lets get 10 comments which includes keywords that may show someone is a medical student

In [78]:
search_keys = ["school", "resident", "undergrad"]
returned_username = []
returned_indices = []
count = 0

In [79]:
for comments, ind, username in zip(Veterinary_sub_comment_list, Veterinary_sub_comment_list.index, Veterinary_sub_list):
    found = False
    if count == 10:
        break
        
    for comment in comments.split("|"):
        for word in comment.split(" "):
            if word.lower() in search_keys:
                returned_indices.append(ind)
                returned_username.append(username)
                found = True
                count += 1
                break
        if found == True:
            break

Lets take a look at the 10 comments and label them

In [80]:
i = 0     # first user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == returned_indices[i]]["comments"].values[0])

Username:
queerofengland

Comments:
Contrary to employers' belief, at will does not actually mean you can be fired for any reason with no consequences. It may be worth contacting a lawyer for your case and see if you have any grounds|This. And now she's an easy scapegoat because they fired her|I wish I had better advice for you, but I'm sure you've heard all of the basics like setting a routine, going to bed earlier, etc. Hang in there!|Well that's a bit rude. There are plenty of people who manage to have healthy routine and outside lives while going through med/vet/etc school. Why is it wrong to ask for advice to balance your mental health and livestyle under a heavy workload, and why does it upset you?|So you think people shouldn't ask for advice to make things better for themselves? They should just ignore their issues, bottle it up, expect it all goes away? I can see that method has made you a healthy and positive person 🙄. 

Please choose to be quiet next time rather than make thi

In [81]:
# label this user as Others
# this user is a data analyst applying are work in a vet school
others.append(returned_username[i])

In [82]:
i = 1     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == returned_indices[i]]["comments"].values[0])

Username:
paxbanana0

Comments:
Those are long, probably stressful days at work. You probably use up all your energy there and have nothing left over. Can you plan to take a day (or week) off? Even using sick time to rest and recharge may be worthwhile. I also recommend labwork just to be safe if you can afford it.|I can only speak for my clinic, as a GP. I always do all of the above. The most important first step is making sure the ER can take my patient. I hate sitting in hold for 15+ minutes with the ER; the hold music is the worst. But it has to be done.|I worked 8-7 straight today. I feel like we have remained pretty busy with intermittent days that aren’t as slammed. Seems like we’re making as much if not more than last year this time too.|My worst experience was with a human dentist. I recommended a dental with likely extractions for his old small breed dog with Perio 4, and he told me that it’s ridiculous to recommend extractions for teeth with just recession. I told him I’d ne

In [83]:
# label this user as doctor
doctors.append(returned_username[i])

In [84]:
i = 2     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == returned_indices[i]]["comments"].values[0])

Username:
Most-Exit-5507

Comments:
In high school I found a youtube channel called Vet Ranch that showed a lot of interesting surgeries and procedures. I kind of forced myself to watch the surgeries to get desensitized lol but eventually I stopped flinching and getting squeamish! Also, I learnt that I get grossed out more from human surgeries rather than animal surgeries, maybe cuz I imagine it happening to me idk.|In high school I found a youtube channel called Vet Ranch that showed a lot of interesting surgeries and procedures. I kind of forced myself to watch the surgeries to get desensitized lol but eventually I stopped flinching and getting squeamish! Also, I learnt that I get grossed out more from human surgeries rather than animal surgeries, maybe cuz I imagine it happening to me idk.|In high school I found a youtube channel called Vet Ranch that showed a lot of interesting surgeries and procedures. I kind of forced myself to watch the surgeries to get desensitized lol but even

In [85]:
# label this user as Others
others.append(returned_username[i])

In [86]:
i = 3     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == returned_indices[i]]["comments"].values[0])

Username:
Frizzyawkward

Comments:
I’m a soon to be veteran looking to get into vet school. I’ll be doing all my prereqs at UTK and hoping for the best when I apply. 
Any tips for getting volunteer hours? I had a bunch in high school about 6 years ago now through my schools veterinary assistance program but I have no contact with anyone that would verify that. Besides maybe the classes on my highschool transcripts I wouldn’t really have “proof”. 
I’ve been active duty these 6 years and I know I’m seriously behind other candidates but don’t want to give up 😭 Any tips in general for a late starter?


In [87]:
# label this user as Others
# this user ia a vet student
others.append(returned_username[i])

In [88]:
i = 4     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == returned_indices[i]]["comments"].values[0])

Username:
nan

Comments:
[deleted]|[deleted]|[deleted]|[deleted]|[deleted]|[deleted]|[deleted]|[removed]|Can I ask a question about really basic vetmed certification? I’m in an area that has a serious shortage of emergency trained vets, so much so that there’s been a pivot to regular vets not doing emergency triage, and not being able to recognize emergencies. 

Is there a basic certification that’s available so that pet owners can know when it’s time for the ER?|[deleted]|[deleted]|I agree with some of the below threads. Pay varies from state and I’ve also found big cities tend to pay more than hospitals in burbs or rural areas. For instance, I’m not certified and as a tech in Boston, MA I make $27/hour but in Chicago, IL I made $23/hour. That being said, I live with my boyfriend and having dual incomes is honestly the only way I can afford to live.

I know moving for a job is a big thing consider but maybe not a bad idea to see what’s out there. I’ve also learned to not be afraid to 

In [89]:
# label this user as others
# this user ia a vet student or graduate from the statement "I am not a vet and I make more than our new vets do" 
others.append(returned_username[i])

In [90]:
i = 5     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == returned_indices[i]]["comments"].values[0])

Username:
matcha-fiend

Comments:
vet assistant making $19/hr while doing straight up tech work in California. target workers make more than me. I don’t even have time to try to finish school with how stressed I am to pay the bills ): feels so futile sometimes


In [91]:
# label this user as others
# this user ia a vet assistant 
others.append(returned_username[i])

In [92]:
i = 6     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == returned_indices[i]]["comments"].values[0])

Username:
ARatNamedClydeBarrow

Comments:
As a VA, this makes me so sad for you. Yeah it takes you guys some time to hit your stride, but that’s the same for *everyone*. I would never dream of speaking to a DVM that way. Voicing concerns in a respectful manner and having a discussion is one thing, but to talk to you with such derision is absolutely not okay. It’s far past time for a chat with their supervisor / head doc / PM.

I love my new grad doctors so much! They always like to talk things out with me (I love hearing the trains of thought) and even sometimes ask my opinion if it’s not something they’re super familiar with, but is something I’ve seen before. They love to teach us things they learned in school, it helps solidify concepts for them *and* we probably get to learn something new. I’m losing one of my new grads and I’m super broken up about it, I switched my dog to her when she started and now he doesn’t have a vet.

Some advice for emergencies: even if you’re not comforta

In [93]:
# label this user as doctor
doctors.append(returned_username[i])

In [94]:
i = 7     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == returned_indices[i]]["comments"].values[0])

Username:
agirlwhowaited

Comments:
As someone who’s worked as an assistant/tech with many vets, I would NEVER call a vet by their first name unless they had specifically asked me to. You worked hard for your title and deserve respect if you are giving it to them. I’ve unfortunately worked with a lot of know-it-all colleagues who think they know more than seasoned vets, most of us see right through it but it doesn’t make them any easier to work with.|Yes this is a good point! I don’t think any students have graduated from the full program yet so I’m curious if they’re on track to finish their PhD in 4 years|Thank you! And yes doing the PhD with residency would be a back up plan, doing the dual degree is preferable for me with the tuition waver as an OOS|I haven’t been accepted to vet school yet! Still waiting to hear! But I’d be happy to chat with you about why I want to go this route. Congrats on your acceptance!|The website actually details that the PhD portion is three years. But th

In [95]:
# label this user as other
# this user is a prospective vet student
others.append(returned_username[i])

In [96]:
i = 8     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == returned_indices[i]]["comments"].values[0])

Username:
Difficult_Ad_8152

Comments:
If you enjoy it, definitely give it a go! Vet med is difficult and has its ups and downs: try to hold on to the things that bring u up|Tell them you’ll Leave the clinic if you’re not treated with respect and find a good one that treats you with respect if they don’t change… you didn’t spend so much of your life becoming a vet to be: treated poorly or: to quit because of one shit clinic… you’re more resilient than that cause u were capable of getting through vet school


In [97]:
# label this user as other
# not enough info from the comments
others.append(returned_username[i])

In [98]:
i = 9     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == returned_indices[i]]["comments"].values[0])

Username:
Cheeztitts

Comments:
I have the same irrational fear and I applied for vet school so it’s totally doable. Don’t let fear control you since it’s going to be very unlikely we’ll run into a rabid animal. I’m still unvaccinated but Ill probably get the rabies vaccine and do yearly titers so that I have peace of mind. My vet said she still has valid rabies titers even after 20+ years after her first vaccine… pretty crazy! I’m glad I found someone else with the same experience as me!!|What are some small animal zoonotic diseases? That way I can properly protect myself! Thanks (:


In [99]:
# label this user as other
# this user is a prospective vet student
others.append(returned_username[i])

Lets get 10 comments (different from the 10 above) which includes keywords that may show someone is a vet student

In [100]:
search_keys = ["school", "resident", "undergrad", "vet"]
vet_returned_username = []
vet_returned_indices = []
count = 0

In [101]:
for comments, ind, username in zip(Veterinary_sub_comment_list, Veterinary_sub_comment_list.index, Veterinary_sub_list):
    found = False
    if count == 10:
        break
        
    for comment in comments.split("|"):
        for word in comment.split(" "):
            if (word.lower() in search_keys) and (username not in returned_username):
                vet_returned_indices.append(ind)
                vet_returned_username.append(username)
                found = True
                count += 1
                break
        if found == True:
            break

Lets take a look at the 10 stes of comments and label them

In [102]:
i = 0     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(vet_returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == vet_returned_indices[i]]["comments"].values[0])

Username:
Churro_The_fish_Girl

Comments:
what makes you want to become a vet?|what makes you want to become a vet?|what makes you want to become a vet?|what makes you want to become a vet?|what makes you want to become a vet?|what makes you want to become a vet?|what makes you want to become a vet?|what makes you want to become a vet?|what makes you want to become a vet?|what makes you want to become a vet?|what makes you want to become a vet?|what makes you want to become a vet?|what makes you want to become a vet?|what makes you want to become a vet?|what makes you want to become a vet?|what makes you want to become a vet?


In [103]:
# label this user as other
# no enough info from user comments
others.append(vet_returned_username[i])

In [104]:
i = 1     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(vet_returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == vet_returned_indices[i]]["comments"].values[0])

Username:
theophania808

Comments:
I've worked in the vet industry for years and almost everyone had tattoos. I'm covered with tattoos and never had a problem only with VCA. If you ever apply there (I recommend you don't, they suck) they have a tattoo policy.


In [105]:
# label this user as vet
vet.append(vet_returned_username[i])

In [106]:
i = 2     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(vet_returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == vet_returned_indices[i]]["comments"].values[0])

Username:
Shemoose

Comments:
Vet tech since the same and I'm on 20 euro a hr.|I sent this to mu cop friend


In [107]:
# label this user as vet
# this user is a vet tech
others.append(vet_returned_username[i])

In [108]:
i = 3     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(vet_returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == vet_returned_indices[i]]["comments"].values[0])

Username:
Environmental-Snow29

Comments:
I had a retired surgeon admit to me he "accidentally" killed his daughters hamster, trying to remove a mass with lidocaine and nothing else. 

My sister  is a nurse who worked one of the surgical units at johns hopkins. She reported to me that the surgeons regularly complained about vet costs and freely admit they dont take their dogs for anything unless they have to, and try to treat at home. 


I have limited respect for certain humam medicine people....|I second Ross!!!!! I loved my experience on the island. It's just very overpriced. Clinics was at LSU... i hated it. All i can say is thank god ross sent like 15 of us rossies during my semester. Couldn't have made it through LSU without them. I liked the farm animal team and exotics team at LSU. Equine and internal med made me want to throw myself into the missisippi and consume the oleander they have on the campus. ER LSU peeps were great, too. I will say i did appreciate that they made sur

In [109]:
# label this user as vet
# according to the phrase "Im like actually a pretty good vet. And I mostly enjoy my job"
vet.append(vet_returned_username[i])

In [110]:
i = 4     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(vet_returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == vet_returned_indices[i]]["comments"].values[0])

Username:
wHetcatfood

Comments:
“All Dogs Go To Kevin” by Dr. Jessica Vogelsang has always been a favorite for me. She narrates her own audiobook too so that’s a good listen as well :) 

Also books I’ve received that I haven’t read fully or are still on my TBR:
- The Vet at Noah’s Ark by Dr. Doug Mader
- What It Takes to Save a Life by Dr. Kwane Stewart
- The Other Family Doctor by Karen Fine, DVM


In [111]:
# label this user as other
others.append(vet_returned_username[i])

In [112]:
i = 5     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(vet_returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == vet_returned_indices[i]]["comments"].values[0])

Username:
spaghetti000s

Comments:
Steps to becoming a Radiologist:

Vet school:
x. Be within the top 1/3 of your class in Vet school (4 years)
x. Try and get published while in vet school (even just a case report is incredibly helpful)
x. Get to know your radiology department in vet school; if at all possible try to get a student worker position in the department. Ask questions, don't be annoying, read Thrall textbook as often as you can

Post Vet School:
x. Rotating internship first, choose one that has at least 2 board certified radiologists (need at least two letters of rec from them)
x. if you didn't get published in vet school, you need to do it during the rotating internship
x. Read Thrall on your downtime

Post Rotating Internship:
x. Apply for both specialty radiology imaging internships (1yr) and residencies (3-4yrs) in the same Match cycle; if you don't match to a residency first try (quite rare), hopefully you'll match to an imaging internship. Get published during this int

In [113]:
# label this user as other
# this user is most like a student
others.append(vet_returned_username[i])

In [114]:
i = 6     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(vet_returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == vet_returned_indices[i]]["comments"].values[0])

Username:
almostdonestudent

Comments:
I was a tech on and off for years. I would never dream of talking to a vet like that. Unfortunately I've worked in some toxic clinics and it sounds like you found one. I would go to the higher ups and I would can the technicians out when they say rude things. They aren't your boss, you're the doctor.


In [115]:
# label this user as vet
# this user is most a veterinarian
vet.append(vet_returned_username[i])

In [116]:
i = 7     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(vet_returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == vet_returned_indices[i]]["comments"].values[0])

Username:
qwertyculous

Comments:
Quizlet was my best friend. I was on that thing all the time. Vague terms like "canine respiratory" will get you things, course numbers will get you past or current student sets (VMED 1672). I hardly ever made my own study guides, I was just a vulture for other people's. 
It might not be on the test you're taking, if it's some guy from Ohio's material, but if you're confused about a concept, someone's probably written it in a better way. 

Also youtube videos on like, arrythmias and stuff. There's always going to be quacks on the internet acting like they've been to school, but actual accomplished people also make videos. https://youtu.be/6dp8mN9pRik?si=ME35hxDme6ZOUAdV You'll have to sort through the bullshit to actually find good videos, but scooting marketing and ego garbage to the side for science is basically half of being a licensed vet.


In [117]:
# label this user as vet
# this user is most likely a veterinarian
vet.append(vet_returned_username[i])

In [118]:
i = 8     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(vet_returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == vet_returned_indices[i]]["comments"].values[0])

Username:
i-touched-morrissey

Comments:
You might try something else as a career. When you get out into the real world, things might get stressful. My mental health was shit when I was in vet school, only I thought everyone was as stressed out as I was. I didn't get help until my dad committed suicide when I suspected that there was something wrong with me. 

As a practicing vet now, I obsess about what people think, what I said, I worry that if I can't fix something that I suck and no one will want to come see me again. Some days I wish I was a kindergarten teacher, but then I would worry about pissing off someone's parents. I always wanted to be a pathologist, but never made it further than vet school, but it is ideal for an introvert who doesn't feel comfortable speaking to people.|I have been in practice for 31 years and there are definitely times when I wish I was something else. Interacting with people is the most difficult part of my job, and a lot of us vets are high-achieving

In [119]:
# label this user as vet
# this user is a vet
vet.append(vet_returned_username[i])

In [120]:
i = 9     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(vet_returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == vet_returned_indices[i]]["comments"].values[0])

Username:
Popular_Hour6343

Comments:
I didn't know. Someone randomly mentioned vet school in convo and I never considered it until then (after 1st year of undergrad). I decided to job shadow that summer (4 months) and get a feel for it. Looking back, I wasn't really exposed to a whole lot but generally I enjoyed my time job shadowing. 

I pursued vet school, got in, started, and thats where i really started to ask myself if this is what I want to do. I started to wonder if I can actually do this, if I'm cut out for it, if I'm actually going to be able to enjoy it, because its one thing to watch a vet work, but a whole other thing to be the vet working. 

Anyways, I just went with the flow of things hoping that I can do it and I would enjoy it, and when i got to my clinical year, that's when I knew. A big part of the doubt was having confidence to do the things that vets do, and basically the more exposure and practice I got, the more my confidence built and the more I was able to enjo

In [121]:
# label this user as other
# this user is about to write NAVLE (North American Veterinary Licensing Examination)
others.append(vet_returned_username[i])

In [122]:
print(vet)
print()
print(others)
print()
print(doctors)

['test_vet2', 'test_vet3', 'test_vet4', 'test_vet5', 'test_vet6', 'test_vet7', 'test_vet8', 'test_vet9', 'test_vet', 'theophania808', 'Environmental-Snow29', 'almostdonestudent', 'qwertyculous', 'i-touched-morrissey']

['Think_Not_Doer', 'Personal-Escape4283', 'timee_bot', 'erlingspaulsen', 'Best_Bid_9327', 'SoulReaver-SS', 'Imthefatboyb', 'Passi-RVN', 'alexiskef', 'drunknfoo', 'queerofengland', 'Most-Exit-5507', 'Frizzyawkward', nan, 'matcha-fiend', 'agirlwhowaited', 'Difficult_Ad_8152', 'Cheeztitts', 'Churro_The_fish_Girl', 'Shemoose', 'wHetcatfood', 'spaghetti000s', 'Popular_Hour6343']

['test_doctor2', 'test_doctor3', 'test_doctor4', 'test_doctor5', 'test_doctor6', 'test_doctor7', 'test_doctor8', 'test_doctor1', 'paxbanana0', 'ARatNamedClydeBarrow']


The usernames labelled in sample labelling file would also be added with their labels

In [123]:
others.append("--solaris--")

vet.extend(["100realtx", "3_Black_Cats"])

The labels would now be added to the dataframe

In [124]:
reddit_user_df["Label"] = None

In [125]:
reddit_user_df.head()

Unnamed: 0,Unnamed: 0_x,username,comments,Unnamed: 0_y,isused,subreddit,created_at,Label
0,0,LoveAGoodTwist,"Female, Kentucky. 4 years out. Work equine on...",0,True,Veterinary,2024-05-02,
1,1,wahznooski,"As a woman of reproductive age, fuck Texas|As ...",7,True,Veterinary,2024-05-02,
2,2,Churro_The_fish_Girl,what makes you want to become a vet?|what make...,9,True,Veterinary,2024-05-02,
3,3,abarthch,"I see of course there are changing variables, ...",1133,True,MysteriumNetwork,2024-05-02,
4,4,VoodooKing,I have 412+ and faced issues because wireguard...,1779,False,MysteriumNetwork,2024-05-03,


In [126]:
reddit_user_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3276 entries, 0 to 3275
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0_x  3276 non-null   int64 
 1   username      3275 non-null   object
 2   comments      3276 non-null   object
 3   Unnamed: 0_y  3276 non-null   int64 
 4   isused        3276 non-null   bool  
 5   subreddit     3276 non-null   object
 6   created_at    3276 non-null   object
 7   Label         0 non-null      object
dtypes: bool(1), int64(2), object(5)
memory usage: 182.5+ KB


In [127]:
for username in doctors:
    reddit_user_df.loc[reddit_user_df["username"] == username, "Label"] = "Medical Doctor"

In [128]:
for username in vet:
    reddit_user_df.loc[reddit_user_df["username"] == username, "Label"] = "Veterinarian"

In [129]:
for username in others:
    reddit_user_df.loc[reddit_user_df["username"] == username, "Label"] = "Other"

In [130]:
reddit_user_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3276 entries, 0 to 3275
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0_x  3276 non-null   int64 
 1   username      3275 non-null   object
 2   comments      3276 non-null   object
 3   Unnamed: 0_y  3276 non-null   int64 
 4   isused        3276 non-null   bool  
 5   subreddit     3276 non-null   object
 6   created_at    3276 non-null   object
 7   Label         49 non-null     object
dtypes: bool(1), int64(2), object(5)
memory usage: 182.5+ KB


Checking the Label, we see that 50 features has been labelled

These will form our training set

Lets extract the training set as a csv in the requested format in order to get feedback

In [131]:
# get indices of labelled data
doctor_indices_mask = reddit_user_df["Label"] == "Medical Doctor"
vet_indices_mask = reddit_user_df["Label"] == "Veterinarian"
other_indices_mask = reddit_user_df["Label"] == "Other"

labelled_indices_mask = doctor_indices_mask + vet_indices_mask + other_indices_mask

train_set_df = reddit_user_df.loc[labelled_indices_mask, ["username", "comments", "Label"]].copy()

In [132]:
train_set_df.head()

Unnamed: 0,username,comments,Label
2,Churro_The_fish_Girl,what makes you want to become a vet?|what make...,Other
5,queerofengland,"Contrary to employers' belief, at will does no...",Other
7,theophania808,I've worked in the vet industry for years and ...,Veterinarian
9,paxbanana0,"Those are long, probably stressful days at wor...",Medical Doctor
12,Most-Exit-5507,In high school I found a youtube channel calle...,Other


In [133]:
train_set_df.columns = ["Reddit Username", "Reddit Comments", "Label"]

In [134]:
train_set_df.head()

Unnamed: 0,Reddit Username,Reddit Comments,Label
2,Churro_The_fish_Girl,what makes you want to become a vet?|what make...,Other
5,queerofengland,"Contrary to employers' belief, at will does no...",Other
7,theophania808,I've worked in the vet industry for years and ...,Veterinarian
9,paxbanana0,"Those are long, probably stressful days at wor...",Medical Doctor
12,Most-Exit-5507,In high school I found a youtube channel calle...,Other


The column feature would need to be preprocessed else the output csv file would not be properly formatted

In [135]:
train_set_df["Reddit Comments"] = train_set_df["Reddit Comments"].apply(nlp_preprocessing)

In [136]:
train_set_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 49 entries, 2 to 3175
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Reddit Username  49 non-null     object
 1   Reddit Comments  49 non-null     object
 2   Label            49 non-null     object
dtypes: object(3)
memory usage: 1.5+ KB


In [137]:
train_set_df.to_csv("training_set.csv", index=False)

At this point is would be better to build the model on the 4 subreddit we have checked so far then predict the category for the remianing users who subscribe to the remaining 2 categories

## Model Building

For the model building, I would be using glove embedding matrix to embed the words

In [138]:
# loading glove word vectors (words embeddings) into dictionary
embedding_index = {}

with open('glove.6B.100d.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = coefs

FileNotFoundError: [Errno 2] No such file or directory: 'glove.6B.100d.txt'