# Doctor and Veterinary Classification using NLP

This notebook is for building a model which will correctly classify a number of given reddit users as practicing doctors, practicng veterinary or others based on each user's comments 

The dataset for this task would be sourced from a databased whose link is given as

[postgresql://niphemi.oyewole:W7bHIgaN1ejh@ep-delicate-river-a5cq94ee-pooler.us-east-2.aws.neon.tech/Vetassist?statusColor=F8F8F8&env=&name=redditors%20db&tLSMode=0&usePrivateKey=false&safeModeLevel=0&advancedSafeModeLevel=0&driverVersion=0&lazyload=false](postgresql://niphemi.oyewole:W7bHIgaN1ejh@ep-delicate-river-a5cq94ee-pooler.us-east-2.aws.neon.tech/Vetassist?statusColor=F8F8F8&env=&name=redditors%20db&tLSMode=0&usePrivateKey=false&safeModeLevel=0&advancedSafeModeLevel=0&driverVersion=0&lazyload=false)

However, trying to access the database with the given link would result in errors

Therefore, a modified version of the link would be used

## Module Importations and Data Retrieval

Before continuing, needed libraries would be imported below

In [1]:
import re             # for regrex operations
import string         # for removing punctuations
import numpy as np    # for mathematical calculations
import pandas as pd    # for working with structured data (dataframes)
from sqlalchemy import create_engine # for connecting to database
from nltk.tokenize import word_tokenize
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding
from tensorflow.keras.initializers import Constant
from sklearn.model_selection import train_test_split

In [90]:
from tensorflow.keras.preprocessing.text import Tokenizer
from autocorrect import Speller

The modified link to access the database is defined below

In [3]:
# # define the connection link
# conn_str = "postgresql://niphemi.oyewole:endpoint=ep-delicate-river-a5cq94ee-pooler;W7bHIgaN1ejh@ep-delicate-river-a5cq94ee-pooler.us-east-2.aws.neon.tech/Vetassist?sslmode=allow"

# # create connection to the databse
# engine =  create_engine(conn_str)

First, lets take a look at the tables in the database

In [4]:
# define sql query for retrieving the tables in the database
sql_for_tables = """
SELECT
    table_schema || '.' || table_name
FROM
    information_schema.tables
WHERE
    table_type = 'BASE TABLE'
AND
    table_schema NOT IN ('pg_catalog', 'information_schema');
"""

In [5]:
# # retrieve the tables in a dataframe
# tables_df = pd.read_sql_query(sql_for_tables, engine)

In [6]:
# tables_df

There are two tables in the database as shown above

Each table would be saved in a pandas dataframe

In [7]:
sql_for_table1 = """
SELECT
    *
FROM
    public.reddit_usernames_comments;
"""

> Note: The code below may take a while to run. If it fails, reconnect the engine above then rerun the cell

In [8]:
# user_comment_df = pd.read_sql_query(sql_for_table1, engine)

Lets save the table as a csv file

In [9]:
user_comment_df = pd.read_csv("reddit_usernames_comments.csv")

In [10]:
# user_comment_df.to_csv('reddit_usernames_comments.csv', index=False)

In [11]:
sql_for_table2 = """
SELECT
    *
FROM
    public.reddit_usernames;
"""

> Note: The code below may take a while to run. If it fails, reconnect the engine above then rerun the cell

In [12]:
# user_info_df = pd.read_sql_query(sql_for_table2, engine)

In [13]:
user_info_df = pd.read_csv("reddit_usernames.csv")

Lets save the table as a csv file

In [14]:
# user_info_df.to_csv('reddit_usernames.csv', index=False)

Lets take a look at the tables one after the other

In [15]:
user_comment_df.head()

Unnamed: 0.1,Unnamed: 0,username,comments
0,0,LoveAGoodTwist,"Female, Kentucky. 4 years out. Work equine on..."
1,1,wahznooski,"As a woman of reproductive age, fuck Texas|As ..."
2,2,Churro_The_fish_Girl,what makes you want to become a vet?|what make...
3,3,abarthch,"I see of course there are changing variables, ..."
4,4,VoodooKing,I have 412+ and faced issues because wireguard...


In [16]:
user_comment_df = user_comment_df.drop(columns="Unnamed: 0")

In [17]:
user_comment_df.head()

Unnamed: 0,username,comments
0,LoveAGoodTwist,"Female, Kentucky. 4 years out. Work equine on..."
1,wahznooski,"As a woman of reproductive age, fuck Texas|As ..."
2,Churro_The_fish_Girl,what makes you want to become a vet?|what make...
3,abarthch,"I see of course there are changing variables, ..."
4,VoodooKing,I have 412+ and faced issues because wireguard...


In [18]:
user_comment_df.shape

(3276, 2)

In [19]:
user_info_df.head()

Unnamed: 0.1,Unnamed: 0,username,isused,subreddit,created_at
0,0,LoveAGoodTwist,True,Veterinary,2024-05-02
1,1,drawntage,True,Veterinary,2024-05-02
2,2,LinkPast84,True,Veterinary,2024-05-02
3,3,heatthequestforfire,True,Veterinary,2024-05-02
4,4,Most-Exit-5507,True,Veterinary,2024-05-02


In [20]:
user_info_df = user_info_df.drop(columns="Unnamed: 0")

In [21]:
user_info_df.head()

Unnamed: 0,username,isused,subreddit,created_at
0,LoveAGoodTwist,True,Veterinary,2024-05-02
1,drawntage,True,Veterinary,2024-05-02
2,LinkPast84,True,Veterinary,2024-05-02
3,heatthequestforfire,True,Veterinary,2024-05-02
4,Most-Exit-5507,True,Veterinary,2024-05-02


In [22]:
user_info_df.shape

(8259, 4)

## Data Exploration

This table (now dataframe) contains usernames of users and their comments

Lets look at a comment in order to understand how it is structured

In [23]:
# print all comments for first user
user_comment_df["comments"][0]

'Female, Kentucky.  4 years out. Work equine only private practice. Base salary $85k plus bonuses/production which was $20k 2023. 6 days a week Jan-June/July then variable in the off season. No limit on PTO - took ~5 weeks last year. One paid conference a year (registration/travel/ 1/2 hotel/ transportation) or online CE program. All licensures & professional group fees covered. Cell phone allowance and mileage reimbursement.|Female, Kentucky.  4 years out. Work equine only private practice. Base salary $85k plus bonuses/production which was $20k 2023. 6 days a week Jan-June/July then variable in the off season. No limit on PTO - took ~5 weeks last year. One paid conference a year (registration/travel/ 1/2 hotel/ transportation) or online CE program. All licensures & professional group fees covered. Cell phone allowance and mileage reimbursement.|Female, Kentucky.  4 years out. Work equine only private practice. Base salary $85k plus bonuses/production which was $20k 2023. 6 days a wee

In [24]:
# split comments into individual comments
first_comments = user_comment_df["comments"][0].split("|")

# get the number of comments for first user
len(first_comments)

16

In [25]:
# remove repeated comments
unique_comment = []
for comment in first_comments:
    if comment in unique_comment:
        continue
    else:
        unique_comment.append(comment)

In [26]:
print(f"Length of unique comments for first user: {len(unique_comment)}")
print()
print(unique_comment)

Length of unique comments for first user: 1

['Female, Kentucky.  4 years out. Work equine only private practice. Base salary $85k plus bonuses/production which was $20k 2023. 6 days a week Jan-June/July then variable in the off season. No limit on PTO - took ~5 weeks last year. One paid conference a year (registration/travel/ 1/2 hotel/ transportation) or online CE program. All licensures & professional group fees covered. Cell phone allowance and mileage reimbursement.']


It can be seen that the comment column contains multiple comments separated with "|"

It can also be seen that there are repeated comments

Lets check for missing values

In [27]:
user_comment_df.isna().sum()

username    1
comments    0
dtype: int64

In [28]:
user_comment_df[user_comment_df["username"].isna() == True]

Unnamed: 0,username,comments
23,,[deleted]|[deleted]|[deleted]|[deleted]|[delet...


In [29]:
user_comment_df.iloc[23]["username"] = "None"

In [30]:
user_comment_df.iloc[23]

username                                                 None
comments    [deleted]|[deleted]|[deleted]|[deleted]|[delet...
Name: 23, dtype: object

In [31]:
user_comment_df.iloc[23]["comments"]

'[deleted]|[deleted]|[deleted]|[deleted]|[deleted]|[deleted]|[deleted]|[removed]|Can I ask a question about really basic vetmed certification? I’m in an area that has a serious shortage of emergency trained vets, so much so that there’s been a pivot to regular vets not doing emergency triage, and not being able to recognize emergencies. \n\nIs there a basic certification that’s available so that pet owners can know when it’s time for the ER?|[deleted]|[deleted]|I agree with some of the below threads. Pay varies from state and I’ve also found big cities tend to pay more than hospitals in burbs or rural areas. For instance, I’m not certified and as a tech in Boston, MA I make $27/hour but in Chicago, IL I made $23/hour. That being said, I live with my boyfriend and having dual incomes is honestly the only way I can afford to live.\n\nI know moving for a job is a big thing consider but maybe not a bad idea to see what’s out there. I’ve also learned to not be afraid to advocate for yoursel

In [32]:
user_comment_df[user_comment_df["username"] == "None"]

Unnamed: 0,username,comments
23,,[deleted]|[deleted]|[deleted]|[deleted]|[delet...


There are no missig values

Let's check if there are duplicate usernames

In [33]:
if user_comment_df["username"].nunique() == user_comment_df.shape[0]:
    print("There are no duplicated usernames")
else:
    print("There are duplicated usernames")

There are no duplicated usernames


Lets explore the second dataframe also

In [34]:
user_info_df.head()

Unnamed: 0,username,isused,subreddit,created_at
0,LoveAGoodTwist,True,Veterinary,2024-05-02
1,drawntage,True,Veterinary,2024-05-02
2,LinkPast84,True,Veterinary,2024-05-02
3,heatthequestforfire,True,Veterinary,2024-05-02
4,Most-Exit-5507,True,Veterinary,2024-05-02


In [35]:
user_info_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8259 entries, 0 to 8258
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   username    8258 non-null   object
 1   isused      8259 non-null   bool  
 2   subreddit   8259 non-null   object
 3   created_at  8259 non-null   object
dtypes: bool(1), object(3)
memory usage: 201.8+ KB


From the summary above, we se that there are no missing values as each feature has exactly 8259 values which is total entries in the dataset

Let's check if there are duplicate usernames

In [36]:
if user_info_df["username"].nunique() == user_info_df.shape[0]:
    print("There are no duplicated usernames")
else:
    print("There are duplicated usernames")

There are duplicated usernames


In [37]:
user_info_df["username"].nunique()

8258

In [38]:
user_info_df.shape[0]

8259

In [39]:
user_info_df[user_info_df["username"].duplicated() == True]

Unnamed: 0,username,isused,subreddit,created_at


At this point lets create a function to preprocess the comments

## Data Preprocessing

Lets define functions to clean the dataset

In [40]:
def remove_web_link(text):
    text_list = text.split("|")
    for i in range(len(text_list)):
        text_list[i] = re.sub(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+",
                              "", text_list[i].strip())
    return " | ".join(text_list)

In [41]:
def remove_directories(text):
    text_list = text.split("|")
    for i in range(len(text_list)):
        text_list[i] = re.sub(r"(/[a-zA-Z0-9_]+)+(/)*(.[a-zA-Z_]+)*",
                              "", text_list[i]).strip()
    return " | ".join(text_list)

In [42]:
def remove_deleted_comments(text):
    text_list = text.split("|")
    for i in range(len(text_list)):
        text_list[i] = re.sub(r"\[deleted\]", "", text_list[i].strip())
    return " | ".join(text_list)

In [43]:
def remove_punctuations(text):
    text_list = text.split("|")
    for i in range(len(text_list)):
        text_list[i] = "". join([l if l not in string.punctuation else " " for l in text_list[i]])
        #text_list[i] = ''.join([l for l in text_list[i] if l not in string.punctuation])
    return " | ".join(text_list)

In [44]:
def remove_non_alphabets(text):
    text_list = text.split("|")
    for i in range(len(text_list)):
        text_list[i] = re.sub(r"[^a-zA-Z ]", "", text_list[i].strip())
    return " | ".join(text_list)

In [None]:
# def autocorrect_spelling(text):
#     spell = Speller()
#     text_list = text.split("|")
#     for i in range(len(text_list)):
#         text_list[i] = spell(text_list[i])
#     return " | ".join(text_list)

In [45]:
def remove_unneeded_spaces(text):
    text_list = text.split("|")
    for i in range(len(text_list)):
        text_list[i] = re.sub(r"(\s)+", " ", text_list[i].strip())
    return " | ".join(text_list)

In [46]:
def remove_repeated_sentence(text):
    text_list = text.split("|")
    unique_comment = []
    for comment in text_list:
        if comment.strip() in unique_comment:
            continue
        else:
            unique_comment.append(comment.strip())
    return " | ".join(unique_comment)

In [98]:
def nlp_preprocessing(text):
    text = remove_web_link(text)
    text = remove_directories(text)
    text = remove_deleted_comments(text)
    text = remove_punctuations(text)
    text = remove_non_alphabets(text)
    # text = autocorrect_spelling(text)
    text = remove_unneeded_spaces(text)
    text = remove_repeated_sentence(text)
    text = text.lower()
    return text

## Hand Engineering

Lets check out the unique values in the subreddit feature as well as the count of each value

In [48]:
subreddit_count = user_info_df['subreddit'].value_counts()
subreddit_count

subreddit
Veterinary          6170
MysteriumNetwork     967
medicine             409
HeliumNetwork        400
orchid               303
vet                   10
Name: count, dtype: int64

In [49]:
subreddit_list = list(subreddit_count.index)

Lets explore each of this subreddit categories starting from the least (the bottom)

In [50]:
# get the number of vet subscribers that are in the first dataset

# initialize counter
user_count = 0
# create container for vet subcribers also in the first dataframe
vet_subscribers = []

# for each username who is a subcriber of vet
for user in user_info_df[user_info_df['subreddit'] == "vet"]["username"]:
    # if username is found in table1
    if not user_comment_df[user_comment_df["username"] == user].empty:
        # increment counter by 1
        user_count += 1
        # capture the username
        vet_subscribers.append(user)

print("Vet Subreddit Count")
print("Table1: {}".format(subreddit_count["vet"]))
print(f"Table2: {user_count}")

Vet Subreddit Count
Table1: 10
Table2: 9


One of the subscribers of vet is not in the first dataset

At this point it would be better to combine both dataset into one

Lets do that

In [51]:
reddit_user_df = pd.merge(user_comment_df, user_info_df,
                          on="username", how="left")

In [52]:
reddit_user_df.head()

Unnamed: 0,username,comments,isused,subreddit,created_at
0,LoveAGoodTwist,"Female, Kentucky. 4 years out. Work equine on...",True,Veterinary,2024-05-02
1,wahznooski,"As a woman of reproductive age, fuck Texas|As ...",True,Veterinary,2024-05-02
2,Churro_The_fish_Girl,what makes you want to become a vet?|what make...,True,Veterinary,2024-05-02
3,abarthch,"I see of course there are changing variables, ...",True,MysteriumNetwork,2024-05-02
4,VoodooKing,I have 412+ and faced issues because wireguard...,False,MysteriumNetwork,2024-05-03


In [53]:
reddit_user_df.iloc[23]["comments"]

'[deleted]|[deleted]|[deleted]|[deleted]|[deleted]|[deleted]|[deleted]|[removed]|Can I ask a question about really basic vetmed certification? I’m in an area that has a serious shortage of emergency trained vets, so much so that there’s been a pivot to regular vets not doing emergency triage, and not being able to recognize emergencies. \n\nIs there a basic certification that’s available so that pet owners can know when it’s time for the ER?|[deleted]|[deleted]|I agree with some of the below threads. Pay varies from state and I’ve also found big cities tend to pay more than hospitals in burbs or rural areas. For instance, I’m not certified and as a tech in Boston, MA I make $27/hour but in Chicago, IL I made $23/hour. That being said, I live with my boyfriend and having dual incomes is honestly the only way I can afford to live.\n\nI know moving for a job is a big thing consider but maybe not a bad idea to see what’s out there. I’ve also learned to not be afraid to advocate for yoursel

In [54]:
reddit_user_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3276 entries, 0 to 3275
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   username    3276 non-null   object
 1   comments    3276 non-null   object
 2   isused      3275 non-null   object
 3   subreddit   3275 non-null   object
 4   created_at  3275 non-null   object
dtypes: object(5)
memory usage: 128.1+ KB


In [55]:
reddit_user_df["comments"][0]

'Female, Kentucky.  4 years out. Work equine only private practice. Base salary $85k plus bonuses/production which was $20k 2023. 6 days a week Jan-June/July then variable in the off season. No limit on PTO - took ~5 weeks last year. One paid conference a year (registration/travel/ 1/2 hotel/ transportation) or online CE program. All licensures & professional group fees covered. Cell phone allowance and mileage reimbursement.|Female, Kentucky.  4 years out. Work equine only private practice. Base salary $85k plus bonuses/production which was $20k 2023. 6 days a week Jan-June/July then variable in the off season. No limit on PTO - took ~5 weeks last year. One paid conference a year (registration/travel/ 1/2 hotel/ transportation) or online CE program. All licensures & professional group fees covered. Cell phone allowance and mileage reimbursement.|Female, Kentucky.  4 years out. Work equine only private practice. Base salary $85k plus bonuses/production which was $20k 2023. 6 days a wee

In [99]:
reddit_user_df["comments"] = reddit_user_df["comments"].apply(nlp_preprocessing)

In [100]:
reddit_user_df["comments"][0]

'female kentucky years out work equine only private practice base salary k plus bonuses k days a week jan june no limit on pto took weeks last year one paid conference a year registration transportation or online ce program all licensures professional group fees covered cell phone allowance and mileage reimbursement'

### My Approach to Building the Model

The following are the approaches used to solve this problem
<ol>
    <li>All users would be categorized as others unless proven otherwise from the comments</li>
    <li>Comments are independent of each other (meaning a comment is not continued in another comment)</li>
    <li>When there is indication of user's category in a comment, other comments do not matter (i.e. when users state that they are doctors in a comment, even if other comment are not related to this, the user is still a doctor</li>
    <li>Comments made by a user would be splitted and considered separate data to capture the independence among comments</li>
    <li>When a user is a doctor or a veterinarian, at least one word in the comment that shows the profession must be related to doctor or veterinarian (i.e. when no word in a users comment is realted (similar) to doctor or medicine, automatically the user is not a doctor)</li>
    <li>Any user automatically found from above to not be a doctor or veterinarian would be automatically classified as Others</li>
    <li>The model (to be built) would be built on for only comments having at least a word related to doctor, medicine, veterinarian, animal, hospital or clinic</li>
</ol>

Split comments

In [103]:
user_separated_comment_dict = {
    "username" : [],
    "comment" : [],
    "subreddit" : []
}

for i in range(reddit_user_df.shape[0]):
    for comment in reddit_user_df.iloc[i]["comments"].split("|"):
        user_separated_comment_dict["username"].append(reddit_user_df.iloc[i]["username"])
        user_separated_comment_dict["comment"].append(comment.strip())
        user_separated_comment_dict["subreddit"].append(reddit_user_df.iloc[i]["subreddit"])
        
user_separated_comment_df = pd.DataFrame(user_separated_comment_dict)

In [104]:
user_separated_comment_df

Unnamed: 0,username,comment,subreddit
0,LoveAGoodTwist,female kentucky years out work equine only pri...,Veterinary
1,wahznooski,as a woman of reproductive age fuck texas,Veterinary
2,Churro_The_fish_Girl,what makes you want to become a vet,Veterinary
3,abarthch,i see of course there are changing variables b...,MysteriumNetwork
4,abarthch,what do you mean as far as i am aware people c...,MysteriumNetwork
...,...,...,...
11191,Real_Use_3216,i earn production on everything i touch from p...,Veterinary
11192,Real_Use_3216,focus on practicing good medicine and surgery ...,Veterinary
11193,Real_Use_3216,hard no,Veterinary
11194,Real_Use_3216,am crossfit its the first thing i do every wor...,Veterinary


In [105]:
user_separated_comment_df.tail(30)

Unnamed: 0,username,comment,subreddit
11166,daliadeimos,good point,Veterinary
11167,daliadeimos,the clinic i work at collects payment for euth...,Veterinary
11168,daliadeimos,we just euthanized a cat this week for not bei...,Veterinary
11169,Unhappy_Passenger_86,as some one who is also coming from a difficul...,Veterinary
11170,B1u3Chips_,im looking into applying for veterinary nursin...,Veterinary
11171,B1u3Chips_,what could i study in college to do veterinary...,Veterinary
11172,Daktari2018,good for you for sticking to standards of care...,Veterinary
11173,Daktari2018,this is wonderful wanting to know more knowing...,Veterinary
11174,Daktari2018,its tough to come into a tight group esp from ...,Veterinary
11175,Daktari2018,call the company they can tell you length of t...,Veterinary


In [106]:
user_separated_comment_df.iloc[11169]["comment"]

'as some one who is also coming from a difficult situation and trying to pursue vet school i have the greatest amount of empathy for you and your situation i had my son my sophomore year of college and being practically homeless made it really difficult i worked full time took care of my son and just tried to barely survive my end gpa at my undergrad school was so i completely understand that low grades can be hard to come back from i share this because i have gone on to get waitlisted at a vet school and this year i have again gotten interviews and a chance to go look at each vet school before you write yourself off and see what their requirements are and how they calculate gpa i put mine into a google doc so it was easy to read i then looked very critically at my own academic experience and matched all the classes i had taken and their grades with the prereqs of each school a lot of schools not all out more weight on last semester credits quarter credits than on overall gpa if you ca

Next step is to identify comments where at least a word realated to any of doctor, medicine, veterinarian, animal, hospital or clinic was mentioned

The simiarity index to be used would be cosine similarity and a threshold of 0.7 would be used

To do this, the words would need to be embedded. I would be making use of Glove embedding

First thing is to extract the embedding vectors

In [107]:
embeddings_index = dict()

with open("glove.6B.100d.txt", encoding="utf8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

In [108]:
print(f"{len(embeddings_index)} words found")

400000 words found


Next step is to define the vocabulary size

Since we are only concerned with this dataset, the number of unique words in this dataset would form the vocabulary size

In [109]:
words_in_dataset = set()

for comment in user_separated_comment_df["comment"]:
    for word in comment.split():
        words_in_dataset.add(word.lower())

In [110]:
vocabulary_size = len(words_in_dataset)

first_word = list(embeddings_index.keys())[0]
embedding_dim = len(embeddings_index[first_word])

The next step is to tokenize the words

For this the words in Glove embedding would be used as the trainin

In [111]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(words_in_dataset)

In [112]:
vocabulary_size

18474

The next step is to create embedding matrix

In [113]:
embedding_matrix = np.zeros((vocabulary_size, embedding_dim))

for word, index in tokenizer.word_index.items():
    if index <= vocabulary_size:
        embeddig_vector = embeddings_index.get(word)
        if embeddig_vector is not None:
            embedding_matrix[index-1] = embeddig_vector

In [114]:
print(embedding_matrix.shape)

(18474, 100)


Now lets check the comments containing at least a word that is related to any of doctor, medicine, veterinarian, animal, hospital or clinic as mentioned earlier

In [None]:
def check_cosine_similarity(word, base_words):
    

Now let us continue with the subreddits

Starting from the bottom and moving up

In [75]:
subreddit_list

['Veterinary',
 'MysteriumNetwork',
 'medicine',
 'HeliumNetwork',
 'orchid',
 'vet']

Let's check out the comments of the vet subscribers

In [None]:
# get list of vet subscribers
vet_sub_list = reddit_user_df[reddit_user_df["subreddit"] == "vet"]["username"].values

# get the comments made by vet subscribers
vet_sub_comments = reddit_user_df[reddit_user_df["subreddit"] == "vet"]["comments"].values

In [None]:
for i in range(len(vet_sub_list)):
    print(f"{vet_sub_list[i]}: {vet_sub_comments[i]}")

It would be logical for a practicing veterinarian or anyone whose work is related to veterinary to follow "vet" subreddit. This shows closeness to veterinary but does not guarantee being a veterinarian

It can be seen that all these people have just one comment each

Many spoke in 3rd person form which is hard to say if they are doctors or not

Initially, I chose only test_vet6 and test_vet9 to be practicing veterinarian but the firther instruction given has clarified that I should include all

In [None]:
# store usernames of confirmed veterinarians
vet = list(vet_sub_list)

> It is noteworthy that this type of problem is usually solved effectively with labelled dataset

> However with unlabelled data as the one here, hand engineering may be employed to some extent enough to build a model, therafter the model can predict the rest

> That is the approach I wish to employ for this task

Now, unto next subreddit (orchid)

In [None]:
# get list of orchid subscribers
orchid_sub_list = reddit_user_df[reddit_user_df["subreddit"] == "orchid"]["username"].values

# get the comments made by orchid subscribers
orchid_sub_comment_list = reddit_user_df[reddit_user_df["subreddit"] == "orchid"]["comments"]

In [None]:
print(f"Orchid sunscrbers: {orchid_sub_list}")
print(f"Number of orchid subscribers: {len(orchid_sub_list)}")

It would be neccesary to preprocess our data first as the comments of users who subscribe to orchid is a bit much

In [None]:
orchid_sub_comment_list = orchid_sub_comment_list.apply(nlp_preprocessing)

In [None]:
orchid_sub_comment_list.index

In [None]:
orchid_sub_comment_list[205]

In [None]:
orchid_sub_comment_list[241]

In [None]:
orchid_sub_comment_list[306]

In [None]:
orchid_sub_comment_list[1547]

Checking the comments of all the 4 people who subscribed to orchid shows none of them is either a practicing doctor or a practicing veterinarian

In [None]:
others = list(orchid_sub_list)

Next subreddit is HeliumNetwork

In [None]:
# get list of orchid subscribers
HeliumNetwork_sub_list = reddit_user_df[reddit_user_df["subreddit"] == "HeliumNetwork"]["username"].values

# get the comments made by orchid subscribers
HeliumNetwork_sub_comment_list = reddit_user_df[reddit_user_df["subreddit"] == "HeliumNetwork"]["comments"]

In [None]:
print(f"HeliumNetwork sunscrbers: {HeliumNetwork_sub_list}")
print(f"Number of HeliumNetwork subscribers: {len(HeliumNetwork_sub_list)}")

Thereare 6 sunscribers

Lets proprocess the data of all users who subscribe to HeliumNetwork in order to view it properly

In [None]:
HeliumNetwork_sub_comment_list = HeliumNetwork_sub_comment_list.apply(nlp_preprocessing)

In [None]:
HeliumNetwork_sub_comment_list.index

In [None]:
HeliumNetwork_sub_comment_list[93]

In [None]:
HeliumNetwork_sub_comment_list[442]

In [None]:
HeliumNetwork_sub_comment_list[458]

In [None]:
HeliumNetwork_sub_comment_list[632]

In [None]:
HeliumNetwork_sub_comment_list[670]

In [None]:
HeliumNetwork_sub_comment_list[1504]

Everone who subscribes to HeliumNetwork belongs to the others category (None of them is perceved to be a practicing doctor or veterinarian)

In [None]:
others.extend(list(HeliumNetwork_sub_list))

Lets take alook at the medicine subreddit

In [None]:
# get list of orchid subscribers
medicine_sub_list = reddit_user_df[reddit_user_df["subreddit"] == "medicine"]["username"].values

# get the comments made by orchid subscribers
medicine_sub_comment_list = reddit_user_df[reddit_user_df["subreddit"] == "medicine"]["comments"]

In [None]:
print(f"medicine subscrbers: {medicine_sub_list}")
print(f"Number of medicine subscribers: {len(medicine_sub_list)}")

Thereare 8 subscribers

Lets proprocess the data of all users who subscribe to medicine in order to view it properly

In [None]:
medicine_sub_comment_list = medicine_sub_comment_list.apply(nlp_preprocessing)

In [None]:
for i in range(len(medicine_sub_list)):
    print(f"{medicine_sub_list[i]}: {medicine_sub_comment_list.values[i]}")

Just like the vet subreddit, most of the users in this category speak in third person form which makes it hard to say if they are really practicing doctor or a medical practitioner like nurse student

However, the further instruction clarified this and all of these subscribers would be classified as doctors

In [None]:
doctors = list(medicine_sub_list)

Lets take alook at the next subreddit, MysteriumNetwork

In [None]:
# get list of MysteriumNetwork subscribers
MysteriumNetwork_sub_list = reddit_user_df[reddit_user_df["subreddit"] == "MysteriumNetwork"]["username"].values

# get the comments made by MysteriumNetwork subscribers
MysteriumNetwork_sub_comment_list = reddit_user_df[reddit_user_df["subreddit"] == "MysteriumNetwork"]["comments"]

In [None]:
print(f"Number of MysteriumNetwork subscribers: {len(MysteriumNetwork_sub_list)}")

There are quite a lot of users who subscribe to MysteriumNetwork

Nevertheless we still need to preprocess the comments

In [None]:
MysteriumNetwork_sub_comment_list = MysteriumNetwork_sub_comment_list.apply(nlp_preprocessing)

Lets take a look at some of the comment made by subscribers of MysteriumNetwork

In [None]:
MysteriumNetwork_sub_comment_list[MysteriumNetwork_sub_comment_list.index[235]]

Lets take alook at the last subreddit also, Veterinary

In [None]:
# get list of Veterinary subscribers
Veterinary_sub_list = reddit_user_df[reddit_user_df["subreddit"] == "Veterinary"]["username"].values

# get the comments made by Veterinary subscribers
Veterinary_sub_comment_list = reddit_user_df[reddit_user_df["subreddit"] == "Veterinary"]["comments"]

In [None]:
print(f"Number of Veterinary subscribers: {len(Veterinary_sub_list)}")

There are quite a lot of users who subscribe to Veterinary also

Lets first preprocess the comments

In [None]:
Veterinary_sub_comment_list = Veterinary_sub_comment_list.apply(nlp_preprocessing)

In [None]:
Veterinary_sub_comment_list[Veterinary_sub_comment_list.index[3]]

Before building the model, it would be neccesary to also manually label some samples of medical school or vet students as Others so that the model can learn this also and not classify them otherwise

Lets start by searching for medical student

Lets get 10 comments which includes keywords that may show someone is a medical student

In [None]:
search_keys = ["school", "resident", "undergrad"]
returned_username = []
returned_indices = []
count = 0

In [None]:
for comments, ind, username in zip(Veterinary_sub_comment_list, Veterinary_sub_comment_list.index, Veterinary_sub_list):
    found = False
    if count == 10:
        break
        
    for comment in comments.split("|"):
        for word in comment.split(" "):
            if word.lower() in search_keys:
                returned_indices.append(ind)
                returned_username.append(username)
                found = True
                count += 1
                break
        if found == True:
            break

Lets take a look at the 10 comments and label them

In [None]:
i = 0     # first user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == returned_indices[i]]["comments"].values[0])

In [None]:
# label this user as Others
# this user is a data analyst applying are work in a vet school
others.append(returned_username[i])

In [None]:
i = 1     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == returned_indices[i]]["comments"].values[0])

In [None]:
# label this user as doctor
doctors.append(returned_username[i])

In [None]:
i = 2     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == returned_indices[i]]["comments"].values[0])

In [None]:
# label this user as Others
others.append(returned_username[i])

In [None]:
i = 3     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == returned_indices[i]]["comments"].values[0])

In [None]:
# label this user as Others
# this user ia a vet student
others.append(returned_username[i])

In [None]:
i = 4     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == returned_indices[i]]["comments"].values[0])

In [None]:
# label this user as others
# this user ia a vet student or graduate from the statement "I am not a vet and I make more than our new vets do" 
others.append(returned_username[i])

In [None]:
i = 5     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == returned_indices[i]]["comments"].values[0])

In [None]:
# label this user as others
# this user ia a vet assistant 
others.append(returned_username[i])

In [None]:
i = 6     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == returned_indices[i]]["comments"].values[0])

In [None]:
# label this user as doctor
doctors.append(returned_username[i])

In [None]:
i = 7     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == returned_indices[i]]["comments"].values[0])

In [None]:
# label this user as other
# this user is a prospective vet student
others.append(returned_username[i])

In [None]:
i = 8     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == returned_indices[i]]["comments"].values[0])

In [None]:
# label this user as other
# not enough info from the comments
others.append(returned_username[i])

In [None]:
i = 9     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == returned_indices[i]]["comments"].values[0])

In [None]:
# label this user as other
# this user is a prospective vet student
others.append(returned_username[i])

Lets get 10 comments (different from the 10 above) which includes keywords that may show someone is a vet student

In [None]:
search_keys = ["school", "resident", "undergrad", "vet"]
vet_returned_username = []
vet_returned_indices = []
count = 0

In [None]:
for comments, ind, username in zip(Veterinary_sub_comment_list, Veterinary_sub_comment_list.index, Veterinary_sub_list):
    found = False
    if count == 10:
        break
        
    for comment in comments.split("|"):
        for word in comment.split(" "):
            if (word.lower() in search_keys) and (username not in returned_username):
                vet_returned_indices.append(ind)
                vet_returned_username.append(username)
                found = True
                count += 1
                break
        if found == True:
            break

Lets take a look at the 10 stes of comments and label them

In [None]:
i = 0     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(vet_returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == vet_returned_indices[i]]["comments"].values[0])

In [None]:
# label this user as other
# no enough info from user comments
others.append(vet_returned_username[i])

In [None]:
i = 1     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(vet_returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == vet_returned_indices[i]]["comments"].values[0])

In [None]:
# label this user as vet
vet.append(vet_returned_username[i])

In [None]:
i = 2     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(vet_returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == vet_returned_indices[i]]["comments"].values[0])

In [None]:
# label this user as vet
# this user is a vet tech
others.append(vet_returned_username[i])

In [None]:
i = 3     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(vet_returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == vet_returned_indices[i]]["comments"].values[0])

In [None]:
# label this user as vet
# according to the phrase "Im like actually a pretty good vet. And I mostly enjoy my job"
vet.append(vet_returned_username[i])

In [None]:
i = 4     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(vet_returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == vet_returned_indices[i]]["comments"].values[0])

In [None]:
# label this user as other
others.append(vet_returned_username[i])

In [None]:
i = 5     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(vet_returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == vet_returned_indices[i]]["comments"].values[0])

In [None]:
# label this user as other
# this user is most like a student
others.append(vet_returned_username[i])

In [None]:
i = 6     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(vet_returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == vet_returned_indices[i]]["comments"].values[0])

In [None]:
# label this user as vet
# this user is most a veterinarian
vet.append(vet_returned_username[i])

In [None]:
i = 7     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(vet_returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == vet_returned_indices[i]]["comments"].values[0])

In [None]:
# label this user as vet
# this user is most likely a veterinarian
vet.append(vet_returned_username[i])

In [None]:
i = 8     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(vet_returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == vet_returned_indices[i]]["comments"].values[0])

In [None]:
# label this user as vet
# this user is a vet
vet.append(vet_returned_username[i])

In [None]:
i = 9     # second user comments cotaining any of the keywords
# the last zero was because the line below would return the comments in a list (having squre barckets) without it
print("Username:")
print(vet_returned_username[i])
print()
print("Comments:")
print(reddit_user_df[reddit_user_df.index == vet_returned_indices[i]]["comments"].values[0])

In [None]:
# label this user as other
# this user is about to write NAVLE (North American Veterinary Licensing Examination)
others.append(vet_returned_username[i])

In [None]:
print(vet)
print()
print(others)
print()
print(doctors)

The usernames labelled in sample labelling file would also be added with their labels

In [None]:
others.append("--solaris--")

vet.extend(["100realtx", "3_Black_Cats"])

The labels would now be added to the dataframe

In [None]:
reddit_user_df["Label"] = None

In [None]:
reddit_user_df.head()

In [None]:
reddit_user_df.info()

In [None]:
for username in doctors:
    reddit_user_df.loc[reddit_user_df["username"] == username, "Label"] = "Medical Doctor"

In [None]:
for username in vet:
    reddit_user_df.loc[reddit_user_df["username"] == username, "Label"] = "Veterinarian"

In [None]:
for username in others:
    reddit_user_df.loc[reddit_user_df["username"] == username, "Label"] = "Other"

In [None]:
reddit_user_df.info()

Checking the Label, we see that 50 features has been labelled

These will form our training set

Lets extract the training set as a csv in the requested format in order to get feedback

In [None]:
# get indices of labelled data
doctor_indices_mask = reddit_user_df["Label"] == "Medical Doctor"
vet_indices_mask = reddit_user_df["Label"] == "Veterinarian"
other_indices_mask = reddit_user_df["Label"] == "Other"

labelled_indices_mask = doctor_indices_mask + vet_indices_mask + other_indices_mask

train_set_df = reddit_user_df.loc[labelled_indices_mask, ["username", "comments", "Label"]].copy()

In [None]:
train_set_df.head()

In [None]:
train_set_df.columns = ["Reddit Username", "Reddit Comments", "Label"]

In [None]:
train_set_df.head()

The column feature would need to be preprocessed else the output csv file would not be properly formatted

In [None]:
train_set_df["Reddit Comments"] = train_set_df["Reddit Comments"].apply(nlp_preprocessing)

In [None]:
train_set_df.info()

In [None]:
train_set_df.to_csv("training_set.csv", index=False)

At this point is would be better to build the model on the 4 subreddit we have checked so far then predict the category for the remianing users who subscribe to the remaining 2 categories

## Model Building

For the model building, I would be using glove embedding matrix to embed the words

In [None]:
# loading glove word vectors (words embeddings) into dictionary
embedding_index = {}

with open('glove.6B.100d.txt', encoding='utf-8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embedding_index[word] = coefs