## Turmerik ML Take Home Assignment: Sentiment Analysis and Personalized Messaging for Clinical Trial Recruitment
### Objective
The objective of this project is to ethically scrape and analyze Reddit web data, utilize sentiment analysis, and leverage AI to personalize communication. In this project, we will focus on identifying potential participants for a clinical trial by analyzing sentiments expressed on his/her posts.

### Part 1: Setup and Requirements

PRAW is "Python Reddit API Wrapper", a Python package that allows for simple access to Reddit's API. 

To ensure compliance with Reddit's API terms of use and respecting user privacy and data use policies, several important aspects will be involved:

- Adhering to API terms of use such as rate limits, proper use, and content handling.
- Respecting user privacy by collecting only the data I need for my application.
- Following data use policies by being cautious when dealing with sensitive or personally identifiable information.

In this project, I commit to using only publicly available data sourced from Reddit. 

In [53]:
!pip install praw


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [42]:
import praw

### Part 2: Project Overview
#### 2.1: **Data Collection** - Develop a script to scrape Reddit posts and comments from specific subreddits related to health conditions relevant to the clinical trial.

PRAW requires creating and configuring a Reddit application in the Reddit account for authentication purposes, which involves obtaining a client ID, client secret, and user name. It can be done by visiting https://www.reddit.com/prefs/apps.

In [43]:
client_id = ''
client_secret = ''
user_agent = 'script:reddit_scraper:v1.0 (by /u/)'

# Instantiation of the PRAW Reddit
reddit = praw.Reddit(client_id=client_id,
                     client_secret=client_secret,
                     user_agent=user_agent)

# Define the subreddits to scrape
subreddits = ['health', 'clinicaltrials', 'clinicalresearch', 'medicine']

keywords = ['diabetes', 'hypertension', 'insulin', 'blood pressure', 'clinical trial', 'treatment', 'side effects']

def contains_keyword(text):
    return any(keyword.lower() in text.lower() for keyword in keywords)

# function for scraping posts and comments from the specified subreddits
def scrape_reddit(subreddits, limit=200):
    data = []
    for subreddit in subreddits:
        subreddit_instance = reddit.subreddit(subreddit)
        for submission in subreddit_instance.hot(limit=limit):
            if not contains_keyword(submission.title) and not contains_keyword(submission.selftext):
                continue
            submission_data = {
                'title': submission.title,
                'selftext': submission.selftext,
                'comments': []
            }
            submission.comments.replace_more(limit=0)
            for comment in submission.comments.list():
                if not contains_keyword(comment.body):
                    continue
                submission_data['comments'].append(comment.body)
            data.append(submission_data)
    return data

scraped_data = scrape_reddit(subreddits)
print(len(scraped_data))

96


In [44]:
scraped_data[0].keys()

dict_keys(['title', 'selftext', 'comments'])

#### 2.2: **Sentiment Analysis** - Perform sentiment analysis on the scraped data to determine the general attitude and interest levels regarding clinical trials.

There are many Python packages for natural language processing and sentiment analysis, including NLTK (Natural Language Toolkit), TextBlob, and Transformers. In this project, I choose to use NLTK, which is a comprehensive library with tools for various natural language processing tasks including sentiment analysis. It includes pre-trained models. VADER (Valence Aware Dictionary and sEntiment Reasoner) is included in the NLTK library and is particularly good for sentiments expressed in social media. 

In [54]:
!pip install nltk


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [45]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/zixuanwang/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

Instantiate an analyzer

In [46]:
sia = SentimentIntensityAnalyzer()

for data in scraped_data:
    data['title_sentiment'] = sia.polarity_scores(data['title'])
    data['comments_sentiment'] = [sia.polarity_scores(comment) for comment in data['comments']]

Compute the average compound value (range: [-1, 1]) of all data

In [47]:
total_compound = 0
count = 0

for data in scraped_data:
    total_compound += data['title_sentiment']['compound']
    count += 1
    for comment_sentiment in data['comments_sentiment']:
        total_compound += comment_sentiment['compound']
        count += 1

average_compound = total_compound / count
print(average_compound)

0.10009586776859504


Find the maximum value in data

In [48]:
for data in scraped_data: 
    max_compound = max(data['title_sentiment']['compound'] for data in scraped_data)

print(max_compound)

0.8176


Get positive sentiment data by setting the threshold to be 0.3

In [49]:
positive_sentiment_data = [i for i in range(len(scraped_data)) if scraped_data[i]['title_sentiment']['compound'] > 0.3]

print(len(positive_sentiment_data))

24


#### 2.3: **Message Generation** - Use the OpenAI API to generate personalized messages aimed at users who express interest in or could potentially benefit from participating in a clinical trial.

Set up an API key

In [56]:
!pip install openai


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [50]:
import openai
from openai import OpenAI

client = OpenAI(
    api_key=''
)

Generate personalized message using OpenAI API

In [51]:
def generate_message_from_title(title):
    messages = [
        {"role": "system", "content": "You are an assistant that generates personalized, supportive, and informative messages related to clinical trial participation."},
        {"role": "user", "content": f"Create a personalized message for a user who might be interested in a clinical trial. The title of the information is: '{title}'. Ensure the message is encouraging and highlights potential benefits in a clear and sensitive manner. Please complete the response within the given token limit."}
    ]
    
    response = client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages,
        max_tokens=200
    )

    return response.choices[0].message.content

messages = []

for i in positive_sentiment_data:
    title = scraped_data[i]['title']
    message = generate_message_from_title(title)
    messages.append(message)

Post processing of messages

In [52]:
import re

def clean_incomplete_sentence(text):
    # Use regex to split the text into sentences
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    
    # check if the last sentence is complete
    if not text.strip()[-1] in '.!?':
        # if the last sentence is incomplete, remove it
        sentences = sentences[:-1]
    
    return ' '.join(sentences)


for i in range(len(messages)):
    idx = positive_sentiment_data[i]
    title = scraped_data[idx]['title']
    message = messages[i]

    print(f"Title: \n\t{title}, \nmessage: \n\t{messages[i]}\n\n")


Title: 
	New treatment can help those with OCD, 
message: 
	Subject: Exciting Opportunity to Explore New OCD Treatment in a Clinical Trial

Dear [User],

I hope this message finds you well. I wanted to share some encouraging news with you - a new treatment is being researched that holds promising results for individuals struggling with OCD. By participating in a clinical trial for this innovative therapy, you could be at the forefront of potentially life-changing advancements in mental health care.

Clinical trials provide a unique opportunity to access cutting-edge treatments under expert supervision, offering you a chance to explore alternative options that may significantly improve your quality of life. Your participation not only benefits you directly but also contributes to the larger scientific community working towards enhancing treatment options for individuals with OCD.

If you are interested in learning more about this clinical trial and how it could positively impact your jo