[![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/tushar-mahalya/Custom-ChatGPT/blob/master/data/reddit_data.ipynb)

# Data Acquisition
For our project we are going to use the wisdom of 3 most popular Reddit communities related to Data Science -
* Machine Learning - [r/MachineLearning](https://www.reddit.com/r/MachineLearning/)
* Artificial Intelligence - [r/artificial](https://www.reddit.com/r/Artificial/)
* Data Science - [r/DataScience](https://www.reddit.com/r/DataScience/)

We will extract the required information using Reddit's official API - [PRAW](https://praw.readthedocs.io/en/stable/code_overview/models/subreddit.html) (The Python Reddit API Wrapper).

In [2]:
# Importing important libraries
import os
import praw
import pandas as pd
import configparser
import datetime as dt

The credentials required to access API can be procured from [reddit.com/prefs/apps](https://www.reddit.com/prefs/apps).

For this project, I have saved my credentials in 'credentials.ini' file to protect my sensitive information.

In [27]:
# For reading configuration files for Reddit Credentials
config = configparser.ConfigParser()
config.read('/home/studio-lab-user/sagemaker-studiolab-notebooks/Custom ChatGPT/credentials.ini')

# Storing Reddit Credential info in local variables
user_agent = config.get('Reddit', 'user_agent')
client_id = config.get('Reddit', 'client_id')
client_secret = config.get('Reddit', 'client_secret')
redirect_url = config.get('Reddit', 'redirect_url')

In [24]:
# Creating read-only Reddit instance
reddit = praw.Reddit(user_agent = user_agent,
                    client_id = client_id,
                    client_secret = client_secret,
                    redirect_url = redirect_url)

## Extracting Top Posts
We will extract top 1000 post of all time from each sub-reddit to create our dataset along with some other useful information like Post URL (& ID), User posted, Post title, Flair, Number of Comments, Time Created, Upvote Ratio and Score.
We will use this information further to analyse and infer useful insights from it.

In [25]:
# Extracting top 1000 posts from each subreddit
posts = reddit.subreddit('MachineLearning+artificial+datascience').top(time_filter = 'all', limit = 3000)

In [26]:
# Creating DataFrame of the top posts along with other attributes for analysis

posts_list = []

for post in posts:
    posts_list.append({
        'post_id' : post.id,
        'post_title' : post.title,
        'subreddit' : post.subreddit,
        'time_created' : post.created_utc,
        'post_url' : post.url,
        'flair_text' : post.link_flair_text,
        'score' : post.score,
        'comments' : post.num_comments,
        'upvote_ratio' : post.upvote_ratio
    })
    
posts_df = pd.DataFrame(posts_list)

In [27]:
# Converting UTC Date format to Standard Date-Time format
posts_df['date-time'] = posts_df['time_created'].apply(lambda x: dt.datetime.fromtimestamp(x))

# Creating 'Year' column
posts_df['year'] = posts_df['date-time'].dt.year

# Dropping 'time_created' column
posts_df.drop('time_created', axis = 1, inplace = True)

In [28]:
# Saving our posts data in .csv format
posts_df.to_csv("/home/studio-lab-user/sagemaker-studiolab-notebooks/Custom ChatGPT/data/Top_Posts.csv", header = True, index = False)

In [4]:
# Displaying the content of saved Post Data
posts_df = pd.read_csv('/home/studio-lab-user/sagemaker-studiolab-notebooks/Custom ChatGPT/data/Top_Posts.csv')
posts_df.sample(5)

Unnamed: 0,post_id,post_title,subreddit,post_url,flair_text,score,comments,upvote_ratio,date-time,year
872,11h3p2x,[D] Facebooks LLaMA leaks via torrent file in PR,MachineLearning,https://www.reddit.com/r/MachineLearning/comme...,Discussion,497,164,0.98,2023-03-03 15:37:03,2023
2389,i54se2,Human-like robot hand mimicking demo,artificial,https://www.youtube.com/watch?v=ujZRmFbrCmQ&fe...,My project,137,10,0.99,2020-08-07 01:35:29,2020
1073,6xvnwo,[D] My Neural Network isn't working! What shou...,MachineLearning,http://theorangeduck.com/page/neural-network-n...,Discussion,442,62,0.95,2017-09-03 20:44:08,2017
1370,83mkrz,The Brain Is The Most Important Organ You Have,artificial,https://i.redd.it/8n8r6ze9u4l01.jpg,,385,14,0.92,2018-03-11 13:02:03,2018
2450,gx95jt,Researchers make algorithm to generate frontal...,artificial,https://i.imgur.com/ShgRhum.jpg,,130,66,0.78,2020-06-05 17:39:40,2020


## Extracting Comments
Using 'post_id' of top posts we will further extract all comments. We will create a different dataset containing 'post_id' and 'comment' to create our textual dataset for training our large NLP model (GPT-3.5-turbo). We will also utilize this data to analyse the sentiment around different topics and recognizing emotions of the text.

In [None]:
# Creating DataFrame of all the comments available in the Top Posts

comments_list = []

for post_id in posts_df['post_id']:
    submission = reddit.submission(post_id)
    submission.comments.replace_more(limit = None)
    
    for comment in submission.comments.list():
        comments_list.append({
            'post_id' : post_id,
            'comment' : comment.body
        })
        
comments_df = pd.DataFrame(comments_list)

In [None]:
# Saving our comments data in .csv format
comments_df.to_csv('/home/studio-lab-user/sagemaker-studiolab-notebooks/Custom ChatGPT/data/Top_Posts_Comments.csv', header = True, index = False)

In [5]:
# Displaying the content of our Comments Data
comments_df = pd.read_csv('/home/studio-lab-user/sagemaker-studiolab-notebooks/Custom ChatGPT/data/Top_Posts_Comments.csv')
comments_df.sample(10)

Unnamed: 0,post_id,comment
123678,upl33c,I am very surprised someone like Nando would s...
217492,whg1zi,I will have to check them out. Midjourney crea...
141273,404r9m,Deep learning has grown so fast over the last ...
114747,jc1fp2,"Sorry I don't have time to perform test, you c..."
38746,heiyqq,Please read the article linked in OP.
13267,umse6v,If you PM I can send it there
42828,p29bae,"lol you had me at ""unpaid"".."
131394,skjjvm,Do you think fine-tuning transformers in class...
25050,o468ms,This is an excellent resource for reviewing ML...
194142,rvya1w,You give me the numbers and I'll tell you the ...


In [19]:
print("Shape of Posts Data - {}".format(posts_df.shape))
print("Shape of Comments Data - {}".format(comments_df.shape))

Shape of Posts Data - (2987, 9)
Shape of Comments Data - (223174, 2)


We have successfully extracted ~223K comments from top 1000 posts from popular sub-reddits related to Data Science.

This data will be used to further create training data for our large language model and analytical purposes.

## Training Data
We have successfully extracted the useful Reddit data, and now we'll leverage the power of [In-Context Learning](http://ai.stanford.edu/blog/understanding-incontext/) ability of large language models.

Concisely, In-context learning is a type of machine learning where the model is trained on a large corpus of text data, such as the GPT-3.5-Turbo model, which has been trained on a massive amount of text data to generate human-like responses to text prompts. In order to apply in-context learning to our Reddit data, we need to provide a large amount of relevant text data that the model can use to learn from. However, limited token size (Max Tokens ~ 4,096) can make it difficult to fit large text data into the model. To overcome this limitation, we used LLAMA-index to create text embeddings of our Reddit data.
[LLAMA-index](https://gpt-index.readthedocs.io/en/latest/index.html) is a text embedding tool that creates compact representations of text data that can be used by machine learning models. These text embeddings can be used as inputs to our GPT-3.5-Turbo model, to improve its performance on specific tasks without running into token size limitations.

In [6]:
# Importing important libraries used in
# generating text embeddings

from llama_index import SimpleDirectoryReader, GPTSimpleVectorIndex, LLMPredictor, PromptHelper
from langchain.chat_models import ChatOpenAI

#### Comments Aggregation

In [7]:
# Joining Comments with their respective Post ID
comments_posts_merged = posts_df.merge(comments_df, on = 'post_id', how = 'left')

# Deleting rows that doesn't contain any Comment
comments_posts_merged = comments_posts_merged[~comments_posts_merged['comment'].isnull()]

In [8]:
# Combining all relevant textual information
agg_comments_temp = comments_posts_merged[['post_title','flair_text', 'comment']].astype(str)
agg_comments = agg_comments_temp.groupby(['post_title','flair_text'])['comment'].apply('. '.join).reset_index()
agg_comments.sample(5)

Unnamed: 0,post_title,flair_text,comment
1029,I'm a Senior Data Scientist at Disney and I'm ...,Networking,There seems to be issues with the link in the ...
2033,[D] Antipatterns in open sourced ML research code,Discussion,"[deleted]. Hah, initially I thought I would on..."
1124,It seems a lot of people want to get into the ...,Discussion,I understand people who want to change fields ...
1436,PyTorch for Beginners - Building Neural Networks,Tutorial,Great for beginners who don't know where to st...
1068,Imposter Syndrome is a problem for me and I th...,,I once attended an interview and they not only...


In [9]:
agg_comments['combined_text'] = agg_comments.astype(str).agg('. '.join, axis = 1)
text_data = ' '.join(agg_comments['combined_text'])
print(text_data[:700])

"Artificial Imagination" - AI generated. My project. Why does everything look familiar but nothing is identifiable. The more you look, the less it makes sense. [deleted]. [deleted]. [deleted]. It's impressive how artificial intelligences are able to make more elaborate and less abstract representations over time. They're evolving in the right direction.. I am currently reading the book "When Brains Dream".  The current theory is that we dream to process the days events, to figure out the meaning and significance of that new information. "Dreams are almost never an accurate replay of daytime events". So similar to AI, our dreams are somehow looking for patterns in our experience to encode int


In [25]:
print('Total No. of Characters in aggregated textual data - {}'.format(len(text_data)))

Total No. of Characters in aggregated texual data - 59605154


Now, we have aggregated all the relevant textual information from the comments we have extracted into a single text that can be used for fine-tuning/training our Large Language Model.

In [10]:
# saving text data in .txt format
f = open('/home/studio-lab-user/sagemaker-studiolab-notebooks/Custom ChatGPT/data/train_data/train_data.txt', 'w')
f.write(text_data)
f.close()

#### Generating Text Embeddings/Indexes

We will use the [Facebook AI Similarity Search (Faiss)](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/) library for efficient similarity search between the input prompt and the existing corpus of data we have collected from Reddit API. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.
We will then create the text embeddings which we will further use for quering.

In [15]:
# importing index generator function
from src.index_generator import construct_index

In [12]:
# Storing OpenAI Credential info in local variables
openai_key = config.get('OpenAI', 'secret_key')

In [1]:
# Constructing our indexes (ONLY NEED TO RUN ONCE! BE CAREFUL THAT THIS COSTS MONEY)
training_data = 'data/train_data/train_data.txt'
construct_index(training_data, openai_key)

[32mText Embeddings created Successfully ! 
Stored in 'faiss_index' directory[0m


We have saved the indexes/embeddings of all the comments (textual data) we have collected from Reddit which we will further use to create a ChatBot based on that data.