Subreddit level
Sentiment score
Topics in posts
Followers/Members

Post level
Sentiment score
Relevance score
Topics in comments
Total Num Comments
Num Comments per submission
Upvote ratio
Score

Comment level
Sentiment score
Relevance score
Up-vote count
Down-vote count
Contraversality
Total Awards Received
Score
Is_locked, collapsed, submitter?


In [1]:
from src.features.preprocess import PreProcess
from src.models.relevance import Relevance
from src.models.bertmodels import BertModels

from imports import *

[nltk_data] Downloading package punkt to /home/ajz55/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/ajz55/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/ajz55/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/ajz55/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/ajz55/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to /home/ajz55/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/ajz55/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /h

In [2]:
class SubredditAnalysis:
    
    def __init__(self, subreddit='computerscience', sort_order='hot', set_num_posts=500, set_num_comments=500):
        """
        Subreddit Analysis constructor
        
        3 DataFrames:
            - Subreddit Data
            - Post Data
            - Comment Data
        
        Running static functions(?):
            Preprocess Posts data
            Preprocess Comments data
            Retrieve Topics for posts and comments
                Add Topics list to respective dataframes
            Get Sentiment Score for posts and comments
            Get Relevance Score for posts->subreddit and comments->posts
                Add scores to respective dataframes
            
        :param subreddit: name of subreddit for analysis
        :param sort_order: order of submissions retrieved from reddit
        :param set_num_posts: set max number of posts for analysis
        :param set_num_comments: set max number of comments for analysis
        """
        
        
        bertmodels_obj = BertModels(subreddit=subreddit, sort_type=sort_order)
        
        def topic_extractor(topics_tag_list, topics):
            topic_mapping = topics_tag_list.set_index('Topic').to_dict()['Name']
            topics_list = list((pd.Series(topics)).map(topic_mapping))
            return topics_list
        
        #1. posts preprocessing
        self.posts_df = bertmodels_obj.posts_df[:set_num_posts].copy() 
        topic_prep_posts = bertmodels_obj.topic_preprocess(self.posts_df, 'body')
        
        #2. topic modeling for posts
        print('********Topic Modeling for Posts*********')
        bertmodels_obj.topic_modeling(topic_prep_posts, 'body_word_token', visualize=False)
        topic_prep_posts['topics'] = topic_extractor(bertmodels_obj.model.get_topic_info(), bertmodels_obj.topics)
        self.bert_posts = topic_prep_posts.copy()
        print('********DONE: Topic Modeling for Posts*********')
        
        #3. sentiment analysis for posts
        print('********Sentiment Analysis for Posts*********')
        bertmodels_obj.sentiment_preprocess(topic_prep_posts, "body")
        print('********DONE: Sentiment Analysis for Posts*********')
        
        #4. comments preprocessing
        self.comments_df = bertmodels_obj.comments_df[:set_num_comments].copy()
        topic_prep_comments = bertmodels_obj.topic_preprocess(self.comments_df, 'comment')
        
        #5. topic modeling for comments
        print('********Topic Modeling for Comments*********')
        bertmodels_obj.topic_modeling(topic_prep_comments, 'comment_word_token', visualize=False)
        topic_prep_comments['topics'] = topic_extractor(bertmodels_obj.model.get_topic_info(), bertmodels_obj.topics)
        self.bert_comments = topic_prep_comments.copy()
        print('********DONE: Topic Modeling for Comments*********')
        
        #6. sentiment analysis for comments
        print('********Sentiment Analysis for Comments*********')
        bertmodels_obj.sentiment_preprocess(topic_prep_comments, "comment")
        print('********DONE: Sentiment Analysis for Comments*********')
        
        #7. getting relevance score
        print('********Relevance Scores*********')
        relevance = Relevance()
        self.data = topic_prep_posts.merge(topic_prep_comments, left_on='post_id', right_on='post_id', how='left')
        print(self.data['post_id'].unique())
        print(self.data['post_id'].nunique())
#         print(self.data['comment'])
#         print(self.data['comment'].unique())
#         print(self.data.columns)
        relevance.generate_relevance(self.data)
        self.res_df = relevance.df
        print('********DONE: Relevance Scores*********')
#         relevance.save_to_file(file_name='relevance_1.csv')

    def get_posts_data(self):
        return self.posts_df
    
    def get_comments_data(self):
        return self.comments_df
    
    def get_processed_posts(self):
        return self.bert_posts
    
    def get_processed_comments(self):
        return self.bert_comments
    
    def get_result_df(self):
        return self.res_df

In [3]:
# temp = models.model.get_topic_info()
# di = temp.set_index('Topic').to_dict()['Name']
# topics = list((pd.Series(models.topics)).map(di))
# topics

In [None]:
sub = SubredditAnalysis('computerscience', 'hot')

********Preprocessing DataFrame for Topic Modeling*********
Fill NaNs
Remove URLs
Expand Contractions
Make Lowercase
Tokenize
Filter Stopwords
Lemmatization
********DONE: Preprocessing for Topic Modeling*********
********Topic Modeling for Posts*********
Number of entries being modeled: 417
Intiailizing model and training


Batches:   0%|          | 0/14 [00:00<?, ?it/s]

2022-02-21 22:20:54,123 - BERTopic - Transformed documents to Embeddings
2022-02-21 22:21:04,366 - BERTopic - Reduced dimensionality with UMAP
2022-02-21 22:21:04,403 - BERTopic - Clustered UMAP embeddings with HDBSCAN


********DONE: Topic Modeling for Posts*********
********Sentiment Analysis for Posts*********
********Preprocessing DataFrame for Sentiment Analysis*********
Fill NaNs
Remove URLs
Expand Contractions
Remove escape characters.
********DONE: Sentiment Analysis for Posts*********
********Preprocessing DataFrame for Topic Modeling*********
Fill NaNs
Remove URLs
Expand Contractions
Make Lowercase
Tokenize
Filter Stopwords
Lemmatization
********DONE: Preprocessing for Topic Modeling*********
********Topic Modeling for Comments*********
Number of entries being modeled: 500
Intiailizing model and training


Batches:   0%|          | 0/16 [00:00<?, ?it/s]

2022-02-21 22:21:32,158 - BERTopic - Transformed documents to Embeddings
2022-02-21 22:21:35,716 - BERTopic - Reduced dimensionality with UMAP
2022-02-21 22:21:35,761 - BERTopic - Clustered UMAP embeddings with HDBSCAN


********DONE: Topic Modeling for Comments*********
********Sentiment Analysis for Comments*********
********Preprocessing DataFrame for Sentiment Analysis*********
Fill NaNs
Remove URLs
Expand Contractions
Remove escape characters.
********DONE: Sentiment Analysis for Comments*********
********Relevance Scores*********
['su3ybe' 'su6m6s' 'su7zf4' 'st08lc' 'stb8pj' 'stdryq' 'ssv8zr' 'st2hhc'
 'ssursj' 'ss0efg' 'ssagax' 'sspvav' 'srqmpy' 'sraf70' 'srw89o' 'srsbj1'
 'srbdx2' 'sqt9uf' 'sr998x' 'squ1lg' 'sr0w0y' 'sq9tve' 'sq0aef' 'spnpvh'
 'spwb7n' 'sp7b3q' 'spay93' 'soq68k' 'sob87b' 'soiieq' 'sofw4w' 'soamn6'
 'so25ut' 'sn86ck' 'sndcd7' 'sndnjm' 'snauwt' 'sn8ppi' 'smanmb' 'sm1pz4'
 'smdequ' 'sm9s5c' 'slcgd1' 'slotas' 'slsxv7' 'sl4tdf' 'skltie' 'skp5fm'
 'sksa9m' 'sk7puv' 'skkkmf' 'sk80jv' 'sjn3gy' 'skhmpy' 'sjasw0' 'sjuhh6'
 'siux7r' 'sj5m8j' 'siry8q' 'sicshh' 'siw69v' 'sj47go' 'shzu1j' 'siaup5'
 'si9m1h' 'sh0rr0' 'shath4' 'sgvyvn' 'shfux4' 'sg4epv' 'sf7cyl' 'sf5rtw'
 'secwcr' 'sf5gj9' 'sf

[0]


In [None]:
sub.posts_df.head()

In [None]:
sub.res_df[['title','comment','relevance']]