CSCI544-Team-7

Comment Summarization System for a Social Media Post

Introduction: Social media and streams, such as Twitter, Facebook or YouTube, contain user comments that depict valuable user opinions. Comments vary greatly in quality and, detecting high quality comments is a subtask of opinion mining and summarization research. On initial viewing these comments may seem less than useful, full of replication, extreme views, petty arguments and spam, but when studied closely and analyzed effectively they provide multiple view points and a wide range of experience and knowledge from many different sources. If we can find ways to analyze the information correctly, we can exploit this crowd-sourced information aggregation so as to automatically produce a summary that is both indicative of the discussion and brief. This allows users to interact with the data at a higher level as it gives an overview impression of the conversation that has occurred. For any particular informative web resource, it is challenging to quickly ascertain the overall themes and thrusts of the mass of user-contributed comments. While some users may be interested in scanning over hundreds or thousands of comments, there has been a shift in recent years towards providing guidance to users to focus their attention on particular comments. A survey paper by (Potthast et al. 2012) suggests that the most important tasks with regard to understanding the information available in comments are filtering, ranking and summarizing the comments. Several approaches exist for selectively focusing attention, such as editorial selection, collaborative recommendation, keyword cloud. While these and related methods provide a first-step toward making sense of the large amount of user-contributed comments in social media, the overall goal of this project is to develop the algorithms and methods necessary for filtering out non-relevant, non-informative content and presenting a diversified summary of the relevant content to the users, and ensure the continued growth of the social web by enhancing how users engage with short text in social media.
Related Work: There has been considerable prior research on summarization, identifying and ranking relevant content and diversifying the ranking results. Summarization approaches could be categorized into two main kinds: abstractive and extractive. Abstractive summary generates new content from the available input text, while the extractive summary is based on picking the segments of the input text that is more representative of the entire text without any changes to the input. Although abstractive methods would be a desirable way of summarization in general, in the context of comment summarization, extractive summary is more appropriate as it can be supported with evidence in the form of the actual user statement. Our proposed algorithm is categorized as non-aspect-based, extractive summarization method. Among related work in the context of extractive summarization, machine learning algorithms such as support vector machines (SVM) and regression models have been used to rank the sentences by degree of preference in social communities. However, Wu et al. concluded that the combination of graph-based algorithms with length shows better result than SVM regression method. Many of the proposed algorithms are based on graph-based ranking, selecting the top-K sentences as the summaries of the input document(s). Examples include TextRank, MEAD, and LexRank.

3.Method: 3.1 Materials: SenTube—a dataset of user generated comments on YouTube videos annotated for information content and sentiment polarity. It contains annotations that allow to develop classifiers for several important NLP tasks: (i) sentiment analysis, (ii) text categorization (relatedness of a comment to video and/or product), (iii) spam detection, and (iv) prediction of comment informativeness. The corpus covers several languages such as English, Italian, Spanish and Dutch. To achieve the task of comment summarization we rely on Italian corpus that contains user comments for about 200 YouTube videos. Two product domains are covered, namely, Tablets and Automobiles. For each product, two types of videos are considered: Technical Reviews and Commercials. The corpus includes not only comments themselves, but also links to the corresponding videos, allowing for joint text and multimedia modeling, building combined models of speech, image, video and text.

3.2 Procedure: We propose to automatically summarize the comments by selecting the most representative comments with respect to a resource from a large collection of user-contributed comments. At the same time the selected comments should cover different viewpoints about the associated resource that can highlight various aspects of the resource. Our overall approach is to (i) identify groups of thematically-related comments through an application of topic-based clustering based on Latent Dirichlet Allocation, (ii) then rank comments in order to identify important and informative comments within each cluster using precedence based ranking. We define V as the set of all resources that we have in our dataset V = {v1, v2, ..., vn}. Each resource vi is associated with a set of comments Ci = {c1, c2, ..., cm}, where each cj is a single comment that we consider as a bag of words. Here m is the total number of comments. Our goal is to extract a subset of Ci, SCi ⊂ Ci, that are the k most representative comments: SCi = {s1, s2, ..., sk}, based on a ranking of all of the comments associated with a resource, where k is a tunable parameter. Since our goal is to summarize a large set of comments for quick understanding, we typically set k ≤ 5, though larger values may be appropriate in some situations.

3.2.1 : Determine clusters of related comments - LDA is a generative model that can be used to identify the underlying topics that documents are generated from. We use LDA to extract T topics out of the comments associated with a single resource. That is, we have a set of comment “documents” D = {d1, d2, .., dn} and a number of topics T = {t1, ..., tm}. Any document di can be viewed by its topic distribution. For example Pr(d1 ∈ t1)=0.70 and Pr(d1 ∈ t2)=0.20 and so on. We modify the original soft clustering of LDA to a hard clustering by considering each comment as belonging to a single topic (cluster) r = argmaxrPr(tr|c) = argmaxrPr(c|tr)Pr(tr), where r is the topic number that has the maximum likelihood for each comment. Hence, the output of the LDA-based topic clustering approach is an assignment of each comment to a cluster.

3.2.2 : Determining significant in-cluster comments - Our hypothesis is that comments that reference an earlier comment may confer some level of implicit endorsement on the earlier comment, in essence echoing the ideas of the earlier comment. Let the set of short text sentences be S = {s1, s2, ..., sn}, where n is the total number of sentences for a resource. Here, si is represented by a bag of words, i.e., si = {t1, t2, ..., tm}, where m is the number of distinct non-stop words. We define a graph G = (V, E) for all the sentences related to a resource, in which the nodes from V are the sentences that are connected through edges eij ∈ E. There is a link between two nodes if the similarity of the sentences are more than a threshold. Using such PageRank algorithm, we aim to select the sentences that receive highest number of in-links among hundreds of other ones. We consider a weighted graph in which the edge weights are measured by any similarity metrics such as, raw number of common terms, normalized number of common terms, Jaccard coefficient, or cosine similarity. To calculate the PageRank score of a sentence PR(si), we add the score of all the neighbors pointing to it divided by the number of output links of each of these neighbors. We used 0.85 as our damping factor α.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
KLDivergence.py		KLDivergence.py
README.md		README.md
RetentionRate.py		RetentionRate.py
TopicClustering.py		TopicClustering.py
input.txt		input.txt
summaryList.txt		summaryList.txt
test.json		test.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KLDivergence.py

KLDivergence.py

README.md

README.md

RetentionRate.py

RetentionRate.py

TopicClustering.py

TopicClustering.py

input.txt

input.txt

summaryList.txt

summaryList.txt

test.json

test.json

Repository files navigation

CSCI544-Team-7

About

Releases

Packages

Contributors 3

Languages

vikasnar/CSCI544-Team-7

Folders and files

Latest commit

History

Repository files navigation

CSCI544-Team-7

About

Resources

Stars

Watchers

Forks

Languages