<center> <h2><b> Sentiment Analysis on Reddit Vaccine Myths</b></h2></center>

**Observations:**
1. Based on results from spacy textblob nearly half of the data are positive posts both from title and body.
2. The other half is nearly equally shared by negative and neutral posts.


In [None]:
!pip install spacytextblob
!python -m spacy download en_core_web_sm
!pip install pywaffle

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import missingno as msno
from pywaffle import Waffle
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

Here I am performing sentiment analysis only on title and body. As part of data cleaning I have removed the word "Comment" from title because it occured more times.

In [None]:
data = pd.read_csv('../input/reddit-vaccine-myths/reddit_vm.csv')
data = data[['title','body']].fillna('')
data['title'] = data['title'].replace(to_replace='Comment',value='')

I have used spacy text blob for my sentiment classification. It gives us polarity, subjectivity and assessments on each sentence. 

Polarity is a value in the range [-1,1]. If polarity is less than 0 it is negative, else if it is greater than 0 it is positive. If polarity is equal to 0 it is considered as neutral.

In [None]:
'''import spacy
from spacytextblob.spacytextblob import SpacyTextBlob

polarity = []
subjectivity = []
assessments = []
for i in range(data.shape[0]):
    print(f'Processing row no {i}')
    nlp = spacy.load('en_core_web_sm')
    nlp.add_pipe('spacytextblob')
    text = data['body'].iloc[i]
    doc = nlp(text)
    polarity.append(doc._.polarity)      
    subjectivity.append(doc._.subjectivity)  
    assessments.append(doc._.assessments) '''

I have dumped all the results in an assesments file as json. Let us now load the file and visualise the results.

In [None]:
import json
file = open('../input/sentimentanalysisresults/assessments_full.json')
assessments_result = json.load(file)

title_sentiment_words = [ele for ele in assessments_result['assessments_title'] if len(ele)!=0]
title_positive_words = [ele[0][0][0] for ele in title_sentiment_words if ele[0][1] > 0]
title_negative_words = [ele[0][0][0] for ele in title_sentiment_words if ele[0][1] < 0]
title_neutral_words = [ele[0][0][0] for ele in title_sentiment_words if ele[0][1] == 0]

body_sentiment_words = [ele for ele in assessments_result['assessments_body'] if len(ele)!=0]
body_positive_words = [ele[0][0][0] for ele in body_sentiment_words if ele[0][1] > 0]
body_negative_words = [ele[0][0][0] for ele in body_sentiment_words if ele[0][1] < 0]
body_neutral_words = [ele[0][0][0] for ele in body_sentiment_words if ele[0][1] == 0]

In [None]:
ratio_positive_title = round((len(title_positive_words)/len(title_sentiment_words)),2)
ratio_negative_title = round((len(title_negative_words)/len(title_sentiment_words)),2)
ratio_neutral_title = round((len(title_neutral_words)/len(title_sentiment_words)),2)
percents_df = pd.DataFrame.from_dict({'ratio':'ratio','positive_ratio':[ratio_positive_title],'negative_ratio':[ratio_negative_title],'neutral_ratio':[ratio_neutral_title]})
percents_df = percents_df.set_index('ratio')

ratio_positive_body = round((len(body_positive_words)/len(body_sentiment_words)),2)
ratio_negative_body = round((len(body_negative_words)/len(body_sentiment_words)),2)
ratio_neutral_body = round((len(body_neutral_words)/len(body_sentiment_words)),2)
percents_body_df = pd.DataFrame.from_dict({'ratio':'ratio','positive_ratio':[ratio_positive_body],'negative_ratio':[ratio_negative_body],'neutral_ratio':[ratio_neutral_body]})
percents_body_df = percents_body_df.set_index('ratio')

# **Visualisation from Titles:**

In [None]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1,figsize=(10, 2.5))

ax.barh(percents_df.index, percents_df['positive_ratio'], 
        color='#0000FF', alpha=0.9)
ax.barh(percents_df.index, percents_df['negative_ratio'], left=percents_df['positive_ratio'], 
        color='#221f1f', alpha=0.9)
ax.barh(percents_df.index, percents_df['neutral_ratio'], left=percents_df['positive_ratio']+percents_df['negative_ratio'],
        color='#C0C0C0', alpha=0.9)

# ax.set_xlim(0, 1)
ax.set_xticks([])
ax.set_yticks([])


for i in percents_df.index:
    ax.annotate(f"{int(percents_df['positive_ratio'][i]*100)}%", 
                   xy=(percents_df['positive_ratio'][i]/2, i),
                   va = 'center', ha='center',fontsize=40, fontweight='light', fontfamily='san-serif',
                   color='white')

    ax.annotate("Positives", 
                   xy=(percents_df['positive_ratio'][i]/2, -0.25),
                   va = 'center', ha='center',fontsize=15, fontweight='light', fontfamily='san-serif',
                   color='white')
    
    
for i in percents_df.index:
    ax.annotate(f"{int(percents_df['negative_ratio'][i]*100)}%", 
                   xy=(percents_df['positive_ratio'][i]+percents_df['negative_ratio'][i]/2, i),
                   va = 'center', ha='center',fontsize=40, fontweight='light', fontfamily='san-serif',
                   color='white')
    ax.annotate("Negatives", 
                   xy=(percents_df['positive_ratio'][i]+percents_df['negative_ratio'][i]/2, -0.25),
                   va = 'center', ha='center',fontsize=15, fontweight='light', fontfamily='san-serif',
                   color='white')

for i in percents_df.index:
    ax.annotate(f"{int(percents_df['neutral_ratio'][i]*100)}%", 
                   xy=(percents_df['positive_ratio'][i]+percents_df['neutral_ratio'][i]+percents_df['negative_ratio'][i]/2, i),
                   va = 'center', ha='center',fontsize=40, fontweight='light', fontfamily='san-serif',
                   color='black')
    ax.annotate("Neutrals", 
                   xy=(percents_df['positive_ratio'][i]+percents_df['neutral_ratio'][i]+percents_df['negative_ratio'][i]/2, -0.25),
                   va = 'center', ha='center',fontsize=15, fontweight='light', fontfamily='san-serif',
                   color='black')


fig.text(0.125,0.92,'Sentiment Distribution from Titles Based on Polarity', fontfamily='san-serif',fontsize=15, fontweight='bold')  

for s in ['top', 'left', 'right', 'bottom']:
    ax.spines[s].set_visible(False)
    

ax.legend().set_visible(False)
plt.show()

fig = plt.figure(figsize=(5, 1),dpi=150,facecolor='#fafafa',
    FigureClass=Waffle,
    rows=1,
    values=[4, 6],
    colors=['#4C8CB5', "lightgray"],
    icons='thumbs-up',
    font_size=20,vertical=False,
)

fig.text(0,1,'Four in every Ten Titles are positive',fontfamily='monospace',fontsize=10)

fig = plt.figure(figsize=(5, 1),dpi=150,facecolor='#fafafa',
    FigureClass=Waffle,
    rows=1,
    values=[3, 7],
    colors=['#FF0000', "lightgray"],
    icons='thumbs-down',
    font_size=20,vertical=False,
)

fig.text(0,1,'Three in every Ten Titles are Negative',fontfamily='monospace',fontsize=10)
plt.show()

> We have forty percent of titles detected as positives, where as remaining sixty is negative and neutral with nearly equal share for both.

In [None]:
from wordcloud import WordCloud
import random
from PIL import Image
import matplotlib
import requests

def word_cloud_string(words_list):
    words_string = ''
    for word in words_list:
        words_string = words_string +' '+ word    
    return words_string
    
def color_map(colors_list):
    cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", colors_list)
    return cmap

def mask_generation(img_url):
    mask = np.array(Image.open(requests.get(img_url, stream=True).raw).convert('LA'))
    mask = mask[:,:,0]

    def transform_zeros(val):
        if val == 251:
            return 255
        else:
            return val

    maskable_image = np.ndarray((mask.shape[0],mask.shape[1]), np.int32)

    for i in range(len(mask)):
        maskable_image[i] = list(map(transform_zeros, mask[i]))
    return maskable_image

# **Title Positive Words WordCloud**

In [None]:
maskable_image = mask_generation('https://raw.githubusercontent.com/vineethbabu/coronavaccine_images/main/positive_review.png')
cmap = color_map(['#4C8CB5','#b20710'])
wordcloud = WordCloud(background_color = 'white', max_font_size=100,relative_scaling=0,colormap=cmap, max_words = 5000, mask = maskable_image, contour_width=3, contour_color='steelblue').generate(word_cloud_string(title_positive_words))

plt.figure( figsize=(8,8))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

# **Title Negative Words WordCloud**

In [None]:
maskable_image = mask_generation('https://raw.githubusercontent.com/vineethbabu/coronavaccine_images/main/negative_review.png')
cmap = color_map(['#000000','#b20710'])
wordcloud = WordCloud(background_color = 'white', max_font_size=250,relative_scaling=0,colormap=cmap, max_words = 5000, mask = maskable_image, contour_width=3, contour_color='black').generate(word_cloud_string(title_negative_words))

plt.figure( figsize=(8,8))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

# **Title Neutral Words WordCloud**

In [None]:
maskable_image = mask_generation('https://raw.githubusercontent.com/vineethbabu/coronavaccine_images/main/neutral_review.png')
cmap = color_map(['#000000','#b20710'])
wordcloud = WordCloud(background_color = 'white', max_font_size=250,relative_scaling=0,colormap=cmap, max_words = 5000, mask = maskable_image, contour_width=3, contour_color='darkgrey').generate(word_cloud_string(title_neutral_words))

plt.figure( figsize=(8,8))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

# **Visualisations from Body**

In [None]:
fig, ax = plt.subplots(1,1,figsize=(10, 2.5))

ax.barh(percents_body_df.index, percents_body_df['positive_ratio'], 
        color='#0000FF', alpha=0.9)
ax.barh(percents_body_df.index, percents_body_df['negative_ratio'], left=percents_body_df['positive_ratio'], 
        color='#221f1f', alpha=0.9)
ax.barh(percents_body_df.index, percents_body_df['neutral_ratio'], left=percents_body_df['positive_ratio']+percents_body_df['negative_ratio'],
        color='#C0C0C0', alpha=0.9)

# ax.set_xlim(0, 1)
ax.set_xticks([])
ax.set_yticks([])


for i in percents_body_df.index:
    ax.annotate(f"{int(percents_body_df['positive_ratio'][i]*100)}%", 
                   xy=(percents_body_df['positive_ratio'][i]/2, i),
                   va = 'center', ha='center',fontsize=40, fontweight='light', fontfamily='san-serif',
                   color='white')

    ax.annotate("Positives", 
                   xy=(percents_body_df['positive_ratio'][i]/2, -0.25),
                   va = 'center', ha='center',fontsize=15, fontweight='light', fontfamily='san-serif',
                   color='white')
    
    
for i in percents_body_df.index:
    ax.annotate(f"{int(percents_body_df['negative_ratio'][i]*100)}%", 
                   xy=(percents_body_df['positive_ratio'][i]+percents_body_df['negative_ratio'][i]/2, i),
                   va = 'center', ha='center',fontsize=40, fontweight='light', fontfamily='san-serif',
                   color='white')
    ax.annotate("Negatives", 
                   xy=(percents_body_df['positive_ratio'][i]+percents_df['negative_ratio'][i]/2, -0.25),
                   va = 'center', ha='center',fontsize=15, fontweight='light', fontfamily='san-serif',
                   color='white')

for i in percents_body_df.index:
    ax.annotate(f"{int(percents_body_df['neutral_ratio'][i]*100)}%", 
                   xy=(percents_body_df['positive_ratio'][i]+percents_body_df['neutral_ratio'][i]+percents_body_df['negative_ratio'][i]/2, i),
                   va = 'center', ha='center',fontsize=40, fontweight='light', fontfamily='san-serif',
                   color='black')
    ax.annotate("Neutrals", 
                   xy=(percents_body_df['positive_ratio'][i]+percents_body_df['neutral_ratio'][i]+percents_body_df['negative_ratio'][i]/2, -0.25),
                   va = 'center', ha='center',fontsize=15, fontweight='light', fontfamily='san-serif',
                   color='black')






fig.text(0.125,0.92,'Sentiments Distribution from Body Based on Polarity', fontfamily='san-serif',fontsize=15, fontweight='bold')  

for s in ['top', 'left', 'right', 'bottom']:
    ax.spines[s].set_visible(False)
    

ax.legend().set_visible(False)
plt.show()

fig = plt.figure(figsize=(5, 1),dpi=150,facecolor='#fafafa',
    FigureClass=Waffle,
    rows=1,
    values=[5, 5],
    colors=['#4C8CB5', "lightgray"],
    icons='thumbs-up',
    font_size=20,vertical=False,
)

fig.text(0,1,'Five in every Ten Body Texts are positive',fontfamily='monospace',fontsize=10)


fig = plt.figure(figsize=(5, 1),dpi=150,facecolor='#fafafa',
    FigureClass=Waffle,
    rows=1,
    values=[3, 7],
    colors=['#FF0000', "lightgray"],
    icons='thumbs-down',
    font_size=20,vertical=False,
)

fig.text(0,1,'Three in every Ten Body Texts are Negative',fontfamily='monospace',fontsize=10)
plt.show()

> We have nearly half of the body text is detected as positives where as in remaining half negatives and positives again have near equal share. 

# **Body Positive Words - WordCloud**

In [None]:
maskable_image = mask_generation('https://raw.githubusercontent.com/vineethbabu/coronavaccine_images/main/positive_review.png')
cmap = color_map(['#4C8CB5','#b20710'])
wordcloud = WordCloud(background_color = 'white', max_font_size=100,relative_scaling=0,colormap=cmap, max_words = 5000, mask = maskable_image, contour_width=3, contour_color='darkblue').generate(word_cloud_string(body_positive_words))

plt.figure( figsize=(8,8))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

# **Body Negative Words - WordCloud**

In [None]:
maskable_image = mask_generation('https://raw.githubusercontent.com/vineethbabu/coronavaccine_images/main/negative_review.png')
cmap = color_map(['#4C8CB5','#b20710'])
wordcloud = WordCloud(background_color = 'white', max_font_size=100,relative_scaling=0,colormap=cmap, max_words = 5000, mask = maskable_image, contour_width=3, contour_color='black').generate(word_cloud_string(body_negative_words))

plt.figure( figsize=(8,8))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()

# **Body Neutral Words - WordCloud**

In [None]:
maskable_image = mask_generation('https://raw.githubusercontent.com/vineethbabu/coronavaccine_images/main/neutral_review.png')
cmap = color_map(['#4C8CB5','#b20710'])
wordcloud = WordCloud(background_color = 'white', max_font_size=100,relative_scaling=0,colormap=cmap, max_words = 5000, mask = maskable_image, contour_width=3, contour_color='darkgrey').generate(word_cloud_string(body_neutral_words))

plt.figure( figsize=(8,8))
plt.imshow(wordcloud, interpolation = 'bilinear')
plt.axis('off')
plt.tight_layout(pad=0)
plt.show()