<a href="https://colab.research.google.com/github/sabudev/CAIF/blob/main/CAIF_Module_4_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Wordclouds

Wordclouds are excellent ways to summarize textual information like reviews, customer feedback, documents etc. The first part of this excercise focuses on creating a word cloud from the text descriptions in the wine dataset that you have seen earlier in Part 1.

In [None]:
!pip install wordcloud

In [None]:
# Start with loading all necessary libraries
import numpy as np
import pandas as pd
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from nltk import FreqDist

import matplotlib.pyplot as plt
from nltk.tokenize import RegexpTokenizer

**Load the Data files**

In [None]:
! git clone https://github.com/vibsabhishek/EP290.git

In [None]:
# Load in the dataframe
df = pd.read_csv("EP290/winemag-data-2500.csv", index_col=0)

In [None]:
df.head()

In [None]:
print(df.description[0])

In [None]:
# Start with one review:
text = df.description[0]

# Create and generate a word cloud image:
wordcloud = WordCloud().generate(text)

# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
print(text)

In [None]:
# Save the image in the img folder:
wordcloud.to_file("first_review.png")

In [None]:
#combine all the descriptions into one big text variable
text = " ".join(description for description in df.description)
print ("There are {} words in the combination of all review.".format(len(text)))

In [None]:
tokenizer = RegexpTokenizer(r'\w+')
tokens = tokenizer.tokenize(text)
freq = FreqDist(tokens)
freq.plot(50)

In [None]:
plt.rcParams["figure.figsize"] = (10,10)
freq.plot(50)

## Q1. Plot the distribution of the entire corpus

In [None]:
freq.plot()

## Zipf's law

The distribution of text follows a Zipf's or scale-free distribution. This is quite characteristic of any naturally occuring text corpus, irrespective of language. 

## Q2. What is the most commonly occuring word?


### Stopwords

Some words like the, and etc. even though commonly occuring do not add a lot of value as they are not unique to the context. In addition, we might wish to remove commonly occuring words for a specific context, e.g. "wine", and "drink". 

In [None]:
# Create stopword list:
stopwords = set(STOPWORDS)

## Q3. Add a set of stopwords specific to the wine descriptions and generate another wordcloud

In [None]:
stopwords.update(["wine"])

In [None]:
# Generate a word cloud image
wordcloud = WordCloud(stopwords=stopwords, background_color="white", width=1000, height=800).generate(text)

# Display the generated image:
# the matplotlib way:
plt.rcParams["figure.figsize"] = (20,20)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
plt.rcParams["figure.figsize"] = (20,20)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

## Apply an image mask

In [None]:
wine_mask = np.array(Image.open("EP290/winemask.png"))
wine_mask

In [None]:
def transform_format(val):
    if val == 0:
        return 255
    else:
        return val

# Transform your mask into a new one that will work with the function:
transformed_wine_mask = np.ndarray((wine_mask.shape[0],wine_mask.shape[1]), np.int32)

for i in range(len(wine_mask)):
    transformed_wine_mask[i] = list(map(transform_format, wine_mask[i]))

# Check the expected result of your mask
transformed_wine_mask

# Create a word cloud image
wc = WordCloud(background_color="white", mask=transformed_wine_mask,
               stopwords=stopwords, contour_width=1, contour_color='firebrick', width=1000, height=800)

# Generate a wordcloud
wc.generate(text)

# store to file
wc.to_file("wine.png")

# show
plt.figure(figsize=[10,10])
plt.imshow(wc)
plt.axis("off")
plt.show()

## Q4. Create the world cloud for the abstracts in COVID data (in file: EP290/covid19_small.csv)

In [None]:
df = pd.read_csv("EP290/covid19_small.csv", index_col=0)

In [None]:
df.abstract[1]


# Sentiment analysis

Here we will try to find the sentiment of various text, a very common application of NLP.

## Import packages
Make sure you installed ***sklearn***, ***matplotlib*** and ***numpy*** if you use your local machine

In [None]:
!pip install -U textblob
!pip install vaderSentiment
!python -m textblob.download_corpora

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from os import path
import seaborn as sns
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from sklearn.metrics import confusion_matrix, precision_score, precision_recall_curve, recall_score, f1_score, accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split 

##**Sentiment Analysis using VADER**

*   For compound > 0.05 - Positive
*   For compound < -0.05 - Negative
*   else Neutral



In [None]:
vader = SentimentIntensityAnalyzer()
text_sentiment = vader.polarity_scores("VADER is amazingly simple to use. What great fun!")
print(text_sentiment)

In [None]:
text_sentiment = vader.polarity_scores("VADER is terrible to use. What a shame!")
print(text_sentiment)

In [None]:
text_sentiment = vader.polarity_scores("He is not going to hate that.")
print(text_sentiment)

## **Sentiment analysis on IMDB review dataset**

In [None]:
df = pd.read_excel("EP290/IMDB_Dataset_small.xls", header=0)

In [None]:
df.head()

In [None]:
len(df)

## Q5. Show the distribution of words containted in the IMDB reviews.

In [None]:
#some helper function to measure sentiment
def detect_vader_pos(text):
    return vader.polarity_scores(text)['pos']

def detect_vader_neg(text):
    return vader.polarity_scores(text)['neg']

def detect_vader_comp(text):
    return vader.polarity_scores(text)['compound']

In [None]:
print("Review:", df.review[0],"\n Sentiment:", df.sentiment[0])

###Compute Sentiment for the entire dataset using VADER

In [None]:
vader = SentimentIntensityAnalyzer()
df['vader_pos'] = df.review.apply(detect_vader_pos)
df['vader_neg'] = df.review.apply(detect_vader_neg)
df['vader_comp'] = df.review.apply(detect_vader_comp)

In [None]:
#Visualize the results
plt.figure(figsize=[4,4])

ax = sns.violinplot(x="sentiment", y="vader_comp", data=df)

In [None]:
plt.figure(figsize=[4,4])
ax = sns.violinplot(x="sentiment", y="vader_pos", data=df)

In [None]:
plt.figure(figsize=[4,4])
ax = sns.violinplot(x="sentiment", y="vader_neg", data=df)

In [None]:
v_pred = np.where(df['vader_comp'] > 0.0, "positive", "negative")

In [None]:
print("Confusion Matrix:\n", confusion_matrix(df.sentiment, v_pred))
print("F1 score:", f1_score(df.sentiment, v_pred, average='micro'))
print("Accuracy score:", accuracy_score(df.sentiment, v_pred))

## **Creating our own classifier**

## Count Vectorizer
- run count vectorizer on it
- plot histograms of counts etc.
- vary the parameters of Cvectorizer, show how histograms change

### Split into train and test datasets
Here, 70% of the original data are used for training models, and the rest are for test

In [None]:
train_sample = int(len(df)*0.7)
train = df[0:(train_sample)]
test = df[(train_sample+1):len(df)]
print('train data size:', len(train))
print('test data size:', len(test))

### Create the vector representation of training and testing data

In [None]:
#Encode documents
vectorizer = CountVectorizer(stop_words='english', lowercase=True, min_df=5);
vectorizer.fit(train.review);

#create vector representation
train_vec = vectorizer.transform(train.review)
test_vec = vectorizer.transform(test.review)

In [None]:
print(train.review[2])

In [None]:
print(train_vec[2])

In [None]:
lr_model = LogisticRegression(C=0.1)
lr_model.fit(train_vec, train.sentiment)

In [None]:
lr_pred = lr_model.predict(test_vec)

print("Confusion Matrix:\n", confusion_matrix(test.sentiment, lr_pred))
print("F1 score:", f1_score(test.sentiment, lr_pred, average='micro'))
print("Accuracy:", accuracy_score(test.sentiment, lr_pred))

In [None]:
#Test an example
reviews = [
           "Star Wars! This was the worst movie ever. Total waste of time.",
           "Star Trek! This was the best movie ever. Totally recommended."
]
reviews_vec = vectorizer.transform(reviews)
lr_model.predict(reviews_vec)

## Model analysis: Examine which features are important for positive versus negative

In [None]:
!pip install eli5
!pip install tabulate
!pip install spacy

In [None]:
import eli5

In [None]:
eli5.show_weights(lr_model, top=20, vec=vectorizer)
#eli5.show_prediction(lr_pred, reviews)

In [None]:
eli5.show_prediction(lr_model, df.review[0], vec=vectorizer)



---



In [None]:
!pip install pyfiglet

In [None]:
import pyfiglet
  
result = pyfiglet.figlet_format("That's all folks!!\n\n THANK YOU \nfor be being such a sport with the Python notebooks!")
print(result)