# <center> <font size = 24 color = 'steelblue'> <b>Exploration and Visualization

<div class="alert alert-block alert-info">
    
<font size = 4> 

**By the end of this notebook you will be able to:**
- Learn to explore text data
- Visualize text data using wordcloud

# <a id= 'e0'> 
<font size = 4>
    
**Table of contents:**<br>
[1. Installation and import of necessary packages](#e1)<br>
[2. Download the necessary corpus from NLTK](#e2)<br>
[3. Data extraction](#e3)<br>
>[a. Check the files available in the corpus](#3.a)<br>
>[b. Use the 'Canon_G3.txt'](#3.b)<br>

[4. Data exploration and visualization](#e4)<br>
> [4.1 Basic exploration](#4.1)<br>
> [4.1 Detailed exploration](#4.2)<br>
> [4.3 Word frequency analysis](#4.3)<br>
> [4.4 Wordcloud generation](#4.4)<br>

##### <a id = 'e1'>
<font size = 10 color = 'midnightblue'> <b>Installation and import of necessary packages

In [None]:
import nltk
import pandas as pd
import matplotlib.pyplot as plt
from wordcloud import WordCloud

[top](#e0)

##### <a id = 'e2'>
<font size = 10 color = 'midnightblue'> <b>Download necessary corpus and models from nltk

In [None]:
nltk.download('product_reviews_1')
nltk.download('punkt')
nltk.download('stopwords')

<div class="alert alert-block alert-info">
<font size = 4> 
    
<center><b>The product reviews corpus contains multiple products reviews for specific products.


[top](#e0)

##### <a id = 'e3'>
<font size = 10 color = 'midnightblue'> <b>Data extraction

<a id = '3.a'>
<font size = 6 color = pwdrblue>  <b>Check the files available in the corpus

In [None]:
file_ids = nltk.corpus.product_reviews_1.fileids()
print(file_ids)

<a id = '3.b'>
<font size = 6 color = pwdrblue>  <b>Use the 'Canon_G3.txt' review file for this exercise

<div class="alert alert-block alert-success">
<font size = 4> 
    
**To extract text data from the file, there are multiple ways**

  - `reviews` function extracts the reviews in an object form.
  - `raw` function extracts the raw data from the text file.
  - `sents` function extract tokenised list of all sentences in all the reviews.

In [None]:
reviews = nltk.corpus.product_reviews_1.reviews(file_ids[1])
raw_text = nltk.corpus.product_reviews_1.raw(file_ids[1])
review_sentences = nltk.corpus.product_reviews_1.sents(file_ids[1])

[top](#e0)

##### <a id = 'e4'>
<font size = 10 color = 'midnightblue'> <b>Data Exploration

<a id = '4.a'>
<font size = 6 color = pwdrblue>  <b>Basic exploration

<div class="alert alert-block alert-success">
<font size = 4> 

**Let's start with simple exploration to check**
  - The total number of reviews,
  - Total number of review sentences,
  - Title for each review, and
  - Extract complete review texts

In [None]:
print(f"Total no. of review available : {len(reviews)}")

In [None]:
print(f"Total review sentences : {len(review_sentences)}")

In [None]:
review_titles = {f"rev_{i+1}" : reviews[i].title for i in range(len(reviews))}
for tit in review_titles.keys():
    print(f"{tit} : {review_titles[tit]}")

In [None]:
review_texts = []
for i in range(len(reviews)):
    review_lines = reviews[i].review_lines
    line = ''
    for j in range(len(review_lines)):
        line += ' '.join(review_lines[j].sent)
    review_texts.append(line)


In [None]:
print("Complete Review Texts".center(150))
i = 1
for rev in review_texts:
    print(f"Title --> {review_titles[f'rev_{i}']} :\n\n{rev}\n")
    print(f"Summary --> \nLen of review : {len(rev.split())} words.\n")
    i += 1

[top](#t0)

<a id = '4.2'>
<font size = 6 color = pwdrblue>  <b>Detailed exploration

<font size = 5 color = seagreen>  <b>Words per review

In [None]:
words_per_review = [len(rev) for rev in review_texts]

In [None]:
print(f"Largest review: {max(words_per_review)}")
print(f"Smallest review: {min(words_per_review)}")
print(f"Average words per review: {sum(words_per_review)/ len(words_per_review):.0f}")

<font size = 5 color = seagreen>  <b>Average sentence length analysis

In [None]:
sent_length = [len(rev) for rev in review_sentences]

In [None]:
print(f"Largest sentence: {max(sent_length)}")
print(f"Smallest sentence: {min(sent_length)}")
print(f"Average sentence length: {sum(sent_length)/ len(sent_length):.0f}")

<font size = 5 color = seagreen>  <b>Review title analysis

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Explore the review titles to understand the sentiments.
- For this we need to extract the titles using the review object.
- This we ave done earlier.

In [None]:
titles = ' '.join(list(review_titles.values()))

In [None]:
# tokenization
tokens = nltk.word_tokenize(titles)

[top](#e0)

<a id = '4.3'>
<font size = 6 color = pwdrblue>  <b>Word frequency distribution

<div class="alert alert-block alert-success">
<font size = 4> 
    
- Analyze the word frequency using the _**Freq_dist**_ function of nltk.
- This accounts for frequency of each word in the text being analyzed.
- Create a dataframe for ease of analysis.

In [None]:
freq_dist = pd.DataFrame({"words" : nltk.FreqDist(tokens).keys(), "freq" : nltk.FreqDist(tokens).values()})
freq_dist.sort_values('freq',ascending=False, inplace = True, ignore_index=True)
freq_dist.freq.sum()

<font size = 5 color = seagreen> **Stopword analysis**

<div class="alert alert-block alert-success">
<font size = 4> 
    
**Clean the data further to remove the frequently occurring stopwords.**

In [None]:
stp_words = nltk.corpus.stopwords.words('english')

In [None]:
stp_wrd_freq_dist = freq_dist[freq_dist.words.isin(stp_words)].reset_index(drop = True)
stp_wrd_freq_dist[:10]

<font size = 5 color = seagreen> <b>Use a bar chart to analyze the most frequent words in the text.
    

In [None]:
plt.figure(figsize = (20,5))
plt.bar(x = stp_wrd_freq_dist.words[:10], height=stp_wrd_freq_dist.freq[:10])
for i in range(10):
    plt.annotate(f"{stp_wrd_freq_dist.freq[i]}",xy = (i, stp_wrd_freq_dist.freq[i]+0.15) )
plt.ylim(0,12)
plt.show()

<font size = 5 color = seagreen> <b>Perform a granular analysis exploring the top 20 words by frequency.

In [None]:
top_20_words = freq_dist[~(freq_dist.words.isin(stp_words))].reset_index(drop = True)
top_20_words

In [None]:
plt.figure(figsize = (20,5))
plt.bar(x = top_20_words.words[:10], height=top_20_words.freq[:10])
for i in range(10):
    plt.annotate(f"{top_20_words.freq[i]}",xy = (i, top_20_words.freq[i]+0.15) )
# plt.ylim(0,12)
plt.show()

<div class="alert alert-block alert-success">
<font size = 4>

- Looking at the plot it seems terms like `camera`, `cannon`, `g2` which are terms identical to product name are present in abundance.
- However they are not contributing much to analysis, hence these may be removed from the data.


In [None]:
word_list = ['camera', 'canon', 'g2', 'digital']
most_list = top_20_words[~(top_20_words.words.isin(word_list))]

In [None]:
most_list = most_list[most_list.words.str.isalpha()]

In [None]:
word_freq = {word : freq for word, freq in zip(most_list.words, most_list.freq)}

[top](#e0)

<a id = '4.4'>
<font size = 6 color = pwdrblue>  <b>Wordcloud generation

<div class="alert alert-block alert-success">
<font size = 4>
    
- Use a wordcloud to analyze the frequent terms in the text.
- Wordclouds helps in identifying the theme of the text in use.

In [None]:
wordcloud = WordCloud(width = 800, height = 600, background_color = "white", colormap = 'Accent')
wordcloud.generate_from_frequencies(frequencies=word_freq)
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

[top](#e0)