# Text Analysis - Unit 01 - Analysis

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%201%20-%20Lesson%20Learning%20Outcome.png"> Lesson Learning Outcome

* **Text Analysis Lesson is made of 1 unit**
* By the end of this lesson, you should be able to:
  * Gather text data from Wikipedia
  * Create a WordCloud
  * Learn and use basic functionalities from Text Hero
  * Handle DataFrame with Text data

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%202%20-%20Unit%20Objective.png"> Unit Objectives

  * Gather text data from Wikipedia
  * Create a WordCloud
  * Learn and use basic functionalities from Text Hero
  * Handle DataFrame with Text data



---

Data Science also has incredible applications when dealing with text. As part of this process, you will have to learn more about the most common words for a given subset of your data.

<img width="3%" height="3%" align="top"  src=" https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Question%20mark%20icon.png
">
 **Why do we study text analysis?**
  * Because it allows you to analyse words present in different classes of textual data, and quickly reveals significant insights into your project



## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%203%20-%20Additional%20Learning%20Context.png"> Additional Learning Context

* We encourage you to:
  * Add **code cells and try out** other possibilities, play around with parameter values in a function/method, or consider additional function parameters etc.
  * Also, **add your own comment** in the cells. It can help you to consolidate your learning. 


* Parameters in given function/method
  * As you may expect, a given function in a package may contain multiple parameters. 
  * Some of them are mandatory to declare; some have pre-defined values, and some are optional. We will cover the most common parameters used/employed in Data Science for a particular function/method. 
  * However, you may seek additional in the respective package documentation, where you will find instructions on how to use a given function/method. The studied packages are open source, so this documentation is public.

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%204%20-%20Import%20Package%20for%20Learning.png"> Import Packages for Learning

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

---

## <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Text Analysis

There might come a moment in the workplace when your dataset may contain text, for example, a long description about a given group or a product review. We will learn how to:
* Collect data from Wikipedia articles
* Create WordCloud 
* Use a text processing library called TextHero
* Handle text data in DataFrames

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Wikipedia

We will use the ``wikipedia`` library to web scrape data from a given wikipedia page.  First, we will install `wikipedia` library for this exercise

We will use `wikipedia.page()` to request data. The argument is the wikipedia article name. 
* In this case, we are requesting the content from "Python_(programming_language)"

import wikipedia
wiki_request = wikipedia.page('Python_(programming_language)')
wiki_request

You can access the attributes to choose your option

We are selecting `.content`, so that we can access the article text


text = wiki_request.content
text

Let's check the text lenght

len(text)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%206%20-%20Warning.png"> Ideally, we should pre-process the text data by removing stopwords (words that do not add much meaning to a sentence, like ‘We’, ‘are’ and ‘the’), empty blank space, punctuation, etc. We will leave that to the next section. So we can focus now on creating the WordCloud

#### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Additional Topics on wikipedia library

You may be running out of ideas for topics to search on Wikipedia. This is fine. Use the method ``.random()`` at Wikipedia to get random article names. The argument is pages, and is the number of random suggestions limited to 10.

wikipedia.random(pages=10)

You may also be interested in setting the language you want to have from the collected data. Use the method .set_lang(). The argument is ``prefix`` and is a two letter prefix. The options can be found [here](http://meta.wikimedia.org/wiki/List_of_Wikipedias)
* In this case, we will select `'de'` for german

wikipedia.set_lang(prefix='de')

When we now run random suggestions for pages, we notice that only German language pages are shown

wikipedia.random(pages=10)

We are setting the language back to English

wikipedia.set_lang(prefix='en')

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> WordCloud

We are now interested in plotting a WordCloud for ``text`` to learn the most frequent words shown in a WordCloud. The word size in WordCloud reflects the word frequency in the text.
* First, we install the wordcloud library 

We will define a funtion to display the WordCloud.
* It creates a figure and plots the word cloud image without a grid

def plot_wordcloud(wordcloud):
    plt.figure(figsize=(15, 10))
    plt.axis("off")
    plt.imshow(wordcloud) 
    plt.show()

We use the `WordCloud()` function; the documentation is found [here](https://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html). The arguments we consider are:
* `width` and `height` to set the wordcloud plot
* `background_color`, options include 'white', 'black', 'navy', 'salmon' and others
* `collocations`, set to False to ensure that the word cloud doesn’t appear as if it contains any duplicate words
* `colormap`, follows matplotlib [palette](https://matplotlib.org/stable/tutorials/colors/colormaps.html)
* `stopwords`, built-in stopwords removal capabilities from wordcloud 

Once we set `WordCloud()`, we chain the method `.generate()` to the text, and parse to `plot_wordcloud()`

from wordcloud import WordCloud, STOPWORDS
wordcloud = WordCloud(width = 800, height = 400, 
                      background_color='salmon', colormap='Pastel1',
                      collocations = False, stopwords = STOPWORDS).generate(text)

plot_wordcloud(wordcloud)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> Now we are interested in saving the WordCloud plot. We can use the method `.to_file()` over the variable that contains the wordcloud object. The argument is the path and filename

wordcloud.to_file("wordcloud.png")

We can check the image was saved by inspecting the files in the current directory. Note the file named wordcloud.png

!ls

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> **PRACTICE**: Now, it is your turn to create a word cloud.

Choose an article from the list of articles below and create a wordcloud 

wikipedia.random(pages=10)

# Write your code here to get the text from wikipedia
text = ...

# write your code here to do a word cloud

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Text Hero

<img src="https://warehouse-camo.ingress.cmh1.psfhosted.org/e8f9b940af405538395be49fba6829ebd39cc8c8/68747470733a2f2f6769746875622e636f6d2f6a6265736f6d692f746578746865726f2f7261772f6d61737465722f6769746875622f6c6f676f2e706e67" width="25%" height="25%" />

Text Hero is a python library that handles text processing and helps you to understand text data. You can find the documentation [here](https://texthero.org/)

To start, our text will be the IPython page from wikipedia

txt = wikipedia.page('IPython').content
txt

Let's check its length which means how many characters the text has

print(len(txt))

We are interested in applying the function [hero.clean()](https://texthero.org/docs/api/texthero.preprocessing.clean), which will do a series of text preprocessing, including: 
* lowercase
* remove digits
* remove punctuation
* remove diacritics
* remove stopwords
*  remove whitespace


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> The function doesn't work when a string is parsed, so we are converting ``txt`` to become a Pandas Series. The output of `hero.clean()`, is a Pandas Series, so we are getting the first element [0] to convert the content back to a string


import texthero as hero
txt = pd.Series(txt)
txt = hero.clean(txt)[0]
txt

Let's check the new lenght

print(len(txt))

We can create a WordCloud using the knowledge from the last section

wordcloud = WordCloud(width = 800, height = 400, 
                      background_color='navy', colormap='Set1',
                      collocations = False, stopwords = STOPWORDS).generate(txt)

plot_wordcloud(wordcloud)

---

As an alternative, you can create a plot to count the most frequent words
* We use ``hero.top_words()`` to count the frequency of the words, and with bracket notation, we subset the top n words
* The data should be a Pandas Series or DataFrame, so in this case, we parse ``txt`` (string) to `pd.Series()`

num_top_words = 10
txt_ser = pd.Series(txt)
hero.top_words(txt_ser)[:num_top_words]

Next, you can create a bar plot using, for example, Pandas' built-in capability (or any other data viz library you prefer)

sns.set_style("whitegrid")
num_top_words = 10
txt_ser = pd.Series(txt)
plt.figure(figsize=(15,5))
hero.top_words(txt_ser)[:num_top_words].plot(kind='bar')
plt.show()

---

### <img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%2010-%20Lesson%20Content.png"> Text in tabular data

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We will gather a DataFrame that contains text data. We consider IMDb Reviews, a dataset of 50,000 movie reviews from the Internet Movie Database (IMDb), with positives and negatives reviews

df = pd.read_csv("https://raw.githubusercontent.com/Code-Institute-Solutions/sample-datasets/main/imdb_reviews.csv")
df = df.sample(n=1000, random_state=1).reset_index(drop=True)
print(df.shape)
df.head()

We will process the text with `hero.clean()`

df['review'] = hero.clean(df['review'])
df.head()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We want to see the word cloud and plot the most frequent words for the negative sentiment on a bar chart.
* We query `sentiment==0`, subsets `'review'` and transform it to an array. Next, we transform the array into a string so that we can create a WordCloud

text = df.query("sentiment == 0")['review'].values
text = str(text)
wordcloud = WordCloud(width = 800, height = 400, 
                      background_color='blue', colormap='Dark2',
                      collocations = False, stopwords = STOPWORDS).generate(text)

plot_wordcloud(wordcloud)

Next, we plot the top words

sns.set_style("whitegrid")
num_top_words= 10
txt_ser = pd.Series(text)
plt.figure(figsize=(15,5))
hero.top_words(txt_ser)[:num_top_words].plot(kind='bar')
plt.show()

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We notice that the word "br" appears very often. That may be related to HTML coding syntax.
* We can add to our pre processing workflow task to remove common syntax from HTML, like: `['\n','/><br','<br', 'br', '/><br', '/>']`
* We create a function called `remove_specific_characters()` that gets the string of text and looks for a set of patterns. If it finds a given pattern, it replaces it with a blank space, using `.replace()`
* You may add to this list specific characters or words that you want to remove from your analysis

def remove_specific_characters(txt):
  for x in ['\n','/><br','<br', 'br', '/><br', '/>']:
    txt = txt.replace(x, ' ')
  return txt

We add this function to our previous WordCloud code; check the difference now.

text = df.query("sentiment == 0")['review'].values
text = str(text)
text = remove_specific_characters(text)  # remove remove_specific_characters()
wordcloud = WordCloud(width = 800, height = 400, 
                      background_color='blue', colormap='Dark2',
                      collocations = False, stopwords = STOPWORDS).generate(text)

plot_wordcloud(wordcloud)

<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> You may consider a set of variables in your DataFrame to explore textual insights
* In your exercise, we can use sentiment to create cohorts (or groups), and for each cohort, evaluate the text
* However, you may get a dataset with more variables where you could create more groups of analysis; for example, your data could contain information on review, date, product, customer_segment; so you could mine your data and add granularity to your analysis, per product, per customer_segment, per month, per year etc


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%205%20-%20Practice.png"> In our exercise, we will create a custom function that will iterate over the possible sentiments and display a wordcloud and most frequent words plot
* `wordplot_from_data_frame()` arguments are: the DataFrame, `cohort_variable` is the variable that will create distinct groups, in this case, a group with sentiment 0 and sentiment 1, and `text_variable` is the variable with text content
* we reuse `remove_specific_words()` and add a `plot_top_words()` function to plot the most frequent words in a string

def remove_specific_words(txt):
  for x in ['\n','/><br','<br', 'br', '/><br', '/>']:
    txt = txt.replace(x, ' ')
  return txt

def plot_top_words(text, num_top_words=10):
  sns.set_style("whitegrid")
  txt_ser = pd.Series(text)
  plt.figure(figsize=(15,5))
  hero.top_words(txt_ser)[:num_top_words].plot(kind='bar')
  plt.show()


def wordplot_from_data_frame(df, cohort_variable, text_variable):
  """
  logic:
  - loops over levels of a categorical variable
  - for each level, subsets the data, process it, and create a 
  wordcloud and bar plot with the most frequent words

  """
  cohort_levels = df[cohort_variable].unique()

  for level in cohort_levels:
    
    text = df[str(text_variable)].loc[df[str(cohort_variable)]==level].values
    text = str(text)
    text = remove_specific_words(text)

    wordcloud = WordCloud(width = 800, height = 400, 
                      background_color='navy', colormap='Dark2',
                      collocations = False, stopwords = STOPWORDS).generate(str(text))
    print(f"=== {cohort_variable} : {level} ===")
    plot_wordcloud(wordcloud)
    print("\n")
    plot_top_words(text, num_top_words=10)
    print("\n\n")


<img width="3%" height="3%" align="top"  src="https://codeinstitute.s3.amazonaws.com/predictive_analytics/jupyter_notebook_icons/Icon%207-%20Note.png"> We use `cohort_variable='sentiment'` and  `text_variable='review'`

wordplot_from_data_frame(df, cohort_variable='sentiment', text_variable='review')

---