<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

## LAB SESSION :: Wrangling external business data

1. Load text, infer structure
2. Basic pattern matching
3. Basic visualisation

## Load text, infer structure

1. Load the contents of the file called `kaggle-amazon_reviews-first50.txt` into a variable called `rawtext` and then display the contents of the variable.
2. How do we make this unstructured text fit a structure that we can use? How do we separate the reviews?
3. Can we use split further to structure each review?

In [None]:
#Open the file and read it into a variable 'text'
file = open("")
rawtext = file.read()
file.close()
rawtext

In [None]:
#Split the document by newlines to give us individual reviews
reviews = rawtext.split("")
if reviews[-1]=='':
    del reviews[-1] #Remove last empty item
reviews

Use a function to convert the plain text list into `HTML` that we can then display to make the reviews easier to read. The function will take each element of the list and wrap them in paragraph `p` tags, join the paragraphs together and return the result as `HTML`.

In [None]:
#Import some display software
from IPython.core.display import display, HTML

#FUNCTION - turn list into HTML
def listToHtml(textList):
    def pTag(text): #function that wraps text in paragraph tags
        return ""+text+""
    paras = map(pTag,textList) #Apply the wrapping function to the list
    return HTML(''.join(paras)) #Join the paragraphs together and return as HTML

#Create the HTML by calling the function
reviewsAsHtml = listToHtml(reviews)

#Display the HTML
display(reviewsAsHtml)

In [None]:
#For each review, we can split further - try with first review
firstReview = reviews[0]
firstReview

In [None]:
firstReviewParts = firstReview.split("")
print("Label, subject >>> ",firstReviewParts[0])
print("Review text    >>> ",firstReviewParts[1])

## Basic pattern matching

In splitting the text we were actually applying a simple pattern matching algorithm that turned the text into a list based on matching a chosen character e.g. `\n`. However, we can manipulate text further by using *regular expressions* or *regex*.

1. Use regex to identify the label and return `positive` or `negative` (hint: `?<=`) [Python Regex](https://docs.python.org/3/library/re.html). Experiment [here](https://pythex.org)
2. Use a namedtuple to hold label, subject and text
3. Process the list of unstructured reviews into structured form

In [None]:
#Import the Regex library
import re

#Create an expression to pull out the label value
match = re.search(r"(?<=)[0]+",)
match

In [None]:
#Get the first regex match group
match.group(0)

In [None]:
#We can split using the same regex
split = re.split(r"(?<=)[]+",)
split

In [None]:
#The second part gives us the subject, but we need to clean it up
subject = split[1].strip()
subject

Functions enable us to easily repeat a block of code. If we write functions to get the sentiment and the subject of the review, these can be used repeatedly on each of the reviews in our list.

In [None]:
#Create a function to extract the number value as a positive or negative label

def getSentimentLabel(text):
    match = re.search(r"(?<=)[]+",text)
    value = match.group(0)
    if value=='1':
        return 'negative'
    elif value=='2':
        return 'positive'
    
#Test with first review

getSentimentLabel(firstReviewParts[0])

In [None]:
#Create a function to extract the subject

def getSubject(text):
    split = re.split(r"(?<=)[]+",text)
    return split[1].strip()

#Test with first review

getSubject(firstReviewParts[0])

In [None]:
#Setup a review namedtuple
from collections import namedtuple
Review = namedtuple('review',['label','subject','text'])

In [None]:
#Create dummy Review to test
rev = Review('','','')
rev.label

In [None]:
# Create a function to parse a review into a tuple

def parseReview(text):
    textSplit = text.split('')
    text = textSplit[1]  
    subject = getSubject(textSplit[0])
    label = getSentimentLabel(textSplit[0])
    return Review(label,subject,text)

In [None]:
# Test the review function

parseReview(firstReview)

In [None]:
# Process all reviews with the parseReview function

structuredReviews = list(map(parseReview,reviews))
structuredReviews

## Basic visualisation

Create 2 visualisations from the semi-structured data:

1. Print out the reviews with subject in `bold` and coloured according to sentiment label. Print the text normally.
2. Display a graph of the total positive and negative labels.

In [None]:
#Modify the listToHtml function from before
def reviewsToHtml(reviewList):
    def pTag(review): #function that wraps review in tags
        return '<?><? class="'+review.label+'">'+review.subject+"</?>: "+review.text+"</?>"
    paras = map(pTag,reviewList) #Apply the wrapping function to the list
    return HTML(''.join(paras)) #Join the paragraphs together and return as HTML



In [None]:
#Create the HTML by calling the function
structReviewsHtml = reviewsToHtml(structuredReviews)
structReviewsHtml

We now have better structure, but we are still lacking the colour to descriminate between the positive and negative reviews.

In [None]:
#Create the CSS for the positive and negative labels
css = HTML("""
<style>
.positive {
    ;
}
.negative {
    ;
}
</style>
""")

#Display the HTML
display(css,structReviewsHtml)

Before creating the chart, we need to get a count of the positive reviews and the negative reviews. We can do this by getting a list of each type (which could be useful separate datasets) and counting how many are in each list.

In [None]:
#Count the positives and negatives
posList = list(filter(lambda review: review.label=='', structuredReviews))
negList = list(filter(lambda review: review.label=='', structuredReviews))
posCount = len(posList)
negCount = len(negList)
print("Number of positive reviews: ",posCount)
print("Number of negative reviews: ",negCount)

In [None]:
#Import the plotting library
import matplotlib.pyplot as plt

#Setup the data
y = [posCount,negCount]
x = ['','']
colours = ['','']
#Plot the data
plt.bar(x,y, color=colours)

#Lable the chart
plt.ylabel('')
plt.xlabel('')
plt.title('')