# tinyurl.com/ANLPcolab1



# Basics of Natural Language Processing (NLP) #

*CoLab Tutorial*

This notebook demonstrates some basic NLP tasks to get you started on text analysis.

Run each block one by one in order of occurence to get the results. Have a play with the cells by editing the code and examining the new output(s).

# Run this code in the beginning to limit the output size of the cells

This is just an initial set up to make sure our cells are displayed properly

In [1]:
from IPython.display import display, Javascript

def resize_colab_cell():
  # Change the maxHeight variable to change the max height of the output
   display(Javascript('google.colab.output.setIframeHeight(0, true, {maxHeight: 400})'))
  #Change output size for the entire notebook (set to call function on cell run)
   get_ipython().events.register('pre_run_cell', resize_colab_cell)

## 1. Input Text

There are many ways we can provide an input text for analysis. We will go through three ways in this notebook


1.   Define input text in a variable
2.   Scrape text from a web source
3.   Read from a file

Defining input text in a variable and scraping text from a web source will be demonstrated in class.

To get started, we provide a sample text for analysis. We analyse the **components of the text** to teach the machine to make sense of it.


In [2]:
# Assign text (in the form of a multiline string) to a variable 'mytext'

mytext = """
Have we reached (hu)man-machine symbiosis where we do not simply use technology but are enmeshed with technology in “socio-cyborgian activity systems,” as Bazerman notes?

While we are certainly in an age where technology plays an increasingly important role in our daily lives, we have not yet fully reached a state of human-machine symbiosis. However, there are certainly elements of symbiosis that can be seen in certain domains.

For example, in some professions such as medicine or aviation, technology plays an integral role in supporting human decision-making and action. In these domains, technology and humans work together to achieve a common goal, and the relationship between the two can be seen as a kind of symbiosis.

However, in other domains such as social media or gaming, the relationship between humans and technology is more complex and often less symbiotic. In these domains, technology can sometimes be seen as a distraction or even a hindrance to human activity, rather than a support.

Overall, while the relationship between humans and technology is undoubtedly becoming more intertwined, it is still evolving and we have not yet reached a full state of symbiosis in all areas of our lives
""".strip()

# Check if the variable contains the intended text
print(mytext)

Have we reached (hu)man-machine symbiosis where we do not simply use technology but are enmeshed with technology in “socio-cyborgian activity systems,” as Bazerman notes?

While we are certainly in an age where technology plays an increasingly important role in our daily lives, we have not yet fully reached a state of human-machine symbiosis. However, there are certainly elements of symbiosis that can be seen in certain domains.

For example, in some professions such as medicine or aviation, technology plays an integral role in supporting human decision-making and action. In these domains, technology and humans work together to achieve a common goal, and the relationship between the two can be seen as a kind of symbiosis.

However, in other domains such as social media or gaming, the relationship between humans and technology is more complex and often less symbiotic. In these domains, technology can sometimes be seen as a distraction or even a hindrance to human activity, rather than a

Check out what happens when you don't use " " "  for multi-line texts. Learn more about Python strings here and try different inputs: https://www.w3schools.com/python/python_strings.asp

### Using web scraping

The following code demonstrates the use of a web scraping library called BeautifulSoup . You may try different Web scraping methods from [here](https://realpython.com/python-web-scraping-practical-introduction/) if you are interested, but the code shows a simple example to extract data from a web page using the page's HTML tags and attributes.

In [3]:
import requests
from bs4 import BeautifulSoup

# Send a GET request to the website
response = requests.get('https://handbook.uts.edu.au/subjects/details/36118.html')

# Parse the HTML content of the page
soup = BeautifulSoup(response.text, 'html.parser')
#The above line creates a BeautifulSoup object from the HTML content of the response. It uses the 'html.parser' to parse the HTML

# Find the required information using HTML tags and attributes
title = soup.find_all('h1')[0].text  # finds all <h1> tags in the HTML and takes the text of the first one (index 0) - assumed to be the title of the page
subject_code = title.split(' ')[0]  # splits the title by spaces and takes the first part, assuming it is the subject code
subject_name = ' '.join(title.split(' ')[1:])  #takes all parts of the title after the first space and joins them back together, assuming this is the subject name
subject_description = soup.h3.next_sibling.next_sibling.next_element.text #This line is more complex - It starts from the first <h3> tag in the document. It then moves to the next sibling twice (.next_sibling.next_sibling). Finally, it gets the next element and its text. This assumes that the subject description is located two siblings after an <h3> tag.

# Print the scraped information
print(f"Page Title: {title}")
print(f"Subject Code: {subject_code}")
print(f"Subject Name: {subject_name}")
print(f"Subject Description: {subject_description}")

SSLError: HTTPSConnectionPool(host='handbook.uts.edu.au', port=443): Max retries exceeded with url: /subjects/details/36118.html (Caused by SSLError(SSLEOFError(8, '[SSL: UNEXPECTED_EOF_WHILE_READING] EOF occurred in violation of protocol (_ssl.c:1007)')))

Note: This code assumes a very specific HTML structure. If the structure changes, the code might break. For more robust parsing, you might want to use more specific selectors (like IDs or classes) if they're available in the HTML, or add error handling to deal with cases where the expected structure isn't found.

### Text representation

Low level computer representations of text do not equate to simple human understandings of text. A simple illustration can be seen by how much difference it makes to distinguish between words, sentences and paragraphs.

*This is a sample sentence I wrote*

Notice that as this is just a **string** of characters, it doesn't include any of the normal formatting that we associate with text.
To make it more readable, we can display the text as HTML in the output of the cell - the browser will parse it in a way that makes it easier to see the whole text. This is basic text visualisation in the browser.

In [None]:
# import display.HTML and use it display the text as HTML
from IPython.display import HTML
HTML(mytext)


We could make it more readable still by turning the text into a list of paragraphs and displaying each with a space between.

In [None]:
# create a list of paragraphs

mytext_paras = mytext.split('\n')

# wrap each paragraph in html <p> tags and separate with a </br> tag
html_mytext = '</br>'.join(map(lambda x: '<p>'+x+'</p>', mytext_paras))

HTML(html_mytext)

## 2. Basic Analysis

To perform analysis on text, we generally make use of NLP libraries. Two of the most common libraries for the Python language are `Spacy` and `NLTK`. We will use NLTK for our first analysis.

In [None]:
#Import and load the NLTK library
import nltk
from  nltk.tokenize  import  sent_tokenize ,  word_tokenize
nltk.download('punkt')

### Tokenization

We're now ready to process our text with NLTK. For this exercise, we'll just do simple analysis starting with tokenization.

In [None]:
mytextsents = sent_tokenize(mytext)
mytextsents

In [None]:
mytextwords = word_tokenize(mytext)
mytextwords

### Parts of Speech (POS) tagging

We can also identify the parts-of-speech in the text using NLTK predefined taggers

In [None]:
nltk.download('averaged_perceptron_tagger')
nltk.download('tagsets')
nltk.help.upenn_tagset()

mytextpos = nltk.pos_tag(mytextwords)
mytextpos

And display the dependency tree...

In [None]:
!pip install svgling #The svgling package is a pure python package for doing single-pass rendering of linguistics-style constituent trees into SVG
nltk.download("maxent_ne_chunker")
nltk.download("words")
tree = nltk.ne_chunk(mytextpos)
display(tree)

Note that NLTK returns strings it found as words, but it also includes ')', ', etc.

### Named Entity Recognition (NER)

Now let's try NER using the Spacy package which also has many other linguistic features (see https://spacy.io/usage/linguistic-features for more)

In [4]:
# Load the spacy library and a pre-trained language model for English text

import spacy
nlp = spacy.load("en_core_web_sm")

In [5]:
#Process text
doc = nlp(mytext)

In [6]:
#Extract entities

for entity in doc.ents:
    print(f"Entity: {entity.text}, Label: {entity.label_}")

Entity: Bazerman, Label: PERSON
Entity: two, Label: CARDINAL


In [7]:
# These entities can be visualised

import IPython

from spacy import displacy
ent_render = displacy.render(doc, style="ent")
IPython.display.HTML(ent_render)

<IPython.core.display.HTML object>

# 3. Regular expressions

A **regular expression** (or RE) is used to match strings of text such as particular characters, words, or patterns of characters. These come in quite handy for a number of operations in string manipulation. For instance, we can extract name from an email ID, Title from a name, subject code from a text description, or components of an address.

There are commonly used wild card patterns in Python that helps us extract useful information from texts:
^

This wild card matches the characters at the beginning of a line.

$

This wild card matches the characters at the end of the line.

.

This wild card matches any character in the line.

s

This wild card is used to match space in a string.

S

This wild card matches non-whitespace characters.

d

This wild card matches one digit.

*

This wild card repeats any preceding character zero or more times. It matches the longest possible string.

*?

This wild card also repeats any preceding character/characters zero or more times. However, it matches the shortest string following the pattern.

+

This wild card repeats any preceding character one or more times. It matches the longest possible string following the pattern.

+?

This wild card repeats any preceding character one or more times. However, it matches the shortest possible string following the pattern.

[aeiou]

It matches any character from a set of given characters.

[^XYZ]

It matches any character not given in the set.

 [a-z0-9]

It matches any character given in the a-z or 0-9.

(

This wild card represents the beginning of the string extraction.

)

This wild card represents the end of the string extraction.


Read examples of applications here: https://code.tutsplus.com/tutorials/8-regular-expressions-you-should-know--net-6149, and more examples here: https://developers.google.com/edu/python/regular-expressions

In [None]:
!pip install regex
import regex as re

test_string = '''
36118 Applied Natural Language Processing
Warning: The information on this page is indicative. The subject outline for a particular session, location and mode of offering is the authoritative source of all information about the subject for that offering. Required texts, recommended texts and references in particular are likely to change. Students will be provided with a subject outline once they enrol in the subject.

Subject handbook information prior to 2024 is available in the Archives.

UTS: Transdisciplinary Innovation
Credit points: 8 cp
Result type: Grade, no marks

Requisite(s): 36100 Data Science for Innovation AND 36103 Statistical Thinking for Data Science AND 36106 Machine Learning Algorithms and Applications
Description
This subject introduces students to the complexities of human language data and the use of Natural Language Processing (NLP) and text mining techniques to analyse them. Students develop both technical and communicative skills to process and interpret unstructured textual data with a range of practical applications. Covering core NLP concepts and the latest developments in large language models, the course equips students with skills for insightful pattern discovery in natural language text, while emphasising ethical considerations.

Subject learning objectives (SLOs)
Upon successful completion of this subject students should be able to:

1.	Understand core concepts of Natural Language Processing (NLP) and computational linguistics including its limitations (CILO 2.2, 2.3)
2.	Evaluate complex challenges for problem solving and build practical NLP applications (CILO 2.3, 4.2)
3.	Apply text mining techniques on unstructured data sets using advanced NLP programming packages (CILOs 1.2, 2.2)
4.	Interpret, extract value and effectively communicate insights from text analysis and create real-world applications suitable to a range of audiences (CILOs 2.4, 3.2, 4.2)
5.	Articulate the strengths, weaknesses and underlying assumptions of NLP and text analysis to apply ethical practices (CILO 5.1, 5.2)
Contribution to the development of graduate attributes
1.2 Explore and test models and generalisations for describing the behaviour of sociotechnical systems and selecting data sources, taking into account the needs and values of different contexts and stakeholders

2.2 Explore, analyse, manipulate, interpret and visualise data using data science techniques, software and technologies to make sense of data rich environments

2.3 Understand and deal critically and openly with the uncertainty, ambiguity and complexity associated with people, systems and data

2.4 Apply and assess data science concepts, theories, practices and tools for designing and managing data discovery investigations in professional environments that draw upon diverse data sources, including efforts to shed light on underrepresented components

3.2 Critically examine the perceived value of data analytics outcomes and clearly articulate implications for different stakeholders and organisations

4.2 Explore and craft interpretative narratives that engage key audiences with data analytics and potential significance for action, at a societal, industrial, organisational, group or individual levels

5.1 Engage in active, reflective practice that supports flexible navigation of assumptions, alternatives and uncertainty in professional data science contexts

5.2 Interrogate and justify ethical responsibilities related to data selection, access, analysis and governance to create a framework for practice

Graduate attributes

GA 1 Sociotechnical systems thinking

GA 2 Creative, analytical and rigorous sense making

GA 3 Create value in problem solving and inquiry

GA 4 Persuasive and robust communication

GA 5 Ethical citizenship

Teaching and learning strategies
Blend of online and face to face activities: The subject is offered through a series of teaching sessions which blend online and face-to-face learning. Students learn through interactive lectures and classroom activities making use of the subject materials on canvas. They also engage in individual and collaborative learning activities to understand and apply text analysis techniques in diverse settings.

Authentic problem based learning: This subject offers a range of authentic data science problems to solve that will help develop students’ text analysis skills. They work on real world data analysis problems for broad areas of interest using unstructured data and contemporary techniques.

Collaborative work: Group activities will enable students to leverage peer-learning and demonstrate effective team participation, as well as learning to work in professional teams with an appreciation of diverse perspectives on data science and innovation.

Future-oriented strategies: Students will be exposed to contemporary learning models using speculative thinking, ethical and human-centered approaches as well as reflection. Electronic portfolios will be used to curate, consolidate and provide evidence of learning and development of course outcomes, graduate attributes and professional evolution. Formative feedback will be offered with all assessment activities for successful engagement.

Content (topics)
• Introduction to unstructured data and natural language text
• Foundations of Natural Language Processing (NLP)
• Text analysis techniques using Python
• Advanced NLP and Deep Learning
• Natural Language Understanding (NLU) and Natural Language Generation (NLG)
• Large Language Models (LLMs)
• Real-world applications of NLP
• Ethical best practices in NLP

Assessment
Assessment task 1: Assessment 1: Text Analysis
Intent:
Assessment 1: Text Analysis

NLP for data analysis (Python code + Markdown report) (Individual, 30%)

Type:	Report
Groupwork:	Individual
Weight:	30%
Assessment task 2: Assessment 2: End-to-end NLP project
Intent:
Assessment 2: End-to-end NLP project

· Part A: Design and development of a NLP application - Group project report and peer review (Group & Individual, 40%)

· Part B: Final Presentation (Group, 10%)

Type:	Report
Groupwork:	Group, group and individually assessed
Weight:	50%
Assessment task 3: Assessment 3: Critical Reflection
Intent:
Critical Reflection on:

Bias and fairness in NLP
Personal learning and portfolio
(Individual, 20%)

A detailed assessment brief will be made available on Canvas once the assignment tasks are released during in-class sessions.

Type:	Reflection
Groupwork:	Individual
Weight:	20%
Minimum requirements
1. Students must participate in all online and face to face requirements
2. Pass all assessment tasks
'''
print(test_string)



36118 Applied Natural Language Processing

Subject handbook information prior to 2024 is available in the Archives.

UTS: Transdisciplinary Innovation
Credit points: 8 cp
Result type: Grade, no marks

Requisite(s): 36100 Data Science for Innovation AND 36103 Statistical Thinking for Data Science AND 36106 Machine Learning Algorithms and Applications
Description
This subject introduces students to the complexities of human language data and the use of Natural Language Processing (NLP) and text mining techniques to analyse them. Students develop both technical and communicative skills to process and interpret unstructured textual data with a range of practical applications. Covering core NLP concepts and the latest developments in large language models, the course equips students with skills for insightful pattern discovery in natural language text, while emphasising ethical considerations.

Subject learning objectives (SLOs)
Upon successful completion of this subject students should be

In the example below, we extract all words that start with the letter 'C'

In [None]:
startswithC = re.findall(r'(C\w+)', test_string)

for txt in startswithC:
    print(txt)

Credit
Covering
CILO
CILO
CILOs
CILOs
CILO
Contribution
Critically
Creative
Create
Collaborative
Content
Critical
Critical
Canvas


In [None]:
#Note how they are case-sensitive
startswithc = re.findall(r'(c\w+)', test_string)

for txt in startswithc:
    print(txt)

cessing
cative
ct
cular
cation
ce
ct
commended
ces
cular
change
ct
ce
ct
ct
chives
ciplinary
cp
cience
cal
cience
chine
cations
cription
ct
ces
complexities
cessing
chniques
chnical
communicative
cess
ctured
ctical
cations
core
concepts
course
covery
cal
considerations
ct
ctives
ccessful
completion
ct
core
concepts
cessing
computational
cs
cluding
complex
challenges
ctical
cations
chniques
ctured
ced
ckages
ct
ctively
communicate
create
cations
ces
culate
cal
ctices
cribing
ciotechnical
cting
ces
ccount
contexts
cience
chniques
chnologies
ch
critically
certainty
complexity
ciated
cience
concepts
ctices
covery
ces
cluding
components
cally
ceived
cs
comes
clearly
culate
cations
craft
ces
cs
cance
ction
cietal
ctive
ctive
ctice
certainty
cience
contexts
cal
ction
ccess
ce
create
ctice
ciotechnical
cal
communication
cal
citizenship
ching
ce
ce
ctivities
ct
ching
ch
ce
ce
ctive
ctures
classroom
ctivities
ct
canvas
collaborative
ctivities
chniques
ct
cience
ctured
contemporary
chniques
ctivi

Note how it captures non-words as well. We need to adjust it to ensure we're only extracting complete words. Let's try the below:

In [None]:
wordsthatstartwithc = re.findall(r'\b(c\w+)\b', test_string)

for txt in wordsthatstartwithc:
    print(txt)

change
cp
complexities
communicative
core
concepts
course
considerations
completion
core
concepts
computational
complex
challenges
communicate
create
contexts
critically
complexity
concepts
components
clearly
craft
contexts
create
communication
citizenship
classroom
canvas
collaborative
contemporary
contemporary
centered
curate
consolidate
course
code
class


#### Regular Expression Breakdown

r'\b(c\w+)\b'


Let's break down this improved regular expression:

1. `\b`: This is a word boundary anchor. It matches a position where a word character is not followed or preceded by another word character.
2. `(c\w+)`: This is the main matching group:
    - `c`: Matches the literal character 'c'
    - `\w+`: Matches one or more word characters (letters, digits, or underscores)
3. `\b`: Another word boundary anchor at the end

This pattern will now match complete words that:

- Start with the letter 'c'
- Contain one or more additional word characters
- Are not part of a larger word

Examples of what it will match:

- "cat", "computer", "code", "c123"

Examples of what it won't match:

- "incompatible" (doesn't start with 'c')
- "c-section" (contains a hyphen)
- "abc" (doesn't start with 'c')

Exercise: Can you try creating one that captures lower case or upper case characters?

Let's write a function that can return matching texts and test it out with RegEx patterns.

In [None]:
def find_with_regex(regex, text):
    matches = []
    # find all matching patterns
    for group in regex.findall(text):
        matchingtext = ''.join(group)
        matches.append(matchingtext)

    print("All matching texts: ")
    print(matches)

In [None]:
#Extracting any integer
pattern = re.compile(r'[0-9]')
find_with_regex(pattern, test_string)

All matching texts: 
['3', '6', '1', '1', '8', '2', '0', '2', '4', '8', '3', '6', '1', '0', '0', '3', '6', '1', '0', '3', '3', '6', '1', '0', '6', '1', '2', '2', '2', '3', '2', '2', '3', '4', '2', '3', '1', '2', '2', '2', '4', '2', '4', '3', '2', '4', '2', '5', '5', '1', '5', '2', '1', '2', '2', '2', '2', '3', '2', '4', '3', '2', '4', '2', '5', '1', '5', '2', '1', '2', '3', '4', '5', '1', '1', '1', '3', '0', '3', '0', '2', '2', '2', '4', '0', '1', '0', '5', '0', '3', '3', '2', '0', '2', '0', '1', '2']


In [None]:
#Extracting string with integers with at least 4 digits and at most 7 digits
pattern = re.compile(r'\d{4,7}(?!\d)')
find_with_regex(pattern, test_string)

All matching texts: 
['36118', '2024', '36100', '36103', '36106']


Note: match() will only match if the string starts with the pattern. search() module will return the first occurrence that matches the specified pattern. findall() will iterate over all the lines of the file and will return all non-overlapping matches of pattern in a single step
