# Introduction to Natural Language Processing

### Tutorial 2

---

In our last session we could observe some functionality about jupyter notebooks: `?` and also `!` as well as how to `import` libraries and how the `"hello world"` in Python looks like. 

Today, we will take a look at a couple of more details and will learn to extract features from text using Pandas and Scikit Learn.

Let's begin...

## Arrays and Hash Maps are called Lists and Dictionaries 

What we commonly know as an array is called in Python a list. Similarly we have also a dictionary as data structure which are like lookup tables. One interesting aspect here is that we can mix several types into this structures.

In [24]:
my_string = "TEST ABC"
my_integer = 42
my_slice = my_string[:3]
my_float = 3.5

In [25]:
my_list = [my_string, my_integer, my_slice, my_float]

In [26]:
print(my_list)

['TEST ABC', 42, 'TES', 3.5]


In [27]:
another_list = my_string.split(" ") + my_list

In [28]:
print(another_list)

['TEST', 'ABC', 'TEST ABC', 42, 'TES', 3.5]


In [29]:
my_list.append(my_string.split(" "))

In [30]:
print(my_list)

['TEST ABC', 42, 'TES', 3.5, ['TEST', 'ABC']]


In [31]:
my_dict = {"a": 1, "b":2, "c": 3, "d": my_list}

In [32]:
print(my_dict)

{'a': 1, 'b': 2, 'c': 3, 'd': ['TEST ABC', 42, 'TES', 3.5, ['TEST', 'ABC']]}


## Defining Functions

Task 1: Let's say that we want a function to return always the last letter of a word in a string

In [33]:
this_string = "Introduction to Natural Language Processing, second tutorial"

In [34]:
def extract_last_letter(any_string):
    # here comes your code
    return letter_list

my_letter_list = extract_last_letter(this_string)

print(my_letter_list)

Create a function that returns every word in a string in lowercase and and another one for uppercase

**Hint:** explore the funtions `upper` and `lower`

In [None]:
def return_lower(any_string):
# here comes your code
    return(lower)
    

In [None]:
print(return_lower(this_string))

In [None]:
def return_upper(any_string):
# here comes your code
    return(upper)

In [None]:
print(return_upper(this_string))

## Iterating

There are several ways of iterating, you can use the `enumerate()` method or you can combine `for`, `range` and the 
`length` of your type. Please make sure that you use meaningful names here. 

Task 3: let's create a function that returns a mapping of each word in a string and it's index

In [35]:
def map_word_to_index(any_string):
    word_list = any_string.split(" ")
    my_mapping = {}
    for index in range(len(word_list)):
        my_mapping[word_list[index]] = index
    return my_mapping

In [36]:
print(map_word_to_index(this_string))

{'Introduction': 0, 'to': 1, 'Natural': 2, 'Language': 3, 'Processing,': 4, 'second': 5, 'tutorial': 6}


Now: try to explore the `enumerate()` method and write a similar function to return a list with only words with *even* indexes.

In [37]:
def extract_even_words(any_string):
    # here comes your code
    mapping = []
    word_list = any_string.split(" ")
    for index, value in enumerate(word_list):
        if index % 2 == 0:
            mapping.append(value)
    return (mapping)

In [38]:
print(extract_even_words(this_string))

['Introduction', 'Natural', 'Processing,', 'tutorial']


## Reading text files

We want to know is how to read files in Python. There are several posibilities of using `open()`, which is the function to work with files.

In [39]:
with open("tweet.txt", 'r', encoding='utf-8') as my_file:
    for line in my_file:
        print(line)
        print("----")

Play Services bittet User um Erlaubnis für:\n\n* Before starting to scan for and broadcast beacons.\n* Before providing user keys to the app for uploading to the […] server once the user has been positively diagnosed with COVID-19.\n\n#CoronaApp dürfte das nicht verhindern können!

----
GPT-4 and its ilk are awesome for rapid prototyping and one-offs, but at the end of the day, enterprises will deploy far smaller distilled models in production.

----
Are you wondering how large language models like ChatGPT and InstructGPT actually work? One of the secret ingredients is RLHF - Reinforcement Learning from Human Feedback.
----


## Tokenization

In [40]:
from urllib.request import urlopen
import re

little_women_url = 'http://www.gutenberg.org/cache/epub/514/pg514.txt'

def read_url(url):
    return re.sub('\\s+', ' ', urlopen(url).read().decode())

text = read_url(little_women_url)

URLError: <urlopen error [Errno -3] Temporary failure in name resolution>

In [None]:
text[:100]

In [None]:
print("length of dataset in characters: ", len(text))

In [None]:
# here are all the unique characters that occur in this text
chars = sorted(list(set(text)))
vocab_size = len(chars)
print(''.join(chars))
print(vocab_size)
print(chars)

In [None]:
s_to_i = { ch:i for i,ch in enumerate(chars) }
i_to_s = { i:ch for i,ch in enumerate(chars) }
encode = lambda s: [s_to_i[c] for c in s]
decode = lambda l: ''.join([i_to_s[i] for i in l])

In [None]:
print(encode("introduction to"))
print(decode(encode("introduction to")))

In [None]:
# let's encode the text dataset and save it into a torch.Tensor
import torch
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000])

In [None]:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")

In [None]:
enc.n_vocab

In [None]:
enc.encode("introduction to")

In [None]:
enc.decode([396, 17158, 311])

In [None]:
enc.encode("introduce To")

## Lemmatization

In [41]:
import nltk
nltk.download('wordnet')
#nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to /home/faris/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [42]:
from nltk.stem import WordNetLemmatizer

In [43]:
lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("worst"))
print(lemmatizer.lemmatize("born"))
print(lemmatizer.lemmatize("was"))

worst
born
wa


## Stemming

In [None]:
from nltk.stem import PorterStemmer

In [None]:
stemmer = PorterStemmer()
stemmer.stem("helping")

## Pandas

Take the survey.csv file as an example ...

In [None]:
import pandas as pd

In [None]:
# here comes your code
# Read the CSV file


## Pandas Series

Pandas has two data structures that we will consider in this class, `Series` and `DataFrame`. Let's take a closer look at `Series`. At first glance, it's like playing around with a `list`. We already know what a list is and why this data structure is relevant. `Series` is also very similar, but Pandas allows index naming, making everything much easier to read.

In [None]:
list1 = "This is the first document".split(" ")

In [None]:
print(list1)

In [None]:
my_series = pd.Series(list1)

In [None]:
my_series

In [None]:
my_dict = {"this":0, "is":1, "the":2, "first":3, "document":4}
my_index = [0, 1, 2, 3, 4]

In [None]:
series1 = pd.Series(my_dict)

In [None]:
series1

In [None]:
list2 = "this document is the second document".split(" ")
series2 = pd.Series(data=[0, 1, 2, 3, 4, 5], index = list2)

In [None]:
series2

## DataFrame

Now, let's take a look at what is a DataFrame. We can define it as several Series units that share the same index since we already know the Series data structure. Here, we will use Numpy to create a random matrix having setting also a common seed. Do you know why?

In [None]:
from numpy.random import randn
np.random.seed(123)

In [None]:
df1 = pd.DataFrame(randn(5,4),index=[0, 1, 2, 3, 4],columns="A B C D".split(" "))

In [None]:
display(df1)

### Indexing DataFrames

Here things begin to turn a bit different. If we want to index one column, then we just call it by its name, but if we want several columns, we will need to give them as a list.

In [None]:
df1["B"]

In [None]:
df1[["C", "D"]]

In [None]:
df1["E"] = df1["A"] * df1["D"]

In [None]:
df1

In [None]:
df1.drop("E", axis=1, inplace=True)

In [None]:
df1

### Apply

Pandas allows to apply a function to a `Series`, which might be sometimes super useful. Let's take a look at that.

In [None]:
corpus = ["This is the first document",
           "This document is the second document",
           "And this is the third one", 
           "Is this the first document"]

In [None]:
df2 = pd.DataFrame(corpus, columns=["text"])

In [None]:
df2

Task 4: Create a function which counts the words in a string

In [None]:
def count_words(any_string):
    return 

In [None]:
# Add a new column to DF >> "count_words"
# Your code comes here
df2["count_words"] = df2["text"].apply(count_words)

In [None]:
df2

### Visualization

In [None]:
# Bar plot
# Your code comes here
df2.plot.bar(x="text", y="count_words")

### Scikit-Learn: Understanding CountVectorizer

The CountVectorizer is specifically used for counting words. The vectorizer part of CountVectorizer is (technically speaking!) the process of converting text into some sort of number-y thing that computers can understand.

In [44]:
from sklearn.feature_extraction.text import CountVectorizer

# Build the text
text = """The CountVectorizer is specifically used for counting words.
The vectorizer part of CountVectorizer is (technically speaking!) the process of converting text into some sort of number-y
thing that computers can understand."""

vectorizer = CountVectorizer()

matrix = vectorizer.fit_transform([text])
matrix

<1x25 sparse matrix of type '<class 'numpy.int64'>'
	with 25 stored elements in Compressed Sparse Row format>

In [45]:
matrix.toarray()

array([[1, 1, 1, 1, 2, 1, 1, 2, 1, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1,
        1, 1, 1]])

In [49]:
print(vectorizer.get_feature_names_out())

['can' 'computers' 'converting' 'counting' 'countvectorizer' 'for' 'into'
 'is' 'number' 'of' 'part' 'process' 'some' 'sort' 'speaking'
 'specifically' 'technically' 'text' 'that' 'the' 'thing' 'understand'
 'used' 'vectorizer' 'words']


In [50]:
counts = pd.DataFrame(matrix.toarray(),
                      columns=vectorizer.get_feature_names_out())

NameError: name 'pd' is not defined

In [None]:
counts

In [None]:
# Sort the DF and show the top 10 most common words
counts.T.sort_values(by=0, ascending=False).head(10)

In [51]:
import requests

# Download the book >> Title: Pride and Prejudice
response = requests.get('http://www.gutenberg.org/cache/epub/42671/pg42671.txt')
text = response.text

# Look at some text in the middle
print(text[4101:4600])

the best of the
party."

"My dear, you flatter me. I certainly _have_ had my share of beauty, but
I do not pretend to be any thing extraordinary now. When a woman has
five grown up daughters, she ought to give over thinking of her own
beauty."

"In such cases, a woman has not often much beauty to think of."

"But, my dear, you must indeed go and see Mr. Bingley when he comes into
the neighbourhood."

"It is more than I engage for, I assure you."

"But consider your daughters. Onl


In [55]:
# How often have the words "love" and "hate" been used in the book?
#your code comes here
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform([text])
len(vectorizer.get_feature_names_out())

6719

In [54]:
counts = pd.DataFrame(matrix.toarray(),
                      columns=vectorizer.get_feature_names())
counts
print(counts['love'])
print(counts['hate'])

NameError: name 'pd' is not defined