# String Operations and NumPy Arrays #

In this lab session we'll go over some python methods and general approaches for dealing with strings.   


## String Operations ##
We'll start with string operations. Below we are given an example sentence. We'll use this sentence to showcase some of the string operations that we could do with string objects in python.

In [None]:
d = "In the first quarter of fiscal 2018, Walmart reported an astounding figure: 63 percent."

First we'll convert the sentence as a list of words. Python let's us do that with the split() method:

In [None]:
words = d.split()
print(words)

With the sentence stored as a list we could access each word based on its location within the sentence. The word location is the list index starting from 0.

Let's see how we could lowercase a word and remove a punctation mark:

In [None]:
word = words[7]
word=word.lower()
print(word)

In [None]:
word=words[-1]
print(word)

In [None]:
word=word.replace(".","")
print(word)

In [None]:
word=words[-2]
word

In [None]:
word = words[7]
word

In [None]:
len(word)

With the .isdecimal(), .isalpha(), .isdigit() we could check whether the string only contains decimal characters, if the characters in the string are all alphabetic, or if they are all digits.

In [None]:
word.isdigit()

In [None]:
word=words[6]
print(word)

In [None]:
word.isdigit()

The word itself is an array of characters and therefore we could perform array operations on it as well:

In [None]:
word = words[7]
print(word[:3])

In [None]:
print(word[3:])

A word could be converted into a list of characters:

In [None]:
w = word[:3]
characters = list(w)
print(characters)

We could also do list comprehension over the characters:

In [None]:
wordlist = [ch for ch in w]
print(wordlist)

**[Assignment 1]**
Find an online sentence or a paragraph and explore the different string operations on your own. 
A list of the various striwng methods could be found here:  
https://docs.scipy.org/doc/numpy/reference/routines.char.html  

**[Solution 1]**

## Word Representations ##
Next we'll look into how we could store and represent words for various NLP approaches using NumPy arrays.

### Words as array indices ###

In [1]:
para="The Federal Reserve considers transparency about the goals, conduct, and stance of monetary policy to be fundamental to the effectiveness of monetary policy. The Federal Reserve Act sets forth the goals of monetary policy."
print(para)

The Federal Reserve considers transparency about the goals, conduct, and stance of monetary policy to be fundamental to the effectiveness of monetary policy. The Federal Reserve Act sets forth the goals of monetary policy.


In most cases words are represented as array indices by assigning them with integer values. This representation type requires less memory and is more time efficient. This is done by creating a dictionary of words where each word is a key whose dictionary value is the assigned key (integer). In this representation type sentences are integer arrays.

In [2]:
import numpy as np
words = np.array(para.split())
print(words)

['The' 'Federal' 'Reserve' 'considers' 'transparency' 'about' 'the'
 'goals,' 'conduct,' 'and' 'stance' 'of' 'monetary' 'policy' 'to' 'be'
 'fundamental' 'to' 'the' 'effectiveness' 'of' 'monetary' 'policy.' 'The'
 'Federal' 'Reserve' 'Act' 'sets' 'forth' 'the' 'goals' 'of' 'monetary'
 'policy.']


In [3]:
dictionary = dict()
words_index = list()
index=0
for word in words:
    if (word in dictionary.keys()):
        words_index.append(dictionary[word]) 
    else:
        dictionary[word]=index
        words_index.append(index)
        index+=1
     
print(dictionary)
print(words_index)

{'The': 0, 'Federal': 1, 'Reserve': 2, 'considers': 3, 'transparency': 4, 'about': 5, 'the': 6, 'goals,': 7, 'conduct,': 8, 'and': 9, 'stance': 10, 'of': 11, 'monetary': 12, 'policy': 13, 'to': 14, 'be': 15, 'fundamental': 16, 'effectiveness': 17, 'policy.': 18, 'Act': 19, 'sets': 20, 'forth': 21, 'goals': 22}
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 14, 6, 17, 11, 12, 18, 0, 1, 2, 19, 20, 21, 6, 22, 11, 12, 18]


In [4]:
for i in range(0,len(words)):
    print (str(words[i])+"\t"+str(words_index[i]))

The	0
Federal	1
Reserve	2
considers	3
transparency	4
about	5
the	6
goals,	7
conduct,	8
and	9
stance	10
of	11
monetary	12
policy	13
to	14
be	15
fundamental	16
to	14
the	6
effectiveness	17
of	11
monetary	12
policy.	18
The	0
Federal	1
Reserve	2
Act	19
sets	20
forth	21
the	6
goals	22
of	11
monetary	12
policy.	18


In [5]:
words.size
len(words)

34

In [6]:
len(dictionary)

23

### Words as binary vectors ###
In NLP applications such as classification, regression and clustering and for some neural network based approaches it is more convenient to represent words as binary vectors whose number of dimensions is the vocabulary length $|V|$. In representation type a word has all vector dimensions equal to zero except for its corresponding dimension which is equal to its assigned index. This is also known as a one-hot representation or encoding. For example, let's assume that we have a vocabulary of 3 words, e.g. $V$= {color, money , green}. In the one-hot representation we assign each word with its own dimension which generates the following representation:

color = [1,0,0]

money = [0,1,0]

green =[0,0,1]

In [7]:
import numpy as np
dsize = len(dictionary)
para_oh = list()
print(words_index)
for index in words_index:
    temp = np.zeros(dsize)
    temp[index]=1
    para_oh.append(temp)
para_oh = np.array(para_oh)
print(para_oh)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 14, 6, 17, 11, 12, 18, 0, 1, 2, 19, 20, 21, 6, 22, 11, 12, 18]
[[ 1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.]
 [ 0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.]
 [ 0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.]
 [ 0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  1.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0.  0.  0.]
 [ 0.

### Words as tuples of frequency counts ###
When the word ordering is not important, text (i.e. documents, paragraphs, sentences, etc.) is represented through the bag of words approach. This type of an approach only considers the frequency of occurrence of the words in the text. It is typically performed by going over the words $w$ and counting the number of times each word occurs. Once statistics are collected the text represented as a set of tuples which consists of the word index and the frequency count $fc(w)$:

{$w$,$fc(w)$)

For example, the following sentence:  
"We have 47 prefectures and each prefecture will have a store in Tokyo."  

would be represented as:  
[(0,1), (1,2), (2,1), (3,2), (4,1), (5,1), (6,1), (7,1), (8,1), (9,1), (10,1)]  

Let's now represent the paragraph using the bag of words approach: 


In [8]:
bow = dict()
index=0
for index in words_index:
    if (index in bow.keys()):
        bow[index]+=1
    else:
        bow[index]=1

print (bow)
para_bow = list()
for index in bow.keys():
    para_bow.append("("+str(index)+":"+str(bow[index])+")")
print (para_bow)

{0: 2, 1: 2, 2: 2, 3: 1, 4: 1, 5: 1, 6: 3, 7: 1, 8: 1, 9: 1, 10: 1, 11: 3, 12: 3, 13: 1, 14: 2, 15: 1, 16: 1, 17: 1, 18: 2, 19: 1, 20: 1, 21: 1, 22: 1}
['(0:2)', '(1:2)', '(2:2)', '(3:1)', '(4:1)', '(5:1)', '(6:3)', '(7:1)', '(8:1)', '(9:1)', '(10:1)', '(11:3)', '(12:3)', '(13:1)', '(14:2)', '(15:1)', '(16:1)', '(17:1)', '(18:2)', '(19:1)', '(20:1)', '(21:1)', '(22:1)']


**[Assignment 2]**
At the beginning of this lab sessions we learned how we could lowercase a word and replace a character. Use these two operations to generate a new dictionary where all words are lowercased and periods ("."), commas (",") and colons (":") removed. 

**[Solution 2]**