# Creating variables

In this notebook we will look into the concept of variables.

Python, like R, is a dynamically-typed language, meaning you can change the class/type of a variable on the go. This is convenient in many places, but dangerous in many other ways. It is impossible to rely on the type of the variable, and you should always retrace your steps throughout the code to see what the variable is currently representing. This is sometimes hard, especially in places like this notebook where we can execute different bits of code in any order.

## Intro and strings

Let's create a variable:

In [1]:
name = "Edinburgh"
name

'Edinburgh'

This generates a string variable. They can be easily printed, although it is safer to use the print function:

In [2]:
print(name)

Edinburgh


It is also wise to check the type of the variable, in case you are lost:

In [3]:
type(name)

str

This confirms that we are dealing with a string. There are a few things we can do with strings (which we can denote by using one or two apostrophes):

In [4]:
name = 'university of edinburgh'
print(name.lower())
print(name.upper())
print(name.title())

university of edinburgh
UNIVERSITY OF EDINBURGH
University Of Edinburgh


We can concatenate strings easily using +, or using a comma in a print statement:

In [5]:
print('University', 'of Edinburgh')
print('University' + ' ' + 'of Edinburgh')

University of Edinburgh
University of Edinburgh


Writing print('The University of Edinburgh is '+ 439) will not work, as the + operator only works for strings, we can convert any object into a string however:

In [6]:
print('The University of Edinburgh is '+ str(439))

The University of Edinburgh is 439


A few other useful tricks:

In [7]:
name = " edinburgh "
print("|"+name.lstrip()+"|")
print("|"+name.rstrip()+"|")
print("|"+name.strip()+"|")

|edinburgh |
| edinburgh|
|edinburgh|


You can use control characters as well:

In [8]:
print('Edinburgh\thas a university\nrunning web & social network analytics course')

Edinburgh	has a university
running web & social network analytics course


## Numbers

In [9]:
a = 10
b = -10.1023

#Some operations illustrated (\t stands for a tab)
print("a: \t\t\t" + str(a))
print("b: \t\t\t" + str(b))
print("absolute of b: \t\t" + str(abs(b)))
print("rounded b: \t\t" + str(round(b,3)))
print("square of a: \t\t" + str(pow(a,2)))
print("cube of a: \t\t" + str(a**3))
print("integer part of b: \t" + str(int(b)))

a: 			10
b: 			-10.1023
absolute of b: 		10.1023
rounded b: 		-10.102
square of a: 		100
cube of a: 		1000
integer part of b: 	-10


# Flow Control

Control flow statements help you to structure the code and direct it towards your convenience and introduce loops and so on.

## If statements

In [10]:
price = -5;

if price <0:
    print("Price is negative!")
elif price <1:
    print("Price is too small!")
else:
    print("Price is suitable.")

Price is negative!


Especially in text mining, comparing strings is very important:

In [11]:
#Comparing strings
name1 = "edinburgh"
name2 = "Edinburgh"

if name1 == name2:
    print("Equal")
else:
    print("Not equal")

if name1.lower() == name2.lower():
    print("Equal")
else:
    print("Not equal")

Not equal
Equal


Using multiple conditions:

In [12]:
number = 9
if number > 1 and not number > 9:
    print("Number is between 1 and 10")
    
number = 9
name = 'johannes'
if number < 5 or 'j' in name:
    print("Number is lower than 5 or the name contains a 'j'")

Number is between 1 and 10
Number is lower than 5 or the name contains a 'j'


## While loops

In [13]:
number = 4
while number > 1:
    print(number)
    number = number -1

4
3
2


## For loops

For loops allow you to iteratre over elements in a certain collection, for example a list:

In [14]:
# We'll look into lists in a minute
number_list = [1, 2, 3, 4]
for item in number_list:
    print(item)

1
2
3
4


In [15]:
list = ['a', 'b', 'c']
for item in list:
    print(item)

a
b
c


Ranges are also useful. Note that the upper element is not included and we can adjust the step size:

In [16]:
for i in range(1,4):
    print(i)

1
2
3


In [17]:
for i in range(30,100, 10):
    print(i)

30
40
50
60
70
80
90


## Indentation

Please be very careful with indentation

In [18]:
number_1 = 3
number_2 = 5

print('No indent (no tabs used)')
if number_1 > 1:
    print('\tNumber 1 higher than 1.')
    if number_2 > 5:
        print('\t\tnumber 2 higher than 5')
    print('\tnumber 2 higher than 5')

number_1 = 3
number_2 = 6

print('No indent (no tabs used)')
if number_1 > 1:
    print('\tNumber 1 higher than 1.')
    if number_2 > 5:
        print('\t\tnumber 2 higher than 5')
    print('\tnumber 2 higher than 5')

No indent (no tabs used)
	Number 1 higher than 1.
	number 2 higher than 5
No indent (no tabs used)
	Number 1 higher than 1.
		number 2 higher than 5
	number 2 higher than 5


# List & Tuple

## Lists

Lists are great for collecting anything. They can contain objects of different types. For example:

In [19]:
names = [5, "Giovanni", "Rose", "Yongzhe", "Luciana", "Imani"]

Although that is not best practice. Let's start with a list of names:

In [20]:
names = ["Johannes", "Giovanni", "Rose", "Yongzhe", "Luciana", "Imani"]

In [21]:
# Loop names
for name in names:
    print('Name: '+name)

# Get 'Giovanni' from list
# Lists start counting at 0
giovanni = names[1]
print(giovanni.upper())

# Get last item
name = names[-1]
print(name.upper())

# Get second to last item
name = names[-2]
print(name.upper())

print("First three: "+str(names[0:3]))
print("First four: "+str(names[:4]))
print("Up until the second to last one: "+str(names[:-2]))
print("Last two: "+str(names[-2:]))

Name: Johannes
Name: Giovanni
Name: Rose
Name: Yongzhe
Name: Luciana
Name: Imani
GIOVANNI
IMANI
LUCIANA
First three: ['Johannes', 'Giovanni', 'Rose']
First four: ['Johannes', 'Giovanni', 'Rose', 'Yongzhe']
Up until the second to last one: ['Johannes', 'Giovanni', 'Rose', 'Yongzhe']
Last two: ['Luciana', 'Imani']


## Enumeration

We can enumerate collections/lists that adds an index to every element:

In [22]:
for index, name in enumerate(names):
    print(str(index) , " " , name, " is in the list.")

0   Johannes  is in the list.
1   Giovanni  is in the list.
2   Rose  is in the list.
3   Yongzhe  is in the list.
4   Luciana  is in the list.
5   Imani  is in the list.


## Searching and editing

In [23]:
names = ["Johannes", "Giovanni", "Rose", "Yongzhe", "Luciana", "Imani"]

# Finding an element
print(names.index("Johannes"))

# Adding an element
names.append("Kumiko")

# Adding an element at a specific location
names.insert(2, "Roberta")

print(names)

#Removal
fruits = ["apple","orange","pear"]
del fruits[0]
fruits.remove("pear")
print('Fruits: ', fruits)

# Modifying an element
names[5] = "Tom"
print(names)

# Test whether an item is in the list (best do this before removing to avoid raising errors)
print("Tom" in names)

# Length of a list
print("Length of the list: " + str(len(names)))

0
['Johannes', 'Giovanni', 'Roberta', 'Rose', 'Yongzhe', 'Luciana', 'Imani', 'Kumiko']
Fruits:  ['orange']
['Johannes', 'Giovanni', 'Roberta', 'Rose', 'Yongzhe', 'Tom', 'Imani', 'Kumiko']
True
Length of the list: 8


Python starts at 0!!!

## Sorting and copying

In [24]:
# Temporary sorting:
print(sorted(names))
print(names)

# Make changes permanent
names.sort()
print("Sorted names: " + str(names))
names.sort(reverse=True)
print("Reverse sorted names: " + str(names))

['Giovanni', 'Imani', 'Johannes', 'Kumiko', 'Roberta', 'Rose', 'Tom', 'Yongzhe']
['Johannes', 'Giovanni', 'Roberta', 'Rose', 'Yongzhe', 'Tom', 'Imani', 'Kumiko']
Sorted names: ['Giovanni', 'Imani', 'Johannes', 'Kumiko', 'Roberta', 'Rose', 'Tom', 'Yongzhe']
Reverse sorted names: ['Yongzhe', 'Tom', 'Rose', 'Roberta', 'Kumiko', 'Johannes', 'Imani', 'Giovanni']


In [25]:
# Copying list (a shallow copy just duplicates the pointer to the memory address)
namez = names
namez.remove("Johannes")
print(namez)
print(names)

# Now a 'deep' copy
print("After deep copy")

namez = names.copy()
namez.remove("Giovanni")
print(namez)
print(names)

#Alternative
namez = names[:]
print(namez)

['Yongzhe', 'Tom', 'Rose', 'Roberta', 'Kumiko', 'Imani', 'Giovanni']
['Yongzhe', 'Tom', 'Rose', 'Roberta', 'Kumiko', 'Imani', 'Giovanni']
After deep copy
['Yongzhe', 'Tom', 'Rose', 'Roberta', 'Kumiko', 'Imani']
['Yongzhe', 'Tom', 'Rose', 'Roberta', 'Kumiko', 'Imani', 'Giovanni']
['Yongzhe', 'Tom', 'Rose', 'Roberta', 'Kumiko', 'Imani', 'Giovanni']


## Strings as lists

Strings can be manipulated and used just like lists. This is especially handy in text mining:

In [26]:
course = "Predictive analytics"
print("Last nine letters: "+course[-9:])
print("Analytics in course title? " + str("analytics" in course))
print("Start location of 'analytics': " + str(course.find("analytics")))
print(course.replace("analytics","analysis"))
list_of_words = course.split(" ")
for index, word in enumerate(list_of_words):
    print("Word ", index, ": "+word)

Last nine letters: analytics
Analytics in course title? True
Start location of 'analytics': 11
Predictive analysis
Word  0 : Predictive
Word  1 : analytics


## Sets

Sets only contain unique elements. They have to be declared upfront using set() and allow for operations such as intersection():

In [27]:
name_set = set(names)
print(name_set)

# Add an element
name_set.add("Galina")
print(name_set)

# Discard an element
name_set.discard("Johannes")
print(name_set)

name_set2 = set(["Rose", "Tom"])
# Difference and intersection
difference = name_set - name_set2
print(difference)
intersection = name_set.intersection(name_set2)
print(intersection)

{'Giovanni', 'Rose', 'Roberta', 'Imani', 'Tom', 'Kumiko', 'Yongzhe'}
{'Giovanni', 'Rose', 'Galina', 'Roberta', 'Imani', 'Tom', 'Kumiko', 'Yongzhe'}
{'Giovanni', 'Rose', 'Galina', 'Roberta', 'Imani', 'Tom', 'Kumiko', 'Yongzhe'}
{'Giovanni', 'Galina', 'Roberta', 'Imani', 'Kumiko', 'Yongzhe'}
{'Rose', 'Tom'}


# Dictionary & Function

## Dictionaries

Dictionaries are a great way to store particular data as key-value pairs, which mimics the basic structure of a simple database.

In [28]:
courses = {"Johannes" : "Predictive analytics", "Kumiko" : "Prescriptive analytics", "Luciana" : "Descriptive analytics"}

for organizer in courses:
    print(organizer + " teaches " + courses[organizer])

Johannes teaches Predictive analytics
Kumiko teaches Prescriptive analytics
Luciana teaches Descriptive analytics


We can also write:

In [29]:
for organizer, course in courses.items():
    print(organizer + " teaches " + course)

Johannes teaches Predictive analytics
Kumiko teaches Prescriptive analytics
Luciana teaches Descriptive analytics


In [30]:
# Adding items
courses["Imani"] = "Other analytics"
print(courses)

# Overwrite
courses["Johannes"] = "Business analytics"
print(courses)

{'Johannes': 'Predictive analytics', 'Kumiko': 'Prescriptive analytics', 'Luciana': 'Descriptive analytics', 'Imani': 'Other analytics'}
{'Johannes': 'Business analytics', 'Kumiko': 'Prescriptive analytics', 'Luciana': 'Descriptive analytics', 'Imani': 'Other analytics'}


In [31]:
# Remove
del courses["Johannes"]
print(courses)

{'Kumiko': 'Prescriptive analytics', 'Luciana': 'Descriptive analytics', 'Imani': 'Other analytics'}


In [32]:
# Looping values
for course in courses.values():
    print(course)

Prescriptive analytics
Descriptive analytics
Other analytics


In [33]:
# Sorted output (on keys)
for organizer, course in sorted(courses.items()):
    print(organizer +" teaches " + course)

Imani teaches Other analytics
Kumiko teaches Prescriptive analytics
Luciana teaches Descriptive analytics


## Functions

Functions form the backbone of all code. You have already used some, like print(). They can be easily defined by yourself as well.

In [34]:
def my_function(a, b):
    a = a.title()
    b = b.upper()
    print(a+ " "+b)

In [35]:
def my_function2(a, b):
    a = a.title()
    b = b.upper()
    return a + " " + b

In [36]:
my_function("johannes","de smedt")
output = my_function2("johannes","de smedt")
print(output)

Johannes DE SMEDT
Johannes DE SMEDT


Notice how the first function already prints, while the second returns a string we have to print ourselves. Python is weakly-typed, so a function can produce different results, like in this example:

In [37]:
# Different output type
def calculate_mean(a, b):
    if (a>0):
        return (a+b)/2
    else:
        return "a is negative"

output = calculate_mean(1,2)
print(output)
output = calculate_mean(0,1)
print(output)

1.5
a is negative


## Comprehensions

Comprehensions allow you to quickly/efficiently write lists/dictionaries:

In [38]:
# Finding even numbers
evens = [i for i in range(1,11) if i % 2 ==0]
print(evens)

[2, 4, 6, 8, 10]


In Python, you can easily make tuples such as pairs, like here:

In [39]:
# Double fun
pairs = [(x,y) for x in range(1,11) for y in range(5,11) if x>y]
print(pairs)

[(6, 5), (7, 5), (7, 6), (8, 5), (8, 6), (8, 7), (9, 5), (9, 6), (9, 7), (9, 8), (10, 5), (10, 6), (10, 7), (10, 8), (10, 9)]


They are also useful to perform some pre-processing, e.g., on strings:

In [40]:
# Operations
names = ["jamal", "maurizio", "johannes"]

titled_names = [name.title() for name in names]
print(titled_names)

j_s = [name.title() for name in names if name.lower()[0] == 'j']
print(j_s)

['Jamal', 'Maurizio', 'Johannes']
['Jamal', 'Johannes']


# IO & Library

In [42]:
# Download some datasets
# If you are using git, then you don't need to run the following.
# !wget -q https://raw.githubusercontent.com/Magica-Chen/data-analysis-statistics/main/python-fundamentals/data/DM_1.csv
# !mkdir data
# !mv *.csv ./data

## Reading files

In Python, we can easily open any file type. Naturally, it is most suitable for plainly-structured formats such as .txt., .csv., as so on. You can also open Excel files with appropriate packages, such as pandas (more on this later). Let's read in a .csv file:

In [41]:
# Open a file for reading ('r')
file = open('data/DM_1.csv','r')

for line in file:
    print(line)

Name,Email,City,Salary

Brent Hopkins,Cum.sociis.natoque@aodiosemper.edu,Mount Pearl,38363

Colt Bender,Vivamus.non.lorem@Proin.org,Castle Douglas,21506

Arthur Hammond,nisl.Maecenas@sed.net,Biloxi,27511

Sean Warner,enim.nisl.elementum@Vivamus.edu,Moere,25201

Tate Greene,velit.justo.nec@aliquetlobortisnisi.edu,Ipswich,35052

Gavin Gibson,cursus.Integer.mollis@Duissitamet.org,Oordegem,37126

Kelly Garza,cursus.non.egestas@antebibendum.ca,Kukatpalle,39420

Zane Preston,sed@Phasellusataugue.com,Neudšrfl,28553

Cole Cunningham,ac.mattis.ornare@inmagna.co.uk,Catemu,27972

Tarik Hendricks,Mauris.vestibulum@sodales.ca,Newbury,39027

Elvis Collier,pede@mattisvelit.org,Paradise,22568

Jackson Huber,eros.nec.tellus@ultricesposuere.edu,Veere,29922

Macaulay Cline,Aliquam@arcuSedeu.edu,Campobasso,24163

Elijah Chase,est.mollis.non@in.net,Grantham,23881

Dennis Anthony,mauris.ut.mi@maurisid.co.uk,Cedar Rapids,27969

Fulton Snyder,enim.mi@egestas.ca,San Pedro,21594

Leo Willis,massa.lobortis@matti

We can store this information in objects and start using it:

In [42]:
# File is looped now, hence, reread file
file = open('data/DM_1.csv','r')
# ignore the header
next(file)

# Store names with amount (i.e. columns 1 & 2)
amount_per_person = {}
for line in file:
    cells = line.split(",")
    amount_per_person[cells[0]] = int(cells[3])

for person, amount in sorted(amount_per_person.items()):
    if amount > 25000:
        print(person , " has " , amount)

Arthur Hammond  has  27511
Brent Hopkins  has  38363
Cole Cunningham  has  27972
Dennis Anthony  has  27969
Gavin Gibson  has  37126
Jackson Huber  has  29922
Kelly Garza  has  39420
Leo Willis  has  31203
Matthew Hooper  has  33222
Palmer Byrd  has  29045
Sean Warner  has  25201
Tarik Hendricks  has  39027
Tate Greene  has  35052
Zane Preston  has  28553


In [43]:
# Now we use 'w' for write   
output_file = open('data/ordered_amounts_per_person.csv','w')

for person, amount in sorted(amount_per_person.items()):
    output_file.write(person.lower()+","+str(amount))    
output_file.close()

## Libraries

Libraries are imported by using `import`:

In [48]:
import numpy as np

We can import just a few bits using `from`, or create aliases using `as`:

In [45]:
import math as m
from math import pi

In [46]:
print(numpy.add(1, 2))
print(pi)
print(m.sin(1))

3
3.141592653589793
0.8414709848078965


## Numpy

In [49]:
# Create empty arrays/matrices
empty_array = np.zeros(5)

empty_matrix = np.zeros((5,2))

print('Empty array: \n',empty_array)
print('Empty matrix: \n',empty_matrix)

Empty array: 
 [0. 0. 0. 0. 0.]
Empty matrix: 
 [[0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]
 [0. 0.]]


In [50]:
# Create matrices
mat = np.array([[1,2,3],[4,5,6]])
print('Matrix: \n', mat)
print('Transpose: \n', mat.T)
print('Item 2,2: ', mat[1,1])
print('Item 2,3: ', mat[1,2])
print('rows and columns: ', np.shape(mat))
print('Sum total matrix: ', np.sum(mat))
print('Sum row 1: ' , np.sum(mat[0]))
print('Sum row 2: ', np.sum(mat[1]))
print('Sum column 2: ', np.sum(mat,axis=0)[2])

Matrix: 
 [[1 2 3]
 [4 5 6]]
Transpose: 
 [[1 4]
 [2 5]
 [3 6]]
Item 2,2:  5
Item 2,3:  6
rows and columns:  (2, 3)
Sum total matrix:  21
Sum row 1:  6
Sum row 2:  15
Sum column 2:  9
