# Introduction to Python

> Data Wrangling with Python

Kuo, Yao-Jen from [DATAINPOINT](https://www.datainpoint.com/)

## TL; DR

> In this lecture, we will talk about how to wrangle data with both Python's built-in and third party data structures.

## The What and Why

## What is wrangling?

![](https://media.giphy.com/media/MnlZWRFHR4xruE4N2Z/giphy.gif)

Source: <https://giphy.com/>

## What is a data structure?

> In computer science, a data structure is a data organization, management, and storage format that enables efficient access and modification. More precisely, a data structure is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data.

Source: <https://en.wikipedia.org/wiki/Data_structure>

## Why data structure?

As a software engineer, the main job is to perform operations on data, we can simplify that operation into: 

1. Take some input
2. Process it
3. Return the output

Quite similar to what we've got from the definition of a function.

## To make the process efficient, we need to optimize it via data structure

Data structure decides how and where we put the data to be processed. A good choice of data structure can enhance our efficiency.

## We will talk about 4 built-in data structures in Python

- `list`
- `tuple`
- `dict` as in dictionary
- `set`

## We will also talk about 4 data structures from well-known third party libraries

- `ndarray` from NumPy
- `Index` from Pandas
- `Series` from Pandas
- `DataFrame` from Pandas

## Built-in Data Structures

## Lists

Lists are the basic ordered and mutable data collection type in Python. They can be defined with comma-separated values between square brackets.

In [1]:
primes = [2, 3, 5, 7, 11]
print(type(primes)) # use type() to check type
print(len(primes))  # use len() to check how many elements are stored in the list

<class 'list'>
5


## Lists have a number of useful methods

- `.append()`
- `.pop()`
- `.remove()`
- `.insert()`
- `.sort()`
- ...etc.

We can use `TAB` and `SHIFT - TAB` for documentation prompts in a notebook environment.

In [2]:
primes.append(13) # appending an element to the end of a list
print(primes)
primes.pop() # popping out the last element of a list
print(primes)
primes.remove(2) # removing the first occurance of an element within a list
print(primes)
primes.insert(0, 2) # inserting certain element at a specific index
print(primes)
primes.sort(reverse=True) # sorting a list, reverse=False => ascending order; reverse=True => descending order
print(primes)

[2, 3, 5, 7, 11, 13]
[2, 3, 5, 7, 11]
[3, 5, 7, 11]
[2, 3, 5, 7, 11]
[11, 7, 5, 3, 2]


## Python provides access to elements in compound types through

- **indexing** for a single element
- **slicing** for multiple elements

## Python uses zero-based indexing

In [3]:
primes.sort()
print(primes[0]) # the first element
print(primes[1]) # the second element

2
3


## Elements at the end of the list can be accessed with negative numbers, starting from -1

In [4]:
print(primes[-1]) # the last element
print(primes[-2]) # the second last element

11
7


## While indexing means fetching a single value from the list, slicing means accessing multiple values in sub-lists

- start(inclusive)
- stop(non-inclusive)
- step

```python
# slicing syntax
OUR_LIST[start:stop:step]
```

In [5]:
print(primes[0:3:1]) # slicing the first 3 elements
print(primes[-3:len(primes):1]) # slicing the last 3 elements 
print(primes[0:len(primes):2]) # slicing every second element

[2, 3, 5]
[5, 7, 11]
[2, 5, 11]


## If leaving out, it defaults to

- start: 0
- stop: -1
- step: 1

So we can do the same slicing with defaults

In [6]:
print(primes[:3]) # slicing the first 3 elements
print(primes[-3:]) # slicing the last 3 elements 
print(primes[::2]) # slicing every second element
print(primes[::-1]) # a particularly useful tip is to specify a negative step

[2, 3, 5]
[5, 7, 11]
[2, 5, 11]
[11, 7, 5, 3, 2]


## Tuples

Tuples are in many ways similar to lists, but they are defined with parentheses rather than square brackets.

In [7]:
primes = (2, 3, 5, 7, 11)
print(type(primes)) # use type() to check type
print(len(primes))  # use len() to check how many elements are stored in the list

<class 'tuple'>
5


## The main distinguishing feature of tuples is that they are immutable

Once they are created, their size and contents cannot be changed.

In [8]:
primes = [2, 3, 5, 7, 11]
primes[-1] = 13
print(primes)
primes = tuple(primes)
primes[-1] = 11

[2, 3, 5, 7, 13]


TypeError: 'tuple' object does not support item assignment

## Use TAB to see if there is any mutable method for tuple

```python
primes.<TAB>
```

## Tuples are often used in a Python program; like functions that have multiple return values

In [9]:
def get_locale(country, city):
    return country, city

print(get_locale("Taiwan", "Taipei"))
print(type(get_locale("Taiwan", "Taipei")))

('Taiwan', 'Taipei')
<class 'tuple'>


## Multiple return values can also be individually assigned

In [10]:
my_country, my_city = get_locale("Taiwan", "Taipei")
print(my_country)
print(my_city)

Taiwan
Taipei


## Dictionaries

Dictionaries are extremely flexible mappings of keys to values, and form the basis of much of Python's internal implementation. They can be created via a comma-separated list of `key:value` pairs within curly braces.

In [11]:
the_celtics = {
    'isNBAFranchise': True,
    'isAllStar': False,
    'city': "Boston",
    'altCityName': "Boston",
    'fullName': "Boston Celtics",
    'tricode': "BOS",
    'teamId': "1610612738",
    'nickname': "Celtics",
    'urlName': "celtics",
    'teamShortName': "Boston",
    'confName': "East",
    'divName': "Atlantic"
}

print(type(the_celtics))
print(len(the_celtics))

<class 'dict'>
12


## Elements are accessed through valid key rather than zero-based order

In [12]:
print(the_celtics['city'])
print(the_celtics['confName'])
print(the_celtics['divName'])

Boston
East
Atlantic


## New key:value pair can be set smoothly

In [13]:
the_celtics['isMyFavorite'] = True
print(the_celtics)

{'isNBAFranchise': True, 'isAllStar': False, 'city': 'Boston', 'altCityName': 'Boston', 'fullName': 'Boston Celtics', 'tricode': 'BOS', 'teamId': '1610612738', 'nickname': 'Celtics', 'urlName': 'celtics', 'teamShortName': 'Boston', 'confName': 'East', 'divName': 'Atlantic', 'isMyFavorite': True}


## Use `del` to remove a key:value pair from a dictionary

In [14]:
del the_celtics['isMyFavorite']
print(the_celtics)

{'isNBAFranchise': True, 'isAllStar': False, 'city': 'Boston', 'altCityName': 'Boston', 'fullName': 'Boston Celtics', 'tricode': 'BOS', 'teamId': '1610612738', 'nickname': 'Celtics', 'urlName': 'celtics', 'teamShortName': 'Boston', 'confName': 'East', 'divName': 'Atlantic'}


## Common mehtods called on dictionaries

- `.keys()`
- `.values()`
- `.items()`

In [15]:
print(the_celtics.keys())
print(the_celtics.values())
print(the_celtics.items())

dict_keys(['isNBAFranchise', 'isAllStar', 'city', 'altCityName', 'fullName', 'tricode', 'teamId', 'nickname', 'urlName', 'teamShortName', 'confName', 'divName'])
dict_values([True, False, 'Boston', 'Boston', 'Boston Celtics', 'BOS', '1610612738', 'Celtics', 'celtics', 'Boston', 'East', 'Atlantic'])
dict_items([('isNBAFranchise', True), ('isAllStar', False), ('city', 'Boston'), ('altCityName', 'Boston'), ('fullName', 'Boston Celtics'), ('tricode', 'BOS'), ('teamId', '1610612738'), ('nickname', 'Celtics'), ('urlName', 'celtics'), ('teamShortName', 'Boston'), ('confName', 'East'), ('divName', 'Atlantic')])


## Dictionary is extremely useful in applications

![Imgur](https://i.imgur.com/0jYRGTd.jpg?1)

## Using `if-elif-else` statement to print out the lowest living cost per month

In [16]:
living_area = input("請輸入您的居住地區別：")
lowest_living_cost = None
if living_area == "非六都縣市":
    lowest_living_cost = 12388
    print("每人每月最低生活費為 {:,}元".format(lowest_living_cost))
elif living_area == "臺北市":
    lowest_living_cost = 17005
    print("每人每月最低生活費為 {:,}元".format(lowest_living_cost))
elif living_area == "新北市":
    lowest_living_cost = 15500
    print("每人每月最低生活費為 {:,}元".format(lowest_living_cost))
elif living_area == "桃園市":
    lowest_living_cost = 15281
    print("每人每月最低生活費為 {:,}元".format(lowest_living_cost))
elif living_area == "臺中市":
    lowest_living_cost = 14596
    print("每人每月最低生活費為 {:,}元".format(lowest_living_cost))
elif living_area == "臺南市":
    lowest_living_cost = 12388
    print("每人每月最低生活費為 {:,}元".format(lowest_living_cost))
elif living_area == "高雄市":
    lowest_living_cost = 13099
    print("每人每月最低生活費為 {:,}元".format(lowest_living_cost))
elif living_area == "金門縣連江縣":
    lowest_living_cost = 11648
    print("每人每月最低生活費為 {:,}元".format(lowest_living_cost))
else:
    print("您所輸入的居住地別我不認得。")

請輸入您的居住地區別：臺北市
每人每月最低生活費為 17,005元


## Using dictionary to print out the lowest living cost per month

In [17]:
living_cost_dict = {
    '非六都縣市': 12388,
    '臺北市': 17005,
    '新北市': 15500,
    '桃園市': 15281,
    '臺中市': 14596,
    '臺南市': 12388,
    '高雄市': 13099,
    '金門縣連江縣': 11648
}
living_area = input("請輸入您的居住地區別：")
try:
    lowest_living_cost = living_cost_dict[living_area]
    print("每人每月最低生活費為 {:,}元".format(lowest_living_cost))
except:
    print("您所輸入的居住地別我不認得。")

請輸入您的居住地區別：臺北市
每人每月最低生活費為 17,005元


## Sets

The fourth basic collection is the set, which contains unordered collections of unique items. They are defined much like lists and tuples, except they use the curly brackets.

In [18]:
primes = {2, 3, 5, 7, 11}
odds = {1, 3, 5, 7, 9}
print(type(primes))
print(len(odds))

<class 'set'>
5


## Sets

The fourth basic collection is the set, which contains unordered collections of unique items. They are defined much like lists and tuples, except they use the curly brackets.

## Python's sets have all of the operations like union, intersection, difference, and symmetric difference

## Union: elements appearing in either sets

In [19]:
print(primes | odds)      # with an operator
print(primes.union(odds)) # equivalently with a method

{1, 2, 3, 5, 7, 9, 11}
{1, 2, 3, 5, 7, 9, 11}


## Intersection: elements appearing in both

In [20]:
print(primes & odds)             # with an operator
print(primes.intersection(odds)) # equivalently with a method

{3, 5, 7}
{3, 5, 7}


## Difference: elements in primes but not in odds

In [21]:
print(primes - odds)           # with an operator
print(primes.difference(odds)) # equivalently with a method

{2, 11}
{2, 11}


## Symmetric difference: items appearing in only one set

In [22]:
print(sorted((primes - odds) | (odds - primes))) # union two differences
print(primes ^ odds)                             # with an operator
print(primes.symmetric_difference(odds))         # equivalently with a method

[1, 2, 9, 11]
{1, 2, 9, 11}
{1, 2, 9, 11}


##  One of the powerful features of Python's compound objects is that they can contain objects of any type, or even a mix of types

##  Take data.nba.net for example

- The [/10s/prod/v1/today.json](https://data.nba.net/10s/prod/v1/today.json) is a compound dictionary contained other dictionary as values
- The [/prod/v2/2019/teams.json](https://data.nba.net/prod/v2/2019/teams.json) is a compound dictionary contained other list of dictionaries as values

##  So we need a tool to handle collections of data, that's when it comes to iteration

## Use Iteration to Retrieve Every Elements

## The essense of iterations

Like slicing syntax:

- start: when does the iteration start?
- stop: when does the ieration stop?
- step: how does the iteration go from start to stop?

## We can utilize two kinds of iteration

- `while` loop
- `for` loop

## The `while` loop is used to repeat one or more code statements as long as the condition is evaluated as `True`

```python
i = 0 # start
while CONDITION: # stop
    # repeated statements
    i += 1 # step
```

![Imgur](https://i.imgur.com/KNhPttU.png?1)

Source: [A Beginners Guide to Python 3 Programming](https://www.amazon.com/Beginners-Programming-Undergraduate-Computer-Science-ebook/dp/B07W4THQB6)

## The `for` loop is used to step an element through an iterable until the end is reached

```python
for i in ITERABLE: # start/stop/step
    # repeated statements
```

![Imgur](https://i.imgur.com/K4MRRcC.png?1)

Source: [A Beginners Guide to Python 3 Programming](https://www.amazon.com/Beginners-Programming-Undergraduate-Computer-Science-ebook/dp/B07W4THQB6)

## Retrieving the first 5 odds and multiply with 10

In [23]:
i = 1 # start
while i < 11: # stop
    print(i*10)
    i += 2 # step

10
30
50
70
90


In [24]:
for i in range(1, 10, 2): # start/stop/step => help(range)
    print(i*10)

10
30
50
70
90


## Use `range()` function to create a sequence

In [25]:
help(range)

Help on class range in module builtins:

class range(object)
 |  range(stop) -> range object
 |  range(start, stop[, step]) -> range object
 |  
 |  Return an object that produces a sequence of integers from start (inclusive)
 |  to stop (exclusive) by step.  range(i, j) produces i, i+1, i+2, ..., j-1.
 |  start defaults to 0, and stop is omitted!  range(4) produces 0, 1, 2, 3.
 |  These are exactly the valid indices for a list of 4 elements.
 |  When step is given, it specifies the increment (or decrement).
 |  
 |  Methods defined here:
 |  
 |  __bool__(self, /)
 |      self != 0
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(self, key, /)
 |      Return self[key].
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __hash__(self, /)
 |

## Retrieving the first 5 primes and multiply with 10

In [26]:
primes = [2, 3, 5, 7, 11]
i = 0 # start
while i < len(primes): # stop
    print(primes[i]*10)
    i += 1 # step

20
30
50
70
110


In [27]:
for i in primes: # start/stop/step
    print(i*10)

20
30
50
70
110


## `while` versus `for` when dealing with iterations?

- Use `for` to iterate over lists, dictionaries, and other iterables
- Use `while` if our operations involve randomness or uncertainty

## Use a code-visualization tool to help you understand the behavior of loops

We can use [pythontutor.com](http://www.pythontutor.com/visualize.html#mode=edit) to explore the execution of our code.

## We've been talking about ITERABLE for quite a few times, so what is a iterable?

An iterable is any Python object capable of returning its elements one at a time, permitting it to be iterated over in a `for` loop. Familiar examples of iterables include lists, tuples, and strings.

## Iterate over a `str`

Using built-in functions to explore iterations.

- `iter()`
- `next()`

In [28]:
help(iter)

Help on built-in function iter in module builtins:

iter(...)
    iter(iterable) -> iterator
    iter(callable, sentinel) -> iterator
    
    Get an iterator from an object.  In the first form, the argument must
    supply its own iterator, or be a sequence.
    In the second form, the callable is called until it returns the sentinel.



In [29]:
help(next)

Help on built-in function next in module builtins:

next(...)
    next(iterator[, default])
    
    Return the next item from the iterator. If default is given and the iterator
    is exhausted, it is returned instead of raising StopIteration.



In [30]:
may4th = "Luke, use the Force!"
I = iter(may4th)
print(next(I))
print(next(I))
print(next(I))
print(next(I))

L
u
k
e


## Iterate over a `int`, `float`, or `bool`?

In [31]:
my_int = 5566
I = iter(my_int)

TypeError: 'int' object is not iterable

In [32]:
my_float = 5566.0
I = iter(my_float)

TypeError: 'float' object is not iterable

In [33]:
is_56_the_best = True
I = iter(is_56_the_best)

TypeError: 'bool' object is not iterable

## Iterate over a list/tuple is quite straight-forward

In [34]:
primes = [2, 3, 5, 7, 11]
for i in primes:
    print(i)

2
3
5
7
11


## How about iterating over a dictionary?

Use `.keys()`, `.values()`, and `.items()` to help us iterate over a dictionary.

In [35]:
living_cost_dict = {
    '非六都縣市': 12388,
    '臺北市': 17005,
    '新北市': 15500,
    '桃園市': 15281,
    '臺中市': 14596,
    '臺南市': 12388,
    '高雄市': 13099,
    '金門縣連江縣': 11648
}
print(living_cost_dict.keys())
print(living_cost_dict.values())
print(living_cost_dict.items())

dict_keys(['非六都縣市', '臺北市', '新北市', '桃園市', '臺中市', '臺南市', '高雄市', '金門縣連江縣'])
dict_values([12388, 17005, 15500, 15281, 14596, 12388, 13099, 11648])
dict_items([('非六都縣市', 12388), ('臺北市', 17005), ('新北市', 15500), ('桃園市', 15281), ('臺中市', 14596), ('臺南市', 12388), ('高雄市', 13099), ('金門縣連江縣', 11648)])


In [36]:
for k in living_cost_dict.keys():
    print(k)

非六都縣市
臺北市
新北市
桃園市
臺中市
臺南市
高雄市
金門縣連江縣


In [37]:
for v in living_cost_dict.values():
    print(v)

12388
17005
15500
15281
14596
12388
13099
11648


In [38]:
for k, v in living_cost_dict.items():
    print("居住在{}每人每月最低生活費為 {:,}元".format(k, v))

居住在非六都縣市每人每月最低生活費為 12,388元
居住在臺北市每人每月最低生活費為 17,005元
居住在新北市每人每月最低生活費為 15,500元
居住在桃園市每人每月最低生活費為 15,281元
居住在臺中市每人每月最低生活費為 14,596元
居住在臺南市每人每月最低生活費為 12,388元
居住在高雄市每人每月最低生活費為 13,099元
居住在金門縣連江縣每人每月最低生活費為 11,648元


## Common tasks using iteration

- Simply `print()`
- Combinations
- Summations/Counts

## We've done quite a lot of simply `print()`, let's move on.

## Common tasks: combinations

Grab five odds that are not primes.

In [39]:
primes = [2, 3, 5, 7, 11, 13, 17, 19, 23, 29]
odds_not_primes = []
i = 1
while len(odds_not_primes) < 5:
    if i not in primes:
        odds_not_primes.append(i)
    i += 2
print(odds_not_primes)

[1, 9, 15, 21, 25]


## Common tasks: summations/counts

In [40]:
summations = 0
counts = 0
for i in odds_not_primes:
    summations += i
    counts += 1
print(summations)
print(counts)

71
5


## Common tasks: summations/counts

Built-in functions `sum()` and `len()` work like a charm.

In [41]:
print(sum(odds_not_primes))
print(len(odds_not_primes))

71
5
