# A very short Python introduction

Before we begin our Python tutorial, here's one important note about code cells in Jupyter notebooks: If you have a code cell and only one result generates output, we don't need to use a `print` function.

In [None]:
a = 1
a # this line is the reason the cell prints 1 in the output

But if we have more than one line that should generate output, we need to use `print` otherwise only the **last** line will produce output.

In [None]:
a, b = 1, 2
a # will not produce output
b # will produce output

In [None]:
a, b = 1, 2
print(a) # will produce output
print(b) # will produce output

If you want to produce output for something that is running inside of a loop, you also need to use `print`.

In [None]:
for i in range(5):
    print(i)

Let's begin by learning about the basic building blocks of Python data structures: **integers, floats, booleans, and strings**.

In [None]:
30 + 10

In [None]:
type(40)

In [None]:
4.5

In [None]:
type(40.0)

In [None]:
float('nan')

In [None]:
float('inf')

### Exercise

Verify that `float('inf')` indeed behaves like infinity. Mathematically, what should happen if you multiply infinty by 2? What about dividing infinity by itself?

### End of exercise

In [None]:
1 == 0

In [None]:
1 is 1

In [None]:
1 in [1, 0]

In [None]:
not(1 == 0)

In [None]:
type(1 == 0)

What sets dynamic programming languages like Python apart from **strongly-typed** programming languages like Java is that we can change the type of a variable at execution time. In strongly-typed programming languages we define the type ahead of time and enforce it at runtime. In Python we can do things like this:

In [None]:
1.0 + False

In [None]:
1 + 1.0

In [None]:
3 - True # this is the kind of thing that you shouldn't do, even if you can

Using these constructs, we can sometimes write conditional statements in a concise way. But be careful doing this, because as popular as **one-liners** are in Python, it's important to write code that is also easy to understand, either to you six months from now, or to someone else reading your code. Here's an example:

In [None]:
statement = (2 > 4)

print(statement * 5 + (1 - statement)*3) # hard to understand

if statement == True: # easier to understand but many lines
    print(5)
else:
    print(3)

if statement: # turns out we don't need the `== True` part, it's implied
    print(5)
else:
    print(3)

if not statement: # turns out we don't need the `== True` part, it's implied
    print(3)
else:
    print(5)

print(5 if statement else 3) # both concise and easy to understand, but a little risky

Of course this is both convenience and an inconvenience. It is a convenience because it means that we can write very short code to do what we need to do. In data science, where we want to quickly analyze our data, create summaries and visualizations, train and test models, etc. this convenience really pays off. However, the inconvenience lies in that we are more susceptible to have any bugs in our program that we do not catch until later. So we need to be careful and develop best practices when we code in Python.

In [None]:
s = "abc"

In [None]:
s + "def"

In a way we can say that dynamic languages **try to be smart when we are being lazy** by guessing what we mean when we type some code. Of course there is a limit to that and sometimes we need to be a little more explicit, such as here when we need to coerce a string into an integer or vice verca for the operation to make sense.

In [None]:
int('1') + 2

In [None]:
str(1) + '2'

From these basic building blocks, we create the next set of fundamental objects: **lists, tuples, sets and dictionaries**.

In [None]:
fruits = ["apple", "pear", "pitaya", "zapote"] # lists
print(fruits[0]) # first element
print(fruits[-1]) # last element
print(fruits[:2]) # first two elements

Lists are **mutable** objects, meaning that we can change its elements, add new ones, or drop elements from it.

In [None]:
fruits[-1] = "nispero" # lists are mutable

In [None]:
fruits + ["guanabana", "papaya"] # adding something to a list is easy

In [None]:
del fruits[:2] # dropping something from a list is easy
print(fruits)

We can have lists of booleans, strings, or even lists of mixed data types. Basically, lists are very **flexible** objects, and we use them when we need flexibility. Just know that there's no such thing as a free lunch: when you have more flexibility, you usually have less efficiency.

In [None]:
list("abc")

In [None]:
"-".join(list("abc")) # we can use join to join a list of strings

With a list of booleans, we can use `any` and `all` to check conditions.

In [None]:
any([True, False, False])

In [None]:
all([True, False, False])

Finally, lists can be **nested** too, meaning we can have lists of lists.

In [None]:
nested_list = [[-1, 1, 9], [-2, 3, 4]]
print(nested_list)

There is a really neat way to create lists with Python using what's called **list comprehension**. It's really just a shortcut for creating a list very quickly and in a way that makes the code look easy to follow.

In [None]:
some_list = [] # initialize empty list
for i in range(10):
    if i % 3 == 1: # if i divided by 3 leaves a remainder of 1
        some_list += [i] # add i to the list

print(some_list)

The above snippet is valid, but there's a much easier way of doing it using a list comprehension:

In [None]:
some_list = [i for i in range(10) if i % 3 == 1]
print(some_list)

Tuples look somewhat similar to lists, but they are much more rigid. Tuples are **immutable** meaning that once they're created they can't really be changed. 

In [None]:
tup = tuple([2, 4, 12, "bla"])
print(tup)

### Exercise

Try changing the tuple above by changing one of its elements and see what happens. What about trying to add tuples to each other?

### End of exercise

Tuples are handly when you know what information goes in each place and don't need the flexibility to change it later.

We can also have **sets** in Python, which are similar to mathematical sets: elements in a set are unique and the order doesn't matter. Sets have certain **methods** that sound familiar if you remember sets in math, such as `difference`.

In [None]:
my_set = set([2, 2, "hello", "hello", 5])
print(my_set)

In [None]:
my_set.difference(set([2, 4, "hello"]))

If you're not familiar with the term **method** it refers to the fact that once you create an object, there are certain functions that you can call that are relevant to the object. For example, a set has a method called `difference` so that you can take the difference of one set from another set. To call the method, you type the name of the object, followed by a period, and the name of the method. Tab completion can help you find out what all methods an object might have.

### Exercise

Explore the methods associated with a list and a tuple. Just create a list, then type its name, followed by dot, and press `SHIFT+TAB` to let auto-complete show you the methods. Then try the same thing for a tuple.

### End of exercise

A Python dictionary is similar to a list, but instead of elements being ordered by their position in the list, the elements are **values** that are paired with **keys**, which makes dictionaries a **key-value pair**.

In [None]:
my_dict = {"a": 12, "b": "hello", "c": [4, 9], "d": {"one": 1, "two": 2}}
print(my_dict)

In [None]:
my_dict.keys()

In [None]:
my_dict.values()

In [None]:
del my_dict['a']
print(my_dict)

There is one more data type we briefly cover here, although it is not a **native** data type, in the sense that we need to load a library for it: the `numpy` library. The `numpy` library is a library used to do linear algebra calculations. The most basic `numpy` object is an `array`.

In [None]:
import numpy as np
arr = np.array([2, 4, 1, 9])
print(arr)

The above array looks like a Python list, but a numpy array and a list are very different. Let's look at some example to see what makes them different.

### Exercise

In the following cell, we create a Python list and a `numpy` array. Then we perform the same operation on the list and on the array, but get very different results. Explain what happens in each case.

In [None]:
lst = [2, 4, 9, 7, 1, 9] # this is a list
arr = np.array(lst) # this is an array

print(lst + [1])
print(arr + [1])

Return to the above code and change one of the elements of the list from an integer to a string, then report what happens.

### End of exercise

As we saw in the last exercise, `numpy` arrays cannot contain elements of different types. We say that they are `atomic`. This makes `numpy` arrays ideal for storing columns in tables, because in a table, different columns can be of different types, but for a given column are elements (rows) are expected to be of the same type.

The `numpy` library is meant to make it easy to work with arrays. We saw for example how we can add 1 to every element of the array **without having to write a loop or list comprehension**. Let's look at another example: say we want to filter an array to only keep elements greater than 5. Compare the list implementation and the array implementation:

In [None]:
[i for i in lst if i > 5] # filtering a list

In [None]:
arr[arr > 5] # filtering an array

There's a lot more we could be talking, but this is a data science class, not a class on Python programming. We **strongly** encourage you to take an introductory Python class and get very comfortable with the constructs we covered here. For the remainder of the notebook, we cover a few more things that have relevance to data science and then try to bring these examples home by covering some practical examples.

We often have to print some summaries about data. The `print` function along with the `format` method can be very helpful.

In [2]:
dec, bignum, pct = 3.1415, 4.234**5, 0.89 # you can assign multiple variable at once with this shortcut
print("The three numbers are {}, {}, and {}.".format(dec, bignum, pct))
print("The decimal rounded to 2 digits is {:0.3}".format(dec))
print("The bignum with thousand separators and 2 decimals is {:,.2f}".format(bignum))
print("The percentage number is {:2.0f}%".format(pct * 100))

print(f"The three numbers are {dec}, {bignum}, {pct}")
print(f"Four plus two equals {4+2}")

The three numbers are 3.1415, 1360.6745706140914, and 0.89.
The decimal rounded to 2 digits is 3.14
The bignum with thousand separators and 2 decimals is 1,360.67
The percentage number is 89%
The three numbers are 3.1415, 1360.6745706140914, 0.89
Four plus two equals 6


Sometimes we want to run some code, and in case it fails, run some other code. This can help us **gracefully** (that's a technical term) handle errors in our code. To do that, we use the `try` and `except` function.

In [None]:
try:
    new_var += 1 # new_var doesn't exist
    print("Found and incremented new_var. The new value is {}.".format(new_var))
except:
    new_var = 45
    print("Initialized new_var to 45")

Try running the above cell multiple times to see what happens. The `try` function can be very helpful when we're traversing data that is not well structured and we expect to run into errors but don't want the errors to stop us mid-stream.

Writing functions in Python is relatively easy:

In [None]:
def my_function(n, m = 1): # name and arguments for the function
    return(n + m)

print(my_function(3))
print(my_function(n = 3))

In [None]:
print(my_function(3, 4)) # you can match arguments by position, here order matters
print(my_function(n = 3, m = 4)) # you can match arguments by name
print(my_function(m = 4, n = 3)) # if you match arguments by name, order doesn't matter

### Exercise

In the list below, one of the elements is a string by accident. Write a program that multiplies each element of the list by itself. Use `try` and `except` to leave the element as-is when the element is not a number.

There are different ways to solve this. Here are two ways we propose:
- use a loop that iterates over the index of the list using `range(len(my_list))`
- write a function and then use list comprehension to apply it to each element of the list

In [None]:
my_list = [2, 4, 8, "3", 5]

### End of exercise

At this point you might be wondering, am I here to learn data science or am I here to learn solve little programming challenges. In other words, how relevant is all of this to doing data science? The answer is very, very, very relevant, because knowing the basics well can help you write clear and concise code to manipulate data. We'll look at some examples here, but the truth is it takes time and practice to come to this realization.

Let's show an example of how the things we learned can be applied to a data science situation. We will go and read in some data from a JSON file, which is an example of what we call **semi-structured data**. 

### Exercise

Before reading the data, go and open it in an editor (the file name is `books.json`). A JSON file is not a Python object, but does it look **similar to** any of the Python objects we've encountered so far?

### End of exercise

Let's now go and read the file into Python. After reading it, we will print the first element of it, to see what kind of data is there. When you try to print objects that can have nested information, it's helpful to pretty-print it, using the `pprint` library so the information is more presentable.

In [None]:
import json
with open('data/books.json', encoding = 'utf-8') as f:
    books_dict = json.load(f)

from pprint import pprint
pprint(books_dict[0]) # print information for the first book

We are now going to extract particular pieces of information from the first book: the title, author, category, and ISBN of the book. However, we run into a problem: books can have multiple authors and multiple categories, and for reasons that will become clear soon we don't want to allow that. Instead we'll do this:

- When there are multiple authors, we will replicate the information once for each author.
- When there are multiple categories, we will just take the first one and ignore the rest.

In [None]:
elem = books_dict[0] # pull out the first element
num_authors = len(elem['authors'])
print("The first book has {} authors.".format(num_authors))

Also for reasons that will become clear soon, we will create a **tuple** for storing the information, so here because we have three authors, we create three tuples called `row_1`, `row_2` and `row_3`.

In [None]:
row_1 = (elem['title'], 
         elem['authors'][0], # first author
         elem['categories'][0], # first category
         elem['isbn'])

row_2 = (elem['title'], 
         elem['authors'][1], # second author
         elem['categories'][0], # first category
         elem['isbn'])

row_3 = (elem['title'], 
         elem['authors'][2], # third author
         elem['categories'][0], # first category
         elem['isbn'])

In [None]:
print(row_1)

Now it's time to take the content from above and place it in a tabular datastore. We will use the `sqlite3` library, which gives us access to a light-weight SQL database in Python. We don't actually need to have SQLite installed on our machine. Instead we use `sqlite.connect(':memory:')` to connect to a "database" in the memory and pretend it's a physical database somewhere.

In [None]:
import sqlite3

connection = sqlite3.connect(':memory:') 
cursor = connection.cursor()

cursor.execute('''CREATE TABLE books_long
             (title text, author text, categroy text, isbn text)''')

rows = [row_1, row_2, row_3]
cursor.executemany('INSERT INTO books_long VALUES (?,?,?,?)', rows)

connection.commit() # save the changes

Notice how in the above snippet, we first create a SQL table with column names and column types matching what we extracted from the JSON file. We then grouped `row_1`, `row_2` and `row_3` into a list and then inserted them into a SQL table we created, by using `INSERT INTO`.

### Exercise

Based on what we learned about lists and tuples, see if you can answer the following questions:

1. What is the type of `row_1`? Provide a justification for this chioce.
1. What is the type of `row`? Provide a justification for this chioce.

### End of exercise

How do we check that it all worked? We can simply run a `SELECT *` on the data, and use `fetchall()` to grab it from the database and bring it back into Python.

In [None]:
books_table = cursor.execute("SELECT * FROM books_long").fetchall()
books_table

So let's summarize what we accomplished:
1. We read data from a JSON file into a Python dictionary.
1. We extracted some of the data out of the Python dictionary and placed it into a list of tuples.
1. We took the list of tuples and dumped its content into a SQL table.
1. We read the content back from the SQL table and into Python as a list of tuples.

Now that we get the picture, let's loop through the entire data and extract the information. Of course because JSON data doesn't necessarily enforce any sort of schema, we can't be sure that the information we are trying to extract exists for every book entry. So we use `try` and `except` to loop through the data and skip to the next item every time there's an error with reading one of the book entries.

In [None]:
rows = []

for item in books_dict:
    num_authors = len(item['authors'])
    for i in range(num_authors):
        try:
            row = (item['title'], 
                   item['authors'][i], 
                   item['categories'][0], 
                   item['isbn'])
            rows.append(row)
        except:
            pass

In [None]:
cursor.executemany('INSERT INTO books_long VALUES (?,?,?,?)', rows)
books_table = cursor.execute("SELECT * FROM books_long").fetchall()
connection.commit() # save the changes
connection.close() # any un-committed changes are lost when closing the connection 

In [None]:
books_table[:5]

SQL tables are not the only way, and definitely not the most straightforward way to store and manipulate data in Python. Later we will learn how we can use Python's `pandas` library to create a `DataFrame`. This library has a lot of functionality that makes it easy to run the common tasks data scientists do with data.

In [None]:
import pandas as pd
books_df = pd.DataFrame(books_table, columns = ['title', 'author', 'categoriy', 'isbn'])

Just as an example, let's see how we can show the first few columns of a `DataFrame` using a method called `head`.

In [None]:
books_df.head()

Remember how earlier we said that a `DataFrame` is built on top of `numpy` arrays? Another way of saying it is that a `DataFrame` is an **abstraction** on top of `numpy` arrays: i.e. a `DataFrame` is a more **high-level** object than a `numpy` array. Here's one easy way to convince you of that:

In [None]:
books_df.values # returns the DataFrame as a numpy array

Now you can judge which object is more "user-friendly". That's one of the things that abstractions allow us to do: build more user-friendly (abstract) objects from less user-friendly (but more fundamental) objects.