<img src="../../img/backdrop-wh.png" alt="Drawing" style="width: 300px;"/>

# Python Basics

* * * 

<div class="alert alert-success">  
    
### Learning Objectives 
    
* Learn what Jupyter notebooks are and how to run them. 
* Learn the basics of Python: variables, structures, functions, loops and conditionals.
* Understand the `Pandas` package and the `DataFrame` object, which can be used for exploratory analysis.
</div>


### Icons Used in This Notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
⚠️ **Warning:** Heads-up about tricky stuff or common mistakes.<br>

### Sections

1. [Python and Jupyter](#Python)
2. [Variables and Data Types](#var)
3. [Loops](#loops)
4. [Conditionals](#cond)
5. [Data Frames: Spreadsheets in Python](#df)

<a id='python'></a>

# Python and Jupyter

Jupyter Notebooks are a format allowing you to run Python. 

In Jupyter Notebooks, we can write 1) code and 2) markdown cells. Right now, you're reading a markdown cell. 

Code cells run Python under the hood, using a "kernel" (the computational engine). Markdown is text, which we can use to organize our work and research narrative. This text is essential to introduce the reader/audience to the problem or topic being investigated, our research question/hypothesis, materials and methods, results, discussion, and conclusions. 

Jupyter Notebooks can be exported into slideshows and .html and .pdf files for presentation. [Click here to check out the Jupyter Notebooks beginner guide](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/What%20is%20the%20Jupyter%20Notebook.html)

### Run a cell

Every cell has to be <i class="fa-step-forward fa"></i> `Run` to work. Press **Shift + enter** after clicking on a cell to run a cell and advance to the next.

You can insert cells using the `Insert` menu, and move them up and down relative to one another using the <i class="fa-arrow-up fa"></i> and <i class="fa-arrow-down fa"></i> buttons.

Sometimes the kernel breaks or stops functioning. In such cases, you'll want to Restart it. Do this by clicking the `Kernel` menu and clicking the `Restart` button.

## Hello, World!

Even if programming is a [comparatively young art](https://en.wikipedia.org/wiki/History_of_programming_languages), it still has its traditions. One of these traditions is the `Hello, World!` program which is the first program that most people ever write whilst learning how to program.

In the cell below, you will now write your own `Hello, World!` program. Click on the block below this one, then run it using the button marked <i class="fa-step-forward fa"></i> `Run` and see what appears!

In [None]:
print("Hello, World!")

Now, write your own line of code here. Use the `print()` function to print out any line of text (don't forget to use quotation marks!)

In [None]:
# Your code here



## Comments

You'll also find that a lot of cells have comments in them. A comment is a note we leave for ourselves when writing code, explaining what we're thinking or doing. Python ignores these comments entirely, so we can write in human languages.

A comment is anything that starts with the hash (`#`) symbol, a comment can take up an entire line, or just the rest of a line containing code. See the two comments below.

In [None]:
# here's a dog
print('Woof!')
print('Meow!')  # and that was a cat
# but let's hide the mouse
# print("Peep!")

<a id='var'></a>

# Variables and Data Types

In python, we call something we literally write into the code a **literal**. This is as opposed to a variable, which is simply a name or placeholder for something that can take any value.

In python, we create a variable by **assigning** it a value. We do this using the **assignment operator**, otherwise known as the equals (`=`) symbol. For example, to assign the value `"woof"` to the variable `dog` we would write `dog = "woof"`.

In the cell below, we write code which assigns a text value to a variable, then print the variable.

In [None]:
# YOUR CODE HERE
dog = 'woof'
dog

In the code cell below, type `cat` and see what happens.

In [None]:
# YOUR CODE HERE


Don't worry when you see something like this - it's called a **traceback** and shows where there's an error in your code and how to fix it. 

Python is raising a `NameError`. That means it's trying to find something, but it can't find the name in its memory. 

This is because the variable `cat` does not exist in memory yet.

## 🥊 Challenge: Overwriting Variables

Cells are run in the order that the user clicks on them. The `dog` variable right now refers to the value `"woof"`, but if we would execute a cell that overwrites this **from anywhere in the notebook**, this value will be overwritten.

Create a new cell anywhere in this notebook and assign `dog` to another string, e.g. `"grrr"`. Then call `dog` again. You'll see the variable has been overwritten.

Keep this in mind when you are running notebooks, especially when running cells out of order. If you are getting `NameError`s, this might be simply because some variables are defined somewhere in the notebook that you haven't run!

## Data Types

Variables in python have **types**. So far we have only come across one type: **strings** (which are what we call text, i.e., strings of characters). You can tell that it is a string as it is in between quotation marks--single or double, you choose.

Try to change the output below to something else,

In [None]:
print("woof")

Other types are integers (whole numbers), floats (numbers with a decimal point), lists (a mutable, or changeable, ordered sequence of elements).

* `str` - strings, or text
* `int` - whole numbers
* `float` - numbers with a decimal point e.g., 1.0 or 1.5
* `list` - mutable, or changeable, ordered sequence of elements

To find out a variable's type, we can use the `type()` function. Try the example below.

In [None]:
print(type("Hello, World!"))

In the code cell below, type `cat` and see what happens.In the cell below there are some variables. Using `print()` statements, print out the type of one of the variables.

In [None]:
lumberjack = 'okay'
my_age = 30
i_deserve = 10000.00
my_list = [1,2,3]

# Put your answer below!



Variables are fundamental to programming and you will use them throughout the course and during the rest of your programming career.

In the next line we are `print`ing multiple items. We can swap out strings for the variables.

In [None]:
pronoun = 'He'
occupation = 'lumberjack'
judgement = 'okay!'
print(pronoun, "is a", occupation, 'and', pronoun, 'is', judgement)

## Using Methods on Datatypes

**Functions** are blocks of code which only run when they're called. `print()` is one such function, and it is built into python.

**Methods** are functions that only work on certain data types. Two much-used methods to work with strings are `lower()`, which turns a given string into lowercase, and `split()`, which splits a string into a list.

In [None]:
"I am a string".lower()

In [None]:
"I-am-a-string".split('-')

In [None]:
"I am a string".lower()

In [None]:
"I-am-a-string".split('-')

## Methods for Lists

One data type you'll use very often is a `list`. It's useful because you can store multiple items in it! Here's an example:

In [None]:
shopping_list = ['bread', 'eggs, 'milk']
print(shopping_list)

Note the **square brackets** when creating the list. The items in the list are separated by commas.

There are methods that only work on lists, such as `append()`, `count()`, `pop()` or `index()`: 

In [None]:
shopping_list.append('cereal')

In [None]:
shopping_list.count('cereal')

### Indexing Lists

**Indexing** a list means to retrieve a certain item from it. We (confusingly) also use **square brackets** to index a list. 

Python is zero-indexed, meaning indexing starts at 0. 

In [None]:
shopping_list[1] # note that this is the second element from the list, as python starts counting at 0

We can use indexing to retrieve multiple items from our list using the colon:

In [None]:
shopping_list[1:3]

## `SyntaxError`
So far the text we have used has been quite simple and not contained any special characters like quotation marks. But what if our text is more complicated?

In [None]:
print('Oh, I'm a lumberjack, and I'm okay,')

This error is a `SyntaxError`, it means that there is a problem in the way your code is written. Think of it like a spelling or grammar error in an essay.

Luckily, whenever you encounter an error, it's generally quite easy to find help. In this case, [there's a handy online tutorial which can assist](https://www.digitalocean.com/community/tutorials/how-to-format-text-in-python-3). Read through this then print out the lyrics which are commented out below.

In [None]:
# Oh, I'm a lumberjack, and I'm okay, 
# I sleep all night and I work all day. 

# He's a lumberjack, and he's okay, 
# He sleeps all night and he works all day. 

## Operators

We can do things with variables, and sometimes change their values using operators. So far, we've only covered one operator, the assignment operator (=). Python actually has lots of different operators.

Here are some basic operators:

| Symbol  | Name              | Example  | Used For                                                           |
|---------|-------------------|----------|--------------------------------------------------------------------|
| `=`     | Assignment        | `a = 1`  | Assigning the value on the right to the variable on the left       |
| `+`     | Addition          | `1 + 2`  | Returns the sum of the right and left hand sides                   |
| `-`     | Subtraction       | `3 - 1`  | Returns the left hand side minus the right hand side               |
| `*`     | Multiplication    | `2 * 3`  | Returns the product of the left and right hand sides               |
| `**`    | Power             | `2 ** 2` | Returns the left hand side to the power of the right hand side     |
| `+=`    | In place addition | `a += 1` | Sums the left and right hand sides, assigns sum to left hand side  |

Most of these operators will not change the value of a variable, but rather return a new value. For example, check out the code below.

In [None]:
a = 10 # note: we don't put integers in quotes!
b = 20 
print('a * b =', a * b)
print('a = ', a)
print('b = ', b)

If you want to keep the value returned by an operator like + or *, we need to use it in conjunction with the assignment operator to assign the value to a new variable, or to an old variable. Check out the examples below.

In [None]:
c = 20
d = c + 10
print('d is', d)
print('and c is still', c)
f = 100
print('f is currently', f)
f = f * d
print('but we multiplied it by d and assigned the product to f, now f is', f)

### What's `'Woof!' * 2`?

So far, we've only used operators on integers. However, you can also use some operators on strings, too.

🔔 **Question**: In the cells below, try using the addition and and multiplication operators on strings. What do you expect to happen?

In [None]:
# here is a variable
dog = "Woof!"

# multiply it by two and print the result



## `TypeError`

In the following code cell, we try to multiply the following strings: `"2" * "2"`.

In [None]:
"2" * "2"

We run into a `TypeError`. The program is complaining that it *"can't multiply sequence by non-int of type 'str'"*. What this means is that we've tried to multiply strings. 

In this case we've tried to multiply two strings together, in the same way you can't multiply `"dog"` by `"cat"`, python can't multiply `"2"` by `"2"`.

🔔 **Question**: Why **can't** python multiply the values `"2"` and `"2"` together? Why **can** python multiply the value `"2"` by `2`?

## Casting: Converting between Types

Luckily, python has several functions which allow us to convert between different types. The term we use for this is **casting**. So, to fix the error we got above, we need to **cast** the **string** values from the **`input()` function** into **integers** so that we can multiply them together.

It's easy to see how the string `"100"` can be converted to the integer `100`, or how the integer `123` can be converted to the string `"123"`, but how would you convert the string `"woof"` into an integer? Hint - you can't.

If a casting function can't convert a value, we says it **raises** a `TypeError` (like was raised before when we tried to multiply two strings together).

Here are some examples of casting: 

* `int('22')` (string to integer) will return `22`
* `int('twenty two')` (invalid string) will raise `TypeError`
* `str(100)` (integer) will return `100`
* `str(100.0)` (float) will return `100.0`

<a id='loops'></a>

# Loops

A **[for loop](https://www.w3schools.com/python/python_for_loops.asp)** executes some statements once *for* each value in an iterable (like a list or a string). It says: "*for* each thing in this group, *do* these operations".

## Looping Over lists 

One very common thing to do is loop over a list. It works like this:

In [None]:
tokens = ['Hermeneutics','is','a','fancy','word']

for token in tokens:
    up = token.upper()
    print(up)

Let's look at the syntax of this `for` loop a bit more closely:

<img src="../../img/for.svg" alt="For loop in Python" width="700"/>

Pay attention to the **loop variable** (`token`). It stands for each item in the list (`tokens`) we are iterating through. Loop variables can have any name; if we'd change it to `x`, it would still work. However, loop variables only exist inside the loop.

## 🥊 Challenge: Looping Over Lists

Create a list with some integers, then write some code that will loop over each integer in your list and multiply it by 10. In the body of the loop, `print` each multiplication. 

In [None]:
# YOUR CODE HERE



<a id='cond'></a>

# Conditionals

Conditionals, such as `if`, `else` and so on, are what's called **Booleans**. In Python, there is a special type, called `bool`, which can only ever have one of two values: `True` or `False`. Never both, never maybe, just `True` or `False`.

Let's imagine we have a variable which was somehow set to whether a shop had eggs, let's call it `shop_has_eggs`, it can be either `True` or `False`. This is called **conditional execution**. In Python, we write out two blocks of code, and then execute only one block, depending on some **condition**.

This is where two keywords come in: `if`, and `else` (you can think of `else` as simply meaning "otherwise"). We use these like below. Run the code and change the value of `shop_has_eggs` to see how the execution differs.

In [None]:
# first, define our shop has eggs variable
shop_has_eggs = True  # change this to False and see what happens when you run it again

# everything above will always execute
print('I am going to the shops')

# now for the conditional execution
if shop_has_eggs:
    # when shop_has_eggs is True, this block executes
    print('The shop has eggs, therefore I will buy 6 eggs.')
else:
    # when it's False, this block executes
    print('The shop has no eggs, therefore I will buy 1 loaf of bread.')
    
# everything below will always execute
print('Now I am walking home.')

Here we can see that `if` takes a boolean value. If the value is `True`, it will execute the block of code below it, if it is `False`, it will skip the block below it and not execute it. 

Furtermore, `if` can be combined with `else`, so that if the condition passed to `if` is `False`, then the block below `if` won't execute, but the block below `else` will.

**⚠️ Warning:** You can use `if` on its own without `else`, but you can never use `else` on its own without `if`. This is because `if` must take a condition, but `else` can't.

## Comparison Operations

In a real life database management system, it would be much more probable that there was a variable called something like `egg_count`, which tells us the *number* of eggs in stock, rather than simply *whether* there are eggs or not. If that number is 0, there are no eggs, if it's 1 or more, then there are eggs.

So, we need a way to **compare** values (in this case an integer number of eggs) to evaluate to `True` or `False`. This is where the python [comparison operators](https://www.tutorialspoint.com/python/python_basic_operators.htm) come in.

Comparison operators are like arithmetic operators, in that they take the value on the left, and compare it to the value on the right, and then, depending on the result of the comparison, return `True` or `False`. We can either assign the boolean value to a variable, or, more commonly, just pass the condition to `if`. 

Probably the simplest of all these operators to understand is `==`, the **equality** operator. Note the two equals signs, this is intentional, as it means python can understand the different between **assignment** and **equality**.

The code below is an example which uses both the assignment operator and the equality operator.

In [None]:
# add comments in this cell
test_var = 10

if test_var == 10:
    print('test_var is equal to ten')

The equality operator is just one of the python comparison operators:


| Symbol | Name                     | Example: `True`    | Example: `False`   |
|:------:|:------------------------:|:------------------:|:------------------:|
| `==`   | Equality                 | 'woof' == 'woof'   | 23 == 20           |
| `!=`   | Inequality               | 'woof' != 'meow'   | 23 != 23           |
| `>`    | Greater than             | 123 > 12.3         | 100 > 1000         |
| `<`    | Less than                | 1000 < 10000       | 1 < 0.1            |
| `>=`   | Greater than or equal to | 10 >= 10           | 100 >= 1000        |
| `<=`   | Less than or equal to    | 10 <= 100          | 101 <= 100.0       |
| `is`   | Identity                 | 10 is 10           | 10 is 10.0         |


Take a moment to go over these in your head so you are confident you understand them. 

## 🥊 Challenge: Creating a Calculator

With what you've learned, do the following:

1. Create a list with integers between 1 and 20.
3. Multiply the values smaller than 10 by 2.
4. Else, divide 10 by 2.
4. Print the values of the calculation.

In [None]:
# YOUR CODE HERE






<a id='df'></a>

# Data Frames: Spreadsheets in Python

**Tabular data** is everywhere. Think of an Excel sheet: each column corresponds to a different feature of each datapoint, while rows correspond to different samples.

In scientific programming, tabular data is often called a **data frame**. In Python, the `pandas` package contains an object called `DataFrame` that implements this data structure.

In [None]:
%pip install gdown

In [None]:
import gdown

gdown.download("https://drive.google.com/uc?id=1x4qElA3OKHH9SWkr78NdYxmAdyIqdpB-", "../../data/aita_top_submissions.csv", quiet=False)

## Importing Packages

Now let's discuss packages. A **package** is a collection of code that someone else wrote and put in a sharable format. Usually it's designed to add specific functionalities to Python. The package we will use in this notebook is called Pandas. Let's try and use it by importing a CSV file.

In [None]:
# read the file
df = pd.read_csv('../../data/aita_top_submissions.csv')

## `NameError`

We've encountered another error: a `NameError`. What does this mean?

Before we can use a package like Pandas, we have to **import** it into the current session.
Importing is done with the `import` keyword. We simply run `import [PACKAGE_NAME]`, and everything inside the package becomes available to use.

For many packages, like `pandas`, we use an **alias**, or nickname, when importing them. This is just done to save some typing when we refer to the package in our code.

Let's import the `pandas` module, and add the alias `pd`.

In [None]:
import pandas as pd

## Reading CSV Files

The red text in the cell below is part of the command we use to load a file using the read_csv() function. This function takes the location of the file (called a file path) as its main input.

Let’s break it down:
- `../` means “go up one folder level” from where this notebook is currently located (which is in the “lessons” folder). We use it twice to go back up to the main “DIGHUM160” folder.
- `data/` means “go into a folder named data” inside that main folder.
- `aita_top_submissions.csv` is the name of the file we’re opening from the data folder.

So we’re telling Python:
Go up two levels, then go into the data folder, and open the aita_top_submissions.csv file.

In [None]:
df = pd.read_csv('../../data/aita_top_submissions.csv')

The `.head()` method will show the first five rows of a Data Frame by default. 

💡 **Tip**: Put an integer in between the parentheses to specify a different number of rows. 

In [None]:
df.head()

### More on .csv Files
As data scientists, we'll often be working with these **Comma Separated Values (.csv)** files. 

Comma separated values files are common because they are relatively small and look good in spreadsheet software. A comma separated values file is just a text file that contains data but that has commas (or other separators) to indicate column breaks.

As you see, `pandas` comes with a function [`read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
that makes it really easy to import .csv files.

Let's have a look at our .csv file in our browser!

## Selecting Columns
Now that we have our `DataFrame`, we can select a single column by selecting the name of that column. This uses bracket notation (like we do when accessing lists).

Check it out:

In [None]:
df['author']

The data type of this column is a `Series`. It's like a list. You can index a `Series` object just like you can with a list!

In [None]:
aita_author = df['author']
aita_author[0]

Now let's try to retrieve another column.

In [None]:
df['time']

## `KeyError`

We've encountered a `KeyError`. What does this mean?

We are trying to access a column using a label that does not exist in the DataFrame. 

In the case of `df['time']`, the error `KeyError: 'time'` indicates that there is no column labeled 'time' in the DataFrame.


## Using Methods on Columns

`DataFrame` objects come with their own methods, many of which operate on a single column of the DataFrame. 

For example, we can identify the number of unique values in each column by using the `nunique()` method:

In [None]:
df['author'].nunique()

Usually, a package provides **documentation** that explains all of its functionalities. Let's have a look at the documentation for a method called `value_counts()` [online](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.value_counts.html). 

🔔 **Question**: What does `value_counts()` do in the code below?

In [None]:
df['flair_text'].value_counts()

## 🥊 Challenge: Putting Methods in Order

In the following code we want to to find the top-3 most frequently occurring flairs in our data. Put the following code fragments in the right order to get this information!

In [None]:
.head(3)
.value_counts()
df['flair_text']

## Attributes 

Packages like Pandas don't only come with methods, but also with so-called **attributes**.

Attributes are like variables: they give you more information about the data that you have. Methods are like functions: they allow you to do something with data.

For instance, we can easily check the column names of our data frame using the `columns` **attribute**.

In [None]:
df.columns

🔔 **Question**: Here's another popular attribute: `shape`. What do you think it does?

In [None]:
df.shape

## Jupyter Autocomplete

Jupyter Notebooks allow for tab completion, just like many text editors. If you begin typing the name of something (such as a variable) that already exists, you can simply hit **Tab** and Jupyter will autocomplete it for you. If there is more than one possibility, it will show them to you and you can choose from there. 

🔔 **Question:** Below we are selecting a column in our `DataFrame`. See what happens when you hit `TAB`! What are you seeing?

In [None]:
# YOUR CODE HERE

df['flair_text'].

## Selecting Rows

Accessing rows in a Dataframe can be done using the `.loc` method. `loc` is **label-based**: that means it works based off of the **index label** of our DataFrame. In our current DataFrame, the index label is the leftmost column, consisting of  of numbers that explicitly identify each row.

In [None]:
df.head()

When you use `.loc`, you are specifying the exact label of the index you want to access. For example, if your DataFrame's index labels are 0, 1, 2, 3, and so forth, using `df.loc[0]` will fetch the row where the index label is 0.
 

In [None]:
df.loc[0]

However, if we try to access an index label that does not exist, we will get a `KeyError` again:

In [None]:
df.loc[912039]

## Conditional Selection

What if we wanted to get some rows in our dataset based on some condition? For example, what if we just wanted a select only the rows for which the flair is "Not the A-hole"? Or only posts that have a certain number of comments?

We can use so-called **value comparison operators** for this. For instance, to get only the rows that have a specific flair, we can use `==`.

In [None]:
df['flair_text'] == 'Not the A-hole'

💡 **Tip**: Fancy terminology alert: the above Series is called a **Boolean mask**. It's like a list of True/False labels that we can use to filter our Data Frame for a certain condition!

Here, we create a subset of our Data Frame with the fancy Boolean mask we just created. 

In [None]:
# Getting only the data points with this flair
df[df['flair_text'] == 'Not the A-hole']

Note that the output of this operation is a **new data frame**! We can assign it to a new variable so we can work with this subsetted data frame. Let's do it again:

In [None]:
# Creating a new data frame with data from 2002
nta_df = df[df['flair_text'] == 'Not the A-hole']
nta_df.head()

## 🥊 Challenge: Subsetting Data Frames

Besides `==` we can use [other operators](https://www.w3schools.com/python/gloss_python_comparison_operators.asp) to compare values. For instance:
- `<` less than
- `>` greater than

Fill in the code below to subset our data frame to include only posts with more than 500 comments (`num_comments`).

In [None]:
# YOUR CODE HERE
df[df[...] < ...]

## 🥊 Challenge: Subsetting and Calculating the Mean

Let's make use of subsetting to do some calculation! Calculate the **average score** for a flair of your choice. 

This means you will have to:
1. Subset the `flair_css_class` column using a Boolean mask and '==', picking one of the values this column can hold.
2. Take the `score` column from that subset.
3. Apply a Pandas method to get the mean from that column.

You might not know how to get the mean of a column – yet! If that's the case, **use your search engine**.

1. Enter the name of the computer language or package, and your question (for instance: "Pandas calculate mean").
2. Read and compare the results you find.
3. Try 'em out!

In [None]:
# YOUR CODE HERE


# 🎉 Well Done!

This workshop series took us through the basics of data analysis in Python:

- Using Jupyter notebooks.
- Variables, data types, functions.
- Looking through documentation.
- Googling errors and debugging.
- Manipulating data with Pandas.

<div class="alert alert-success">

## ❗ Key Points

* Jupyter notebooks use a Kernel to run Python. It can be reset in the `Kernel` tab when the notebook gets stuck.
* Quitting a notebook means to lose all the variables you had saved to memory.
* Variables in Python are assigned using `=`.
* Lists are a key data structure in Python: they can be recognized by square brackets.
* For-loops allow you to loop over iterables like lists or `Series`.
* If-statements exectute based on some condition. 
* Import a library into Python using `import <libraryname>`.
* Data frames allow you to work with tabular data (think Excel in Python).
    
</div>