# Introduction to the Python Coding Environment

*This notebook uses content from the [Python section](https://melaniewalsh.github.io/Intro-Cultural-Analytics/02-Python/00-Python.html) of Melanie Walsh's Introduction to Cultural Analytics & Python.*

The purpose of this notebook is to give you a very basic introduction to the Python coding environment using Jupyter. We will see a preview of several topics which will be expanded on in later lessons, but keep in mind that **the goal here is simply to get a *feel* for the Python-Jupyter coding experience.**

In this lab, we will cover the following topics:

1. JupyterLab and Markdown
2. Variables and Data Types
3. Operators and Logic
4. Packages, Modules, and Objects

\* *Note: make sure you've had a good look at the [Command Line section](https://melaniewalsh.github.io/Intro-Cultural-Analytics/01-Command-Line/01-The-Command-Line.html) before completing this notebook.*

## Jupyter(Lab) and Markdown

[**Jupyter**](https://docs.jupyter.org/en/latest) provides a convenient solution for coding in Python. *This file* is a Jupyter notebook (.ipynb), where Python code is divided into individual and independent cells, where the output of each cell is displayed below the cell when it is run. These cells *can* be run in any order you like, but it is in good form to place them in such a way that the notebook can always be run from top to bottom without errors. [This blog post](https://florianwilhelm.info/2018/11/working_efficiently_with_jupyter_lab/) provides a very in-depth look at some best practices when using Jupyter.

[**Markdown**](https://www.markdownguide.org/cheat-sheet/) renders blocks of text with user-defined formatting. Any cell in a Jupyter notebook can also be a collapsible Markdown cell, and it is highly recommended that these are used in such a way that **every notebook is a clear story told with words (Markdown) and code (Python).** Use the [Markdown Cheat Sheet](https://www.markdownguide.org/cheat-sheet/) as a reference.

<span style="color: red;">PLEASE USE MARKDOWN OFTEN!</span> Ensure it is grammatically correct, and organized. Markdown is important to having a readable, well-written notebook.

### Comments

Throughout this notebook (and your future code), lines that begin with a hash symbol `#` are ignored from the execution of the code. You can thus use lines starting with `#` to insert human language comments directly into the code — notes or instructions to yourself and others. **USE COMMENTS OFTEN!** Again, code should be easy to understand, and comments help this along.

In [117]:
# this is a comment nothing happens

## Variables and Data Types

Variables are one of the fundamental building blocks of Python. A variable is like a tiny container where you store values and data, such as filenames, words, numbers, collections of words and numbers, and more.

### Assigning Variables

The variable name will point to a value that you "assign" it. You might think about variable assignment like putting a value "into" the variable, as if the variable is a little box. **You assign variables with an equals `=` sign.** In Python, a single equals sign `=` is the "assignment operator", and you can read it as "... is assigned the value ...". Each variable points to a tiny partition of your computer's memory (RAM). Too many big variables can take up too much RAM, slowing down your computer, so be careful.

In [118]:
# numeric variables
my_integer = 5
my_float = 3.14  # "float" is short for "floating point", a real number

# non-numeric variables
my_boolean = True
my_string = "hello world"

# "iterable" variables
my_list = [1, 2, 3, 4]
my_set = {'a', 'b', 'c'}
my_dictionary = {'a': 1, 'b': 2}
my_tuple = (5, 6, 7)

Variables that can be updated ("mutated") are called "mutable", and those which technically cannot are called "immutable". We will see more on this later, but it suffices to say that we can either update variables (e.g., change the 2nd element of `my_list`), or we can completely overwrite them (e.g., run `my_integer = -1`).

In [119]:
my_float

3.14

In [120]:
my_float = 4.0
my_float

4.0

### Data Types

Each value held in a Python variable has an attribute called a "data type". A variable's data type describes the kind of value held in the memory location that variable points to, and it dictates the kinds of things we can do with that variable.

There are all [several built-in data types](https://www.w3schools.com/python/python_datatypes.asp) available in Python, but the most common ones are listed above as the variables we just created!

Data Type | Examples | Notes
:---: | --- | ---
integer | `-2`, `5`, `0` |
float | `1.3`, `3.1415927`, `-3.2223` |
boolean | `True`, `False` |
string | `"hello"`, `"32"` | Strings actually act like "lists" of characters
list | `[1, 2, 3]`, `["a", 2, 3.0]` | We can access elements using `[]`, like `my_list[2]`
set | `{"a", 2, True}`, `{None}`, `{2}` |
dictionary | `{"a": 12, "c": 3}`, `{"a": [1, 2], "c": 3}` | Similar to lists, we use the key. E.g., `my_dictionary["a"]`
tuple | `(3)`, `(1, 5)` | These can also be indexed like lists and dictionaries.

**Caveats**
* The `None` in the second `set` example is actually a "None Type" data type. It is usually used to represent a missing or non-existent value.
* It is possible to have multiple kinds of data types within a list, set, or tuple.
* Iterables (like lists) and strings can be sliced using `[start:stop:step]` syntax.
    - **Python is zero-indexed.** So, the `0`th index is actually the first value.
    - The `stop` value is excluded from the result. So, `[2:4]` results in the 3rd and 4th elements.

In [121]:
"blah"[1]

'l'

In [122]:
my_dictionary['a']

1

In [123]:
my_list[0]

1

In [124]:
None

In [125]:
new_list = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

new_list[2:8:2]

[2, 4, 6]

In [126]:
# we don't need start, stop, and step
new_list[:4]

[0, 1, 2, 3]

In [127]:
# negatives work in reverse
new_list[:-3:-1]

[9, 8]

### Naming Variables

Variable names can be as long or as short as you want, and they can include:
- upper and lower-case letters (A-Z, and a-z)
- digits (0-9)  *... (a variable cannot start with a digit)*
- underscores (_)  *... (the character `_` alone acts as a "temporary" variable)*

Variable names *cannot* include:
- other punctuation (-.!?@)
- spaces ( )
- a reserved Python word

**Use [clear and precisely-named variables](https://gist.github.com/etigui/7600441926e73c3385057718c2fdef8e).** Instead of using `n` to represent the name of a person, use `name`. Likewise, instead of using `b` to represent a person's birth year, just use `birth_year`. Be efficient, but don't worry too much if variable names get a bit long! :)

### Off-Limits Names

The only variable names that are off-limits are names that are reserved by, or built into, the Python programming language itself, such as `True`, and `list`. You’ll know very quickly if a name is reserved by Python because it will show up in green and often give you an error message.

In [128]:
# Try running the code below
# True = "the sky is blue"

## Operators and Logic

Of course, the power of Python comes from its ability to manipulate and operate on these variables. In particular, in this class, we use arithmetic, comparison, and logical operators.

### Arithmetic Operators

Mathematical operators in Python behave differently depending on the data type they're used on. But, in general, there are 7 varieties of operator:

Operator | ... on Numbers | ... on Strings | ... on (some) Iterables
--- | --- | --- | ---
`+` | Addition | Concatenate | Concatenate (lists)
`-` | Subtraction | - | Difference (sets)
`*` | Multiplication | Replicate | Replicate
`/` | Division | - | -
`%` | Modulus | - | -
`**` | Exponentiation | - | Expand sets/dictionaries (e.g., `**my_set`)
`//` | Floor division | - | -

**Numeric Examples**

In [129]:
# addition and subtraction work as expected
4 + 5 + 2

11

In [130]:
# the floor is the "rounded down" integer
8 // 3

2

In [131]:
# the modulus is the remainder after division
5 % 2

1

**List Examples**

In [132]:
# we can join two lists together
[2, 3, 4] + [5, 6, 7]

[2, 3, 4, 5, 6, 7]

In [133]:
# or, we can make values repeat themselves
[1, 2] * 4

[1, 2, 1, 2, 1, 2, 1, 2]

**Set Examples**

In [134]:
# What is in the first set and not the latter
{1, 3, 2} - {2, 19, 2}

{1, 3}

In [135]:
a = {'a', 'b', 'c', 'c'}
b = {'c', 'd', 'f'}

# We can use * to get a union between two sets
{*a, *b}

{'a', 'b', 'c', 'd', 'f'}

### Conditional Operators

If we want to compare two values, we can use conditional operators:

Operator | Example | Meaning | Result
--- | --- | --- | ---
`==` | `a == b` | Equal to | True if the value of a is equal to the value of b, False otherwise
`!=` | `a != b` | Not equal to | True if a is not equal to b, False otherwise
`<` | `a < b` | Less than | True if a is less than b, False otherwise
`<=` | `a <= b` | Less than or equal to | True if a is less than or equal to b, False otherwise
`>` | `a > b` | Greater than | True if a is greater than b, False otherwise
`>=` | `a >= b` | Greater than or equal to | True if a is greater than or equal to b, False otherwise

In [136]:
# you can compare numbers and characters
4 == "4"

False

In [137]:
# strings have alphabetical order
"a" > "z"

False

### Logical Operators

Lastly, we have logical operators: `not`, `or`, and `and`. These work as you'd expect them to:

In [138]:
not 3 > 8

True

In [139]:
# with or, Python stops at the first false conditional
(2 == 3) or (3 > 7)

False

Here, play around with the `3`, below, to learn about ["truthiness"](https://docs.python.org/3/library/stdtypes.html#truth-value-testing).

In [140]:
# Play around with the "3" here to learn what "truthy" means
3 and (5 <= 10)

True

### Logic

There are three main logical "phrases" available in Python: `if`, `if-else`, and `if-elif-else`.

*Note: Python 3.10 has now introduced the handy [`case` statement](https://learnpython.com/blog/python-match-case-statement/). But, I recommend avoiding this for now, at least until the world catches on.*

**`if` statements**

In [141]:
x = 4  # adjust this value to see what happens

# if statements always have this form
if x < 5:
    # code beneath the colon must be indented
    result = "good stuff"
    
result

'good stuff'

In [142]:
# use parentheses for complex comparisons
if (x < 5) or (x > 20):
    result = "good stuff"
    
result

'good stuff'

**`if-else` statements**

In [143]:
x = 20  # adjust this value to see what happens

# with if-else, exactly one block will be run
if x < 5:
    result = "good stuff"
else:
    result = "different stuff"
    
result

'different stuff'

**`if-elif-else` statements**

`elif` is a sort of hybrid between `if` and `else`. Once a condition is met, Python stops cascading through the rest of the conditions.

In [144]:
x = 8  # adjust this value to see what happens

# with if-elif-else, exactly one block will be run
if x < 5:
    result = "good stuff"
elif x < 10:
    result = "medium stuff"
else:
    result = "different stuff"
    
result

'medium stuff'

## Packages, Modules, and Objects

Python is an [object-oriented programming language](https://realpython.com/python3-object-oriented-programming/), which basically means that everything in Python can be distilled into an "entity" with attributes and/or a "process" with steps to follow, and encapsulated in a variable-esque thing called an *object*. Objects can be defined in a .ipynb file (like this one), or in a proper Python "module" (a Python .py file). Then, modules are collected and organized in folders as a package (or "library"). In any code, you can reference modules or packages using the `import` statements we've seen above.

### Packages and Libraries

When you Google "tenets of programming", you'll find several lists of principles which programmers should follow (and, I agree, they should). However, two of the most common acronyms which come up are:

1. **KISS: "Keep It Short and Simple"**. First, keep your code as simple as possible. It should be readable, easy to explain, and everything should have a very clear purpose.
2. **DRY: "Don't Repeat Yourself"**. Secondly, avoid writing repetitive code. Use Python functionality in a clever way, such that any repetition is being done by the computer, not you.

Staying true to these tenets, note that many other developers have written Python code into "packages" or "libraries" that you can `import` into your own code, which will both simplify your code, and keep you from repeating yourself.

In [145]:
import re                          # import the whole package
import math as m                   # rename the package to use later
from collections import Counter    # import just one object from the package
from os import getcwd, listdir     # import multiple objects from a package

*Note: `re` stands for [regular expressions](https://www.computerhope.com/jargon/r/regex.htm), and [this](https://regex101.com/) is a good reference for testing them out.*

### Functions and Classes

A [function](https://melaniewalsh.github.io/Intro-Cultural-Analytics/02-Python/12-Functions.html) is a Python object which allows us to bundle up code to perform specific tasks. In it, we define a set of steps we want Python to take, given some input to return an output. Below is an example for how we can define a simple function. There are built-in functions (those which don't require you to `import` anything), those which come from other packages, and user-defined functions, which you define yourself.

**Built-In Functions**
One popular built-in function is the `print()` function, which "prints" values to the console (or the space below a Jupyter notebook cell). Another is the `type()` function, which allows you to check for an object's data type.

In [146]:
print("hello world")

hello world


In [147]:
# we can also use "format strings" to place variables in strings
short_sentence = f"My integer is {my_integer}"
print(short_sentence)

My integer is 5


In [148]:
# determine the datatype of something
type(None)

NoneType

In [149]:
type(4.0 ** 3)

float

In [150]:
# sort lists
sorted([2, 5, 4, 1, 10, 37, 12])

[1, 2, 4, 5, 10, 12, 37]

**Packaged Functions**

We can reference stand-alone functions, if they were imported (e.g., `getcwd()`, below), or we can use the period `.` to call functions or variables which are part of a module (or another object).

In [151]:
# prints the current working directory
getcwd()

'/Users/yuritziavila-robledo/Downloads/lab_1 (1)'

In [152]:
# lists the names of the files in the current directory (notice the '.' abbreviation)
listdir('.')

['labutil', 'lab_1.ipynb', 'data']

In [153]:
# `sin` is a function from the `math` module, and `pi` is a variable from the same
m.sin(m.pi / 2)

1.0

**User-defined Functions**

I encourage you to take a look at Walsh's [chapter on functions](https://melaniewalsh.github.io/Intro-Cultural-Analytics/02-Python/12-Functions.html), but as an example, we can define functions like the one below.

For this example, we're going to use the text of Lewis Carroll's *Alice in Wonderland* (from Walsh's [book](https://github.com/melaniewalsh/Intro-Cultural-Analytics)). But to do so, we need to access a file in our directory. `with` is a special kind of operator outside the scope of this class, but it suffices to say that it is a clean way to manage computational resources when running routines such as using the built-in function `open()` to open (and close) files.

In [154]:
def book_words(filepath, split_pat="\W+"):
    '''
    Given a `filepath` to a book, split the text into individual words.
    `split_pat` is the regex expression defining each split.
    '''
    with open(filepath, "r") as f:
        book = f.read()
    
    words = re.split(split_pat, book.lower())
    
    return words

A few things:
* We define special variables called *arguments* in the parenthesis to use within the function. Those with an `=` are optional, with a default value.
* Inside the triple quotations `''' ... '''` is the *docstring*, telling us what the function does and how to use it. **Every function should have a docstring**, even if it is very short. In Jupyter, you can peek at functions' docstrings using **Shift + Tab**.
* The `return` statement dictates the output of the function.
* Variables defined *inside* the function are called **local variables**. These are only "available" within the function. **Global variables** are defined outside the function (or using the `global` keyword), and they are available anywhere in the python file or where the module is imported.

In [155]:
aiw_path = './data/Alice-in-Wonderland_Lewis-Carroll.txt'
aiw_words = book_words(aiw_path)

In [156]:
print(aiw_words[:10])

['the', 'project', 'gutenberg', 'ebook', 'of', 'alice', 's', 'adventures', 'in', 'wonderland']


**(Using) Classes**

The [*Class*](https://realpython.com/python-classes/) is probably the most common form of Python object. It is an abstract entity with well-defined attributes and behaviors (e.g., a shoe is an entity with a size and color, and it can also be worn). **We will not be creating new Python classes in this course**, but we will be using them.

A class has *attributes* (variables) and *methods* (functions) which are specifically "attached" to the class. Either of them are accessed with the period `.`. Unlike normal variables/functions, class attributes and methods can interact with one another "behind the scenes", as they are all *part* of the same entity (i.e., the class itself).

> You can find out what kinds of attributes or methods are available for a class by typing `.`, then typing **Tab**.

In [157]:
# in fact, a string is a class, with several methods
"this is a string".split(' ')

['this', 'is', 'a', 'string']

In [158]:
# as are lists
append_list = [1, 2, 4, 5, 7, 8]
append_list.append(10)
append_list

[1, 2, 4, 5, 7, 8, 10]

In [159]:
# here, we define a new instance of the `Counter` class (see above)
counter = Counter(aiw_words)

# One attribute of this class is a count for each word
counter['alice']

403

In [160]:
# A method of this class is listing the `n` most common words
counter.most_common(n = 10)

[('the', 1832),
 ('and', 944),
 ('to', 810),
 ('a', 695),
 ('of', 635),
 ('it', 607),
 ('she', 549),
 ('i', 524),
 ('you', 469),
 ('said', 460)]

Prepositions, conjunctions, and pronouns like these carry very little interesting information about the book. Words like these are called *[stop words](https://en.wikipedia.org/wiki/Stop_word)*, and we can filter them out before counting, if we like. To do this, we will use a [**list comprehension**](https://melaniewalsh.github.io/Intro-Cultural-Analytics/02-Python/10-Lists-Loops-Part2.html#list-comprehensions). This allows us to iterate through a collection of items, and create a new list of items.

In [161]:
numbers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

# we can use the modulus operator to find even numbers
[n for n in numbers if n % 2 == 0]

[0, 2, 4, 6, 8]

Two special operators are the object "identity" operators: `is` or `is not`, which check whether an object *is* an instance of some object. You will likely only use these to determine if a variable is `None`. Also, `in` will return `True` if what precedes it is an element of what follows it.

In [162]:
# always use clear, descriptive function names
def most_common_words(words, n_top=None, stop_words=None):
    if stop_words is not None:
        words = [w for w in words if w not in stop_words]
        
    counter = Counter(words)
    common = counter.most_common(n=n_top)
    
    return common

In [163]:
with open("./data/stop_words_english.txt", "r") as f:
    stop_words = f.read()

most_common_words(aiw_words, n_top=12, stop_words=stop_words)

[('alice', 403),
 ('gutenberg', 98),
 ('project', 88),
 ('queen', 76),
 ('thought', 74),
 ('turtle', 59),
 ('mock', 57),
 ('began', 57),
 ('tm', 57),
 ('hatter', 56),
 ('gryphon', 55),
 ('rabbit', 53)]

### Modules (optional)

Functions exist as a sort of short-hand: reusable Python structures which help us to avoid writing repetitive code blocks. I.e., instead of writing the same block of code multiple times, we simply call a single function — it saves room and keeps things clean. In the same way that functions (and classes) can be imported from third-party (or built-in) packages, they can also be imported from modules we build ourselves!



1. Create a folder in this directory, and give it a unique name like "labutil". This will be our [Python package](https://docs.python.org/3/tutorial/modules.html#packages).
2. In it, save a Python file called \_\_init\_\_.py (two underscores on either side of the name).
    - This file can be completely empty.
3. Save another Python file, named appropriately based on the functions you want in there, e.g., "basics.py".
4. Build and test out your functions in Jupyter, and when you feel comfortable with them, *move* them into your Python file. Use Jupyter notebook *as a notebook*, and keep any reusable code (e.g., functions) in the Python file.
5. Whenever you're working on a project, keep open your Jupyter notebook for building and testing code, and keep your IDE (e.g., VS Code) open for updating your functions.
    - [PEP-8](https://pep8.org/) is the Python style guide, and for the most part, IDEs like VS Code naturally guide users to following it. However, if you're ever in doubt about how something should look, refer to PEP-8.

We will use the [autoreload](https://ipython.org/ipython-doc/3/config/extensions/autoreload.html) functionality Jupyter to make sure updates to our code reflect in the notebook:

```python
%load_ext autoreload      # edits to our .py file(s) will automatically update in Jupyter
%aimport labutil.basics   # the modules (files) we want to "auto" import
%autoreload 1             # a code to only autoreload the "aimport" modules, above
```

In [164]:
%load_ext autoreload
%aimport labutil.basics
%autoreload 1

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [165]:
# we use <package>.<module> syntax, and the "*" means "all"
from labutil.basics import *

In [166]:
# try changing the print statement in this function to see what happens
my_func()

all the stuff


# Exercises

## Exercise 1

Find the largest number in the Alice in Wonderland book. (Remember, we should have `aiw_words`.)

- The regex pattern `"\d+"` will match on any integer (without commas).
- [re.match](https://docs.python.org/3/library/re.html#re.match) will return a "truthy" value if there is a match.
- Recall the built-in function `sorted`. How can you reverse the ordering here?
- Also, you can convert a string of digits to an integer with `int(characters)`.

In [167]:
pip install word2number


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [168]:
from word2number import w2n

In [173]:
# your code here

regex_pattern = r"\d+"
matches=re.finditer(regex_pattern, ' '.join(aiw_words))
numbers=[w2n.word_to_num(match.group())for match in matches]


largest_number=max(numbers)
print("The largest number in Alice in Wonderland is:", largest_number)



The largest number in Alice in Wonderland is: 6221541


## Exercise 2

Write a function that calculates the average of a list of numbers called `numbers`.
* `sum` is a built-in function, as is `len` (for length).

In [179]:
# your code here
def average(numbers):
    total=sum(numbers)
    count=len(numbers)
    average_list=total/count
    print (average_list)

In [180]:
random_list_nums=[1,4,6,1]
result_avg=average(random_list_nums)

3.0
