# CSS 201.5 - CSS Bootcamp

## Python Programming

### Umberto Mignozzetti (UCSD)

## Indexing into a dictionary

Once you've created a dictionary, you'll want to **access** the items in it.

- An advantage of a `dict` (over a `list`) is that key/value pairings are inherently **structured**.  
- So rather than indexing by *position*, you can index by *key*.

The syntax for indexing is: `dict_name[key_name]`. 

In [None]:
person = {'Name': 'Smarty Student',
          'Occupation': 'UCSD Grad',
          'Location': 'San Diego'}
print(person['Name'])

In [None]:
print(person['Location'])
print(person['Occupation'])

### Check-in

How would you retrieve the value `25` from the dictionary below?

In [None]:
test_dict = {'apple': 25,
            'banana': 37}
### Your code here

### Indexing requires a key

To index into a `dict`, you **need to use the key**.

- The *position* of a value will not work.  
- The *value* itself will also not work.

In [None]:
test_dict[0] ### will throw an error

In [None]:
test_dict[25] ### will throw an error

## Updating a `dict`

Once you've created a `dict`, it's not set in stone––there are multiple ways to **modify** that dictionary.

- Adding new entries.  
- Deleting existing entries.  
- Combining two dictionaries.

### Adding new entries

In [None]:
## First, let's create a new dictionary
registrar = {'Mignozzetti': 'POLI', 
             'Trott': 'COGS'}
print(registrar)

We can add a new entry using the `dict_name[key_name] = new_value` syntax.

In [None]:
## Now we add a new entry to the dictionary
registrar['Styler'] = 'LING'
print(registrar)

### Check-in

Add an entry for the price of `"pasta"` to `prices_dict` below using this new syntax. 

In [None]:
prices_dict = {'rice': 4, 'bananas': 3}
### Your code here

### Check-in

What would the `len` of `prices_dict` be after you've added that entry?

In [None]:
### How long is prices_dict after you've added "pasta"?
len(prices_dict)

### Deleting entries

We can also use the `del` function to delete specific key/value pairs from a dictionary.

In [None]:
## First, we create a new dictionary.
attendance = {'A1': True, 'A2': False}
print(attendance)

In [None]:
## Then, we delete the entry with the "A2" key.
del(attendance['A2'])
print(attendance)

### Merging dictionaries using `update`

What if we have **two different dictionaries** that we want to combine or *merge*? 

The `update` function can be used to do this.

In [None]:
## First, we create a new dictionary.
registrar = {'Mignozzetti': 'POLI', 
             'Trott': 'COGS'}
print(registrar)

In [None]:
## Now, we define another dictionary with more info.
registrar_other = {'Styler': 'LING',
                   'Mignozzetti': ['POLI', 'CSS'],
                   'Rangel': 'COGS'}
## Finally, we "update" original registrar
registrar.update(registrar_other)

In [None]:
print(registrar)

### Check-in

Recall that a dictionary cannot contain **duplicate keys**. What do you think would happen to `original_dict` if we ran the code below?

In [None]:
original_dict = {'a': 1, 'b': 3}
new_dict = {'a': 2}
original_dict.update(new_dict)
### What happens to original_dict['a']?
original_dict

#### Updating with duplicate keys

If we `update` a dictionary with another dictionary that contains **overlapping keys**, the **new values** replace the old values.

In [None]:
original_dict = {'a': 1, 'b': 3}
new_dict = {'a': 2}
original_dict.update(new_dict)
print(original_dict['a'])

## Iterating through a `dict`

Dictionaries are **structured** collections of **key/value pairings**.

As such, there are several ways to iterate (i.e., **loop**) through a `dict`:

- Iterating through a `list` of **keys** (`.keys()`).  
- Iterating through a `list` of **values** (`.values()`). 
- Iterating through a `list` of **key/value** `tuples` (`.items()`).

### Looping through keys with `.keys()`

Each dictionary can be thought of as a `list` of **keys**; each key in turn maps onto some **value**.

We can retrieve that `list` of keys using `dict_name.keys()`.

In [None]:
courses = {'CSS 201': 'Introduction to Computational Social Science',
           'CSS 202': 'Computational Social Science Technical Bootcamp',
           'CSS 296': 'Research in Computational Social Science'}
courses.keys()

This `dict_keys` object behaves like a `list`: we can index into it, loop through it, and so on.

In [None]:
for abr in courses.keys():
    print(abr)

#### Check-in

How could we retrieve each **value** of the `dict` using `keys()`?

In [None]:
### Your code here

#### Retrieving values

Because each key maps onto a **value**, we can simply use it to index into `courses`.

In [None]:
for course in courses.keys():
    ## Index into courses
    name = courses[course]
    print(name)

### Looping through values with `.values()`

We can also retrieve the **values** directly using `dict_name.values()`.

In [None]:
courses.values()
for abr in courses:
    print(abr)

In [None]:
for course_name in courses.values():
    print(course_name)

### Looping through key/value pairings with `.items()`

Dictionaries are, at their core, a list of **key/value pairings**. 

- We can access each of these using `dict_name.items()`.  
- `items()` returns a `list` of `tuples`:
  - The first element of each `tuple` is the **key**.
  - The second element of each `tuple` is the **value**.

In [None]:
print(list(courses.keys()))
for key, value in courses.items():
    print(value + ' is abbreviated as ' + key)

#### Assignment "unpacking"

- We can access each element of the `tuple` using indexing, e.g., `item[0]` or `item[1]`.  
- However, sometimes it's more convenient to **unpack** these elements directly in the `for` loop itself.

In [None]:
for code, name in courses.items():
    print(code)
    print(name)

#### Converting back to a `dict`

We can use the `dict` function to convert a list of **items** back to a `dict`.

In [None]:
items = courses.items()
print(items)

In [None]:
course_dict = dict(items)
print(course_dict)

### Check-in: Looping through values

Use the `.items()` function to loop through `fruits_dict` below. `print` out each item in a formatted string using `format`: 

`{fruit_name}: {price}`. 

In [None]:
fruits_dict = {'apple': 2, 'banana': 3}
### Your code here

### Check-in: Debug

Suppose someone writes a piece of code (see below) to loop through `fruits_dict`. Ultimately, they want to print out the price of each fruit. 

However, they keep running into an error. Can you figure out what they're doing wrong? And further, could you suggest a way to fix it?

In [None]:
### Why is this throwing an error?
for fruit in fruits_dict.keys():
    print(fruits_dict[fruit])

## Nested dictionaries

> A **nested dictionary** is a dictionary contained inside another dictionary, i.e., as a **value**.  

In principle, there is no limit on how many nested dictionaries can be contained in a `dict` (besides memory capacity on one's computer).

- A nested dictionary is useful when you want to store **complex information** in each entry.  
- So far, we've dealt mostly with very simple key/value entries.  
- But what if we wanted to represent more complicated information?

Example, for each student in CSS (or COGS, etc.), store:

- `username`.
- `Name`.  
- `Courses` (a `list`). 
- `College`
- `Major`. 

### Check-in (conceptual)

What would be a useful `dict` structure to represent information about instructors? For example, say we wanted to represent:

- `username` (e.g., `sstudent`)
- `Name` (e.g., `Smarty Student`)
- `Courses` (e.g., `['CSS 1', ...]`)
- `College` (e.g., `ERC`)
- `Major` (e.g., `Psychology`)

### A possible implementation

One approach is to use **nested dictionaries**.

- At the top level, each instructor is represented by their `username`.  
- Each PID then maps onto a nested dictionary, which contains their `Name`, `Email`, and any other info we need.

In [None]:
student = {
    'sstudent': {'Name': 'Smarty Student',
                'Courses': ['COGS 14A', 'CSS 1', 'CSS 2'],
               'College': 'ERC',
               'Major': 'Psychology'},
    'jdoe': {'Name': 'John Doe',
                'Courses': ['COGS 18', 'CSS 1'],
               'College': 'Revelle',
               'Major': 'Undeclared'},
    'jlopez': {'Name': 'Jane Lopez',
                'Courses': ['LING 6', 'LING 101'],
               'College': 'Revelle',
               'Major': 'Linguistics'},
}

### Indexing our nested `dict`

We can index into this `dict` as we would normally. Note that now, the **value** is itself a `dict`.

In [None]:
student['jlopez']

#### Check-in

How might we index the `College` of a particular student? I.e., what if we wanted to find out the `College` of `jdoe`?

In [None]:
### Your code here

#### Nested indices

Indexing into a **nested dictionary** follows the same logic––we can *chain together* index statements to retrieve a particular value.

In [None]:
student['jdoe']['College']

In [None]:
student['jlopez']['Courses'][1]

### Check-in

How would you retrieve the list of `username`s (i.e., keys) in this `dict`?

In [None]:
### Your code here

# Functions

## What is a function?

> A **function** is a re-usable piece of code that performs some operation (typically on some *input*), and then typically returns a result (i.e., an *output*). 

Breaking this down:

- **Input**: a variable defined by the user that is *passed into* a function using the `(input)` syntax.
   - Also called an **argument**.
   - Functions can have multiple **arguments**.
- **Output**: the variable **returned** by a function after this operation is performed.  
   - If a `return` value is not specified, a function will return `None`.

### A very simple function

We'll explore the syntax more in a bit, but this will give you a sense for what we're talking about.

In [None]:
def square(x):
    """Returns the square of X."""
    return x**2

In [None]:
square(1)

In [None]:
square(2)

## Why functions?

In principle, we could just rewrite the same code each time we want to execute that operation. So why bother defining functions at all?

The answer lies in **modular programming**.

- As operations become more and more complex, it becomes unwieldy (and just inefficient) to copy/paste the *same code* again and again.  
- In modular programming, we emphasize building **re-usable chunks of code**.
- Functions (and loops) are ways to re-use chunks of code that solve basic, recurring problems.

Learning to think in a modular way can be hard! But it's a helpful approach to **breaking down a problem into its sub-components**.

### Functions we've encountered

We've already encountered a number of functions in this course.

#### `print`

- Input: something to `print`.  
- Output: technically, `None`.  
- "Side effects": `print`s out input to designated log (by default, the terminal/Jupyter cell).

In [None]:
print("Hello!")

#### `sorted`

- Input: a `list` 
- Output: a sorted `list`.

In [None]:
unsorted = [2, 1, 5]
sorted(unsorted)

## Defining a function

In Python, a new function can be created or **defined** using the `def` keyword, followed by the name of the function.

See the `square` function definition below:

- Function name: `square`. 
- Function arguments: `x`.  
- Function `return`: `x ** 2`.  

In [None]:
def square(x):
    """Returns the square of X."""
    return x**2

### Executing a function

To **execute** a function, we can reference the function name (like a variable), followed by the parentheses `()` and any arguments/input for the function.

In [None]:
## Function name = square
## Input = 2
square(2)

In [None]:
## Function name = square
## Input = 4
square(4)

### What type is a function?

A function belongs to a special `type` in Python, called `function`.

In [None]:
type(square)

### A more complex function

What if we wanted a function that did the following:

- `if` the input `x` is **even**, we square it.  
- `if` the input `x` is **odd**, we just `return` that number.

In [None]:
def square_if_even(x):
    """Squares x if x is even; otherwise return x."""
    if x % 2 == 0: ## check if even
        return x ** 2 ## if so, return square
    else: ## otherwise..
        return x ## just return x

In [None]:
## 2 is even, so square it
square_if_even(2)

In [None]:
## 3 is odd, so just return it
square_if_even(3)

### Another more complex function

So far, our functions have only had a **single argument**. But functions can take in *many* arguments. 

Let's define a function with *two inputs*, which just adds those inputs together.

In [None]:
def add_two_numbers(num1, num2):
    """Adds num1 to num2."""
    return num1 + num2

In [None]:
add_two_numbers(1, 2)

In [None]:
add_two_numbers(5, 3)

### Check-in

What would the function below produce if the input `x` was `25`?

More generally: how would you describe what this function *does*? 

In [None]:
def mystery_func(x):
    if x % 5 == 0:
        return True
    return False

### Check-in

Write a function that takes a `name` as input and `return`s the formatted `str`: `"My name is {name}."`

The code below can get you started:

```
def hello(name):
### your code here
```

In [None]:
# Your code here

## Function arguments: the details

Beyond the basics, there are several other important things to know about the **arguments** for a function:

- It's important to be aware of what `type` your function expects as an argument.
- Arguments can have **default values**.  
- Some arguments can be accessed with a **keyword**, while others are **positional** arguments.

### Argument `type`

Some languages, like Java, require that you specify the `type` of an argument (and variable names, etc.).

Python doesn't require that, but it's still important to be aware of.

- Otherwise, you can run into a `TypeError`.
- If you're interested: Python uses something called [duck typing](https://en.wikipedia.org/wiki/Duck_typing). 

#### Example of a `TypeError`

Here, the `square` function performs an operation with `x` that requires `x` to be an `int`.

In [None]:
def square(x):
    return x ** 2
square("two")

#### How to avoid a `TypeError`?

In practice, the best way to avoid a `TypeError` is to **document your code**. 

- In the `docstring` under a function, you can write details about what the function expects, e.g., whether the input is an `int`, a `str`, etc.

In [None]:
def square(x):
    """
    Parameters
    ------
    x: int or float
      number to be squared
    
    Returns
    -------
    int or float
      square of x
    """
    return x ** 2

square(0.5)

#### Check-in

Will the function below result in an error if you called it on the input `"test"`? Why or why not?

In [None]:
def mystery_func(x):
    return x ** 3

### Default values

> A **default value** is the value taken on by an argument *by default*. If no other value is specified, this is the value assumed by the function.

In the function definition, a default value can be specified by setting: `arg_name = default_value`.

- In the example below, `name` is required.
- But `major` has a default value of `"COGS"`.

In [None]:
def my_info(name, major = "COGS"):
    return "My name is {name}, and my major is {major}.".format(name = name, major = major)

Even if we don't specify a value for `major`, the function will run just fine––it just uses the default value.

In [None]:
my_info("Mary")

#### Overriding a default value

A default value can be overridden in the call to the function itself. 

- Note that this can reference the argument name (`major`), or just occupy the correct **position** in the series of arguments. (More on this later.)

In [None]:
my_info("Umberto", major = "LIGN")

In [None]:
my_info(major = "LIGN", name = "Sean")

#### Arguments without a default must be referenced!

If an argument *doesn't* have a default, the function will throw an error if you don't pass in enough arguments.

In [None]:
my_info()

#### Check-in

Why does the following code not throw an error?

In [None]:
my_info("POLI")

### Positional vs. keyword arguments

An argument to a function can be indicated using either:

- Its **position**, i.e., in the list of possible arguments.
- A **keyword**, i.e., the *name* of that argument.

A **positional argument** uses the relative position of the arguments to determine which is which. 

In [None]:
def exponentiate(num, exp):
    return num ** exp

In [None]:
## Raise 2 ^ 3
exponentiate(2, 3)

In [None]:
## Raise 3 ^ 2
exponentiate(3, 2)

A **keyword argument** uses the *name* of the argument to determine which is which. 

- Even if the positions are swapped, the *keyword* will take priority. 
- (Note that the best practice is to keep the order consistent, however.)

In [None]:
## Raise 2 ^ 3
exponentiate(num = 2, exp = 3)

In [None]:
## Raise 2 ^ 3
exponentiate(exp = 3, num = 2)

#### Position before keyword

- Once you've used a keyword argument, you can't rely on **position** for any arguments coming after that keyword. This will throw a `SyntaxError`.
- However, a **positional argument** can come before a **keyword argument**.


In [None]:
## This is incorrect
exponentiate(num = 2, 3)

In [None]:
## This is fine
exponentiate(2, exp = 3)

## Practice problems

### Problem 1

Write a function called `fizzbuzz`. It should take in a single argument, `x`, and follow this behavior:

- If `x` is divisible by both `3` and `5`, return the `str` `"fizzbuzz"`. 
- If `x` is divisible by only `3` (and not `5`), return `"fizz"`).
- If `x` is divisible by only `5` (and not `3`), return `"buzz"`).

Note: this is part of a famous problem in **coding interviews**!

In [None]:
def fizzbuzz(x):
    pass

### Problem 2

Write a function called **product**, which takes a `list` (`lst`) as input, and returns the **product** of every item in the list.

In [None]:
L = [2, 3, 4, 5]

def product(lst):
    pass

## Returning multiple values

Functions can `return` multiple values, or even another function. 

This can be useful when:

- The goal of a `function` can't be distilled into a single value.  
- You want to `return` multiple bits of information about something, e.g., its `len`, its value, and so on.  

Multiple values can be separated with a `,`.

### Multiple `return` values: an example

Suppose we wanted a function that takes two numbers as input, and returns both:

- Their sum.  
- Their product.

In [None]:
def sum_product(a, b):
    sumvar = a + b
    prod = a * b
    L = [sumvar, prod]
    return L, sumvar, prod

In [None]:
l, s, p = sum_product(10, 200)
print(l)
print(s)
print(p)

### Check-in

What do you notice about the `type` of the object that gets returned when a function returns *multiple values*?

In [None]:
sum_product(5, 2)

### `return` and `tuple`s

By default, a `function` will package these multiple values into a `tuple`.

- It's possible to return them in another form, e.g., in a structured dictionary. 
- But if you use the `return a, b` syntax, `a` and `b` will returned like: `(a, b)`

## Namespaces

### What is a namespace?

> A [**namespace**](https://realpython.com/python-namespaces-scope/) is the "space" where a given set of variable names have been *declared*.

Python has several types of namespaces:

1. **Built-in**: Built-in objects within Python (e.g., **Exceptions**, **lists**, and more). These can be accessed from anywhere.  
2. **Global**: Any objects defined in the main program. These can be accessed anywhere in the main program once you've defined them, but not in another Jupyter notebook, etc.
3. **Local**: If you define new variables within a *function*, those variables can only be accessed within the "scope" of that function.

### The global namespace

So far, we've mostly been working with variables defined in the **global namespace**.

- I.e., once we define a variable in a notebook (and run that cell), we can reference it in another cell.

In [None]:
## define global variable
my_var = 2

In [None]:
## reference global variable
print(my_var)

### Functions have their own namespace

If you declare a variable **within** a function definition, that variable does *not* persist outside the scope of that function.

In the function below, we declare a new variable called `answer`, which is eventually `return`ed.

- However, the **variable itself** does not exist outside the function.

In [None]:
def exponentiate(num, exp):
    ### "answer" is a new variable 
    answer = num ** exp
    return answer

In [None]:
exponentiate(3, 2)
### This will throw an error
print(answer)

### Global variables *can* be referenced inside a function

If you've defined a variable in the global namespace, you *can* reference it inside a function.

- **Word of caution ⚠️**: this can make for confusing code. 

In [None]:
## define global variable
my_var = 2
## define function
def add_two(x):
    ## references my_var
    return x + my_var

add_two(2)

### Check-in

What would value of `new_var` be after running the code below?

What about `test_var`?

In [None]:
test_var = 2
def test_func(x):
    test_var = x ** 2
    return test_var

new_var = test_func(5)

### Using `whos`

Remember that you can check which variables are defined using `whos`.

**Warning**: It works on IPython and Jypyter Notebooks. If you open a python script in your computer, it is probably not going to work.

In [None]:
whos

## `lambda` functions

So far, we've focused on creating functions using the `def func_name(...)` syntax.

However, Python also has something called [**lambda functions**](https://www.w3schools.com/python/python_lambda.asp). 

- Syntax: `lambda x: ...`. 
- Main advantage: can be written in a single line, best if you want a **simple function**.  
   - Excellent for passing as *arguments* into other functions, such as `sorted`.

In [None]:
square = lambda x: x ** 2
print(square(2))
print(square(4))

In theory, `lambda` functions can have multiple arguments.

In [None]:
exp = lambda x, y: x ** y
print(exp(2, 3))

### Check-in

Convert the function below into a `lambda` function.

In [None]:
def add_one(x):
    ## Adds 1 to x
    return x + 1

### Your code here

### `lambda`: summary

- `lambda` is an easy, efficient way to define a simple function.  
- In practice, `lambda` is most useful when defining functions "on the fly".
   - As **arguments** to pass into another function.
   - As **nested functions** within another function. 

## Varying number of arguments

So far, we've assumed that we *know* how many arguments will be passed into a function at any given time. But this isn't always the case.

Fortunately, Python gives us two ways to handle an **arbitrary number** of arguments:

- `*args`: allows a `function` to receive an arbitrary number of (positional) arguments, which can be "unpacked" as needed. The function treats them as a `tuple`. 
- `**kwargs`: allows a `function` to receive a `dictionary` of (keyword) arguments, which can be "unpacked" as needed. 

### `*args` in practice

The `*args` syntax allows you to input an arbitrary number of arguments into a function.

In [None]:
def my_function(*fruits):
    print("The last fruit is " + fruits[-1] + ".")

In [None]:
my_function("strawberry")

In [None]:
my_function("strawberry", "apple")

#### Check-in

How exactly is this working? That is, what is `my_function` treating `*fruits` as? 

Try `print`ing out `fruits` to see what's going on.

In [None]:
### Your code here

### `**kargs` in practice

The `*kwargs` is similar to `*args`, but allows for an arbitrary number of **keyword arguments**.

- These are treated as a `dict` by the function.

In [None]:
def my_bad_function(*fruits):
    print('I have ' + str(fruits[1]) + ' ' + str(fruits[0]))

def my_function(**fruits):
    print('I have ' + str(fruits['amount']) + ' ' + fruits['name'])
    if (fruits['ripe']): print('And they are ripe!')

In [None]:
### Keyword and value are automatically placed into dictionary
my_function(amount = 5, name = "apple", ripe = False)
my_bad_function(5, "apple")

In [None]:
### The specific keyword can be altered as needed
my_function(name = "banana", cost = 10)

#### Why use this?

In general, `**kwargs` is useful when you want **flexibility**.

For example, suppose you have a website, in which people can (optionally) fill out the following information:

- `Name`. 
- `Email`. 
- `Phone number`.
- `Location`.

But because not everyone fills out *every field*, the function you use to store this information needs to be flexible about how many arguments it receives.

In [None]:
def store_user(**info):
    ## For now, this is just a placeholder to demonstrate
    for item in info.items():
        print(item)

In [None]:
store_user(Name = "John", Location = "San Diego", Email = 'john@ucsd.edu')

## Practice problems

One of the best ways to learn a new concept is to actually practice it. Thus, I'm including a number of practice problems at the end of this lecture, which we'll work through.

### Problem 1: find the maximum number of a `list`

Goal: write a function that takes in a `list` of numbers as input, and finds the **maximum** of the `list`.  

The catch: you can't use the operator `max`. 

Things to consider:

- If the input `list` is empty, you should return `None`.  
- Since you can't use `max`, you might consider using a `for` loop, checking the value of each number in turn.

In [None]:
### Your code here

### Problem 2: find the maximum number in a set of `*args`

Goal: write a function that takes in an arbitrary number of arguments (i.e., uses `*args`), and finds the maximum.

The catch: you can't use the operator `max`. 

Things to consider:

- If there are no arguments, you should return `None`.  
- Since you can't use `max`, you might consider using a `for` loop, checking the value of each number in turn.

In [None]:
### Your code here

### Problem 3: find the even numbers

Goal: write a function that takes in a `list` of numbers, and prints the even ones.

In [None]:
### Your code here

### Problem 4: find the tallest in a dictionary.

Suppose we want a `function` that takes in a `dict` of `Names` and `Heights`. That is, each *key* is a `Name`, and it maps onto a `Height`.

We want the function to return the `Name` of the person with the largest `Height`, *as well as* the `Height` itself.

In [None]:
## Can't just max...that'll return "Sean"
heights = {'Sean': 67, 'Ben': 72, 'Anne': 66}
### Your code here

# Working with Text Files

## Why read and write files?

Fundamentally, a **file** is just a way to store **data**.

This data could take many forms:

- Unstructured text.  
- [JSON](https://www.json.org/json-en.html), i.e., a kind of `dict`.  
- `.csv`, i.e., like an Excel file.  
- An executable file, like a Python script (`.py`). 

**Computational Social Science** centers around working with data. Thus, it's important to understand how to read and write these files.

### Some common use cases

In CSS research, reading and writing files is pretty much *unavoidable*. It happens almost anytime you want to work with data.

Examples:

- Reading in a [text corpus](https://en.wikipedia.org/wiki/Text_corpus) of Tweets on a particular topic to perform **sentiment analysis**. 
- Reading in a corpus of [song lyrics](https://pudding.cool/2017/02/vocabulary/) to perform analyses about vocabulary, rhythm, and more.
- Reading in [tabular data](https://www.statology.org/tabular-data/#:~:text=In%20statistics%2C%20tabular%20data%20refers,represent%20attributes%20for%20those%20observations.) about Economics to correlate `Economic Connectedness` with `Social Mobility`.  

## So what is a file?

> A **file** is a set of *bytes* used to store some kind of data.

The **format** of this data depends on what you're using it for, but at some level, it is translated into *binary bits* (`1`s and `0`s). 

The file format is usually specified in the **file extension**.  

- `.csv`: comma separated values.  
- `.txt`: a plain text file.  
- `.py`: an executable Python file.  
- `.png`: a portable network graphic file (i.e., an image).

### Where are files?

Files are **stored** somewhere on your computer (or in a server, etc.), typically in a folder (also called a **directory**). Thus, each file has its own **location**

- We call this **location** of a file its **path**.  
- File paths can be either **absolute** or **relative**.

### Absolute file paths

An **absolute** file path specifies the location of a file relative to some **root** directory.

- On my computer, the root might be: `/Users/myusername/...`
- If a file is called `my_file.txt`, the absolute file path would include *every directory* leading up to that file, starting from the root.
- On Mac/Linux, each directory/folder is separated by the the `/` notation.
- On Windows, they are separated by the `\` notation.

Example: `Users/myusername/CSS/css201/my_file.txt`

### Relative file paths

A **relative** file path specifies the location of a file relative to the **current** directory (i.e., the one you're in right now). 

- For example, say our current directory is `css201`. 
- If a file is called `my_file.txt`, the relative file path would tell the computer how to get to `my_file.txt` from `css`.
- On Mac/Linux, each directory/folder is separated by the the `/` notation.
- On Windows, they are separated by the `\` notation.

Example: `css201/my_file.txt`

#### The `..` syntax

If your target file (e.g., `my_file.txt`) is not stored within your current directory, you'll need to use the `..` syntax.

- This tells your computer to "go up a level".

For example, if we're currently in `css201/lectures/week2`, but we want to get to `css201/my_file.txt`, we'll need to use this notation:

`../../my_file.txt`.


### Check-in

Suppose we want to access a file called `notes.txt`. This is the absolute path leading to that file:

`/Users/myusername/css/lectures`

How would we write the full **absolute path**, including the file name?


In [1]:
### Your response here

#### Solution

Suppose we want to access a file called `notes.txt`. This is the absolute path leading to that file:

`/Users/myusername/css/lectures`

Absolute path: `/Users/myusername/css/lectures/notes.txt`

### Check-in

Suppose we want to access a file called `notes.txt`. This is the absolute path leading to that file:

`/Users/myusername/css/lectures`

However, we're currently in the `labs` directory, which is also in the `css` folder.

How would we write the **relative path** leading from our *current directory* to `lectures/notes.txt`?


In [1]:
### Your response here

#### Solution

Suppose we want to access a file called `notes.txt`. This is the absolute path leading to that file:

`/Users/myusername/css/lectures`

Relative path from `css/labs`: `../lectures/notes.txt`

### File paths: wrap-up

**File paths** can be one of the hardest things to get right.

- Even as a more experienced programmer, I mess file paths up *all the time* (including for this class!). 

A helpful command is `pwd`, which reminds us *where we are*: i.e., what our current directory is.

In [4]:
pwd

'/Users/seantrott/Dropbox/UCSD/Teaching/CSS/css1/css1_book/lectures'

## The *how*: interacting with files

Once you've located a file, you probably want to either **read** or **write** it in some way. Both **modes** of interacting with a file will require the `open` keyword.

In turn, you can `open` a file in one of several **modes**:

- `w`: writing to that file (i.e., adding text to it).  
- `r`: reading that file (i.e., reading what's already in it).
- `a`: appending to what's already in the file. 

Let's take these step by step.

### Writing a file

The syntax to `open` a file in the **writing mode** is as follows:

`open("filename.txt", "w")`

Often, we'll use the `with` keyword as in the codeblock below, which allows us to `open` that filename and assign it immediately to a variable.

- Then, we can can call `var_name.write("TEXT TO ADD TO FILE")`
- The advantage of `with` is that it will automatically `close` the file once we're done with the `with` block.

The `with` keyword is what we call a [**context manager**](https://book.pythontips.com/en/latest/context_managers.html). More on that in CSS 2 and CSS 100.

In [14]:
### Open up a file called `test.txt`
with open("test.txt", "w") as f:
    ### Write string to file
    f.write("This is a file.")

#### Things to be aware of

- `filename.txt` doesn't have to exist when you open a file for **writing**. It will be *created* by calling `open(filename.txt).  
- If `filename.txt` *does* already exist, then by default you'll over-write what's there. If you want to just *add* to the file, use the `a` (**append**) mode instead.
- To separate lines in this file, use the `\n` character (*newline*). 

### Reading a file

The syntax to `open` a file in the **reading mode** is as follows:

`open("filename.txt", "r")`

Once we've opened the file, we can `read` the contents. The `read` function will return the contents as a `str`.

In [18]:
### Open up a file called `test.txt`
with open("test.txt", "r") as f:
    ### Read the contents
    contents = f.read()

In [19]:
### print out contents
print(contents)

This is a file.


### Check-in

Use the `open` command to create and write a new file called `my_first_file.txt`. Once you've opened it, **write** a series of lines to that file:

- The first line should read: `My name is {NAME}\n`.
- The next 5 lines should read: `This is line {i} of the file.\n`, where `i` refers to the specfiic line number.

**Hint**: Remember to use the *newline* character to separate each line.

In [49]:
### Your code here

### Check-in

Now use the `open` command to open `my_first_file.txt`. Once you've opened it, **read** the contents of that file into a new variable called `file_contents`.

In [70]:
### Your code here

### File reading, continued

Before, we read in the *entire* file as one big `str`. There are several other ways to interact with and **read** a file, however.

- `.read(n)`, where `n` refers to the number of characters you want to read.  
- `.readlines()`, which returns a `list` of each *line* in the file.

#### `.read(n)`

The `read` function can be **parameterized** by the `n` argument, which tells Python how many characters of the file to read. 

In [73]:
with open("my_first_file.txt", "r") as f:
    n_characters = f.read(10)
print(n_characters)

My name is


In [74]:
with open("my_first_file.txt", "r") as f:
    n_characters = f.read(15)
print(n_characters)

My name is Sean


#### `.readlines()`

The `readlines` function returns a `list`, where each element in the list corresponds to a line in the file.

- *Lines* are defined as being separated by a `\n` character.

In [75]:
with open("my_first_file.txt", "r") as f:
    all_lines = f.readlines()

In [76]:
all_lines

['My name is Sean.\n',
 'This is line 2 of the file.\n',
 'This is line 3 of the file.\n',
 'This is line 4 of the file.\n',
 'This is line 5 of the file.\n']

### Check-in

- Use the `readlines` function to read in all lines from `my_first_file.txt`. 
- Then, use a `for` loop to iterate through each line.  
  - For each line, `replace` the `\n` character with an empty character (i.e., `""`). 
  - Then, `print` out the line.

In [77]:
### Your code here

### Appending a file

If you `open` a pre-existing file in the `w` mode, you can *overwrite* all of its existing content.

If you wish to simply *add* to that file, you can instead open it in the `a` mode: `open("filename.txt", "a")`

In [80]:
## Open in append mode
with open("my_first_file.txt", "a") as f:
    ## Syntax to write is the same.
    f.write("This is new text I'm adding.")

In [82]:
## Now let's check if it worked...
with open("my_first_file.txt", "r") as f:
    file_contents = f.read()
print(file_contents)

My name is Sean.
This is line 2 of the file.
This is line 3 of the file.
This is line 4 of the file.
This is line 5 of the file.
This is new text I'm adding.


### Closing a file

Technically, it is good practice to always `close` a file once you've opened it. 

- If you're using the context manager (the `with` keyword), it will automatically `close` the file once you finish the `with` block.  
- But if you're not, you can `close` a file using `var_name.close()`.

## Finding a target `str`

One common use case is **searching** a large volume of text to `return` particular sub-string.

- Where in the text does this sub-string occur?  
- What is the text surrounding one of its occurrences?

Note that this is not too far afield from a **search engine** like [Google](https://www.google.com/)!

### Our sample text

To start, we'll use a `.txt` file of [**Hamlet**](https://en.wikipedia.org/wiki/Hamlet), by [William Shakespeare](https://en.wikipedia.org/wiki/William_Shakespeare). The `.txt` file was retrieved from the [Project Gutenberg Corpus](https://www.gutenberg.org/browse/scores/top) online, and should be credited as such. 

The file is included in the `lectures` GitHub repository under the `data` directory.

First, let's use `readlines()` to extract each **line** of the play as a separate item in a list.

In [1]:
with open("data/hamlet.txt") as f:
    book = f.readlines()

#### Inspecting the text

In [2]:
## This is just the title
book[0]

'THE TRAGEDY OF HAMLET, PRINCE OF DENMARK\n'

In [3]:
## Partial list of characters in play
for line in book[5:12]:
    l = line.replace("\n", "")
    print(l)


Dramatis Personae

  Claudius, King of Denmark.
  Marcellus, Officer.
  Hamlet, son to the former, and nephew to the present king.
  Polonius, Lord Chamberlain.


#### Check-in

How could we check how many **lines** are in the `.txt` file?

In [4]:
### Your code here

### Finding a sample `str`

One of the most famous lines in *Hamlet* reads:

> To be, or not to be- that is the question...

Suppose we wanted to **find** the `str` `"that is the question"` in the book, and **return** the line number (at least in this `.txt` file).

How could we go about that?

### Solution: `enumerate`

- Use `enumerate` to iterate through each line of the play.  
- For each line, check if some `target_str` occurs in that line.  
- If it does, use `break` to **stop** iterating, and record which line it is.

In [5]:
target_str = "that is the question"
for index, line in enumerate(book):
    if target_str in line:
        break
print("Line: {x}".format(x = line.replace("\n", "")))
print("Line number: {x}".format(x = index))

Line:   Ham. To be, or not to be- that is the question:
Line number: 2048


### Check-in: Finding the next $N$ lines

What if we wanted to return the next $N$ (e.g., `5`) lines *after* this target string? 

- To do this, we just need to add another variable: `keep_lines`, which tells us *how many* additional lines we want to return.  
- Then, once we've retrieved the `index` of our `target_str`, we can **slice** between that `index` and `index + 3`.

Try implementing this algorithm yourself first. 

**Hint**: The code can be *mostly* the same as before (i.e., use `enumerate`, etc.). 

In [6]:
keep_lines = 5 ### New variable to track
### Your code here

### Check-in: What if `target_str` occurs multiple times?

What if we were looking for a more common `target_str`, e.g., one that occurred multiple times?  

1. What problems do you see with our previous approach (e.g., using `break` once we find `target_str`)?
2. How might you solve this problem? 

In [7]:
target_str = "the question"
### Your answer here

### Check-in: Other considerations

These exercises really only scratch the surface of **searching** a file. Here are some other issues for consideration and discussion. 

How might you address:

1. Issues of **case**: e.g., what if *question* is spelled `"Question"`, not `"question"`?
2. Situations where a `target_str` spans multiple *lines*? 
3. Mismatch in punctuation, e.g., a misplaced `,`? 
4. A **partial match**, e.g., if $90\%$ of the characters match?

**Note**: These are challenging issues! And each of them likely has multiple solutions.

## Counting Words

Another very common **use case** is simply **counting** words.

- How many words are there overall?  
- How many *unique words* are used?  
- How many times does *each word* occur?  
- What is the *most frequent word*?

### Caveat: what *is* a word?

The question of what defines a word is surprisingly complex.

- First, languages have very different [**morphological systems**](https://en.wikipedia.org/wiki/Morphological_typology). So even *conceptually*, it's not always clear what makes a word "a word" in a given language.  
- Second, languages have very different [**writing systems**](https://en.wikipedia.org/wiki/Orthography). 
  - Some languages (like English, Spanish, etc.) have *spaces* between words in their written form.  
  - Other languages (like Classical Latin, Chinese, etc.) do [not typically use *spaces* between words](https://en.wikipedia.org/wiki/Scriptio_continua) in their written form.

Many **conceptual definitions** and **tools** for identifying *words* are rooted in English specifically, but those definitions and tools don't always generalize––languages can be very different.

### How *many* words?

The first question that might occur to us is *how many words* are in a book. 

To do this, we could:

- `read` the book in as one long `str`.  
- Use the `split` function to separate this long `str` by **spaces**, into a `list` of words.
- Count the number of items in this list.

#### Using `split`: a review

In [8]:
sentence = "To be or not to be, that is the question"
sentence.split(" ")

['To', 'be', 'or', 'not', 'to', 'be,', 'that', 'is', 'the', 'question']

#### Using `split` for Hamlet

In [9]:
# First, read in as string
with open("data/hamlet.txt", "r") as f:
    book_str = f.read()

In [10]:
# We should also clean up all those *newline* characters.
book_str = book_str.replace("\n", " ")
# To make it easier for later, we can also turn it into lowercase
book_str = book_str.lower()
# Now, use split to separate into words
book_words = book_str.split()
book_words[0:5]

['the', 'tragedy', 'of', 'hamlet,', 'prince']

In [11]:
# How many items in list?
len(book_words)

32724

### How many *unique* words?

Above, we calculated how many word *tokens* were in the book. 

- This means that the word "the" will be counted *every time* it occurs.  
- Instead, let's calculate the number of *unique* word types.

#### Using `set`

The `set` function will turn a `list` into a `set` object, which contains only the *unique elements* in that list.

In [12]:
my_list = ["the", "dog", "is", "the", "best"]
set(my_list)

{'best', 'dog', 'is', 'the'}

#### Check-in

Use the `set` function to calculate how many *unique* words are in this book.

In [13]:
### Your code here

### How many times does each word occur?

We might also want to know *how many times* each word occurs. 

- For example, perhaps "the" occurs $>1000$ times, whereas "question" occurs only ~$10$ times.  
- Ideally, we would store this in a `dict`:
   - Each **key** represents a *word*.  
   - Each **value** represents *how many times* that word occurred in *Hamlet*.

How might we go about this?

#### First pass: counting each word

As a first pass, let's use the following approach:

- First, create a `dict` to store our words.  
- Then, *iterate* through our `list` of words.  
- `if` a given word is not in our `dict`, add an entry for it (and set the value to `1`).  
- `if` a given word *is* in a `dict`, increase its value by `1`.

In [14]:
word_counts = {}
for w in book_words:
    if w not in word_counts:
        word_counts[w] = 1
    else:
        word_counts[w] += 1

In [15]:
# How many times does "the" occur?
word_counts['the']

1095

In [16]:
# How many times does "king" occur?
word_counts['king']

43

#### Check-in

Any issues with this **first pass** approach? 

**Hint**: One issue could have to do with punctuation...

Write a code that works when you may or may not have a '.' in sentence.

In [17]:
### Your code here

### Which word is most common?

Now that we have a `dict` representing how many times each word occurs, we can calculate **which word** is most common.

**Check-in**: Which word do you think is most frequent in *Hamlet*?

#### Finding the most frequent word

As always, there are multiple ways to do this.

But one simple approach is to:

- Use a `for` loop to iterate through all `items()` in the `dict`.  
- Track the `key_with_highest_value` we've seen so far.  
- Once the `for` loop is done, inspect `key_with_highest_value`.

In [20]:
key_with_highest_value = None
max_count = 0
for word, count in word_counts.items():
    # If this word frequency > max_count
    if count > max_count:
        # Set new "highest word" to this word
        key_with_highest_value = word
        max_count = count


In [21]:
## Now, inspect which word was most frequent
key_with_highest_value

'the'

#### Other approaches

There are *many different approaches* you could take to solving this problem. Some are more generalizable (but also more complicated) than what I've shown here.

- You can `sort` the dictionary by **value** (see the lecture on **dictionary operations**).  
- You could use the `max` function with `dict.get` as your `key` parameter (see below).

In [22]:
# Another approach
max(word_counts, key = word_counts.get)

'the'

## JSON

## What is a `.json` file?

> A `.json` file is a **file written in the JSON file format**. It allows us to store structured data objects consisting of **key-value** pairs.

### What is JSON?

**JSON** = JavaScript Object Notation.

- Standard format for *representing* and *transmitting* data.  
   - "Standard" = different people/systems agree to use this format to send and receive information.  
- Represents data in **key-value** pairs.

#### Check-in

What else have we seen that represents data in **key-value pairs**?

### A Python `dict` is a collection of key-value pairs

A **dictionary** (`dict`) stores **key-value** pairs.

In [13]:
my_class = {'Code': '1',
           'Department': 'CSS',
           'Instructor': 'Mignozzetti',
            'Prerequisite': True,
           'Enrollment': 120}
print(my_class)

{'Code': '1', 'Department': 'CSS', 'Instructor': 'Mignozzetti', 'Prerequisite': True, 'Enrollment': 120}


In [24]:
my_class['Department']

'CSS'

### JSON and `dict`: an analogy

Conceptually, JSON accomplishes the same goals as a Python `dict`.

- In fact, Python programmers often *convert* a `dict` into a JSON `str` when they want to store it in a file.  
- Similarly, you can **read in** a `.json` file and convert the contents into a `dict`.

**Bottom line**: we're not dealing with a fundamentally new data sturcture––it's another standardized way to represent **key-value pairs**.

## Reading in a `.json` file

Reading in a `.json` file shares some similarities with [reading `.txt` files](17-reading-text).  

- Must specify a **file path**.  
- File path can be either *absolute* or *relative*.

But there are also some important differences:

- To **read in** a `.json` file, we'll need to `import` the `json` library.  
- `json.load` will read in a **structured `.json` file** as a `dict`, not a `str`.

### Example: simple file

Here, we will work with a simple `.json` file: `data/restaurant.json`. 

- The file contains a structured representation of a restaurant.  
- We use `json.load(...)` to **load** this representation as a `dict`.

In [2]:
## This imports the json library
import json

In [3]:
## As with normal .txt. files, we use "open" to open the target restaurant
with open("data/restaurant.json", "r") as fp:
    ## use json.load to load as dict
    info = json.load(fp)

In [4]:
info

{'Name': 'Plumeria', 'Location': 'University Heights', 'Cuisine': 'Thai'}

### `load` creates a `dict`

Now, we can work with the **contents** of this file as we would any `dict`.

In [6]:
info['Name']

'Plumeria'

In [7]:
info['Location']

'University Heights'

In [8]:
info['Cuisine']

'Thai'

### Check-in

Try reading in another file that's stored in `data`: `data/school.json`. 

What is the value of the **Name** key?

In [11]:
### Your code here
with open('data/school.json', 'r') as f:
    school = json.load(f)
print(school)
print(school['Name'])

{'Name': 'UCSD', 'Location': 'San Diego', 'Affiliation': 'University of California'}
UCSD


## Writing a `.json` file

Often, you'll want to **write** a structured `dict` to a file.  

- Useful for *storing* information, so you can access it later.  
- Useful for *transmitting* information between programs.  

We can use `json.dump(...)` to **write** (or "dump") a `dict` into a `.json` file.

### Simple example: course 

To start out, let's use the `my_class` dict we defined earlier.

In [14]:
my_class['Code']

'1'

To **write** this to a file, we:

- `open` (create) a file with the name we want to call it.  
- Use `json.dump(dict_name, filename)`.

In [15]:
with open("course.json", "w") as fp:
    json.dump(my_class, fp)

#### Checking that this worked

In [16]:
with open("course.json", "r") as fp:
    course_info = json.load(fp)
print(course_info)

{'Code': '1', 'Department': 'CSS', 'Instructor': 'Mignozzetti', 'Prerequisite': True, 'Enrollment': 120}


### Check-in

Create a new `dict` called `my_info`. Add the following keys/values:

- `Name`. 
- `Major`. 

Then, use `json.dump` to **write** this `dict` to a `.json` file called `my_info.json` to your own computer (in whichever directory you prefer).

In [18]:
### Your code here

## JSON files vs. JSON strings

The `load` and `dump` methods can be used to **read** and **write** a `dict` from/to a `.json` file.  

However, Python can also represent JSON as a **`str`**.

- To *read* a `dict` from a JSON `str`, use `loads` (load + *s*tring).  
- To *write* a `dict` into a JSON `str`, use `dumps` (dump + *s*tring).

### `json.dumps`

- Input: a `dict`. 
- Output: a JSON `str`.  

In [38]:
json_str = json.dumps(my_class)
json_str

'{"Code": "1", "Department": "CSS", "Instructor": "Mignozzetti", "Prerequisite": true, "Enrollment": 120}'

In [39]:
type(json_str)

str

### `json.loads`

- Input: a JSON `str`.  
- Output: a `dict`.

### Other objects besides `dict`s

- Technically, you can use `dumps`/`loads` for other objects, such as `str`, `list`, and more.
- Though in my experience, a `dict` is the most common format.

In [40]:
json.dumps([1, 2, 3])

'[1, 2, 3]'

In [41]:
json.loads('[1, 2, 3]')

[1, 2, 3]

## Conclusion

There's lots more to working with files (including text files), but this sets the **foundation**. Now you should feel a little more comfortable:

- Understanding how to navigate your computer's **directory structure**.  
  - E.g., knowing "where" a file is located.
- Knowing how to `open` a file in Python.
- Knowing how to **read** or **write** that file.

This will form the basis of working with future file types, such as `.csv` (a very common format for representing tabular data).