# CSS 201.5 - CSS Bootcamp

## Python Programming

# Python Programming

## Working with Text Files

## Why read and write files?

Fundamentally, a **file** is just a way to store **data**.

This data could take many forms:

- Unstructured text.  
- [JSON](https://www.json.org/json-en.html), i.e., a kind of `dict`.  
- `.csv`, i.e., like an Excel file.  
- An executable file, like a Python script (`.py`). 

**Computational Social Science** centers around working with data. Thus, it's important to understand how to read and write these files.

### Some common use cases

In CSS research, reading and writing files is pretty much *unavoidable*. It happens almost anytime you want to work with data.

Examples:

- Reading in a [text corpus](https://en.wikipedia.org/wiki/Text_corpus) of Tweets on a particular topic to perform **sentiment analysis**. 
- Reading in a corpus of [song lyrics](https://pudding.cool/2017/02/vocabulary/) to perform analyses about vocabulary, rhythm, and more.
- Reading in [tabular data](https://www.statology.org/tabular-data/#:~:text=In%20statistics%2C%20tabular%20data%20refers,represent%20attributes%20for%20those%20observations.) about Economics to correlate `Economic Connectedness` with `Social Mobility`.  

## So what is a file?

> A **file** is a set of *bytes* used to store some kind of data.

The **format** of this data depends on what you're using it for, but at some level, it is translated into *binary bits* (`1`s and `0`s). 

The file format is usually specified in the **file extension**.  

- `.csv`: comma separated values.  
- `.txt`: a plain text file.  
- `.py`: an executable Python file.  
- `.png`: a portable network graphic file (i.e., an image).

### Where are files?

Files are **stored** somewhere on your computer (or in a server, etc.), typically in a folder (also called a **directory**). Thus, each file has its own **location**

- We call this **location** of a file its **path**.  
- File paths can be either **absolute** or **relative**.

### Absolute file paths

An **absolute** file path specifies the location of a file relative to some **root** directory.

- On my computer, the root might be: `/Users/myusername/...`
- If a file is called `my_file.txt`, the absolute file path would include *every directory* leading up to that file, starting from the root.
- On Mac/Linux, each directory/folder is separated by the the `/` notation.
- On Windows, they are separated by the `\` notation.

Example: `Users/myusername/CSS/css201/my_file.txt`

### Relative file paths

A **relative** file path specifies the location of a file relative to the **current** directory (i.e., the one you're in right now). 

- For example, say our current directory is `css201`. 
- If a file is called `my_file.txt`, the relative file path would tell the computer how to get to `my_file.txt` from `css`.
- On Mac/Linux, each directory/folder is separated by the the `/` notation.
- On Windows, they are separated by the `\` notation.

Example: `css201/my_file.txt`

#### The `..` syntax

If your target file (e.g., `my_file.txt`) is not stored within your current directory, you'll need to use the `..` syntax.

- This tells your computer to "go up a level".

For example, if we're currently in `css201/lectures/week2`, but we want to get to `css201/my_file.txt`, we'll need to use this notation:

`../../my_file.txt`.


### Check-in

Suppose we want to access a file called `notes.txt`. This is the absolute path leading to that file:

`/Users/myusername/css/lectures`

How would we write the full **absolute path**, including the file name?


In [1]:
### Your response here

#### Solution

Suppose we want to access a file called `notes.txt`. This is the absolute path leading to that file:

`/Users/myusername/css/lectures`

Absolute path: `/Users/myusername/css/lectures/notes.txt`

### Check-in

Suppose we want to access a file called `notes.txt`. This is the absolute path leading to that file:

`/Users/myusername/css/lectures`

However, we're currently in the `labs` directory, which is also in the `css` folder.

How would we write the **relative path** leading from our *current directory* to `lectures/notes.txt`?


In [1]:
### Your response here

#### Solution

Suppose we want to access a file called `notes.txt`. This is the absolute path leading to that file:

`/Users/myusername/css/lectures`

Relative path from `css/labs`: `../lectures/notes.txt`

### File paths: wrap-up

**File paths** can be one of the hardest things to get right.

- Even as a more experienced programmer, I mess file paths up *all the time* (including for this class!). 

A helpful command is `pwd`, which reminds us *where we are*: i.e., what our current directory is.

In [4]:
pwd

'/Users/seantrott/Dropbox/UCSD/Teaching/CSS/css1/css1_book/lectures'

## The *how*: interacting with files

Once you've located a file, you probably want to either **read** or **write** it in some way. Both **modes** of interacting with a file will require the `open` keyword.

In turn, you can `open` a file in one of several **modes**:

- `w`: writing to that file (i.e., adding text to it).  
- `r`: reading that file (i.e., reading what's already in it).
- `a`: appending to what's already in the file. 

Let's take these step by step.

### Writing a file

The syntax to `open` a file in the **writing mode** is as follows:

`open("filename.txt", "w")`

Often, we'll use the `with` keyword as in the codeblock below, which allows us to `open` that filename and assign it immediately to a variable.

- Then, we can can call `var_name.write("TEXT TO ADD TO FILE")`
- The advantage of `with` is that it will automatically `close` the file once we're done with the `with` block.

The `with` keyword is what we call a [**context manager**](https://book.pythontips.com/en/latest/context_managers.html). More on that in CSS 2 and CSS 100.

In [14]:
### Open up a file called `test.txt`
with open("test.txt", "w") as f:
    ### Write string to file
    f.write("This is a file.")

#### Things to be aware of

- `filename.txt` doesn't have to exist when you open a file for **writing**. It will be *created* by calling `open(filename.txt).  
- If `filename.txt` *does* already exist, then by default you'll over-write what's there. If you want to just *add* to the file, use the `a` (**append**) mode instead.
- To separate lines in this file, use the `\n` character (*newline*). 

### Reading a file

The syntax to `open` a file in the **reading mode** is as follows:

`open("filename.txt", "r")`

Once we've opened the file, we can `read` the contents. The `read` function will return the contents as a `str`.

In [18]:
### Open up a file called `test.txt`
with open("test.txt", "r") as f:
    ### Read the contents
    contents = f.read()

In [19]:
### print out contents
print(contents)

This is a file.


### Check-in

Use the `open` command to create and write a new file called `my_first_file.txt`. Once you've opened it, **write** a series of lines to that file:

- The first line should read: `My name is {NAME}\n`.
- The next 5 lines should read: `This is line {i} of the file.\n`, where `i` refers to the specfiic line number.

**Hint**: Remember to use the *newline* character to separate each line.

In [49]:
### Your code here

### Check-in

Now use the `open` command to open `my_first_file.txt`. Once you've opened it, **read** the contents of that file into a new variable called `file_contents`.

In [70]:
### Your code here

### File reading, continued

Before, we read in the *entire* file as one big `str`. There are several other ways to interact with and **read** a file, however.

- `.read(n)`, where `n` refers to the number of characters you want to read.  
- `.readlines()`, which returns a `list` of each *line* in the file.

#### `.read(n)`

The `read` function can be **parameterized** by the `n` argument, which tells Python how many characters of the file to read. 

In [73]:
with open("my_first_file.txt", "r") as f:
    n_characters = f.read(10)
print(n_characters)

My name is


In [74]:
with open("my_first_file.txt", "r") as f:
    n_characters = f.read(15)
print(n_characters)

My name is Sean


#### `.readlines()`

The `readlines` function returns a `list`, where each element in the list corresponds to a line in the file.

- *Lines* are defined as being separated by a `\n` character.

In [75]:
with open("my_first_file.txt", "r") as f:
    all_lines = f.readlines()

In [76]:
all_lines

['My name is Sean.\n',
 'This is line 2 of the file.\n',
 'This is line 3 of the file.\n',
 'This is line 4 of the file.\n',
 'This is line 5 of the file.\n']

### Check-in

- Use the `readlines` function to read in all lines from `my_first_file.txt`. 
- Then, use a `for` loop to iterate through each line.  
  - For each line, `replace` the `\n` character with an empty character (i.e., `""`). 
  - Then, `print` out the line.

In [77]:
### Your code here

### Appending a file

If you `open` a pre-existing file in the `w` mode, you can *overwrite* all of its existing content.

If you wish to simply *add* to that file, you can instead open it in the `a` mode: `open("filename.txt", "a")`

In [80]:
## Open in append mode
with open("my_first_file.txt", "a") as f:
    ## Syntax to write is the same.
    f.write("This is new text I'm adding.")

In [82]:
## Now let's check if it worked...
with open("my_first_file.txt", "r") as f:
    file_contents = f.read()
print(file_contents)

My name is Sean.
This is line 2 of the file.
This is line 3 of the file.
This is line 4 of the file.
This is line 5 of the file.
This is new text I'm adding.


### Closing a file

Technically, it is good practice to always `close` a file once you've opened it. 

- If you're using the context manager (the `with` keyword), it will automatically `close` the file once you finish the `with` block.  
- But if you're not, you can `close` a file using `var_name.close()`.

## Finding a target `str`

One common use case is **searching** a large volume of text to `return` particular sub-string.

- Where in the text does this sub-string occur?  
- What is the text surrounding one of its occurrences?

Note that this is not too far afield from a **search engine** like [Google](https://www.google.com/)!

### Our sample text

To start, we'll use a `.txt` file of [**Hamlet**](https://en.wikipedia.org/wiki/Hamlet), by [William Shakespeare](https://en.wikipedia.org/wiki/William_Shakespeare). The `.txt` file was retrieved from the [Project Gutenberg Corpus](https://www.gutenberg.org/browse/scores/top) online, and should be credited as such. 

The file is included in the `lectures` GitHub repository under the `data` directory.

First, let's use `readlines()` to extract each **line** of the play as a separate item in a list.

In [1]:
with open("data/hamlet.txt") as f:
    book = f.readlines()

#### Inspecting the text

In [2]:
## This is just the title
book[0]

'THE TRAGEDY OF HAMLET, PRINCE OF DENMARK\n'

In [3]:
## Partial list of characters in play
for line in book[5:12]:
    l = line.replace("\n", "")
    print(l)


Dramatis Personae

  Claudius, King of Denmark.
  Marcellus, Officer.
  Hamlet, son to the former, and nephew to the present king.
  Polonius, Lord Chamberlain.


#### Check-in

How could we check how many **lines** are in the `.txt` file?

In [4]:
### Your code here

### Finding a sample `str`

One of the most famous lines in *Hamlet* reads:

> To be, or not to be- that is the question...

Suppose we wanted to **find** the `str` `"that is the question"` in the book, and **return** the line number (at least in this `.txt` file).

How could we go about that?

### Solution: `enumerate`

- Use `enumerate` to iterate through each line of the play.  
- For each line, check if some `target_str` occurs in that line.  
- If it does, use `break` to **stop** iterating, and record which line it is.

In [5]:
target_str = "that is the question"
for index, line in enumerate(book):
    if target_str in line:
        break
print("Line: {x}".format(x = line.replace("\n", "")))
print("Line number: {x}".format(x = index))

Line:   Ham. To be, or not to be- that is the question:
Line number: 2048


### Check-in: Finding the next $N$ lines

What if we wanted to return the next $N$ (e.g., `5`) lines *after* this target string? 

- To do this, we just need to add another variable: `keep_lines`, which tells us *how many* additional lines we want to return.  
- Then, once we've retrieved the `index` of our `target_str`, we can **slice** between that `index` and `index + 3`.

Try implementing this algorithm yourself first. 

**Hint**: The code can be *mostly* the same as before (i.e., use `enumerate`, etc.). 

In [6]:
keep_lines = 5 ### New variable to track
### Your code here

### Check-in: What if `target_str` occurs multiple times?

What if we were looking for a more common `target_str`, e.g., one that occurred multiple times?  

1. What problems do you see with our previous approach (e.g., using `break` once we find `target_str`)?
2. How might you solve this problem? 

In [7]:
target_str = "the question"
### Your answer here

### Check-in: Other considerations

These exercises really only scratch the surface of **searching** a file. Here are some other issues for consideration and discussion. 

How might you address:

1. Issues of **case**: e.g., what if *question* is spelled `"Question"`, not `"question"`?
2. Situations where a `target_str` spans multiple *lines*? 
3. Mismatch in punctuation, e.g., a misplaced `,`? 
4. A **partial match**, e.g., if $90\%$ of the characters match?

**Note**: These are challenging issues! And each of them likely has multiple solutions.

## Counting Words

Another very common **use case** is simply **counting** words.

- How many words are there overall?  
- How many *unique words* are used?  
- How many times does *each word* occur?  
- What is the *most frequent word*?

### Caveat: what *is* a word?

The question of what defines a word is surprisingly complex.

- First, languages have very different [**morphological systems**](https://en.wikipedia.org/wiki/Morphological_typology). So even *conceptually*, it's not always clear what makes a word "a word" in a given language.  
- Second, languages have very different [**writing systems**](https://en.wikipedia.org/wiki/Orthography). 
  - Some languages (like English, Spanish, etc.) have *spaces* between words in their written form.  
  - Other languages (like Classical Latin, Chinese, etc.) do [not typically use *spaces* between words](https://en.wikipedia.org/wiki/Scriptio_continua) in their written form.

Many **conceptual definitions** and **tools** for identifying *words* are rooted in English specifically, but those definitions and tools don't always generalize––languages can be very different.

### How *many* words?

The first question that might occur to us is *how many words* are in a book. 

To do this, we could:

- `read` the book in as one long `str`.  
- Use the `split` function to separate this long `str` by **spaces**, into a `list` of words.
- Count the number of items in this list.

#### Using `split`: a review

In [8]:
sentence = "To be or not to be, that is the question"
sentence.split(" ")

['To', 'be', 'or', 'not', 'to', 'be,', 'that', 'is', 'the', 'question']

#### Using `split` for Hamlet

In [9]:
# First, read in as string
with open("data/hamlet.txt", "r") as f:
    book_str = f.read()

In [10]:
# We should also clean up all those *newline* characters.
book_str = book_str.replace("\n", " ")
# To make it easier for later, we can also turn it into lowercase
book_str = book_str.lower()
# Now, use split to separate into words
book_words = book_str.split()
book_words[0:5]

['the', 'tragedy', 'of', 'hamlet,', 'prince']

In [11]:
# How many items in list?
len(book_words)

32724

### How many *unique* words?

Above, we calculated how many word *tokens* were in the book. 

- This means that the word "the" will be counted *every time* it occurs.  
- Instead, let's calculate the number of *unique* word types.

#### Using `set`

The `set` function will turn a `list` into a `set` object, which contains only the *unique elements* in that list.

In [12]:
my_list = ["the", "dog", "is", "the", "best"]
set(my_list)

{'best', 'dog', 'is', 'the'}

#### Check-in

Use the `set` function to calculate how many *unique* words are in this book.

In [13]:
### Your code here

### How many times does each word occur?

We might also want to know *how many times* each word occurs. 

- For example, perhaps "the" occurs $>1000$ times, whereas "question" occurs only ~$10$ times.  
- Ideally, we would store this in a `dict`:
   - Each **key** represents a *word*.  
   - Each **value** represents *how many times* that word occurred in *Hamlet*.

How might we go about this?

#### First pass: counting each word

As a first pass, let's use the following approach:

- First, create a `dict` to store our words.  
- Then, *iterate* through our `list` of words.  
- `if` a given word is not in our `dict`, add an entry for it (and set the value to `1`).  
- `if` a given word *is* in a `dict`, increase its value by `1`.

In [14]:
word_counts = {}
for w in book_words:
    if w not in word_counts:
        word_counts[w] = 1
    else:
        word_counts[w] += 1

In [15]:
# How many times does "the" occur?
word_counts['the']

1095

In [16]:
# How many times does "king" occur?
word_counts['king']

43

#### Check-in

Any issues with this **first pass** approach? 

**Hint**: One issue could have to do with punctuation...

Write a code that works when you may or may not have a '.' in sentence.

In [17]:
### Your code here

#### Solution

One problem that you might've noticed is that words occurring at the *end of a sentence* don't have a space between the word and a period (e.g., `question.`). 

- This will *under-count* certain words.

To resolve this, we can `replace` all periods with an empty character before adding a word to our `dict`.

In [18]:
word_counts = {}
for w in book_words:
    w_no_period = w.replace(".", "")
    if w_no_period not in word_counts:
        word_counts[w_no_period] = 1
    else:
        word_counts[w_no_period] += 1

In [19]:
# How many times does "king" occur?
word_counts['king']

162

### Which word is most common?

Now that we have a `dict` representing how many times each word occurs, we can calculate **which word** is most common.

**Check-in**: Which word do you think is most frequent in *Hamlet*?

#### Finding the most frequent word

As always, there are multiple ways to do this.

But one simple approach is to:

- Use a `for` loop to iterate through all `items()` in the `dict`.  
- Track the `key_with_highest_value` we've seen so far.  
- Once the `for` loop is done, inspect `key_with_highest_value`.

In [20]:
key_with_highest_value = None
max_count = 0
for word, count in word_counts.items():
    # If this word frequency > max_count
    if count > max_count:
        # Set new "highest word" to this word
        key_with_highest_value = word
        max_count = count


In [21]:
## Now, inspect which word was most frequent
key_with_highest_value

'the'

#### Other approaches

There are *many different approaches* you could take to solving this problem. Some are more generalizable (but also more complicated) than what I've shown here.

- You can `sort` the dictionary by **value** (see the lecture on **dictionary operations**).  
- You could use the `max` function with `dict.get` as your `key` parameter (see below).

In [22]:
# Another approach
max(word_counts, key = word_counts.get)

'the'

## JSON

## What is a `.json` file?

> A `.json` file is a **file written in the JSON file format**. It allows us to store structured data objects consisting of **key-value** pairs.

### What is JSON?

**JSON** = JavaScript Object Notation.

- Standard format for *representing* and *transmitting* data.  
   - "Standard" = different people/systems agree to use this format to send and receive information.  
- Represents data in **key-value** pairs.

#### Check-in

What else have we seen that represents data in **key-value pairs**?

### A Python `dict` is a collection of key-value pairs

A **dictionary** (`dict`) stores **key-value** pairs.

In [13]:
my_class = {'Code': '1',
           'Department': 'CSS',
           'Instructor': 'Mignozzetti',
            'Prerequisite': True,
           'Enrollment': 120}
print(my_class)

{'Code': '1', 'Department': 'CSS', 'Instructor': 'Mignozzetti', 'Prerequisite': True, 'Enrollment': 120}


In [24]:
my_class['Department']

'CSS'

### JSON and `dict`: an analogy

Conceptually, JSON accomplishes the same goals as a Python `dict`.

- In fact, Python programmers often *convert* a `dict` into a JSON `str` when they want to store it in a file.  
- Similarly, you can **read in** a `.json` file and convert the contents into a `dict`.

**Bottom line**: we're not dealing with a fundamentally new data sturcture––it's another standardized way to represent **key-value pairs**.

## Reading in a `.json` file

Reading in a `.json` file shares some similarities with [reading `.txt` files](17-reading-text).  

- Must specify a **file path**.  
- File path can be either *absolute* or *relative*.

But there are also some important differences:

- To **read in** a `.json` file, we'll need to `import` the `json` library.  
- `json.load` will read in a **structured `.json` file** as a `dict`, not a `str`.

### Example: simple file

Here, we will work with a simple `.json` file: `data/restaurant.json`. 

- The file contains a structured representation of a restaurant.  
- We use `json.load(...)` to **load** this representation as a `dict`.

In [2]:
## This imports the json library
import json

In [3]:
## As with normal .txt. files, we use "open" to open the target restaurant
with open("data/restaurant.json", "r") as fp:
    ## use json.load to load as dict
    info = json.load(fp)

In [4]:
info

{'Name': 'Plumeria', 'Location': 'University Heights', 'Cuisine': 'Thai'}

### `load` creates a `dict`

Now, we can work with the **contents** of this file as we would any `dict`.

In [6]:
info['Name']

'Plumeria'

In [7]:
info['Location']

'University Heights'

In [8]:
info['Cuisine']

'Thai'

### Check-in

Try reading in another file that's stored in `data`: `data/school.json`. 

What is the value of the **Name** key?

In [11]:
### Your code here
with open('data/school.json', 'r') as f:
    school = json.load(f)
print(school)
print(school['Name'])

{'Name': 'UCSD', 'Location': 'San Diego', 'Affiliation': 'University of California'}
UCSD


### Solution

In [32]:
## As before, we use "open" to open the target file
with open("data/school.json", "r") as fp:
    ## use json.load to load as dict
    school_info = json.load(fp)

In [33]:
## Get name of school
school_info['Name']

'UCSD'

## Writing a `.json` file

Often, you'll want to **write** a structured `dict` to a file.  

- Useful for *storing* information, so you can access it later.  
- Useful for *transmitting* information between programs.  

We can use `json.dump(...)` to **write** (or "dump") a `dict` into a `.json` file.

### Simple example: course 

To start out, let's use the `my_class` dict we defined earlier.

In [14]:
my_class['Code']

'1'

To **write** this to a file, we:

- `open` (create) a file with the name we want to call it.  
- Use `json.dump(dict_name, filename)`.

In [15]:
with open("course.json", "w") as fp:
    json.dump(my_class, fp)

#### Checking that this worked

In [16]:
with open("course.json", "r") as fp:
    course_info = json.load(fp)
print(course_info)

{'Code': '1', 'Department': 'CSS', 'Instructor': 'Mignozzetti', 'Prerequisite': True, 'Enrollment': 120}


### Check-in

Create a new `dict` called `my_info`. Add the following keys/values:

- `Name`. 
- `Major`. 

Then, use `json.dump` to **write** this `dict` to a `.json` file called `my_info.json` to your own computer (in whichever directory you prefer).

In [18]:
### Your code here
my_info = {
    'Name': 'Umberto',
    'Major': 'PoliSci'
}
with open('my_info.json', 'w') as f:
    json.dump(my_info, f)

## JSON files vs. JSON strings

The `load` and `dump` methods can be used to **read** and **write** a `dict` from/to a `.json` file.  

However, Python can also represent JSON as a **`str`**.

- To *read* a `dict` from a JSON `str`, use `loads` (load + *s*tring).  
- To *write* a `dict` into a JSON `str`, use `dumps` (dump + *s*tring).

### `json.dumps`

- Input: a `dict`. 
- Output: a JSON `str`.  

In [38]:
json_str = json.dumps(my_class)
json_str

'{"Code": "1", "Department": "CSS", "Instructor": "Mignozzetti", "Prerequisite": true, "Enrollment": 120}'

In [39]:
type(json_str)

str

### `json.loads`

- Input: a JSON `str`.  
- Output: a `dict`.

### Other objects besides `dict`s

- Technically, you can use `dumps`/`loads` for other objects, such as `str`, `list`, and more.
- Though in my experience, a `dict` is the most common format.

In [40]:
json.dumps([1, 2, 3])

'[1, 2, 3]'

In [41]:
json.loads('[1, 2, 3]')

[1, 2, 3]

# Numpy

## Computational social science in Python

Python hosts a number of tools (packages) to enable **scientific computing**, including computational social science:

- Packages to perform *vector operations* (e.g., `numpy`). 
- Packages to represent *data tables* (e.g., `pandas`).
- Packages to make *visualizations* (e.g., `matplotlib`). 
- Packages to perform *statistical analyses* (e.g., `scipy`).  

These packages form part of an **ecosystem**.

### Is this different from what we've been doing?

Yes and no.

So far, we've focused on **Python fundamentals**: 

- E.g., variables, loops, functions, and more.  
- These fundamentals are critical––think of them as the **foundation** for everything that comes next.  

But **scientific computing** will involve a heavier focus on:
 - Thinking about *data structures*. 
 - Thinking about *relationships between data*.  
  

### Illustrative example

- Previously, we've relied on *loops* while working with `list` objects.
- E.g., if we wanted to add every *i*th item in one `list` to every *i*th item in another `list`, we have to iterate through each loop item-by-item.  
- However, `numpy` is a tool for **vector-wise arithmetic**: 
  - Can add/multiply/divide/etc. entire *vectors* together, rather than having to loop through each item.
  
Onto `numpy`!


## What is `numpy`?

> [**`numpy`**](https://numpy.org/doc/) is a *package* for scientific computing; specifically, it enables fast computation with *vectors* and *matrices*, along with a number of important mathematical operations.

Because `numpy` is a package, it must be *imported*.

In [7]:
# Import statement
import numpy as np

### What can I use `numpy` for?

- `numpy` allows you to work with **homogenous** arrays.
  - A [**homogenous array**](https://numpy.org/doc/stable/user/quickstart.html) is an array with objects all of the same `type`.  
  - E.g., all `int`, or all `bool`, etc.
- The benefit of this is that you can do **computations** very **efficiently**.
  - No more need to *loop*!
- Enables more advanced mathematical operations.

**Note**: `numpy` is a key part of many *advanced machine learning* packages!


### Creating a `numpy.ndarray`

The basic data type of `numpy` is an `ndarray`.

- `ndarray` = **N-dimensional array**.  

A simple way to create an `ndarray` is `np.arange` ("a range").

In [8]:
# Works similar to range(N)
np.arange(1, 4)

array([1, 2, 3])

#### `np.arange` in detail

- By default, `np.arange(start, stop)` returns an array of integers from `start` to `stop`. 
- The `step` parameter allows you to determine the **granularity** of how you "step" between `start` and `stop`.

In [9]:
## step size = 2
np.arange(1, 10, step = 2)

array([1, 3, 5, 7, 9])

In [10]:
## step size = .5
np.arange(1, 4, step = .5)

array([1. , 1.5, 2. , 2.5, 3. , 3.5])

#### Check-in

How would you create an array ranging from `1` to `20`, incrementing with a step size of `.5`? How long would this array be?

In [14]:
### Your code here
np.arange(1, 20, step = 1/3)

array([ 1.        ,  1.33333333,  1.66666667,  2.        ,  2.33333333,
        2.66666667,  3.        ,  3.33333333,  3.66666667,  4.        ,
        4.33333333,  4.66666667,  5.        ,  5.33333333,  5.66666667,
        6.        ,  6.33333333,  6.66666667,  7.        ,  7.33333333,
        7.66666667,  8.        ,  8.33333333,  8.66666667,  9.        ,
        9.33333333,  9.66666667, 10.        , 10.33333333, 10.66666667,
       11.        , 11.33333333, 11.66666667, 12.        , 12.33333333,
       12.66666667, 13.        , 13.33333333, 13.66666667, 14.        ,
       14.33333333, 14.66666667, 15.        , 15.33333333, 15.66666667,
       16.        , 16.33333333, 16.66666667, 17.        , 17.33333333,
       17.66666667, 18.        , 18.33333333, 18.66666667, 19.        ,
       19.33333333, 19.66666667])

#### Solution

In [34]:
np_range = np.arange(1, 20, step = .5)
len(np_range)

38

### Turning a `list` into a `ndarray`

Another way to create an `ndarray` is to pass a `list` into the `np.array(...)` function.

In [16]:
og_list = [1, 2, 3]
type(og_list)

list

In [17]:
myarray = np.array(og_list)
print(type(myarray))

<class 'numpy.ndarray'>


In [18]:
myarray

array([1, 2, 3])

#### Check-in

How would you create a `numpy` array with the elements `[5, 6, 7]`?

In [23]:
### Your code here
L = [5, 6, 7]
L = np.array(L)
L

array([5, 6, 7])

#### Solution

In [51]:
np_array = np.array([5, 6, 7])
np_array

array([5, 6, 7])

#### Check-in

Why is this code throwing an error?

In [25]:
test_array = np.array(1, 2, 3)

TypeError: array() takes from 1 to 2 positional arguments but 3 were given

#### Solution

Make sure you **wrap** the input array in `[]`.

In [78]:
test_array = np.array([1, 2, 3])
test_array

array([1, 2, 3])

### Indexing into a one-dimensional array

Indexing works just like it does for `list`s.

In [27]:
myarray[0]

1

In [28]:
myarray[1]

2

In [29]:
myarray[2]

3

### Multi-dimensional arrays

- So far, we've just been looking at 1-dimensional arrays.
- But `numpy` is excellent at storing **multi-dimensional arrays**.


### Checking *attributes* of an array

The `shape` attribute tells you the dimensions of an array.

#### Check-in

What is the **dimensionality** of `md_array`?

In [35]:
### What is dimensionality
md_array = np.array([[1, 2], [3, 3]])
md_array.shape

(2, 2)

#### Solution

You can check this using `md_array.shape`.

In [36]:
md_array.shape

(2, 2)

#### Check-in

What about `md_array2`?

In [37]:
## 2x3 array
md_array2 = np.array([[1, 2, 3], [4, 5, 6]])
md_array2.shape

(2, 3)

#### Solution

You can check this using `md_array.shape`.

In [88]:
md_array2.shape

(2, 3)

### Checking *attributes* of an array (pt. 2)

The `dtype` attribute tells you the *type* of data in the array.

In [38]:
md_array2.dtype

dtype('int64')

### Homogenous data

As noted earlier, an array is meant to store **homogenous elements**.

- This means that `np.array` will try to **convert** any heterogenous elements to a common `type`.

In [39]:
## Note what happens to 5 and 7!
arr3 = np.array(["a", 5, 7])
arr3

array(['a', '5', '7'], dtype='<U21')

In [40]:
arr3.dtype

dtype('<U21')

### Interim summary

- `numpy` is a package that forms the foundation of scientific computing.  
- `numpy` arrays are the cornerstone of `numpy`.
- A `numpy` array is like a `list`, with a couple differences:
  - Requires **homogenous elements**.  
  - Better at representing **multi-dimensional arrays**.
  - Can be used for **vector operations** (coming up!). 


## Working with vectors (intro)

- `numpy` vectors make it easier to do all sorts of operations, such as **arithmetic** operations.
- No more need to use `for` loops––can do vector arithmetic the same way we multiply individual numbers.

### The old way: arithmetic with `for` loops and `list`s

Adding one `list` to another requires using a `for` loop. 

In [108]:
list1 = [1, 2, 3]
list2 = [2, 3, 4]

In [111]:
## The "+" operator just combines them
list1 + list2

[1, 2, 3, 2, 3, 4]

In [113]:
## To add them, we must use a for loop
sum_list = []
for index, item in enumerate(list1):
    sum_list.append(item + list2[index])
sum_list

[3, 5, 7]

### The new way: arithmetic with `numpy`

`numpy` makes it *much* easier to do arithmetic operations with vectors.

In [114]:
## First, define some vectors
arr1 = np.array([list1])
arr2 = np.array([list2])

In [115]:
## Can just use "+"!
arr1 + arr2

array([[3, 5, 7]])

#### Other arithmetic operations

In [117]:
arr1 -  arr2

array([[-1, -1, -1]])

In [118]:
arr1 *  arr2

array([[ 2,  6, 12]])

In [116]:
arr1 / arr2

array([[0.5       , 0.66666667, 0.75      ]])

#### Vectors vs. scalars

- A **vector** is a list of numbers; a **scalar** is a single number.
- We can multiply (or add, subtract, etc.) an entire *vector* by a single number.

In [123]:
arr1

array([[1, 2, 3]])

In [125]:
## Multiply all elements by 100
arr1 * 100

array([[100, 200, 300]])

#### Check-in

What would multiplying the two arrays below return?

```python
a = np.array([2, 4, 5])
b = np.array([2, 2, 3])
a * b
```

In [126]:
### Your code here

#### Check-in

What would happen if we ran the code below?

```python
a = np.array([2, 4, 5])
b = np.array([2, 2])
a * b
```

In [126]:
### Your code here

### Thinking with vectors

- Using `numpy` for vector arithmetic can sometimes involve a "cognitive shift".  
- We're used to multiplying *individual elements*; now we have to transition to thinking about multiplying *entire vectors*.
- But it's much more efficient!

### The old way: arithmetic with `for` loops and `list`s

Adding one `list` to another requires using a `for` loop. 

In [2]:
list1 = [1, 2, 3]
list2 = [2, 3, 4]

In [3]:
## The "+" operator just combines them
list1 + list2

[1, 2, 3, 2, 3, 4]

In [4]:
## To add them, we must use a for loop
sum_list = []
for index, item in enumerate(list1):
    sum_list.append(item + list2[index])
sum_list

[3, 5, 7]

### The new way: arithmetic with `numpy`

`numpy` makes it *much* easier to do arithmetic operations with vectors.

In [5]:
## First, define some vectors
arr1 = np.array([list1])
arr2 = np.array([list2])

In [6]:
## Can just use "+"!
arr1 + arr2

array([[3, 5, 7]])

#### Other arithmetic operations

In [7]:
arr1 -  arr2

array([[-1, -1, -1]])

In [8]:
arr1 * arr2

array([[ 2,  6, 12]])

In [9]:
arr1 / arr2

array([[0.5       , 0.66666667, 0.75      ]])

#### Vectors vs. scalars

- A **vector** is a list of numbers; a **scalar** is a single number.
- We can multiply (or add, subtract, etc.) an entire *vector* by a single number.

In [10]:
arr1

array([[1, 2, 3]])

In [11]:
## Multiply all elements by 100
arr1 * 100

array([[100, 200, 300]])

#### Check-in

What would multiplying the two arrays below return?

```python
a = np.array([2, 4, 5])
b = np.array([2, 2, 3])
a * b
```

In [12]:
### Your code here

#### Check-in

What would happen if we ran the code below?

```python
a = np.array([2, 4, 5])
b = np.array([2, 2])
a * b
```

In [13]:
### Your code here

#### Check-in

Consider the same two arrays as before. How would you:

- Calculate the **product** of these arrays?  
- Calculate the **sum** of the elements in this new "product" array?

Note that this is also called the [dot product](https://en.wikipedia.org/wiki/Dot_product).

In [14]:
### Your code here
a = np.array([2, 4, 5])
b = np.array([2, 2, 3])

#### Solution

This is also called the [dot product](https://en.wikipedia.org/wiki/Dot_product).

In [15]:
## First, calculate the product
product = a * b
product


array([ 4,  8, 15])

In [16]:
## Then, calculate the sum of this array
product.sum()


27

## Descriptive statistics with `numpy`

> **Descriptive statistics** are ways to summarize and organize data.

A big advantage of `numpy` is that it has **built-in functions** to calculate various descriptive statistics:

- The `sum` of a set of numbers.  
- The `mean` (or "average") of a set of numbers.  
- The `median` (or "middle value") of a set of numbers.


### `sum`

> The **sum** of a set of numbers is simply the result of adding each number together.

The `numpy` package has a `sum` function built in, which makes it easier to calculate the sum of a vector.

In [17]:
## First, create an array with some numbers
v = np.array([5, 9, 10])
v

array([ 5,  9, 10])

In [18]:
## Now calculate the sum
v.sum()

24

### `mean`

> The **mean** of a set of scores is the *sum* of those scores divided by the number of the observations.

The `numpy` package also has a `mean` function built in. Two options:

- `array_name.mean()`
- `np.mean(array_name)`

In [19]:
## First, create an array with some numbers
v = np.array([5, 9, 10])
v

array([ 5,  9, 10])

In [20]:
## Now calculate the mean
v.mean()

8.0

In [21]:
## Or do it with numpy.mean
np.mean(v)

8.0

#### Check-in

What would happen if we ran the following code?

```python
v = np.array([1, 5, 9])
np.mean(v)
```

In [22]:
### Your code here

#### Solution

This will also calculate the `mean`. 

In [23]:
v = np.array([1, 5, 9])
np.mean(v)

5.0

#### Check-in

What would happen if we ran the following code?

```python
v = np.array(["a", "b", "a"])
v.mean()
```

In [24]:
### Your code here

#### Solution

It will throw an **error**. 

- You cannot calculate the `mean` of a vector of `str` types.
- The `mean` can only be calculated for **interval/ratio data**.  

### `median`

> The **median** of a set of scores is the *middle* score when those scores are arranged from least to greatest.

The `numpy` package also has a `mean` function built in.

- Syntax: `np.median(array_name)`. 
- Unlike `mean`, **cannot use** `array_name.median()`

In [25]:
## First, create an array with some numbers
v = np.array([5, 9, 10, 1, 20])
v

array([ 5,  9, 10,  1, 20])

In [26]:
## Now calculate the median
np.median(v)

9.0

#### Check-in

What would be the **median** of the following vector?

```python
v = np.array([1, 2, 5, 8])
```

#### Solution

If the vector has an *even* number of elements, the **median** is the *mean* of the middle two elements.

In [27]:
## First, create an array with some numbers
v = np.array([1, 2, 5, 8])
v

array([1, 2, 5, 8])

In [28]:
## Use np.median
np.median(v)

3.5

In [29]:
## Equivalent to *mean* of 2 and 5
(2 + 5) / 2

3.5

### Interim summary

- **Descriptive statistics** are a really useful way to *summarize* data.  
- Very valuable for both basic and applied research (e.g., in industry).  
   - Examples: `median` salary, `mean` sales per fiscal quarter, `mean` reaction time on a psychophysics task, etc.
- `numpy` makes this much easier.
   - Later, we'll discuss how `pandas` (a way to represent data tables) uses these same functions.

## Working with matrices

> A **matrix** is a rectangular array of data (i.e., a **multi-dimensional array**).

`numpy` is *designed* for **representing** and **performing calculations** with matrices.

- A "vector" is just a one-dimensional matrix.
- Many of the same operations we've discussed also apply to working with matrices.

### Creating a matrix

- Matrices can be created just like vectors.  
- The key difference is that they contain **nested lists**.

In [30]:
md_array = np.array([[1, 2, 5], [3, 4, 7]])
md_array

array([[1, 2, 5],
       [3, 4, 7]])

In [31]:
## This is a 2 by 3 matrix
md_array.shape

(2, 3)

### Indexing into a matrix

- You can **index** into a matrix, just like with a vector.  
- A key difference is that you use *multiple indices*, for each dimension.
   - `matrix_name[D1_index, D2_index, ...]`

In [32]:
# This just returns the first *row*
md_array[0]

array([1, 2, 5])

In [33]:
# This returns the second element of the first row
md_array[0, 1]x


SyntaxError: invalid syntax (1053156695.py, line 2)

#### Check-in

How would you return the *first element* of the *second row* of `md_array`?

```python
md_array = np.array([[1, 2, 5], [3, 4, 7]])
md_array
```

In [None]:
### Your code here
md_array = np.array([[1, 2, 5], [3, 4, 7]])
md_array

#### Solution

In [None]:
# Can use [1] to get second row
md_array

In [None]:
# Use [1, 0] to get first element of second row
md_array[1, 0]

In [None]:
### Your code here

#### Solution

Retrieving a **column** uses the `[:,column_index]` syntax.

In [None]:
## column 1
md_array[:, 0]

In [None]:
## column 2
md_array[:, 1]

In [None]:
## column 3
md_array[:, 2]

#### Check-in

How would you retrieve the *second* and *third* element of the second row?

```python
md_array = np.array([[1, 2, 5], [3, 4, 7]])
md_array
```

In [None]:
### Your code here
md_array = np.array([[1, 2, 5], [3, 4, 7]])
md_array[1, 1:3]

#### Solution

In [None]:
## First, get second row with md_array[1]
md_array[1]

In [None]:
## First, get second/third elements with slicing
## I.e., [1:3] syntax
md_array[1, 1:3]

### Summary statistics with matrices

- When you call `np.sum` (or `mean`, etc.), you can specify which **axis** to calculate that statistic from.  
- `axis = 0`: calculate `sum` (or `mean`, etc.) of each **column**.
- `axis = 1`: calculate `sum` (or `mean`, etc.) of each **row**.

In [None]:
## Calcaulate mean of each column
print(md_array)
md_array.sum(axis = 0)

In [None]:
## Calcaulate mean of each row
md_array.sum(axis = 1)

#### Check-in

How would you calculate the `mean` of each **row** of the following matrix?

In [None]:
m = np.array([[5, 10, 2],
            [20, 5, 100]])
### your code here
np.mean(m, axis = 1)

#### Solution

In [None]:
m.mean(axis = 1)

In [None]:
m.mean(axis = 1).shape

#### Check-in

Suppose you have a 5x6 matrix (5 rows, 6 columns). If you calculated the `mean` of each **column**, what would the `shape` be of the resulting vector?

#### Solution

The vector would have a `shape` of `(6,)`, i.e., **six observations**.

- There are six columns, so calculating the mean of each column would result in *six observations*.

### Side note: arithmetic with matrices

- You can also perform arithmetic with matrices (e.g., **addition**, **multiplication**, etc.).
- However, note that matrices must have compatible dimensions.  
   - More discussion of this in a [Linear Algebra class](https://en.wikipedia.org/wiki/Linear_algebra).

## Identifying the location of an item

Often, you'll need to **search** a vector or matrix for items that meet a certain conditions.

- All `scores == 100`.
- All `building_heights` above a certain threshold.  
- All `reaction_times` above a certain cutoff.

You can think of this as **applying a conditional statement** to *search* a vector.

### Identifying the location of an item

Often, you'll need to **search** a vector or matrix for items that meet a certain conditions.

- All `scores == 100`.
- All `building_heights` above a certain threshold.  
- All `reaction_times` above a certain cutoff.

You can think of this as **applying a conditional statement** to *search* a vector.

### Using `==`

This will return a vector of `True` or `False`, indicating whether each index/element matches the condition.

In [None]:
## Scores
scores = np.array([100, 95, 100, 85])

In [None]:
## Which scores == 100?
scores == 100

In [None]:
## Select only scores == 100
scores[scores >= 95]

### Using `np.where`

By default, this will return the **indices** in the initial array corresponding to the condition.

In [None]:
## Get indices
np.where(scores == 100)

In [None]:
## Applying indices to vector
scores[np.where(scores == 100)]

### Check-in

Consider the following array of `building_heights`. How would you find out which buildings are taller than 50 feet?

In [None]:
building_heights = np.array([25, 45, 10, 60, 10, 85, 100])
### Your code here
building_heights[building_heights > 50]

### Solution using `==`

In [None]:
building_heights > 50

In [None]:
building_heights[building_heights > 50]

### Solution using `np.where`

In [None]:
## Get indices
np.where(building_heights > 50)

In [None]:
## Apply indices
building_heights[np.where(building_heights > 50)]

## Other useful functions

`numpy` also has a host of other useful functions. For now, we'll focus on:

- **Generating an array** with either random numbers or `ones` or `zeros`.
- **Reshaping an array** with `reshape.

### Initializing a random array

`numpy.random.rand(d1, ...)` can be used to initialize an array with **random numbers** and dimensionality `(d1, ...)`.

In [None]:
## Generates a 1-D vector with 10 elements
np.random.rand(10)

In [None]:
## Generates a 2-D vector with shape (2, 2)
np.random.rand(5, 5)

#### Check-in

Generate a random array with shape (3, 2), then calculate the `mean` of each column.

In [None]:
### Your code here
cook = np.random.rand(100000000, 2)
cook.mean(axis = 0)

#### Solution

In [34]:
r = np.random.rand(3, 2)
r

array([[0.18374708, 0.72578191],
       [0.75109432, 0.71579205],
       [0.31648686, 0.98426268]])

In [35]:
r.mean(axis = 0)

array([0.41710942, 0.80861221])

### Initializing an array of `ones` or `zeros`

This is like `np.random.rand`, but each element is either a `1` or `0`.

In [39]:
np.ones((2, 2))

array([[1., 1.],
       [1., 1.]])

In [37]:
np.zeros((2, 2))

array([[0., 0.],
       [0., 0.]])

### Using `numpy.reshape`

Sometimes, a matrix or vector isn't the right **shape** to perform a computation.

- E.g., multiplying by another vector.  
- E.g., using for regression in a regression equation.

We can use `np.reshape` to reshape that array.

#### Example: turning a vector into a matrix

In [40]:
# Create a (10, ) vector
og_array = np.ones(10)
og_array

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [41]:
# Reshape to (2, 5)
og_array.reshape((2, 5))

array([[1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1.]])

In [42]:
# Reshape to (5, 2)
og_array.reshape((5, 2))

array([[1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.],
       [1., 1.]])

#### Dimensions must be compatible

If you try to `reshape` into a `shape` that's not compatible with the original `size` (i.e., not **divisible** by `size`), you'll get an error.

In [43]:
# Reshape to (5, 2)
og_array.reshape((4, 4))

ValueError: cannot reshape array of size 10 into shape (4,4)

## Conclusion

There's lots more to working with files (including text files), but this sets the **foundation**. Now you should feel a little more comfortable:

- Understanding how to navigate your computer's **directory structure**.  
  - E.g., knowing "where" a file is located.
- Knowing how to `open` a file in Python.
- Knowing how to **read** or **write** that file.

This will form the basis of working with future file types, such as `.csv` (a very common format for representing tabular data).