# Python: Counting and filtering

- [Overview](#overview)
- [Counting](#counting)
- [Filtering](#filtering)
- [Count if](#count-if)
- [Handling large files](#handling-large-files)

<h2 id="overview">Overview</h2>

In the course of our Python study, we've learned about some of the basic features of the Python programming language. That includes:

  * Basic data types (integers, strings, lists, dicts)
  * Expressions and assignment statements
  * Variables as storage containers
  * Flow control (_for_ loops and _if/elif/else_)
  * Built-in functions such as `len` and `print`

Let's start tying together this knowledge and applying it in practical contexts.

<h2 id="counting">Counting</h2>

Counting is one of the most basic and important operations we need to perform.

One of the simplest and most common ways to count items involves using the built-in [len](https://docs.python.org/3/library/functions.html#len) function to measure the length of an array:



In [None]:
animals = ['cat', 'dog', 'bird']
len(animals)

Another common approach -- one often used when processing data from an external source such as a CSV -- is to use a counter variable.

In [None]:
count = 0
for animal in animals:
	count +=1 # same as writing count = count + 1
print(count)

Above, we "initialized" a variable called `count`, and then used the [augmentation operator](https://docs.python.org/3.8/reference/simple_stmts.html#augmented-assignment-statements) to increment the count as we loop through the list of animals.

<h2 id="filtering">Filtering</h2>

Filtering data based on some aspect of the information is another common data wrangling task.


In [None]:
for animal in animals:
    if animal != 'dog':
        print(animal)

<h2 id="count-if">Count if</h2>

We can now combine the above techniques to count a filtered list of items. Here are a few different approaches.

In [None]:
# Use a simple counter
count = 0
for animal in animals:
    if animal != 'dog':
        count += 1
print(count)

What if we need to keep the results that we filtered for some reason? 

For example, say we need both the count *and* the actual list of filtered data for some additional downstream purpose such as saving it to a new file. 

In this case, we can adapt our strategy with the help of a list and `len`.

In [None]:
# Store the filtered items in a new list
noncanines = []
for animal in animals:
    if animal != 'dog':
        noncanines.append(animal)

# Now "count" the filtered list
print(len(noncanines))

Both of these approaches work. Which one you choose will vary based on the end goal.

<h2 id="handling-large-files">Handling large files</h2>

If you're dealing with larger data sets (e.g. millions or tens of millions of rows), you may want to avoid
storing all those rows in your computer's memory and instead read information as a stream and write out the filtered data as a stream.

Here's a fake example using an imaginary set of files:

```python
# Open a file to write animal names that start with the letter Z
with open("z_animals.txt", 'w') as out:
    # Loop through the large list of animal names
    # without ever storing the data in memory.
    # Python will efficiently process the data line by line (aka as a stream)
    # to avoid reading all the data into memory at once
    for animal in open("millions_of_animals.txt"):
        if animal.lower().startswith('z'):
            # Add a newline to ensure each animal is on a separate row
            out.write(f"{animal}\n")
```        
