# Week 1 - Iterating through data using Python

## A real data challenge – a massive dataset

We want to explore `NYC EMS data since 2011`. We are saddled with a massive file that is `6+GB` that has `more than 26 million` rows of data. At this size, even Pandas will slow down dramatically.

Our strategy is to `iterate` or `loop` through the data in smaller chunks and to analyze them. We will analyze the entire dataset and to get complete, reconstituted results that represent an analysis of the entire 6+GB of data.   

We need to learn a couple of fundamental Python techniques that will help us extend Pandas' abilities.












### 1. ```for loops```... a data journalist's favorite Python expression</center>

We can use a `for loop` to **iterate** (to do the same series of steps in a process over and over again), including:
* running some calculation on each value stored in a list;
* opening and reading a list of files;
* literally an endless series of important tasks.

In [None]:
## name dog lucy


In [None]:
## upper case the previous variable 
## you can target an individual item


In [None]:
## run this list


In [None]:
##call our list


In [None]:
## upper case and print each animal in our list
## this will break


In [None]:
## recall that we can slice a list



In [None]:
## we can target an individual item that has has been sliced from a list


**We can't do one at a time, but we can iterate through all of them using a `for loop`**

In [None]:
## use a for loop to upper case each animal and print it


### What's happening in a `for loop`:

<img src="https://sandeepmj.github.io/image-host/forloop3.png">


<img src="https://sandeepmj.github.io/image-host/forloop4.png">


<img src="https://sandeepmj.github.io/image-host/forloop5.png">


<img src="https://sandeepmj.github.io/image-host/forloop6.png">



<img src="https://sandeepmj.github.io/image-host/forloop7.png">


<img src="https://sandeepmj.github.io/image-host/forloop8.png">


<img src="https://sandeepmj.github.io/image-host/forloop9.png">


<img src="https://sandeepmj.github.io/image-host/forloop10.png">

# To recap:
<img src="https://sandeepmj.github.io/image-host/forloop6.png">


In [None]:
## re run a for loop to upper case each animal and print it



In [None]:
## call each_animal


In [None]:
## did our list change? Call the fav_animals list


### 2.  `append()`

The `append()` lets us append values to a new list. Even if the list does not exist, we can declare it and then append to it.

```python
    new_list = []
    new_list.append(some_value)
```


In [None]:
## We save our iterated data by adding to an empty list


In [None]:
## call upper animals


## Let's take **For Loops** for test drive:

### Combine different data points together 

#### You scrape some URLs and place them in a list called myURLS (provided below):

In [None]:
## run this cell to activate the list
myURLS = [
    'great-unique-data-1.html',
    'great-unique-data-2.html',
    'great-unique-data-3.html',
    'great-unique-data-4.html',
    'great-unique-data-5.html',
    'great-unique-data-6.html',
    'great-unique-data-7.html',
    'great-unique-data-8.html',
    'great-unique-data-9.html',
    'great-unique-data-10.html',
    'great-unique-data-11.html',
    'great-unique-data-12.html',
    'great-unique-data-13.html',
    'great-unique-data-14.html',
    'great-unique-data-15.html'
]

### * You realize that these URLs are missing the base of "http://www.importantsite.com/"
### * Use a ```for loop``` to join the base URL to every partial URL in your list.
### * Print each FULL URL
It should look like: ```"http://www.importantsite.com/great-unique-data-14.html``` but with unique numbers

In [None]:
## for loop and print

### Update myURLS and store full URLS in a new list

#### Instead of just printing the joined URLs, create a new list called ```full_URLS``` that holds the full URLs.

In [None]:
## store the updated values


In [None]:
## call the new list


### 3. Counting while iterating

Often we need to increment a number to track progress of an iteration.

In [None]:
## counter without incrementing


In [None]:
## counter that increments


## Back to our EMS data challenge

For those of you who don't have sufficient splace, I have created <a href="https://raw.githubusercontent.com/sandeepmj/datasets/main/ems-excerpt.csv">an excerpt</a> of the `6+GB` dataset that is `25MB` and holds 100,000 rows of data instead of millions of rows. Those using the excerpt, your strategy will break to take break 100,000-rows file and `chunk` it into 10K pieces. 

With the actual `6+GB` file, we'll break it into 500K chunks.

In [None]:
## import libraries


In [None]:
## take big csv and chunk it


In [None]:
## take big csv and chunk it


In [None]:
## call df_segment


In [None]:
## we can save a segment to a more manageable csv


In [None]:
## we have to save each one as a file


In [None]:
## we have to save each one as a file


In [None]:
## what is the chunk number if you call it now?


In [None]:
## let's look at one


In [None]:
## get big picture view


In [None]:
## filter to see "FIRST_HOSP_ARRIVAL_DATETIME"


In [None]:
## find all unique categories in 'INITIAL_CALL_TYPE'


In [None]:
## what are the unique injuries


In [None]:
## query and filter for venom and incident dates and boroughs


### 4. `glob`

Our goal is to analyze all the files, not a single subset. 

We need a way to `iterate` through all the files.

A `list` is also known as an `iterable` because it contains items that can be **iterated over**.

We need to take the chucked CSVs and store in a single list that we can then iterate over.

We use a package called `glob` that globs all the files into a list.

Let's see how it works:


In [None]:
## import libraries


### The power of ```glob``` comes from its ability to gather any target files we want.


In [None]:
## grab only the csv files


In [None]:
# ## pip install natsort
!pip install natsort



In [None]:
# ## import library and module
from natsort import natsorted



In [None]:
# ## sort in lexicographical order


In [None]:
## iterate through all the files and pull out "drug" incidents only


In [None]:
## how many?


In [None]:
## call a single one


In [None]:
## see a sample, with random source list


In [None]:
## call a sample of 20


In [None]:
## filter venom with incident date and severity level code
df_venom.filter(["INCIDENT_DATETIME", "INITIAL_SEVERITY_LEVEL_CODE","BOROUGH"])

In [None]:
## value count by borough


In [None]:
## percentage by borough


In [None]:
## iterate through all the files and pull out "final severity levels between 6 and 7 inclusive" incidents only


In [None]:
## see a sample, with random source list
