# Control Structures
The last chapter promised a look into loops as a solution to a few limitations of the base python functions. The idea that we were limited is not true of course! Programming in general relies heavily on iterative lines of code that can repeat, change values, and follow steps of logic written by the programmer. Consider any sort of repetitive tasks you have to perform often, and think about how much easier it would be if you could automate those tasks. Consider, for example, one of your responsibilities was to update your school's ACLED conflict data once every quarter. This would mean having to download all regions' individual CSV files, merging, and cleaning some variables you pay close attention to. This is what you will get practice doing in this chapter, creating loops (and functions!) that can repeat these steps as many times as there are CSV files to merge so your job will be that much easier by next quarter.

It has been our personal experience that many Python lessons and tutorials give abstract examples when teaching learners about looping and conditional statements. Hopefully this chapter will keep the lessons and examples grounded in real data with real tasks you might perform as a social science researcher and analyst. However, you will still need to see some abstract examples to start off each section of this chapter. It is important that you can recognize all the components of a loop before focusing on practical applications. 

(more?)

In [1]:
# This code cell will be in every one of our chapters in Jupyter Notebook
# The function allows you to see every line of output when the code has multiple lines
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

### Basic concepts
When writing loops, you will be repeating (or _iterating_) functions _x_ number of times. The value of that _x_ is _very_ important, and the value usually comes from the shape or _length_ of a data object, usually rows or columns. Length here means the number of columns or the number of rows/observations, and we have the `len()` function to obtain that number. You will also need to combine `len()` with `range()`, which gives you start and end values for your loops; Python loops can't accept a single number on which to loop/iterate, it must be a range. And you will most definitely become better acquainted with index numbers to extract specific values from a vector or a column in your data.

Let's start with thinking about how many observations/how much data we're dealing with: Create the `height` object once again and see what len() and range() give you. 

In [12]:
height = [177, 174, 170, 183, 168, 
          182, 163, 191, 177, 176, 
          173, 186, 174, 168, 184, 
          170, 170, 192, 181, 173]
print('The "height" object is =')
height
print('The "len(height)" object is =')
len(height)
print('The "range(20)" object is =')
range(20)
print('The "range(len(height))" object is =')
range(len(height))

The "height" object is =


[177,
 174,
 170,
 183,
 168,
 182,
 163,
 191,
 177,
 176,
 173,
 186,
 174,
 168,
 184,
 170,
 170,
 192,
 181,
 173]

The "len(height)" object is =


20

The "range(20)" object is =


range(0, 20)

The "range(len(height))" object is =


range(0, 20)

## For Loops
A for-loop is a control structure in python that repeats code inside the looping syntax, after a colon `:`. It repeats (or loops, or _iterates_) as many times as the values you give it, "for this many values". If a vector had twenty variables, a for loop that iterated over this vector would do so twenty times, until it reached the end of the length the object. This is why len() and range() work so together in for-loops. The basic grammar is:

`for i in x: do y()`

At its very simplest, a for-loop repeats a function _x_ number of times. The structure grammar means that `i` will take on _each_ value that exists in _x_:

In [13]:
for i in range(5): i

0

1

2

3

4

In [15]:
# any list of values after 'in' becomes a value for 'i' in the iteration steps
for i in ('apples','bananas',99,[1,2,3]): i

'apples'

'bananas'

99

[1, 2, 3]

In [14]:
# a list object (in brackets] can be 'iterable' even if it only has a length of one
for i in [5]: i

# this is true for strings too, which are basically lists of letters
for i in 'apples': i

# but a single-value object after 'in' will give you an error!
for i in (1): i

5

'a'

'p'

'p'

'l'

'e'

's'

TypeError: 'int' object is not iterable

We used the letter `i` after 'for' as the object for python to store the value of the current iteration. Historically this is the letter `i`. It does not _need_ to be `i`, it is merely the first letter of the word 'iteration'.

<div class="alert alert-block alert-info">Note: as you type the first line "for i in height:", when you press the enter key after the colon `:`, jupyter notebook automatically makes an indentation, or extra space on the left. This lets Python know that this is not a separate line of code but a visual break and still a part of the code. You could also keep the function on the same line without a line break, but it looks neater if you have a lot of code after the `for` declaration.</div>

The function you're calling in these examples is just `i` because we want you to know what `i`, the variable after 'for', means. The iteration variable `i` takes on the values of the `height` vector in the first loop, and in the second loop, `i` represents the values of `range(len(height))` (which is the range of values in the integer '20' as we saw in the first code cell).

In [16]:
for i in height:
    i # here, 'i' stands for the vector index's value, or height[i]


for i in range(len(height)):
    i # here 'i' stands for a value in the range of 0 - 20

177

174

170

183

168

182

163

191

177

176

173

186

174

168

184

170

170

192

181

173

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Remember last chapter we needed NumPy to obtain the transformation of `height` into inches? Now we can loop over our vector to calculate values in inches. With the second loop above, where we iterate over the _range of the length_ of the height vector, we got values from 0 to 19. Incidentally, these are the indexes for the position of each observation in the vector from range 0 to 20. That means you could use the increasing values of `i` to extract a vector's values in order and perform math operations on each value (`height[i]*x`). You can also incorporate _more_ functions related to your vector where you will care more about the length of the vector (and thus the number of iterations) than the actual values in that vector.

This creates two choices for how you make a math operation work in a loop: You can choose to make `i` equal to the 0-19 range of values (the first line of code); or you can choose for `i` to equal the explicit values within the height vector (in the second, shorter, line of code). We won't tell you which choice is better or worse: You should understand both choices exist and both are appropriate for different coding contexts. 

In [7]:
# Both lines output the same values, but 'i' is different in each loop.
for i in range(len(height)):
    height[i]*0.393701 # here, 'i' represents values in the range 0-20 and so e extract height values by indexing height[i]

for i in height:
    i*0.393701 # here 'i' takes the literal values in height, so we can calculate directly

69.685077

68.503974

66.92917

72.04728300000001

66.141768

71.653582

64.173263

75.19689100000001

69.685077

69.291376

68.110273

73.228386

68.503974

66.141768

72.440984

66.92917

66.92917

75.590592

71.25988100000001

68.110273

69.685077

68.503974

66.92917

72.04728300000001

66.141768

71.653582

64.173263

75.19689100000001

69.685077

69.291376

68.110273

73.228386

68.503974

66.141768

72.440984

66.92917

66.92917

75.590592

71.25988100000001

68.110273

This looks like what we want, but the loop's output didn't actually make a vector / list data type, only 20 independent variables for height in inches. One solution is to take advantage of the `append()` function on an empty list. Just make a blank list object called `height_inches` and append will add the current value onto the last position of the vector for as many times as there are values in 'height':

In [8]:
height_inches=list()
height_inches # this prints a blank list '[]'


for i in height:
    height_inches.append(i*0.393701)

height_inches # inspect. now we have a list data type with 20 values.
type(height_inches)

[]

[69.685077,
 68.503974,
 66.92917,
 72.04728300000001,
 66.141768,
 71.653582,
 64.173263,
 75.19689100000001,
 69.685077,
 69.291376,
 68.110273,
 73.228386,
 68.503974,
 66.141768,
 72.440984,
 66.92917,
 66.92917,
 75.590592,
 71.25988100000001,
 68.110273]

list

## List Comprehension
If you are mainly interested in creating a new vector, however, __list comprehension__ is the preferred loop structure in Python. It looks and behaves like the for-loop, but it has a more succinct syntax and actually processes more quickly. The basic command for a list comprehension looks like this: 

`[do y for i in x]` 

Notice how the line comprehension code is wrapped in square brackets, which is how you assign a list of values to an object in Python. Look at some examples of the very base syntax of line comprehension below. The iteration process is similar to the for-loop, but the output is automatically formatted as a list class. 

In [None]:
[i for i in height] # 'i' stands for the value in the corresponding index of the vector. It returns the exact same vector.

[i for i in range(len(height))] # 'i' stands for the position of the index, as opposed to the value. It still outputs a list class object

[i/10 for i in height] # all our operations happen to the left of the 'for'

All that's left in order to to create the height in inches vector (more efficiently this time) is to assign the new object to the left of the list comprehension. The differences with our previous for-loop, as you can appreciate below, are several. First, assigning the new list occurs outside of the loop as opposed to within the loop. Similarly, list comprehension does not require a blank list object and the `append()` function because list comprehension creates vectors by default. Lastly, the operation $height \times 0.393701$ happens to the left of the 'for' term. 

In [9]:
height_inches = [i*0.393701 for i in height]

type(height_inches)
height_inches

list

[69.685077,
 68.503974,
 66.92917,
 72.04728300000001,
 66.141768,
 71.653582,
 64.173263,
 75.19689100000001,
 69.685077,
 69.291376,
 68.110273,
 73.228386,
 68.503974,
 66.141768,
 72.440984,
 66.92917,
 66.92917,
 75.590592,
 71.25988100000001,
 68.110273]



## Nested Loops

When you place any loop inside another, you are using __nested loops__. Nesting is incredibly common in data analysis and in real world data, where observations are grouped or categorized within other variable values. An example we have seen before in the pandas II chapter is countries within years in our ACLED data. Let's use this as an example of how we can loop over the years and countries in one region. What this example will show you is that data itself is nested, so if you are interested in repeating one analysis for all the countries in the data for one year, then the same analysis for all the countries in the next year, and so on, you would use a nested loop.

Import the ACLED CSVdata from East Asia.

In [2]:
import pandas as pd
east_asia = pd.read_csv('../../Data/ACLED/1900-01-01-2022-04-22-East_Asia.csv')

str(east_asia['year'].loc[0])

'2022'

In this case, because we would like to group our dataframe by each year and country, it won't do to iterate over every value of the `year` and `country` variables. These have many multiple observations per year and country, and our for-loops would run for a very long time causing our computer's memory to run out. Instead, we want each _unique_ value of country and year. 

In [None]:
east_asia['year'].unique()
east_asia['country'].unique()

Place a for-loop for country "inside" of a for-loop over years and print a concatenated string of either:

In [8]:
for y in east_asia['year'].unique():
    for c in east_asia['country'].unique():
        str(y)+' '+c

'2022 South Korea'

'2022 Japan'

'2022 Taiwan'

'2022 Mongolia'

'2022 North Korea'

'2022 China'

'2021 South Korea'

'2021 Japan'

'2021 Taiwan'

'2021 Mongolia'

'2021 North Korea'

'2021 China'

'2020 South Korea'

'2020 Japan'

'2020 Taiwan'

'2020 Mongolia'

'2020 North Korea'

'2020 China'

'2019 South Korea'

'2019 Japan'

'2019 Taiwan'

'2019 Mongolia'

'2019 North Korea'

'2019 China'

'2018 South Korea'

'2018 Japan'

'2018 Taiwan'

'2018 Mongolia'

'2018 North Korea'

'2018 China'

Handy! We can use this string output to make subsets and crosstabs of our data. The first loop subsets each year of data into a placeholder `temp` dataframe. The second nested loop prints the country and year, then calls a subset by country with `temp[temp['country'].str.contains()]`. These are functions you learned in the previous pandas II chapter that we are generalizing to ever-changing values of `i`. 

In [13]:
for y in east_asia['year'].unique():
    temp=east_asia[east_asia['year']==y]
    for c in east_asia['country'].unique():
        print(c+' '+str(y))
        temp[temp['country'].str.contains(c)].groupby('event_type').aggregate({'fatalities':'sum'})


South Korea 2022


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0


Japan 2022


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0


Taiwan 2022


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0


Mongolia 2022


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0


North Korea 2022


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Explosions/Remote violence,0
Riots,0
Strategic developments,0
Violence against civilians,21


China 2022


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,1


South Korea 2021


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0


Japan 2021


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Strategic developments,0


Taiwan 2021


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,0


Mongolia 2021


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,0


North Korea 2021


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,18
Violence against civilians,14


China 2021


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Battles,1
Explosions/Remote violence,5
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,6


South Korea 2020


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Battles,0
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,0


Japan 2020


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Strategic developments,0


Taiwan 2020


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0


Mongolia 2020


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0


North Korea 2020


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Explosions/Remote violence,0
Protests,0
Strategic developments,7
Violence against civilians,1


China 2020


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Explosions/Remote violence,1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,8


South Korea 2019


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0


Japan 2019


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,0


Taiwan 2019


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,0


Mongolia 2019


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0


North Korea 2019


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Riots,1
Strategic developments,0
Violence against civilians,9


China 2019


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Battles,0
Explosions/Remote violence,1
Protests,0
Riots,1
Strategic developments,0
Violence against civilians,10


South Korea 2018


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0


Japan 2018


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0


Taiwan 2018


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0


Mongolia 2018


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Strategic developments,0
Violence against civilians,1


North Korea 2018


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Strategic developments,0
Violence against civilians,0


China 2018


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Battles,3
Protests,3
Riots,5
Strategic developments,0
Violence against civilians,30


Reversing our nested loop would give us the same countries first, just changing the order of the for-loops.

In [14]:
for c in east_asia['country'].unique():
    temp=east_asia[east_asia['country'].str.contains(c)]
    for y in east_asia['year'].unique():
        print(c+' '+str(y))
        temp[temp['country'].str.contains(c)].groupby('event_type').aggregate({'fatalities':'sum'})


South Korea 2022


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Battles,0
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,0


South Korea 2021


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Battles,0
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,0


South Korea 2020


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Battles,0
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,0


South Korea 2019


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Battles,0
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,0


South Korea 2018


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Battles,0
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,0


Japan 2022


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,0


Japan 2021


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,0


Japan 2020


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,0


Japan 2019


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,0


Japan 2018


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,0


Taiwan 2022


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,0


Taiwan 2021


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,0


Taiwan 2020


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,0


Taiwan 2019


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,0


Taiwan 2018


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,0


Mongolia 2022


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,1


Mongolia 2021


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,1


Mongolia 2020


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,1


Mongolia 2019


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,1


Mongolia 2018


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Protests,0
Riots,0
Strategic developments,0
Violence against civilians,1


North Korea 2022


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Explosions/Remote violence,0
Protests,0
Riots,1
Strategic developments,25
Violence against civilians,45


North Korea 2021


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Explosions/Remote violence,0
Protests,0
Riots,1
Strategic developments,25
Violence against civilians,45


North Korea 2020


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Explosions/Remote violence,0
Protests,0
Riots,1
Strategic developments,25
Violence against civilians,45


North Korea 2019


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Explosions/Remote violence,0
Protests,0
Riots,1
Strategic developments,25
Violence against civilians,45


North Korea 2018


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Explosions/Remote violence,0
Protests,0
Riots,1
Strategic developments,25
Violence against civilians,45


China 2022


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Battles,4
Explosions/Remote violence,7
Protests,3
Riots,6
Strategic developments,0
Violence against civilians,55


China 2021


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Battles,4
Explosions/Remote violence,7
Protests,3
Riots,6
Strategic developments,0
Violence against civilians,55


China 2020


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Battles,4
Explosions/Remote violence,7
Protests,3
Riots,6
Strategic developments,0
Violence against civilians,55


China 2019


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Battles,4
Explosions/Remote violence,7
Protests,3
Riots,6
Strategic developments,0
Violence against civilians,55


China 2018


Unnamed: 0_level_0,fatalities
event_type,Unnamed: 1_level_1
Battles,4
Explosions/Remote violence,7
Protests,3
Riots,6
Strategic developments,0
Violence against civilians,55


## While Loops
When the exact number of iterations in your loop depends on a condition, then a while loop can help you make iteration for as long as a logical true-false condition remains "true". This is called "indefinite iteration", and the while-loop structure lacks any kind of 'for' object to reference. It's basic structure is:

`while i ... x: do(y)`

In this example, the ellipsis `...` represents any kind of general Python operator (the kind we introduced in the previous chapter). So we might use a comparison operator `while i < 10`. This means you must set a starting value for `i` before the while-loop, and explicitly update the value of `i`:

In [2]:
i=0 # manually set a starting value
while i < 10:
    print(i,' is less than 10')
    i=i+1 # and manually increment the value of 'i' since there is no reference object to follow along

0  is less than 10
1  is less than 10
2  is less than 10
3  is less than 10
4  is less than 10
5  is less than 10
6  is less than 10
7  is less than 10
8  is less than 10
9  is less than 10


When you know the exact number of iterations you want to perform, while- and for-loops do the exact same thing by returning an increasing value of `i` at each step in the loop. However, you might not really know how many cases in your data match some sort of criteria, and you may, in that case, require a while loop to process. We'll illustrate with an example in the ACLED conflict data. Let's load the CSV data from the Central America region

In [8]:
central_america = pd.read_csv('../../Data/ACLED/1900-01-01-2022-04-22-Central_America.csv')

Now that we have a dataframe for Central America, we can give a working example of how a while-loop can help us with data cleaning and analysis. Imagine your research organization has a specific criteria for when a conflict event can be considered 'catastrophic': when a conflict event had eight fatalities or more. There are _many_ ways you could subset your dataframe in this way, including some you have already learned about in our chapters on pandas! This section we will try subsetting our data by the criteria `fatalities >= 8` inside a while-loop. 

Because the while condition immediately stops when the last evaluated value does _not_ meet the condition, we have to be sure that `fatalities` in our data is sorted in descending order from most to least fatalities. That way we can be sure the while-loop will not miss a value of eight or greater. We accomplish this with `pd.sort_values(ascending=False)`. The `reset_index()` function is also necessary in our example because sort shuffles the row index values, and you need to make them go in order from zero with reset_index(). Note: we are doing this to create an artifical or hypothetical dataset; you should normally never need to go through this much trouble to filter out values in a dataframe!

In [11]:
# sort by descending values of atalities
central_america_sorted = central_america.sort_values('fatalities',ascending=False)

# sort_values does not re-organize the index numbers! Let's make them go from 0 to 1 again.
central_america_sorted.reset_index(drop=True, inplace=True)

You can now start iterating down the decreasing values of `fatalities` inside of a for-loop, with a nested while loop constantly checking our condition `fatalities>=8` is being met. So in a nutshell, the for-loop will give us the current row as we move top to bottom. We can take advantage of this to extract rows using `central_america.loc[[j]]`. We keep the value of `i` up to date inside the for loop, then call a while-loop that evaluates the value of `i`. If the whie condition remains true, we add a row to a new `catastrophic` dataframe, and then (this is critical!) __change the value of `i`__ to be less than eight to break free from the while-loop and return to the for-loop. Otherwise this loop prints the first value of `i` forever!.

In [12]:
# make an empty dataframe to take subset data
catastrophic=[]
catastrophic=pd.DataFrame(catastrophic)


for j in range(len(central_america_sorted['fatalities'])):
    i = central_america_sorted.loc[j,'fatalities'] # update `i` with the value for fatalities in row 'j' of central_america
    while i >= 8:
        print('i =',i)
        catastrophic = pd.concat([catastrophic,central_america_sorted.loc[[j]]])       
        i=0 # reset i to a value less than 10 to break out of the current while loop... or else

catastrophic.head()

i = 20
i = 19
i = 18
i = 17
i = 14
i = 12
i = 12
i = 11
i = 9
i = 9
i = 9
i = 8
i = 8


Unnamed: 0,data_id,iso,event_id_cnty,event_id_no_cnty,event_date,year,time_precision,event_type,sub_event_type,actor1,...,location,latitude,longitude,geo_precision,source,source_scale,notes,fatalities,timestamp,iso3
0,6926012,340,HND1224,1224,20 December 2019,2019,1,Riots,Mob violence,Rioters (Honduras),...,Tela,15.7743,-87.4673,1,La Prensa (Honduras); El Heraldo (Honduras),National,"On 20 December 2019, in Tela, Atlantida, a mob...",20,1618530900,HND
1,6939157,340,HND1225,1225,22 December 2019,2019,1,Battles,Armed clash,MS-13: Mara Salvatrucha,...,El Porvenir,15.666,-86.9105,1,El Heraldo (Honduras),National,"On 22 December 2019, in El Porvenir, Francisco...",19,1618530900,HND
2,7586581,340,HND3157,3157,22 December 2019,2019,1,Riots,Mob violence,Rioters (Honduras),...,Tegucigalpa,14.0818,-87.2068,1,AFP,International,"On 22 December 2019, in Tegucigalpa, Francisco...",18,1618531054,HND
3,7348738,558,NIC377,377,22 April 2018,2018,1,Protests,Excessive force against protesters,Protesters (Nicaragua),...,Managua,12.1328,-86.2504,1,Organization of American States,Other,"On 22 April 2018, seventeen people were killed...",17,1607554574,NIC
4,8828910,320,GTM4489,4489,17 December 2021,2021,1,Battles,Armed clash,Nahuala Communal Militia (Guatemala),...,Chiquix,14.7998,-91.3977,1,El Periodico; Prensa Comunitaria; Prensa Libre...,Local partner-National,"On 17 December 2021, in Chiquix, Solola, membe...",14,1644864568,GTM


## Conditional Statements
Although we used the logic of while and for-loops to iterate over a desired amount of steps in our previous examples, we could have employed conditional statements inside of the loops to fine-tune our code. 

### If
The first of these is the 'if' statement, the most common and useful of them all. Using the same example as we used when explaining while-loops, you could have made a different for-loop altogether with a simple if statement. 

The basic syntax for if statements is:

`if x ... z: do(y)`

In [42]:
i = 0
if i < 5: 
    print(i,'< 5') # this is true and will print

i = 10
if i < 5: 
    print(i,'< 5') # this will print nothing because it evaluates as false

0 < 5


Now you could replace the while statement from the previous example with an if statement! In fact, 'if' is more generalizable than 'while' so we can go ahead and use the unsorted Central America dataframe and iterate over the whole data while checking our condition at each step. Only when fatalities >= 8 will we call the concat() function to add rows to the new dataframe.

In [43]:
# make an empty dataframe to take subset data again
catastrophic=[]
catastrophic=pd.DataFrame(catastrophic)


for i in range(len(central_america['fatalities'])):
    x = central_america_sorted.loc[i,'fatalities'] # update `x` with the value for fatalities in row 'i' of central_america
    if x >= 8:
        print('x =',x)
        catastrophic = pd.concat([catastrophic,central_america_sorted.loc[[i]]])       

catastrophic.head()

x = 20
x = 19
x = 18
x = 17
x = 14
x = 12
x = 12
x = 11
x = 9
x = 9
x = 9
x = 8
x = 8


Unnamed: 0,data_id,iso,event_id_cnty,event_id_no_cnty,event_date,year,time_precision,event_type,sub_event_type,actor1,...,location,latitude,longitude,geo_precision,source,source_scale,notes,fatalities,timestamp,iso3
0,6926012,340,HND1224,1224,20 December 2019,2019,1,Riots,Mob violence,Rioters (Honduras),...,Tela,15.7743,-87.4673,1,La Prensa (Honduras); El Heraldo (Honduras),National,"On 20 December 2019, in Tela, Atlantida, a mob...",20,1618530900,HND
1,6939157,340,HND1225,1225,22 December 2019,2019,1,Battles,Armed clash,MS-13: Mara Salvatrucha,...,El Porvenir,15.666,-86.9105,1,El Heraldo (Honduras),National,"On 22 December 2019, in El Porvenir, Francisco...",19,1618530900,HND
2,7586581,340,HND3157,3157,22 December 2019,2019,1,Riots,Mob violence,Rioters (Honduras),...,Tegucigalpa,14.0818,-87.2068,1,AFP,International,"On 22 December 2019, in Tegucigalpa, Francisco...",18,1618531054,HND
3,7348738,558,NIC377,377,22 April 2018,2018,1,Protests,Excessive force against protesters,Protesters (Nicaragua),...,Managua,12.1328,-86.2504,1,Organization of American States,Other,"On 22 April 2018, seventeen people were killed...",17,1607554574,NIC
4,8828910,320,GTM4489,4489,17 December 2021,2021,1,Battles,Armed clash,Nahuala Communal Militia (Guatemala),...,Chiquix,14.7998,-91.3977,1,El Periodico; Prensa Comunitaria; Prensa Libre...,Local partner-National,"On 17 December 2021, in Chiquix, Solola, membe...",14,1644864568,GTM


### Else
The followup to an unmet 'if' condition is else. The else statement is useful when you are interested in the two complementary/opposite outcomes of your if statement When the if condition is true, you perform function y, but if it is false, do function b! The statement's syntax is for two lines: 

```
if x ... z: do(y)

else: do(b)
```

In [44]:
i = 0
if i < 5: 
    print(i,'< 5') # this is true and will print
else: 
    print(i,'> 5') # this is not true and will not print

# inversely
i = 10
if i < 5: 
    print(i,'< 5')
else: 
    print(i,'> 5') 

0 < 5
10 > 5


We will keep our example simple and repeat the previous if statement, but add an else statement. This would mean something must happen for any observations where `fatalities` are _less than_ eight. Let's make another dataframe for non-catastrophic events. This loop is unfortunately very inefficient because it inserts a new row to the dataframe over each iteration for almost twenty thousand rows. So let it run for a moment until your kernel status says "Idle", and ponder why Python programmers say you should not loop over pandas dataframes. 

In [45]:
# make an empty dataframe to take subset data again
catastrophic=[]
catastrophic=pd.DataFrame(catastrophic)
non_catastrophic=[]
non_catastrophic=pd.DataFrame(catastrophic)


for i in range(len(central_america['fatalities'])):
    x = central_america_sorted.loc[i,'fatalities'] # update `x` with the value for fatalities in row 'i' of central_america
    if x >= 8:
        catastrophic = pd.concat([catastrophic,central_america_sorted.loc[[i]]])
    else: 
        non_catastrophic = pd.concat([non_catastrophic,central_america_sorted.loc[[i]]]) 

In [47]:
catastrophic.shape
non_catastrophic.shape

(13, 31)

(17218, 31)

### Elif
When you have more than two conditions, 'elif' provides the second condition after 'if', but before the catch-all 'else' statement. The syntax is three lines this time:

```
if x ... z: do(y)

elif x ... w: do(a)

else: do(b)
```

In [32]:
i = 0
if i < 0: 
    print(i,'< 0') # this is true and will print

elif i == 0:
    print(i,'= 0')
    
else: 
    print(i,'> 5') # this is not true and will not print

0 = 0


In total an 'elif' statement combined with 'if' and 'else' let you create three conditions for your data. If we go back to the ACLED example, our justification was some arbitrary cutoff value where events with eight fatalities or more were categorized as catastrophic. Now lets imagine a new category called 'deadly' for any events with one to seven fatalities. Any other events would be 'non-deadly'. This would be a clear use-case for the elif statement with our same data-sorting exercise. In our elif condition, we also get to utilize the `and` operator to create an inclusive bucket of values between 1 and 7.

In [46]:
# make an empty dataframe to take subset data again
catastrophic=[]
catastrophic=pd.DataFrame(catastrophic)
deadly=[]
deadly=pd.DataFrame(deadly)
non_deadly=[]
non_deadly=pd.DataFrame(catastrophic)


for i in range(len(central_america['fatalities'])):
    x = central_america_sorted.loc[i,'fatalities'] # update `x` with the value for fatalities in row 'i' of central_america
    if x >= 8:
        catastrophic = pd.concat([catastrophic,central_america_sorted.loc[[i]]])
    elif x <= 7 and x > 0 :
        deadly =  pd.concat([deadly,central_america_sorted.loc[[i]]])
    else: 
        non_deadly = pd.concat([non_deadly,central_america_sorted.loc[[i]]]) 

In [50]:
# inspect the fatalities ranges for each dataframe
catastrophic['fatalities'].describe()
deadly['fatalities'].describe()
non_deadly['fatalities'].describe()

count    13.000000
mean     12.769231
std       4.380903
min       8.000000
25%       9.000000
50%      12.000000
75%      17.000000
max      20.000000
Name: fatalities, dtype: float64

count    6966.000000
mean        1.328309
std         0.739818
min         1.000000
25%         1.000000
50%         1.000000
75%         1.000000
max         7.000000
Name: fatalities, dtype: float64

count    10252.0
mean         0.0
std          0.0
min          0.0
25%          0.0
50%          0.0
75%          0.0
max          0.0
Name: fatalities, dtype: float64

## Break and Continue
There are two loop-specific statements that can interrupt the normal iteration process: __Break__ and __Continue__. The first is an early exit condition, which terminates whatever kind of loop is happening based on some logical statement, normally for your `i` variable. In fact, we could use break or continue to perform the same task as our `fatalities >= 8` examples above.

### Break
Using the `central_america` dataframe that is sorted by most to least fatalities per event, we can tell it to iterate down the ordered list and stop when fatalities are less than or equal to eight. This works by adding the following syntax into our loop:

`if x = y: break`

In [10]:
i=0 # manually set a starting value
while i < 10:
    print(i,'is less than 10')
    if i == 4: 
        break
    i=i+1 

0 is less than 10
1 is less than 10
2 is less than 10
3 is less than 10
4 is less than 10


Let's repeat the creation of the `catastrophic` dataframe from the previous cell, but remove the nested while-loop in favor of a break:

In [19]:
# make an empty dataframe again to take subset data
catastrophic=[]
catastrophic=pd.DataFrame(catastrophic)

for i in range(len(central_america['fatalities'])):
    b = central_america.loc[i,'fatalities'] # update `b` with the value for fatalities in row 'i' of central_america
    print('b =',b)
    if b < 8: break
    catastrophic = pd.concat([catastrophic,central_america.loc[[i]]])         

catastrophic.tail() # Look at the last rows to see if the new dataframe picked up seven fatalities

b = 20
b = 19
b = 18
b = 17
b = 14
b = 12
b = 12
b = 11
b = 9
b = 9
b = 9
b = 8
b = 8
b = 7


Unnamed: 0,data_id,iso,event_id_cnty,event_id_no_cnty,event_date,year,time_precision,event_type,sub_event_type,actor1,...,location,latitude,longitude,geo_precision,source,source_scale,notes,fatalities,timestamp,iso3
8,7170132,340,HND1805,1805,19 July 2020,2020,1,Violence against civilians,Attack,Unidentified Armed Group (Honduras),...,Yoro,15.1375,-87.1278,1,El Heraldo (Honduras),National,"On 19 July 2020, in Yoro, Yoro, armed men inte...",9,1618499953,HND
9,7348763,558,NIC498,498,16 June 2018,2018,1,Protests,Excessive force against protesters,Protesters (Nicaragua),...,Managua,12.1328,-86.2504,3,Organization of American States,Other,"On 16 June 2018, nine people were killed durin...",9,1607554574,NIC
10,7348757,558,NIC441,441,30 May 2018,2018,1,Protests,Excessive force against protesters,Protesters (Nicaragua),...,Managua,12.1328,-86.2504,1,Organization of American States,Other,"On 30 May 2018, nine people were killed during...",9,1607554574,NIC
11,7338467,340,HND2803,2803,19 October 2018,2018,1,Violence against civilians,Attack,Unidentified Gang (Honduras),...,Comayaguela,14.1059,-87.2328,1,El Heraldo (Honduras),National,"On 19 October 2018, in Comayaguela, Francisco ...",8,1618531021,HND
12,7994371,340,HND3578,3578,08 May 2021,2021,1,Battles,Armed clash,Unidentified Armed Group (Honduras),...,Saba,15.5214,-86.2238,2,La Tribuna (Honduras),National,"On 8 May 2021, close to Saba, Colon, an armed ...",8,1621892119,HND


An important detail to remember when using `break` is that your loop will end when the condition is met, and any functions before the break will still take place. In the cell above, print() iterated until seven because the function is right before the break statement. The assignment of the rows to the `catastrophic` dataframe, on the other hand, did not include a row where fatalities was equal to seven because we called concat() _after_ the break. 

### Continue
Another solution to the same problem lies in the continue command. Continue works by skipping a function if some criteria is met, and moving to the next iteration. The syntax is quite the same:

`if x = y: continue`

In [12]:
for i in range(10): 
    if i != 4:  
        continue # skip over every value of 'i' except 4
    print(i,'= 4')

4 = 4


What's quite useful about the continue statement is that you don't need to worry about whether the 'fatalities` variable is sorted when we begin to iterate along the data. Reload the Central America object from the CSV file so that the rows are in their original order, with fatalities in no particular order of values. 

In [3]:
central_america = pd.read_csv('../../Data/ACLED/1900-01-01-2022-04-22-Central_America.csv')

You can once again reuse the for-loop from our two previous examples and replace the segment where we had the while statement and the break statement with a continue statement. Once again, be very aware of what python functions you place above or below the continue statement; if we had placed print() above the continue statement, the function would have printed for every row of the full dataframe. 

In [4]:
# make an empty dataframe again to take subset data
catastrophic=[]
catastrophic=pd.DataFrame(catastrophic)

for i in range(len(central_america['fatalities'])):
    c = central_america.loc[i,'fatalities'] # update `c` with the value for fatalities in row 'i' of central_america
    if c < 8: continue # if fatalities are less than eight, skip this entry, don't print and don't concatenate
    print('c =',c)
    catastrophic = pd.concat([catastrophic,central_america.loc[[i]]])

c = 14
c = 12
c = 8
c = 9
c = 19
c = 18
c = 20
c = 8
c = 11
c = 9
c = 9
c = 17
c = 12


In [5]:
catastrophic.head()

Unnamed: 0,data_id,iso,event_id_cnty,event_id_no_cnty,event_date,year,time_precision,event_type,sub_event_type,actor1,...,location,latitude,longitude,geo_precision,source,source_scale,notes,fatalities,timestamp,iso3
1124,8828910,320,GTM4489,4489,17 December 2021,2021,1,Battles,Armed clash,Nahuala Communal Militia (Guatemala),...,Chiquix,14.7998,-91.3977,1,El Periodico; Prensa Comunitaria; Prensa Libre...,Local partner-National,"On 17 December 2021, in Chiquix, Solola, membe...",14,1644864568,GTM
2259,8544416,558,NIC722,722,23 August 2021,2021,1,Violence against civilians,Attack,Unidentified Communal Militia (Nicaragua),...,Bonanza,14.0309,-84.5929,2,Trinchera (Nicaragua); La Prensa Libre (Costa ...,National-Regional,"On 23 August 2021, near Bonanza, Costa Caribe,...",12,1632159713,NIC
3565,7994371,340,HND3578,3578,08 May 2021,2021,1,Battles,Armed clash,Unidentified Armed Group (Honduras),...,Saba,15.5214,-86.2238,2,La Tribuna (Honduras),National,"On 8 May 2021, close to Saba, Colon, an armed ...",8,1621892119,HND
6861,7170132,340,HND1805,1805,19 July 2020,2020,1,Violence against civilians,Attack,Unidentified Armed Group (Honduras),...,Yoro,15.1375,-87.1278,1,El Heraldo (Honduras),National,"On 19 July 2020, in Yoro, Yoro, armed men inte...",9,1618499953,HND
8552,6939157,340,HND1225,1225,22 December 2019,2019,1,Battles,Armed clash,MS-13: Mara Salvatrucha,...,El Porvenir,15.666,-86.9105,1,El Heraldo (Honduras),National,"On 22 December 2019, in El Porvenir, Francisco...",19,1618530900,HND


## Applied Example
Loops don't apply just to an iterative process in data analysis. There are also uses related to project management where a you might need to perform repetitive tasks. When working with publicly available data, for example, files often come split into groups of years (Census data for example) or regions (like the ACLED conflict data). At the beginning of the chapter we shared an example of a task where your organization might require you to keep an up-to-date database of worldwide conflict reports. This means that part of your job would be to regularly download every region's CSV file from the ACLED website, load every CSV file into a data management software like this one, and merge each region's data into one large dataset for analysis and reporting. There are 16 ACLED regions, so you would be loading and merging a lot of files periodically. That's a lot of busywork you could automate for yourself with loops!

First you would have to consider what it is exactly that you are iterating over. For example each time you download the data, you need to unzip 16 files. You can use the `ZipFile` package for this task. And every time you read csv data into python, you must write the `read_csv()` function. Then you have to use pandas' `concatenate()`. Let's look at what that code would look like without loops for three regions:

Look closely at the lines of code you would have to write over and over, and think about what pieces of code repeat exactly every time. Think about what pieces of code change each time. To unzip the compressed CSV files, we are repeating everything except the specific file path (i.e.: `path/to/region.csv.zip`) and when reading the CSV files, we also repeat everything except the new object name and the CSV file path: `region = pd.read('/path/to/region.csv')`

That means we are iterating over the actual files in your ACLED data folder. So how can we turn this into a python object that we can iterate along? You could write a list object with every file's full name and path, but that is onerous and can fail you if the file names change in the future (and files are guaranteed to change names at some point). Instead we can use Python's `os` package that lets you look at your local computer files and folders. 

The basic functions we can use are `getcwd()` to see our python project's working directory/folder, and `listdir()` that prints the full contents of a folder.

In [5]:
# Import the 'os' package that lets you browse your computer directories.
import os
os.getcwd() # current working directory
os.listdir() # every file in the cwd
os.listdir('../../Data/ACLED/') # every file in a specific folder


'/home/fernando/Documents/UCLA/DataX/Python_for_Social_Science/lessons/control_structures'

['.ipynb_checkpoints', 'control_structures.ipynb']

['1900-01-01-2022-04-22-Southern_Africa.csv',
 '1900-01-01-2022-04-22-East_Asia.csv',
 '__MACOSX',
 '1900-01-01-2022-04-22-Southeast_Asia.csv',
 '.ipynb_checkpoints',
 '1900-01-01-2022-04-22-Middle_Africa.csv',
 '1900-01-01-2022-04-22-Caucasus_and_Central_Asia.csv',
 '1900-01-01-2022-04-22-Middle_East.csv.zip',
 '1900-01-01-2022-04-22-South_Asia.csv.zip',
 '1900-01-01-2022-04-22-South_America.csv',
 '1900-01-01-2022-04-22-North_America.csv',
 '1900-01-01-2022-04-22-Northern_Africa.csv',
 '1900-01-01-2022-04-22-Central_America.csv',
 '1900-01-01-2022-04-22-Western_Africa.csv',
 '1900-01-01-2022-04-22-Europe.csv',
 '1900-01-01-2022-04-22-Caribbean.csv',
 '1900-01-01-2022-04-22-Eastern_Africa.csv',
 '1900-01-01-2022-04-22-Oceania.csv']

Looking more closely at the file paths in our code, there is still a constant unchanging segment of text:

```
../../Data/ACLED/1900-01-01-2022-04-22-Southern_Africa.csv
../../Data/ACLED/1900-01-01-2022-04-22-East_Asia.csv
etc..
```

Make a simple string object for the path to the data and you won't need to type this out again: `path = '../../Data/ACLED/'` and we will iterate over the file names themselves. So now we can make a list of file names to iterate over. First for any compressed zip files using list comprehension.

In [12]:
# make a text vector that has the filenames of all the zip files in the data folder. we use list comprehension for this task
path = '../../Data/ACLED/'
zip_files = [i for i in os.listdir(path) if i.endswith('.zip')]
zip_files
path+zip_files[0]

['1900-01-01-2022-04-22-Middle_East.csv.zip',
 '1900-01-01-2022-04-22-South_Asia.csv.zip']

'../../Data/ACLED/1900-01-01-2022-04-22-Middle_East.csv.zip'

Unzip all the files in your list. The syntax of the ZipFile function in our case is `with ZipFile(file, 'r') as f: f.extractall(path)`. In the command, 'file' refers to the complete path to the ZIP file we want to extract; the 'path' means which path/folder you want to extract into. In our case these values will be `Data/ACLED/region.csv.zip` and `Data/ACLED/`, respectively. Wrap the ZipFile function into a for loop that iterates over the zip files, and inside that loop create a `file_path` object that stores the full path + file name string for each file. remember that in this loop, `i` stands for every file name stored in zip_files.

In [16]:
# import the zipfile package
from zipfile import ZipFile 

# make sure you have the base path to the data
path = '../../Data/ACLED/'

# unzip it all in a simple for-loop
for i in zip_files:
    file_path=path+i # zipfile will need to know exactly the path and file to unzip
    with ZipFile(file_path, 'r') as f:
        f.extractall(path) # zipfile needs just the path to the data folder to unzip the csv files

Now that you have unzipped your ACLED CSV files, you can make another loop to read all the CSV files into Python. First create a list vector for all the CSV files in the ACLED directory. We can reuse the `path` object and change the `endswith()` search criteria from our previous list comprehension loop. 

In [15]:
csv_files = [i for i in os.listdir(path) if i.endswith('.csv')]
csv_files

['1900-01-01-2022-04-22-South_Asia.csv',
 '1900-01-01-2022-04-22-Southern_Africa.csv',
 '1900-01-01-2022-04-22-East_Asia.csv',
 '1900-01-01-2022-04-22-Southeast_Asia.csv',
 '1900-01-01-2022-04-22-Middle_Africa.csv',
 '1900-01-01-2022-04-22-Caucasus_and_Central_Asia.csv',
 '1900-01-01-2022-04-22-South_America.csv',
 '1900-01-01-2022-04-22-North_America.csv',
 '1900-01-01-2022-04-22-Northern_Africa.csv',
 '1900-01-01-2022-04-22-Central_America.csv',
 '1900-01-01-2022-04-22-Western_Africa.csv',
 '1900-01-01-2022-04-22-Europe.csv',
 '1900-01-01-2022-04-22-Caribbean.csv',
 '1900-01-01-2022-04-22-Eastern_Africa.csv',
 '1900-01-01-2022-04-22-Middle_East.csv',
 '1900-01-01-2022-04-22-Oceania.csv']

Now you can wrap `read_csv()` into a for loop that iterates over 'csv_files', just like you did with the ZipFiles function before. One difference is that we will be concatenating a new dataframe (`acled_merge`) such that every loop will add _x_ new observations to this dataframe. So, unlike our previous loop, we need to declare 'acled_merge' as an empty object, then change its data type to a pandas dataframe. Once you create the receiving dataframe, you can use it to receive the temporary data with `concat()`.

In [19]:
import pandas as pd

acled_merge=[]
acled_merge=pd.DataFrame(acled_merge)

for i in csv_files:
    file_path=path+i
    temp=pd.read_csv(file_path, low_memory=False)
    acled_merge=pd.concat([acled_merge, temp])
    print('For csv file "',i , '" there are', len(temp['country'].unique()), 'unique countries') # this is just to keep track of the loop

pandas.core.frame.DataFrame

For csv file " 1900-01-01-2022-04-22-South_Asia.csv " there are 7 unique countries
For csv file " 1900-01-01-2022-04-22-Southern_Africa.csv " there are 8 unique countries
For csv file " 1900-01-01-2022-04-22-East_Asia.csv " there are 6 unique countries
For csv file " 1900-01-01-2022-04-22-Southeast_Asia.csv " there are 11 unique countries
For csv file " 1900-01-01-2022-04-22-Middle_Africa.csv " there are 9 unique countries
For csv file " 1900-01-01-2022-04-22-Caucasus_and_Central_Asia.csv " there are 9 unique countries
For csv file " 1900-01-01-2022-04-22-South_America.csv " there are 14 unique countries
For csv file " 1900-01-01-2022-04-22-North_America.csv " there are 5 unique countries
For csv file " 1900-01-01-2022-04-22-Northern_Africa.csv " there are 6 unique countries
For csv file " 1900-01-01-2022-04-22-Central_America.csv " there are 7 unique countries
For csv file " 1900-01-01-2022-04-22-Western_Africa.csv " there are 16 unique countries
For csv file " 1900-01-01-2022-04-22-E

array(['India', 'Bangladesh', 'Nepal', 'Sri Lanka', 'Pakistan',
       'Maldives', 'Bhutan', 'South Africa', 'Zambia', 'Zimbabwe',
       'Namibia', 'Lesotho', 'eSwatini', 'Botswana',
       'Saint Helena, Ascension and Tristan da Cunha', 'South Korea',
       'Japan', 'Taiwan', 'Mongolia', 'North Korea', 'China', 'Myanmar',
       'Philippines', 'Indonesia', 'Thailand', 'Malaysia', 'Cambodia',
       'Singapore', 'Vietnam', 'East Timor', 'Laos', 'Brunei', 'Chad',
       'Democratic Republic of Congo', 'Cameroon',
       'Central African Republic', 'Angola', 'Gabon', 'Republic of Congo',
       'Equatorial Guinea', 'Sao Tome and Principe', 'Afghanistan',
       'Armenia', 'Kyrgyzstan', 'Kazakhstan', 'Georgia', 'Azerbaijan',
       'Uzbekistan', 'Tajikistan', 'Turkmenistan', 'Colombia', 'Paraguay',
       'Chile', 'Ecuador', 'Argentina', 'Brazil', 'Venezuela', 'Peru',
       'Bolivia', 'French Guiana', 'Suriname', 'Guyana', 'Uruguay',
       'Falkland Islands', 'Mexico', 'United States'

In [20]:
acled_merge.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1399627 entries, 0 to 1734
Data columns (total 31 columns):
 #   Column            Non-Null Count    Dtype  
---  ------            --------------    -----  
 0   data_id           1399627 non-null  int64  
 1   iso               1399627 non-null  int64  
 2   event_id_cnty     1399627 non-null  object 
 3   event_id_no_cnty  1399627 non-null  float64
 4   event_date        1399627 non-null  object 
 5   year              1399627 non-null  int64  
 6   time_precision    1399627 non-null  int64  
 7   event_type        1399627 non-null  object 
 8   sub_event_type    1399627 non-null  object 
 9   actor1            1399627 non-null  object 
 10  assoc_actor_1     529562 non-null   object 
 11  inter1            1399627 non-null  int64  
 12  actor2            717933 non-null   object 
 13  assoc_actor_2     179397 non-null   object 
 14  inter2            1399627 non-null  int64  
 15  interaction       1399627 non-null  int64  
 16  region  

In [21]:
os.remove('../../Data/ACLED/1900-01-01-2022-04-22-South_Asia.csv')
os.remove('../../Data/ACLED/1900-01-01-2022-04-22-Middle_East.csv')

In [None]:
# Save your work here
%whos DataFrame

## To-do
- Errors - logs, try-catch, debugging.
    - debugging is here in the syllabus but seems out of place in this chapter.
    - zach uses trycatch because he enters data line by line and you always want to catch this. logging, he'll write the error to a file and thats his log. for debugging he wants heuristics: interpreting an error message.
    - This could be its own chapter. 