# Pandas II: Descriptive Statistics, Merging, and Aggregating

## Summarizing your data
So far you have inspected your data with the functions `.info()`, `.head()`, and `.shape`. These tell you broadly about the content of the dataframe like the number of rows and columns, and your variables' data types. We want to introduce you to some more useful pandas functions for investigating specific yet simple summary statistics. These functions will help you to describe general tendencies in the data, compare groups, or see differences across time and place. Though simple, these statistics are still the foundation for more advanced statistical models that, quite often, will tell you the exact same conclusion with just higher degrees of confidence or nuance.

The data you will be working with this chapter comes from the [Databanks International Cross-National Time-Series Data (CNTS)](https://www.cntsdata.com/), which provides demographic and economic data for hundreds of countries from 1815 to the present. If you thought that, based on the previous chapter's raw data, this data was going to be messy you would be right. This chapter will be challenging you to do some of the cleaning on your own, as a chance to practice what you learned in the previous chapter. First things first will be to load the data, which comes in Microsoft Excel's `.xls` format. You will find it in the `Python for Social Science/Data/CrossNationalTimeSeries` directory, under `cnts_data.xls`.

## Loading the Cross-National Time-Series Data
As always in a new notebook or Python script file, we need to call pandas before we can use the `.read_excel()` function. 

_Note that read_excel() may require you to install the `xlrd` package. If you receive this error, open up a terminal launcher inside JupyterLab and enter `conda install xlrd`_

In [1]:
# This code cell will be in every one of our chapters in Jupyter Notebook
# The function allows you to see every line of output when the code has multiple lines
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

import pandas as pd

In [2]:
cnts = pd.read_excel('../../Data/CrossNationalTimeSeries/cnts_data.xls')

The data should have loaded in as the `cnts` object. We also added an argument, `skiprows=[0]`, telling read_excel to skip the first row of excel data. We did this because this first row had variable descriptions, but we want the first row to be simple variable names. 

As a quick review, use some of the functions you learned in the last chapter in the cell below to inspect the data and get a grasp of its size, its variables, and the types of data contained. 

In [3]:
#delete these later
cnts.shape
cnts.info()
cnts.head()

(15729, 194)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15729 entries, 0 to 15728
Columns: 194 entries, code to vehicle6
dtypes: float64(191), int64(2), object(1)
memory usage: 23.3+ MB


Unnamed: 0,code,country,year,area1,area2,area3,computer1,computer2,computer3,computer4,...,urban07,urban08,urban09,urban10,vehicle1,vehicle2,vehicle3,vehicle4,vehicle5,vehicle6
0,10,AFGHANISTAN,1919,647000.0,250000.0,,,,,,...,90.0,15.0,,,0.0,0.0,0.0,0.0,0.0,0.0
1,10,AFGHANISTAN,1920,647000.0,250000.0,,,,,,...,96.0,15.0,,,0.0,0.0,0.0,0.0,0.0,0.0
2,10,AFGHANISTAN,1921,647000.0,250000.0,,,,,,...,103.0,16.0,,,0.0,0.0,0.0,0.0,0.0,0.0
3,10,AFGHANISTAN,1922,647000.0,250000.0,,,,,,...,109.0,16.0,,,0.0,0.0,0.0,0.0,0.0,0.0
4,10,AFGHANISTAN,1923,647000.0,250000.0,,,,,,...,116.0,16.0,,,0.0,0.0,0.0,0.0,0.0,0.0


As with our ACS data, there are a lot of observations for a lot of countries and, additionally, more than a hundred years of observations per country. Thankfully we have a dictionary in the cross-national time series data folder to help us find the data we want to use. The `cnts_codebook.xls` file comes with descriptive labels and important information on multiplying some numeric variables to get the correct values. 

From your preliminary inspections above, you will be relieved to know that you will not be required to transform the data types from object to numeric: all the variables but one are `float` or `int` types. On the other hand, you still have a lot of data to clean. The first task will be for you to make two new dataframes using the variables we are interested in inspecting you. Use the cell below to select the following variables into a new dataframe called `workforce`:

- Country
- Year
- pop1 (population)
- pop2 (population density)
- economics2 (GDP Per Capita)
- industry3 (percent workforce in agriculture)
- industry4 (percent workforce in industry)
- industry5 (percent workforce other)


In [4]:
workforce = cnts[['country','year','pop1','pop2','economics2','industry3','industry4','industry5']]
workforce.info()
workforce.head(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15729 entries, 0 to 15728
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     15729 non-null  object 
 1   year        15729 non-null  int64  
 2   pop1        15662 non-null  float64
 3   pop2        15583 non-null  float64
 4   economics2  7532 non-null   float64
 5   industry3   4548 non-null   float64
 6   industry4   4269 non-null   float64
 7   industry5   4269 non-null   float64
dtypes: float64(6), int64(1), object(1)
memory usage: 983.2+ KB


Unnamed: 0,country,year,pop1,pop2,economics2,industry3,industry4,industry5
0,AFGHANISTAN,1919,5809.0,232.0,,902.0,,
1,AFGHANISTAN,1920,6095.0,243.0,,900.0,,
2,AFGHANISTAN,1921,6381.0,255.0,,897.0,,
3,AFGHANISTAN,1922,6667.0,266.0,,894.0,,
4,AFGHANISTAN,1923,6954.0,278.0,,892.0,,
5,AFGHANISTAN,1924,7241.0,289.0,,889.0,,
6,AFGHANISTAN,1925,7528.0,301.0,,886.0,,
7,AFGHANISTAN,1926,7815.0,312.0,,884.0,,
8,AFGHANISTAN,1927,8102.0,324.0,,881.0,,
9,AFGHANISTAN,1928,8389.0,335.0,,878.0,,


## Missing Values
From the tables above you should observe something important about our new dataframe: There are missing data in the form of `Nan` values. In just the first ten rows, we can see Afghanistan had no records for `economics2` (GDP per capita) nor `industry4` and `inustry5`, the percent of the workforce in industry or other areas. This is another incredibly common feature of large (and not so large) datasets. Often you will find no observations for specific variables. This is a complex subject that we sadly can't delve into too deeply. 

We can, however, offer advice on managing data with missing values. First is how to inspect missing data to decide if you can live without this information in your analyses. The second is how to drop missing values outright at the cost of losing information. To drop missing values, pandas has a function called `.dropna()` which we can append to any pandas dataframe. Below are two examples of how to use `dropna()`:

In [5]:
# drops missing values from the dataframe. Use the equals sign to push this pruned dataframe to a new object.
workforce.dropna()

# exactly the same process, but assigning the value to a new dataframe when we are selecting our original subset of columns
workforce_no_missing = cnts[['country','year','pop1', 'pop2','economics2','industry3','industry4','industry5']].dropna()
workforce_no_missing

Unnamed: 0,country,year,pop1,pop2,economics2,industry3,industry4,industry5
40,AFGHANISTAN,1965,13899.0,555.0,60.0,775.0,82.0,143.0
133,ALBANIA,1973,2323.0,2111.0,510.0,553.0,347.0,100.0
170,ALGERIA,1962,10920.0,118.0,246.0,535.0,115.0,350.0
171,ALGERIA,1963,11205.0,121.0,230.0,527.0,117.0,356.0
172,ALGERIA,1964,11675.0,126.0,240.0,519.0,119.0,362.0
...,...,...,...,...,...,...,...,...
15568,YUGOSLAVIA,1973,20960.0,2117.0,805.0,420.0,207.0,373.0
15569,YUGOSLAVIA,1974,21153.0,2136.0,1000.0,415.0,209.0,376.0
15570,YUGOSLAVIA,1975,21352.0,2156.0,1161.0,410.0,212.0,378.0
15571,YUGOSLAVIA,1976,21560.0,2177.0,1559.0,405.0,215.0,380.0


Unnamed: 0,country,year,pop1,pop2,economics2,industry3,industry4,industry5
40,AFGHANISTAN,1965,13899.0,555.0,60.0,775.0,82.0,143.0
133,ALBANIA,1973,2323.0,2111.0,510.0,553.0,347.0,100.0
170,ALGERIA,1962,10920.0,118.0,246.0,535.0,115.0,350.0
171,ALGERIA,1963,11205.0,121.0,230.0,527.0,117.0,356.0
172,ALGERIA,1964,11675.0,126.0,240.0,519.0,119.0,362.0
...,...,...,...,...,...,...,...,...
15568,YUGOSLAVIA,1973,20960.0,2117.0,805.0,420.0,207.0,373.0
15569,YUGOSLAVIA,1974,21153.0,2136.0,1000.0,415.0,209.0,376.0
15570,YUGOSLAVIA,1975,21352.0,2156.0,1161.0,410.0,212.0,378.0
15571,YUGOSLAVIA,1976,21560.0,2177.0,1559.0,405.0,215.0,380.0


Both methods drop the same number of rows, and we are left with 2632 observations as opposed to the original 15729. That's a lot of lost information! `dropna()` is a dangerous function to use if you aren't mindful. In the example above, we told pandas to drop __every__ row that had __any__ missing values in any of the columns. This will absolutely lead you to unexpected results and inaccurate statistics. Let's look at how and why using pandas' descriptive statistics functions.

## Basic descriptive statistics
We can isolate our variables from the dataframe where we used `.dropna()` and the same variable in the original dataframe and see how they differ. Focus your attention on a single variable: `industry3` which is the percentage of workers employed in agriculture. Along with describing a single variable, we're also going to demonstrate the difference between the full dataset and the pruned `workforce_no_missing` dataframe that you generated in the previous code cell. 

Our first step will be to quite literally extract all the values for `industry3` and push them into a single vector. Let's do this three times:
1. From the full dataframe with missing values.
2. From the full dataframe and then pruning missing values just from the `industry3` column.
3. From the datframe where any and all missing values across all variables were removed.

In [6]:
# 1. The variable with all missing values
ag_1 = workforce['industry3']*.1

# 2. The exact same variable but we drop its missing values only on the column's values.
ag_2 = workforce['industry3'].dropna()*.1

# 3. The variable drawn from the dataframe with *all* missing values dropped
ag_3 = workforce_no_missing['industry3']*.1

<div class="alert alert-block alert-info">
<b>Tip:</b> In the cell above, we are selecting the variable `industry3` with single brackets `dataframe['variable']` instead of double brackets `dataframe[['variable']]`. Read what the `info()` function says about either: </div>

In [7]:
workforce['industry3'].info()

workforce[['industry3']].info()

<class 'pandas.core.series.Series'>
RangeIndex: 15729 entries, 0 to 15728
Series name: industry3
Non-Null Count  Dtype  
--------------  -----  
4548 non-null   float64
dtypes: float64(1)
memory usage: 123.0 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15729 entries, 0 to 15728
Data columns (total 1 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   industry3  4548 non-null   float64
dtypes: float64(1)
memory usage: 123.0 KB


<div class="alert alert-block alert-info">
Double brackets provide output in the form of a dataframe, while single brackets are in the form of a series (aka a vector). Many python  functions and data operations expect input in the form of a vector, so it is generally better to use single brackets. When displaying code output, single and double bracket pandas objects work pretty interchangeably, although they look differently as a series versus a dataframe: </div>

In [8]:
workforce['industry3'].describe()

workforce[['industry3']].describe()

count    4548.000000
mean      450.887423
std       233.168795
min        11.000000
25%       263.000000
50%       457.000000
75%       626.250000
max       969.000000
Name: industry3, dtype: float64

Unnamed: 0,industry3
count,4548.0
mean,450.887423
std,233.168795
min,11.0
25%,263.0
50%,457.0
75%,626.25
max,969.0


Now we can compare each version of these variables using simple math functions for sums, average/mean, observations/rows of data, and standard deviation.

In [9]:
# Add all the values
ag_1.sum()
ag_2.sum()
ag_3.sum()

# Average the column values
ag_1.mean()
ag_2.mean()
ag_3.mean()

# Count the observations
ag_1.count()
ag_2.count()
ag_3.count()

ag_1.std()
ag_2.std()
ag_3.std()

205063.60000000003

205063.6

107505.20000000001

45.0887423043096

45.08874230430959

40.845440729483286

4548

4548

2632

23.316879487998317

23.316879487998317

24.21738952983924

The third vector that we created, `ag_3`, is clearly the odd one out. Remember that we drew this data from the dataframe where we applied `.dropna()` on every single column. By dropping any and all missing observations from a dataframe, we actually deleted valid non-missing data. Based on the count function, we actually ignored $4548-2632=1,916$ valid observations! 

### Pandas describe()
The `.describe()` function in pandas is going to be very handy for looking at measures of central tendency, a fancy way of talking about mean, median, and standard deviation. The function pulls together the individual functions from our code above, but simplifies it all to a single function call and a single table of output. Calling `.describe()` on the entire dataframe will show a table with statistics for all numeric variables.

In [10]:
workforce.describe()

Unnamed: 0,year,pop1,pop2,economics2,industry3,industry4,industry5
count,15729.0,15662.0,15583.0,7532.0,4548.0,4269.0,4269.0
mean,1942.942209,24516.11,3042.403388,2859.449018,450.887423,241.771844,323.593816
std,56.861483,88315.23,16055.824652,5761.634416,233.168795,121.144671,137.282697
min,1815.0,1.0,3.0,18.0,11.0,0.0,17.0
25%,1898.0,1280.0,224.5,231.0,263.0,148.0,224.0
50%,1965.0,4587.5,1015.0,650.5,457.0,235.0,315.0
75%,1990.0,15487.25,2405.0,2187.25,626.25,332.0,416.0
max,2009.0,1338468.0,340000.0,44797.0,969.0,542.0,716.0


The table is a little hard to read, and the numbers aren't at the correct scale yet, plus we definitely don't care about the average year. Thankfully you can easily feed a single column to `.describe()`, and alter this column's values accordingly. We'll use `industry3` again to see the descriptions of the percent of the population working in agriculture in the whole dataset.

According to the dictionary, the variable needs to be multiplied by 0.1 in order to move the decimal point one space to the left. We can do  that _and_ describe it in one line of code, just make sure to pay attention to your parentheses.

In [11]:
# First transform the variable values. Output is a pandas series/vector
workforce['industry3']*.1

# Append the .describe() function to it, but wrapping the previous line of code in parentheses so we don't try to describe the number "0.1"
(workforce[['industry3']]*.1).describe()

0        90.2
1        90.0
2        89.7
3        89.4
4        89.2
         ... 
15724     NaN
15725     NaN
15726     NaN
15727     NaN
15728     NaN
Name: industry3, Length: 15729, dtype: float64

Unnamed: 0,industry3
count,4548.0
mean,45.088742
std,23.316879
min,1.1
25%,26.3
50%,45.7
75%,62.625
max,96.9


You now have a tidy table of summary statistics: The number of observations for this variable in `count`, the average proportion of agricultural workers in `mean`, the standard deviation `std`, minimum and maximum values, and quartiles. Once again, the number of observations is 4548, less than the number of available observations for the whole data set, because `.describe()` automatically omits missing values from its counts. If you summarized the same variable from the `workforce_no_missing` dataframe, you would get the same figures as our code cells further up.

In [12]:
(workforce_no_missing[['industry3']]*.1).describe()

Unnamed: 0,industry3
count,2632.0
mean,40.845441
std,24.21739
min,1.1
25%,18.4
50%,41.1
75%,57.4
max,96.9


These values are still different. Look at the numbers in `count`, which is the number of rows with valid data: There are almost half as many observations as before. Your main takeaway should be simple: Avoid applying `.dropna()` on a dataframe. Instead, consider creating a new column and apply the .dropna() function when the variable must be free from missing values. Alternatively, just remember that any summary statistics from the `describe()` function (and the other basic math functions) will ignore missing values by default.  

## Armed Conflict Location & Event Data Project (ACLED)
When the variables in your data are in text, aka strings, the describe function will no longer be useful in describing the data. The cross-national time-series data we've been using is almost exclusively numeric, so we will use data from the The [Armed Conflict Location & Event Data Project (ACLED)](https://acleddata.com/) to show what `.describe()` does with text/string variables.

In [13]:
conflict_west = pd.read_csv('../../Data/ACLED/1900-01-01-2022-04-22-Western_Africa.csv')
conflict_west.info()
conflict_west['event_type'].describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55909 entries, 0 to 55908
Data columns (total 31 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   data_id           55909 non-null  int64  
 1   iso               55909 non-null  int64  
 2   event_id_cnty     55909 non-null  object 
 3   event_id_no_cnty  55909 non-null  int64  
 4   event_date        55909 non-null  object 
 5   year              55909 non-null  int64  
 6   time_precision    55909 non-null  int64  
 7   event_type        55909 non-null  object 
 8   sub_event_type    55909 non-null  object 
 9   actor1            55909 non-null  object 
 10  assoc_actor_1     16644 non-null  object 
 11  inter1            55909 non-null  int64  
 12  actor2            41432 non-null  object 
 13  assoc_actor_2     12366 non-null  object 
 14  inter2            55909 non-null  int64  
 15  interaction       55909 non-null  int64  
 16  region            55909 non-null  object

count                          55909
unique                             6
top       Violence against civilians
freq                           15261
Name: event_type, dtype: object

There are plenty of variables of the `object` data type, which are the default type pandas gives to columns with text values. We chose to look more closely at the variable `event_type`, which are the kinds of conflicts the ACLED organization tracks over time. These events/conflicts are coded by ACLED under six categories:
- Battles
- Explosions/Remote violence
- Protests
- Riots
- Strategic developments
- Violence against civilians

The output given by `.describe()` above showed six event types in the `unique` row, with the top event type being "Protests". However, `.describe()` cannot give us more detailed numbers about an 'object' variable type. We can start describing string variables however with a crosstabulation or crosstabs. Crosstabs are a way to group two variables together by the frequency with which one variable occurs in the context of another. To put it more concretely with an example, we could compare types of events by year to observe trends.

## Text variables with Pandas pd.crosstab()

Let's use pandas' `.crosstab()` to compare these two variables. Usage is very simple: You simply tell `pd.crosstab()` which two columns should be put on the table. The first variable will be the rows, and the second will be the columns of the new table. You should be aware of a key difference with other functions we have seen. You do not append the `.crosstab()` function directly onto the dataframe's name as we have been doing. Something like `conflict_west.crosstab(x,y)` would give you an error. Instead, use `pd` as a prefix,and supply two pandas series within the parentheses.

In [14]:
pd.crosstab(conflict_west['year'], conflict_west['event_type'])

event_type,Battles,Explosions/Remote violence,Protests,Riots,Strategic developments,Violence against civilians
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1997,369,24,47,45,518,320
1998,478,7,54,53,930,607
1999,391,10,93,79,511,347
2000,426,28,109,87,293,277
2001,316,2,47,38,172,163
2002,342,6,29,34,13,148
2003,386,13,59,50,22,200
2004,164,3,102,57,4,152
2005,85,0,129,57,7,95
2006,106,11,40,29,8,73


Now you have a table summarizing the sum total conflicts gathered by ACLED in western Africa since 1997. We can get even more specific tables by subsetting countries. `.unique()` can provide a list of string values in an text column for us to choose a country of interest. Then we can use the string subsetting function we learned about last chapter and make our crosstab table. 

In [15]:
# Countries in the data. 
conflict_west['country'].unique()

# Alternatively, set() sorts any unique values
set(conflict_west['country'])

array(['Mali', 'Nigeria', 'Mauritania', 'Ghana', 'Burkina Faso', 'Guinea',
       'Senegal', 'Niger', 'Liberia', 'Benin', 'Ivory Coast', 'Togo',
       'Sierra Leone', 'Cape Verde', 'Guinea-Bissau', 'Gambia'],
      dtype=object)

{'Benin',
 'Burkina Faso',
 'Cape Verde',
 'Gambia',
 'Ghana',
 'Guinea',
 'Guinea-Bissau',
 'Ivory Coast',
 'Liberia',
 'Mali',
 'Mauritania',
 'Niger',
 'Nigeria',
 'Senegal',
 'Sierra Leone',
 'Togo'}

In [16]:
# A subset of one column, where country='Sierra Leone'
conflict_west[conflict_west['country'].str.contains('Sierra Leone')]['event_type']

468                        Protests
1086                       Protests
2108         Strategic developments
2139                       Protests
2594                          Riots
                    ...            
55904        Strategic developments
55905    Violence against civilians
55906                       Battles
55907        Strategic developments
55908        Strategic developments
Name: event_type, Length: 5015, dtype: object

In [17]:
# Crosstab two series with the same country subset
pd.crosstab(
    conflict_west[conflict_west['country'].str.contains('Sierra Leone')]['year'],
    conflict_west[conflict_west['country'].str.contains('Sierra Leone')]['event_type']
)

event_type,Battles,Explosions/Remote violence,Protests,Riots,Strategic developments,Violence against civilians
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1997,257,0,0,0,502,260
1998,311,1,0,0,914,509
1999,216,0,0,0,496,243
2000,147,17,0,1,258,74
2001,46,0,0,0,166,12
2002,0,0,0,2,1,2
2003,7,0,4,5,1,0
2004,5,0,1,3,0,5
2005,0,0,1,3,0,1
2006,1,0,0,0,0,0


You can also try subsetting to a new country-specific dataframe first. We added a `margins=True` paramter to show the row and column sums!

In [18]:
conflict_sierra_leone = conflict_west[conflict_west['country'].str.contains('Sierra Leone')]

pd.crosstab(conflict_sierra_leone['year'], conflict_sierra_leone['event_type'], margins=True)

event_type,Battles,Explosions/Remote violence,Protests,Riots,Strategic developments,Violence against civilians,All
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1997,257,0,0,0,502,260,1019
1998,311,1,0,0,914,509,1735
1999,216,0,0,0,496,243,955
2000,147,17,0,1,258,74,497
2001,46,0,0,0,166,12,224
2002,0,0,0,2,1,2,5
2003,7,0,4,5,1,0,17
2004,5,0,1,3,0,5,14
2005,0,0,1,3,0,1,5
2006,1,0,0,0,0,0,1


# Merging Data
Customizing your own data from pieces of related but independent datasets is a major task for any researcher working with public data. Different organizations collect data on the same units of observation for wholly different reasons, and more often than not these data are not shared, compared, or merged by anyone but the interested researcher. An example would be the data you have been manipulating all through this chapter: The Databanks International Cross-National Time-Series Data (CNTS), and the Armed Conflict Location & Event Data (ACLED). Although they collect different kinds of information, they share the same units of observation: Countries. Thus, if we are interested in investigating relationships and trends between economic and conflict indicators, we can join data from the CNTS and ACLED data sets.

There exist a variety of functions to combine variables together from different data objects in Python in general and Pandas specifically. We'll introduce you to just one of these in this chapter, pandas' `pd.merge()` function. Within the merge function, we'll also demonstrate three possible kinds of unions of our data: left-join, right-join, and inner-join. First off, let's talk about what variables we want to select and merge from each data set.

We have already loaded the two datasets and named them:

In [19]:
%whos DataFrame

Variable                Type         Data/Info
----------------------------------------------
cnts                    DataFrame           code      country <...>15729 rows x 194 columns]
conflict_sierra_leone   DataFrame           data_id  iso event<...>n[5015 rows x 31 columns]
conflict_west           DataFrame           data_id  iso event<...>[55909 rows x 31 columns]
workforce               DataFrame               country  year <...>n[15729 rows x 8 columns]
workforce_no_missing    DataFrame               country  year <...>\n[2632 rows x 8 columns]


You will be merging the `cnts` and `conflict_west` dataframes, but we will only be using a selection of columns from either dataset. Notice the dataframes each have very different dimensions: The `cnts` data, which is worldwide and since the 1800s, has 15729 observations and 194 variables; the `conflict` dataset focuses on western Africa sine 1997 but has 55909 conflict reports across 31 variables. 

From `cnts` we want you to draw the variables
- country
- year
- pop1 (population)
- pop2 (population density)
- economics2 (GDP Per Capita)

From the ACLED `conflict_west` dataframe we want the following three variables:
- country
- year
- event_type
- fatalities

As you may have already observed, both datasets share the `country` and `year` variables. These variables are going to be our keys to merging disparate variables from two differently sized dataframes. First things first, create two new dataframes from the variables we requested and name them `cnts_a` and `conflict_b`. 

In [20]:
cnts_a = cnts[['country','year','pop1','pop2','economics2']]
conflict_b = conflict_west[['country','year','event_type','fatalities']]

Now we need to check whether our keys are comparable, as with anything programming related, our values must be identical in order to match in the merge function! So let us compare `year` and `country` in either dataset.

In [21]:
cnts_a['year'].unique()
conflict_b['year'].unique()

cnts_a['country'].unique()
conflict_b['country'].unique()

array([1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929,
       1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1946,
       1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957,
       1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968,
       1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979,
       1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990,
       1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001,
       2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 1816, 1817, 1818,
       1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829,
       1830, 1831, 1832, 1833, 1834, 1835, 1836, 1837, 1838, 1839, 1840,
       1841, 1842, 1843, 1844, 1845, 1846, 1847, 1848, 1849, 1850, 1851,
       1852, 1853, 1854, 1855, 1856, 1857, 1858, 1859, 1860, 1861, 1862,
       1863, 1864, 1865, 1866, 1867, 1868, 1869, 1870, 1871, 1872, 1873,
       1874, 1875, 1876, 1877, 1878, 1879, 1880, 18

array([2022, 2021, 2020, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012,
       2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001,
       2000, 1999, 1998, 1997])

array(['AFGHANISTAN', 'ALBANIA', 'ALGERIA', 'ANDORRA', 'ANGOLA',
       'ANTIGUA', 'ARGENTINA', 'ARMENIA', 'AUSTRALIA', 'NAURU',
       'AUST EMPIRE', 'AUST-HUNG', 'AUSTRIA', 'HUNGARY', 'AZERBAIJAN',
       'BAHRAIN', 'BHUTAN', 'BAHAMAS', 'BARBADOS', 'BELARUS', 'BELGIUM',
       'BELIZE', 'BOLIVIA', 'BOPHUTSWANA', 'BOSNIA-HERZ', 'BOTSWANA',
       'BRAZIL', 'BRUNEI', 'BULGARIA', 'BURMA', 'MYANMAR', 'BURUNDI',
       'CAMBODIA', 'KHMER REP', 'KAMPUCHEA', 'CAMEROON', 'CANADA',
       'C VERDE IS', 'CEN AFR REP', 'CEN AFR EMP', 'CEYLON', 'SRI LANKA',
       'CHAD', 'CHILE', 'CHINA', 'CHINA REP', 'CHINA PR', 'TAIWAN',
       'CISKEI', 'COLOMBIA', 'COMORO IS', 'ANJOUAN', 'CONGO (BRA)',
       'CONGO', 'CONGO REP', 'CONGO (KIN)', 'CONGO DR', 'ZAIRE',
       'COSTA RICA', 'CROATIA', 'CUBA', 'CYPRUS', 'CYPRUS GRK',
       'CYPRUS TURK', "CZECHOS'KIA", 'CZECH REP', 'SLOVAK REP', 'DAHOMEY',
       'BENIN', 'DENMARK', 'DJIBOUTI', 'DOMINICA', 'DOMIN REP',
       'TIMOR-LESTE', 'ECUADOR', 'EL SALVA

array(['Mali', 'Nigeria', 'Mauritania', 'Ghana', 'Burkina Faso', 'Guinea',
       'Senegal', 'Niger', 'Liberia', 'Benin', 'Ivory Coast', 'Togo',
       'Sierra Leone', 'Cape Verde', 'Guinea-Bissau', 'Gambia'],
      dtype=object)

The `year` key variable looks fine, but there are some problems with the `country` variables across datasets. The main issue is the country values are all upper case in `cnts_a`, whereas the `conflicts_b` data only capitalizes the first letter. We are going to have to change these string values so that they are either all lowercase, all uppercase, or all capitalized. Another problem is that the `cnts` data's country names are actually inconsistent. We'll leave that for you as a future exercise and live with imperfect matches for the moment.

To make the countries' values all the same case, we can choose pandas' `.str.lower()`, `str.upper()`, or `str.capitalize()`. For this example we will pick str.capitalize which would mean we only need to alter `country` in the `cnts_a` dataframe to make it match `conflict_b`.

In [22]:
# Attach head just to see what happens
cnts_a['country'].str.lower().head()
conflict_b['country'].str.upper().head()
cnts_a['country'].str.capitalize().head()

# Let's go with str.capitalize()
cnts_a['country']=cnts_a['country'].str.capitalize()

0    afghanistan
1    afghanistan
2    afghanistan
3    afghanistan
4    afghanistan
Name: country, dtype: object

0       MALI
1       MALI
2       MALI
3       MALI
4    NIGERIA
Name: country, dtype: object

0    Afghanistan
1    Afghanistan
2    Afghanistan
3    Afghanistan
4    Afghanistan
Name: country, dtype: object

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cnts_a['country']=cnts_a['country'].str.capitalize()


## Pandas merge()
The `pd.merge()` function requires at least four parameters from you: Dataframe left, dataframe right, `how`, and `on`. The first two parameters are obvious, cnts_a and conflict_b. "Left" or "Right' refers to which dataframe's key variables will get matched by the merging function. `how=` is where you tell the function to match the keys from the left or right dataframe. `on=` is used to specify the column/variable names that you will match across dataframes. 

### Left join
This is much easier to show than to explain so we will perform a left-join where `cnts_a` is left, and `conflict_b` is right. We'll specify `how='left'` and `on=['country','year']`

In [23]:
left_join = pd.merge(cnts_a, conflict_b, how='left', on=['country','year'])
left_join.info()
left_join[left_join['country'].str.contains('Liberia')] #closer look at a single country

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20415 entries, 0 to 20414
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     20415 non-null  object 
 1   year        20415 non-null  int64  
 2   pop1        20348 non-null  float64
 3   pop2        20269 non-null  float64
 4   economics2  8998 non-null   float64
 5   event_type  4829 non-null   object 
 6   fatalities  4829 non-null   float64
dtypes: float64(4), int64(1), object(2)
memory usage: 1.1+ MB


Unnamed: 0,country,year,pop1,pop2,economics2,event_type,fatalities
9002,Liberia,1847,300.0,69.0,,,
9003,Liberia,1848,304.0,70.0,,,
9004,Liberia,1849,308.0,71.0,,,
9005,Liberia,1850,312.0,72.0,,,
9006,Liberia,1851,316.0,73.0,,,
...,...,...,...,...,...,...,...
9981,Liberia,2009,4117.0,957.0,,Battles,0.0
9982,Liberia,2009,4117.0,957.0,,Battles,0.0
9983,Liberia,2009,4117.0,957.0,,Violence against civilians,0.0
9984,Liberia,2009,4117.0,957.0,,Riots,0.0


### Right join
Repeat the code above in the same order, but specify now that `how='right'`.

In [24]:
right_join = pd.merge(cnts_a, conflict_b, how='right', on=['country','year'])
right_join.info()
right_join[right_join['country'].str.contains('Liberia')] #closer look at a single country

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55909 entries, 0 to 55908
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     55909 non-null  object 
 1   year        55909 non-null  int64  
 2   pop1        4829 non-null   float64
 3   pop2        4829 non-null   float64
 4   economics2  1510 non-null   float64
 5   event_type  55909 non-null  object 
 6   fatalities  55909 non-null  int64  
dtypes: float64(3), int64(2), object(2)
memory usage: 3.0+ MB


Unnamed: 0,country,year,pop1,pop2,economics2,event_type,fatalities
75,Liberia,2022,,,,Protests,0
276,Liberia,2022,,,,Protests,0
277,Liberia,2022,,,,Riots,1
403,Liberia,2022,,,,Protests,0
438,Liberia,2022,,,,Protests,0
...,...,...,...,...,...,...,...
55790,Liberia,1997,2776.0,645.0,282.0,Strategic developments,0
55794,Liberia,1997,2776.0,645.0,282.0,Strategic developments,0
55795,Liberia,1997,2776.0,645.0,282.0,Strategic developments,0
55831,Liberia,1997,2776.0,645.0,282.0,Strategic developments,0


You can see very different outputs when choosing right or left! In the first cell where we performed a left join, `cnts_a` data was on the left and we preserved all `cnts_a` observations despite having no conflict data for those years. Remember that data collection for `cnts` began in the 1800s, while ACLED began collecting data on western Africa in 1997. You can see this in the `year` column in either output for right and left joins. There are also tens of thousands more observations in the left join than right join. Left or right, whenever no observations occur the merge function will insert `NaN` missing values.  

In the second cell where we performed a right join, you should see that the years go up to 2022, which is later than the last year available in the `cnts` data. So we kept all observations in the `conflict_b` dataframe from 1997 to 2022, while entering values from `cnts_a` where those country-year keys match. 

### Inner join
The previous two merges, left and right, tell the merge function which dataframe's keys to use. By choosing the `how='inner'` parameter, you are requesting  a merge for the _intersection_ of your key variables, `year` and `country`. In other words, all observations where the country-year combinations correspond across both datasets:

In [25]:
inner_join = pd.merge(cnts_a, conflict_b, how='inner', on=['country','year'])
inner_join.info()
inner_join[inner_join['country'].str.contains('Liberia')] #closer look at a single country

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4829 entries, 0 to 4828
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     4829 non-null   object 
 1   year        4829 non-null   int64  
 2   pop1        4829 non-null   float64
 3   pop2        4829 non-null   float64
 4   economics2  1510 non-null   float64
 5   event_type  4829 non-null   object 
 6   fatalities  4829 non-null   int64  
dtypes: float64(3), int64(2), object(2)
memory usage: 264.2+ KB


Unnamed: 0,country,year,pop1,pop2,economics2,event_type,fatalities
694,Liberia,1997,2776.0,645.0,282.0,Battles,2
695,Liberia,1997,2776.0,645.0,282.0,Strategic developments,0
696,Liberia,1997,2776.0,645.0,282.0,Violence against civilians,0
697,Liberia,1997,2776.0,645.0,282.0,Strategic developments,0
698,Liberia,1997,2776.0,645.0,282.0,Strategic developments,0
...,...,...,...,...,...,...,...
1534,Liberia,2009,4117.0,957.0,,Battles,0
1535,Liberia,2009,4117.0,957.0,,Battles,0
1536,Liberia,2009,4117.0,957.0,,Violence against civilians,0
1537,Liberia,2009,4117.0,957.0,,Riots,0


There are still fewer observation in an inner join than left or right in this merge. However, there are also less missing values and more complete observations! 

## Merging new observations
Another form of merging datasets comes in the form of new observations, or adding rows. Ideally both datasets would have the same number of columns and in the same order. This exercise would be something like updating your observations, say by adding a new country or a new year of observations. ACLED collects data in regions, so each region's data has completely different observations under the exact same variables, so we are in luck; If the data had different numbers of columns or column names, you would have to carefully inspect those differences, rename variables, and/or select only matching variables. Let's inspect southern Africa's data and see whether it is ready to merge rows onto to our previously loaded `conflict_west` dataframe for western Africa.

In [26]:
# Load from out data folder
conflict_south = pd.read_csv('../../Data/ACLED/1900-01-01-2022-04-22-Southern_Africa.csv')

# Compare variable names
conflict_south.columns
conflict_west.columns

# And a check on the different units of observation (countries)
conflict_south['country'].unique()
conflict_west['country'].unique()

Index(['data_id', 'iso', 'event_id_cnty', 'event_id_no_cnty', 'event_date',
       'year', 'time_precision', 'event_type', 'sub_event_type', 'actor1',
       'assoc_actor_1', 'inter1', 'actor2', 'assoc_actor_2', 'inter2',
       'interaction', 'region', 'country', 'admin1', 'admin2', 'admin3',
       'location', 'latitude', 'longitude', 'geo_precision', 'source',
       'source_scale', 'notes', 'fatalities', 'timestamp', 'iso3'],
      dtype='object')

Index(['data_id', 'iso', 'event_id_cnty', 'event_id_no_cnty', 'event_date',
       'year', 'time_precision', 'event_type', 'sub_event_type', 'actor1',
       'assoc_actor_1', 'inter1', 'actor2', 'assoc_actor_2', 'inter2',
       'interaction', 'region', 'country', 'admin1', 'admin2', 'admin3',
       'location', 'latitude', 'longitude', 'geo_precision', 'source',
       'source_scale', 'notes', 'fatalities', 'timestamp', 'iso3'],
      dtype='object')

array(['South Africa', 'Zambia', 'Zimbabwe', 'Namibia', 'Lesotho',
       'eSwatini', 'Botswana',
       'Saint Helena, Ascension and Tristan da Cunha'], dtype=object)

array(['Mali', 'Nigeria', 'Mauritania', 'Ghana', 'Burkina Faso', 'Guinea',
       'Senegal', 'Niger', 'Liberia', 'Benin', 'Ivory Coast', 'Togo',
       'Sierra Leone', 'Cape Verde', 'Guinea-Bissau', 'Gambia'],
      dtype=object)

The dataframes for southern and western Africa do indeed have the same number of columns and with the same names. This is enough for us to merge these observations using pandas' `.concat()` function. You should remember using this function from the previous chapter, where we added columns by specifying the axis as horizontal. By default, `.concat()` uses the vertical axis, and merges data by rows, so we don't have to specify the `axis=""` argument. Like the previous chapter, we provide a list of dataframes to the concatenate function within square brackets and separated by commas.

In [27]:
conflict_concat = pd.concat([conflict_west, conflict_south])
conflict_concat.info()
conflict_concat['country'].unique()

<class 'pandas.core.frame.DataFrame'>
Index: 81881 entries, 0 to 25971
Data columns (total 31 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   data_id           81881 non-null  int64  
 1   iso               81881 non-null  int64  
 2   event_id_cnty     81881 non-null  object 
 3   event_id_no_cnty  81881 non-null  int64  
 4   event_date        81881 non-null  object 
 5   year              81881 non-null  int64  
 6   time_precision    81881 non-null  int64  
 7   event_type        81881 non-null  object 
 8   sub_event_type    81881 non-null  object 
 9   actor1            81881 non-null  object 
 10  assoc_actor_1     27302 non-null  object 
 11  inter1            81881 non-null  int64  
 12  actor2            54408 non-null  object 
 13  assoc_actor_2     16931 non-null  object 
 14  inter2            81881 non-null  int64  
 15  interaction       81881 non-null  int64  
 16  region            81881 non-null  object 
 17

array(['Mali', 'Nigeria', 'Mauritania', 'Ghana', 'Burkina Faso', 'Guinea',
       'Senegal', 'Niger', 'Liberia', 'Benin', 'Ivory Coast', 'Togo',
       'Sierra Leone', 'Cape Verde', 'Guinea-Bissau', 'Gambia',
       'South Africa', 'Zambia', 'Zimbabwe', 'Namibia', 'Lesotho',
       'eSwatini', 'Botswana',
       'Saint Helena, Ascension and Tristan da Cunha'], dtype=object)

With a longer dataset for western and southern Africa, we should update our merged dataframe with ACLED and CNTS data. just repeat the steps from the previous cells to remake the `inner_join` dataframe;

In [28]:
# Select the relevant variables from the south+west data
conflict_b = conflict_concat[['country','year','event_type','fatalities']]
inner_join = pd.merge(cnts_a, conflict_b, how='inner', on=['country','year'])
inner_join.info()
inner_join[inner_join['country'].str.contains('Lesotho')] #closer look at a single country

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9409 entries, 0 to 9408
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   country     9409 non-null   object 
 1   year        9409 non-null   int64  
 2   pop1        9409 non-null   float64
 3   pop2        9409 non-null   float64
 4   economics2  2211 non-null   float64
 5   event_type  9409 non-null   object 
 6   fatalities  9409 non-null   int64  
dtypes: float64(3), int64(2), object(2)
memory usage: 514.7+ KB


Unnamed: 0,country,year,pop1,pop2,economics2,event_type,fatalities
715,Lesotho,1997,1919.0,1599.0,527.0,Violence against civilians,1
716,Lesotho,1997,1919.0,1599.0,527.0,Riots,1
717,Lesotho,1997,1919.0,1599.0,527.0,Battles,0
718,Lesotho,1997,1919.0,1599.0,527.0,Battles,5
719,Lesotho,1998,1977.0,1647.0,450.0,Battles,0
...,...,...,...,...,...,...,...
772,Lesotho,2007,2642.0,2201.0,,Strategic developments,0
773,Lesotho,2007,2642.0,2201.0,,Violence against civilians,0
774,Lesotho,2008,2741.0,2284.0,,Protests,0
775,Lesotho,2009,2843.0,2369.0,,Battles,4


# Aggregate
With our nice new custom dataset on conflict and economic data, we can prepare more basic descriptive tables with pandas' `.groupby()` and `.aggregate()` functions, which work in tandem. The former, `.groupby()` creates a table with the string/text factors you provide along rows, while `.aggregate()` tells us which kinds of numeric operations to perform. 

Let's look at fatalities in each country. The function requires providing the rows in parentheses and the columns in square brackets. We tell aggregate that we want the total sum of fatalities.

In [29]:
# Rows and columns before the aggregate() function:
inner_join.groupby('country')['fatalities'].aggregate('sum')

# Or specify columns and the math operator after .aggregate(). Has clearer labelling
inner_join.groupby('country').aggregate({'fatalities':'sum'})

country
Benin            11
Botswana          2
Gambia          118
Ghana           301
Guinea         2871
Lesotho         169
Liberia        1127
Mali            456
Mauritania       82
Namibia         256
Niger           591
Nigeria       17302
Senegal        1259
Togo            113
Zambia          148
Zimbabwe        340
Name: fatalities, dtype: int64

Unnamed: 0_level_0,fatalities
country,Unnamed: 1_level_1
Benin,11
Botswana,2
Gambia,118
Ghana,301
Guinea,2871
Lesotho,169
Liberia,1127
Mali,456
Mauritania,82
Namibia,256


Above we showed you two ways to use `.aggregate()`, either after a list of `['row_variables']('column_variables')`, or with the column variables in curly brackets inside the `.aggregate()` function proper. Not only does the output table look a bit better, but the latter way to write your tables is more flexible. With the aggregate function arguments inside `{}` curly brackets, you can provide any number of column variables with any combination of math operations. The aggregate function has several functions built in: count, sum, mean, median, min and max, standard deviation, variance, skewness, cumulative sums, and more. 


Try grouping counties, adding the total number of fatalities, and also the average GDP per capita (`economics2`) over the twelve years of observations. You can also put the table's values in order, so let's append a `.sort()` function to the line.

In [30]:
inner_join.groupby('country').aggregate({'fatalities':'sum','economics2':'mean'}).sort_values(by='fatalities',ascending=False)

Unnamed: 0_level_0,fatalities,economics2
country,Unnamed: 1_level_1,Unnamed: 2_level_1
Nigeria,17302,492.344361
Guinea,2871,431.467181
Senegal,1259,512.568627
Liberia,1127,257.881356
Niger,591,198.551724
Mali,456,248.483871
Zimbabwe,340,412.127389
Ghana,301,367.428571
Namibia,256,1822.130081
Lesotho,169,459.288889


You can get even more specific breakdowns of the information using `groupby()` by giving it additional variables to group into. Let's add the conflict type and year variables into the mix.

In [31]:
inner_join.groupby(['country','event_type','year']).aggregate({'fatalities':'sum','economics2':'mean'})

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,fatalities,economics2
country,event_type,year,Unnamed: 3_level_1,Unnamed: 4_level_1
Benin,Battles,2004,2,
Benin,Battles,2007,0,
Benin,Protests,1997,0,369.0
Benin,Protests,1998,0,387.0
Benin,Protests,1999,0,386.0
...,...,...,...,...
Zimbabwe,Violence against civilians,2005,8,
Zimbabwe,Violence against civilians,2006,1,
Zimbabwe,Violence against civilians,2007,10,
Zimbabwe,Violence against civilians,2008,126,


One more function we can append to this line of code can clean up the confusing blank space between countries and event types too. Right now this table looks, well, like a table for visualizing data. But using the `.reset_index()` function, you get you the exact same table with the row values repeating, which makes this useful for creating new dataframes without blank values. 

In [32]:
inner_join.groupby(['country','event_type','year']).aggregate({'fatalities':'sum','economics2':'mean'}).reset_index()

Unnamed: 0,country,event_type,year,fatalities,economics2
0,Benin,Battles,2004,2,
1,Benin,Battles,2007,0,
2,Benin,Protests,1997,0,369.0
3,Benin,Protests,1998,0,387.0
4,Benin,Protests,1999,0,386.0
...,...,...,...,...,...
650,Zimbabwe,Violence against civilians,2005,8,
651,Zimbabwe,Violence against civilians,2006,1,
652,Zimbabwe,Violence against civilians,2007,10,
653,Zimbabwe,Violence against civilians,2008,126,


These are some fairly tidy tables, but unfortunately a little long given the amount of countries and years in the data. You should be getting the hang of subsetting a dataframe by now. Try creating a subset of the `inner_join` dataframe with one or more countries: You could make a new subset dataframe to group by, or subset the existing dataframe without. 

__Hint:__ look at how you've subset countries before with `str.contains`

In [33]:
# Example from authors, delete later
inner_join[inner_join['country'].str.contains('Lesotho|Nigeria')].groupby(['country','event_type']).aggregate({'fatalities':'sum'}).reset_index()

Unnamed: 0,country,event_type,fatalities
0,Lesotho,Battles,54
1,Lesotho,Protests,1
2,Lesotho,Riots,36
3,Lesotho,Strategic developments,0
4,Lesotho,Violence against civilians,78
5,Nigeria,Battles,8418
6,Nigeria,Explosions/Remote violence,2015
7,Nigeria,Protests,28
8,Nigeria,Riots,2654
9,Nigeria,Strategic developments,0


It's time to save your work. Just like the previous chapter, you should check your current working directory with `os.getcwd()` (remember to import the 'os' package first), check existing dataframes with `%whos DataFame`, and use `to_csv` or `to_excel` to save your dataframes inside the "Pandas_II" folder. 

In [34]:
# Save your work here
%whos DataFrame

Variable                Type         Data/Info
----------------------------------------------
cnts                    DataFrame           code      country <...>15729 rows x 194 columns]
cnts_a                  DataFrame               country  year <...>n[15729 rows x 5 columns]
conflict_b              DataFrame                country  year<...>n[81881 rows x 4 columns]
conflict_concat         DataFrame           data_id  iso event<...>[81881 rows x 31 columns]
conflict_sierra_leone   DataFrame           data_id  iso event<...>n[5015 rows x 31 columns]
conflict_south          DataFrame           data_id  iso event<...>[25972 rows x 31 columns]
conflict_west           DataFrame           data_id  iso event<...>[55909 rows x 31 columns]
inner_join              DataFrame           country  year     <...>\n[9409 rows x 7 columns]
left_join               DataFrame               country  year <...>n[20415 rows x 7 columns]
right_join              DataFrame                country  year<...>n[

Save the `inner_join` dataframe to a csv file names `conflict_shoutwest_africa.csv` in the current working directory. 

In [None]:
inner_join.to_csv('conflict_southwest_africa.csv', index=False) # index writes row names if True, so always tell it False

os.listdir() # to see if your objects are saved in the current directory.

# To Do 

subsequent challenge, group by month and year, splitting those values using the str code we gave them in the previous chapter. 
- show one more function in aggregate to demonstrate that you can make different columns and different operations per column. 

SyntaxError: invalid syntax (960368851.py, line 1)