# Qualitative Analysis of Text Data

Text data types are a major part of the data which social scientists can study, and we will dedicate this chapter to the tools that will help you manage text in Python. Remember that in programming the term 'string' is a synonym for data that comes in the form of text. As opposed to the quantitative analysis chapter, you can also understand strings as any non-numeric data. You might be studying survey data where open-ended questions give nuance and context to closed-ended quantitative questions about survey respondents' opinions on a given topic. Social media data around people's posts are predominantly text: The users' posts and conversations, hastags, place names, even the date is parsed as a string variable. It is up to the researcher, then, to parse this data that is often the least structured and hardest to analyze. 

Before introducing the lesson we want to make it clear that this chapter is not about qualitative methods in social science. That aspect of social science research encompasses interviews, focus groups, ethnography, photography and videography, audio recordings, and a very rigorous and detailed process of coding and describing the data to  create meaning around your research question. This chapter cannot teach your those methods. This chapter is about string variables, cleaning strings, formatting strings, and searching and matching strings so that you can better understand and use the text data available to you. This chapter can get you to where your qualitative data are ready for analysis, but the analytical methods will be beyond the scope of this book.

- suggest further reading? Same for other chapters?

In [12]:
# This code cell will be in every one of our chapters in Jupyter Notebook
# The function allows you to see every line of output when the code has multiple lines
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

# Load packages
import pandas as pd

### Loading the Armed Conflict Location & Event Data

This chapter revisits the Armed Conflict Location & Event Data Project (ACLED). You had some experience with the ACLED data in the second pandas chapter, including a basic introduction to pandas' string search function .str.contains(). You will learn more about searching strings and text formatting with this very text-heavy dataset. ACLED data comes with specific regions of the world in different CSV files. We have chosen the South America region for you in the code cell below. We also wanted you to have a more succinct dataframe, so you will be using the `usecols=''` argument again to load specific columns from the CSV file instead of having to load the complete data and then subsetting by variable. Below the columns we chose and their descriptions according to the ACLED [codebook](https://acleddata.com/knowledge-base/codebook/).

|Variable|Description|
|:--|:--|
|event_id_cnty|A unique alphanumeric event identifier by number and country acronym.|
|event_date|The date on which the event took place. Recorded as Year-Month-Day.|
|year|The year in which the event took place.|
|event_type|The type of event; further specifies the nature of the event.|
|sub_event_type|A subcategory of the event type.|
|actor1|One of two main actors involved in the event (does not necessarily indicate the aggressor).|
|actor2|One of two main actors involved in the event (does not necessarily indicate the target or victim).|
|country|The country or territory in which the event took place.|
|location|The name of the location at which the event took place.|
|admin1|The largest sub-national administrative region in which the event took place.|
|fatalities|The number of reported fatalities arising from an event. When there are conflicting reports, the most conservative estimate is recorded.|
|notes|A short description of the event.|


In [10]:
south_america = pd.read_csv('../../Data/ACLED/1900-01-01-2022-04-22-South_America.csv', 
                            usecols= ['event_id_cnty', 'event_date', 'year', 'event_type', 'sub_event_type', 
                                       'actor1', 'actor2', 'country', 'location', 'admin1', 'fatalities', 'notes'])
south_america.info()

Inspecting the `south_america` dataframe with .info() shows that there are just two numeric variables: `year` and `fatalities`. All other variables were read by the read_csv() function as `object` datatypes, which is the default type for string variables. 

### Separating strings with .str_split()



In [13]:
# use pd.str.split() to split the single value in 'date' using the blank space 
south_america['event_date'].str.split(' ')

# you get a list of three, and day is in the '0' index position
south_america['event_date'].str.split(' ').str[0]

0           [15, April, 2022]
1           [15, April, 2022]
2           [15, April, 2022]
3           [15, April, 2022]
4           [15, April, 2022]
                 ...         
107233    [01, January, 2018]
107234    [01, January, 2018]
107235    [01, January, 2018]
107236    [01, January, 2018]
107237    [01, January, 2018]
Name: event_date, Length: 107238, dtype: object

0         15
1         15
2         15
3         15
4         15
          ..
107233    01
107234    01
107235    01
107236    01
107237    01
Name: event_date, Length: 107238, dtype: object

explain why insert to an index versus creating a new variable with the df['var'] =  method. and the disadvantage of not allowing you to replace the variable

why we don't create year (it's already there) So we decided to give you an example where day and month arent in the data as single cols.

In [14]:
# use .insert() to make a new variable 
day = south_america['event_date'].str.split(' ').str[0]
south_america.insert(2, 'day', day)

month = south_america['event_date'].str.split(' ').str[1]
south_america.insert(3, 'month', month)

south_america.head(1)

Unnamed: 0,event_id_cnty,event_date,day,month,year,event_type,sub_event_type,actor1,actor2,country,admin1,location,notes,fatalities
0,COL10764,15 April 2022,15,April,2022,Battles,Armed clash,Gulf Clan,ACSN: Self-Defense Conquerors of Sierra Nevada,Colombia,Magdalena,Cienaga,"On 15 April 2022, in the rural area of Cienaga...",5


In [6]:
south_america.drop(columns='event_date', inplace=True)
south_america.head(1)

Unnamed: 0,event_id_cnty,day,month,year,event_type,sub_event_type,actor1,actor2,country,admin1,location,notes,fatalities
0,COL10764,15,April,2022,Battles,Armed clash,Gulf Clan,ACSN: Self-Defense Conquerors of Sierra Nevada,Colombia,Magdalena,Cienaga,"On 15 April 2022, in the rural area of Cienaga...",5


In [7]:
# convert year to a string type! concatenating different data types is not allowed
south_america['year'] = south_america['year'].astype(str)

event_date = south_america['day']+' '+south_america['month']+' '+south_america['year']

south_america.insert(4, 'event_date', event_date)

south_america.head(1)

Unnamed: 0,event_id_cnty,day,month,year,event_date,event_type,sub_event_type,actor1,actor2,country,admin1,location,notes,fatalities
0,COL10764,15,April,2022,15 April 2022,Battles,Armed clash,Gulf Clan,ACSN: Self-Defense Conquerors of Sierra Nevada,Colombia,Magdalena,Cienaga,"On 15 April 2022, in the rural area of Cienaga...",5


In [15]:
# drop event_date again and re-create it with str.cat()
south_america.drop(columns='event_date', inplace=True)

event_date = south_america['day'].str.cat(south_america[['month','year']], sep=' ')
south_america.insert(4, 'event_date', event_date)

south_america.head(1)

Unnamed: 0,event_id_cnty,day,month,year,event_date,event_type,sub_event_type,actor1,actor2,country,admin1,location,notes,fatalities
0,COL10764,15,April,2022,15 April 2022,Battles,Armed clash,Gulf Clan,ACSN: Self-Defense Conquerors of Sierra Nevada,Colombia,Magdalena,Cienaga,"On 15 April 2022, in the rural area of Cienaga...",5


Show them concat with a different separator like the backslash dates or hypen dates

In [33]:
# drop event_date again and re-create it with str.cat()
south_america.drop(columns='event_date', inplace=True)

event_date = south_america['day'].str.cat(south_america[['month','year']], sep='/')
south_america.insert(4, 'event_date', event_date)

south_america.head(1)

Unnamed: 0,event_id_cnty,day,month,year,event_date,event_type,sub_event_type,actor1,actor2,country,admin1,location,notes,fatalities
0,COL10764,15,April,2022,15/April/2022,Battles,Armed clash,Gulf Clan,ACSN: Self-Defense Conquerors of Sierra Nevada,Colombia,Magdalena,Cienaga,"On 15 April 2022, in the rural area of Cienaga...",5


## Date formatting with pd.to_datetime() and dt.strftime()
For date and time formatting that is a step more advanced, you can utilize pandas' own date and time functions to transform our event_date variable from a string type into a datetime type. This is important to learn because date variables are logically numeric. You can subtract one date from another to learn the time passed between dates, as well tell how long ago an event was from the present date. These operations should prove useful and you can make date calculations for spans of time if your date variables are in the right format.

The simplest function would be to convert a string or numeric column with something that looks like a date, into a date with `pd.to_datetime()`. Below you will pass the event_date column to the datetime function. Observe how the original column's format changes from our previous format (15/April/2022) to the datetime standard (2022-04-15). We added the .info function at the end of each line to show you how the column goes from an 'object' datatype into a 'datetime64' datatype.

In [34]:
south_america['event_date']

pd.to_datetime(south_america['event_date'])

0           15/April/2022
1           15/April/2022
2           15/April/2022
3           15/April/2022
4           15/April/2022
               ...       
107233    01/January/2018
107234    01/January/2018
107235    01/January/2018
107236    01/January/2018
107237    01/January/2018
Name: event_date, Length: 107238, dtype: object

0        2022-04-15
1        2022-04-15
2        2022-04-15
3        2022-04-15
4        2022-04-15
            ...    
107233   2018-01-01
107234   2018-01-01
107235   2018-01-01
107236   2018-01-01
107237   2018-01-01
Name: event_date, Length: 107238, dtype: datetime64[ns]

Depending on where you are in the world, you may want to be more specific about the format of the date string. That is, event_date's format '15/April/2022' might not be a standard preference where you live and work, and other formats might confuse the automatic to_datetime() function. 

There are standard formatting abbreviations we will use, so check this [manual](https://4js.com/online_documentation/fjs-fgl-3.00.05-manual-html/c_fgl_DataConversions_format_datetimes.html) for more formatting alternatives. For example, our current event_date format would be `'%d/%b/%Y'`: Lowercase `%d` gives the day's date in two numbers; uppercase `%B` gives the full month name; `%Y` gives four digits for the year. We use the actual backslash `/` separators in the format

So we can use that format to tell to_datetime() exactly what format the date is in.

In [41]:
pd.to_datetime(south_america['event_date'], format='%d/%B/%Y')

0        2022-04-15
1        2022-04-15
2        2022-04-15
3        2022-04-15
4        2022-04-15
            ...    
107233   2018-01-01
107234   2018-01-01
107235   2018-01-01
107236   2018-01-01
107237   2018-01-01
Name: event_date, Length: 107238, dtype: datetime64[ns]

Even though .to_datetime() had already converted the date admirably withough an explicit format argument, it is worthwhile to be careful with dates if the format was more ambiguous. Now, if you wanted to transform the default to_datetime format to something more legible in an english speaking country, you can use the `dt.strftime` function to make a __new__ string variable with the format of your choice. Let's use `.dt.strftime('%B %d, %Y')` as the format.

Beware that strftime removes the datetime datatype and transforms the variable in to a string/object again!

In [42]:
pd.to_datetime(south_america['event_date']).dt.strftime('%B %d, %Y')

0           April 15, 2022
1           April 15, 2022
2           April 15, 2022
3           April 15, 2022
4           April 15, 2022
                ...       
107233    January 01, 2018
107234    January 01, 2018
107235    January 01, 2018
107236    January 01, 2018
107237    January 01, 2018
Name: event_date, Length: 107238, dtype: object

For now, assign event_date to be the standard datetime format, and make another variable called str_date to take on the strftime() value.

In [43]:
south_america['event_date'] = pd.to_datetime(south_america['event_date'])
south_america['str_date'] = pd.to_datetime(south_america['event_date']).dt.strftime('%B %d, %Y')

south_america.head(1)

Unnamed: 0,event_id_cnty,day,month,year,event_date,event_type,sub_event_type,actor1,actor2,country,admin1,location,notes,fatalities,str_date
0,COL10764,15,April,2022,2022-04-15,Battles,Armed clash,Gulf Clan,ACSN: Self-Defense Conquerors of Sierra Nevada,Colombia,Magdalena,Cienaga,"On 15 April 2022, in the rural area of Cienaga...",5,"April 15, 2022"


## Search text with str.contains()

You may wish to search through your open-ended string variables for specific content, the way you would search inside of a document for a specific word. You have already done this in previous chapters with pandas' `str.contains()` function to subset by a specific country name. You can continue using str.contains() to look for observations with a specific word or phrase. Imagine you wished to analyze records of conflict involving the military. You could spend some time (as we did) skimming the `notes` variable entries in another program that reads CSV files, using the search function for useful terms for the military. You might notice the notes use other terms such as 'navy', 'army' , 'air force', 'military' (and maybe 'police' if appropriate). You can then employ thoe terms to query observations where the conflict involved the military using `str.contains()`. 

### Search multiple terms with `or` condition: `|` 

In the search string, you'll want to use the symbol `|`, which represents an `OR` condition. This returns a match for _any_ of the four terms. Using the `&` symbol would return observations where _all four_ terms were found in a single observation. You should also assign the resulting dataframe to a new object called `military_conflicts`. This dataframe has 11,411 observations, down from the original 107,238 observations of the `south_america` dataframe, so there were over eleven thousand events where military forces were involved.

In [10]:
military_conflicts = south_america[(south_america['notes'].str.contains('army|navy|air force|military', case=False))]

military_conflicts['fatalities'].count()

11411

In the next code cell, you can see what happens when you use random uppercase letters in the search string. This is to demonstrate what the `case=False` argument does. In fact, it returns the same number of arguments as the search in the previous code cell. Next you can see how many observations match when we make the query case sensitive with `case=True`. If you are looking for general concepts then it is usually a good idea to make your searches _not_ case sensitive. On the other hand, looking for something like a person or place name would likely require a case-sensitive search. 

In [33]:
military_conflicts = south_america[(south_america['notes'].str.contains('Army|navY|air fORCe|mIlItArY', case=False))]
military_conflicts['fatalities'].count() #number of observations case insensitive

military_conflicts = south_america[(south_america['notes'].str.contains('Army|navY|air fORCe|mIlItArY', case=True))]
military_conflicts['fatalities'].count() # number of observations case sensitive

11411

396

### Search multiple terms with `and` condition `&`

Let's assume your research question was about a conflict event where both the military were present, and heavy duty vehicles related to those armed forces, e.g.: 'Army' and 'tank', or 'Air Force' and 'plane'. Those events might be of interest to you because they denote a level of severity and specificity in conflict that goes beyond a military presence. Let's see how it might go if we are interested in the occurance of airplanes in a military context.

Note: we alre already familiar with the ACLED data and the `notes` column specifically. Therefore, we know beforehand that only searching for 'plane' or 'airplane' alone leads to conflicts regarding transportation, not necessarily related to armed forces. Also, a country's Air Force might not be involved or named as such. So you will look for any mention of the term 'plane' __and__ 'military' __or__ 'air force'. Make thise case-insensitive again.

Unfortunately, the `&` condition requires you to specify the search terms in two separate calls of `df.str.contains()`. So we must put the 'plane' search in one str.contains() call, separate with the `&` symbol, and then the search for any of the armed forces search terms. All of these with the dataframe subset brackets of `south_america[...]`

In [40]:
pd.options.display.max_colwidth = 300 # this option lets you read 100 characters in the cell output.

air_conflicts = south_america[(south_america['notes'].str.contains('plane', case=False)) & (south_america['notes'].str.contains('military|air force', case=False))]
air_conflicts['fatalities'].count() #number of observations
air_conflicts['notes'].head(3)

11

14488    On 19 September 2021, from Angra dos Reis municipality to Rio de Janeiro - West Zone, Rio de Janeiro, two members of the CV hijacked a plane transporting one of its members to the Gericino Complex Penitentiary in an attempt to release the group member from prison. The perpetrators were armed wit...
42916                                                                    Around 15 September 2020 (as reported), in a rural area of Machiques de Perija, Zulia state, military officers seized and afterwards burnt down a plane with American registration, which was suspected of being used for drug trafficking.
45954                                                                                                                                     On 2 August 2020, in Campo Grande, Mato Grosso do Sul, the Brazilian air force seized two planes loaded with cocaine, in a total of 1.1 tons of drugs. A man was arrested.
Name: notes, dtype: object

# Regular expressions (regex)

https://cheatography.com/davechild/cheat-sheets/regular-expressions/

## Searching strings with regex

You can use a regular expression pattern to find the same military query. The fisrt thing you do is to notify str.contains() that you will use a regular expression. Do so by flagging regex with `r` and then put your regular expression in quotes (single or double quotes both work): `str.contains(r'pattern')`. 

Regular expressions include ways to ignore upper and lowercase with a global case-insensitive argument `(?i)` instead of a function-specific argument. Place the case-insensitive argument before the four terms for armed forces we used above. Each term is also separated by the `|` symbol. The term 'air force' has a space in the middle. The regex code for a blank space is `\s`. However, the same query with a literal blank instead of `\s` yields the same outcome. It is just better practice to tell regex exactly what you are looking for. 

In [12]:
military_conflicts = south_america[south_america['notes'].str.contains(r'(?i)army|navy|air\sforce|military')]
military_conflicts['fatalities'].count()

# without the regex term to whitespace '\s'
military_conflicts = south_america[south_america['notes'].str.contains(r'(?i)army|navy|air force|military')]
military_conflicts['fatalities'].count()

11411

11411

The code is similar to the previous searches _without_ regex. We won't claim its much simpler, however. It comes down to your comfort and familiarity with regex. As with all languages, the more you work with text data (web scraping and mining social media data both benefit greatly from knowing regular expressions) the better you'll get. Regardless of your preference in this example, there will be times when using regex is unavoidable, when the text you need to extract are patterns and not exact word matches. 

Searching with regex for 'plane' as well as the military or air force is a bit more complicated pattern. In fact it could be even more complex but we provide a generally working regex pattern for convenience! 

We need an instance where _both_ terms are present. However, regex actually cares a lot about order, because it is a language that precesses sequences. So searching for 'plane' and 'air force or military' will return only text where the term 'plane' precedes 'air force or military'. So the regex pattern actually has to repeat a matching pattern reversed! Thus the regex pattern would be `(?i)(plane.*(air force|military)` and ` | ` and `(military|air force).*plane)`

In [41]:
pd.options.display.max_colwidth = 300 # this option lets you read 100 characters in the cell output.

air_conflicts = south_america[(south_america['notes'].str.contains(r'(?i)(plane.*(air force|military)|(military|air force).*plane)', case=False))]
air_conflicts['fatalities'].count() #number of observations
air_conflicts['notes'].head(3)

  air_conflicts = south_america[(south_america['notes'].str.contains(r'(?i)(plane.*(air force|military)|(military|air force).*plane)', case=False))]


11

14488    On 19 September 2021, from Angra dos Reis municipality to Rio de Janeiro - West Zone, Rio de Janeiro, two members of the CV hijacked a plane transporting one of its members to the Gericino Complex Penitentiary in an attempt to release the group member from prison. The perpetrators were armed wit...
42916                                                                    Around 15 September 2020 (as reported), in a rural area of Machiques de Perija, Zulia state, military officers seized and afterwards burnt down a plane with American registration, which was suspected of being used for drug trafficking.
45954                                                                                                                                     On 2 August 2020, in Campo Grande, Mato Grosso do Sul, the Brazilian air force seized two planes loaded with cocaine, in a total of 1.1 tons of drugs. A man was arrested.
Name: notes, dtype: object

In both code cells above, you can corroborate that the same number of observations were returned with either regex terms or the normal text searches in the previous section. they were 11411 matches for any military forces, and eleven observations of planes and military or air force in the notes.

## Pull regex pattern matches with str.extract()

The event_id_cnty variable holds the three letter ISO country code plus the five number unique event id. The pattern is very simple to identify: Three letters from the alphabet, followed by five digits.vConsider the first value in cntry_event_id: 

`COL10767`

We could use any of these regex patterns to extract the event ID and ISO. 

- `^.{3}` The first three symbols regardless whether letters or numbers gets country ISO. 
- `[A-Z]+` One or more uppercase letters gets country ISO.
- `[A-Z]{3}` Exactly three uppercase letters gets country ISO.
- `.{5}$` The last five symbols gets event ID.
- `\d+` One or more numbers gets event ID.
- `\d{5}` Exactly five numbers gets event ID. 
- `[^\d]{3}` Three symbols that are __not__ numbers gets country ISO.
- `[^A-Z]{5}` Five symbols that are __not__ letters gets event ID.

And there are several more ways we soult capture ISO and event ID strings, the above are not at all a comprehensive list. Try a few of these expressions out below. Be careful, however, to place the pattern to extract within parentheses. The str.contains() and str.match() functions don't care, but str.extract() will pull a pattern, and thus expects a distinguishing pattern group to pull. You specify a pattern group with parentheses.

In [44]:
# use str_extract()
south_america['event_id_cnty'].str.extract(r'([A-Z]{3})')

# make the vector flat
south_america['event_id_cnty'].str.extract(r'([A-Z]{3})', expand=False)

Unnamed: 0,0
0,COL
1,COL
2,COL
3,PAR
4,CHI
...,...
107233,VEN
107234,GUY
107235,BRA
107236,BRA


0         COL
1         COL
2         COL
3         PAR
4         CHI
         ... 
107233    VEN
107234    GUY
107235    BRA
107236    BRA
107237    BRA
Name: event_id_cnty, Length: 107238, dtype: object

In the code cell below you can now create two new columns with extract and insert. You could also use str.split(), which you have used before, and provide a pattern as the delimiter. For now create a vector for both event id and ISO codes with str.extract(). It is important in this case to use the `expand=False` argument to obtain a flat vector to insert. By default the str.extract() function will output a grouped table, which doesn't play well with the insert() function.

In [45]:
country_iso = south_america['event_id_cnty'].str.extract(r'([A-Z]{3})', expand=False)
event_id = south_america['event_id_cnty'].str.extract(r'(\d{5})', expand=False)

south_america.insert(1, 'country_iso', country_iso)
south_america.insert(2, 'event_id', event_id)

south_america.head(1)

Unnamed: 0,event_id_cnty,country_iso,event_id,event_date,day,month,year,event_type,sub_event_type,actor1,actor2,country,admin1,location,notes,fatalities
0,COL10764,COL,10764,15 April 2022,15,April,2022,Battles,Armed clash,Gulf Clan,ACSN: Self-Defense Conquerors of Sierra Nevada,Colombia,Magdalena,Cienaga,"On 15 April 2022, in the rural area of Cienaga municipality (Magdalena), Gulf Clan members and ACSN clashed in the La Secreta vereda. Residents claimed that the armed clashes have lasted more than 4 hours, and that more than 300 residents have been displaced after being caught in the crossfire. ...",5


## done

- str split
- split event date to two day/month columns
- find matches in the notes column for terms regarding the military. use regex and regular string example.
- introduce regex! split event_id_cnty to two columns, event id and country_iso

# To-do

- 
- make a column for intensity where every note with brackets has size after equals (protests and riots), (this regex might go beyond) could be homework?
-  




