### Introduction

In this mission, we'll learn a couple string cleaning tasks such as:

- Finding specific strings or substrings in columns
- Extracting substrings from unstructured data
- Removing strings or substrings from a series

We'll work with the __2015 World Happiness Report__ again and additional economic data from the __World Bank__. You can find the data set <a href='https://www.kaggle.com/worldbank/world-development-indicators/version/2'> here</a>.

Below are descriptions for the columns we'll be working with:

- `ShortName` - Name of the country
- `Region` - The region the country belongs to
- `IncomeGroup` - The income group the country belongs to, based on Gross National Income (GNI) per capita
- `CurrencyUnit` - Name of country's currency
- `SourceOfMostRecentIncomeAndExpenditureData` - The name of the survey used to collect the income and expenditure data
- `SpecialNotes` - Contains any miscellaneous notes about the data

In [1]:
import pandas as pd

In [2]:
happiness2015 = pd.read_csv('data/World_Happiness_2015.csv')
world_dev = pd.read_csv('data/World_dev.csv')

col_renaming = {'SourceOfMostRecentIncomeAndExpenditureData': 'IESurvey'}

In [3]:
merged = pd.merge(left=happiness2015, right=world_dev, how='left', left_on= happiness2015['Country'], right_on= world_dev['ShortName'])

merged = merged.rename(col_renaming, axis=1)

In [4]:
merged.head()

Unnamed: 0,key_0,Country,Region_x,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,...,GovernmentAccountingConcept,ImfDataDisseminationStandard,LatestPopulationCensus,LatestHouseholdSurvey,IESurvey,VitalRegistrationComplete,LatestAgriculturalCensus,LatestIndustrialData,LatestTradeData,LatestWaterWithdrawalData
0,Switzerland,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,...,Consolidated central government,Special Data Dissemination Standard (SDDS),2010,,"Expenditure survey/budget survey (ES/BS), 2004",Yes,2008,2010.0,2013.0,2000.0
1,Iceland,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,...,Consolidated central government,Special Data Dissemination Standard (SDDS),2011,,"Integrated household survey (IHS), 2010",Yes,2010,2005.0,2013.0,2005.0
2,Denmark,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,...,Consolidated central government,Special Data Dissemination Standard (SDDS),2011,,"Income tax registers (ITR), 2010",Yes,2010,2010.0,2013.0,2009.0
3,Norway,Norway,Western Europe,4,7.522,0.0388,1.459,1.33095,0.88521,0.66973,...,Consolidated central government,Special Data Dissemination Standard (SDDS),2011,,"Income survey (IS), 2010",Yes,2010,2010.0,2013.0,2006.0
4,Canada,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,...,Consolidated central government,Special Data Dissemination Standard (SDDS),2011,,"Labor force survey (LFS), 2010",Yes,2011,2011.0,2013.0,1986.0


### Using Apply to Transform Strings

Let's work with the `CurrencyUnit` column first. Suppose we wanted to extract the unit of currency without the leading nationality. For example, instead of "Danish krone" or "Norwegian krone", we just needed "krone".

If we wanted to complete this task for just one of the strings, we could use Python's `string.split()` method:
```
words = 'Danish krone'

#Use the string.split() method to return the following list: ['Danish', 'krone']
listwords = words.split()

#Use the index -1 to return the last word of the list.
listwords[-1]
```

Now, to repeat this task for each element in the Series, let's return to a concept we learned in the previous mission - the `Series.apply()` method.

In [5]:
def extract_last_word(element):
    return str(element).split()[-1]

merged['Currency Apply'] = merged['CurrencyUnit'].apply(extract_last_word)

merged['Currency Apply'].head()

0     franc
1     krona
2     krone
3     krone
4    dollar
Name: Currency Apply, dtype: object

### Vectorized String Methods Overview

We extracted the last word of each element in the `CurrencyUnit` column using the `Series.apply()` method. However, we also learned that we should use built-in vectorized methods (if they exist) instead of the `Series.apply()` method for performance reasons.

Instead, we could've split each element in the `CurrencyUnit` column into a list of strings with the `Series.str.split()` method, the vectorized equivalent of Python's `string.split()` method:

<img src='_images/Split.png'>

In fact, pandas has built in a number of vectorized methods that perform the same operations for strings in series as Python string methods.

Below are some common vectorized string methods, but you can find the full list <a href='https://pandas.pydata.org/pandas-docs/stable/text.html#method-summary'>here</a>:

|-|-|
|Method|Description|
|Series.str.split()|Splits each element in the Series.|
|Series.str.strip()|Strips whitespace from each string in the Series.|
|Series.str.lower()|Converts strings in the Series to lowercase.|
|Series.str.upper()|Converts strings in the Series to uppercase.|
|Series.str.get()|Retrieves the ith element of each element in the Series.|
|Series.str.replace()|Replaces a regex or string in the Series with another string.|
|Series.str.cat()|Concatenates strings in a Series.|
|Series.str.extract()|Extracts substrings from the Series matching a regex pattern.|

We access these vectorized string methods by adding a str between the Series name and method name:

```
Series.str.method_name()
```

The `str` attribute indicates that each object in the Series should be treated as a string, without us having to explicitly change the type to a string like we did when using the `apply` method.

Note that we can also slice each element in the Series to extract characters, but we'd still need to use the `str` attribute. For example, below we access the first five characters in each element of the `CurrencyUnit` column:
```
merged['CurrencyUnit'].str[0:5]

0    Swiss
1    Icela
2    Danis
3    Norwe
4    Canad
Name: CurrencyUnit, dtype: object
```

It's also good to know that vectorized string methods can be _chained_. For example, suppose we needed to split each element in the `CurrencyUnit` column into a list of strings using the `Series.str.split()` method and capitalize the letters using the `Series.str.upper()` method. You can use the following syntax to apply more than one method at once:
```
merged['CurrencyUnit'].str.upper().str.split()
```

However, don't forget to include `str` before each method name, or you'll get an error!

Below are the first five rows of the result:
```
0    [AFGHAN, AFGHANI]
1      [ALBANIAN, LEK]
2    [ALGERIAN, DINAR]
3       [U.S., DOLLAR]
4               [EURO]
```

In [6]:
merged['Currency Vectorized'] = merged['CurrencyUnit'].str.split().str.get(-1)
merged['Currency Vectorized'].head()

0     franc
1     krona
2     krone
3     krone
4    dollar
Name: Currency Vectorized, dtype: object

We learned that using vectorized string methods results in:

1. Better performance
2. Code that is easier to read and write

### Exploring Missing Values with Vectorized String Methods

Let's explore another benefit of using vectorized string methods next. Suppose we wanted to compute the length of each string in the `CurrencyUnit` column. If we use the `Series.apply()` method, what happens to the missing values in the column?

First, let's use the `Series.isnull()` method to confirm if there are any missing values in the column:

In [7]:
merged['CurrencyUnit'].isnull().sum()

13

So, we know that the `CurrencyUnit` column has 13 missing values.

Next, let's create a function to return the length of each currency unit and apply it to the `CurrencyUnit` column:

In [8]:
def compute_lengths(element):
    return len(str(element))

lengths_apply = merged['CurrencyUnit'].apply(compute_lengths)

Then, we can check the number of missing values in the result by setting the `dropna` parameter in the `Series.value_counts()` method to False:

In [9]:
lengths_apply.value_counts(dropna=False)

14    21
4     20
12    17
13    14
3     13
15    13
16    12
18     9
17     9
11     8
22     7
25     5
19     3
9      2
26     1
20     1
23     1
10     1
39     1
Name: CurrencyUnit, dtype: int64

Since the original column had 13 missing values and `NaN` doesn't appear in the list of unique values above, we know our function must have treated `NaN` as a string and returned a length of `3` for each `NaN` value. This doesn't make sense - missing values shouldn't be treated as strings. They should instead have been _excluded_ from the calculation.

If we wanted to exclude missing values, we'd have to update our function to something like this:
```
def compute_lengths(element):
    if pd.isnull(element):
        pass
    else:
        return len(str(element))
lengths_apply = merged['CurrencyUnit'].apply(compute_lengths)
```

In [10]:
lengths = merged['CurrencyUnit'].str.len()
value_counts = lengths.value_counts(dropna=False)

We identified a third benefit of using vectorized string methods - they exclude missing values:

1. Better performance
2. Code that is easier to read and write
3. Automatically excludes missing values

### Finding Specific Words in Strings

Suppose we needed to parse the elements of a Series to find a string or substring that doesn't appear in the same position in each string. For example, let's look at the `SpecialNotes` column. A number of rows mention "national accounts", but the words appear in different places in each comment:
```
April 2013 database update: Based on IMF data, national accounts data were revised for 2000 onward; the **base year** changed to 2002.
Based on IMF data, national accounts data have been revised for 2005 onward; the new base year is 2005.
```
If we wanted to determine how many comments contain this phrase, could we split them into lists? Since the formats are different, how could we tell which element contains the "national accounts" phrase?

We can handle problems like this with __regular expressions__, or __regex__ for short. A regular expression is a sequence of characters that describes a search pattern, used to match characters in a string:

<img src='_images/Regular_Expressions.png'>

In pandas, regular expression is integrated with vectorized string methods to make finding and extracting patterns of characters easier.

In [11]:
import regex

In [12]:
pattern = r"[Nn]ational accounts"

national_accounts = merged['SpecialNotes'].str.contains(pattern)
national_accounts.head()

0     NaN
1     NaN
2     NaN
3     NaN
4    True
Name: SpecialNotes, dtype: object

In [13]:
#Return the value counts for each value in the Series, including missing values.
national_accounts.value_counts(dropna=False)

NaN      65
True     54
False    39
Name: SpecialNotes, dtype: int64

In [15]:
national_accounts = merged['SpecialNotes'].str.contains(pattern, na=False)

merged_national_accounts = merged[national_accounts]
merged_national_accounts.head()

Unnamed: 0,key_0,Country,Region_x,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,...,LatestPopulationCensus,LatestHouseholdSurvey,IESurvey,VitalRegistrationComplete,LatestAgriculturalCensus,LatestIndustrialData,LatestTradeData,LatestWaterWithdrawalData,Currency Apply,Currency Vectorized
4,Canada,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,...,2011,,"Labor force survey (LFS), 2010",Yes,2011,2011.0,2013.0,1986.0,dollar,dollar
7,Sweden,Sweden,Western Europe,8,7.364,0.03157,1.33171,1.28907,0.91087,0.6598,...,2011,,"Income survey (IS), 2005",Yes,2010,2010.0,2013.0,2007.0,krona,krona
8,New Zealand,New Zealand,Australia and New Zealand,9,7.286,0.03371,1.25018,1.31967,0.90837,0.63938,...,2013,,,Yes,2012,2010.0,2013.0,2002.0,dollar,dollar
9,Australia,Australia,Australia and New Zealand,10,7.284,0.04083,1.33358,1.30923,0.93156,0.65124,...,2011,,"Expenditure survey/budget survey (ES/BS), 2003",Yes,2011,2011.0,2013.0,2000.0,dollar,dollar
14,United States,United States,North America,15,7.119,0.03839,1.39451,1.24711,0.86179,0.54604,...,2010,,"Labor force survey (LFS), 2010",Yes,2012,2008.0,2013.0,2005.0,dollar,dollar


### Extracting Substrings from a Series

Let's continue exploring the versatility of regular expressions while learning a new task - extracting characters from strings.

Suppose we wanted to extract any year mentioned in the SpecialNotes column. Notice that the characters in a year follow a specific pattern:

<img src='_images/Years.png'>

The first digit can be either 1 or 2, while the last three digits can be any number between 0 and 9.

With regular expressions, we use the following syntax to indicate a character could be a range of numbers:

```
pattern = r"[0-9]"
```

And we use the following syntax to indicate a character could be a range of letters:
```
#lowercase letters
pattern1 = r"[a-z]"

#uppercase letters
pattern2 = r"[A-Z]"
```
We could also make these ranges more restrictive. For example, if we wanted to find a three character substring in a column that starts with a number between 1 and 6 and ends with two letters of any kind, we could use the following syntax:
```
pattern = r"[1-6][a-z][a-z]"
```
If we have a pattern that repeats, we can also use curly brackets { and } to indicate the number of times it repeats:
```
pattern = r"[1-6][a-z][a-z]" = r"[1-6][a-z]{2}"
```

In [16]:
pattern = r"([1-2][0-9]{3})"

years = merged['SpecialNotes'].str.extract(pattern)

When we used the `Series.str.extract()` method, we enclosed our regular expression in parentheses. The parentheses indicate that only the character pattern matched should be extracted and returned in a series. We call this a __capturing group__.

<img src='_images/Parantheses.png'>

If the capturing group doesn't exist in a row (or there is no match) the value in that row is set to `NaN` instead. As a result, the Series returned looked like this:

<img src='_images/Extracting_Results.png'>

We can also return the results as a dataframe by changing the `expand` parameter to True.

In [17]:
years = merged['SpecialNotes'].str.extract(pattern, expand=True)

### Extracting All Matches of a Pattern from a Series

