# Working with Strings in Pandas

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Reading datasets
happiness2015 = pd.read_csv("datasets/World_Happiness_2015.csv")
world_dev = pd.read_csv("datasets/World_dev.csv")

In [3]:
happiness2015.head(2)

Unnamed: 0,Country,Region,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),Generosity,Dystopia Residual
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,0.29678,2.51738
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,0.4363,2.70201


In [4]:
world_dev.head(2)

Unnamed: 0,CountryCode,ShortName,TableName,LongName,Alpha2Code,CurrencyUnit,SpecialNotes,Region,IncomeGroup,Wb2Code,...,GovernmentAccountingConcept,ImfDataDisseminationStandard,LatestPopulationCensus,LatestHouseholdSurvey,SourceOfMostRecentIncomeAndExpenditureData,VitalRegistrationComplete,LatestAgriculturalCensus,LatestIndustrialData,LatestTradeData,LatestWaterWithdrawalData
0,AFG,Afghanistan,Afghanistan,Islamic State of Afghanistan,AF,Afghan afghani,Fiscal year end: March 20; reporting period fo...,South Asia,Low income,AF,...,Consolidated central government,General Data Dissemination System (GDDS),1979,"Multiple Indicator Cluster Survey (MICS), 2010/11","Integrated household survey (IHS), 2008",,2013/14,,2013.0,2000.0
1,ALB,Albania,Albania,Republic of Albania,AL,Albanian lek,,Europe & Central Asia,Upper middle income,AL,...,Budgetary central government,General Data Dissemination System (GDDS),2011,"Demographic and Health Survey (DHS), 2008/09",Living Standards Measurement Study Survey (LSM...,Yes,2012,2011.0,2013.0,2006.0


In [5]:
# Merging the two dataframes using pd.merge()

merged = pd.merge(left=happiness2015, 
                  right=world_dev, 
                  how='left', 
                  left_on='Country', 
                  right_on='ShortName')

In [17]:
# Renaming column

col_renaming = {'SourceOfMostRecentIncomeAndExpenditureData': 'IESurvey'}
merged.rename(col_renaming, axis=1, inplace=True)

merged.head(3)

Unnamed: 0,Country,Region_x,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),...,GovernmentAccountingConcept,ImfDataDisseminationStandard,LatestPopulationCensus,LatestHouseholdSurvey,IESurvey,VitalRegistrationComplete,LatestAgriculturalCensus,LatestIndustrialData,LatestTradeData,LatestWaterWithdrawalData
0,Switzerland,Western Europe,1,7.587,0.03411,1.39651,1.34951,0.94143,0.66557,0.41978,...,Consolidated central government,Special Data Dissemination Standard (SDDS),2010,,"Expenditure survey/budget survey (ES/BS), 2004",Yes,2008,2010.0,2013.0,2000.0
1,Iceland,Western Europe,2,7.561,0.04884,1.30232,1.40223,0.94784,0.62877,0.14145,...,Consolidated central government,Special Data Dissemination Standard (SDDS),2011,,"Integrated household survey (IHS), 2010",Yes,2010,2005.0,2013.0,2005.0
2,Denmark,Western Europe,3,7.527,0.03328,1.32548,1.36058,0.87464,0.64938,0.48357,...,Consolidated central government,Special Data Dissemination Standard (SDDS),2011,,"Income tax registers (ITR), 2010",Yes,2010,2010.0,2013.0,2009.0


Perform the following tasks:

1. Extract the unit of currency from the "CurrencyUnit" column without the leading nationality. For example, instead of "Danish krone", just "krone".
<br>
- We could use Python's string.split() method for the extraction.
<br>
- To repeat this for each element in the Series, we can use the Series.apply() method.

In [22]:
merged["CurrencyUnit"]

0                 Swiss franc
1               Iceland krona
2                Danish krone
3             Norwegian krone
4             Canadian dollar
                ...          
153             Rwandan franc
154    West African CFA franc
155                       NaN
156             Burundi franc
157    West African CFA franc
Name: CurrencyUnit, Length: 158, dtype: object

In [23]:
# Custom function to extract the last word from a string

def extract_last_word(element):
    return str(element).split()[-1]

In [27]:
# Apply it to the entire column using Series.apply()

merged['Currency Apply']=merged['CurrencyUnit'].apply(extract_last_word)
merged['Currency Apply']

0       franc
1       krona
2       krone
3       krone
4      dollar
        ...  
153     franc
154     franc
155       nan
156     franc
157     franc
Name: Currency Apply, Length: 158, dtype: object

However, vectorized methods are better than using Series.apply(). Pandas has built in a many vectorized methods that perform the same operations for strings in series as Python string methods.
<br>
The full list is here: https://pandas.pydata.org/pandas-docs/stable/text.html#method-summary
<br><br>
In this case, we can use:
- Series.str.split() method to split the string into a list of words.
- Series.str.get() method to get the word at a given position.

In [30]:
merged['CurrencyUnit'].str.split()

0                   [Swiss, franc]
1                 [Iceland, krona]
2                  [Danish, krone]
3               [Norwegian, krone]
4               [Canadian, dollar]
                  ...             
153               [Rwandan, franc]
154    [West, African, CFA, franc]
155                            NaN
156               [Burundi, franc]
157    [West, African, CFA, franc]
Name: CurrencyUnit, Length: 158, dtype: object

In [29]:
merged['Currency Vectorized']=merged['CurrencyUnit'].str.split().str.get(-1)
merged['Currency Vectorized']

0       franc
1       krona
2       krone
3       krone
4      dollar
        ...  
153     franc
154     franc
155       NaN
156     franc
157     franc
Name: Currency Vectorized, Length: 158, dtype: object

2. Compute the length of each string in the CurrencyUnit column.
<br>

We can use Series.apply() to apply a custom function to find the length of each string in the column.
<br> 
But what would happen to the null values?

In [31]:
# Check if there are any missing values
merged['CurrencyUnit'].isnull().sum()

13

In [32]:
# Custom function to return the length of each currency unit
def compute_lengths(element):
    return len(str(element))

# Apply it to the column
lengths_apply = merged['CurrencyUnit'].apply(compute_lengths)

In [38]:
# Check the number of missing values in the result
lengths_apply.value_counts(dropna=False)

14.0    21
4.0     20
12.0    17
13.0    14
NaN     13
15.0    13
16.0    12
18.0     9
17.0     9
11.0     8
22.0     7
25.0     5
19.0     3
9.0      2
26.0     1
23.0     1
10.0     1
39.0     1
20.0     1
Name: CurrencyUnit, dtype: int64

There are no missing values in the calculated lengths. In fact, our function treated NaN as a string and returned length=3 for it.<br>
We have to modify the custom function to ignore the null values.

In [35]:
def compute_lengths(element):
    if pd.isnull(element):
        pass
    else:
        return len(str(element))
lengths_apply = merged['CurrencyUnit'].apply(compute_lengths)
lengths_apply.value_counts(dropna=False)

14.0    21
4.0     20
12.0    17
13.0    14
NaN     13
15.0    13
16.0    12
18.0     9
17.0     9
11.0     8
22.0     7
25.0     5
19.0     3
9.0      2
26.0     1
23.0     1
10.0     1
39.0     1
20.0     1
Name: CurrencyUnit, dtype: int64

Instead of doing this, we can use vectorized method Series.str.len() method to return the length of each element in a column.
<br>
**Note**: Vectorized methods automatically exclude missing values!

In [41]:
lengths=merged['CurrencyUnit'].str.len()
lengths.value_counts(dropna=False)

14.0    21
4.0     20
12.0    17
13.0    14
NaN     13
15.0    13
16.0    12
18.0     9
17.0     9
11.0     8
22.0     7
25.0     5
19.0     3
9.0      2
26.0     1
23.0     1
10.0     1
39.0     1
20.0     1
Name: CurrencyUnit, dtype: int64

3. Search for the substring "national accounts" in the "SpecialNotes column" and select only the rows containing it.

- To parse the elements of a Series to find a string that doesn't appear in the same position in each element, we can use regular expressions, or **regex** for short. 
- A regular expression is a sequence of characters that describes a search pattern, used to match characters in a string.
- In pandas, regular expression is integrated with vectorized string methods. In this case, we can use Series.str.contains() method to  to see if a specific phrase appeared in a series.


Regex docs: https://docs.python.org/3.4/library/re.html

In [44]:
merged["SpecialNotes"]

0                                                    NaN
1                                                    NaN
2                                                    NaN
3                                                    NaN
4      Fiscal year end: March 31; reporting period fo...
                             ...                        
153    Based on official government statistics, natio...
154                                                  NaN
155                                                  NaN
156                                                  NaN
157    April 2013 database update: Based on IMF data,...
Name: SpecialNotes, Length: 158, dtype: object

In [47]:
# Pattern created using regex
pattern = r"[Nn]ational accounts"

# Matching pattern
national_accounts = merged['SpecialNotes'].str.contains(pattern)
national_accounts

0       NaN
1       NaN
2       NaN
3       NaN
4      True
       ... 
153    True
154     NaN
155     NaN
156     NaN
157    True
Name: SpecialNotes, Length: 158, dtype: object

In [49]:
# Return the value counts for each value in the Series, including missing values.
national_accounts.value_counts(dropna=False)

NaN      65
True     54
False    39
Name: SpecialNotes, dtype: int64

Before we use boolean indexing to get all the rows containing "national accounts", we need to get rid of NaN values. So, we change them to False.<br>
Set the na parameter to False in Series.str.contains() to return False for the NaN values.

In [52]:
national_accounts=merged['SpecialNotes'].str.contains(pattern, na=False)

In [53]:
national_accounts.value_counts(dropna=False)

False    104
True      54
Name: SpecialNotes, dtype: int64

In [54]:
# Use boolean indexing to return only the rows that contain 
# "national accounts" or "National accounts" in the SpecialNotes column
merged[national_accounts]

Unnamed: 0,Country,Region_x,Happiness Rank,Happiness Score,Standard Error,Economy (GDP per Capita),Family,Health (Life Expectancy),Freedom,Trust (Government Corruption),...,LatestPopulationCensus,LatestHouseholdSurvey,IESurvey,VitalRegistrationComplete,LatestAgriculturalCensus,LatestIndustrialData,LatestTradeData,LatestWaterWithdrawalData,Currency Apply,Currency Vectorized
4,Canada,North America,5,7.427,0.03553,1.32629,1.32261,0.90563,0.63297,0.32957,...,2011,,"Labor force survey (LFS), 2010",Yes,2011,2011.0,2013.0,1986.0,dollar,dollar
7,Sweden,Western Europe,8,7.364,0.03157,1.33171,1.28907,0.91087,0.6598,0.43844,...,2011,,"Income survey (IS), 2005",Yes,2010,2010.0,2013.0,2007.0,krona,krona
8,New Zealand,Australia and New Zealand,9,7.286,0.03371,1.25018,1.31967,0.90837,0.63938,0.42922,...,2013,,,Yes,2012,2010.0,2013.0,2002.0,dollar,dollar
9,Australia,Australia and New Zealand,10,7.284,0.04083,1.33358,1.30923,0.93156,0.65124,0.35637,...,2011,,"Expenditure survey/budget survey (ES/BS), 2003",Yes,2011,2011.0,2013.0,2000.0,dollar,dollar
14,United States,North America,15,7.119,0.03839,1.39451,1.24711,0.86179,0.54604,0.1589,...,2010,,"Labor force survey (LFS), 2010",Yes,2012,2008.0,2013.0,2005.0,dollar,dollar
19,United Arab Emirates,Middle East and Northern Africa,20,6.901,0.03729,1.42727,1.12575,0.80925,0.64157,0.38583,...,2010,"World Health Survey (WHS), 2003",,,2012,2010.0,2011.0,2005.0,dirham,dirham
23,Singapore,Southeastern Asia,24,6.798,0.0378,1.52186,1.02,1.02525,0.54252,0.4921,...,2010,"National Health Survey (NHS), 2010",,Yes,,2011.0,2013.0,1975.0,dollar,dollar
31,Uruguay,Latin America and Caribbean,32,6.485,0.04539,1.06166,1.2089,0.8116,0.60362,0.24558,...,2011,"Multiple Indicator Cluster Survey (MICS), 2012/13","Integrated household survey (IHS), 2013",Yes,2011,2009.0,2013.0,2000.0,peso,peso
33,Thailand,Southeastern Asia,34,6.455,0.03557,0.9669,1.26504,0.7385,0.55664,0.03187,...,2010,"Multiple Indicator Cluster Survey (MICS), 2012","Integrated household survey (IHS), 2011",,2013,2006.0,2013.0,2007.0,baht,baht
38,Kuwait,Middle East and Northern Africa,39,6.295,0.04456,1.55422,1.16594,0.72492,0.55499,0.25609,...,2011,"Family Health Survey (FHS), 1996",,Yes,,2011.0,2013.0,2002.0,dinar,dinar


Some more regex patterns:
- A character that could be a range of numbers: r"[0-9]"
- A character that could be a range of letters: r"[a-z]" or r"[A-Z]"

{n} indicates repetition n times.
- A three character substring that starts with a number between 1 and 6 and ends with two letters of any kind: r"[1-6][a-z][a-z]" = r"[1-6][a-z]{2}"

4. Match all the years from the "SpecialNotes" column and extract them.

- To extract part of a string that matches a pattern, we can use **regex**.
- In this case, we can use Series.str.extract() method to  extract years from the SpecialNotes column.
- **Capturing group** - the pattern enclosed by the parantheses, for example ([1-9][0-9]{3})
- If the capturing group is not present, the value is set to NaN.

In [58]:
# Pattern to be matched
pattern =r"([1-9][0-9]{3})"

# Match and extract
years=merged['SpecialNotes'].str.extract(pattern)
years

Unnamed: 0,0
0,
1,
2,
3,
4,
...,...
153,2006
154,
155,
156,
