# 3. String Series Methods

## Methods for Series with String Data Types
In this notebook, we will focus on methods that work for Series that contain string data. Remember that there is no string data type in Pandas. Instead there is the **object** data dtype which refers to any Python object. The vast majority of the time, object columns will be entirely composed of strings.

The methods in the previous two notebooks focused on numeric and boolean Series. Many of those methods will work for both string Series as well but some will not.

For instance, the **`mean`** method will not work for string columns. Let's see this in action by selecting the department column from the City of Houston dataset.

In [1]:
import pandas as pd

emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date', 'job_date'])
emp.head()

Unnamed: 0,title,dept,salary,race,gender,hire_date,job_date
0,ASSISTANT DIRECTOR (EX LVL),Municipal Courts Department,121862.0,Hispanic,Female,2006-06-12,2012-10-13
1,LIBRARY ASSISTANT,Library,26125.0,Hispanic,Female,2000-07-19,2010-09-18
2,POLICE OFFICER,Houston Police Department-HPD,45279.0,White,Male,2015-02-03,2015-02-03
3,ENGINEER/OPERATOR,Houston Fire Department (HFD),63166.0,White,Male,1982-02-08,1991-05-25
4,ELECTRICIAN,General Services Department,56347.0,White,Male,1989-06-19,1994-10-22


In [2]:
dept = emp['dept']
dept.head()

0      Municipal Courts Department
1                          Library
2    Houston Police Department-HPD
3    Houston Fire Department (HFD)
4      General Services Department
Name: dept, dtype: object

### Attempt to take the mean

In [None]:
dept.mean()

### Other methods do work
Many of the other methods we covered from the previous two notebooks will work just fine with string columns such as finding the maximum department - maximum being that department with the highest alphabetical letter.

In [None]:
dept.max()

Calculate number of missing values:

In [None]:
dept.isna().sum()

## Most valuable method for string columns: `value_counts`
The **`value_counts`** method is one of the most valuable methods for string columns. It returns the frequency of each value in the Series and sorts it from most to least common.

In [None]:
dept.value_counts()

## Notice what object is returned
The **`value_counts`** method returns a Series object itself with the old values as the index and the count as the new values.

### Use `normalize=True` for proportion
We can use **`value_counts`** to return the proportion of each occurrence instead of the raw count by setting parameter **`normalize`** to **`True`**. For instance, this tells us that 32% of the employees are members of the police department.

In [None]:
dept.value_counts(normalize=True)

### `value_counts` also works for columns of all types
The **`value_counts`** method works for all columns of all types and not just strings. It's just usually more informative for string columns. Let's use it on the salary column to see if we have common salaries.

In [None]:
emp['salary'].value_counts().head(10)

# Special methods just for object columns
Pandas provides a collection of methods only available to object columns with the **str accessor**. The str accessor is only available to Series objects with data type of **object**. It provides a few dozen methods for string manipulation.

### Access with dot notation
To access these special string methods first append the Series object with `.str` followed by another dot and then the specific string method.

### Make each value uppercase
Let's call a simple string method to make each value in the **`dept`** Series uppercase. We will use the **`upper`** method of the str accessor.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

In [None]:
dept.str.upper().head()

### `str` accessor API
Take a look at the [str accessor API][1] in the official documentation.

[1]: http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling

### Lot's of methods but mostly easy to use
There is quite a lot of functionality to manipulate and probe strings in almost any way you can imagine. Let's work through some examples of the following string methods:

* **`count`**
* **`contains`**
* **`find`**
* **`len`**
* **`split`**
* **`replace`**

### `count` str method
Returns the count of the passed string:

In [None]:
dept.str.count('a').head()

In [None]:
dept.str.count('Department').head()

### `contains` str method
Returns a boolean whether or not the passed string is contained somewhere within the string. Let's determine if any departments contain the letter **z**?

In [None]:
dept.str.contains('z').head()

In [None]:
dept.str.contains('z').sum()

### `find` str method
Returns the lowest index (the integer location) of the passed string. If not found returns -1.

In [None]:
dept.str.find('a').head(10)

### `len` str method
Returns the length of each string.

In [None]:
dept.str.len().head()

# Exercises

### Problem 1
<span  style="color:green; font-size:16px">Read in the movie dataset and set the title as the index. Assign the actor 1 column to its own Series variable. Make sure to drop missing values from this Series before assigning it.

Which actor 1 has appeared in the most movies? Can you write an expression that returns this actors name as a string?</span>

In [3]:
movie = pd.read_csv('../data/movie.csv', index_col= 'title')
movie.head(2) 

Unnamed: 0_level_0,year,color,content_rating,duration,director_name,director_fb,actor1,actor1_fb,actor2,actor2_fb,...,actor3_fb,gross,genres,num_reviews,num_voted_users,plot_keywords,language,country,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Avatar,2009.0,Color,PG-13,178.0,James Cameron,0.0,CCH Pounder,1000.0,Joel David Moore,936.0,...,855.0,760505847.0,Action|Adventure|Fantasy|Sci-Fi,723.0,886204,avatar|future|marine|native|paraplegic,English,USA,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,Color,PG-13,169.0,Gore Verbinski,563.0,Johnny Depp,40000.0,Orlando Bloom,5000.0,...,1000.0,309404152.0,Action|Adventure|Fantasy,302.0,471220,goddess|marriage ceremony|marriage proposal|pi...,English,USA,300000000.0,7.1


In [14]:
# your code here
actor1 = movie['actor1']
actor1.head()

title
Avatar                                            CCH Pounder
Pirates of the Caribbean: At World's End          Johnny Depp
Spectre                                       Christoph Waltz
The Dark Knight Rises                               Tom Hardy
Star Wars: Episode VII - The Force Awakens        Doug Walker
Name: actor1, dtype: object

In [16]:
actor1.value_counts().head(1)

Robert De Niro    48
Name: actor1, dtype: int64

### Problem 2
<span  style="color:green; font-size:16px">What percent of movies have the top 100 most frequent actor 1's appeared in?</span>

In [18]:
# your code here
actor1 = movie['actor1']

In [25]:
vc = actor1.value_counts(normalize = True)
vc.head()

Robert De Niro       0.009778
Johnny Depp          0.007333
Nicolas Cage         0.006519
Matt Damon           0.005908
Denzel Washington    0.005908
Name: actor1, dtype: float64

In [29]:
actor_freq = vc.iloc[:100].head()
actor_freq

Robert De Niro       0.009778
Johnny Depp          0.007333
Nicolas Cage         0.006519
Matt Damon           0.005908
Denzel Washington    0.005908
Name: actor1, dtype: float64

In [28]:
actor_freq.sum()

0.03544510083520065

### Problem 3
<span  style="color:green; font-size:16px">How many actor 1's have appeared in exactly one movie?</span>

In [9]:
# your come here
actor1 = movie['actor1']
vc = actor1.value_counts()
vc.head()

Robert De Niro       48
Johnny Depp          36
Nicolas Cage         32
Matt Damon           29
Denzel Washington    29
Name: actor1, dtype: int64

In [17]:
filt = vc == 1
vc[filt].count() # cannot do actor1[filt] since the filt and actor1 have differenct sizes

1379

In [33]:
(actor1.value_counts() == 1).head()

Robert De Niro       False
Johnny Depp          False
Nicolas Cage         False
Matt Damon           False
Denzel Washington    False
Name: actor1, dtype: bool

In [31]:
(actor1.value_counts() == 1).sum()

1379

### Problem 4
<span  style="color:green; font-size:16px">How many actor 1's have more than 3 e's in their name? Output a unique array of just these actor names so we can manually verify them.</span>

In [37]:
# your code here
(actor1.str.count('e') > 3).head()

title
Avatar                                        False
Pirates of the Caribbean: At World's End      False
Spectre                                       False
The Dark Knight Rises                         False
Star Wars: Episode VII - The Force Awakens    False
Name: actor1, dtype: bool

In [39]:
act_with_3e = actor1.str.count('e') > 3
act_with_3e.sum()

101

In [41]:
actor1[act_with_3e].unique()

array(['Jennifer Lawrence', 'Keanu Reeves', 'Seychelle Gabriel',
       'Jeremy Renner', 'Amber Stevens West', 'Peter Greene',
       'Steven Anthony Lawrence', 'Cedric the Entertainer',
       'Sean Pertwee', 'Xander Berkeley', 'Kathleen Freeman',
       'Pierre Perrier', 'Catherine Deneuve', 'George Kennedy',
       'Leighton Meester', 'Steve Guttenberg', 'Emmanuelle Seigner',
       'Jurnee Smollett-Bell', 'Steve Oedekerk',
       'Johannes Silberschneider', 'Bernadette Peters',
       'Jacqueline McKenzie', 'Dee Bradley Baker', 'Jennifer Freeman',
       'Gene Tierney', 'Roscoe Lee Browne', 'Phoebe Legere',
       'Eric Sheffer Stevens', 'Michael Greyeyes', 'Steven Weber',
       'George Newbern', 'Florence Henderson', 'Michelle Simone Miller',
       'Chemeeka Walker', 'Fereshteh Sadre Orafaiy'], dtype=object)

### Problem 5
<span  style="color:green; font-size:16px">Get a unique list of all actors that have the name 'Johnson' as part of their name.</span>

In [42]:
# your code here
act_with_johnson = actor1.str.count('Johnson') >= 1

In [43]:
actor1[act_with_johnson].unique()

array(['Don Johnson', 'Dwayne Johnson', 'Richard Johnson', 'Eric Johnson',
       'Bill Johnson', 'Nicole Randall Johnson', 'R. Brandon Johnson'],
      dtype=object)

In [52]:
actor1.str.contains('Johnson').values

array([False, False, False, ..., False, False, False], dtype=object)

In [59]:
actor1.str.contains('Johnson', na= False).head()

title
Avatar                                        False
Pirates of the Caribbean: At World's End      False
Spectre                                       False
The Dark Knight Rises                         False
Star Wars: Episode VII - The Force Awakens    False
Name: actor1, dtype: bool

In [57]:
actor1[actor1.str.contains('Johnson', na= False)].unique()

array(['Don Johnson', 'Dwayne Johnson', 'Richard Johnson', 'Eric Johnson',
       'Bill Johnson', 'Nicole Randall Johnson', 'R. Brandon Johnson'],
      dtype=object)

### Problem 6
<span  style="color:green; font-size:16px">How many actor 1 names end in 'x'?</span>

In [50]:
# your code here
actor1.str.endswith('x').sum()

28

### Problem 7
<span  style="color:green; font-size:16px">The Pandas string methods overlap with the builtin Python string methods. Find all the public method names that are in-common to both. Then find the public methods that are unique to each.</span>

In [60]:
# your code here

In [63]:
dir(str)

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',
 'title',
 'translate',
 'upper',
 'zfill']

# Explore More `str` Methods and their parameters
In this section below, you can learn and practice with other methods and their parameters. There are much too many to cover all during a lecture and left to you to understand on your own.

### `split` str method
Splits into multiple separate strings based on a given separator. The default separator is a single space. The following splits on each space and returns a Series of lists.

In [65]:
dept.str.split().head()

0       [Municipal, Courts, Department]
1                             [Library]
2     [Houston, Police, Department-HPD]
3    [Houston, Fire, Department, (HFD)]
4       [General, Services, Department]
Name: dept, dtype: object

Set the **`expand`** parameter to **`True`** to return a DataFrame:

In [66]:
dept.str.split(expand=True).head()

Unnamed: 0,0,1,2,3
0,Municipal,Courts,Department,
1,Library,,,
2,Houston,Police,Department-HPD,
3,Houston,Fire,Department,(HFD)
4,General,Services,Department,


### `repalce` str method
You must pass two string arguments to replace - the string you want to replace and its replacement value.

In [67]:
dept.str.replace('Houston', 'H-Town').head()

0     Municipal Courts Department
1                         Library
2    H-Town Police Department-HPD
3    H-Town Fire Department (HFD)
4     General Services Department
Name: dept, dtype: object

### Selecting substrings with the brackets
Selecting a single character of a Python string is simple and accomplished by placing the integer location of the desired character in brackets. Selecting substrings is also quite simple and accomplished by using slice notation in the brackets.

Pandas allows us to perform the exact same operation with its **`str`** accessor to select one or more characters of each string. We simply append the brackets to **`str`** and use the same selection process as we do with Python strings. Let's see some examples.

Select the character with integer location 5 for each value in the Series:

In [None]:
dept.str[5].head()

Select the last 5 characters of each value in the Series:

In [None]:
dept.str[-5:].head()

Select characters 5 through 15

In [None]:
dept.str[5:15].head()

# There are dozens of other string methods. Keep practicing below
Use the documentation to read about every parameter in each method.

In [None]:
# your code here