# Intro to Pandas
by Ryan Orsinger

## Module 1: Intro to pandas series

### Pandas Series Part 3: Strings
- Sorting values
- Using pandas built-in string methods
- Assigning results
- Reassigning string series

In [1]:
import pandas as pd

In [4]:
fruits = pd.Series(["apple", "orange", "banana", "lemon", "lime", "pineapple", "blueberry", "raspberry", "cranberry"])
fruits

0        apple
1       banana
2        lemon
3         lime
4    pineapple
5    blueberry
6    raspberry
7    cranberry
dtype: object

In [8]:
# .sort_values sorts strings alphabetically or numbers in numerical order
# fruits.sort_values(ascending=True) is set to the default
fruits.sort_values()

0        apple
1       banana
5    blueberry
7    cranberry
2        lemon
3         lime
4    pineapple
6    raspberry
dtype: object

In [9]:
fruits.sort_values(ascending=False)

6    raspberry
4    pineapple
3         lime
2        lemon
7    cranberry
5    blueberry
1       banana
0        apple
dtype: object

In [None]:
# We can reassign the series to hold the sorted values
# For more on .sort_values, see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
fruits = fruits.sort_values()
fruits

In [12]:
# .capitalize to capitalize
fruits.str.capitalize()

0        Apple
1       Banana
5    Blueberry
7    Cranberry
2        Lemon
3         Lime
4    Pineapple
6    Raspberry
dtype: object

In [11]:
fruits

0        apple
1       banana
5    blueberry
7    cranberry
2        lemon
3         lime
4    pineapple
6    raspberry
dtype: object

In [20]:
# String operations keep the original series intact, so we reassign to update
capitalized_fruits = fruits.str.capitalize()
capitalized_fruits

0        Apple
1       Banana
5    Blueberry
7    Cranberry
2        Lemon
3         Lime
4    Pineapple
6    Raspberry
dtype: object

In [15]:
# .contains returns a boolean series
# Always remember to use .str or your results will be in error
fruits.contains("apple")

AttributeError: 'Series' object has no attribute 'contains'

In [17]:
# Since .contains returns a Boolean series, we can use it to filter our results
fruits[fruits.str.contains("apple")]

0        apple
4    pineapple
dtype: object

In [19]:
# .count to count a substring occurrence
fruits.str.count("a")

0    1
1    3
5    0
7    1
2    0
3    0
4    1
6    1
dtype: int64

In [33]:
# Summing up the results of .count
vowel_counts = fruits.str.count("a") + fruits.str.count("e") + fruits.str.count("i") + fruits.str.count("o") + fruits.str.count("u")
vowel_counts

SyntaxError: invalid syntax (1607019601.py, line 2)

In [22]:
# Using count with a Regular Expression character class
# Some of the Pandas string methods can utilize regular expressions
fruits.str.count("[aeiou]")

0    2
1    3
5    3
7    2
2    2
3    2
4    4
6    2
dtype: int64

In [27]:
# We can use our new vowel count to filter values from the series
fruits[fruits.str.count("[aeiou]") > 2]

1       banana
5    blueberry
4    pineapple
dtype: object

In [30]:
# .startswith returns a Boolean series
fruits.str.startswith("l")

0    False
1    False
5    False
7    False
2     True
3     True
4    False
6    False
dtype: bool

In [31]:
fruits[fruits.str.startswith("l")]

2    lemon
3     lime
dtype: object

In [32]:
# .endswith returns a Boolean series
fruits.str.endswith("berry")

0    False
1    False
5     True
7     True
2    False
3    False
4    False
6     True
dtype: bool

In [34]:
# .len to get the length of the string
fruits.str.len()

0    5
1    6
5    9
7    9
2    5
3    4
4    9
6    9
dtype: int64

In [35]:
# .lower to lowercase strings
shouts = pd.Series(["PLEASE", "LOWERCASE", "THESE", "STRINGS"])
not_shouts = shouts.str.lower()
not_shouts

0    lowercase
1        these
2      strings
dtype: object

In [47]:
# Using .replace to replace characters (also used to remove characters)
prices = pd.Series(["€5.99", "€12.25", "€95"])

# Be sure to reassign the variable
prices = prices.str.replace("€", "")

# But our data type is still a string
prices * 2

0      5.995.99
1    12.2512.25
2          9595
dtype: object

In [46]:
# Use .astype to convert a number in a string to a numeric data type
prices = prices.astype(float)
prices * 2

0     11.98
1     24.50
2    190.00
dtype: float64

In [51]:
# .upper to uppercase
fruits.str.upper()

0        APPLE
1       BANANA
5    BLUEBERRY
7    CRANBERRY
2        LEMON
3         LIME
4    PINEAPPLE
6    RASPBERRY
dtype: object

## Further Reading
- [Pandas user guide for text](https://pandas.pydata.org/docs/user_guide/text.html)
- [Pandas user guide](https://pandas.pydata.org/docs/user_guide/basics.html)

## Exercises
- Create a series named `vegetables` using the list of strings `["Onion", "cucumber", "Carrot", "squash", "Potato", "Asperagus", "kale", "Broccoli", "spinach"]`
- Write the code necessary to lowercase all of the vegetables and reassign your series.
- Write the pandas code to sort the strings in alphabetical order. Ensure that the series stores the sorted order
- Write the pandas code to show only the vegetables that start with a vowel.
- Write the pandas code to show the vegetables that have exactly two vowels
<br><br>
- Now make a new series named `prices` that holds `["$2.99", "$1,200.25", "$5.99", "$2,350.00"]`
- Reassign `prices` to hold only a string of numbers. Remove the `$` and `,` characters.
- Reassign `prices` to be a float data type
- Now multiply your `prices` series by `0.9`

In [23]:
# Create a series of vegetables ["Onion", "cucumber", "Carrot", "squash", "Potato", "Asperagus", "kale", "Broccoli", "spinach"]

In [None]:
# Write the code necessary to lowercase all of the vegetables and reassign your series.

In [52]:
# Write the pandas code to sort the strings in alphabetical order. Ensure that the series stores the sorted order

In [53]:
# Write the pandas code to show only the vegetables that start with a vowel

In [None]:
# Write the pandas code to show the vegetables that have exactly two vowels

In [54]:
# Make a new series named prices that holds ["$2.99", "$1,200.25", "$5.99", "$2,350.00"]

In [55]:
# Reassign prices to hold only a string of numbers. Remove the $ and , characters.

In [56]:
# Reassign prices to be a float data type

In [None]:
# Multiply your prices series by 0.9