# Intro to Pandas
by Ryan Orsinger

## Module 1: Intro to pandas series

### Pandas Series Part 3: Strings
- Sorting values
- Using pandas built-in string methods
- Assigning and reassigining results
- Using string methods for data cleaning
- Updating data types

In [1]:
import pandas as pd

In [2]:
fruits = pd.Series(["apple", "orange", "banana", "lemon", "lime", "pineapple", "blueberry", "raspberry", "cranberry"])
fruits

0        apple
1       orange
2       banana
3        lemon
4         lime
5    pineapple
6    blueberry
7    raspberry
8    cranberry
dtype: object

In [3]:
# .sort_values sorts strings alphabetically or numbers in numerical order
# fruits.sort_values(ascending=True) the default sort order
fruits.sort_values()

0        apple
2       banana
6    blueberry
8    cranberry
3        lemon
4         lime
1       orange
5    pineapple
7    raspberry
dtype: object

In [4]:
fruits.sort_values(ascending=False)

7    raspberry
5    pineapple
1       orange
4         lime
3        lemon
8    cranberry
6    blueberry
2       banana
0        apple
dtype: object

In [5]:
# Use inplace=True to operate on the original
# For more on .sort_values, see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
fruits = fruits.sort_values()
fruits

0        apple
2       banana
6    blueberry
8    cranberry
3        lemon
4         lime
1       orange
5    pineapple
7    raspberry
dtype: object

In [6]:
# We can reassign the series to hold the sorted values
# For more on .sort_values, see https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_values.html
fruits = fruits.sort_values(ignore_index=True)
fruits

0        apple
1       banana
2    blueberry
3    cranberry
4        lemon
5         lime
6       orange
7    pineapple
8    raspberry
dtype: object

In [7]:
# .capitalize to capitalize
fruits.str.capitalize()

0        Apple
1       Banana
2    Blueberry
3    Cranberry
4        Lemon
5         Lime
6       Orange
7    Pineapple
8    Raspberry
dtype: object

In [8]:
fruits

0        apple
1       banana
2    blueberry
3    cranberry
4        lemon
5         lime
6       orange
7    pineapple
8    raspberry
dtype: object

In [9]:
# String operations keep the original series intact, so we reassign to update
capitalized_fruits = fruits.str.capitalize()
capitalized_fruits

0        Apple
1       Banana
2    Blueberry
3    Cranberry
4        Lemon
5         Lime
6       Orange
7    Pineapple
8    Raspberry
dtype: object

In [10]:
# .contains returns a boolean series
# Always remember to use .str or your results will be in error
fruits.str.contains("apple")

0     True
1    False
2    False
3    False
4    False
5    False
6    False
7     True
8    False
dtype: bool

In [11]:
# Since .contains returns a Boolean series, we can use it to filter our results
fruits[fruits.str.contains("apple")]

0        apple
7    pineapple
dtype: object

In [12]:
# .count to count substring occurrences
fruits.str.count("a")

0    1
1    3
2    0
3    1
4    0
5    0
6    1
7    1
8    1
dtype: int64

In [13]:
fruits.str.count("berry")

0    0
1    0
2    1
3    1
4    0
5    0
6    0
7    0
8    1
dtype: int64

In [14]:
# Summing up the results of .count
vowel_counts = fruits.str.count("a") + fruits.str.count("e") + fruits.str.count("i") + fruits.str.count("o") + fruits.str.count("u")
vowel_counts

0    2
1    3
2    3
3    2
4    2
5    2
6    3
7    4
8    2
dtype: int64

In [15]:
# Using count with a Regular Expression character class
# Some of the Pandas string methods can utilize regular expressions
fruits.str.count("[aeiou]")

0    2
1    3
2    3
3    2
4    2
5    2
6    3
7    4
8    2
dtype: int64

In [16]:
# We can use our new vowel count to filter values from the series
fruits[fruits.str.count("[aeiou]") > 2]

1       banana
2    blueberry
6       orange
7    pineapple
dtype: object

In [17]:
# .startswith returns a Boolean series
fruits.str.startswith("l")

0    False
1    False
2    False
3    False
4     True
5     True
6    False
7    False
8    False
dtype: bool

In [18]:
fruits[fruits.str.startswith("l")]

4    lemon
5     lime
dtype: object

In [19]:
# .endswith returns a Boolean series
fruits.str.endswith("berry")

0    False
1    False
2     True
3     True
4    False
5    False
6    False
7    False
8     True
dtype: bool

In [20]:
# .len to get the length of the string
fruits.str.len()

0    5
1    6
2    9
3    9
4    5
5    4
6    6
7    9
8    9
dtype: int64

In [21]:
# .lower to lowercase strings
shouts = pd.Series(["PLEASE", "LOWERCASE", "THESE", "STRINGS"])
not_shouts = shouts.str.lower()
not_shouts

0       please
1    lowercase
2        these
3      strings
dtype: object

In [22]:
# Using .replace to replace characters (also used to remove characters)
prices = pd.Series(["€5.99", "€12.25", "€95"])

# Be sure to reassign the variable
prices = prices.str.replace("€", "")

# But our data type is still a string
prices * 2

0      5.995.99
1    12.2512.25
2          9595
dtype: object

In [23]:
# Use .astype to convert a number in a string to a numeric data type
prices = prices.astype(float)
prices * 2

0     11.98
1     24.50
2    190.00
dtype: float64

In [24]:
# .upper to uppercase
fruits.str.upper()

0        APPLE
1       BANANA
2    BLUEBERRY
3    CRANBERRY
4        LEMON
5         LIME
6       ORANGE
7    PINEAPPLE
8    RASPBERRY
dtype: object

## Further Reading
- [Pandas user guide for text](https://pandas.pydata.org/docs/user_guide/text.html)
- [Pandas user guide](https://pandas.pydata.org/docs/user_guide/basics.html)

## Exercises
- Create a series named `vegetables` using the list of strings `["Onion", "cucumber", "Carrot", "squash", "Potato", "Asperagus", "kale", "Broccoli", "spinach"]`
- Write the code necessary to lowercase all of the vegetables and reassign your series.
- Write the pandas code to sort the strings in alphabetical order. Ensure that the series stores the sorted order
- Write the pandas code to show only the vegetables that start with a vowel.
- Write the pandas code to show the vegetables that have exactly two vowels
<br><br>
- Now make a new series named `prices` that holds `["$2.99", "$1,200.25", "$5.99", "$2,350.00"]`
- Reassign `prices` to hold only a string of numbers. Remove the `$` and `,` characters.
- Reassign `prices` to be a float data type
- Now multiply your `prices` series by `0.9`

In [25]:
# Create a series of vegetables ["Onion", "cucumber", "Carrot", "squash", "Potato", "Asperagus", "kale", "Broccoli", "spinach"]


In [26]:
# Write the code necessary to lowercase all of the vegetables and reassign your series.


In [27]:
# Write the pandas code to sort the strings in alphabetical order. Ensure that the series stores the sorted order


In [28]:
# Write the pandas code to show only the vegetables that start with a vowel


In [29]:
# Write the pandas code to show the vegetables that have exactly two vowels


In [30]:
# Make a new series named prices that holds ["$2.99", "$1,200.25", "$5.99", "$2,350.00"]


In [31]:
# Reassign prices to hold only a string of numbers. Remove the $ and , characters.


In [32]:
# Reassign prices to be a float data type


In [33]:
# Multiply your prices series by 0.9
