# Use Regular Expressions to Search Text in a DataFrame

In this notebook, using one challenging concept (Regular Expressions) introduced in chapter 2 of Data Science from Scratch, I will find out how many Peloton classes I've taken that had Hip Hop music.

Chapter 2 of Data Science from Scratch introduced Regular expressions. Regular expressions offers one method of locating a some part, whether it be a character or a phrase, within a string. To demonstrate this concept, I am using pandas to read a csv containing my Peloton indoor cycling metrics.

You'll see that I've got a lot of work to do if I want to create meaning from the information I've teased out. I know how to determine how many classes I've taken with music genres specified, but I'd like to learn how to create a Music Genre column, direct the appropriate information to that column, so that I can use it to determine if I perform better or worse to the sounds of a particular music genre.

***

First, I will import pandas and re.

In [63]:
import pandas as pd
import re

Next, I will read my csv and check out the first 5 records using .head

In [5]:
df = pd.read_csv('C:/Users/snr13/Google Drive/School/Personal Projects/Peloton/Data/snr131_workouts.csv')

In [6]:
df.head()

Unnamed: 0,Workout Timestamp,Live/On-Demand,Instructor Name,Length (minutes),Fitness Discipline,Type,Title,Class Timestamp,Total Output,Avg. Watts,Avg. Resistance,Avg. Cadence (RPM),Avg. Speed (mph),Distance (mi),Calories Burned,Avg. Heartrate,Avg. Incline,Avg. Pace (min/mi)
0,2016-05-23 13:30 (UTC),Live,Alex Toussaint,45,Cycling,Music,45 min Top Hits Ride,2016-05-23 13:15 (UTC),150.0,56.0,35%,63.0,12.35,9.26,347.0,,,
1,2020-04-05 16:46 (UTC),On Demand,Olivia Amato,10,Strength,Bodyweight,10 min Bodyweight Strength,2020-03-10 12:31 (UTC),,,,,,,7.0,,,
2,2020-04-11 17:33 (UTC),On Demand,Becs Gentry,10,Strength,Bodyweight,10 min Bodyweight Strength,2020-03-30 15:17 (UTC),,,,,,,50.0,,,
3,2020-04-12 14:42 (UTC),On Demand,Olivia Amato,10,Strength,Bodyweight,10 min Bodyweight Strength,2020-03-10 12:31 (UTC),,,,,,,50.0,,,
4,2020-05-03 15:56 (EST),On Demand,Andy Speer,30,Strength,Upper Body,30 min Upper Body: Live from Home,2020-05-01 11:25 (EST),,,,,,,148.0,,,


I've got data for both cycling classes and a bunch of other disciplines

In [8]:
df['Fitness Discipline'].value_counts()

Strength         165
Cycling           53
Stretching        18
Cardio             9
Bike Bootcamp      8
Yoga               1
Name: Fitness Discipline, dtype: int64

In this notebook I want to look at only cycling classes

In [10]:
cycling_df = df.loc[(df['Fitness Discipline']=='Cycling')]

In [12]:
cycling_df.head()

Unnamed: 0,Workout Timestamp,Live/On-Demand,Instructor Name,Length (minutes),Fitness Discipline,Type,Title,Class Timestamp,Total Output,Avg. Watts,Avg. Resistance,Avg. Cadence (RPM),Avg. Speed (mph),Distance (mi),Calories Burned,Avg. Heartrate,Avg. Incline,Avg. Pace (min/mi)
0,2016-05-23 13:30 (UTC),Live,Alex Toussaint,45,Cycling,Music,45 min Top Hits Ride,2016-05-23 13:15 (UTC),150.0,56.0,35%,63.0,12.35,9.26,347.0,,,
49,2020-09-16 20:20 (EST),On Demand,Alex Toussaint,20,Cycling,Music,20 min Hip Hop Ride,2020-06-06 16:00 (EST),77.0,64.0,33%,77.0,13.39,4.46,106.0,,,
50,2020-09-17 05:37 (EST),On Demand,Alex Toussaint,30,Cycling,Music,30 min Hip Hop Ride,2020-08-31 19:21 (EST),100.0,56.0,32%,75.0,12.45,6.23,138.0,,,
51,2020-09-18 05:21 (EST),On Demand,Alex Toussaint,30,Cycling,Theme,30 min Club Bangers Ride,2020-08-25 11:17 (EST),130.0,73.0,37%,71.0,13.88,6.94,180.0,,,
53,2020-09-24 13:09 (EST),On Demand,Alex Toussaint,20,Cycling,Low Impact,20 min Low Impact Ride,2020-07-27 18:45 (EST),81.0,67.0,31%,85.0,13.61,4.54,111.0,,,


The 'Title' column sometimes contains the music genre

In [13]:
cycling_df['Title'].value_counts()

10 min Low Impact Ride                      4
10 min Climb Ride                           3
30 min Hip Hop Ride                         3
15 min Hip Hop Ride                         3
45 min Power Zone Endurance Ride            2
20 min 90s Pop Ride                         2
30 min EDM Ride                             2
15 min EDM Ride                             2
15 min Low Impact Ride                      2
20 min Hip Hop Ride                         2
30 min Power Zone EDM Ride                  1
45 min Power Zone Ride                      1
20 min EDM Ride                             1
20 min 2000s Hip Hop Ride                   1
30 min Club Bangers Ride                    1
15 min Boss Ride                            1
30 min Power Zone Endurance Ride            1
45 min EDM Ride                             1
15 min 90s Ride                             1
30 min Power Zone Ride                      1
45 min 2010s Ride                           1
30 min 2010s Ride                 

I'll use a search of the strings within the 'Title' column along with .sum to see how many classes I've taken with particular music genres specified

In [41]:
cycling_df['Title'].str.count(r'(Hip Hop)').sum()

11

In [42]:
cycling_df['Title'].str.count(r'(Pop)').sum()

2

In [43]:
cycling_df['Title'].str.count(r'(EDM)').sum()

7

In [44]:
cycling_df['Title'].str.count(r'(2000s)').sum()

2

In [45]:
cycling_df['Title'].str.count(r'(90s)').sum()

6

In [46]:
cycling_df['Title'].str.count(r'(2010s)').sum()

2

In [47]:
cycling_df['Title'].str.count(r'(Club Bangers)').sum()

1

In [48]:
cycling_df['Title'].str.count(r'(Dance)').sum()

1

In [49]:
cycling_df['Title'].str.count(r'(House)').sum()

1

In [50]:
cycling_df['Title'].str.count(r'(Top Hits)').sum()

1