<a href="https://colab.research.google.com/github/wojtekgradzinski/WojtekRepo/blob/main/ElonMusk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd

In [4]:
df = pd.read_csv("data/jre_elon_musk.csv")
df.head()

Unnamed: 0,Timestamp,Speaker,Text
0,[00:00:00],Joe Rogan,"Ah, ha, ha, ha. Four, three, two, one, boom. T..."
1,[00:00:09],Elon Musk,You're welcome.
2,[00:00:10],Joe Rogan,It's very good to meet you.
3,[00:00:11],Elon Musk,Nice to meet you too.
4,[00:00:12],Joe Rogan,And thanks for not lighting this place on fire.


In [6]:
# Print the first ten lines of text
print(df.Text.values[0:10])

['Ah, ha, ha, ha. Four, three, two, one, boom. Thank you. Thanks for doing this, man. Really appreciate it.'
 "You're welcome." "It's very good to meet you." 'Nice to meet you too.'
 'And thanks for not lighting this place on fire.'
 "You're welcome. That's coming later."
 "How does one, just in the middle of doing all the things you do, create cars, rockets, all the stuff you're doing,constantly innovating, decide to just make a flamethrower? Where do you have the time for that?"
 "Well, the flame, we didn't put a lot of time into the flamethrower. This was an off-the-cuff thing. It's sort of a hobbycompany called the Boring Company, which started out as a joke, and we decided to make a real, and dig a tunnelunder LA. And then, other people asked us to dig tunnels. And so, we said yes in a few cases."
 'Now, who-'
 'And then, we have a merchandise section that only has one piece of merchandise at a time. And we started off witha cap. And there was only one thing on, which is BoringCom

In [7]:
# Create a function to convert timestamp into seconds
import re
def convert_timestamp_into_seconds(timestamp):
    # your code here
    timestamp = timestamp.strip('[]').split(':')
    hrs = int(timestamp[0])
    mins = int(timestamp[1])

    return hrs*3600 + mins*60 + int(timestamp[2])

In [8]:
df.Timestamp[1].strip('[,]').split(':')[0]

'00'

In [9]:
# Convert the Timestamp column using the function defined above

df["Timestamp"] = df["Timestamp"].apply(convert_timestamp_into_seconds)
df.tail()

Unnamed: 0,Timestamp,Speaker,Text
1826,9401,Joe Rogan,"I believe it's true too. So, thank you."
1827,9403,Elon Musk,You're welcome.
1828,9404,Joe Rogan,"All you assholes out there, be nice. Be nice, ..."
1829,9410,Elon Musk,"All right, thank you."
1830,9410,Joe Rogan,"Good night, everybody. END OF TRANSCRIPTAutoma..."


In [6]:
# Add a column with the seconds lasted by the text in the row
# for example: the first row lasts 9 seconds, since Elon Musk
# answer at the second 9 (Hint: use shift with period - 1).
# In case of negative numbers, just make them 1 (minimum interval lenght is 1 second)
import numpy as np
def max1(x):
    return np.maximum(1,x)


df["Interval"] =  (df['Timestamp'].shift(periods=-1, fill_value=1.0) - df['Timestamp']).apply(max1) # your code here
df["Interval"]

0       9.0
1       1.0
2       1.0
3       1.0
4       1.0
       ... 
1826    2.0
1827    1.0
1828    6.0
1829    1.0
1830    1.0
Name: Interval, Length: 1831, dtype: float64

In [7]:
# Total seconds spoken by Joe Rogan
df[df['Speaker'] == 'Joe Rogan'].Interval.sum()

4637.0

In [8]:
# Total seconds spoken by Jaime
df[df['Speaker'] == 'Jaime'].Interval.sum()

45.0

In [9]:
# Average speaking interval for each person
df.groupby(by=["Speaker"])["Interval"].mean()

Speaker
Elon Musk    5.583058
Jaime        2.647059
Joe Rogan    5.123757
Name: Interval, dtype: float64

In [16]:
#Preprocess the data
import spacy

nlp = spacy.load("en_core_web_sm")

# Create a function to remove punctuation from text

def remove_punctuation(text):
    # your code here
    return re.sub(r'[^\w\s]', '', text)


# Create a function to count the non punctuation token of a text

def count_tokens(text):
    # your code here
    doc = nlp(text)
    count = 0
    for token in doc:
        count += 1
    return count
    
# Create a function to remove stop words from text
    
def remove_stopwords(text):
    # your code here
    doc = nlp(text)
    tokens = [token.text for token in doc if not token.is_stop]
    return ' '.join(tokens)

# df["TextNoPunct"] = # your code here
df["TextNoPunct"] = df.Text.apply(remove_punctuation)

In [18]:
df

Unnamed: 0,Timestamp,Speaker,Text,Interval,TextNoPunct,n_tokens
0,0,Joe Rogan,"Ah, ha, ha, ha. Four, three, two, one, boom. T...",9.0,Ah ha ha ha Four three two one boom Thank you ...,19
1,9,Elon Musk,You're welcome.,1.0,Youre welcome,3
2,10,Joe Rogan,It's very good to meet you.,1.0,Its very good to meet you,6
3,11,Elon Musk,Nice to meet you too.,1.0,Nice to meet you too,5
4,12,Joe Rogan,And thanks for not lighting this place on fire.,1.0,And thanks for not lighting this place on fire,9
...,...,...,...,...,...,...
1826,9401,Joe Rogan,"I believe it's true too. So, thank you.",2.0,I believe its true too So thank you,8
1827,9403,Elon Musk,You're welcome.,1.0,Youre welcome,3
1828,9404,Joe Rogan,"All you assholes out there, be nice. Be nice, ...",6.0,All you assholes out there be nice Be nice bit...,18
1829,9410,Elon Musk,"All right, thank you.",1.0,All right thank you,4


In [17]:
# Put the number of tokens of each row in a new column
df["n_tokens"] = df.TextNoPunct.apply(lambda x: count_tokens(x))

In [19]:

# Compute the velocity and store it in a new column

df["Velocity"] = df["n_tokens"]/df["Interval"]

In [20]:
# Inspect the avg velocity of each speaker
df.groupby(by=["Speaker"])["Velocity"].mean()

Speaker
Elon Musk    2.814122
Jaime        3.453560
Joe Rogan    2.999045
Name: Velocity, dtype: float64