## NLP Class 1 Exercise 2: 
- Read tweets into Pandas Dataframe
- Separate RT into a different variable (eliminate RT from the `text` variable)
- Eliminate URLs, Mentions and newline characters from `extended_text` variable
- Extract hashtags (create a separate variable that has only hashtags and eliminate hashtags from `extended_text` variable)
- Review extracted hashtags in Pandas dataframe

In [1]:
import pandas as pd
import re

### Frequently Used Python Regular Expressions

    a, X, 9, < -- ordinary characters just match themselves exactly.
    . (a period) -- matches any single character except newline '\n'
    \w -- matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_].
    \W -- matches any non-word character.
    \b -- matches word boundary (in between a word character and a non word character)
    \s -- matches a single whitespace character -- space, newline, return, tab
    \S -- matches any non-whitespace character.
    \t, \n, \r -- tab, newline, return
    \d -- matches any numeric digit [0-9]
    \D matches any non-numeric character.
    ^ -- matches the beginning of the string
    $ -- matches the end of the string|
    \ -- escapes special character.
    (x|y|z) matches exactly one of x, y or z.
    (x) in general is a remembered group. We can get the value of what matched by using the groups() method of the object returned by re.search.
    x? matches an optional x character (in other words, it matches an x zero or one times).
    x* matches x zero or more times.
    x+ matches x one or more times.
    x{m,n} matches an x character at least m times, but not more than n times.
    ?: matches an expression but do not capture it. Non capturing group.
    ?= matches a suffix but exclude it from capture. Positive look ahead.
    a(?=b) will match the "a" in "ab", but not the "a" in "ac"
    In other words, a(?=b) matches the "a" which is followed by the string 'b', without consuming what follows the a.
    ?! matches if suffix is absent. Negative look ahead.
    a(?!b) will match the "a" in "ac", but not the "a" in "ab"
    ?<= positive look behind
    ?<! negative look behind


### Load Data

In [2]:
url = 'https://storage.googleapis.com/msca-bdp-data-open/tweets/tweets_ai_ml_nlp.json'
tweets = pd.read_json(url, orient='records', lines=True)
print(f'Dataset records: {tweets.shape[0]}, Dataset columns: {tweets.shape[1]}')

Dataset records: 100043, Dataset columns: 7


In [3]:
pd.set_option('display.max_colwidth', None)

In [4]:
tweets.head(5)

Unnamed: 0,id,lang,date,name,text,extended_text,quoted_text
0,1529094548005064705,en,2022-05-24,odol☘️,"RT @Frank4NC: CodyFight is a must watch and must EARN! Get ready for some 💸 there. \n\nKudos to you, @codyfight 💥","CodyFight is a must watch and must EARN! Get ready for some 💸 there. \n\nKudos to you, @codyfight 💥",Codyfight is a place where Humans and #AI compete as equals! ⚔️\n\nIt rewards creativity and provides a scalable #Create2earn model 📈\n\nPlay on any device 💻📱⌚️\n#Metaverse #NFT $CTOK #GameFi #DeFi #NFTgame\n1/6 🧵❗️ https://t.co/DlDR1K5oBX
1,1529094585942568960,en,2022-05-24,Paijo s'Bejo,"RT @Bakercrypt0: Wonderful day to everybody! ✨🫶\n\nThe trailer is LIVE 🎬 please, spread the word, Tweet, Retweet or Quote Tweet about us and…","Wonderful day to everybody! ✨🫶\n\nThe trailer is LIVE 🎬 please, spread the word, Tweet, Retweet or Quote Tweet about us and our video. The bigger the support, the greater the rewards will be to all of us! Push, push, push!!! 🧑‍🚀 🚀",Codyfight is a place where Humans and #AI compete as equals! ⚔️\n\nIt rewards creativity and provides a scalable #Create2earn model 📈\n\nPlay on any device 💻📱⌚️\n#Metaverse #NFT $CTOK #GameFi #DeFi #NFTgame\n1/6 🧵❗️ https://t.co/DlDR1K5oBX
2,1529094709771051013,en,2022-05-24,🍀Ging🍀6️⃣5️⃣🎹,"RT @Frank4NC: CodyFight is a must watch and must EARN! Get ready for some 💸 there. \n\nKudos to you, @codyfight 💥","CodyFight is a must watch and must EARN! Get ready for some 💸 there. \n\nKudos to you, @codyfight 💥",Codyfight is a place where Humans and #AI compete as equals! ⚔️\n\nIt rewards creativity and provides a scalable #Create2earn model 📈\n\nPlay on any device 💻📱⌚️\n#Metaverse #NFT $CTOK #GameFi #DeFi #NFTgame\n1/6 🧵❗️ https://t.co/DlDR1K5oBX
3,1529094719120510976,en,2022-05-24,Ultra mild🗯💫,RT @codyfight: Codyfight is a place where Humans and #AI compete as equals! ⚔️\n\nIt rewards creativity and provides a scalable #Create2earn…,Codyfight is a place where Humans and #AI compete as equals! ⚔️\n\nIt rewards creativity and provides a scalable #Create2earn model 📈\n\nPlay on any device 💻📱⌚️\n#Metaverse #NFT $CTOK #GameFi #DeFi #NFTgame\n1/6 🧵❗️ https://t.co/DlDR1K5oBX,
4,1529094845393907712,en,2022-05-24,Ohayou🌼,"RT @ninasimonic: Wonderful day to everybody! ✨🫶\n\nTheir trailer is LIVE 🎬 please, spread the word. The bigger the support, the greater the r…","Wonderful day to everybody! ✨🫶\n\nTheir trailer is LIVE 🎬 please, spread the word. The bigger the support, the greater the rewards will be to all of us! Push, push, push!!! 🧑‍🚀",Codyfight is a place where Humans and #AI compete as equals! ⚔️\n\nIt rewards creativity and provides a scalable #Create2earn model 📈\n\nPlay on any device 💻📱⌚️\n#Metaverse #NFT $CTOK #GameFi #DeFi #NFTgame\n1/6 🧵❗️ https://t.co/DlDR1K5oBX


In [7]:
tweets["reetweet"] = tweets["text"].apply(lambda x: True if re.search(r'RT',x) else False)
tweets["text"] = tweets["text"].apply(lambda x: re.sub(r'RT','',x))
tweets[["text","reetweet"]].head()

Unnamed: 0,text,reetweet
0,"@Frank4NC: CodyFight is a must watch and must EARN! Get ready for some 💸 there. \n\nKudos to you, @codyfight 💥",True
1,"@Bakercrypt0: Wonderful day to everybody! ✨🫶\n\nThe trailer is LIVE 🎬 please, spread the word, Tweet, Retweet or Quote Tweet about us and…",True
2,"@Frank4NC: CodyFight is a must watch and must EARN! Get ready for some 💸 there. \n\nKudos to you, @codyfight 💥",True
3,@codyfight: Codyfight is a place where Humans and #AI compete as equals! ⚔️\n\nIt rewards creativity and provides a scalable #Create2earn…,True
4,"@ninasimonic: Wonderful day to everybody! ✨🫶\n\nTheir trailer is LIVE 🎬 please, spread the word. The bigger the support, the greater the r…",True


In [16]:
def remove_content(string):
    url_pattern = r'(http|ftp|https):\/\/([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:\/~+#-]*[\w@?^=%&\/~+#-])'
    mention_pattern = r'[\s]*@[\w]+'
    hashtag_pattern = r'[\s]*#[\w]+'
    string_fixed = re.sub(url_pattern,"",string)
    string_fixed = re.sub(hashtag_pattern,"",string_fixed)
    string_fixed = re.sub(mention_pattern,"",string_fixed)
    return string_fixed
tweets["fixed_extended"] = tweets["extended_text"].apply(remove_content)
tweets[["extended_text","fixed_extended"]].head()

Unnamed: 0,extended_text,fixed_extended
0,"CodyFight is a must watch and must EARN! Get ready for some 💸 there. \n\nKudos to you, @codyfight 💥","CodyFight is a must watch and must EARN! Get ready for some 💸 there. \n\nKudos to you, 💥"
1,"Wonderful day to everybody! ✨🫶\n\nThe trailer is LIVE 🎬 please, spread the word, Tweet, Retweet or Quote Tweet about us and our video. The bigger the support, the greater the rewards will be to all of us! Push, push, push!!! 🧑‍🚀 🚀","Wonderful day to everybody! ✨🫶\n\nThe trailer is LIVE 🎬 please, spread the word, Tweet, Retweet or Quote Tweet about us and our video. The bigger the support, the greater the rewards will be to all of us! Push, push, push!!! 🧑‍🚀 🚀"
2,"CodyFight is a must watch and must EARN! Get ready for some 💸 there. \n\nKudos to you, @codyfight 💥","CodyFight is a must watch and must EARN! Get ready for some 💸 there. \n\nKudos to you, 💥"
3,Codyfight is a place where Humans and #AI compete as equals! ⚔️\n\nIt rewards creativity and provides a scalable #Create2earn model 📈\n\nPlay on any device 💻📱⌚️\n#Metaverse #NFT $CTOK #GameFi #DeFi #NFTgame\n1/6 🧵❗️ https://t.co/DlDR1K5oBX,Codyfight is a place where Humans and compete as equals! ⚔️\n\nIt rewards creativity and provides a scalable model 📈\n\nPlay on any device 💻📱⌚️ $CTOK\n1/6 🧵❗️
4,"Wonderful day to everybody! ✨🫶\n\nTheir trailer is LIVE 🎬 please, spread the word. The bigger the support, the greater the rewards will be to all of us! Push, push, push!!! 🧑‍🚀","Wonderful day to everybody! ✨🫶\n\nTheir trailer is LIVE 🎬 please, spread the word. The bigger the support, the greater the rewards will be to all of us! Push, push, push!!! 🧑‍🚀"


In [20]:
def get_hashtag(string):
    hashtag_pattern = r'[\s]*#[\w]+'
    string_fixed = re.findall(hashtag_pattern,string)
    return string_fixed
columns_lists = tweets["extended_text"].apply(get_hashtag)
hashtag_table = pd.DataFrame(columns_lists.tolist())
hashtag_table.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,32,33,34,35,36,37,38,39,40,41
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,#AI,#Create2earn,\n#Metaverse,#NFT,#GameFi,#DeFi,#NFTgame,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


In [21]:
import datetime
import pytz

datetime.datetime.now(pytz.timezone('US/Central')).strftime("%a, %d %B %Y %H:%M:%S")

'Wed, 22 March 2023 20:48:48'