## NLP Class 2 Exercise 1:
- Read tweets into Pandas Dataframe
- Identify the top frequently mentioned AI / ML / NLP technologies in `extended_text` variable
- Assess the popularity of Python vs. R vs. SAS

**Suggestions:** 
- Eliminate URLs, Mentions and newline characters
- Separeate RTs 
- Process text of the tweets separately from hashtags

In [1]:
#import nltk
#nltk.download('popular', halt_on_error=False)
#nltk.download('all', halt_on_error=False)

In [1]:
import nltk as nltk
import nltk.corpus  
from nltk.text import Text
import pandas as pd
import re
import sys

In [2]:
print(sys.version)

3.11.4 (main, Jul  5 2023, 09:00:44) [Clang 14.0.6 ]


### Load Data

In [3]:
url = 'https://storage.googleapis.com/msca-bdp-data-open/tweets/tweets_ai_ml_nlp.json'
tweets = pd.read_json(url, orient='records', lines=True)

print(f'Dataset records: {tweets.shape[0]}, Dataset columns: {tweets.shape[1]}')

Dataset records: 100043, Dataset columns: 7


In [4]:
tweets.head(10)

Unnamed: 0,id,lang,date,name,text,extended_text,quoted_text
0,1529094548005064705,en,2022-05-24,odol☘️,RT @Frank4NC: CodyFight is a must watch and mu...,CodyFight is a must watch and must EARN! Get r...,Codyfight is a place where Humans and #AI comp...
1,1529094585942568960,en,2022-05-24,Paijo s'Bejo,RT @Bakercrypt0: Wonderful day to everybody! ✨...,Wonderful day to everybody! ✨🫶\n\nThe trailer ...,Codyfight is a place where Humans and #AI comp...
2,1529094709771051013,en,2022-05-24,🍀Ging🍀6️⃣5️⃣🎹,RT @Frank4NC: CodyFight is a must watch and mu...,CodyFight is a must watch and must EARN! Get r...,Codyfight is a place where Humans and #AI comp...
3,1529094719120510976,en,2022-05-24,Ultra mild🗯💫,RT @codyfight: Codyfight is a place where Huma...,Codyfight is a place where Humans and #AI comp...,
4,1529094845393907712,en,2022-05-24,Ohayou🌼,RT @ninasimonic: Wonderful day to everybody! ✨...,Wonderful day to everybody! ✨🫶\n\nTheir traile...,Codyfight is a place where Humans and #AI comp...
5,1529094949592977408,en,2022-05-24,big Winner,RT @ninasimonic: Wonderful day to everybody! ✨...,Wonderful day to everybody! ✨🫶\n\nTheir traile...,Codyfight is a place where Humans and #AI comp...
6,1529095178426122240,en,2022-05-24,️☘𝙂𝙝𝙚𝙖୧⍤⃝💸☞ Ⓕⓡⓔⓔⓣⓐⓖ,RT @Frank4NC: CodyFight is a must watch and mu...,CodyFight is a must watch and must EARN! Get r...,Codyfight is a place where Humans and #AI comp...
7,1529095330427330560,en,2022-05-24,keju,RT @Bakercrypt0: Wonderful day to everybody! ✨...,Wonderful day to everybody! ✨🫶\n\nThe trailer ...,Codyfight is a place where Humans and #AI comp...
8,1529095454059032576,en,2022-05-24,mrczza 💞🌹,RT @codyfight: Codyfight is a place where Huma...,Codyfight is a place where Humans and #AI comp...,
9,1529098022445281280,en,2022-05-24,pabasbas riego,RT @cryptoreeder: Wonderful day to everybody! ...,Wonderful day to everybody! ✨🫶\n\nOur trailer ...,Codyfight is a place where Humans and #AI comp...


#### Use TweetTokenizer to tokenize Tweets

In [6]:
%%time

top_N = 100

tweet_text = tweets['text'].str.lower().str.replace(r'\|', ' ', regex=True).str.cat(sep=' ')


# Use TweetTokenizer
tweet_tokenizer = nltk.tokenize.TweetTokenizer()
words = tweet_tokenizer.tokenize(tweet_text)

#stopwords = stopwords.words('english')
stopwords = set(nltk.corpus.stopwords.words('english'))

# Remove single-character tokens (mostly punctuation)
words = [word for word in words if len(word) > 1]

# Remove numbers
words = [word for word in words if not word.isnumeric()]

# Remove punctuation
# words = [word for word in words if word.isalpha()]

# Lowercase all words (default_stopwords are lowercase too)
words = [word.lower() for word in words]

# Remove stopwords
words = [word for word in words if word not in stopwords]

word_dist = nltk.FreqDist(words)

df_word_dist = pd.DataFrame(word_dist.most_common(top_N),
                    columns=['Word', 'Frequency']).set_index('Word')

df_word_dist.head(10)

CPU times: user 16 s, sys: 144 ms, total: 16.2 s
Wall time: 16.2 s


Unnamed: 0_level_0,Frequency
Word,Unnamed: 1_level_1
rt,76117
#ai,27036
#python,19949
#machinelearning,16159
#datascience,12820
#bigdata,10172
#100daysofcode,9604
#iot,8272
#javascript,7569
#analytics,7168


In [7]:
import datetime
import pytz

datetime.datetime.now(pytz.timezone('US/Central')).strftime("%a, %d %B %Y %H:%M:%S")

'Thu, 04 August 2022 09:28:18'