## NLP Class 3 Exercise 1:
- Read tweets into Pandas Dataframe
- Identify Bigrams and Trigrams for the top frequently mentioned AI / ML / NLP technologies

**Suggestions:** 
- Eliminate URLs, Mentions, Hashtags, RTs and newline characters
- Clean-up n-grams by eliminating punctuation, number, stopwords and lowercasing the text
- Add custom stopwords filters to get more relevant results

In [1]:
#import nltk
#nltk.download('popular', halt_on_error=False)
#nltk.download('all', halt_on_error=False)

In [2]:
import nltk as nltk
import nltk.corpus  
from nltk.text import Text
import pandas as pd
import re
import sys

In [3]:
print(sys.version)

3.11.4 (main, Jul  5 2023, 09:00:44) [Clang 14.0.6 ]


### Load Data

In [4]:
url = 'https://storage.googleapis.com/msca-bdp-data-open/tweets/tweets_ai_ml_nlp.json'
tweets = pd.read_json(url, orient='records', lines=True)

print(f'Dataset records: {tweets.shape[0]}, Dataset columns: {tweets.shape[1]}')

Dataset records: 100043, Dataset columns: 7


In [5]:
tweets.head(5)

Unnamed: 0,id,lang,date,name,text,extended_text,quoted_text
0,1529094548005064705,en,2022-05-24,odol☘️,RT @Frank4NC: CodyFight is a must watch and mu...,CodyFight is a must watch and must EARN! Get r...,Codyfight is a place where Humans and #AI comp...
1,1529094585942568960,en,2022-05-24,Paijo s'Bejo,RT @Bakercrypt0: Wonderful day to everybody! ✨...,Wonderful day to everybody! ✨🫶\n\nThe trailer ...,Codyfight is a place where Humans and #AI comp...
2,1529094709771051013,en,2022-05-24,🍀Ging🍀6️⃣5️⃣🎹,RT @Frank4NC: CodyFight is a must watch and mu...,CodyFight is a must watch and must EARN! Get r...,Codyfight is a place where Humans and #AI comp...
3,1529094719120510976,en,2022-05-24,Ultra mild🗯💫,RT @codyfight: Codyfight is a place where Huma...,Codyfight is a place where Humans and #AI comp...,
4,1529094845393907712,en,2022-05-24,Ohayou🌼,RT @ninasimonic: Wonderful day to everybody! ✨...,Wonderful day to everybody! ✨🫶\n\nTheir traile...,Codyfight is a place where Humans and #AI comp...


#### Use TweetTokenizer to tokenize Tweets

In [6]:
%%time

top_N = 100

tweet_text = tweets['extended_text'].str.lower().str.replace(r'\|', ' ', regex=True).str.cat(sep=' ')


# Use TweetTokenizer
tweet_tokenizer = nltk.tokenize.TweetTokenizer()
words = tweet_tokenizer.tokenize(tweet_text)

#stopwords = stopwords.words('english')
stopwords = set(nltk.corpus.stopwords.words('english'))

# Remove single-character tokens (mostly punctuation)
words = [word for word in words if len(word) > 1]

# Remove numbers
words = [word for word in words if not word.isnumeric()]

# Remove punctuation
# words = [word for word in words if word.isalpha()]

# Lowercase all words (default_stopwords are lowercase too)
words = [word.lower() for word in words]

# Remove stopwords
words = [word for word in words if word not in stopwords]

bgs = nltk.bigrams(words)

#compute frequency distribution for all the bigrams in the text
fdist_2 = nltk.FreqDist(bgs)

fdist_2_df = pd.DataFrame(fdist_2.most_common(),
                    columns=['Word', 'Frequency'])

fdist_2_df.head(n=20)

CPU times: user 27.7 s, sys: 260 ms, total: 28 s
Wall time: 28 s


Unnamed: 0,Word,Frequency
0,"(#programming, #coding)",7715
1,"(#javascript, #reactjs)",7422
2,"(#coding, #100daysofcode)",7088
3,"(#iot, #iiot)",7002
4,"(#bigdata, #analytics)",6132
5,"(#ai, #machinelearning)",5430
6,"(#python, #rstats)",4600
7,"(#datascience, #ai)",4221
8,"(#analytics, #datascience)",3924
9,"(#tensorflow, #javascript)",3875


In [6]:
top_N = 100

tweet_text = tweets['extended_text'].str.lower().str.replace(r'\|', ' ', regex=True).str.cat(sep=' ')

# Use TweetTokenizer
tweet_tokenizer = nltk.tokenize.TweetTokenizer()
words = tweet_tokenizer.tokenize(tweet_text)

# Set of stopwords
stopwords = set(nltk.corpus.stopwords.words('english'))

# Remove single-character tokens (mostly punctuation)
words = [word for word in words if len(word) > 1]

# Remove numbers
words = [word for word in words if not word.isnumeric()]

# Lowercase all words (default_stopwords are lowercase too)
words = [word.lower() for word in words]

# Remove hashtags
words = [word for word in words if not word.startswith('#')]

# Remove stopwords
words = [word for word in words if word not in stopwords]

bgs = nltk.bigrams(words)

# Compute frequency distribution for all the bigrams in the text
fdist_2 = nltk.FreqDist(bgs)

fdist_2_df = pd.DataFrame(fdist_2.most_common(), columns=['Word', 'Frequency'])

fdist_2_df.head(n=20)

Unnamed: 0,Word,Frequency
0,"(machine, learning)",3279
1,"(artificial, intelligence)",3069
2,"(data, science)",2033
3,"(free, middlemen)",1783
4,"(job, sign)",1603
5,"(need, job)",1546
6,"(covid, insights)",1200
7,"(analytics, team)",1172
8,"(insights, analytics)",1170
9,"(deep, learning)",1045


In [7]:
import datetime
import pytz

datetime.datetime.now(pytz.timezone('US/Central')).strftime("%a, %d %B %Y %H:%M:%S")

'Fri, 12 August 2022 16:13:33'