# Table of Contents
 1. About the Dataset
 2. Regex for Cleaning Text Data
 3. Regex for Text Data Extraction
 4. Regex Challenge
 5. Spacy Analysis

  5.1.1 Word Frecuency

  5.1.2 Dependency Tags

  5.1.3 Matcher

  5.1.4  TD-idf


In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [6]:
path="/content/drive/My Drive/Colab Notebooks/trinity/data_trinity/tweets.csv"

## 1 About the Dataset

In [7]:
import pandas as pd
#Loading the dataset
df = pd.read_csv(path, encoding = "ISO-8859-1")

# Printing first 5 rows
df.head()


Unnamed: 0.1,Unnamed: 0,X,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted
0,1,1,RT @rssurjewala: Critical question: Was PayTM ...,False,0,,2016-11-23 18:40:30,False,,8.014957e+17,,"<a href=""http://twitter.com/download/android"" ...",HASHTAGFARZIWAL,331,True,False
1,2,2,RT @Hemant_80: Did you vote on #Demonetization...,False,0,,2016-11-23 18:40:29,False,,8.014957e+17,,"<a href=""http://twitter.com/download/android"" ...",PRAMODKAUSHIK9,66,True,False
2,3,3,"RT @roshankar: Former FinSec, RBI Dy Governor,...",False,0,,2016-11-23 18:40:03,False,,8.014955e+17,,"<a href=""http://twitter.com/download/android"" ...",rahulja13034944,12,True,False
3,4,4,RT @ANI_news: Gurugram (Haryana): Post office ...,False,0,,2016-11-23 18:39:59,False,,8.014955e+17,,"<a href=""http://twitter.com/download/android"" ...",deeptiyvd,338,True,False
4,5,5,RT @satishacharya: Reddy Wedding! @mail_today ...,False,0,,2016-11-23 18:39:39,False,,8.014954e+17,,"<a href=""http://cpimharyana.com"" rel=""nofollow...",CPIMBadli,120,True,False


In [None]:
backup=df

# Looking at some Tweets
for index, tweet in enumerate(df["text"][10:15]):
    print(index+1,".",tweet)

1 . Many opposition leaders are with @narendramodi on the #Demonetization 
And respect their decision,but support opposition just b'coz of party
2 . RT @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r
3 . @Jaggesh2 Bharat band on 28??<ed><U+00A0><U+00BD><ed><U+00B8><U+0082>Those who  are protesting #demonetization  are all different party leaders.
4 . RT @Atheist_Krishna: The effect of #Demonetization !!
. https://t.co/A8of7zh2f5
5 . RT @sona2905: When I explained #Demonetization to myself and tried to put it down in my words which are not laced with any heavy technicalÂ…


## 2.Regex for Cleaning Text Data

In [None]:
import re

### a. Removing `RT`

In [None]:
# Removing RT from a single Tweet
text = "RT @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r"
clean_text = re.sub('RT ','', text)

print("Text before:\n", text)
print("Text after:\n", clean_text)

Text before:
 RT @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r
Text after:
 @Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r


In [None]:
df=backup
# Tweets before removal
df['text'].head()

0    RT @rssurjewala: Critical question: Was PayTM ...
1    RT @Hemant_80: Did you vote on #Demonetization...
2    RT @roshankar: Former FinSec, RBI Dy Governor,...
3    RT @ANI_news: Gurugram (Haryana): Post office ...
4    RT @satishacharya: Reddy Wedding! @mail_today ...
Name: text, dtype: object

In [None]:
for cont, tweet in enumerate(df["text"]):
    tweet=re.sub('RT ','',tweet)
    df["text"][cont]=tweet


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["text"][cont]=tweet


In [None]:
df['text'].head()

0    @rssurjewala: Critical question: Was PayTM inf...
1    @Hemant_80: Did you vote on #Demonetization on...
2    @roshankar: Former FinSec, RBI Dy Governor, CB...
3    @ANI_news: Gurugram (Haryana): Post office emp...
4    @satishacharya: Reddy Wedding! @mail_today car...
Name: text, dtype: object

In [None]:
df=backup
# Removing RT from all the tweets
df['text']=df['text'].apply(lambda x: re.sub('RT ','',x))

In [None]:
# Tweets after removal
df['text'].head()

0    @rssurjewala: Critical question: Was PayTM inf...
1    @Hemant_80: Did you vote on #Demonetization on...
2    @roshankar: Former FinSec, RBI Dy Governor, CB...
3    @ANI_news: Gurugram (Haryana): Post office emp...
4    @satishacharya: Reddy Wedding! @mail_today car...
Name: text, dtype: object

### b. Removing `<U+..>` like symbols

In [None]:
# Removing <U+..> like symbols from a single tweet
text = "@Jaggesh2 Bharat band on 28??<ed><U+00A0><U+00BD><ed><U+00B8><U+0082>Those who  are protesting #demonetization  are all different party leaders"
clean_text = re.sub('<U\+[A-Z0-9]+>','', text)

print("Text before:\n", text)
print("Text after:\n", clean_text)

Text before:
 @Jaggesh2 Bharat band on 28??<ed><U+00A0><U+00BD><ed><U+00B8><U+0082>Those who  are protesting #demonetization  are all different party leaders
Text after:
 @Jaggesh2 Bharat band on 28??<ed><ed>Those who  are protesting #demonetization  are all different party leaders


**Note** that although we have gotten rid of majority of symbols, `<ed>` is still present. I leave this as an exercise for you to try out.

In [None]:
# Removing <U+..> like symbols from all the tweets
df['text']=df['text'].apply(lambda x: re.sub('<U\+[A-Z0-9]+>', '', x))

### c. Fixing the `&` and `&amp;`

In [None]:
# Replacing &amp with & in a single tweet
text = "RT @harshkkapoor: #DeMonetization survey results after 24 hours 5Lacs opinions Amazing response &amp; Commitment in fight against Blackmoney"
clean_text = re.sub('&amp;','&', text)

print("Text before:\n", text)
print("Text after:\n", clean_text)

Text before:
 RT @harshkkapoor: #DeMonetization survey results after 24 hours 5Lacs opinions Amazing response &amp; Commitment in fight against Blackmoney
Text after:
 RT @harshkkapoor: #DeMonetization survey results after 24 hours 5Lacs opinions Amazing response & Commitment in fight against Blackmoney


In [None]:
# Replacing &amp with & in all the tweets
df['text']=df['text'].apply(lambda x: re.sub('&amp', '&', x))

## 3 Regex for Text Data Extraction


### a. Extracting platform type of tweets

In [None]:
# Getting number of tweets per platform type
platform_count = df["statusSource"].value_counts()

In [None]:
print(platform_count)
type(platform_count)


<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>      7642
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                        2548
<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>        2093
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>        492
<a href="https://mobile.twitter.com" rel="nofollow">Twitter Lite</a>                       263
                                                                                          ... 
<a href="http://pnllg.com" rel="nofollow">PNLLG </a>                                         1
<a href="http://www.toi.in" rel="nofollow">cmssocialservice</a>                              1
<a href="http://sites.google.com/site/yorufukurou/" rel="nofollow">YoruFukurou</a>           1
<a href="https://panel.socialpilot.co/" rel="nofollow">SocialPilot.co</a>                    1
<a href="https://twitter.com/download/android" rel

pandas.core.series.Series

In [None]:
#List platforms that have more than 100 tweets
top_platforms = platform_count.loc[platform_count>100]
top_platforms

<a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>    7642
<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>                      2548
<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>      2093
<a href="https://about.twitter.com/products/tweetdeck" rel="nofollow">TweetDeck</a>      492
<a href="https://mobile.twitter.com" rel="nofollow">Twitter Lite</a>                     263
<a href="https://mobile.twitter.com" rel="nofollow">Mobile Web (M5)</a>                  178
<a href="http://www.facebook.com/twitter" rel="nofollow">Facebook</a>                    167
<a href="http://twitter.com/#!/download/ipad" rel="nofollow">Twitter for iPad</a>        165
<a href="http://www.twitter.com" rel="nofollow">Twitter for Windows Phone</a>            139
<a href="http://onlywire.com/" rel="nofollow">OnlyWire / Official App</a>                136
<a href="http://www.twitter.com" rel="nofollow">Twitter for Windows</a

In [None]:
def platform_type(x):
    ser = re.search( r"android|iphone|web|windows|mobile|google|facebook|ipad|tweetdeck|onlywire", x, re.IGNORECASE)
    if ser:
        return ser.group()
    else:
        return None

#reset index of the series
top_platforms = top_platforms.reset_index()["index"]

#extract platform types
top_platforms.apply(lambda x: platform_type(x))

0       android
1           Web
2        iphone
3     tweetdeck
4        mobile
5        mobile
6      facebook
7          ipad
8       Windows
9      onlywire
10      Windows
11       mobile
12       google
Name: index, dtype: object

### b. Extracting hashtags from the tweets

In [None]:
# Extract first hashtag from a tweet
text = "RT @Atheist_Krishna: The effect of #Demonetization !!\r\n. https://t.co/A8of7zh2f5"
hashtag = re.search('#\w+', text)

print("Tweet:\n", text)
print("Hashtag:\n", hashtag.group())

Tweet:
 RT @Atheist_Krishna: The effect of #Demonetization !!
. https://t.co/A8of7zh2f5
Hashtag:
 #Demonetization


In [None]:
# Extract multiple hastags from a tweet
text = """RT @kapil_kausik: #Doltiwal I mean #JaiChandKejriwal is "hurt" by #Demonetization as the same has rendered USELESS <ed><U+00A0><U+00BD><ed><U+00B1><U+0089> "acquired funds" No wo"""
hashtags = re.findall('#\w+', text)

print("Tweet:\n", text)
print("Hashtag:\n", hashtags)

Tweet:
 RT @kapil_kausik: #Doltiwal I mean #JaiChandKejriwal is "hurt" by #Demonetization as the same has rendered USELESS <ed><U+00A0><U+00BD><ed><U+00B1><U+0089> "acquired funds" No wo
Hashtag:
 ['#Doltiwal', '#JaiChandKejriwal', '#Demonetization']


In [None]:
df['hashtags']=df['text'].apply(lambda x: re.findall('#\w+', x))

In [None]:
df[['text','hashtags']].head()

Unnamed: 0,text,hashtags
0,@rssurjewala: Critical question: Was PayTM inf...,[#Demonetization]
1,@Hemant_80: Did you vote on #Demonetization on...,[#Demonetization]
2,"@roshankar: Former FinSec, RBI Dy Governor, CB...",[#Demonetization]
3,@ANI_news: Gurugram (Haryana): Post office emp...,[#demonetization]
4,@satishacharya: Reddy Wedding! @mail_today car...,"[#demonetization, #ReddyWedding]"


## 4. Regex Challenge

Now that you have learned all the concepts regarding regex and have also seen it in action, it's time for you to utilize that to solve a challenge all by yourself. Here are some of the tasks that you have to do -




### a. Removing URLs from tweets

**Difficulty - Easy**

There are multiple URLs present in individual tweet's `text` and they don't neccessarily provide useful information so we can get rid of them. For example -  

*@Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r*


We can very well remove the URL as it isn't providing much useful information.

In [1]:
# Your Code Here
#text = '@Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r'
#hashtags = re.findall('https?://[A-Za-z0-9.-/]+', text)
#hashtags
#df['text']=df['text'].apply(lambda x: re.sub('https?://[A-Za-z0-9.-/]+','',x))

### b. Extract Top 100 mentions

**Difficulty - Medium**

Many of the tweets have mentions of people in the form *@username*, for example see the following tweet -

*@Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r*

Here *@Joydas* is a mention. You need to extract mentions from all the tweets and find which are the top 100 usernames.

In [2]:
# Your Code Here
text='@Joydas: Question in Narendra Modi App where PM is taking feedback if people support his #DeMonetization strategy https://t.co/pYgK8Rmg7r'
mencion=re.findall('@\w+', text)
mencion

NameError: name 're' is not defined

In [None]:
TodasMenciones=df['text'].apply(lambda x: re.findall('@\w+', x))
TodasMenciones.head()

In [None]:
mentions_arr=[]

for x in TodasMenciones:
    if x != None:
        mentions_arr.extend(x)
mentions_arr[:10]
# Getting top 100 mentions



In [None]:
mentions_count=pd.Series(mentions_arr).value_counts().head(100)
mentions_count

## 4.Spacy analysis




## Word Frecuency

In [8]:
import spacy

In [9]:
# Loading model
nlp=spacy.load('en_core_web_sm')

In [11]:
text=df["text"][1]
print(text)

RT @Hemant_80: Did you vote on #Demonetization on Modi survey app?


In [23]:
# Combining chapters into a single string
combined_tweets=' '.join(df.text.values)
print(len(combined_tweets))


2090159


In [24]:
# creating Doc object
doc=nlp(combined_tweets[1:100000])

In [25]:
# Tokens as string
print([token.text for token in doc])

['T', '@rssurjewala', ':', 'Critical', 'question', ':', 'Was', 'PayTM', 'informed', 'about', '#', 'Demonetization', 'edict', 'by', 'PM', '?', 'It', "'s", 'clearly', 'fishy', 'and', 'requires', 'full', 'disclosure', '&', 'amp', ';', '\x85 ', 'RT', '@Hemant_80', ':', 'Did', 'you', 'vote', 'on', '#', 'Demonetization', 'on', 'Modi', 'survey', 'app', '?', 'RT', '@roshankar', ':', 'Former', 'FinSec', ',', 'RBI', 'Dy', 'Governor', ',', 'CBDT', 'Chair', '+', 'Harvard', 'Professor', 'lambaste', '#', 'Demonetization', '.', '\r\n\r\n', 'If', 'not', 'for', 'Aam', 'Aadmi', ',', 'listen', 'to', 'th', '\x85 ', 'RT', '@ANI_news', ':', 'Gurugram', '(', 'Haryana', '):', 'Post', 'office', 'employees', 'provide', 'cash', 'exchange', 'to', 'patients', 'in', 'hospitals', '#', 'demonetization', 'https://t.co/uGMxUP9', '\x85 ', 'RT', '@satishacharya', ':', 'Reddy', 'Wedding', '!', '@mail_today', 'cartoon', '#', 'demonetization', '#', 'ReddyWedding', 'https://t.co/u7gLNrq31F', '@DerekScissors1', ':', 'India\x9

In [None]:
# Lemmatizing the text
[(token.text,token.lemma_) for token in doc]

In [26]:
# Removing stopwords and punctuations
new_tokens=[token for token in doc if (token.is_stop==False|token.is_punct==False)]

In [27]:
# Function for generating word frequency
def gen_freq(tokens):

    # Creating a pandas series with word frequencies
    #see Dataframe.value_counts:Return a Series containing counts of unique rows in the DataFrame.
    word_freq = pd.Series([token.text for token in tokens]).value_counts()
    #Series is like and a vector but inmutable and each element separte by coma

    # Printing frequencies: is the number of most frecuents words
    print(word_freq[:20])


    return word_freq

In [28]:
# Getting word frequency
#See pandas.Series
word_freq=gen_freq(doc)
#accessing serie
word_freq.keys()
word_freq[:3]
word_freq['the']
#word_freq[['Ishmael','ship']]
print(word_freq.size)
print(len(word_freq))

#                 1015
:                  697
RT                 629
Â…                  425
Demonetization     415
.                  375
\r\n               328
on                 262
demonetization     251
is                 238
of                 226
the                225
she                218
and                185
to                 180
in                 164
,                  147
Modi               146
a                  137
\r\n\r\n           136
dtype: int64
1880
1880


## Dependency Tags

In [None]:
# Getting dependency tags
for token in doc:
    print(token.text,'=>',token.dep_)
HTML(rep_sentence(text[1:1000]))

In [None]:
# Getting dependency tags
for token in doc:
    print(token.text,'=>',token.dep_)
HTML(rep_sentence(text))

I => nsubj
will => aux
wear => ROOT
a => det
white => amod
shirt => dobj
on => prep
Monday => pobj
. => punct


0,1,2,3,4,5,6,7,8
I,will,wear,a,white,shirt,on,Monday,.
nsubj,aux,ROOT,det,amod,dobj,prep,pobj,punct


In [None]:
# Importing visualizer
from spacy import displacy

In [None]:
# Visualizing dependency tree
displacy.render(doc,jupyter=True)

In [None]:
# Getting head word (parent)
for token in doc:
    print(token.text,'=>',token.head.text)

I => wear
will => wear
wear => wear
a => shirt
white => shirt
shirt => wear
on => wear
Monday => on
. => wear


In [None]:
# Getting immediate children
for token in doc:
    print(token.text,'=>',token.children)

I => <generator object at 0x7f7d25a12370>
will => <generator object at 0x7f7d25a12370>
wear => <generator object at 0x7f7d25a12370>
a => <generator object at 0x7f7d25a12370>
white => <generator object at 0x7f7d25a12370>
shirt => <generator object at 0x7f7d25a12370>
on => <generator object at 0x7f7d25a12370>
Monday => <generator object at 0x7f7d25a12370>
. => <generator object at 0x7f7d25a12370>


In [None]:
# Getting immediate children
for token in doc:
    print(token.text,'=>',[child.text for child in token.children])

I => []
will => []
wear => ['I', 'will', 'shirt', 'on', '.']
a => []
white => []
shirt => ['a', 'white']
on => ['Monday']
Monday => []
. => []


In [None]:
# Getting left and right children
for token in doc:
    print(token.text,'=>',token.lefts,'=>',token.rights)

type(token.lefts)

I => <generator object at 0x7f7d25a120f0> => <generator object at 0x7f7d25a12910>
will => <generator object at 0x7f7d25a120f0> => <generator object at 0x7f7d25a12910>
wear => <generator object at 0x7f7d25a120f0> => <generator object at 0x7f7d25a12910>
a => <generator object at 0x7f7d25a120f0> => <generator object at 0x7f7d25a12910>
white => <generator object at 0x7f7d25a120f0> => <generator object at 0x7f7d25a12910>
shirt => <generator object at 0x7f7d25a120f0> => <generator object at 0x7f7d25a12910>
on => <generator object at 0x7f7d25a120f0> => <generator object at 0x7f7d25a12910>
Monday => <generator object at 0x7f7d25a120f0> => <generator object at 0x7f7d25a12910>
. => <generator object at 0x7f7d25a120f0> => <generator object at 0x7f7d25a12910>


generator

In [None]:
# Getting left children
for token in doc:
    print(token.text,'=>',token.n_lefts,'=>',[left for left in token.lefts])

I => 0 => []
will => 0 => []
wear => 2 => [I, will]
a => 0 => []
white => 0 => []
shirt => 2 => [a, white]
on => 0 => []
Monday => 0 => []
. => 0 => []


In [None]:
# Getting right children
for token in doc:
    print(token.text,'=>',token.n_rights,'=>',[right for right in token.rights])

## Matcher

In [40]:
import spacy

# Import the Matcher
from spacy.matcher import Matcher

# Load a pipeline and create the nlp object
nlp = spacy.load("en_core_web_sm")

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
matcher.add("IPHONE_PATTERN", [pattern])

# Process some text
doc = nlp("Upcoming iPhone X release date leaked")

# Call the matcher on the doc
matches = matcher(doc)

In [41]:
# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)

In [45]:
pattern = [
    {"IS_DIGIT": True},
    {"LOWER": "fifa"},
    {"LOWER": "world"},
    {"LOWER": "cup"},
    {"IS_PUNCT": True}
]
matcher.add("digito+fifa+wworld+cup+.", [pattern])

In [48]:
doc = nlp("2018 FIFA World Cup: France won!")

In [57]:
matches = matcher(doc)
print (type(matches))
#for match_id, start, end in matches:
    # Get the matched span
#    matched_span = doc[start:end]
  # print(matcheds.text)
len(matches)

<class 'list'>


2

In [58]:
pattern = [
    {"LEMMA": "love", "POS": "VERB"},
    {"POS": "NOUN"}
]
matcher.add("lema+pos+pos.", [pattern])

In [63]:
doc = nlp("I loved dogs but now I love cats more.")
matches = matcher(doc)
print (type(matches))
for match_id, start, end in matches:
    # Get the matched span
    matched_span = doc[start:end]
    print(matched_span.text)
len(matches)

<class 'list'>
loved dogs
love cats


2

## TD-idf


In [None]:
!pip install spacy
!python -m spacy download en_core_web_lg
import spacy


In [None]:
nlp = spacy.load("en_core_web_sm")
def spacy_tokenizer(document):
    tokens = nlp(document)
    tokens = [token.lemma_ for token in tokens if (
        token.is_stop == False and \
        token.is_punct == False and \
        token.lemma_.strip()!= '')]
    return tokens

In [None]:
example_corpus = [
    "Monsters are bad.", \
    "I saw a monster yesterday.", \
    "Why are we talking about bad monsters?"]


In [None]:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
tfidf_vector = TfidfVectorizer(input = 'content', tokenizer = spacy_tokenizer)
result = tfidf_vector.fit_transform(example_corpus)

In [None]:
tfidf_vector.get_feature_names()

['bad', 'monster', 'see', 'talk', 'yesterday']

In [None]:
import pandas as pd
dense = result.todense()
denselist = dense.tolist()
df = pd.DataFrame(
    denselist,columns=tfidf_vector.get_feature_names())

In [None]:
result
df


Unnamed: 0,bad,monster,see,talk,yesterday
0,0.789807,0.613356,0.0,0.0,0.0
1,0.0,0.385372,0.652491,0.0,0.652491
2,0.547832,0.425441,0.0,0.720333,0.0


In [None]:
mean_df = df.mean()
print(mean_df)

bad          0.445880
monster      0.474723
see          0.217497
talk         0.240111
yesterday    0.217497
dtype: float64
