# Data Engineering II Final Project

### 3.1 The Similarity Search
The students will use word embedding models to facilitate similarity searches. I.e: the word embeddings of the search string are compared with those of the available tweets (using which ever distance algorithm the students choose, like eucledian distance for example) and the top 20 similar tweets are chosen.
The students are free to choose whichever word embedding model they choose, like Fasttext, Doc2Vec, Word2Vec…
**note: a reminder to handle all the cleaning and pre-processing of the text.


In [3]:
import pandas as pd

In [104]:
df = pd.read_csv("../data/tweets.csv")
df

Unnamed: 0.1,Unnamed: 0,date,id,link,retweet,text,author
0,0,Oct 7,784609194234306560,/realDonaldTrump/status/784609194234306560,False,Here is my statement.pic.twitter.com/WAZiGoQqMQ,DonaldTrump
1,1,Oct 10,785608815962099712,/realDonaldTrump/status/785608815962099712,False,Is this really America? Terrible!pic.twitter.c...,DonaldTrump
2,2,Oct 8,784840992734064640,/realDonaldTrump/status/784840992734064641,False,The media and establishment want me out of the...,DonaldTrump
3,3,Oct 8,784767399442653184,/realDonaldTrump/status/784767399442653184,False,Certainly has been an interesting 24 hours!,DonaldTrump
4,4,Oct 10,785561269571026944,/realDonaldTrump/status/785561269571026946,False,Debate polls look great - thank you!\n#MAGA #A...,DonaldTrump
...,...,...,...,...,...,...,...
17211,17211,12 May 2009,1773561338,/realDonaldTrump/status/1773561338,False,"""My persona will never be that of a wallflower...",DonaldTrump
17212,17212,8 May 2009,1741160716,/realDonaldTrump/status/1741160716,False,New Blog Post: Celebrity Apprentice Finale and...,DonaldTrump
17213,17213,8 May 2009,1737479987,/realDonaldTrump/status/1737479987,False,Donald Trump reads Top Ten Financial Tips on L...,DonaldTrump
17214,17214,4 May 2009,1701461182,/realDonaldTrump/status/1701461182,False,Donald Trump will be appearing on The View tom...,DonaldTrump


In [5]:
df_clear = df.drop(columns=["Unnamed: 0","date","id","retweet","author"])
df_clear

Unnamed: 0,link,text
0,/realDonaldTrump/status/784609194234306560,Here is my statement.pic.twitter.com/WAZiGoQqMQ
1,/realDonaldTrump/status/785608815962099712,Is this really America? Terrible!pic.twitter.c...
2,/realDonaldTrump/status/784840992734064641,The media and establishment want me out of the...
3,/realDonaldTrump/status/784767399442653184,Certainly has been an interesting 24 hours!
4,/realDonaldTrump/status/785561269571026946,Debate polls look great - thank you!\n#MAGA #A...
...,...,...
17211,/realDonaldTrump/status/1773561338,"""My persona will never be that of a wallflower..."
17212,/realDonaldTrump/status/1741160716,New Blog Post: Celebrity Apprentice Finale and...
17213,/realDonaldTrump/status/1737479987,Donald Trump reads Top Ten Financial Tips on L...
17214,/realDonaldTrump/status/1701461182,Donald Trump will be appearing on The View tom...


In [6]:
df_clear = df_clear.dropna()
df_clear

Unnamed: 0,link,text
0,/realDonaldTrump/status/784609194234306560,Here is my statement.pic.twitter.com/WAZiGoQqMQ
1,/realDonaldTrump/status/785608815962099712,Is this really America? Terrible!pic.twitter.c...
2,/realDonaldTrump/status/784840992734064641,The media and establishment want me out of the...
3,/realDonaldTrump/status/784767399442653184,Certainly has been an interesting 24 hours!
4,/realDonaldTrump/status/785561269571026946,Debate polls look great - thank you!\n#MAGA #A...
...,...,...
17211,/realDonaldTrump/status/1773561338,"""My persona will never be that of a wallflower..."
17212,/realDonaldTrump/status/1741160716,New Blog Post: Celebrity Apprentice Finale and...
17213,/realDonaldTrump/status/1737479987,Donald Trump reads Top Ten Financial Tips on L...
17214,/realDonaldTrump/status/1701461182,Donald Trump will be appearing on The View tom...


In [15]:
import string
import re 

def clean(data):
    data_clean = re.sub(r"\d+", "", data)
    data_clean = re.sub('\n',  ' ', data_clean)
    data_clean = data_clean.lower()
    data_clean = data_clean.translate(str.maketrans(' ', ' ', string.punctuation))
    data_clean = data_clean.strip()
    data_clean = re.sub('pictwitter',  ' ', data_clean)
    data_clean = re.sub('\xa0',  ' ', data_clean)
    data_clean = re.sub("(?P<url>https?://[^\s]+)", '', data_clean)
    data_clean = re.sub("http", '', data_clean)
    data_clean = re.sub(r'//t\.co.+', '', data_clean)
    # Remove retweets
    data_clean = re.sub(r'^RT @.+\:', '', data_clean)
    data_clean = re.sub('@', '', data_clean)
    # Remove new line characters
    data_clean = re.sub(r'\s+', ' ', data_clean)
    # Remove distracting single quotes
    data_clean = re.sub(r"\'", "", data_clean)
    return data_clean



In [16]:
df_clear2 = [clean(x) for x in df_clear['text']]

In [17]:
df_clear2

['here is my statement comwazigoqqmq',
 'is this really america terrible comwiwcpifu',
 'the media and establishment want me out of the race so badly i will never drop out of the race will never let my supporters down maga',
 'certainly has been an interesting hours',
 'debate polls look great thank you maga americafirst compeqsswdz',
 'what they are saying about the clinton campaign’s anticatholic bigotry bitlydcbtvkcrooked',
 'thank you maga americafirst comfgwjlkm',
 'i will be in cincinnati ohio tomorrow night at pm join me ohiovotesearly votetrumppence tickets swwwdonaldjtrumpcomscheduleregistercincinnatioh … comxufugcfg',
 'very little pickup by the dishonest media of incredible information provided by wikileaks so dishonest rigged system',
 'thank you florida a movement that has never been seen before and will never be seen again lets get out votetrumppence on maga comeaalvo',
 'the very foul mouthed sen john mccain begged for my support during his primary i gave he won then dro

#### Tokenization

In [18]:
import nltk
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from nltk.tokenize import word_tokenize


def tokenize(data):
    tokens = word_tokenize(data)
    result = [i for i in tokens if not i in stop_words]
    return result

df_clear3 = [tokenize(x) for x in df_clear2]
df_clear3

[['statement', 'comwazigoqqmq'],
 ['really', 'america', 'terrible', 'comwiwcpifu'],
 ['media',
  'establishment',
  'want',
  'race',
  'badly',
  'never',
  'drop',
  'race',
  'never',
  'let',
  'supporters',
  'maga'],
 ['certainly', 'interesting', 'hours'],
 ['debate',
  'polls',
  'look',
  'great',
  'thank',
  'maga',
  'americafirst',
  'compeqsswdz'],
 ['saying',
  'clinton',
  'campaign',
  '’',
  'anticatholic',
  'bigotry',
  'bitlydcbtvkcrooked'],
 ['thank', 'maga', 'americafirst', 'comfgwjlkm'],
 ['cincinnati',
  'ohio',
  'tomorrow',
  'night',
  'pm',
  'join',
  'ohiovotesearly',
  'votetrumppence',
  'tickets',
  'swwwdonaldjtrumpcomscheduleregistercincinnatioh',
  '…',
  'comxufugcfg'],
 ['little',
  'pickup',
  'dishonest',
  'media',
  'incredible',
  'information',
  'provided',
  'wikileaks',
  'dishonest',
  'rigged',
  'system'],
 ['thank',
  'florida',
  'movement',
  'never',
  'seen',
  'never',
  'seen',
  'lets',
  'get',
  'votetrumppence',
  'maga',
  '

{'statement': <gensim.models.keyedvectors.Vocab at 0x1a80a0a72b0>,
 'really': <gensim.models.keyedvectors.Vocab at 0x1a80a0a7370>,
 'america': <gensim.models.keyedvectors.Vocab at 0x1a80a0a7340>,
 'terrible': <gensim.models.keyedvectors.Vocab at 0x1a80a0a73a0>,
 'media': <gensim.models.keyedvectors.Vocab at 0x1a80a0a74c0>,
 'establishment': <gensim.models.keyedvectors.Vocab at 0x1a80a0a7550>,
 'want': <gensim.models.keyedvectors.Vocab at 0x1a80a0a74f0>,
 'race': <gensim.models.keyedvectors.Vocab at 0x1a80a0a73d0>,
 'badly': <gensim.models.keyedvectors.Vocab at 0x1a80a0a7640>,
 'never': <gensim.models.keyedvectors.Vocab at 0x1a80a0a7250>,
 'drop': <gensim.models.keyedvectors.Vocab at 0x1a80a0a7670>,
 'let': <gensim.models.keyedvectors.Vocab at 0x1a80a0a7790>,
 'supporters': <gensim.models.keyedvectors.Vocab at 0x1a80a0a7820>,
 'maga': <gensim.models.keyedvectors.Vocab at 0x1a80a0a7700>,
 'certainly': <gensim.models.keyedvectors.Vocab at 0x1a80a0a7730>,
 'interesting': <gensim.models.key

[('follow', 0.999701738357544),
 ('robert', 0.9996898770332336),
 ('still', 0.9996864795684814),
 ('cont', 0.9996798634529114),
 ('charity', 0.9996767044067383),
 ('beyond', 0.9996752142906189),
 ('buy', 0.9996666312217712),
 ('credit', 0.9996649026870728),
 ('call', 0.999664843082428),
 ('financial', 0.9996641874313354)]

In [59]:
def listToString(data):
    data_clean = ' '.join([str(elem) for elem in data])
    return data_clean

df_clear4 = [listToString(x) for x in df_clear3]
df_clear4    

['statement comwazigoqqmq',
 'really america terrible comwiwcpifu',
 'media establishment want race badly never drop race never let supporters maga',
 'certainly interesting hours',
 'debate polls look great thank maga americafirst compeqsswdz',
 'saying clinton campaign ’ anticatholic bigotry bitlydcbtvkcrooked',
 'thank maga americafirst comfgwjlkm',
 'cincinnati ohio tomorrow night pm join ohiovotesearly votetrumppence tickets swwwdonaldjtrumpcomscheduleregistercincinnatioh … comxufugcfg',
 'little pickup dishonest media incredible information provided wikileaks dishonest rigged system',
 'thank florida movement never seen never seen lets get votetrumppence maga comeaalvo',
 'foul mouthed sen john mccain begged support primary gave dropped locker room remarks',
 'disloyal rs far difficult crooked hillary come sides ’ know win teach',
 'exception cheating bernie nom dems always proven far loyal republicans',
 'nice shackles taken fight america way want',
 'weak ineffective leader pau

str

str

### Doc2vec

In [66]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize


tagged_data = df_clear4




tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(df_clear4)]
max_epochs = 100
vec_size = 20
alpha = 0.025

model = Doc2Vec(vector_size=vec_size,
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)

model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.iter)
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

model.save("d2v.model")
print("Model Saved")

iteration 0


  epochs=model.iter)


iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
iteration 7
iteration 8
iteration 9
Model Saved


In [122]:
def output_sentences(most_similar):
    for label, index in [('MOST', 0), ('SECOND-MOST', 1), ('MEDIAN', len(most_similar)//2), ('LEAST', len(most_similar) - 1)]:
           print(u'%s %s: %s\n' % (label, most_similar[index][1], data[int(most_similar[index][0])]))


model= Doc2Vec.load("d2v.model")
#to find the vector of a document which is not in training data
test_data = word_tokenize("make america great again".lower())
v1 = model.infer_vector(test_data)

# to print similar sentences

topn = 20

similar_doc = model.docvecs.most_similar([v1],topn=topn)
result = []
score = []



for  n in range(topn):
    res = df["text"][int(similar_doc[n][0])]
    sim = similar_doc[n][1]
    result.append(res)
    score.append(sim)    
print(result)
print(score)

['One of the worst and most boring political pundits on television is @krauthammer. A totally overrated clown who speaks without knowing facts', '“Watch what people are cynical about, and one can often discover what they lack.” -     \nGeneral George S. Patton', '"Without focus, it\'s just impossible to be successful at anything." --Midas Touch', 'Ignorance is inexcusable; it’s the surest way to fail. No acceptable reason exists for not being well informed.', '"Without passion, you don’t have energy, and without energy, you have nothing!" Just one more of my totally brilliant quotes - use it well.', "Success requires 100% of your focus and 100% of your effort. Don't sell yourself short.", 'Being successful requires nothing less than 100% of your concentrated effort. Be totally focused.', '#2. Be totally focused. Being successful requires nothing less than 100% of your concentrated effort.', 'They asked me to dress as Santa Claus to open Miss Universe tonight—I’m thinking about it!', '"

In [None]:
    for  n in range(topn):
        res = df["text"][int(similar_doc[n][0])], similar_doc[n][1])
        result.append(res)