<a href="https://colab.research.google.com/github/robertrose85/WebMining-Similarity/blob/main/Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Online forums have been a great resource for sharing all kinds of information and other work for a long time. Somewhat recently, different online forums such as Reddit have democratized what is interesting giving the masses the ability to rank or rate a post via the upvote/downvote options or something similar.

But what makes a post interesting? Is there a formula to creating a "front page" post? What do these posts have in common?

Today, we will focus on the forum HackerNews. This is a forum that is generally geared toward tech, science, and professional discussions based on interesting topics on the internet. This forum is also a place to "Show HN" what kinds of projects they are working on or "Ask HN" a question that a particular user would like some help answering. All of these are susceptible to the same democratization seen on Reddit, with one caveat, not everyone can downvote.

This analysis will be leveraging Pandas, Numpy, Sklearn to assist in our discovery. 

In [None]:
import pandas as pd
import sklearn as sk
import numpy as np
import nltk
import re

from sklearn.feature_extraction.text import CountVectorizer

from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
dforig = pd.read_csv('/content/drive/MyDrive/HN_2012_Values.csv')

For this post I've found an existing dataset that logged every post, it's poster, Post Type, upvotes, and comments for a period of time. For this post, and for the sake of processing time, I narrowed the analysis to posts in 2012, which gives us about 311,000 posts to work with.

In [None]:
print(len(dforig.index))

311107


I also manipulated the data a bit to give us some easier data to work with. The first thing I did was create the RootURL column. This get's rid of the string after the domain and allows us some flexibility to work with high level domains versus sifting through every individual article. I also split up the date and time.

Also, since memory is limited, I want to focus on the posts that seem to at least gain some traction, so I will focus on posts that only score above 100 points.

In [None]:
dforig = dforig[dforig['Points'] > 100] 
print(len(dforig.index))

7106


Make all the text the same case to avoid complications.

In [None]:
dforig['Title'] = dforig['Title'].str.lower()
dforig['Author'] = dforig['Author'].str.lower()
dforig = dforig.dropna(subset=['Title'])

I want to remove the stop words because I want to compare substantial words. Credit: [This Guy](https://gist.github.com/sebleier/554280#gistcomment-2596130)

In [None]:
corpus = dforig['Title'].tolist()

corpus

['best papers in computer science up to 2011',
 'tech’s relationship with depression, suicide and asperger’s',
 'avoid apress',
 'turning off google search results indirection',
 'there\'s no shame in code that is simply "good enough"',
 'ask hn: who is hiring? (january 2012)',
 'ask hn: freelancer? seeking freelancer? (january 2012)',
 'how to date a supermodel (or get dealflow or find cofounders)',
 'open-source dropbox alternative powered by git',
 'show hn: scrollorama',
 'how i wrote and self-published a book: step by step',
 'why 13th chords',
 "occupy portland's dec 3rd tactic to neutralize police",
 'deca - a systems language based on modern pl principles',
 'impress.js - a prezi like implementation using css3 3d transformations',
 'code year',
 "the way people copy each other's linguistic style reveals their pecking order",
 'car-sharing service higear shuts down due to theft of 4 cars worth $400,000',
 'reducing code nesting ',
 "galaxy nexus power analysis: why chargers can'

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

http://jonathansoma.com/lede/foundations/classes/text%20processing/tf-idf/

In [None]:
#def boring_tokenizer(str_input):
#    words = re.sub(r"[^A-Za-z]", " ", str_input).split()
#    return words


#count_vectorizer = CountVectorizer(stop_words='english', tokenizer=boring_tokenizer)
#title_list2 = count_vectorizer.fit_transform(title_list)
#print(count_vectorizer.get_feature_names())

#title_list = count_vectorizer.fit_transform(title_list)


There are quite a few words in all these thousands of posts. So I want to condense them as much as possible. I chose to use a stemmer based upon the above articles suggestion, however upon doing some more research, I settled with the Snowball tokenizer since as far as I can tell, it's a little better at stemming words without chopping too much off. https://www.nltk.org/howto/stem.html

The regex is in place to leave out numbers. What I noticed is that there were dozens of numerical strings that were nearly meaningless out of their context. On top of that, I am also using CountVectorizer's mindf, the documentation defines mindf as such:

"When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None."


In [None]:
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")

def stemming_tokenizer(str_input):
    words = re.sub(r"[^a-zA-Z]{2,}", " ", str_input).lower().split()
    words = [stemmer.stem(word) for word in words]
    return words

#count_vectorizer = CountVectorizer(stop_words='english', tokenizer=stemming_tokenizer, min_df=0.005)
count_vectorizer = CountVectorizer(stop_words='english', min_df=0.005)
corpus2 = count_vectorizer.fit_transform(corpus)
print(count_vectorizer.get_feature_names())

['000', '10', '2012', 'amazon', 'android', 'api', 'app', 'apple', 'apps', 'ask', 'best', 'better', 'billion', 'book', 'bootstrap', 'build', 'building', 'chrome', 'code', 'com', 'company', 'computer', 'css', 'data', 'day', 'design', 'developer', 'don', 'email', 'facebook', 'founder', 'free', 'game', 'github', 'google', 'guide', 'hacker', 'hn', 'html5', 'internet', 'introducing', 'ios', 'iphone', 'javascript', 'js', 'just', 'know', 'language', 'learn', 'learning', 'life', 'like', 'linux', 'mac', 'make', 'man', 'maps', 'microsoft', 'million', 'new', 'news', 'old', 'online', 'open', 'os', 'page', 'patent', 'people', 'programmer', 'programming', 'project', 'python', 'real', 'reddit', 'released', 'ruby', 'says', 'search', 'software', 'sopa', 'source', 'startup', 'stop', 'story', 'support', 'things', 'time', 'twitter', 'use', 'users', 'using', 'video', 'want', 'way', 'web', 'windows', 'work', 'world', 'wrong', 'yc', 'year', 'years']


Here you can see that we have all words, some stemmed awkwardly, but we understand what the word is.

In [None]:
df = pd.DataFrame(corpus2.toarray(), columns=count_vectorizer.get_feature_names())
df

Unnamed: 0,000,10,2012,amazon,android,api,app,apple,apps,ask,best,better,billion,book,bootstrap,build,building,chrome,code,com,company,computer,css,data,day,design,developer,don,email,facebook,founder,free,game,github,google,guide,hacker,hn,html5,internet,...,online,open,os,page,patent,people,programmer,programming,project,python,real,reddit,released,ruby,says,search,software,sopa,source,startup,stop,story,support,things,time,twitter,use,users,using,video,want,way,web,windows,work,world,wrong,yc,year,years
0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7101,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7102,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7103,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
7104,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


In [None]:
pd.set_option('display.max_rows', 134)
df_sum = df.sum(axis=0)

print(df_sum)

df_sum.to_csv('termfrequency.csv')

000             60
10              60
2012            81
amazon          57
android         56
api             48
app            115
apple          161
apps            39
ask             97
best            56
better          40
billion         40
book            44
bootstrap       45
build           44
building        39
chrome          37
code           106
com             48
company         44
computer        49
css             38
data            86
day             52
design          54
developer       36
don             79
email           37
facebook       143
founder         39
free            95
game            76
github          85
google         298
guide           38
hacker          67
hn             329
html5           41
internet        69
introducing     37
ios             61
iphone          51
javascript      84
js             108
just            60
know            45
language        42
learn           49
learning        49
life            51
like            70
linux       

Something about cosine similarity

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

df2 = pd.DataFrame(cosine_similarity(df, dense_output=True))
df2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,7066,7067,7068,7069,7070,7071,7072,7073,7074,7075,7076,7077,7078,7079,7080,7081,7082,7083,7084,7085,7086,7087,7088,7089,7090,7091,7092,7093,7094,7095,7096,7097,7098,7099,7100,7101,7102,7103,7104,7105
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.408248,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
l=[]

# Part 01:
for j,k in enumerate(df2.values):
    for n in range(len(k)):
        l.append([j,n,k[n]])

# Part 02:
qq=[]
for i in range(len(l)):
    if l[i][0]==l[i][1]:
        qq.append([l[i][0],l[i][1],0])
    else:
        qq.append(l[i])
qq[:5]

[[0, 0, 0], [0, 1, 0.0], [0, 2, 0.0], [0, 3, 0.0], [0, 4, 0.0]]

In [None]:
from collections import defaultdict
u=defaultdict(list)

# Part 01:

for i in range(len(qq)):
    u[qq[i][0]].append(qq[i][2])
    
updated_df=pd.DataFrame(u)

# updated_df.max(axis=1)
# max(updated_df[0])
# np.argmax(updated_df[3])
# updated_df[3]

# Part 02:

position_maxVal=[]
for i in range(len(updated_df)):
    position_maxVal.append(np.argmax(updated_df[i]))

In [None]:
updated_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,7066,7067,7068,7069,7070,7071,7072,7073,7074,7075,7076,7077,7078,7079,7080,7081,7082,7083,7084,7085,7086,7087,7088,7089,7090,7091,7092,7093,7094,7095,7096,7097,7098,7099,7100,7101,7102,7103,7104,7105
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.00000,0.0,0.0,0.408248,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.00000,0.0,0.0,0.000000,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.00000,0.0,0.0,0.000000,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.00000,0.0,0.0,0.000000,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.5,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.00000,1.0,0.0,0.000000,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.00000,0.0,0.0,0.000000,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7102,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.00000,0.0,0.0,0.000000,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.707107,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.408248,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7103,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.00000,0.0,0.0,0.000000,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7104,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.57735,0.0,0.0,0.000000,0.57735,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
sent_comp=[]


for j in position_maxVal: # list of highest similarity index positions
# this creates in order our tweets w/ highest similiarity by row    
            sent_comp.append(corpus[j])
sent_comp

# tweets based on highest similarity value per row as DF
sim_posts=pd.DataFrame(sent_comp,columns=['Top Sim'])

# similiarity values rounded 4 decimal places finding max value per row
sim_value=pd.DataFrame(round(updated_df.max(axis=1),4),
                               columns=['Cosine Sim'])
print(sent_comp)



While we see that there are a number of posts with greater than zero cosine similarity, it appears that there might not be a magic formula outside of maybe topic selection. Many of the "similar" posts are fundamentally different, despite them having very similar subject matter. Using this code, we certainly leverage the data to determine subject matter that may be of interest to the reader. For example, if the reader is looking for something like the post regarding "Best Papers in Computer science up to 2011" we can reasonably assume they will also like the post regarding the "Best papers from 27 top-tier computer science conferences".

In [None]:

titles_df=pd.DataFrame(corpus,columns=['Titles'])

cos_sim_df=pd.concat([titles_df,sim_posts,sim_value],axis=1)

cos_sim_df = cos_sim_df[cos_sim_df['Cosine Sim'] > 0]

cos_sim_df.head(50)

NameError: ignored

Finding similar posts in some use cases can be great, but maybe I don't have a post I like and just a topic. Let's consider maybe I need some help writing this post so I want to find good articles on "Python web mining". I'll start with working through TFIDF. The TF stands for Term Frequency, this is exactly as it sounds, we're looking at how often a term shows up. IDF stands for inverse document frequency, this process gives for weight to words that appear in less documents. So for example, if I didn't have stop words removed already, it would make certain words like "The" and "and" (which probably are frequently represented in each document) weigh less. Sckit makes this easy for us, see below.

In [None]:
#tfidf_vectorizer = TfidfVectorizer()

#tfidf_vectorizer = TfidfVectorizer(stop_words='english', tokenizer=stemming_tokenizer, min_df=0.005)
tfidf_vectorizer = TfidfVectorizer(stop_words='english', use_idf=True)
corpus2 = tfidf_vectorizer.fit_transform(corpus)


df_tfidf = pd.DataFrame(corpus2.toarray(), columns=tfidf_vectorizer.get_feature_names())
df_tfidf

Unnamed: 0,00,000,000x,001,002x,00am,04,06,072,080,0s,0x,0x10c,10,100,1000,1000s,100k,100m,100th,101,1024768px,1080,109,10gen,10k,10m,10x,11,110,1110,113,119,12,120,125th,128,129,12bit,12m,...,zach,zap,zappos,zara,ze,zealand,zealanders,zed,zeitgeist,zelda,zen,zencoder,zeolite,zepto,zerg,zero,zerobin,zerocater,zeromq,zerorpc,zerovm,zfs,zigfu,zimbabwe,zip,zipcar,zipcode,ziptastic,zlio,zone,zones,zoom,zsh,zte,zuck,zuckerberg,zurb,zx,zxcvbn,zynga
0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7101,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7102,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7103,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7104,0.0,0.30889,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.433896,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
search_df = pd.DataFrame([df_tfidf['python'], df_tfidf['web'], df_tfidf['mining'], df_tfidf['python'] + df_tfidf['web'] + df_tfidf['mining']], index=["Python", "Web", "Mining", "Python + Web + Mining"]).T
search_df = search_df[search_df['Python + Web + Mining'] > 0]
search_df = search_df[search_df['Mining'] > 0]
search_df

test = search_df.sort_values(['Python + Web + Mining'], ascending=[False])
test




Unnamed: 0,Python,Web,Mining,Python + Web + Mining
5726,0.320637,0.311474,0.467453,1.099565
5778,0.0,0.282949,0.424643,0.707592
2540,0.0,0.250381,0.375766,0.626148
4527,0.0,0.0,0.526513,0.526513
5038,0.0,0.0,0.486433,0.486433
6896,0.0,0.0,0.476674,0.476674
938,0.0,0.0,0.356405,0.356405
6732,0.0,0.0,0.356064,0.356064
2502,0.0,0.0,0.354372,0.354372
2395,0.0,0.0,0.304431,0.304431


In [None]:
corpus[5726]

'pattern - web mining python lib'