## Case Study: Preparing Lobste.rs

In [91]:
import pandas as pd
import requests
from fuzzywuzzy import fuzz
from collections import Counter
from sklearn import preprocessing

### If you'd rather read from the API to get the latest, uncomment the details (and add comment to the final line)

In [92]:
stories = pd.read_json(r'C:\Users\risha\Documents\KRMU\AIML_assigment\datasets\all_lobsters.json')

In [93]:
stories.head()

Unnamed: 0,comment_count,comments_url,created_at,description,downvotes,last_updated,score,short_id_url,submitter_user,tags,title,upvotes,url
09zw7r,0,https://lobste.rs/s/09zw7r/edited_truth,2017-08-08 20:11:09,,0,2017-08-09T11:03:57.014269,3,https://lobste.rs/s/09zw7r,{'avatar_url': 'https://lobste.rs/avatars/trn-...,"[crypto, pdf]",The Edited Truth,3,https://eprint.iacr.org/2017/714.pdf
0bdne7,17,https://lobste.rs/s/0bdne7/rise_social_media_v...,2017-08-08 21:12:38,,9,2017-08-09T11:03:57.014269,-1,https://lobste.rs/s/0bdne7,{'avatar_url': 'https://lobste.rs/avatars/nkhu...,"[law, privacy]",The Rise of The Social Media Vigilante,8,https://medium.com/@nkhumphreys_89452/the-rise...
1bhbod,11,https://lobste.rs/s/1bhbod/tcl_misunderstood_a...,2017-04-30 20:28:52,<p>Did any language end up taking that “highly...,0,2017-05-01T06:29:11.725518,17,https://lobste.rs/s/1bhbod,"{'is_moderator': False, 'is_admin': False, 'us...",[programming],Tcl the misunderstood - antirez,17,http://antirez.com/articoli/tclmisunderstood.html
1xkje1,0,https://lobste.rs/s/1xkje1/interview_4_jonatha...,2017-05-01 02:31:35,<p>Rust’s own Jonathan Turner on his backgroun...,0,2017-05-01T06:29:11.725518,1,https://lobste.rs/s/1xkje1,"{'is_moderator': False, 'is_admin': False, 'us...","[audio, javascript, rust]",🎤🎙 Interview 4 – Jonathan Turner: Part 1/3,1,http://www.newrustacean.com/show_notes/intervi...
2dasvh,19,https://lobste.rs/s/2dasvh/return_hipster_pda,2017-08-08 14:25:29,,0,2017-08-09T11:03:56.287654,20,https://lobste.rs/s/2dasvh,{'created_at': '2017-01-19T14:56:50.000-06:00'...,[practices],The Return of the Hipster PDA,20,http://www.agilesysadmin.net/return-of-the-hip...


In [94]:
stories.dtypes

comment_count              int64
comments_url              object
created_at        datetime64[ns]
description               object
downvotes                  int64
last_updated              object
score                      int64
short_id_url              object
submitter_user            object
tags                      object
title                     object
upvotes                    int64
url                       object
dtype: object

### Let's take a look at the submitter_user field, as it appears like a dict

In [95]:
stories.submitter_user.iloc[3]

{'is_moderator': False,
 'is_admin': False,
 'username': 'chriskrycho',
 'karma': 27,
 'avatar_url': 'https://secure.gravatar.com/avatar/c096ed07142659408dc6651f8320acd3?r=pg&d=identicon&s=100',
 'created_at': '2016-08-15T09:33:28.000-05:00',
 'about': "I'm a husband and father; a theologian, composer, poet, and essayist; a front end developer at [Olo](http://www.olo.com); a [Rust](https://www.rust-lang.org/en-US/) enthusiast host; and the host of the [Winning Slowly](http://www.winningslowly.org), [New Rustacean](http://www.newrustacean.com/), [Sap.py](http://www.sap-py.com), and [Run With Me](http://runwith.chriskrycho.com/) podcasts."}

In [96]:
user_df = stories['submitter_user'].apply(pd.Series)

In [97]:
user_df.head()

Unnamed: 0,avatar_url,created_at,is_admin,username,karma,is_moderator,about,github_username
09zw7r,https://lobste.rs/avatars/trn-100.png,2017-01-19T14:56:50.000-06:00,False,trn,429,False,,
0bdne7,https://lobste.rs/avatars/nkhumphreys-100.png,2014-07-02T06:36:39.000-05:00,False,nkhumphreys,-1,False,Web developer and previously embedded C developer,
1bhbod,https://secure.gravatar.com/avatar/85002353297...,2016-11-30T10:14:24.000-06:00,False,yumaikas,578,False,I blog infrequently at https://junglecoder.com...,
1xkje1,https://secure.gravatar.com/avatar/c096ed07142...,2016-08-15T09:33:28.000-05:00,False,chriskrycho,27,False,"I'm a husband and father; a theologian, compos...",
2dasvh,https://lobste.rs/avatars/trn-100.png,2017-01-19T14:56:50.000-06:00,False,trn,429,False,,


### Can we combine the user data without potential column overlap?

In [98]:
set(user_df.columns).intersection(stories.columns)

{'created_at'}

In [99]:
user_df = user_df.rename(columns={'created_at': 
                                  'user_created_at'})

In [100]:
stories = pd.concat([stories.drop(['submitter_user'], axis=1), 
                     user_df], axis=1)

In [101]:
stories.head()

Unnamed: 0,comment_count,comments_url,created_at,description,downvotes,last_updated,score,short_id_url,tags,title,upvotes,url,avatar_url,user_created_at,is_admin,username,karma,is_moderator,about,github_username
09zw7r,0,https://lobste.rs/s/09zw7r/edited_truth,2017-08-08 20:11:09,,0,2017-08-09T11:03:57.014269,3,https://lobste.rs/s/09zw7r,"[crypto, pdf]",The Edited Truth,3,https://eprint.iacr.org/2017/714.pdf,https://lobste.rs/avatars/trn-100.png,2017-01-19T14:56:50.000-06:00,False,trn,429,False,,
0bdne7,17,https://lobste.rs/s/0bdne7/rise_social_media_v...,2017-08-08 21:12:38,,9,2017-08-09T11:03:57.014269,-1,https://lobste.rs/s/0bdne7,"[law, privacy]",The Rise of The Social Media Vigilante,8,https://medium.com/@nkhumphreys_89452/the-rise...,https://lobste.rs/avatars/nkhumphreys-100.png,2014-07-02T06:36:39.000-05:00,False,nkhumphreys,-1,False,Web developer and previously embedded C developer,
1bhbod,11,https://lobste.rs/s/1bhbod/tcl_misunderstood_a...,2017-04-30 20:28:52,<p>Did any language end up taking that “highly...,0,2017-05-01T06:29:11.725518,17,https://lobste.rs/s/1bhbod,[programming],Tcl the misunderstood - antirez,17,http://antirez.com/articoli/tclmisunderstood.html,https://secure.gravatar.com/avatar/85002353297...,2016-11-30T10:14:24.000-06:00,False,yumaikas,578,False,I blog infrequently at https://junglecoder.com...,
1xkje1,0,https://lobste.rs/s/1xkje1/interview_4_jonatha...,2017-05-01 02:31:35,<p>Rust’s own Jonathan Turner on his backgroun...,0,2017-05-01T06:29:11.725518,1,https://lobste.rs/s/1xkje1,"[audio, javascript, rust]",🎤🎙 Interview 4 – Jonathan Turner: Part 1/3,1,http://www.newrustacean.com/show_notes/intervi...,https://secure.gravatar.com/avatar/c096ed07142...,2016-08-15T09:33:28.000-05:00,False,chriskrycho,27,False,"I'm a husband and father; a theologian, compos...",
2dasvh,19,https://lobste.rs/s/2dasvh/return_hipster_pda,2017-08-08 14:25:29,,0,2017-08-09T11:03:56.287654,20,https://lobste.rs/s/2dasvh,[practices],The Return of the Hipster PDA,20,http://www.agilesysadmin.net/return-of-the-hip...,https://lobste.rs/avatars/trn-100.png,2017-01-19T14:56:50.000-06:00,False,trn,429,False,,


### Let's check for nulls

In [102]:
stories.shape

(74, 20)

In [103]:
stories.dropna().shape

(8, 20)

In [104]:
stories.dropna(thresh=10, axis=1).shape

(74, 19)

### Exercise: which columns would be dropped?

In [105]:
set(stories.columns) - set(stories.dropna(thresh=10, axis=1).columns)


{'github_username'}

## Let's make the tags easier to use by having them as features in the columns.

In [106]:
tag_df = stories.tags.apply(pd.Series)

In [107]:
tag_df.head()

Unnamed: 0,0,1,2,3,4
09zw7r,crypto,pdf,,,
0bdne7,law,privacy,,,
1bhbod,programming,,,,
1xkje1,audio,javascript,rust,,
2dasvh,practices,,,,


In [108]:
pd.unique(tag_df.values.ravel())

array(['crypto', 'pdf', nan, 'law', 'privacy', 'programming', 'audio',
       'javascript', 'rust', 'practices', 'ruby', 'devops', 'web',
       'hardware', 'science', 'reversing', 'security', 'openbsd',
       'windows', 'design', 'compilers', 'haskell', 'c++', 'assembly',
       'games', 'math', 'release', 'event', 'netbsd', 'unix', 'c',
       'linux', 'testing', 'lua', 'job', 'video', 'philosophy', 'android',
       'networking', 'erlang', 'emacs', 'historical', 'browsers',
       'person', 'culture', 'java', 'go', 'book', 'css', 'debugging',
       'education', 'art', 'compsci', 'databases'], dtype=object)

In [109]:
set(tag_df.values.ravel())

{'android',
 'art',
 'assembly',
 'audio',
 'book',
 'browsers',
 'c',
 'c++',
 'compilers',
 'compsci',
 'crypto',
 'css',
 'culture',
 'databases',
 'debugging',
 'design',
 'devops',
 'education',
 'emacs',
 'erlang',
 'event',
 'games',
 'go',
 'hardware',
 'haskell',
 'historical',
 'java',
 'javascript',
 'job',
 'law',
 'linux',
 'lua',
 'math',
 nan,
 'netbsd',
 'networking',
 'openbsd',
 'pdf',
 'person',
 'philosophy',
 'practices',
 'privacy',
 'programming',
 'release',
 'reversing',
 'ruby',
 'rust',
 'science',
 'security',
 'testing',
 'unix',
 'video',
 'web',
 'windows'}

In [110]:
len(pd.unique(tag_df.values.ravel()))

54

In [111]:
# most common tags

Counter(tag_df.values.ravel()).most_common(5)

[(nan, 231),
 ('programming', 13),
 ('hardware', 10),
 ('security', 10),
 ('practices', 8)]

### Let's create a dummy df with our tags

In [112]:
tag_df = pd.get_dummies(tag_df.apply(pd.Series).stack()).sum()

In [113]:
tag_df.head()

android     1
art         1
assembly    3
audio       1
book        2
dtype: int64

### Now we can add it back to our stories DataFrame

In [114]:
stories = pd.concat([stories.drop('tags', axis=1), 
                     tag_df], axis=1)

In [115]:
stories.head()

Unnamed: 0,comment_count,comments_url,created_at,description,downvotes,last_updated,score,short_id_url,title,upvotes,url,avatar_url,user_created_at,is_admin,username,karma,is_moderator,about,github_username,0
09zw7r,0.0,https://lobste.rs/s/09zw7r/edited_truth,2017-08-08 20:11:09,,0.0,2017-08-09T11:03:57.014269,3.0,https://lobste.rs/s/09zw7r,The Edited Truth,3.0,https://eprint.iacr.org/2017/714.pdf,https://lobste.rs/avatars/trn-100.png,2017-01-19T14:56:50.000-06:00,False,trn,429.0,False,,,
0bdne7,17.0,https://lobste.rs/s/0bdne7/rise_social_media_v...,2017-08-08 21:12:38,,9.0,2017-08-09T11:03:57.014269,-1.0,https://lobste.rs/s/0bdne7,The Rise of The Social Media Vigilante,8.0,https://medium.com/@nkhumphreys_89452/the-rise...,https://lobste.rs/avatars/nkhumphreys-100.png,2014-07-02T06:36:39.000-05:00,False,nkhumphreys,-1.0,False,Web developer and previously embedded C developer,,
1bhbod,11.0,https://lobste.rs/s/1bhbod/tcl_misunderstood_a...,2017-04-30 20:28:52,<p>Did any language end up taking that “highly...,0.0,2017-05-01T06:29:11.725518,17.0,https://lobste.rs/s/1bhbod,Tcl the misunderstood - antirez,17.0,http://antirez.com/articoli/tclmisunderstood.html,https://secure.gravatar.com/avatar/85002353297...,2016-11-30T10:14:24.000-06:00,False,yumaikas,578.0,False,I blog infrequently at https://junglecoder.com...,,
1xkje1,0.0,https://lobste.rs/s/1xkje1/interview_4_jonatha...,2017-05-01 02:31:35,<p>Rust’s own Jonathan Turner on his backgroun...,0.0,2017-05-01T06:29:11.725518,1.0,https://lobste.rs/s/1xkje1,🎤🎙 Interview 4 – Jonathan Turner: Part 1/3,1.0,http://www.newrustacean.com/show_notes/intervi...,https://secure.gravatar.com/avatar/c096ed07142...,2016-08-15T09:33:28.000-05:00,False,chriskrycho,27.0,False,"I'm a husband and father; a theologian, compos...",,
2dasvh,19.0,https://lobste.rs/s/2dasvh/return_hipster_pda,2017-08-08 14:25:29,,0.0,2017-08-09T11:03:56.287654,20.0,https://lobste.rs/s/2dasvh,The Return of the Hipster PDA,20.0,http://www.agilesysadmin.net/return-of-the-hip...,https://lobste.rs/avatars/trn-100.png,2017-01-19T14:56:50.000-06:00,False,trn,429.0,False,,,


### Another potentially useful feature is the post times...

In [116]:
stories['created_hour'] = stories.created_at.map(
    lambda x: x.hour)

In [117]:
stories['created_dow'] = stories.created_at.map(
    lambda x: x.weekday())

### Let's analyze some of the correlations in our features so far...

In [118]:
stories[['created_hour', 'score']].corr()

Unnamed: 0,created_hour,score
created_hour,1.0,0.253917
score,0.253917,1.0


In [119]:
stories[['created_dow', 'score']].corr()

Unnamed: 0,created_dow,score
created_dow,1.0,-0.113918
score,-0.113918,1.0


In [120]:
stories[['karma', 'score']].corr()

Unnamed: 0,karma,score
karma,1.0,-0.061921
score,-0.061921,1.0


In [121]:
stories[['comment_count', 'score']].corr()

Unnamed: 0,comment_count,score
comment_count,1.0,0.637632
score,0.637632,1.0


In [123]:
stories[[ 'score']].corr()

Unnamed: 0,score
score,1.0


### We might also want/need to normalize scores. We can use a Scaler / MinMaxScaler or Normalizer

In [125]:
stories['score']=stories['score'].fillna(stories.score.mean())

In [126]:
normed_score = preprocessing.normalize(stories[['score']])

In [127]:
normed_score[:5]

array([[ 1.],
       [-1.],
       [ 1.],
       [ 1.],
       [ 1.]])

#### hmm... maybe a min-max scaler works better for our needs!

In [128]:
scaler = preprocessing.MinMaxScaler()

In [129]:
scaled_score = scaler.fit_transform(stories[['score']])

In [130]:
scaled_score[:5]

array([[0.07272727],
       [0.        ],
       [0.32727273],
       [0.03636364],
       [0.38181818]])

In [132]:
stories['scaled_score'] = scaled_score[:,0]
stories['scaled_score']

09zw7r     0.072727
0bdne7     0.000000
1bhbod     0.327273
1xkje1     0.036364
2dasvh     0.381818
             ...   
testing    0.155037
unix       0.155037
video      0.155037
web        0.155037
windows    0.155037
Name: scaled_score, Length: 127, dtype: float64