# Analyzing Last Jedi Reviews
What are reviewers really saying about The Last Jedi? Is it all anti-progressives? How many negative reviews are responding to the treatment of Luke? To find out, this notebook scrapes user reviews for The Last Jedi from Rotten Tomatoes and does some basic text analysis.

In [1]:
#load libraries
from bs4 import BeautifulSoup
import urllib
import pandas as pd

In [203]:
#test out basic page fetching
r = urllib.request.urlopen('https://www.rottentomatoes.com/m/star_wars_the_last_jedi/reviews/?page=1&type=user').read()
soup = BeautifulSoup(r, "html5lib")
print(soup.prettify()[0:1000])

<!DOCTYPE html>
<html lang="en" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns:og="http://opengraphprotocol.org/schema/">
 <head prefix="og: http://ogp.me/ns# flixstertomatoes: http://ogp.me/ns/apps/flixstertomatoes#">
  <script src="//cdn.optimizely.com/js/594670329.js">
  </script>
  <link href="https://d2a5cgar23scu2.cloudfront.net/v/less/?f=/styles/rt_redesign.less" rel="stylesheet" type="text/css"/>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta content="width=device-width,initial-scale=1" name="viewport"/>
  <meta content="VPPXtECgUUeuATBacnqnCm4ydGO99reF-xgNklSbNbc" name="google-site-verification"/>
  <meta content="034F16304017CA7DCF45D43850915323" name="msvalidate.01"/>
  <link href="//d2a5cgar23scu2.cloudfront.net/static/images/icons/apple-touch-icon.png" rel="apple-touch-icon"/>
  <link href="//d2a5cgar23scu2.cloudfront.net/static/images/icons/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
  <link href="//d2a5cgar23scu2.cloudfront.

In [209]:
#find the reviews
reviews = soup.body.find_all('div', class_="review_table_row") #returns a list of tags

#the stripped_strings approach seems to get the review text itself
for string in reviews[1].stripped_strings:
    print(repr(string))

#reviews[0].previous_sibling

'Jon S'
'½'
'December 27, 2017'
'**SPOILERS** Kathleen Kennedy, JJ Abrams, et al. deliver another nostalgia film that fundamentally misunderstands what makes people nostalgic about the original trilogy. Sorry, but capturing those films\' retro futuristic aesthetic doesn\'t excuse nonsensical plotting or the destruction of beloved characters. The Force Awakens nullified Han and Leia\'s character arcs and had Han regress into a washed up deadbeat dad space loser, then gave him a few mildly redemptive moments before his pointless death. The Last Jedi doesn\'t nullify Luke\'s character arc so much as make him a completely different character. Luke the noble optimist who redeemed his murderously evil dad through a dogged belief in his goodness, and whose greatest strength and weakness was abiding loyalty to his friends, is long gone. In his place, for no comprehensible reason, is a twisted misanthrope who utterly abandoned his friends to die at the hands of the nephew he considered murderin

In [176]:
#get user id and name
import re

def get_user(mytag): #expects a review_table_row div container
    idtag = mytag.find_all(href=re.compile("user/id"))[-1] #get the last user id box
    userid = re.findall("[0-9]+", idtag['href'])
    userid.append([string for string in idtag.stripped_strings])
    return userid #figure out how to sort this tupple later

In [89]:
#count stars and get rating
def get_rating(mytag): #expects a review_table_row div container
    whole_stars = mytag.find_all(class_ = "glyphicon-star")
    #see if there is a half star at the end
    if len(whole_stars)>0:
        half_star = whole_stars[-1].next_sibling
    else:
        half_star = 1 #if whole_stars is 0 then the rating must be a half star
    return len(whole_stars) + (0.5 if half_star is not None else 0)

[1.5, 1.5, 2, 1, 1, 2, 4, 1, 4, 0.5, 1, 1.5, 1, 0.5, 2, 5, 0.5, 1.5, 1.5, 0.5]

In [134]:
#get review text
def get_review_text(mytag):
    review_text = mytag.find('div', class_="user_review") #I hope there's only one
    return [string for string in review_text.stripped_strings]

[['A good "any one" movie, a terrible Star Wars piece, absolutely out of tune.'],
 ["This was quite frankly the most disappointing movie I've seen in a very long time.",
  'The inept way the characters were written and the lazy manner in which this picture was edited was incredibly disheartening.',
  'I went into the theater with such high expectations to see Luke Skywalker again after all these years only to have the iconic character cut off at the knees with such carelessness was painful to watch.',
  'Humour was off and misplaced (to say the least) and the antagonist(s) laughably castrated to the point where we lose track of the stakes presumably at hand.',
  'So bad.. So very bad... I easily position this movie at the he bottom of the heap... Even below "Attack of the Clones").',
  'Stay clear folks. Bad movie.'],
 ["This movie effectively ruins Star Wars lore and most beloved Star Wars characters. It's unimaginative, devoid of classic Star Wars charm and a sense of adventure, chao

In [210]:
def reviewtbl_constructor(mytag):
    tmp = get_user(mytag)
    tmp.append(get_rating(mytag))
    tmp.append(get_review_text(mytag))
    return tmp
reviewtbl = pd.DataFrame([reviewtbl_constructor(x) for x in reviews], columns = ["userid", "username", "rating", "text"])
reviewtbl

Unnamed: 0,userid,username,rating,text
0,977004518,[Michael F],0.5,[This dumb SJW turd is insulting to folks look...
1,976151900,[Jon S],0.5,"[**SPOILERS** Kathleen Kennedy, JJ Abrams, et ..."
2,927435756,[James W. L],4.5,[After reading so many of the audience reviews...
3,977004521,[Billy W],1.5,[I had high hopes for this movie as I had quit...
4,977003744,[Steve B],2.0,"[I give this a solid ""MEH"". There are some rea..."
5,952657501,[Erik M],0.5,[They killed star Wars with this film.]
6,783260491,[Cale S],1.0,[It's now clear at this point that Disney has ...
7,976955062,[Paul W],0.5,"[For some reason Leia can fly like superman, L..."
8,797804393,[Keith H],0.5,[I loved the hilarious floaty Leia zooming abo...
9,799780991,[Jeff S],5.0,[Loved every second of The Last Jedi Rian John...


In [199]:
#get maximum pages
maxpage = re.findall("[0-9]+$", soup.body.find('span', class_="pageInfo").string)[0]
print(maxpage)

range(0, 1534)

In [214]:
#now iterate through the pages and get everything - muahahaha!
def rt_iterator(pagenum):
    print('processing page {}'.format(pagenum)) #slows things down, but useful for debugging
    #open page
    r = urllib.request.urlopen("https://www.rottentomatoes.com/m/star_wars_the_last_jedi/reviews/?page={}&type=user".format(pagenum)).read()
    soup = BeautifulSoup(r, "html5lib")
    
    #find the review block
    reviews = soup.body.find_all('div', class_="review_table_row") #returns a list of tags

    #extract info
    return pd.DataFrame([reviewtbl_constructor(x) for x in reviews], columns = ["userid", "username", "rating", "text"])

#go! - test with small num
reviewtbl = pd.concat([rt_iterator(pgnum) for pgnum in range(1, int(maxpage)+1)], ignore_index=True)

processing page 1
processing page 2
processing page 3
processing page 4
processing page 5
processing page 6
processing page 7
processing page 8
processing page 9
processing page 10
processing page 11
processing page 12
processing page 13
processing page 14
processing page 15
processing page 16
processing page 17
processing page 18
processing page 19
processing page 20
processing page 21
processing page 22
processing page 23
processing page 24
processing page 25
processing page 26
processing page 27
processing page 28
processing page 29
processing page 30
processing page 31
processing page 32
processing page 33
processing page 34
processing page 35
processing page 36
processing page 37
processing page 38
processing page 39
processing page 40
processing page 41
processing page 42
processing page 43
processing page 44
processing page 45
processing page 46
processing page 47
processing page 48
processing page 49
processing page 50
processing page 51
processing page 52
processing page 53
pr

processing page 417
processing page 418
processing page 419
processing page 420
processing page 421
processing page 422
processing page 423
processing page 424
processing page 425
processing page 426
processing page 427
processing page 428
processing page 429
processing page 430
processing page 431
processing page 432
processing page 433
processing page 434
processing page 435
processing page 436
processing page 437
processing page 438
processing page 439
processing page 440
processing page 441
processing page 442
processing page 443
processing page 444
processing page 445
processing page 446
processing page 447
processing page 448
processing page 449
processing page 450
processing page 451
processing page 452
processing page 453
processing page 454
processing page 455
processing page 456
processing page 457
processing page 458
processing page 459
processing page 460
processing page 461
processing page 462
processing page 463
processing page 464
processing page 465
processing page 466


processing page 827
processing page 828
processing page 829
processing page 830
processing page 831
processing page 832
processing page 833
processing page 834
processing page 835
processing page 836
processing page 837
processing page 838
processing page 839
processing page 840
processing page 841
processing page 842
processing page 843
processing page 844
processing page 845
processing page 846
processing page 847
processing page 848
processing page 849
processing page 850
processing page 851
processing page 852
processing page 853
processing page 854
processing page 855
processing page 856
processing page 857
processing page 858
processing page 859
processing page 860
processing page 861
processing page 862
processing page 863
processing page 864
processing page 865
processing page 866
processing page 867
processing page 868
processing page 869
processing page 870
processing page 871
processing page 872
processing page 873
processing page 874
processing page 875
processing page 876


processing page 1226
processing page 1227
processing page 1228
processing page 1229
processing page 1230
processing page 1231
processing page 1232
processing page 1233
processing page 1234
processing page 1235
processing page 1236
processing page 1237
processing page 1238
processing page 1239
processing page 1240
processing page 1241
processing page 1242
processing page 1243
processing page 1244
processing page 1245
processing page 1246
processing page 1247
processing page 1248
processing page 1249
processing page 1250
processing page 1251
processing page 1252
processing page 1253
processing page 1254
processing page 1255
processing page 1256
processing page 1257
processing page 1258
processing page 1259
processing page 1260
processing page 1261
processing page 1262
processing page 1263
processing page 1264
processing page 1265
processing page 1266
processing page 1267
processing page 1268
processing page 1269
processing page 1270
processing page 1271
processing page 1272
processing pa

In [217]:
print(reviewtbl.shape)
reviewtbl.tail() #shoot, I think this messed up b/c I forgot to ignore index - well, actually later pages apear to be blank

(1015, 4)


Unnamed: 0,userid,username,rating,text
15,976995788,[Nick J],0.5,[I'm a lifelong fan and i'm so horribly disapp...
16,977000248,[Daniel T],1.5,[A film full of regretful characters and rehas...
17,973951571,[Graeme P],0.5,[This movie is a complete mess in so many ways...
18,977000306,[Rian%20 J],0.5,[I've been robbed by Disney!! I went to watch ...
19,848291899,[Kevin C],4.0,[Star Wars: The Last Jedi isn't the movie fans...


In [225]:
#save file so we don't have to do this again
reviewtbl.to_pickle("RT_Last_Jedi_{}.pkl".format(pd.to_datetime('today'))) #that didn't seem to work
reviewtbl.to_csv("RT_Last_Jedi_{}.csv".format(pd.to_datetime('today')), index=False, sep='\t') #and this one is missing

In [223]:
import os
os.getcwd()

'O:\\PDES\\PRISM\\Sullivan\\Untitled Folder'

In [157]:
#put it all together - there should be a way to do this in one loop (See above)
reviewtbl = pd.DataFrame([get_user(x) for x in reviews], columns=["userid", "username"])
reviewtbl["rating"] = [get_rating(x) for x in reviews]
reviewtbl["review text"] = [get_review_text(x) for x in reviews]
reviewtbl

Unnamed: 0,userid,username,rating,review text
0,[977003932],[Antonio C],1.5,"[A good ""any one"" movie, a terrible Star Wars ..."
1,[908539814],[Che C],1.5,[This was quite frankly the most disappointing...
2,[977003938],[Damian B],2.0,[This movie effectively ruins Star Wars lore a...
3,[977003939],[],1.0,"[Absolute disgrace, no plot, no character deve..."
4,[976537463],[dan n],1.0,[Worst. Star Wars movie. Ever.]
5,[783688574],[Jeff G],2.0,[I really wanted to like the Last Jedi but the...
6,[821127572],[Angel E],4.0,"[So I just got home from ""The Last Jedi"" which..."
7,[977003887],[Chris W],1.0,"[The Last Jedi is terrible., The bad rating is..."
8,[801634079],[Jonathan K],4.0,[Great fun movie. I understand the difficulty ...
9,[977003942],[Alex V],0.5,[How does Rey know how to swim after growing u...


In [208]:
#scratch cell
usertag = reviews[0].find_all(href=re.compile("user/id"))
print(usertag)
for string in usertag.stripped_strings:
    print(repr(string))

[]


AttributeError: ResultSet object has no attribute 'stripped_strings'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

In [83]:
#test grab rows and extract info

reviews = soup.body.find_all('div', class_="row review_table_row") #returns a list of tags
print(reviews[9].prettify())



<div class="row review_table_row">
 <div class="col-xs-8">
  <div class="col-sm-6 col-xs-15 critic_img">
   <img alt="Alex V." class="media_block_image" height="50" src="https://d2a5cgar23scu2.cloudfront.net/static/images/redesign/actor.default.tmb.gif" width="50"/>
  </div>
  <div class="col-sm-7 col-xs-9 top_critic col-sm-push-13 superreviewer">
  </div>
  <div class="col-sm-11 col-xs-24 col-sm-pull-4">
   <a class="bold unstyled articleLink" href="/user/id/977003942/">
    <span style="word-wrap:break-word">
     Alex V
    </span>
   </a>
  </div>
 </div>
 <div class="col-xs-16">
  <span class="fl" style="color:#F1870A">
   ½
  </span>
  <span class="fr small subtle">
   December 27, 2017
  </span>
  <div class="user_review" style="display:inline-block; width:100%">
   <div class="scoreWrapper">
    <span class="05">
    </span>
   </div>
   How does Rey know how to swim after growing up on a desert planet???
   <br/>
   How does the hacker know the plans of the resistance??
   <br