## Shakespeare's <em>Romeo and Juliet</em>

- <strong>request text from Gutenberg Project and use BeautifulSoup to print the play</strong>

In [1]:
import requests

In [2]:
RJ = requests.get('https://www.gutenberg.org/cache/epub/1777/pg1777.html')

In [3]:
RJ

<Response [200]>

In [4]:
from bs4 import BeautifulSoup

In [5]:
html_string = RJ.text
document = BeautifulSoup(html_string, "html.parser")

- <strong>Tokenize the words and remove stopwords</strong>

In [6]:
print(document.text)





The Project Gutenberg eBook of Romeo and Juliet, by William Shakespeare

























*******************************************************************
THIS EBOOK WAS ONE OF PROJECT GUTENBERG'S EARLY FILES PRODUCED AT A
TIME WHEN PROOFING METHODS AND TOOLS WERE NOT WELL DEVELOPED. THERE
IS AN IMPROVED EDITION OF THIS TITLE WHICH MAY BE VIEWED AS EBOOK
(#1513) at https://www.gutenberg.org/ebooks/1513
*******************************************************************
This Etext file is presented by Project Gutenberg, in

cooperation with World Library, Inc., from their Library of the

Future and Shakespeare CDROMS.  Project Gutenberg often releases

Etexts that are NOT placed in the Public Domain!!

*This Etext has certain copyright implications you should read!*
<<THIS ELECTRONIC VERSION OF THE COMPLETE WORKS OF WILLIAM

SHAKESPEARE IS COPYRIGHT 1990-1993 BY WORLD LIBRARY, INC., AND IS

PROVIDED BY PROJECT GUTENBERG WITH PERMISSION.  ELECTRONIC AND

MACHINE READABLE CO

In [7]:
import nltk

In [8]:
from nltk.tokenize import word_tokenize, sent_tokenize

In [9]:
sent_RJ = sent_tokenize(document.text) 

In [10]:
sent_RJ[0]

"\n\n\n\nThe Project Gutenberg eBook of Romeo and Juliet, by William Shakespeare\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n*******************************************************************\nTHIS EBOOK WAS ONE OF PROJECT GUTENBERG'S EARLY FILES PRODUCED AT A\r\nTIME WHEN PROOFING METHODS AND TOOLS WERE NOT WELL DEVELOPED."

In [11]:
print(word_tokenize(sent_RJ[0]))

['The', 'Project', 'Gutenberg', 'eBook', 'of', 'Romeo', 'and', 'Juliet', ',', 'by', 'William', 'Shakespeare', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', '*', 'THIS', 'EBOOK', 'WAS', 'ONE', 'OF', 'PROJECT', 'GUTENBERG', "'S", 'EARLY', 'FILES', 'PRODUCED', 'AT', 'A', 'TIME', 'WHEN', 'PROOFING', 'METHODS', 'AND', 'TOOLS', 'WERE', 'NOT', 'WELL', 'DEVELOPED', '.']


In [12]:
words = []
for s in sent_RJ:
    for w in word_tokenize(s):
        words.append(w)

In [13]:
print(words)



In [14]:
from nltk.corpus import stopwords
from string import punctuation 
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [15]:
extra_stopwords = ['I',"'s",'And',"'d"]

In [16]:
mystopwords = list(punctuation) + stopwords.words('english') + extra_stopwords

In [17]:
RJ_Nostopwords = []
for i in words:
    if i not in mystopwords: 
        RJ_Nostopwords.append(i)
print(RJ_Nostopwords)



- <strong>Find the top 20 most frequent words in the play</strong>

In [18]:
from nltk.probability import FreqDist

In [19]:
freq = FreqDist(RJ_Nostopwords)

In [20]:
freq

FreqDist({'thou': 236, 'Rom': 163, 'Romeo': 154, 'O': 150, 'thy': 145, 'love': 139, 'thee': 135, 'Nurse': 118, 'Jul': 117, 'What': 106, ...})

In [21]:
for i in sorted(freq, key=freq.get, reverse=True)[:19]:
    print(i,freq[i])

thou 236
Rom 163
Romeo 154
O 150
thy 145
love 139
thee 135
Nurse 118
Jul 117
What 106
shall 94
'll 90
The 88
To 84
But 83
That 82
Enter 80
Friar 76
night 72


### My comment: It seems I have to remove more stopwords... But if I always have to remove a lot of stopwords manually, is this analysis still convincing? 

## Yelp sentiments: Moon House Chinese Cuisine (Address: 11058 Santa Monica Blvd, Los Angeles, CA 90025)

In [22]:
from nltk.sentiment import vader
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [23]:
sia = vader.SentimentIntensityAnalyzer()

review 1, 5 stars

In [24]:
sia.polarity_scores('We are Chinese and this is our go to for family takeout dinner on the westside. Filet mignon, orange chicken, walnut shrimp, green beans, mu shu, and the lamb are all superb and always consistent. Highly recommend!!')

{'neg': 0.0, 'neu': 0.824, 'pos': 0.176, 'compound': 0.8165}

review 2, 5 stars

In [25]:
sia.polarity_scores('Great good at a decent price. I would definitely return if I am ever having another hankering for Chinese food')

{'neg': 0.0, 'neu': 0.591, 'pos': 0.409, 'compound': 0.8658}

review 3, 5 stars

In [26]:
sia.polarity_scores("Worth the wait! Always such good Chinese food! If you're looking for good Chinese food in West LA this is where to go!")

{'neg': 0.0, 'neu': 0.7, 'pos': 0.3, 'compound': 0.8213}

review 4, 5 stars

In [27]:
sia.polarity_scores("Good Chinese food in Westwood area! We had the family style hot and sour soup which was yummy, the tofu lettuce wraps were SOO good. We had a garlic sauce green beans, kung pao chicken, and fried rice! Everything was super good, my favorite however was the tofu lettuce wrap! Super fresh and delicious compared to other spots I've had.")

{'neg': 0.0, 'neu': 0.627, 'pos': 0.373, 'compound': 0.9831}

review 5, 4 stars

In [28]:
sia.polarity_scores(
'''
Called to place an order for the following:

Pork chop in salt and garlic lunch portion- this comes with hot and sour soup and rice

Salt and Pepper Fish
Snow Pea leaves
Sweet and Sour Pork

Everything was delicious and the food was still hot when I opened them at home.  Serving portion is pretty big.  You can dine in but looks like they're more famous for takeout.

Will order again if I'm in the area on my way home. Trust.
''')

{'neg': 0.0, 'neu': 0.841, 'pos': 0.159, 'compound': 0.9209}

review 6, 4 stars

In [29]:
sia.polarity_scores(
"Moon House has become my go-to for local Chinese food takeout. If it was in the SGV, I'd probably give it three stars, but decent Chinese restaurants are hard to find in West LA. We always get the fish with black bean sauce and chow mein (the noodles are nothing special but my daughter likes them).  Moo shu is pretty standard. Portions are decently sized, so we usually have leftovers Pricing is a little on the high side, but they mail out coupons for 40% off from time to time, which makes a deal! I'm always on the lookout for those.Free parking in the lot.")

{'neg': 0.04, 'neu': 0.886, 'pos': 0.074, 'compound': 0.7008}

review 7, 4 stars

In [30]:
sia.polarity_scores(
'''We were craving Chinese food not too far from our hotel and found this place on Yelp!  It is located in a busy strip mall which houses a Jamba Juice, Mexican food and more.  It doesn't look like much from the outside or inside but they do have indoor dining!  As soon as we walked in and hubby started talking to them in Cantonese, they were friendly and gave us good service.

We ordered the lunch special #28. Fish Fillet Wok Tossed with Vegetables Lunch Special and porridge with preserved egg and pork.  Both were delicious and the next time we come to LA, we're going to try the Peking Duck!''')

{'neg': 0.012, 'neu': 0.819, 'pos': 0.168, 'compound': 0.9706}

review 8, 1 stars

In [31]:
sia.polarity_scores('''Do not come here if you're hungry! Wait time is ridiculous! They prioritize deliveries and pick ups. Food was salty.''')

{'neg': 0.14, 'neu': 0.86, 'pos': 0.0, 'compound': -0.4738}

review 9, 1 stars

In [32]:
sia.polarity_scores('''I've ordered from this spot a dozen times, foods always been pretty good, no complaints. I ordered a Xiao Long Bao at 8pm, 2 hours later I start feeling nauseous and having abdominal cramps.  Its now 2:15 am and I finally have thrown up the Xiao Long Bao. Thankfully my body was able to expel whatever food poisoning bacteria was in the pork and I can sum this up to a night of nausea instead of a night at the hospital. Gross. Thanks for the food poisoning, won't be ordering ever again!''')


{'neg': 0.181, 'neu': 0.692, 'pos': 0.128, 'compound': -0.7574}

review 10, 1 stars

In [33]:
sia.polarity_scores('''Chicken chow mein was flavorless, fried rice was salty, we ordered dumplings but got pot stickers instead which we good. The one star is for the beef and broccoli which was very good. I would not go back. We ended up throwing away two of the four dishes. Also, it was quite expensive for being just ok. I paid for the chow mein and pot stickers, $26.''')

{'neg': 0.016, 'neu': 0.84, 'pos': 0.145, 'compound': 0.8847}

review 11, 2 stars

In [34]:
sia.polarity_scores('''$16 for an order of shrimp fried rice??????? That's friggin ridiculous!!!!!!''')

{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}

review 12, 2 stars

In [35]:
sia.polarity_scores('''This is our fifth time here. Actually I already forgot how many times I decide to choose this restaurant. It's inexpensive I have to say. Every time there are some friends from China come to see me, I will bring them here. So just have a try. But actually they are smart, just don't choose the special deal. You won't get other discount then!''')

{'neg': 0.045, 'neu': 0.864, 'pos': 0.091, 'compound': 0.4597}

review 13, 2 stars

In [36]:
sia.polarity_scores('''No different from Panda Express. Friendly  staff. Food is bad, even in terms of Americanaized Chinise food. Rule of thumb: if majority of customers there are not Chinise, - food probably doesn't taste good.''')

{'neg': 0.201, 'neu': 0.719, 'pos': 0.079, 'compound': -0.6002}

review 14, 3 stars

In [37]:
sia.polarity_scores('''The food was decent. The service was horrible. More like no service aside from ordering, they forgot some of our food, hard to get their attention and got a dirty cut with lipstick still on it with no kinda apology or some kinda acknowledgment at the least.''')

{'neg': 0.281, 'neu': 0.67, 'pos': 0.049, 'compound': -0.8795}

review 15, 3 stars

In [38]:
sia.polarity_scores('''Located in a scary strip mall but the food is good and reliable. Zero atmosphere. Hard working crew though!''')

{'neg': 0.162, 'neu': 0.657, 'pos': 0.181, 'compound': 0.3489}

## Movie reviews: <em>The Parent Trap</em> (1998) and <em>The Handmaiden</em> (2016)

In [39]:
Reviewlist = ['''At first, I was scared to watch this movie because of the number of rotten tomatoes, but BOY did I love it! Such a comedy, and also lots of drama, which we teens ADORE these days. Lindsay Lohan has to be the best actor in the movie because she was so focused, and she didn't fall behind on the audio, and her acting was perfect. I know the stress of acting like twins, because I've tried it before, and it's hard alright. She was also focused and all her lines were on time.''','This is a remake of the 1961 film of the same name that starred Hayley Mills, this time starring Lindsay Lohan in her first film. The performances of Mills and Lohan are comparable though Mills did have a little more acting experience than Lohan did when filming these films. Also, Lohan doesn’t sing. This new film follows the old one almost beat for beat, with a few changes that make this one slightly better.','''As for this movie, all I can say is WOW! If 11-year-olds are capable of bringing their parents together after divorce, then it should inspire us to try our best to keep our families together no matter what. Who ever thought an 11-year-old girl would turn out to be a barber and ear piercer? I'm even starting to imagine what Hallie looks like with long hair. Hats off to you, Annie James and Hallie Parker,......correction......Annie and Hallie Parker!''','''I give this movie a good 5 stars! It is my top movie I go to every time, and it never gets old! This has been my favorite movie ever since I was a little girl. It is mixed with some drama, action, and love, and my favorite actor Lindsay Lohan is the star of this movie, twice! I love it so much, and it should be rated higher. If you havent seen it yet, make sure to check it out and enjoy!''','''I love the concept of the movie. The twins sperated cuz their parents are seperated. This movie is a good family movie not just for family but for everyone. I also love the cast mostly Lindsy Lohan the one who played Hallie Parker and Annie James.''','''You should definitely watch this movie as this creation goes beyond the periphery of societal understanding as well as lies and truth. The genre should be drama and romance as it is a story of love between a noble lady and her handmaiden. The movie is separated in three consecutive chapters and each chapter has its own shocking revelation as well as taste of emotions such as lust, anger, happiness etc.''','''I learned about this movie in my major class. This was one of the movies that I wanted to watch since I was an adolescent. This movie is based on the British novel "Fingersmith", which is more than 500 pages long. Actually when I watch films based on novels, I prefer reading books first and watching movies later.''','''Since I know of the caliber of the Director and cinematographer, this film really shows their perfect chemistry. The narrative style is so exquisite, the screenplay is genuinely beautiful, and the cinematography deserves commendation. The musical score, editing, lighting all magnified the intensity and beauty of the film.''','''This is hands down the best sapphic film I've ever watched. In most, they either are just male-driven sex scenes or depressing punch you in the gut movies. Yet somehow in both, the love interest never seems to end up together. That's what's different about this movie. It has a plot that includes the sex and the two women end up happy together.''','''The word masterpiece gets thrown around so easily nowadays that its lost its meaning. I've heard a few people on the internet say that this movie was a masterpiece and so I decided to give it a watch but what I expected to be a masterpiece turned out to be a predictable plot. If you've seen enough movies as I have, you can easily predict the entire plot and its ending as you keep watching the movie. This was one movie which deserved to have a sad ending considering how awful and dis likable the three main lead characters are.''']

In [40]:
len(Reviewlist)

10

In [41]:
Reviewlist

["At first, I was scared to watch this movie because of the number of rotten tomatoes, but BOY did I love it! Such a comedy, and also lots of drama, which we teens ADORE these days. Lindsay Lohan has to be the best actor in the movie because she was so focused, and she didn't fall behind on the audio, and her acting was perfect. I know the stress of acting like twins, because I've tried it before, and it's hard alright. She was also focused and all her lines were on time.",
 'This is a remake of the 1961 film of the same name that starred Hayley Mills, this time starring Lindsay Lohan in her first film. The performances of Mills and Lohan are comparable though Mills did have a little more acting experience than Lohan did when filming these films. Also, Lohan doesn’t sing. This new film follows the old one almost beat for beat, with a few changes that make this one slightly better.',
 "As for this movie, all I can say is WOW! If 11-year-olds are capable of bringing their parents togethe

In [42]:
import pandas as pd
from pathlib import Path
import glob

In [43]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords 
from string import punctuation 

In [44]:
mystopwords = list(punctuation) + stopwords.words('english')
print(mystopwords)

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', 'i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when'

In [45]:
[w for w in word_tokenize(Reviewlist[0].lower()) if w not in mystopwords]

['first',
 'scared',
 'watch',
 'movie',
 'number',
 'rotten',
 'tomatoes',
 'boy',
 'love',
 'comedy',
 'also',
 'lots',
 'drama',
 'teens',
 'adore',
 'days',
 'lindsay',
 'lohan',
 'best',
 'actor',
 'movie',
 'focused',
 "n't",
 'fall',
 'behind',
 'audio',
 'acting',
 'perfect',
 'know',
 'stress',
 'acting',
 'like',
 'twins',
 "'ve",
 'tried',
 "'s",
 'hard',
 'alright',
 'also',
 'focused',
 'lines',
 'time']

In [46]:
Review_NewWords = []
for i in Reviewlist:
    Review_NewWords.append([w for w in word_tokenize(i.lower())if w not in mystopwords])

In [47]:
print(Review_NewWords)

[['first', 'scared', 'watch', 'movie', 'number', 'rotten', 'tomatoes', 'boy', 'love', 'comedy', 'also', 'lots', 'drama', 'teens', 'adore', 'days', 'lindsay', 'lohan', 'best', 'actor', 'movie', 'focused', "n't", 'fall', 'behind', 'audio', 'acting', 'perfect', 'know', 'stress', 'acting', 'like', 'twins', "'ve", 'tried', "'s", 'hard', 'alright', 'also', 'focused', 'lines', 'time'], ['remake', '1961', 'film', 'name', 'starred', 'hayley', 'mills', 'time', 'starring', 'lindsay', 'lohan', 'first', 'film', 'performances', 'mills', 'lohan', 'comparable', 'though', 'mills', 'little', 'acting', 'experience', 'lohan', 'filming', 'films', 'also', 'lohan', '’', 'sing', 'new', 'film', 'follows', 'old', 'one', 'almost', 'beat', 'beat', 'changes', 'make', 'one', 'slightly', 'better'], ['movie', 'say', 'wow', '11-year-olds', 'capable', 'bringing', 'parents', 'together', 'divorce', 'inspire', 'us', 'try', 'best', 'keep', 'families', 'together', 'matter', 'ever', 'thought', '11-year-old', 'girl', 'would',

In [48]:
!pip install gensim



In [49]:
from gensim import corpora, models
import gensim

In [50]:
dictionary = corpora.Dictionary(Review_NewWords)

In [51]:
print(dictionary.token2id)

{"'s": 0, "'ve": 1, 'acting': 2, 'actor': 3, 'adore': 4, 'alright': 5, 'also': 6, 'audio': 7, 'behind': 8, 'best': 9, 'boy': 10, 'comedy': 11, 'days': 12, 'drama': 13, 'fall': 14, 'first': 15, 'focused': 16, 'hard': 17, 'know': 18, 'like': 19, 'lindsay': 20, 'lines': 21, 'lohan': 22, 'lots': 23, 'love': 24, 'movie': 25, "n't": 26, 'number': 27, 'perfect': 28, 'rotten': 29, 'scared': 30, 'stress': 31, 'teens': 32, 'time': 33, 'tomatoes': 34, 'tried': 35, 'twins': 36, 'watch': 37, '1961': 38, 'almost': 39, 'beat': 40, 'better': 41, 'changes': 42, 'comparable': 43, 'experience': 44, 'film': 45, 'filming': 46, 'films': 47, 'follows': 48, 'hayley': 49, 'little': 50, 'make': 51, 'mills': 52, 'name': 53, 'new': 54, 'old': 55, 'one': 56, 'performances': 57, 'remake': 58, 'sing': 59, 'slightly': 60, 'starred': 61, 'starring': 62, 'though': 63, '’': 64, "'m": 65, '......': 66, '11-year-old': 67, '11-year-olds': 68, 'annie': 69, 'barber': 70, 'bringing': 71, 'capable': 72, 'correction': 73, 'divo

In [52]:
corpus = [dictionary.doc2bow(text) for text in Review_NewWords]

In [53]:
print(corpus[9])

[(1, 2), (25, 3), (37, 1), (56, 1), (86, 1), (93, 1), (108, 1), (109, 1), (118, 1), (161, 1), (178, 1), (185, 1), (218, 2), (228, 1), (229, 1), (230, 1), (231, 1), (232, 1), (233, 1), (234, 1), (235, 2), (236, 2), (237, 1), (238, 1), (239, 1), (240, 1), (241, 1), (242, 1), (243, 1), (244, 1), (245, 1), (246, 3), (247, 1), (248, 1), (249, 1), (250, 1), (251, 1), (252, 1), (253, 1), (254, 1), (255, 1)]


In [54]:
ldamodel = gensim.models.ldamodel.LdaModel(corpus,
                                           num_topics=2,
                                           id2word = dictionary, 
                                           passes=20)

In [55]:
for i in ldamodel.print_topics(num_topics=2, num_words=20):
    print(i)

(0, '0.031*"movie" + 0.016*"love" + 0.016*"hallie" + 0.013*"also" + 0.013*"annie" + 0.013*"parker" + 0.009*"best" + 0.009*"drama" + 0.009*"together" + 0.009*"acting" + 0.009*"watch" + 0.009*"well" + 0.009*"......" + 0.009*"twins" + 0.009*"focused" + 0.009*"like" + 0.009*"james" + 0.009*"parents" + 0.009*"family" + 0.009*"lohan"')
(1, '0.030*"movie" + 0.018*"film" + 0.016*"lohan" + 0.013*"one" + 0.013*"movies" + 0.010*"\'ve" + 0.010*"watch" + 0.010*"mills" + 0.010*"plot" + 0.010*"masterpiece" + 0.010*"since" + 0.010*"love" + 0.007*"time" + 0.007*"lindsay" + 0.007*"first" + 0.007*"\'s" + 0.007*"ending" + 0.007*"beat" + 0.007*"easily" + 0.007*"old"')


### My comment: hmmm... For <em>The Parent Trap</em>, maybe it is fine. For <em>The Handamaiden</em>, it doesn't help at all. 