# HW5
## MSDS-7337
## Author: Taylor Bonar
---

In [1]:
from platform import python_version
import bs4
import pandas as pd
import requests
import re
import nltk

print(f"""Python Version: {python_version()}
NLTK v.{nltk.__version__}
BeautifulSoup v.{bs4.__version__}
Pandas v.{pd.__version__}
Requests v.{requests.__version__}
Re v. {re.__version__}""")

Python Version: 3.8.12
NLTK v.3.6.5
BeautifulSoup v.4.9.3
Pandas v.1.1.3
Requests v.2.26.0
Re v. 2.2.1




1.	Compile a list of static links (permalinks) to individual user movie reviews from one particular website. This will be your working dataset for this assignment, as well as for assignments 7 and 8.   
    * It does not matter if you use a crawler or if you manually collect the links, but you will need at least 100 movie review links. Note that, as of this writing, the robots.txt file of IMDB.com allows the crawling of user reviews.
    * Each link should be to a web page that has only one user review of only one movie, e.g., the user review permalinks on the IMDB site.
    * Choose reviews of movies that are all in the same genre, e.g., sci-fi, mystery, romance, superhero, etc.  
    * Make sure your collection includes reviews of several movies in your chosen genre and that it includes a mix of negative and positive reviews.  

In [2]:
from bs4 import BeautifulSoup

# Reference Tutorial: https://www.geeksforgeeks.org/scrape-imdb-movie-rating-and-details-using-python/
# Downloading imdb top 250 movie's data
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'lxml')

# Extract movie ratings and details via HTML tags
movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.posterColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]

# create a empty list for storing
# movie information
list = []
 
# Iterating over movies to extract
# each movie's details
for index in range(0, len(movies)):
   
    # Separating  movie into: 'place',
    # 'title', 'year'
    movie_string = movies[index].get_text()
    movie = (' '.join(movie_string.split()).replace('.', ''))
    movie_title = movie[len(str(index))+1:-7]
    year = re.search('\((.*?)\)', movie_string).group(1)
    place = movie[:len(str(index))-(len(movie))]
     
    data = {"movie_title": movie_title,
            "year": year,
            "place": place,
            "star_cast": crew[index],
            "rating": ratings[index],
            "vote": votes[index],
            "link": links[index]}
    list.append(data)
    
# printing movie details with its rating.
for movie in list:
    print(movie['place'], '-', movie['movie_title'], '('+movie['year'] +
          ') -', 'Starring:', movie['star_cast'], movie['rating'])
    
movies_df = pd.DataFrame(list)

1 - The Shawshank Redemption (1994) - Starring: Frank Darabont (dir.), Tim Robbins, Morgan Freeman 9.239894831729572
2 - The Godfather (1972) - Starring: Francis Ford Coppola (dir.), Marlon Brando, Al Pacino 9.160892820128788
3 - The Dark Knight (2008) - Starring: Christopher Nolan (dir.), Christian Bale, Heath Ledger 8.9928709374644
4 - The Godfather: Part II (1974) - Starring: Francis Ford Coppola (dir.), Al Pacino, Robert De Niro 8.99008065987143
5 - 12 Angry Men (1957) - Starring: Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb 8.950433269319296
6 - Schindler's List (1993) - Starring: Steven Spielberg (dir.), Liam Neeson, Ralph Fiennes 8.939830086357944
7 - The Lord of the Rings: The Return of the King (2003) - Starring: Peter Jackson (dir.), Elijah Wood, Viggo Mortensen 8.927082253753648
8 - Pulp Fiction (1994) - Starring: Quentin Tarantino (dir.), John Travolta, Uma Thurman 8.859137513598528
9 - The Lord of the Rings: The Fellowship of the Ring (2001) - Starring: Peter Jackson (dir

In [4]:
movies_df.to_csv("top_250_movies.csv")

In [5]:
movies_df.head()

Unnamed: 0,movie_title,year,place,star_cast,rating,vote,link
0,The Shawshank Redemption,1994,1,"Frank Darabont (dir.), Tim Robbins, Morgan Fre...",9.239894831729572,,/title/tt0111161/
1,The Godfather,1972,2,"Francis Ford Coppola (dir.), Marlon Brando, Al...",9.160892820128788,,/title/tt0068646/
2,The Dark Knight,2008,3,"Christopher Nolan (dir.), Christian Bale, Heat...",8.9928709374644,,/title/tt0468569/
3,The Godfather: Part II,1974,4,"Francis Ford Coppola (dir.), Al Pacino, Robert...",8.99008065987143,,/title/tt0071562/
4,12 Angry Men,1957,5,"Sidney Lumet (dir.), Henry Fonda, Lee J. Cobb",8.950433269319296,,/title/tt0050083/



2.	Extract noun phrase (NP) chunks from your reviews using the following procedure:
    * In Python, use BeautifulSoup to grab the main review text from each link.  
    * Next run each review text through a tokenizer, and then try to NP-chunk it with a shallow parser. 
    * You probably will have too many unknown words, owing to proper names of characters, actors, and so on that are not in your working dictionary. Make sure the main names that are relevant to the movies in your collection of reviews are added to the working lexicon, and then run the NP chunker again.


In [29]:
def get_review_headliner(title_url_link):
    """Retrieves review headliner text for top 25 reviews displayed for a given movie title"""
    url = f"https://www.imdb.com{title_url_link}reviews"  # review link pattern for imdb
    response = requests.get(url)  # hit the original page
    soup = bs4.BeautifulSoup(response.text, 'html.parser')
    #list comphrehension for imdb movie reviews for specific title
    #remove leading whitespace and ending \n because IMDB formats it that way
    return [headline.get_text().lstrip().strip('\n') for headline in soup.find_all('a', class_='title',href=True)] 

shawshank_reviews = get_review_headliner(movies_df["link"].iloc[0])
godfather_reviews = get_review_headliner(movies_df["link"].iloc[1])
dark_knight_reviews = get_review_headliner(movies_df["link"].iloc[2])
godfather_II_reviews = get_review_headliner(movies_df["link"].iloc[3])

print(f"""Webscrapped {len(shawshank_reviews) + len(godfather_reviews) + len(dark_knight_reviews) + len(godfather_II_reviews)} reviews from top 4 movies:
({movies_df.head(4)['movie_title']})""")


Webscrapped 100 reviews from top 4 movies:
(0    The Shawshank Redemption
1               The Godfather
2             The Dark Knight
3      The Godfather: Part II
Name: movie_title, dtype: object)


In [10]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [31]:
# Ch. 7.2.3 of nltk.org book
grammar = r"""
  NP: {<DT|PP\$>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives and noun
      {<NNP>+}                # chunk sequences of proper nouns
"""

cp = nltk.RegexpParser(grammar)


def shallow_np_chunk_parse(review_headliners):
  """ given a list of review headliners, tokenize, filter, tag, chunk, and then parse for noun-phrases. Prints out subtrees of noun phrases and returns list of them
  """
  noun_phrases = []
  
  for review in review_headliners:
      review_tokens = nltk.word_tokenize(review)  # tokenize inputted sentences
      filtered_review_tokens = [token for token in review_tokens if not token.lower() in set(stopwords.words('english'))]  # remove common english stopwords
      tagged_tokens = nltk.pos_tag(filtered_review_tokens)  # Tag each token as a part of speech
      chunks = nltk.chunk.ne_chunk(tagged_tokens)  # Create chunks using recommended named entity chunker for the pos tagged tokens
      result = cp.parse(chunks)  # Use the Regexp Parser and custom grammar rules to parse each chunk for the noun phrase subtrees
      for subtree in result.subtrees(filter=lambda t: t.label() == 'NP'): # use lambda to filter subtrees on NP
        noun_phrases.append(subtree)
  
  return noun_phrases
    

In [32]:
noun_phrases = shallow_np_chunk_parse(shawshank_reviews)
noun_phrases.append(shallow_np_chunk_parse(godfather_reviews))
noun_phrases.append(shallow_np_chunk_parse(dark_knight_reviews))
noun_phrases.append(shallow_np_chunk_parse(godfather_II_reviews))



3.	Output all the chunks in a single list for each review, and submit that output for this assignment. Also submit a brief written summary of what you did (describe your selection of genre, your source of reviews, how many you collected, and by what means).


In [35]:
from pprint import pprint
pprint(noun_phrases)

[Tree('NP', [('All-time', 'JJ'), ('prison', 'NN')]),
 Tree('NP', [('film', 'NN')]),
 Tree('NP', [('depth', 'NN')]),
 Tree('NP', [('great', 'JJ'), ('story', 'NN')]),
 Tree('NP', [('incredible', 'JJ'), ('movie', 'NN')]),
 Tree('NP', [('convicted', 'JJ'), ('murderer', 'NN')]),
 Tree('NP', [('sound', 'JJ'), ('financial', 'JJ'), ('planning', 'NN')]),
 Tree('NP', [('hope', 'NN')]),
 Tree('NP', [('Time', 'NNP')]),
 Tree('NP', [('Time', 'NNP'), ('Pressure', 'NNP')]),
 Tree('NP', [('extraordinary', 'JJ'), ('unforgettable', 'JJ'), ('film', 'NN')]),
 Tree('NP', [('bank', 'NN')]),
 Tree('NP', [('veep', 'NN')]),
 Tree('NP', [('prison', 'NN')]),
 Tree('NP', [('genre', 'JJ'), ('picture', 'NN')]),
 Tree('NP', [('Redemption-', 'NNP')]),
 Tree('NP', [('*', 'NNP'), ('*', 'NNP'), ('*', 'NNP')]),
 Tree('NP', [('*', 'NN')]),
 Tree('NP', [('Redemption', 'NNP')]),
 Tree('NP', [('Relentless', 'NNP')]),
 Tree('NP', [('movie', 'NN')]),
 Tree('NP', [('masterpiece', 'NN')]),
 [Tree('NP', [('multi-Oscar-winner', 'N