# Movie review sentiment analysis

*From [Nifty Assignments](http://nifty.stanford.edu/2016/manley-urness-movie-review-sentiment/)*

**About**

This assignment uses movie reviews from the Rotten Tomatoes database to do some simple sentiment analysis. Students will write programs that use the review text and a manually labeled review score to automatically learn how negative or positive the connotations of a particular word are. This can then be used to predict the sentiment of new text with reasonably good results. For example, student programs will be able to read text like this:

*The film was a breath of fresh air.*

and predict that it is a positive review while predicting negative sentiment for text like this:

*It made me want to poke out my eyeballs.*

The data (with some pre-processing from us) is from a [Sentiment Analysis project at Stanford](https://nlp.stanford.edu/sentiment/) (which used a much more sophisticated algorithm) and has been used for a [Kaggle machine learning competition](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews).

We have provided two examples of projects based on this idea that we have used in a CS 1 course and a CS 2 course, though there are many extensions that could be made for these or other higher-level courses.

**Materials**
- [Movie review data file](http://nifty.stanford.edu/2016/manley-urness-movie-review-sentiment/movieReviews.txt). We removed all of the partial reviews from the Kaggle data and reformatted it to make it a little easier for students to read into their programs.
- [CS 1 Assignment Handout](http://nifty.stanford.edu/2016/manley-urness-movie-review-sentiment/CS1Project.doc). In this assignment, students use the data to determine the sentiment of individual words and practice common early CS 1 concepts like control structures, file I/O, accumulators/counters, min/max algorithm, and methods.
- [CS 1 Starter Code](http://nifty.stanford.edu/2016/manley-urness-movie-review-sentiment/CS1SentimentStarterCode.zip). This code shows how to read the different fields of the movie review data and search for words within reviews. This is short and can be developed live with students or given ahead of time.
- [CS 2 Assignment Handout](http://nifty.stanford.edu/2016/manley-urness-movie-review-sentiment/CS2Project.doc). In this assignment, students predict the sentiment of larger pieces of text. The assignment requires appropriate data structures (e.g. hash tables, custom classes) to increase the search speed and reduce the need for excessive file access.
- [CS 2 Starter Code](http://nifty.stanford.edu/2016/manley-urness-movie-review-sentiment/CS2SentimentStarterCode.zip). This code shows how to read the movie review data. It also provides the .h files for the custom class and hash table functions that need to be implemented.

## Movie Review Sentiment Analysis (CS1)

Sentiment Analysis is a Big Data problem which seeks to determine the general attitude of a writer given some text they have written. For instance, we would like to have a program that could look at the text “The film was a breath of fresh air” and realize that it was a positive statement while “It made me want to poke out my eye balls” is negative. 

One algorithm that we can use for this is to assign a numeric value to any given word based on how positive or negative that word is and then score the statement based on the values of the words. But, how do we come up with our word scores in the first place?

That’s the problem that we’ll solve in this assignment. You are going to search through a file containing movie reviews from the Rotten Tomatoes website which have both a numeric score as well as text. You’ll use this to learn which words are positive and which are negative. 

Note that each review starts with a number 0 through 4 with the following meaning:
- 0 : negative
- 1 : somewhat negative
- 2 : neutral
- 3 : somewhat positive
- 4 : positive

1. (30 points) For the base assignment, you will ask the user to enter a word, and then you will search every movie review for that word. If you find it, add the score for that review to the word’s running score total (i.e., an accumulator variable). You also will need to keep track of how many appearances the word made so that you can report the average score of reviews containing that word back to the user.

In [30]:
import urllib.request

def cs11():

    # Input a word
    input_word = input('Enter a word: ')

    # URL to the reviews file
    reviews_url = 'http://nifty.stanford.edu/2016/manley-urness-movie-review-sentiment/movieReviews.txt'

    # Initialize the word score and counter
    word_score = 0
    word_counter = 0

    # Open the file from the URL
    reviews_file = urllib.request.urlopen(reviews_url)
    
    # Loop over the lines in the file
    for review_line in reviews_file:
        
        # Decode the line
        decoded_line = review_line.decode("utf-8")
        
        # If the word is present in the line (case insensitive)
        if input_word.lower() in decoded_line.lower():

            # Upate the score and counter
            word_score = word_score + int(decoded_line[0])
            word_counter = word_counter + 1

    # If the word was found at least once in the file
    if word_counter != 0:
        
        # Compute the average score
        word_score = word_score/word_counter
        
    else:
        
        # Else, return a NaN
        word_score = float('NaN')

    # Print the results
    print(input_word + ' appears ' + str(word_counter) + ' times.')
    print('The average score for reviews containing the word ' + input_word + ' is ' + str(word_score))

In [32]:
# fantastic
cs11()

Enter a word: fantastic
fantastic appears 14 times.
The average score for reviews containing the word fantastic is 2.9285714285714284


In [3]:
# horrible
cs11()

Enter a word: horrible
horrible appears 12 times.
The average score for reviews containing the word horrible is 0.5833333333333334


In [4]:
# ok
cs11()

Enter a word: ok
ok appears 466 times.
The average score for reviews containing the word ok is 1.9527896995708154


**Explanations**:

This one is pretty straighforward. We open the file from the URL, we loop over the lines, and we check if the word is present in each line; if it is, we update a counter and a score. We finally compute the average score provided the word was found at least once in the file.

2. (10 points) For an additional 10 points, ask the user to give you the name of a file containing a series of words, one-per-line, and compute the score of every word in the file. Report back to the user the average score of the words in the file. This will allow you to predict the overall sentiment of the phrase represented by words in the file. Consider an average word score above 2.01 as an overall positive sentiment and consider average score below 1.99 to have an overall negative sentiment. 

In [70]:
import urllib.request

def cs12():

    # Input a file
    words_file = input('Enter the name of the file with words you want to score: ')
    
    # URL to the reviews file
    reviews_url = 'http://nifty.stanford.edu/2016/manley-urness-movie-review-sentiment/movieReviews.txt'

    # Open the reviews file from the URL
    reviews_file = urllib.request.urlopen(reviews_url)
    
    # Initialize the overall score and counter
    overall_score = 0
    overall_counter = 0
            
    # Open the words file
    with open(words_file, 'r') as words_open:
        
        # Loop over the lines in the file
        for word_line in words_open:
            
            # Initialize the word score and counter
            word_score = 0
            word_counter = 0
            
            # Loop over the lines in the file
            for review_line in reviews_file:
                
                # Decode the line
                review_decoded = review_line.decode("utf-8")
                
                # If the current word is present in the current line (case insensitive)
                # (also remove spaces to the right of the string)
                if word_line.rstrip().lower() in review_decoded.lower():
                    
                    # Update the word score and counter
                    word_score = word_score + int(review_decoded[0])
                    word_counter = word_counter + 1
                
            # If the word was found at least once in the file
            if word_counter != 0:

                # Compute the average word score
                word_score = word_score/word_counter

                # Update the overall score and counter
                overall_score = overall_score + word_score
                overall_counter = overall_counter + 1
    
    # If the words were counted at least once in the file
    if overall_counter != 0:
        
        # Compute the average overall score
        overall_score = overall_score/overall_counter
        
    else:
        
        # Else, return a NaN
        overall_score = float('NaN')

    # Estimate the overall sentiment
    if overall_score > 2.01:
        overall_sentiment = 'positive'
    elif overall_score < 1.99:
        overall_sentiment = 'negative'
    else:
        overall_sentiment = 'neutral'

    # Print the results
    print('The average score of words in ' + words_file + ' is ' + str(overall_score))
    print('The overal sentiment of ' + words_file + ' is ' + overall_sentiment)

In [71]:
# C:\Users\raza7002\Documents\GitHub\Python-Problems\negTest.txt
cs12()

Enter the name of the file with words you want to score: C:\Users\raza7002\Documents\GitHub\Python-Problems\negTest.txt
The average score of words in C:\Users\raza7002\Documents\GitHub\Python-Problems\negTest.txt is 2.014556040756914
The overal sentiment of C:\Users\raza7002\Documents\GitHub\Python-Problems\negTest.txt is positive


**Explanations**:

This one is also pretty straighforward. We open the file, we loop over the words, and we basically repeat the process described in 1. for each word. Each time, we update an overall score using the word score and an overall counter. We finally compute the average score and estimate the overall sentiment.

3. (10 points) For an additional 10 points, ask the user to give you the name of a file containing a series of words, one-per-line, and compute the score of every word in the file. Report back to the user which word was the most positive and which was the most negative. 

In [7]:
def cs13():
    
    # Input file
    words_file = input('Enter the name of the file with words you want to score: ')
    
    # Reviews file
    reviews_file = 'C:/Users/raza7002/Downloads/movieReviews.txt'
    
    # Initialize the overall dictionary
    overall_dictionary = dict()

    # Open the words file
    with open(words_file, 'r') as words_file1:
        
        # Loop over the lines in the file
        for word_line in words_file1:

            # Initialize the word score and counter
            word_score = 0
            word_counter = 0

            # Open the reviews file
            with open(reviews_file, 'r') as reviews_file1:
                
                # Loop over the lines in the file
                for review_line in reviews_file1:
                    
                    # If the current word is found in the current line (case insensitive)
                    # (also remove spaces to the right of the string)
                    if word_line.rstrip().lower()in review_line.lower():
                        
                        # Update the word score and counter
                        word_score = word_score + int(review_line[0])
                        word_counter = word_counter + 1
            
            # If the word was counted at least once in the file
            if word_counter != 0:
                
                # Compute the average word score
                word_score = word_score/word_counter
                
            else:
                
                # Else, return a NaN
                word_score = float('NaN')

            # Store the word score in the dictionary
            overall_dictionary[word_line.rstrip()] = word_score
    
    # Find the most positive and negative words
    positive_word = max(overall_dictionary, key=overall_dictionary.get)
    negative_word = min(overall_dictionary, key=overall_dictionary.get)

    # Print the results
    print('The most positive word, with a score of ' + str(overall_dictionary[positive_word]) + ' is ' + positive_word)
    print('The most negative word, with a score of ' + str(overall_dictionary[negative_word]) + ' is ' + negative_word)

In [8]:
cs13()

Enter the name of the file with words you want to score: C:/Users/raza7002/Downloads/CS1SentimentStarterCode/wordList.txt
The most positive word, with a score of 3.8333333333333335 is tears
The most negative word, with a score of 0.125 is incoherent


4. (10 points) For an additional 10 points, add functionality that will ask the user to enter a word file like in the previous step, but instead of reporting the best and the worst word, create two files called positive.txt and negative.txt, sorting words that have scores below 1.9 into negative.txt, and words that have scores above 2.1 into positive.txt (and just leave out words in between).

In [9]:
def cs14():
    
    # Input file
    words_file = input('Enter the name of the file with words you want to score: ')
    
    # Reviews file
    reviews_file = 'C:/Users/raza7002/Downloads/movieReviews.txt'
    
    # Initialize the overall dictionary
    overall_dictionary = dict()

    # Open the words file
    with open(words_file, 'r') as words_file1:
        
        # Loop over the lines in the file
        for word_line in words_file1:

            # Initialize the word score and counter
            word_score = 0
            word_counter = 0

            # Open the reviews file
            with open(reviews_file, 'r') as reviews_file1:
                
                # Loop over the lines in the file
                for review_line in reviews_file1:
                    
                    # If the current word is found in the current line (case insensitive)
                    # (also remove spaces to the right of the string)
                    if word_line.rstrip().lower()in review_line.lower():
                        
                        # Update the word score and counter
                        word_score = word_score + int(review_line[0])
                        word_counter = word_counter + 1
            
            # If the word was counted at least once in the file
            if word_counter != 0:
                
                # Compute the average word score
                word_score = word_score/word_counter
                
            else:
                
                # Else, return a NaN
                word_score = float('NaN')

            # Store the word score in the dictionary
            overall_dictionary[word_line.rstrip()] = word_score
    
    # Create positive and negative dictionaries, sorted
    positive_dictionary = dict((key, value) for key, value in overall_dictionary.items() if value >= 2.1)
    positive_dictionary = sorted(positive_dictionary, key=positive_dictionary.get)
    negative_dictionary = dict((key, value) for key, value in overall_dictionary.items() if value <= 1.9)
    negative_dictionary = sorted(negative_dictionary, key=negative_dictionary.get)

    # Print the results
    #print('Sorted positive dictionary (in ascending order) :'+ str(positive_dictionary))
    #print('Sorted negative dictionary (in ascending order) :'+ str(negative_dictionary))
    
    # Write the results
    with open('positive.txt','w') as positive_file:
        print(positive_dictionary, file=positive_file)
    with open('negative.txt','w') as negative_file:
        print(negative_dictionary, file=negative_file)

In [10]:
cs14()

Enter the name of the file with words you want to score: C:/Users/raza7002/Downloads/CS1SentimentStarterCode/wordList.txt


5. (5 points) Put the code from the above three parts (or two or one, depending on how many you attempted) into their own methods and call them as appropriate.

In [11]:
# Public class without a constructor
class cs1:
    
    # Static methods without a self
    @staticmethod
    def method1():
        cs11()
        
    @staticmethod
    def method2():
        cs12()
        
    @staticmethod
    def method3():
        cs13()
        
    @staticmethod
    def method4():
        cs14()

In [12]:
cs1.method1()

Enter a word: fantastic
fantastic appears 14 times.
The average score for reviews containing the word fantastic is 2.9285714285714284


In [13]:
cs1.method2()

Enter the name of the file with words you want to score: C:/Users/raza7002/Downloads/CS1SentimentStarterCode/wordList.txt
The average score of words in C:/Users/raza7002/Downloads/CS1SentimentStarterCode/wordList.txt is 1.9342285798738992
The overal sentiment of C:/Users/raza7002/Downloads/CS1SentimentStarterCode/wordList.txt is negative


In [14]:
cs1.method3()

Enter the name of the file with words you want to score: C:/Users/raza7002/Downloads/CS1SentimentStarterCode/wordList.txt
The most positive word, with a score of 3.8333333333333335 is tears
The most negative word, with a score of 0.125 is incoherent


In [15]:
cs1.method4()

Enter the name of the file with words you want to score: C:/Users/raza7002/Downloads/CS1SentimentStarterCode/wordList.txt


6. (5 points) Create a menu that allows the user to pick the functionality that they want from the choices. When finished with it, present the menu again until the user chooses to exit.

In [16]:
condition = True
while condition:

    # Menu
    print('What would you like to do?')
    print('1: Get the score of a word')
    print('2: Get the average score of words in a file (one word per line)')
    print('3: Find the highest/lowest scoring words in a file')
    print('4: Sort words from a file into positive.txt and negative.txt')
    print('5: Exit the program')
    input_value = input('Enter a number 1:5: ')
    print('')
    
    input_value = int(input_value)
    if input_value == 1:
        cs1.method1()
        print('')
    elif input_value == 2:
        cs1.method2()
        print('')
    elif input_value == 3:
        cs1.method3()
        print('')
    elif input_value == 4:
        cs1.method4()
        print('')
    elif input_value == 5:
        condition = False
        pass
    else:
        print('The input must be a number 1-5.')
        print('')

What would you like to do?
1: Get the score of a word
2: Get the average score of words in a file (one word per line)
3: Find the highest/lowest scoring words in a file
4: Sort words from a file into positive.txt and negative.txt
5: Exit the program
Enter a number 1:5: 1

Enter a word: fantastic
fantastic appears 14 times.
The average score for reviews containing the word fantastic is 2.9285714285714284

What would you like to do?
1: Get the score of a word
2: Get the average score of words in a file (one word per line)
3: Find the highest/lowest scoring words in a file
4: Sort words from a file into positive.txt and negative.txt
5: Exit the program
Enter a number 1:5: 2

Enter the name of the file with words you want to score: C:/Users/raza7002/Downloads/CS1SentimentStarterCode/wordList.txt
The average score of words in C:/Users/raza7002/Downloads/CS1SentimentStarterCode/wordList.txt is 1.9342285798738992
The overal sentiment of C:/Users/raza7002/Downloads/CS1SentimentStarterCode/word