## Project 2 Code
#### COMP135
#### Sophia Zhang
#### August 14, 2021

Submission details <br>
1. Collaborators information: Submit the usual COLLABORATORS.txt file, containing your
name, amount of time spent on the project, and persons/resources consulted.
2. PDF: Submit the PDF containing all the required figures and discussion.
3. Predictions: Submit text-files containing predictions on the test data to the leaderboard.
### Import required libraries.

In [4]:
import os
import numpy as np
import pandas as pd

import warnings

import sklearn.linear_model
import sklearn.metrics
import sklearn.calibration

from matplotlib import pyplot as plt
import matplotlib
import seaborn as sns
%matplotlib inline
plt.style.use('seaborn') # pretty matplotlib plots

# imports and setup
from numpy.random import default_rng

#from sklearn.datasets import make_classification, load_digits
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score, log_loss, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import minmax_scale

# For basic tokenizing and counting:
from sklearn.feature_extraction.text import CountVectorizer

# For nlp
import nltk
import re

You have been supplied with several thousand single-sentence reviews, collected from three domains: imdb.com, amazon.com, and yelp.com. Each review consists of a sentence, and has been assigned a binary label indicating the sentiment (1 for positive and 0 for negative) of that sentence. Your goal is to develop binary classifiers that can generate the sentiment-labels for new sentences, automating the assessment process. While the reviews were collected from websites where much of the content is in English, the reviews may well contain slang, spelling errors, foreign characters and the like, all of which make natural language data challenging, albeit fun, to try to classify like this.
The provided data consists of 2,400 training examples in the usual CSV x and y format.
<br>Input data has two columns, for the source-website and review text; outputs are given as binary values, where 1 indicates a positive review. There are also 600 testing inputs, for which no y-values are given; these will be used for validation against the Gradescope leaderboards. The Project download also contains a short script, load_train_data.py, that will give you guidance as to how you might load the data using Pandas (of course, you can load it in other ways as well if you so choose).<br>
Examples of positive reviews include:
<br>•(amazon) #1 It Works - #2 It is Comfortable.
<br>•(imdb) "Gotta love those close-ups of slimy, drooling teeth! " 
<br>• (yelp) Food was so gooodd.
<br>Examples of negative reviews include:
<br>•(amazon) DO NOT BUY DO NOT BUYIT SUCKS
<br>• (imdb) This is not movie-making.
<br>• (yelp) The service was poor and thats being nice.


### Load the training/testing data

In [5]:
#%run -i 'load_train_data.py'
x_test_df = pd.read_csv('data_reviews/x_test.csv') #600 samples

x_train_df = pd.read_csv('data_reviews/x_train.csv') #2400 samples
y_train_df = pd.read_csv('data_reviews/y_train.csv')

tr_text_list = x_train_df['text'].values.tolist()

#for text in tr_text_list:
    #print(text)

len(tr_text_list)

2400

### Notes
Document how processing is done. Flowchart!
Make preprocessing pipeline as clearly as possible, outline the steps with motivation.
Name the steps in order, describe/explain them and provide reference sources, be able to replicate.
Appendix with list stopwords
Summarize the results for all models in a figure and/or table, describe the differences and similarities and what they mean.<br>
e.g. Bar chart, training/test accuracy ordered by test accuracy
<br>
Highlight/bold the best models/metrics
At least 2 parameters and at least 10 combinations, 
e.g. 
NN: 3 combinations of hidden layers, 2 combs of activation functions; adds to abt 10 combs
Reg.: 2 solvers, 5 values of C
Regression: 5 elastic net control parameter l1, 5 values of C

### Preprocessing
It is recommended that you preprocess your data, removing punctuation, non-English and non- text characters, and unifying the case (i.e., setting everything to be either upper- or lower-case). You will then investigate feature representations for converting strings of words in sentence form into feature vectors x_n of some common length n, and use those feature representations to build and compare a number of different types of models.
<br>
Source: https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk
<br>
https://realpython.com/python-nltk-sentiment-analysis/

### Steps

1. Tokenization: splitting strings into smaller parts called tokens (single words, bigrams, sentences etc.)
2. Unifying the case (set to lower-case)
3. Removing noise: punctuation, non-English and non- text characters
4. Normalizing: process of converting a word to its canonical form <br>
    a) Stemming is a process of removing affixes from a word (word endings)<br>
    b) Lemmatization normalizes a word with the context of vocabulary position in text <br>
    We will use lemmatization which preserves more context/accuracy but at the cost of speed <br>
5. Removing stopwords: any part of the text that does not add meaning or information to data
6. Determining Word Density: most common words using the FreqDist class of NLTK
7. Remove very rare words: words that occur less than 3 times in training data
8. Find collocations: Bigrams, Trigrams, or Quadgrams, we will use Bigrams, Frequent two-word combinations
<br>


### 1. Tokenization
**Ans** The basic tokenizing produces 4510 features! Which is way too large, need to reduced the number of features.

In [27]:
from nltk.tokenize import word_tokenize

text = tr_text_list[0]
print(word_tokenize(text))

['Oh', 'and', 'I', 'forgot', 'to', 'also', 'mention', 'the', 'weird', 'color', 'effect', 'it', 'has', 'on', 'your', 'phone', '.']


In [6]:
features = tr_text_list
labels = y_train_df

processed_features = []

for sentence in range(len(tr_text_list)):
    # Remove all the special characters
    processed_feature = re.sub(r'\W', ' ', str(features[sentence]))

    # remove all single characters
    processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)

    # Substituting multiple spaces with single space
    processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)
    
    # Replace anything that is not a letter, number, or hyphen. #4534
    processed_feature = re.sub(r"[^A-Za-z\-]", " ", processed_feature)
    
    # Converting to Lowercase
    processed_feature = processed_feature.lower()

    processed_features.append(processed_feature)

In [22]:
processed_features

['oh and forgot to also mention the weird color effect it has on your phone ',
 'that one didn work either ',
 'waste of    bucks ',
 'product is useless since it does not have enough charging current to charge the   cellphones was planning to use it with ',
 'none of the three sizes they sent with the headset would stay in my ears ',
 'worst customer service ',
 'the ngage is still lacking in earbuds ',
 'it always cuts out and makes beep beep beep sound then says signal failed ',
 'the only very disappointing thing was there was no speakerphone ',
 'very disappointed in accessoryone ',
 'basically the service was very bad ',
 'bad choice ',
 'the only thing that disappoint me is the infra red port irda ',
 'horrible had to switch   times ',
 'it feels poorly constructed the menus are difficult to navigate and the buttons are so recessed that it is difficult to push them ',
 'don make the same mistake did ',
 'muddy low quality sound and the casing around the wire insert was poorly su

In [23]:
autocorrected=[]
spell = Speller()
for text in processed_features:
    spells = [spell(w) for w in (nltk.word_tokenize(text))]
    autocorrected.append(spells)

len(autocorrected)

2400

In [24]:
autocorrected=[]
spell = Speller()
for text in processed_features:
    spells = spell(text)
    autocorrected.append(spells)

len(autocorrected)

2400

In [25]:
autocorrected

['oh and forgot to also mention the weird color effect it has on your phone ',
 'that one didn work either ',
 'waste of    bucks ',
 'product is useless since it does not have enough charging current to charge the   cellphone was planning to use it with ',
 'none of the three sizes they sent with the headset would stay in my ears ',
 'worst customer service ',
 'the engage is still lacking in earbuds ',
 'it always cuts out and makes been been been sound then says signal failed ',
 'the only very disappointing thing was there was no speakerphone ',
 'very disappointed in accessoryone ',
 'basically the service was very bad ',
 'bad choice ',
 'the only thing that disappoint me is the infra red port rida ',
 'horrible had to switch   times ',
 'it feels poorly constructed the menus are difficult to navigate and the buttons are so recessed that it is difficult to push them ',
 'don make the same mistake did ',
 'muddy low quality sound and the casing around the wire insert was poorly su

In [11]:
count_vectorizer = CountVectorizer()
x = count_vectorizer.fit_transform(processed_features)
features = count_vectorizer.get_feature_names()
#print("Number of features: ", len(features))
#print(features) #4434
x.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [47]:
# Stop_word + stemming tokenizer
import re # https://docs.python.org/3/library/re.html

#from nltk.stem.porter import PorterStemmer
#porter_stemmer = PorterStemmer()

#from nltk.corpus import stopwords

def stop_stemming_tokenizer(str_input):
    #removing punctuation, non-English and non- text characters, 
    #and unifying the case (to lower-case)
    # Replace anything that is not a letter, number, or hyphen.
    # Remove the "\-" to see the effect.
    #words = re.sub(r"[^A-Za-z\-]", " ", str_input).lower().split()
    #words = re.sub(r"[^A-Za]", " ", str_input).lower().split()
    
    
    # Remove all the special characters
    #processed_feature = re.sub(r'\W', ' ', str_input)
    
    #processed_feature = processed_feature.lower().split()
    
    # remove all single characters
    #processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)

    # Remove single characters from the start
    #processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature) 

    # Substituting multiple spaces with single space
    #processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)
    
    # Replace anything that is not a letter, number, or hyphen. #4534
    processed_feature = re.sub(r"[^A-Za-z\-]", " ", str_input)
    
    # remove all single characters #4523
    processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)
    
    # Converting to Lowercase
    processed_feature = processed_feature.lower().split()
    
    
    # remove stop words 
    #words = [w for w in words if w not in stop_words]
    
    # stem the remaining words after removing stops; 
    # what happens if we do this *before* removing stops?
    #porter_stemmer = PorterStemmer()
    #words = [porter_stemmer.stem(word) for word in words]
    
    return processed_feature #words

count_vectorizer = CountVectorizer(tokenizer=stop_stemming_tokenizer)
x = count_vectorizer.fit_transform(tr_text_list)
features = count_vectorizer.get_feature_names()
print("Number of features: ", len(features))
print(features) #4528, #4452

Number of features:  4523


In [22]:
len(features)

4528

In [None]:
#Change text to lower case

def to_lower(text):
    """
    Converting text to lower case as in, converting "Hello" to  "hello" or "HELLO" to "hello".
    """
    return ' '.join([w.lower() for w in word_tokenize(text)])

In [11]:
# Simplest basic tokenizing and counting:

count_vectorizer = CountVectorizer(stop_words=stop_words)
x = count_vectorizer.fit_transform(tr_text_list)
x
#print("Number of features: ", len(x.toarray()[0]))
print(count_vectorizer.get_feature_names())
#print(x.toarray())
#pd.DataFrame(x.toarray(), columns=count_vectorizer.get_feature_names())



In [None]:
import re, string
from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer

def remove_noise(tweet_tokens, stop_words = ()):

    cleaned_tokens = []

    for token, tag in pos_tag(tweet_tokens):
        token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
                       '(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
        token = re.sub("(@[A-Za-z0-9_]+)","", token)

        if tag.startswith("NN"):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'

        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token, pos)

        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            cleaned_tokens.append(token.lower())
    return cleaned_tokens

print(remove_noise(tweet_tokens[0], stop_words))


In [8]:

stop_stemming_tokenizer(tr_text_list[0])

['oh', 'forgot', 'also', 'mention', 'weird', 'color', 'effect', 'phone']

### Part One: Classifying Review Sentiment with Bag-of-Words Features (70 points)
The “Bag-of-Words” (BoW) model of a document (i.e., in this case, a single review) involves determining a known fixed vocabulary, V, in advance, imposing an order on those words, and then representing each document with a vector of length |V| that has a non-zero value at position i if the ith word in V is part of that document, and is 0 otherwise.† 
<br> †You can find some discussion of this in the material on clustering, from the end of the course, which has already been released.
<br> You will build such a representation for your input data (train and test). Your first step will be to make some design decisions with respect to how your BoW model works; questions you will need to answer may include:
<br>• How big is the vocabulary, and what order to you place those words into?
<br>• Do you exclude very rare words (and what does “very rare” mean)?
<br>• Do you exclude very common words (and what does “very common” mean)?
<br>• Do you count the occurrences of a word in the document, or only record if it is there or not (producing a binary vector)?
<br>• Is it worth using something other than word counts, like the inverse document frequency idea described in lecture.
<br>• Do you use single features only, or do you try counting word-pairs instead? What about counting n-tuples of words?
<br>Whatever you decide (and you may want to experiment) you want a representation whereby each feature of the resulting input vector corresponds to a single word (or n-tuples of words, if you go that route). Once you have decided upon your feature representation, you will investigate three distinct classifier models on the data, seeking one that gives you best performance. <br>
Resources: there are several tools available in sklearn for creating BoW representations: features [https://scikit-learn.org/stable/modules/feature extraction.html]__

**1. (10 pts.)** 
In your report, include a paragraph or two that explain the “pipeline” for generating your BoW features. This should include a clear description of any pre-processing you did on the basic text, along with the sorts of decisions you made in generating your final feature- vectors. You should present this in complete enough form that someone else (another student, say) could produce a model identical to yours if they wished, based upon reading your report. As we have said before, keep code samples to a minimum; ideally, you should be able to explain what you did in plain language. Your paragraph should also contain some justification for why you made the decisions you did.

**ANS**
Why word count/binary is better than tf for short reviews:
"While the tf–idf normalization is often very useful, there might be cases where the binary occurrence markers might offer better features. This can be achieved by using the binary parameter of CountVectorizer. In particular, some estimators such as Bernoulli Naive Bayes explicitly model discrete boolean random variables. Also, very short texts are likely to have noisy tf–idf values while the binary occurrence info is more stable." [https://scikit-learn.org/stable/modules/feature_extraction.html]

### Removing stop-words
There are simple methods of removing stop-words in text.  This will tend to reduce overall number of features by simply ignoring extremely common words.

In [4]:
#Create a list of stopwords
stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over',   'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'more',  'other', 'some', 'such', 'own', 'so', 'than', 'too',  's', 't', 'can', 'will', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y']

In [12]:
# Simplest basic tokenizing and counting:

count_vectorizer = CountVectorizer(stop_words=stop_words)
x = count_vectorizer.fit_transform(tr_text_list)
print("Number of features: ", len(x.toarray()[0]))
#print(count_vectorizer.get_feature_names())
#print(x.toarray())
pd.DataFrame(x.toarray(), columns=count_vectorizer.get_feature_names())

Number of features:  4400


Unnamed: 0,00,10,100,11,12,13,15,15g,15pm,17,...,youtube,yucky,yukon,yum,yummy,yun,z500a,zero,zillion,zombie
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2395,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2396,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2397,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2398,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Creating our own tokenizers
We can create our own methods of tokenizing words if there are things beyond the default that we want to do with text.  This can allow us to do specialized data pre-processing first if we like.

In [35]:
# Stop_word + stemming tokenizer
import re # https://docs.python.org/3/library/re.html

from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

#from nltk.corpus import stopwords

def stop_stemming_tokenizer(str_input):
    #removing punctuation, non-English and non- text characters, 
    #and unifying the case (to lower-case)
    # Replace anything that is not a letter, number, or hyphen.
    # Remove the "\-" to see the effect.
    words = re.sub(r"[^A-Za-z\-]", " ", str_input).lower().split()
    #words = re.sub(r"[^A-Za]", " ", str_input).lower().split()
    
    # remove stop words 
    words = [w for w in words if w not in stop_words]
    
    # stem the remaining words after removing stops; 
    # what happens if we do this *before* removing stops?
    porter_stemmer = PorterStemmer()
    words = [porter_stemmer.stem(word) for word in words]
    
    return words

count_vectorizer = CountVectorizer(tokenizer=stop_stemming_tokenizer)
x = count_vectorizer.fit_transform(tr_text_list)
features = count_vectorizer.get_feature_names()
print("Number of features: ", len(features))
print(features)

Number of features:  3515
['-', '--', '-drink', '-good', '-mi', '-year', 'abandon', 'abhor', 'abil', 'abl', 'abound', 'abroad', 'absolut', 'absolutel', 'absolutley', 'abstrus', 'abysm', 'ac', 'academi', 'accent', 'accept', 'access', 'accessoryon', 'accid', 'accident', 'acclaim', 'accolad', 'accomod', 'accompani', 'accur', 'accus', 'ach', 'achiev', 'achil', 'ackerman', 'acknowledg', 'act', 'acting--even', 'acting-wis', 'action', 'activ', 'actor', 'actress', 'actual', 'ad', 'adapt', 'add', 'addit', 'adhes', 'admin', 'admit', 'ador', 'adrift', 'adventur', 'advertis', 'advis', 'aerial', 'aesthet', 'affect', 'affleck', 'afford', 'afraid', 'africa', 'afternoon', 'age', 'ago', 'agre', 'ahead', 'aimless', 'air', 'airlin', 'akin', 'ala', 'alarm', 'albondiga', 'alexand', 'alik', 'all-star', 'allot', 'allow', 'almond', 'almost', 'alon', 'along', 'alongsid', 'alot', 'alreadi', 'also', 'although', 'aluminum', 'alway', 'amateurish', 'amaz', 'amazingli', 'amazon', 'ambianc', 'ambienc', 'america', 'am

**2. (15 pts.)** Generate a logistic regression model for your feature-data and use it to classify the training data. In your report:
<br>• Give a few sentences describing the model you built, and any decision made about how you set its parameters, trained it, etc.
<br>• Choose at least two hyperparameters that control model complexity and/or its tendency to overfit. Vary those hyperparameters in a systematic way, testing it using a cross- validation methodology (you can use libraries that search through and cross-validate different hyperparameters here if you like). Explain the hyperparameters you chose, the range of values you explored (and why), and describe the cross-validation testing in a clear enough manner that the reader could reproduce its basic form, if desired.
<br>• Produce at least one figure that shows, for at least two tested hyperparameters, per- formance for at least 5 distinct values—this performance should be plotted in terms of average error for both training and validation data across the multiple folds, for each of the values of the hyperparameter. Include information, either in the figure, or along with it in the report, on the uncertainty in these results.‡ 
<br>‡This can be measured in terms of simply standard deviation across the k-fold cross-validation tests, or in more detail by showing exact performance metrics on each fold. The idea is to help the reader understand if the average performance is typical and stable, or if there is a lot of difference from one cross-validation test to another.
<br>• Give a few sentences analyzing these results. Are there hyperparameter settings for which the classifier clearly does better (or worse)? Is there evidence of over-fitting at some settings?

**3. (15 pts.)** Generate a neural network (or MLP) model for your feature-data. Produce the same sort of description and analysis for it as you did for the previous model, including variation of two or more hyperparameters, cross-validation testing, and at least one figure for each hyperparameter (minimum two) that shows how performance on training and validation data is affected as the hyperparameters change.

**4. (15 pts.)** Generate a third model, of whatever type you choose; you could use, for instance, SVM classifiers, or try ones that we have not yet explored directly (sklearn has its own decision-tree and decision-forest classifiers, for example). Whatever you choose, produce the same analysis as for the prior models, including a description of what you did, how hyperparameter variation affected results, and so forth. Figures are expected showing training/validation performance relative to hyperparameter variation; additional figures are allowed, of course.


**5. (10 pts.)** Summarize which classifier of the three you built performs best overall on your labeled data, and give some reasons why this may be so. Does it have more flexibility? Is it better at avoiding overfitting on this data? <br>
In addition, look at the performance of your best classifier and try to characterize the mistakes that it makes. Are there common features to the sentences that it gets wrong (e.g., are they mostly from one of the three source websites)? Are there other features that you can identify? Can you hypothesize why you see the results you do?

**6. (5 pts.)** Apply your best classifier from the previous steps to the text data in x_test.csv file, storing the outcomes as a probabilistic prediction and then submitting them to the leaderboard, as described below. In your report, describe the performance that you see there. How does that match up to the performance you saw during training and cross-validation? If it is as expected, what does that tell us, do you think? If it is not as expected, what does that tell us?


### Part Two: Prediction submissions (5 points)
To test your various classifiers, you can submit the predictions that each makes—on the unlabeled x_test.csv file—to a leaderboard. You can submit output from multiple classifiers, of multiple types, and simply re-submit whichever did the best at the end for your final graded score. The leaderboard code will compare your predictions to known correct examples, scoring them relative to the correct answers.<br>
As for Project 01, the submission should be in the form of a plain text-file, named yproba1_test.txt, containing one probability value (a floating-point number giving the probability of a positive binary label, 1) per example in the test input. Each line will be a single number, and we should be able to load it into a 1-dimensional NumPy array using:
                         <br> np.loadtxt('yproba1_test.txt') <br>
(It would be a good idea to verify that this will work as expected.) These numbers will be thresholded at a probability of 0.5 for scoring purposes.