<a href="https://colab.research.google.com/github/yonatanrtt/sentiment-analysis/blob/main/P_f4_regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<b><h2>Regression</h2>

In this notebook, regression will be used and the edges of the regression results will be inspected

<b>in all the tests in this notebook Random Forest will be used

In [None]:
import pandas as pd
import re
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier

In [None]:
from google.colab import  drive
drive.mount("/drive")
data = pd.read_csv("/drive/My Drive/data.csv")

Drive already mounted at /drive; to attempt to forcibly remount, call drive.mount("/drive", force_remount=True).


<b><h3>First Step:</h3>

the first step of this note is using regression and look at the reviews which are closest to 0.5 to find out if they are ambiguous <br/>
inspection will be done manually by the reading the reviews

In [None]:
def clean_text(txt):
    
    # set new line as a word
    txt = re.sub(r'<br />', " enter ", txt)  
    txt =re.sub(r'<br/>', " enter ", txt)

    # remove all other hatml tags
    txt = re.sub(r'<.*?>', ' ', txt)
    
    # set all text as lower case
    txt = txt.strip().lower()
    
    # separate specific characters like punctuation from words for using word base algorithms
    special_chars = '!"\/#$%&()*+,-./:;<=>?@[]^_`{|}~'
    special_chars = special_chars + "'"
    special_chars_dict = dict((c, " " + c + " ") for c in special_chars)
    str_map = str.maketrans(special_chars_dict)
    txt = txt.translate(str_map)
    
    # remove digits and spaces
    all_words = [word for word in txt.split(sep = " ") if not word.isdigit()]
    all_words_not_spaces = filter(lambda item: item, all_words)
    txt = ' '.join(all_words_not_spaces)
    txt = txt.strip()
    
    return txt

In [None]:
X = data["review"]
y = data["sentiment"].replace("positive", 1).replace("negative", 0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [None]:
vectorizer = CountVectorizer(
    stop_words="english",
    preprocessor=clean_text
)

In [None]:
training_features = vectorizer.fit_transform(X_train)
test_features = vectorizer.transform(X_test)

In [None]:
rf = RandomForestClassifier()
rf.fit(training_features, y_train)
y_pred_proba = rf.predict_proba(test_features)
df = pd.DataFrame({"reviews": X_test, "pred": list(list(zip(*y_pred_proba))[1])})

In [None]:
middle = 0.5
ambigues_list = list(df.iloc[(df['pred']-middle).abs().argsort()[:3]]["reviews"])      

for item in ambigues_list:
    print(item)
    print("\n")


I recently watched this again and there's another version which is shorter 1999. I get the feeling they are the same movie but I would like to know the difference.<br /><br />One is Japanese and no pikachu short is all I can come up with. Ohtherwise why vote for the same movie twice?? <br /><br />Prof Ivy was rather boring. She sounded as if she was almost asleep, no expression at all with the few lines she had.<br /><br />This was enjoyable enough but there wasn't much to it at all. <br /><br />A collector (whos after Lugia, he has no plan to destroy the world) and the usual characters who try to stop him because trying to capture Lugia causes a lot of destruction.<br /><br />The pokemon movies that follow are slightly better, deoxys (poke 7) is great, with no. 8 almost here.


It's not easy making a movie with 18 different stories in it. Although 18 different international directors took the challenge, not everyone of them is good, some of them even boring. But in his entity, "Paris,

<b><h3>conclusion for the first step</h3>

- as expected all the inspected reviews are ambiguous, the last one is different but still it is ambiguous about the movie

<b><h3>Second Step:</h3>

the second step of this note is looking at the reviews which get the highest score in the regression and find out if they are clearly positive

In [None]:
positive_list = list(df.nlargest(3, "pred")["reviews"])      

for item in positive_list:
    print(item)
    print("\n")

Loved today's show!!! It was a variety and not solely cooking (which would have been great too). Very stimulating and captivating, always keeping the viewer peeking around the corner to see what was coming up next. She is as down to earth and as personable as you get, like one of us which made the show all the more enjoyable. Special guests, who are friends as well made for a nice surprise too. Loved the 'first' theme and that the audience was invited to play along too. I must admit I was shocked to see her come in under her time limits on a few things, but she did it and by golly I'll be writing those recipes down. Saving time in the kitchen means more time with family. Those who haven't tuned in yet, find out what channel and the time, I assure you that you won't be disappointed.


Loved today's show!!! It was a variety and not solely cooking (which would have been great too). Very stimulating and captivating, always keeping the viewer peeking around the corner to see what was coming

<b><h3>conclusion for the second step</h3>

- as expected all the inspected reviews are clearly positive

- it is interesting to inspect the third review. it is not as easy its positivnace as the other two. for me it demanded some reading to understand its classification, but still the algorithm ranked it very high.

<b><h3>Third Step:</h3>

the third step of this note is looking at the reviews which get the lowest score in the regression and find out if they are clearly negative

In [None]:
negative_list = list(df.nsmallest(3, "pred")["reviews"])      

for item in negative_list:
    print(item)
    print("\n")

Talk about rubbish! I can't think of one good thing in this movie. The screenplay was poor, the acting was terrible and the effects, well there were no effects. I can't believe the writer of this movie did Identity, everything in this movie made me sick to start to finish.<br /><br />The front cover of the video box shows a showman with shark like teeth and scary eyes. I looks like a scary villain, but like the old saying "never judge a book by it's cover", the whole villain looked like a cardboard cut out. One part in the film a girl gets killed by a salad tongs, terrible. The setting was bad enough, like they could of set the whole thing in Lapland but no, a tropical island instead.<br /><br />I took this movie as a spoof, which I think they wanted it to be but the only thing that made me laugh in a bad way was the tacky effects. You can argue that I haven't watched the first one, but seeing this I would be safe if I wouldn't attempted it.<br /><br />The biggest joke in this movie is

<b><h3>conclusion for the third step</h3>

- as expected lowest ranked reviews are clearly negative

- note that the word "good" appears in the first two reviews, as was already noted at note 1

<b><h2>general conclusion for this notebook</h2>

at list for the small number of samples inspected in this test, regression results near the edges worked really nice and gave the expected results