# Sentiment analysis of movie reviews


## Tasks

In [93]:
import pandas as pd
import numpy as np
data = pd.read_csv('https://github.com/mbburova/MDS/raw/main/sentiment.csv', index_col = 0)

data.head()

Unnamed: 0,sentiment,review
0,1,With all this stuff going down at the moment w...
1,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,0,The film starts with a manager (Nicholas Bell)...
3,0,It must be assumed that those who praised this...
4,1,Superbly trashy and wondrously unpretentious 8...


**Task 1 (1 points)**
It seems that data contains some unnecessary HTML tags such as `<br />`, for example.

Find all types of HTML tags (the types of expressions in brackets of the form `<...>`). 


How many different tag types are their in the data? What is the most frequent tag? 

Write your answer as a string separating tag_count and most popular tag by space. 

**Example answer:** `"3 <p>"`

In [94]:
import re
from collections import Counter
### YOUR SOLUTION
pattern=re.compile('<(.+?)>')
lines = []
for index, row in data.iterrows():
    line = re.findall(pattern, row['review'])
    lines = lines + line

cnt = Counter()
for word in lines:
    cnt[word] += 1
cnt
print(cnt)
q1 = "formatted string with tag count and the most popular tag"

Counter({'br /': 40968, 'SPOILER': 1, '/SPOILER': 1})


In [95]:
#ans: "3 <br />"

**Task 2 (1 points)**

Prepare your text. For this, replace tags from task 1 by spaces, remove multiple spaces (which may appear after tag removal), relace back slashes (`\`) with zero string,  and lower the text and strip it using `text.strip()`.

What is the mean number of unique characters in the review? 

Calculate number of unique characters in a string using `len(set(string))`.


In [96]:
# data['cleaned_reaview'] = # YOUR CODE HERE

In [97]:
def text_prepare(text):
  new_txt = re.sub('<(.+?)>', '', text)
  txt = " ".join(new_txt.split()).replace('\\', '').lower()
  return(txt)



In [98]:
text_prepare('Do not like \"Titanic\"')

'do not like "titanic"'

In [99]:
def test_text_prepare():
    examples = ['Best film I have ever seen <SMILE::>',
                'Do not like \"Titanic\"',
                'Can say just    .... Nothing!!! <SAD>']
    answers = ['best film i have ever seen',
                'do not like "titanic"',
                'can say just .... nothing!!!']
    for ex, ans in zip(examples, answers):
        if text_prepare(ex) != ans:
            print(text_prepare(ex))
            return "Wrong answer for the case: '%s'" % ex
    return 'Basic tests are passed.'
test_text_prepare()

'Basic tests are passed.'

In [100]:
# q2 =### YOUR SOLUTION

data['cleaned_review'] = data['review'].apply(text_prepare)


In [101]:
data['unique'] = data['cleaned_review'].apply(lambda x: len(set(x)))

In [102]:
data.unique.mean()
#ans: 33.7798

33.7798

In [103]:
data

Unnamed: 0,sentiment,review,cleaned_review,unique
0,1,With all this stuff going down at the moment w...,with all this stuff going down at the moment w...,36
1,1,"\The Classic War of the Worlds\"" by Timothy Hi...","the classic war of the worlds"" by timothy hine...",29
2,0,The film starts with a manager (Nicholas Bell)...,the film starts with a manager (nicholas bell)...,39
3,0,It must be assumed that those who praised this...,it must be assumed that those who praised this...,39
4,1,Superbly trashy and wondrously unpretentious 8...,superbly trashy and wondrously unpretentious 8...,41
...,...,...,...,...
9995,0,I actually saw this movie at a theater. As soo...,i actually saw this movie at a theater. as soo...,31
9996,0,I don't quite get the rating for The Amati Gir...,i don't quite get the rating for the amati gir...,36
9997,0,*Contains some spoilers* This movie is cheesy ...,*contains some spoilers* this movie is cheesy ...,39
9998,0,"Hmm, Hip Hop music to a period western. Modern...","hmm, hip hop music to a period western. modern...",30


**Task 3 (1 point)**

For sentiment analysis brackets may serve as a useful feature. Create feature counters for the number of positive smiles (opening brackets `)`) and for the negative smiles (opening brackets `(`) in the reviews. In the answer write a sum of their averages (`mean_positive + mean_negative`).

In [104]:
data['positive_count'] = data['cleaned_review'].apply(lambda x: x.count(')'))
data['negative_count'] = data['cleaned_review'].apply(lambda x: x.count('('))

In [105]:
# q3 =  ### YOUR SOLUTION

data.positive_count.mean() + data.negative_count.mean()

#ans: 2.9163

2.9163

**Task 4 (1 point)**
Now remove all characters which are not English letters (`[a-zA-z]`) or digits (`[0-9]`) and tokenize the text splitting it by spaces. 

**Example:**
`'mother+father = parents'` -> `[mother, father, parents]`

Then remove stop words using nltk stopwords list for English (see cell below).

What is the mean number of unique tokens in a review?

In [106]:
# !pip install nltk

In [107]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /Users/riya/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [108]:
# data['tokenized'] = ## YOUR CODE HERE

In [109]:
def remove_characters(text):
  new_txt = re.sub('[^a-zA-Z0-9]+', ' ', text)
  txt = " ".join(new_txt.split()).split()
  # txt = " ".join(new_txt.split())
  return(txt)

In [110]:
remove_characters('*contains some spoilers* this movie is cheesy ')

['contains', 'some', 'spoilers', 'this', 'movie', 'is', 'cheesy']

In [111]:
data['tokenized'] = data['cleaned_review'].apply(remove_characters)

In [112]:
data

Unnamed: 0,sentiment,review,cleaned_review,unique,positive_count,negative_count,tokenized
0,1,With all this stuff going down at the moment w...,with all this stuff going down at the moment w...,36,1,1,"[with, all, this, stuff, going, down, at, the,..."
1,1,"\The Classic War of the Worlds\"" by Timothy Hi...","the classic war of the worlds"" by timothy hine...",29,0,0,"[the, classic, war, of, the, worlds, by, timot..."
2,0,The film starts with a manager (Nicholas Bell)...,the film starts with a manager (nicholas bell)...,39,8,8,"[the, film, starts, with, a, manager, nicholas..."
3,0,It must be assumed that those who praised this...,it must be assumed that those who praised this...,39,3,3,"[it, must, be, assumed, that, those, who, prai..."
4,1,Superbly trashy and wondrously unpretentious 8...,superbly trashy and wondrously unpretentious 8...,41,1,1,"[superbly, trashy, and, wondrously, unpretenti..."
...,...,...,...,...,...,...,...
9995,0,I actually saw this movie at a theater. As soo...,i actually saw this movie at a theater. as soo...,31,0,0,"[i, actually, saw, this, movie, at, a, theater..."
9996,0,I don't quite get the rating for The Amati Gir...,i don't quite get the rating for the amati gir...,36,0,0,"[i, don, t, quite, get, the, rating, for, the,..."
9997,0,*Contains some spoilers* This movie is cheesy ...,*contains some spoilers* this movie is cheesy ...,39,0,0,"[contains, some, spoilers, this, movie, is, ch..."
9998,0,"Hmm, Hip Hop music to a period western. Modern...","hmm, hip hop music to a period western. modern...",30,0,0,"[hmm, hip, hop, music, to, a, period, western,..."


In [113]:
def remove_stopwords(words):
  for word in list(words):
    if word in STOPWORDS:
      words.remove(word)
  return(words)
data['tokenized'] = data['tokenized'].apply(remove_stopwords)

In [114]:

data['tokens_num'] = data['tokenized'].apply(lambda x: len(set(x)))

In [115]:
data['tokens_num'].mean()

#ans: 100.0712

100.0712

**Task 5 (1 point)**

Using the same preprocessing as in task 4, tokenize the text into 3-grams. 

What is the most common 3-gram?

**Example answer:** `"the cat sat"`.

**Hint:** You may use `data['tokenized']` column and function `ngrams` from `nltk.util`.

In [116]:
from nltk.util import ngrams


data['ngrams'] = data['tokenized'].apply(lambda x: list(ngrams(x, 3)))

In [117]:
data

Unnamed: 0,sentiment,review,cleaned_review,unique,positive_count,negative_count,tokenized,tokens_num,ngrams
0,1,With all this stuff going down at the moment w...,with all this stuff going down at the moment w...,36,1,1,"[stuff, going, moment, mj, started, listening,...",167,"[(stuff, going, moment), (going, moment, mj), ..."
1,1,"\The Classic War of the Worlds\"" by Timothy Hi...","the classic war of the worlds"" by timothy hine...",29,0,0,"[classic, war, worlds, timothy, hines, enterta...",67,"[(classic, war, worlds), (war, worlds, timothy..."
2,0,The film starts with a manager (Nicholas Bell)...,the film starts with a manager (nicholas bell)...,39,8,8,"[film, starts, manager, nicholas, bell, giving...",217,"[(film, starts, manager), (starts, manager, ni..."
3,0,It must be assumed that those who praised this...,it must be assumed that those who praised this...,39,3,3,"[must, assumed, praised, film, greatest, filme...",164,"[(must, assumed, praised), (assumed, praised, ..."
4,1,Superbly trashy and wondrously unpretentious 8...,superbly trashy and wondrously unpretentious 8...,41,1,1,"[superbly, trashy, wondrously, unpretentious, ...",194,"[(superbly, trashy, wondrously), (trashy, wond..."
...,...,...,...,...,...,...,...,...,...
9995,0,I actually saw this movie at a theater. As soo...,i actually saw this movie at a theater. as soo...,31,0,0,"[actually, saw, movie, theater, soon, handed, ...",45,"[(actually, saw, movie), (saw, movie, theater)..."
9996,0,I don't quite get the rating for The Amati Gir...,i don't quite get the rating for the amati gir...,36,0,0,"[quite, get, rating, amati, girls, think, real...",72,"[(quite, get, rating), (get, rating, amati), (..."
9997,0,*Contains some spoilers* This movie is cheesy ...,*contains some spoilers* this movie is cheesy ...,39,0,0,"[contains, spoilers, movie, cheesy, 80s, horro...",206,"[(contains, spoilers, movie), (spoilers, movie..."
9998,0,"Hmm, Hip Hop music to a period western. Modern...","hmm, hip hop music to a period western. modern...",30,0,0,"[hmm, hip, hop, music, period, western, modern...",73,"[(hmm, hip, hop), (hip, hop, music), (hop, mus..."


In [118]:
cnt = Counter()
for _, row in data.iterrows():
    for gram in row['ngrams']:
        cnt[gram] += 1

max(cnt, key=cnt.get)

('movie', 'ever', 'seen')

In [119]:
data

Unnamed: 0,sentiment,review,cleaned_review,unique,positive_count,negative_count,tokenized,tokens_num,ngrams
0,1,With all this stuff going down at the moment w...,with all this stuff going down at the moment w...,36,1,1,"[stuff, going, moment, mj, started, listening,...",167,"[(stuff, going, moment), (going, moment, mj), ..."
1,1,"\The Classic War of the Worlds\"" by Timothy Hi...","the classic war of the worlds"" by timothy hine...",29,0,0,"[classic, war, worlds, timothy, hines, enterta...",67,"[(classic, war, worlds), (war, worlds, timothy..."
2,0,The film starts with a manager (Nicholas Bell)...,the film starts with a manager (nicholas bell)...,39,8,8,"[film, starts, manager, nicholas, bell, giving...",217,"[(film, starts, manager), (starts, manager, ni..."
3,0,It must be assumed that those who praised this...,it must be assumed that those who praised this...,39,3,3,"[must, assumed, praised, film, greatest, filme...",164,"[(must, assumed, praised), (assumed, praised, ..."
4,1,Superbly trashy and wondrously unpretentious 8...,superbly trashy and wondrously unpretentious 8...,41,1,1,"[superbly, trashy, wondrously, unpretentious, ...",194,"[(superbly, trashy, wondrously), (trashy, wond..."
...,...,...,...,...,...,...,...,...,...
9995,0,I actually saw this movie at a theater. As soo...,i actually saw this movie at a theater. as soo...,31,0,0,"[actually, saw, movie, theater, soon, handed, ...",45,"[(actually, saw, movie), (saw, movie, theater)..."
9996,0,I don't quite get the rating for The Amati Gir...,i don't quite get the rating for the amati gir...,36,0,0,"[quite, get, rating, amati, girls, think, real...",72,"[(quite, get, rating), (get, rating, amati), (..."
9997,0,*Contains some spoilers* This movie is cheesy ...,*contains some spoilers* this movie is cheesy ...,39,0,0,"[contains, spoilers, movie, cheesy, 80s, horro...",206,"[(contains, spoilers, movie), (spoilers, movie..."
9998,0,"Hmm, Hip Hop music to a period western. Modern...","hmm, hip hop music to a period western. modern...",30,0,0,"[hmm, hip, hop, music, period, western, modern...",73,"[(hmm, hip, hop), (hip, hop, music), (hop, mus..."


**Task 6 (1 point)**
Use `WordPunctTokenizer` from `nltk` library for text tokenization. Apply it to `data['cleaned_review']`, then remove punctuation using `string.punctuation` and stopwords as before.

What is top-10 most frequent tokens? (Write tokens in one string separated by spaces).

**Example answer:** `'mother film cinema two good film even would really story'`

In [160]:
from nltk import WordPunctTokenizer
import string

tk = WordPunctTokenizer()

data['nltk_tokenized'] = data['cleaned_review'].apply(lambda x: tk.tokenize(x))

In [214]:
def remove_punct_and_stopwords(words):
    words_copy = list(words)
    for word in words_copy:
        if word in STOPWORDS:
            words.remove(word)
    
        for i in word:
            if i in string.punctuation or i=='¨':                
                words.remove(word)
                break
    return(words)
data['nltk_tokenized'] = data['nltk_tokenized'].apply(remove_punct_and_stopwords)

In [215]:
cnt = Counter()
for _,row in data.iterrows():
    cnt.update(row['nltk_tokenized'])
    
cnt.most_common(10)

[('movie', 17578),
 ('film', 16532),
 ('one', 10624),
 ('like', 8224),
 ('good', 6124),
 ('time', 5119),
 ('even', 5076),
 ('would', 4938),
 ('really', 4714),
 ('story', 4679)]

In [216]:
def remove_punct_and_stopwords(words):
    words_copy = list(words)
    for word in words_copy:
        if word in STOPWORDS:
            words.remove(word)
    
        for i in word:
            if i in string.punctuation or i=='¨':                
                words.remove(word)
                break
    return(words)

In [217]:
remove_punct_and_stopwords(['¨', 'jurassik', 'scientist', 'park', '.', ').'])

['jurassik', 'scientist', 'park']

**Task 7 (1 point)** Using `SnowballStemmer ` from `nltk.stem.snowball` stem first 100 lines in the data (`data.head(100)['nltk_tokenized']`). 

What is the number of unique stems?

In [218]:
data_100 = data.head(100)

In [219]:
from nltk.stem.snowball import SnowballStemmer 


In [220]:
stem_words = []
snow_stemmer = SnowballStemmer(language='english')

for _, row in data_100.iterrows():
    for word in row['nltk_tokenized']:
        x = snow_stemmer.stem(word)
        stem_words.append(x)

In [223]:
len(set(stem_words))
# stem_words


3452

**Task 8 (1 point)** Using `nltk.stem.WordNetLemmatizer()` lemmatize first 100 lines in the data (`data.head(100)['nltk_tokenized']`). 

What is the number of unique lemmas?

In [226]:
import nltk
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

lemm_words = []
for _, row in data_100.iterrows():
    for word in row['nltk_tokenized']:
        x = lemmatizer.lemmatize(word)
        lemm_words.append(x)

[nltk_data] Downloading package omw-1.4 to /Users/riya/nltk_data...


In [227]:
len(set(lemm_words))

4040

### Classification model

Now it's time to solve a text classification task. First, split the data using the cell below (do not change the random state!).

In [228]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(data, test_size = 0.2, random_state = 42)
train_df = train_df.copy()
test_df = test_df.copy()
train_df.head()

Unnamed: 0,sentiment,review,cleaned_review,unique,positive_count,negative_count,tokenized,tokens_num,ngrams,nltk_tokenized
9254,0,"Actress Patty Duke wrote an insightful, funny,...","actress patty duke wrote an insightful, funny,...",33,3,3,"[actress, patty, duke, wrote, insightful, funn...",94,"[(actress, patty, duke), (patty, duke, wrote),...","[actress, patty, duke, wrote, insightful, funn..."
1561,1,In answer to the person who made the comment a...,in answer to the person who made the comment a...,31,0,0,"[answer, person, made, comment, film, drags, b...",57,"[(answer, person, made), (person, made, commen...","[answer, person, made, comment, film, drags, b..."
1670,0,Madison is not too bad-if you like simplistic...,madison is not too bad-if you like simplistic...,42,2,2,"[madison, bad, like, simplistic, non, offensiv...",185,"[(madison, bad, like), (bad, like, simplistic)...","[madison, bad, like, simplistic, non, offensiv..."
6087,0,This is a strange sex comedy because there`s v...,this is a strange sex comedy because there`s v...,30,0,0,"[strange, sex, comedy, little, comedy, whole, ...",60,"[(strange, sex, comedy), (sex, comedy, little)...","[strange, sex, comedy, little, comedy, whole, ..."
6669,1,Thats My Bush is first of all a very entertain...,thats my bush is first of all a very entertain...,35,0,0,"[thats, bush, first, entertaining, show, parke...",178,"[(thats, bush, first), (bush, first, entertain...","[thats, bush, first, entertaining, show, parke..."


Compute features for `train_df` and `test_df`.
* length of the original review
* length of the text in tokens (use `nltk_tokenized` column)
* length of the text in 3-grams (use `3gram` column)
* number of unigue tokens (use `nltk_tokenized` column)
* number of unique 3-grams (use `3gram` column)
* positive_count and negative_count from task 3
* counters for tokens best, worst, good, bad, excellent, horrible (use `nltk_tokenized` column and create a separate feature for each of this tokens).

Thus, you obtain the following list of features: 

`features = ['original_length','token_length', '3gram_length', 'token_count', '3gram_count', 'best_count', 'worst_count', 'good_count', 'bad_count', 'excellent_count', 'horrible_count', 'positive_count', 'negative_count']`


**Task 9 (1 point)** 

Compute **absolute** correlation between features and target variable `sentiment` in `train_df`. What is the most correlated variable?

**Hint:** use `np.corrcoef` and do not forget about `abs`.

In [230]:
train_df['original_length'] = train_df['review'].apply(lambda x: len(x))
test_df['original_length'] = test_df['review'].apply(lambda x: len(x))

In [232]:
train_df['token_length'] = train_df['nltk_tokenized'].apply(lambda x: len(x))
test_df['token_length'] = test_df['nltk_tokenized'].apply(lambda x: len(x))

In [233]:
train_df['3gram_length'] = train_df['ngrams'].apply(lambda x: len(x))
test_df['3gram_length'] = test_df['ngrams'].apply(lambda x: len(x))


In [234]:
train_df['token_count'] = train_df['nltk_tokenized'].apply(lambda x: len(set(x)))
test_df['token_count'] = test_df['nltk_tokenized'].apply(lambda x: len(set(x)))


In [235]:
train_df['3gram_count'] = train_df['ngrams'].apply(lambda x: len(set(x)))
test_df['3gram_count'] = test_df['ngrams'].apply(lambda x: len(set(x)))

In [237]:
train_df['best_count'] = train_df['nltk_tokenized'].apply(lambda x: x.count('best'))
test_df['best_count'] = test_df['nltk_tokenized'].apply(lambda x: x.count('best'))
train_df['worst_count'] = train_df['nltk_tokenized'].apply(lambda x: x.count('worst'))
test_df['worst_count'] = test_df['nltk_tokenized'].apply(lambda x: x.count('worst'))
train_df['good_count'] = train_df['nltk_tokenized'].apply(lambda x: x.count('good'))
test_df['good_count'] = test_df['nltk_tokenized'].apply(lambda x: x.count('good'))
train_df['bad_count'] = train_df['nltk_tokenized'].apply(lambda x: x.count('bad'))
test_df['bad_count'] = test_df['nltk_tokenized'].apply(lambda x: x.count('bad'))
train_df['excellent_count'] = train_df['nltk_tokenized'].apply(lambda x: x.count('excellent'))
test_df['excellent_count'] = test_df['nltk_tokenized'].apply(lambda x: x.count('excellent'))
train_df['horrible_count'] = train_df['nltk_tokenized'].apply(lambda x: x.count('horrible'))
test_df['horrible_count'] = test_df['nltk_tokenized'].apply(lambda x: x.count('horrible'))

In [241]:
from scipy.stats.stats import pearsonr   
features = ['original_length','token_length', '3gram_length', 'token_count', '3gram_count', 'best_count', 'worst_count', 'good_count', 'bad_count', 'excellent_count', 'horrible_count', 'positive_count', 'negative_count']

for feature in features:
    coef = np.corrcoef(train_df[feature], train_df['sentiment'])
    print(f'Coef for {feature} is {abs(coef)[0][1]}')

Coef for original_length is 0.01631844812397978
Coef for token_length is 0.020055787050218564
Coef for 3gram_length is 0.019990809298198114
Coef for token_count is 0.015035220142316506
Coef for 3gram_count is 0.020045209951380795
Coef for best_count is 0.15141741713861748
Coef for worst_count is 0.23431738493553936
Coef for good_count is 0.020572444958314673
Coef for bad_count is 0.2719072277745534
Coef for excellent_count is 0.1544400755955548
Coef for horrible_count is 0.1466411880678859
Coef for positive_count is 0.03724083117531796
Coef for negative_count is 0.04300851753180165


  from scipy.stats.stats import pearsonr


In [None]:
# q9 = ### YOUR SOLUTION

**Task 10 (1 point)**

Scale the data using `StandardScaler` from `sklearn` and train `LogisticRegression` with default parametes from `sklearn.linear_model`.

What is F1-score for the `test_df`? Round your answer up to 4 points after the decimal point (`round(score, 4)`).

In [243]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(train_df[features])
scaler.transform(train_df[features])
scaler.transform(test_df[features])

array([[ 1.9118237 ,  1.52612939,  1.52815924, ..., -0.18984995,
         1.11983195,  1.15334776],
       [ 0.22713586,  0.29786717,  0.29872198, ..., -0.18984995,
         1.56234396,  1.60325911],
       [ 1.19179545,  1.15656376,  1.15824006, ..., -0.18984995,
         0.23480793,  0.25352505],
       ...,
       [-0.91777839, -0.95213419, -0.95247524, ..., -0.18984995,
        -0.65021608, -0.64629766],
       [ 0.98083806,  1.08047673,  1.08208023, ..., -0.18984995,
         1.11983195,  1.15334776],
       [-0.76030316, -0.69126434, -0.69135582, ..., -0.18984995,
        -0.65021608, -0.64629766]])

In [245]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0).fit(train_df[features], train_df['sentiment'])
y_pred = clf.predict(test_df[features])


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
