# Sentiment analysis of movie reviews


## Tasks

In [None]:
import pandas as pd
import numpy as np
data = pd.read_csv('https://github.com/mbburova/MDS/raw/main/sentiment.csv', index_col = 0)

data.head()

**Task 1 (1 points)**
It seems that data contains some unnecessary HTML tags such as `<br />`, for example.

Find all types of HTML tags (the types of expressions in brackets of the form `<...>`). 


How many different tag types are their in the data? What is the most frequent tag? 

Write your answer as a string separating tag_count and most popular tag by space. 

**Example answer:** `"3 <p>"`

In [None]:
import re
from collections import Counter
### YOUR SOLUTION
q1 = "formatted string with tag count and the most popular tag"

**Task 2 (1 points)**

Prepare your text. For this, replace tags from task 1 by spaces, remove multiple spaces (which may appear after tag removal), relace back slashes (`\`) with zero string,  and lower the text and strip it using `text.strip()`.

What is the mean number of unique characters in the review? 

Calculate number of unique characters in a string using `len(set(string))`.


In [None]:
data['cleaned_reaview'] = # YOUR CODE HERE

In [None]:
def test_text_prepare():
    examples = ['Best film I have ever seen <SMILE::>',
                'Do not like \"Titanic\"',
                'Can say just    .... Nothing!!! <SAD>']
    answers = ['best film i have ever seen',
                'do not like "titanic"',
                'can say just .... nothing!!!']
    for ex, ans in zip(examples, answers):
        if text_prepare(ex) != ans:
            print(text_prepare(ex))
            return "Wrong answer for the case: '%s'" % ex
    return 'Basic tests are passed.'
test_text_prepare()

In [None]:
q2 =### YOUR SOLUTION

**Task 3 (1 point)**

For sentiment analysis brackets may serve as a useful feature. Create feature counters for the number of positive smiles (opening brackets `)`) and for the negative smiles (opening brackets `(`) in the reviews. In the answer write a sum of their averages (`mean_positive + mean_negative`).

In [None]:
data['positive_count'] = # YOUR CODE HERE
data['negative_count'] = # YOUR CODE HERE

In [None]:
q3 =  ### YOUR SOLUTION



**Task 4 (1 point)**
Now remove all characters which are not English letters (`[a-zA-z]`) or digits (`[0-9]`) and tokenize the text splitting it by spaces. 

**Example:**
`'mother+father = parents'` -> `[mother, father, parents]`

Then remove stop words using nltk stopwords list for English (see cell below).

What is the mean number of unique tokens in a review?

In [None]:
!pip install nltk

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
STOPWORDS = set(stopwords.words('english'))

In [None]:
data['tokenized'] = ## YOUR CODE HERE

In [None]:
q4 = ### YOUR SOLUTION

**Task 5 (1 point)**

Using the same preprocessing as in task 4, tokenize the text into 3-grams. 

What is the most common 3-gram?

**Example answer:** `"the cat sat"`.

**Hint:** You may use `data['tokenized']` column and function `ngrams` from `nltk.util`.

In [None]:
from nltk.util import ngrams
## YOUR CODE HERE

In [None]:
q5 =### YOUR SOLUTION

**Task 6 (1 point)**
Use `WordPunctTokenizer` from `nltk` library for text tokenization. Apply it to `data['cleaned_review']`, then remove punctuation using `string.punctuation` and stopwords as before.

What is top-10 most frequent tokens? (Write tokens in one string separated by spaces).

**Example answer:** `'mother film cinema two good film even would really story'`

In [None]:
from nltk import WordPunctTokenizer
import string

data['nltk_tokenized'] = ## YOUR CODE HERE

In [None]:
q6 =### YOUR SOLUTION

**Task 7 (1 point)** Using `SnowballStemmer ` from `nltk.stem.snowball` stem first 100 lines in the data (`data.head(100)['nltk_tokenized']`). 

What is the number of unique stems?

In [None]:
from nltk.stem.snowball import SnowballStemmer 
## YOUR CODE HERE

In [None]:
q7 =### YOUR SOLUTION

**Task 8 (1 point)** Using `nltk.stem.WordNetLemmatizer()` lemmatize first 100 lines in the data (`data.head(100)['nltk_tokenized']`). 

What is the number of unique lemmas?

In [None]:
import nltk
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

## YOUR CODE HERE

In [None]:
q8 = ### YOUR SOLUTION

### Classification model

Now it's time to solve a text classification task. First, split the data using the cell below (do not change the random state!).

In [None]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(data, test_size = 0.2, random_state = 42)
train_df = train_df.copy()
test_df = test_df.copy()
train_df.head()

Compute features for `train_df` and `test_df`.
* length of the original review
* length of the text in tokens (use `nltk_tokenized` column)
* length of the text in 3-grams (use `3gram` column)
* number of unigue tokens (use `nltk_tokenized` column)
* number of unique 3-grams (use `3gram` column)
* positive_count and negative_count from task 3
* counters for tokens best, worst, good, bad, excellent, horrible (use `nltk_tokenized` column and create a separate feature for each of this tokens).

Thus, you obtain the following list of features: 

`features = ['original_length','token_length', '3gram_length', 'token_count', '3gram_count', 'best_count', 'worst_count', 'good_count', 'bad_count', 'excellent_count', 'horrible_count', 'positive_count', 'negative_count']`


**Task 9 (1 point)** 

Compute **absolute** correlation between features and target variable `sentiment` in `train_df`. What is the most correlated variable?

**Hint:** use `np.corrcoef` and do not forget about `abs`.

In [None]:
from scipy.stats.stats import pearsonr   
features = ['original_length','token_length', '3gram_length', 'token_count', '3gram_count', 'best_count', 'worst_count', 'good_count', 'bad_count', 'excellent_count', 'horrible_count', 'positive_count', 'negative_count']

## YOUR CODE HERE

In [None]:
q9 = ### YOUR SOLUTION

**Task 10 (1 point)**

Scale the data using `StandardScaler` from `sklearn` and train `LogisticRegression` with default parametes from `sklearn.linear_model`.

What is F1-score for the `test_df`? Round your answer up to 4 points after the decimal point (`round(score, 4)`).

In [None]:
q10 =### YOUR SOLUTION