<font color = green >

# Text classification: sentiment analysis 

</font>

<font color = green >

## Popular tasks of text classification

</font>

- **Spam detection**: Having message decide is is spammy or not 
- **Topic identification**: Having article choose one of known classes like "Sport", "Technology", "Finances"
- **Sentiment analysis**: Is the moview positive or negative 
- **Spelling correction**: what is more suitable "weather" or "whether"  


<font color = green >

## Features from Text

</font>

1. The most common words
2. *Stop* words
3. Normalization: lower case / stemming / lemmatizing
4. Capitalization as feature 
5. POS e.g. "the weather" vs whether  
6. grouping
    - buy, purchase
    - Mr, Ms, Dr
    - Numbers
    - Dates
7. Bigrams, n-grams e.g. "White House"
8. Sub-sequences e.g. "ing", "ion"


<font color = green >

### Text classification of search query 

</font>

- **python**  as snake -> Zoology
- **python**  as programming language -> Computer Science
- **python**  as "monty python" -> Entertainment

Probabilistic model:

#### Bayes Rule

\begin{equation*}
P(y|X) = \frac{P(X| y) \cdot P(y)}{P(X)} 
\quad\quad\quad
Posterior = \frac{ Likelihood \cdot Prior}{Evidence} 
\quad\quad\quad
P(class| python) = \frac{P(python| class) \cdot P(class)}{P(python)} 
\end{equation*}

Considering the $P(python)$ is common for all classes we may compare just nominators: 

\begin{equation*}
P(python| Zoology) \cdot P(Zoology) 
\quad\quad\quad 
P(python|CS) \cdot P(CS) 
\quad\quad\quad
P(python|Entertainment) \cdot P(Entertainment) 
\end{equation*}

In general: 
\begin{equation*}
\hat{y} =  \underset{y}{argmax} \quad P(y|X) =  \underset{argmax}{y} P(X|y) \cdot P(y)
\end{equation*}

Most probably predicted class is <font color = blue>CS</font>

#### Naive Bayes Classifiers

\begin{equation*}
\hat{y} =  \underset{y}{argmax} \quad P(y) \prod_{ i=1 }^{ n }{ P(x_{ i }|\,y) } 
\end{equation*}

If search query = **"python snake"** 
\begin{equation*}
\hat{y} =  \underset{y}{argmax} \quad
P(y)\cdot P(python|\,y) \cdot P(snake|\,y)
\end{equation*}

Now, the most probably predicted class is <font color = blue>Zoology</font> since  $P(snake|\,CS)$ is far less than $P(snake|\,Zoology)$

Note: if one of word is not presented in text then its statistical propability = 0 and as the result
the whole likelihood = 0 regardless of other words. Thus it is worth using laplace smooting  


<font color = green >

#### Laplace smooting 
 
</font>


$
A : 1 \quad
B : 3\quad
C : 0\quad
D : 6\quad
$

$N= 10\quad K =4$ 
<br>N - number of samples, K - number of classes

\begin{equation*}
P(A) = 0.1\quad\quad\quad\quad
P(B) = 0.3\quad\quad\quad\quad
P(C) = 0.0\quad\quad\quad\quad
P(D) = 0.6\\
\end{equation*}

<font color = blue >

\begin{equation*}
P^{\,L}(x_{i}) =  \frac{P(x_{i})+1}{N+K}
\end{equation*}

</font>




\begin{equation*}
P^{\,L}(A) =  \frac{1+1}{10+4} = 0.14 \quad P^{\,L}(B) =  \frac{3+1}{10+4} = 0.29
\quad P^{\,L}(C) =  \frac{0+1}{10+4} = 0.07 \quad P^{\,L}(D) =  \frac{6+1}{10+4} = 0.5
\end{equation*}






<font color = green >

## Sentiment Analysis

</font>


<font color = green >

### Using NLTK

</font>


In [1]:
import pandas as pd
import numpy as np

from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
import nltk
from nltk.tokenize import RegexpTokenizer
import random

In [2]:
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /home/master/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

<font color = green >

#### Load data

</font>


In [106]:
all_movie_reviews_text= movie_reviews.raw() # it is just all reviews joined into one text e.g.
# this is the ending of first review: " the others ( 9/10 ) - stir of echoes ( 8/10 ) "
# this is the beginning of second review : "the happy bastard's quick movie review "
print(all_movie_reviews_text[3600:5600])

way because someone is apparently assuming that the genre is still hot with the kids . 
it also wrapped production two years ago and has been sitting on the shelves ever since . 
whatever . . . skip 
it ! 
where's joblo coming from ? 
a nightmare of elm street 3 ( 7/10 ) - blair witch 2 ( 7/10 ) - the crow ( 9/10 ) - the crow : salvation ( 4/10 ) - lost highway ( 10/10 ) - memento ( 10/10 ) - the others ( 9/10 ) - stir of echoes ( 8/10 ) 
the happy bastard's quick movie review 
damn that y2k bug . 
it's got a head start in this movie starring jamie lee curtis and another baldwin brother ( william this time ) in a story regarding a crew of a tugboat that comes across a deserted russian tech ship that has a strangeness to it when they kick the power back on . 
little do they know the power within . . . 
going for the gore and bringing on a few action sequences here and there , virus still feels very empty , like a movie going for all flash and no substance . 
we don't know why the crew w

<font color = green >

#### Tools to review data 

</font>



In [107]:
cats =  movie_reviews.categories()
cats

['neg', 'pos']

In [108]:
cat = cats[0]
ids= movie_reviews.fileids(cat)
ids[:10]

['neg/cv000_29416.txt',
 'neg/cv001_19502.txt',
 'neg/cv002_17424.txt',
 'neg/cv003_12683.txt',
 'neg/cv004_12641.txt',
 'neg/cv005_29357.txt',
 'neg/cv006_17022.txt',
 'neg/cv007_4992.txt',
 'neg/cv008_29326.txt',
 'neg/cv009_29417.txt']

In [111]:
id_review = ids[0]
print(movie_reviews.raw(id_review))

plot : two teen couples go to a church party , drink and then drive . 
they get into an accident . 
one of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . 
what's the deal ? 
watch the movie and " sorta " find out . . . 
critique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . 
which is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn't snag this one correctly . 
they seem to have taken this pretty neat concept , but executed it terribly . 
so what are the problems with the movie ? 
well , its main problem is that it's simply too jumbled . 
it starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience member , have no id

<font color = green >

#### Tokenize

</font>


In [112]:
def preprocess(text): # removes punctualtion
    tokenizer = RegexpTokenizer(r'\w+') # just for demo
    return tokenizer.tokenize(text.lower())

all_words = preprocess(all_movie_reviews_text)
print (len(all_words))
print(all_words[:100])

1336782
['plot', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', 'drink', 'and', 'then', 'drive', 'they', 'get', 'into', 'an', 'accident', 'one', 'of', 'the', 'guys', 'dies', 'but', 'his', 'girlfriend', 'continues', 'to', 'see', 'him', 'in', 'her', 'life', 'and', 'has', 'nightmares', 'what', 's', 'the', 'deal', 'watch', 'the', 'movie', 'and', 'sorta', 'find', 'out', 'critique', 'a', 'mind', 'fuck', 'movie', 'for', 'the', 'teen', 'generation', 'that', 'touches', 'on', 'a', 'very', 'cool', 'idea', 'but', 'presents', 'it', 'in', 'a', 'very', 'bad', 'package', 'which', 'is', 'what', 'makes', 'this', 'review', 'an', 'even', 'harder', 'one', 'to', 'write', 'since', 'i', 'generally', 'applaud', 'films', 'which', 'attempt', 'to', 'break', 'the', 'mold', 'mess', 'with', 'your', 'head', 'and', 'such']


<font color = green >

#### Build vocabulary

</font>


In [113]:
help(nltk.FreqDist)

Help on class FreqDist in module nltk.probability:

class FreqDist(collections.Counter)
 |  FreqDist(samples=None)
 |
 |  A frequency distribution for the outcomes of an experiment.  A
 |  frequency distribution records the number of times each outcome of
 |  an experiment has occurred.  For example, a frequency distribution
 |  could be used to record the frequency of each word type in a
 |  document.  Formally, a frequency distribution can be defined as a
 |  function mapping from each sample to the number of times that
 |  sample occurred as an outcome.
 |
 |  Frequency distributions are generally constructed by running a
 |  number of experiments, and incrementing the count for a sample
 |  every time it is an outcome of an experiment.  For example, the
 |  following code will produce a frequency distribution that encodes
 |  how often each word occurs in a text:
 |
 |      >>> from nltk.tokenize import word_tokenize
 |      >>> from nltk.probability import FreqDist
 |      >>> sen

In [None]:
all_words=nltk.FreqDist(all_words)
print ('len of vocabulary: {:,}'.format (len(all_words)))
# Use most common words
most_common_words = list(zip(*all_words.most_common()))[0] # [0] means names whereas [1] are frequencies
print (most_common_words[:100])

len of vocabulary: 39,696
('the', 'a', 'and', 'of', 'to', 'is', 'in', 's', 'it', 'that', 'as', 'with', 'for', 'his', 'this', 'film', 'i', 'he', 'but', 'on', 'are', 't', 'by', 'be', 'one', 'movie', 'an', 'who', 'not', 'you', 'from', 'at', 'was', 'have', 'they', 'has', 'her', 'all', 'there', 'like', 'so', 'out', 'about', 'up', 'more', 'what', 'when', 'which', 'or', 'she', 'their', 'some', 'just', 'can', 'if', 'we', 'him', 'into', 'even', 'only', 'than', 'no', 'good', 'time', 'most', 'its', 'will', 'story', 'would', 'been', 'much', 'character', 'also', 'get', 'other', 'do', 'two', 'well', 'them', 'very', 'characters', 'first', 'after', 'see', 'way', 'because', 'make', 'life', 'off', 'too', 'any', 'does', 'really', 'had', 'while', 'films', 'how', 'plot', 'little', 'where')


<font color = green >

#### Get rid of stop words 

</font>


In [115]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/master/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [122]:
def remove_stop_words(words):
    stop_words = set(stopwords.words('english'))
    return [w for w in words if w not in stop_words]
most_common_words_filtered = remove_stop_words(most_common_words)


<font color = green >

#### Select features 

</font>

In [121]:
word_features = most_common_words_filtered [:3000]
print (word_features[:100])

['film', 'one', 'movie', 'like', 'even', 'good', 'time', 'story', 'would', 'much', 'character', 'also', 'get', 'two', 'well', 'characters', 'first', 'see', 'way', 'make', 'life', 'really', 'films', 'plot', 'little', 'people', 'could', 'scene', 'man', 'bad', 'never', 'best', 'new', 'scenes', 'many', 'director', 'know', 'movies', 'action', 'great', 'another', 'love', 'go', 'made', 'us', 'big', 'end', 'something', 'back', 'still', 'world', 'seems', 'work', 'makes', 'however', 'every', 'though', 'better', 'real', 'audience', 'enough', 'seen', 'take', 'around', 'going', 'year', 'performance', 'role', 'old', 'gets', 'may', 'things', 'think', 'years', 'last', 'comedy', 'funny', 'actually', 'long', 'look', 'almost', 'thing', 'fact', 'nothing', 'say', 'right', 'john', 'although', 'played', 'find', 'script', 'come', 'ever', 'cast', 'since', 'star', 'plays', 'young', 'show', 'comes']


<font color = green >

#### Extract documents and labels

</font>


In [126]:
# Note: this does not use tokenizing to documents but words of document retrieved by file_id instead.
documents = [(list(movie_reviews.words(file_id)), category) # using the words() method of movie_reviews object
            for category in movie_reviews.categories() # select category - there are two: ['neg', 'pos']
            for file_id in movie_reviews.fileids(category)]# select all file_ids for specified category
len (documents)
# This returns list of tuples (list_of_tokens_of document, label)

2000

In [127]:
print (documents [0]) # (['plot', ':', 'two', 'teen', ... 'echoes', '(', '8', '/', '10', ')'], 'neg')

(['plot', ':', 'two', 'teen', 'couples', 'go', 'to', 'a', 'church', 'party', ',', 'drink', 'and', 'then', 'drive', '.', 'they', 'get', 'into', 'an', 'accident', '.', 'one', 'of', 'the', 'guys', 'dies', ',', 'but', 'his', 'girlfriend', 'continues', 'to', 'see', 'him', 'in', 'her', 'life', ',', 'and', 'has', 'nightmares', '.', 'what', "'", 's', 'the', 'deal', '?', 'watch', 'the', 'movie', 'and', '"', 'sorta', '"', 'find', 'out', '.', '.', '.', 'critique', ':', 'a', 'mind', '-', 'fuck', 'movie', 'for', 'the', 'teen', 'generation', 'that', 'touches', 'on', 'a', 'very', 'cool', 'idea', ',', 'but', 'presents', 'it', 'in', 'a', 'very', 'bad', 'package', '.', 'which', 'is', 'what', 'makes', 'this', 'review', 'an', 'even', 'harder', 'one', 'to', 'write', ',', 'since', 'i', 'generally', 'applaud', 'films', 'which', 'attempt', 'to', 'break', 'the', 'mold', ',', 'mess', 'with', 'your', 'head', 'and', 'such', '(', 'lost', 'highway', '&', 'memento', ')', ',', 'but', 'there', 'are', 'good', 'and', 'b

<font color = green >

#### Shuffle documents 

</font>


In [128]:
# shuffle first
random.shuffle(documents) # it is inplace method
documents= documents[:500] # reduce the data set for speed up the demo
len (documents)

500

<font color = green >

#### Vectorize documents 

</font>


In [129]:
def find_features(review_tokens):
    return {w: w in set(review_tokens) for w in word_features} # feature representation on document

data_set= [(find_features(review_tokens), category) for (review_tokens, category) in documents]


In [130]:
data_set[0]

({'film': False,
  'one': True,
  'movie': True,
  'like': True,
  'even': True,
  'good': False,
  'time': False,
  'story': True,
  'would': True,
  'much': True,
  'character': True,
  'also': False,
  'get': True,
  'two': True,
  'well': True,
  'characters': False,
  'first': True,
  'see': True,
  'way': True,
  'make': False,
  'life': False,
  'really': True,
  'films': False,
  'plot': True,
  'little': True,
  'people': True,
  'could': True,
  'scene': False,
  'man': False,
  'bad': True,
  'never': False,
  'best': True,
  'new': False,
  'scenes': False,
  'many': True,
  'director': False,
  'know': True,
  'movies': True,
  'action': False,
  'great': False,
  'another': True,
  'love': False,
  'go': True,
  'made': False,
  'us': False,
  'big': True,
  'end': True,
  'something': True,
  'back': False,
  'still': False,
  'world': False,
  'seems': True,
  'work': False,
  'makes': False,
  'however': False,
  'every': False,
  'though': False,
  'better': True,
  '

<font color = green >

#### Split to training and test set

</font>


In [131]:
split_on = int(len(data_set)*.8)
X_y_train= data_set[:split_on]
X_y_test = data_set[split_on:]
print (len(X_y_train))

400


<font color = green >

#### Train model

</font>


In [None]:
clf = nltk.NaiveBayesClassifier.train(X_y_train) # Note: the difference grammar comparing with sklearn

<font color = green >

#### Evaluate model

</font>


In [133]:
nltk.classify.accuracy(clf, X_y_test)*100

85.0

<font color = green >

#### Review most informative features

</font>


In [134]:
clf.show_most_informative_features(15)

Most Informative Features
                  wasted = True              neg : pos    =     11.4 : 1.0
                 details = True              pos : neg    =     11.3 : 1.0
                    alas = True              neg : pos    =     10.1 : 1.0
                anywhere = True              neg : pos    =      9.3 : 1.0
               animation = True              pos : neg    =      8.5 : 1.0
                    gary = True              neg : pos    =      8.5 : 1.0
                   harry = True              neg : pos    =      8.5 : 1.0
             wonderfully = True              pos : neg    =      8.0 : 1.0
               pointless = True              neg : pos    =      8.0 : 1.0
                 unfunny = True              neg : pos    =      7.5 : 1.0
             outstanding = True              pos : neg    =      7.4 : 1.0
                 patrick = True              pos : neg    =      7.4 : 1.0
                   sheer = True              pos : neg    =      6.9 : 1.0

<font color = green >

### Using sklearn

</font>


<font color = green >

#### Load data 

data set ['amazon-reviews-unlocked-mobile-phones'](https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones)
</font>


In [135]:
import os
cwd= os.getcwd() # current working directory
# path = os.path.join(cwd,'data')
fn=  'Amazon_Unlocked_Mobile.csv' # https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
df = pd.read_csv(fn) #
print('len=  {:,}\ncolumns= {}'.format(len(df), list(df)))

# df = df.sample(frac=0.1, random_state=10) # reduce the amount of reviews due to speedup the training considering this is demo
df.head()

len=  413,840
columns= ['Product Name', 'Brand Name', 'Price', 'Rating', 'Reviews', 'Review Votes']


Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


<font color = green >

#### Get rid of records with missed data 

</font>


In [136]:
df.dropna(inplace=True)
print('len=  {:,}'.format(len(df)))

len=  334,328


<font color = green >

#### Label positive and negative 

</font>


In [137]:
df = df[df['Rating'] != 3] # Remove any 'neutral' ratings equal to 3  as uninformative
df['Rating_binary'] = np.where(df['Rating'] > 3, 1, 0) # returns 1 for 4,5 and 0 for 1,2
df.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes,Rating_binary
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0,1
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0,1
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0,1
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0,1
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0,1


In [138]:
df['Rating_binary'].mean()

0.748269374249846

<font color = green >

#### Split to train and test sets

</font>


In [139]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df['Reviews'],df['Rating_binary'],random_state=42)

<font color = green >

#### Review training sample

</font>


In [146]:
X_train.iloc[0], y_train.iloc[0] # Be careful with quering like X_train[0] because it casts to X_train.loc[0]

("Best thing I ever ordered. It's in perfect condition, came with battery and charger. It arrived in just a couple days. I Love It?",
 1)

<font color = green >

#### Extract Features 

</font>
The bag-of-words approach is simple way to represent text for use in machine learning, which ignores structure and only counts how often each word occurs.

<font color = green >

#### CountVectorizer vectorizer

</font>
By default, selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator)

In [147]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer().fit(X_train) # Fit the CountVectorizer to the training data
print('features samples:\n{}'.format(vect.get_feature_names_out()[::2000])) # display each 2000-th feature
print ('\nlen of features {:,}'.format(len(vect.get_feature_names_out())))


features samples:
['00' '4gig' 'adoption' 'asp' 'blankets' 'casecredit' 'condemned'
 'deafult' 'documentation' 'esperiencia' 'fixer' 'goodlots' 'howeveri'
 'irc' 'lifethose' 'miamithank' 'niceties' 'ownd' 'political' 'quererme'
 'resolucion' 'sel' 'sometes' 'swetingtherefore' 'tote' 'usd79'
 'willhappen']

len of features 53,415


<font color = green >

#### Transfrom the X_train to feature representation

</font>


In [148]:
X_train_vectorized = vect.transform(X_train) # indeces of existing words from vocabulary and their count in current text
X_train_vectorized

<231202x53415 sparse matrix of type '<class 'numpy.int64'>'
	with 6117881 stored elements in Compressed Sparse Row format>

In [149]:
print (X_train_vectorized[0])

  (0, 5026)	1
  (0, 5852)	1
  (0, 7198)	1
  (0, 7649)	1
  (0, 9542)	1
  (0, 10426)	1
  (0, 12020)	1
  (0, 12960)	1
  (0, 13915)	1
  (0, 18259)	1
  (0, 24783)	2
  (0, 26171)	3
  (0, 26795)	1
  (0, 28670)	1
  (0, 33488)	1
  (0, 34779)	1
  (0, 47237)	1
  (0, 52135)	1


<font color = green >

#### Review vectorized training sample

</font>


In [150]:
# review first sample
df = pd.DataFrame(X_train_vectorized[0].toarray(), index= ['value']).T
df

Unnamed: 0,value
0,0
1,0
2,0
3,0
4,0
...,...
53410,0
53411,0
53412,0
53413,0


In [151]:
print (list(df[df['value']>0].index))
[vect.get_feature_names_out()[index] for index in df[df['value']>0].index.values]

[5026, 5852, 7198, 7649, 9542, 10426, 12020, 12960, 13915, 18259, 24783, 26171, 26795, 28670, 33488, 34779, 47237, 52135]


['and',
 'arrived',
 'battery',
 'best',
 'came',
 'charger',
 'condition',
 'couple',
 'days',
 'ever',
 'in',
 'it',
 'just',
 'love',
 'ordered',
 'perfect',
 'thing',
 'with']

<font color = green >

#### Train model

</font>


In [153]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import f1_score

In [154]:
clf = LogisticRegression(max_iter=2000).fit(X_train_vectorized, y_train) # Train the model

<font color = green >

#### Evaluate model

</font>


In [155]:
predictions = clf.predict(vect.transform(X_test)) # Predict the transformed test documents
print('f1: ', f1_score(y_test, predictions))
scores = clf.decision_function(vect.transform(X_test))
print('AUC: ', roc_auc_score(y_test, scores))

f1:  0.967568684156166
AUC:  0.9793126282763929


<font color = green >

#### Review relevant features 
    
</font>

The smallest coefs corresponds to `Neg` impact, and largest coefs represent `Pos` impact

In [156]:
feature_names = np.array(vect.get_feature_names_out())
sorted_coef_index = clf.coef_[0].argsort() # ascending  [0] is just squeeze from shape (1,n)
clf.coef_.shape, clf.coef_[0].shape, sorted(clf.coef_[0])[:10], sorted(clf.coef_[0])[-11:-1],

((1, 53415),
 (53415,),
 [-5.153103790597999,
  -3.9374596368268757,
  -3.748686859339774,
  -3.5716910706268763,
  -3.318505971547361,
  -3.315362076693784,
  -3.2701568923671074,
  -3.2558265929200565,
  -3.1714529653832773,
  -3.122313332904471],
 [3.2289376691643654,
  3.2312554160344313,
  3.2357179245971217,
  3.3113626233690456,
  3.326498998919694,
  3.386029031214735,
  3.499818695136082,
  3.6025081700023494,
  4.89920909187138,
  5.002879160589319])

In [45]:
print('Smallest coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest coefs:
['mony' 'worst' 'horribly' 'false' 'blacklist' 'lemon' 'messing' 'junk'
 'worthless' 'mouth']

Largest Coefs: 
['excelent' '4eeeks' 'excelente' 'excellent' 'loving' 'pleasantly'
 'exelente' 'loves' 'lovely' 'buen']


<font color="green">

## Term Frequency–Inverse Document Frequency (TF-IDF)

</font>

TF-IDF is a numerical statistic that reflects how important a word is to a document in a collection or corpus. Its value **increases** with the number of times a word appears in a document and **decreases** with the number of documents in the corpus that contain that word.

<div style="float:left;">
<br>

### Term Frequency

The term frequency \(tf(t,d)\) measures how often term \(t\) appears in document \(d\):

$$
tf(t,d) = \frac{k}{n}
$$  

\(d\) — document;  
\(k\) — number of times the word occurs in \(d\);  
\(n\) — total number of words in \(d\).

*Augmented frequency* (to reduce bias toward longer documents):

$$
tf^{\,A}(t,d) = 0.5 + 0.5 \frac{tf(t,d)}
                     {\max_{t' \in d} tf(t',d)}
$$

### Inverse Document Frequency

The inverse document frequency \(idf(t,D)\) measures how much information the word provides:

$$
idf(t,D) = \log \frac{N}{K}
$$  

\(D\) — the entire collection of documents;  
\(K\) — number of documents in \(D\) that contain the word;  
\(N\) — total number of documents in \(D\).

</div>

**Note:** Various approaches can be used for inverse document frequency.

<div style="float:left;">
<table width="500">
  <tr>
    <th style="text-align:center" bgcolor="white">Document&nbsp;1</th>
    <th style="text-align:center" bgcolor="white">Document&nbsp;2</th>
  </tr>
  <tr>
    <td>
      <table>
        <tr><th bgcolor="gainsboro">Term</th><th bgcolor="gainsboro">Count</th></tr>
        <tr><td>this</td><td>1</td></tr>
        <tr><td>is</td><td>1</td></tr>
        <tr><td>a</td><td>2</td></tr>
        <tr><td>sample</td><td>1</td></tr>
      </table>
    </td>
    <td>
      <table>
        <tr><th bgcolor="gainsboro">Term</th><th bgcolor="gainsboro">Count</th></tr>
        <tr><td>this</td><td>1</td></tr>
        <tr><td>is</td><td>1</td></tr>
        <tr><td>another</td><td>2</td></tr>
        <tr><td>example</td><td>3</td></tr>
      </table>
    </td>
  </tr>
</table>
</div>

<div style="float:left;">
<br>

For **“this”**:

$$
tf(\text{"this"}, d_{1}) = \frac{1}{5} = 0.2, \quad
tf(\text{"this"}, d_{2}) = \frac{1}{7} \approx 0.14, \quad
idf(\text{"this"}, D) = \log\frac{2}{2} = 0
$$

$$
tfidf(\text{"this"}, d_{1}, D) = 0.2 \times 0 = 0, \qquad
tfidf(\text{"this"}, d_{2}, D) = 0.14 \times 0 = 0
$$

For **“example”**:

$$
tf(\text{"example"}, d_{1}) = \frac{0}{5} = 0, \quad
tf(\text{"example"}, d_{2}) = \frac{3}{7} \approx 0.43, \quad
idf(\text{"example"}, D) = \log\frac{2}{1} \approx 0.30
$$

$$
tfidf(\text{"example"}, d_{1}, D) = 0 \times 0.30 = 0, \qquad
tfidf(\text{"example"}, d_{2}, D) = 0.43 \times 0.30 \approx 0.129
$$

</div>


<font color = green >

### Sklearn tfidf

</font>


In [157]:
from sklearn.feature_extraction.text import TfidfVectorizer

<font color = green >

#### Compute sklearn tfidf for sample with 2 documents 

</font>


In [None]:
X = np.array(['this is a sample a', 'this is another example another example example'])
tfidf_vectorizer = TfidfVectorizer().fit(X)
X_vectorized= tfidf_vectorizer.transform(X)
print (tfidf_vectorizer.vocabulary_)
X_vectorized.toarray()
# conclusion: sklearn uses different variant of computation tfidf

{'this': 4, 'is': 2, 'sample': 3, 'another': 0, 'example': 1}


array([[0.        , 0.        , 0.50154891, 0.70490949, 0.50154891],
       [0.53428425, 0.80142637, 0.19007382, 0.        , 0.19007382]])

<font color = green >

#### Use sklearn tfidf for Amazon_Unlocked_Mobile documents 

</font>


In [159]:
tfidf_vectorizer= TfidfVectorizer(min_df=5)#.fit(X_train)
    # min_df - minimum document count to include the term, default is 1
    # you may also set max_features (Int or None) to return just limited number of top tfidf features
X_train_vectorized = tfidf_vectorizer.fit_transform(X_train)
print ('len of features= {:,}'.format(len(tfidf_vectorizer.get_feature_names_out())))
    # Note: min_df=5 caused 17,951  comparing to 53,216 acquired by count vectorizer
    # Note: min_df=5 is also available in count vectorizer

len of features= 17,984


In [160]:
# X_train_vectorized.shape # (231207, 17951) = (n_documents, n_features)
sorted_tfidf_index = X_train_vectorized.max(axis=0).toarray()[0].argsort()
    # max(axis=0) means max through all docs - will get the max of each word within all docs
    # [0] - just squeezing
print (np.sort(X_train_vectorized.max(axis=0).toarray()[0]))
sorted_tfidf_index # indices of the most tfidf terms


[0.01989095 0.01989095 0.01989095 ... 1.         1.         1.        ]


array([15230,  1288,  3668, ..., 12144,  9802, 16029])

In [161]:
feature_names = np.array(tfidf_vectorizer.get_feature_names_out())
print ('feature_names ',feature_names)
print('Smallest tfidf:\n{}\n'.format(feature_names[sorted_tfidf_index[:10]]))
print('Largest tfidf: \n{}'.format(feature_names[sorted_tfidf_index[:-11:-1]]))


feature_names  ['00' '000' '0000' ... 'тнe' 'աɨtɦ' 'աօʀҡ']
Smallest tfidf:
['storageso' 'aggregration' 'commenter' 'pthalo' 'warmness' '1300'
 'seizing' '401p' 'bigtime' 'a10']

Largest tfidf: 
['thnx' 'luis' 'positive' 'exito' 'stats' 'returned' 'heated' 'bueno'
 'return' 'exellent']


<font color = green >

#### Train model on features  extracted by tfidf vectorizer

</font>


In [162]:
clf = LogisticRegression(max_iter=1000).fit(X_train_vectorized, y_train) # Train the model
predictions = clf.predict(tfidf_vectorizer.transform(X_test))
print('f1: ', f1_score(y_test, predictions))
scores = clf.decision_function(tfidf_vectorizer.transform(X_test))
print('AUC: ', roc_auc_score(y_test, scores))

f1:  0.965055851936258
AUC:  0.9827034570487878


#### Conclusion: Perfromance is not worse but there are 3 times less amount of features used

In [163]:
sorted_coef_index = clf.coef_[0].argsort()
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['not' 'worst' 'useless' 'waste' 'disappointed' 'terrible' 'return'
 'returning' 'poor' 'horrible']

Largest Coefs: 
['love' 'great' 'excellent' 'amazing' 'perfect' 'awesome' 'loves' 'easy'
 'perfectly' 'best']


<font color = green >

### n-grams

</font>


In [164]:
# the problem is the following reviews are treated the same by current model
targets= [
    "not an issue, phone is working",
    "an issue, phone is not working"
]
print(clf.predict(tfidf_vectorizer.transform(targets)))


[0 0]


In [166]:
count_vectorizer = CountVectorizer(min_df=5, max_features=50000, ngram_range=(1,2)).fit(X_train) # Note: both limits are included
X_train_vectorized = count_vectorizer.transform(X_train)
print('len of features using n-grams vectorizer={:,}'.format(len(count_vectorizer.get_feature_names_out())))


len of features using n-grams vectorizer=50,000


In [171]:
clf= LogisticRegression(max_iter= 2000).fit(X_train_vectorized, y_train)
predictions = clf.predict(count_vectorizer.transform(X_test))
print('f1: ', f1_score(y_test, predictions))
scores = clf.decision_function(count_vectorizer.transform(X_test))
print('AUC: ', roc_auc_score(y_test, scores))

f1:  0.9827553370154267
AUC:  0.9897879145821752


In [172]:
feature_names = np.array(count_vectorizer.get_feature_names_out())
sorted_coef_index = clf.coef_[0].argsort()
print('Smallest Coefs:\n{}\n'.format(feature_names[sorted_coef_index[:10]]))
print('Largest Coefs: \n{}'.format(feature_names[sorted_coef_index[:-11:-1]]))

Smallest Coefs:
['no good' 'worst' 'junk' 'garbage' 'not good' 'horrible' 'not happy'
 'support from' 'looks ok' 'product good']

Largest Coefs: 
['not bad' 'excelent' 'excelente' 'excellent' 'no problems' 'perfect'
 'awesome' 'no issues' 'amazing' 'exelente']


In [173]:
print (targets)
print(clf.predict(count_vectorizer.transform(targets)))

['not an issue, phone is working', 'an issue, phone is not working']
[1 0]


In [174]:

import pandas as pd
import numpy as np
import torch
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

# Disable warnings
import warnings
warnings.filterwarnings('ignore')


<font color = green >

## Modern Approaches to Sentiment Analysis (2023-2025)

</font>

Text analysis has evolved significantly in recent years with the introduction of:

1. **Word Embeddings** - Dense vector representations that capture semantic meaning
2. **Transformer Models** - Context-aware models that understand sequences and relationships
3. **Transfer Learning** - Using pre-trained models on large corpora for specific tasks
4. **Few-shot Learning** - Ability to perform with minimal examples

Let's explore these techniques using our Amazon reviews dataset.


<font color = green >

### Word Embeddings with Word2Vec

</font>

Unlike bag-of-words or TF-IDF which treat words as discrete atomic units, word embeddings represent words as continuous vectors where semantically similar words are mapped to nearby points.

Key advantages:
- Captures semantic relationships between words
- Reduces dimensionality compared to one-hot encoding
- Words with similar meanings have similar vectors


In [None]:
# Install and import necessary libraries
# !pip install gensim

import gensim
from gensim.models import Word2Vec
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
import nltk

nltk.download('punkt')


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




[nltk_data] Downloading package punkt to /home/master/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [71]:
# Tokenize and prepare text data for Word2Vec
def tokenize_text(text):
    return word_tokenize(text.lower())

# Prepare sentences for Word2Vec (it needs a list of tokenized sentences)
X_train_tokenized = [tokenize_text(text) for text in X_train]

# Train Word2Vec model
w2v_model = Word2Vec(X_train_tokenized,
                    vector_size=100,     # Dimensionality of word vectors
                    window=5,            # Context window size
                    min_count=5,         # Ignore words with frequency below this
                    workers=4)           # Number of threads
print(f"Vocabulary size: {len(w2v_model.wv.key_to_index)}")

Vocabulary size: 22374


In [72]:
# Examine the model
# Find most similar words to "excellent"
if "excellent" in w2v_model.wv:
    similar_words = w2v_model.wv.most_similar("excellent", topn=10)
    print("Words most similar to 'excellent':")
    for word, score in similar_words:
        print(f"{word}: {score:.4f}")

Words most similar to 'excellent':
outstanding: 0.8038
excelent: 0.7889
awesome: 0.7719
exceptional: 0.7705
amazing: 0.7560
incredible: 0.7472
great: 0.7163
superb: 0.7127
awsome: 0.6956
fantastic: 0.6880


In [73]:
# Function to create document vectors by averaging word vectors
def document_vector(doc, model):
    # Remove out-of-vocabulary words
    doc = [word for word in doc if word in model.wv]
    if len(doc) == 0:
        return np.zeros(model.vector_size)
    return np.mean([model.wv[word] for word in doc], axis=0)

# Create document vectors for train and test sets
X_train_w2v = np.array([document_vector(doc, w2v_model) for doc in X_train_tokenized])
X_test_tokenized = [tokenize_text(text) for text in X_test]
X_test_w2v = np.array([document_vector(doc, w2v_model) for doc in X_test_tokenized])

In [74]:
# Train a logistic regression model on Word2Vec features
from sklearn.linear_model import LogisticRegression

w2v_clf = LogisticRegression(max_iter=1000).fit(X_train_w2v, y_train)
w2v_predictions = w2v_clf.predict(X_test_w2v)
w2v_scores = w2v_clf.predict_proba(X_test_w2v)[:, 1]

print(f"Word2Vec + Logistic Regression:")
print(f"F1 Score: {f1_score(y_test, w2v_predictions):.4f}")
print(f"AUC Score: {roc_auc_score(y_test, w2v_scores):.4f}")

Word2Vec + Logistic Regression:
F1 Score: 0.9437
AUC Score: 0.9588


<font color = green >

### Transformer-Based Models and Transfer Learning

</font>

Transformer models like BERT, RoBERTa, and their variants have revolutionized NLP by capturing contextual information and enabling transfer learning from large pre-trained models.

Key advantages:
- Bidirectional context understanding
- Pre-trained on massive text corpora
- Can be fine-tuned for specific tasks with relatively small datasets
- Captures complex linguistic phenomena


In [75]:
# Install transformers and datasets libraries
# !pip install transformers datasets

from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import Trainer, TrainingArguments
from datasets import Dataset
import torch.nn.functional as F

In [None]:
# Let's use a smaller subset for demonstration purposes
sample_size = 5000  # Adjust based on your computational resources
indices = np.random.choice(len(X_train), sample_size, replace=False)
X_train_sample = X_train.iloc[indices].reset_index(drop=True)
y_train_sample = y_train.iloc[indices].reset_index(drop=True)

# Prepare datasets in the format expected by Hugging Face
train_dataset = Dataset.from_dict({
    'text': X_train_sample,
    'label': y_train_sample
})

test_indices = np.random.choice(len(X_test), min(1000, len(X_test)), replace=False)
X_test_sample = X_test.iloc[test_indices].reset_index(drop=True)
y_test_sample = y_test.iloc[test_indices].reset_index(drop=True)

test_dataset = Dataset.from_dict({
    'text': X_test_sample,
    'label': y_test_sample
})

In [77]:
# Load pre-trained model and tokenizer
model_name = "distilbert-base-uncased"  # A smaller, faster version of BERT
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Tokenize data
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=128)

tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_test = test_dataset.map(tokenize_function, batched=True)


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|██████████| 5000/5000 [00:00<00:00, 15191.65 examples/s]
Map: 100%|██████████| 1000/1000 [00:00<00:00, 15144.79 examples/s]


In [None]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    # no_cuda=True  # Force training on CPU
)

# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
)

# Train the model
trainer.train()

Step,Training Loss
10,0.7422
20,0.6897
30,0.6093
40,0.5102
50,0.4552
60,0.3564
70,0.238
80,0.1904
90,0.148
100,0.1453


TrainOutput(global_step=471, training_loss=0.1721426574968102, metrics={'train_runtime': 383.0194, 'train_samples_per_second': 39.163, 'train_steps_per_second': 1.23, 'total_flos': 496752744960000.0, 'train_loss': 0.1721426574968102, 'epoch': 3.0})

In [79]:
# Evaluate the model
predictions = trainer.predict(tokenized_test)
preds = np.argmax(predictions.predictions, axis=-1)

print(f"Transformer (DistilBERT) model results:")
print(f"F1 Score: {f1_score(y_test_sample, preds):.4f}")
print(f"Accuracy: {accuracy_score(y_test_sample, preds):.4f}")

Transformer (DistilBERT) model results:
F1 Score: 0.9656
Accuracy: 0.9480


<font color = green >

### Zero-Shot and Few-Shot Classification

</font>

Modern NLP models can perform tasks with minimal or even zero examples, leveraging their pre-training on diverse corpora.

- **Zero-shot learning**: Using models to classify text without any labeled examples
- **Few-shot learning**: Using only a few examples to adapt to a specific task

In [80]:
# Install required libraries
# !pip install transformers>=4.26.0

from transformers import pipeline

In [81]:
# Zero-shot classification with a pre-trained model
zero_shot_classifier = pipeline("zero-shot-classification",
                                model="facebook/bart-large-mnli")

# Sample a few reviews
sample_reviews = X_test.iloc[:5].tolist()
candidate_labels = ["positive", "negative"]

# Perform zero-shot classification
for review in sample_reviews:
    # Truncate long reviews for demonstration
    short_review = review[:512]
    result = zero_shot_classifier(short_review, candidate_labels)
    print(f"Review: {short_review[:100]}...")
    print(f"Predicted: {result['labels'][0]} with confidence {result['scores'][0]:.4f}")
    print("-" * 50)

Device set to use cuda:0


Review: Didnt came with the pink case offered. 1 week of use and the screen go bad damaged with strips....
Predicted: negative with confidence 0.9798
--------------------------------------------------
Review: buen vendedor lo recomiendo, estoy 100% satisfecho con mi articulo el articulo es como el descrito p...
Predicted: positive with confidence 0.9982
--------------------------------------------------
Review: Phone had horrible service quality for some odd reason then after 2 weeks of using it the phone just...
Predicted: negative with confidence 0.9937
--------------------------------------------------
Review: I took the phone to Brazil and it would only work with one sim card. The sim card that had worked in...
Predicted: negative with confidence 0.9914
--------------------------------------------------
Review: excellent phone. blackberry is the best mark on the market for smart phones. recommended for those w...
Predicted: positive with confidence 0.9959
--------------------------

<font color = green >

### Emotion Analysis: Beyond Positive/Negative

</font>

Modern sentiment analysis goes beyond binary classification to detect specific emotions and nuances.


In [83]:
# Emotion analysis using a pre-trained model
emotion_classifier = pipeline("text-classification",
                            model="j-hartmann/emotion-english-distilroberta-base",
                            top_k=2)

# Sample a few reviews
sample_reviews = X_test.iloc[:5].tolist()

# Perform emotion analysis
for review in sample_reviews:
    # Truncate long reviews for demonstration
    short_review = review[:512]
    result = emotion_classifier(short_review)
    print(f"Review: {short_review[:100]}...")
    print(f"Detected emotions: {result[0][0]['label']} ({result[0][0]['score']:.4f}), {result[0][1]['label']} ({result[0][1]['score']:.4f})")
    print("-" * 50)

Device set to use cuda:0


Review: Didnt came with the pink case offered. 1 week of use and the screen go bad damaged with strips....
Detected emotions: sadness (0.4625), neutral (0.3992)
--------------------------------------------------
Review: buen vendedor lo recomiendo, estoy 100% satisfecho con mi articulo el articulo es como el descrito p...
Detected emotions: joy (0.4326), neutral (0.4130)
--------------------------------------------------
Review: Phone had horrible service quality for some odd reason then after 2 weeks of using it the phone just...
Detected emotions: disgust (0.2515), surprise (0.1910)
--------------------------------------------------
Review: I took the phone to Brazil and it would only work with one sim card. The sim card that had worked in...
Detected emotions: sadness (0.9642), surprise (0.0170)
--------------------------------------------------
Review: excellent phone. blackberry is the best mark on the market for smart phones. recommended for those w...
Detected emotions: joy (0.5

<font color = green >

### Aspect-Based Sentiment Analysis

</font>

Aspect-based sentiment analysis identifies specific aspects or features mentioned in text and determines the sentiment toward each aspect.

For example, in a phone review:
- "The battery life is excellent, but the camera quality is poor."

We want to extract:
- Aspect: battery life, Sentiment: positive
- Aspect: camera quality, Sentiment: negative


In [None]:
# Load aspect-based sentiment analysis model
absa_pipeline = pipeline(
    "text-classification",
    model="yangheng/deberta-v3-base-absa-v1.1",
    return_all_scores=True,
    use_fast=False
)

# Example reviews with multiple aspects
example_reviews = [
    "The screen is bright and clear, but the battery drains too quickly.",
    "This phone has an amazing camera, though it's a bit overpriced.",
    "The software is intuitive, but it crashes frequently when multitasking."
]

# Extract aspects and sentiments (simplified approach)
for review in example_reviews:
    print(f"Review: {review}")

    # In a real implementation, we would first extract aspects using NER or other techniques
    # This is a simplified version for illustration

    # Let's assume we've extracted these aspects
    aspects = review.split(", ")
    for aspect in aspects:
        result = absa_pipeline(aspect)
        print(f"  Aspect: {aspect}")
        for score in result[0]:
            if score['score'] > 0.5:  # Only show confident predictions
                print(f"    Sentiment: {score['label']} ({score['score']:.4f})")
    print("-" * 50)

Device set to use cuda:0


Review: The screen is bright and clear, but the battery drains too quickly.
  Aspect: The screen is bright and clear
    Sentiment: Positive (0.9971)
  Aspect: but the battery drains too quickly.
    Sentiment: Negative (0.8436)
--------------------------------------------------
Review: This phone has an amazing camera, though it's a bit overpriced.
  Aspect: This phone has an amazing camera
    Sentiment: Positive (0.9921)
  Aspect: though it's a bit overpriced.
    Sentiment: Negative (0.8434)
--------------------------------------------------
Review: The software is intuitive, but it crashes frequently when multitasking.
  Aspect: The software is intuitive
    Sentiment: Positive (0.9952)
  Aspect: but it crashes frequently when multitasking.
    Sentiment: Negative (0.9440)
--------------------------------------------------




<font color = green >

### Multilingual Sentiment Analysis

</font>


In [98]:
from transformers import pipeline

# Load multilingual sentiment model
multilingual_classifier = pipeline(
    "sentiment-analysis",
    model="nlptown/bert-base-multilingual-uncased-sentiment"
)

# Sample texts in different languages
texts = {
    "English": "This product is amazing!",
    "Spanish": "Este producto es increíble!",
    "French": "Ce produit est incroyable!",
    "German": "Dieses Produkt ist unglaublich!",
    "Italian": "Questo prodotto è incredibile!"
}

# Analyze sentiment across languages
for language, text in texts.items():
    result = multilingual_classifier(text)
    print(f"{language}: {text}")
    print(f"Sentiment: {result[0]['label']} ({result[0]['score']:.4f})")

Device set to use cuda:0


English: This product is amazing!
Sentiment: 5 stars (0.8755)
Spanish: Este producto es increíble!
Sentiment: 5 stars (0.8487)
French: Ce produit est incroyable!
Sentiment: 5 stars (0.8649)
German: Dieses Produkt ist unglaublich!
Sentiment: 5 stars (0.7747)
Italian: Questo prodotto è incredibile!
Sentiment: 5 stars (0.8987)




<font color = green >

### Using Large Language Models

</font>


In [None]:
# This would require API access to models like OpenAI's GPT
import openai

# Function to analyze sentiment using GPT
def analyze_sentiment_with_gpt(text):
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a sentiment analysis assistant. Classify the following text as positive, negative, or neutral, and provide a confidence score between 0 and 1."},
            {"role": "user", "content": text}
        ],
        temperature=0
    )
    return response.choices[0].message.content

# Sample usage
reviews = [
    "I absolutely love this product! It's perfect for my needs.",
    "This is the worst purchase I've ever made. Complete waste of money.",
    "The product arrived on time and works as expected."
]

for review in reviews:
    print(f"Review: {review}")
    print(f"GPT Analysis: {analyze_sentiment_with_gpt(review)}")
    print("-" * 50)

Review: I absolutely love this product! It's perfect for my needs.
GPT Analysis: Positive with a confidence score of 0.95
--------------------------------------------------
Review: This is the worst purchase I've ever made. Complete waste of money.
GPT Analysis: Negative with a confidence score of 0.95
--------------------------------------------------
Review: The product arrived on time and works as expected.
GPT Analysis: Positive, 0.95
--------------------------------------------------


<font color = green >

## Conclusion and Best Practices

</font>

### Key Takeaways

1. **Evolution of Sentiment Analysis**:
   - Traditional methods (BoW, TF-IDF) provide baselines and are still useful for simple tasks
   - Word embeddings capture semantic relationships
   - Transformer models provide state-of-the-art performance

2. **Trade-offs**:
   - Computational resources vs. accuracy
   - Training time vs. performance
   - Model size vs. inference speed

3. **Choosing the Right Approach**:
   - For simple applications with limited resources: TF-IDF + classical ML
   - For moderate complexity: Word embeddings + neural networks
   - For maximum accuracy: Fine-tuned transformer models
   - For specialized applications: Zero-shot or few-shot learning

4. **Beyond Binary Sentiment**:
   - Emotion detection
   - Aspect-based sentiment analysis
   - Stance detection
   - Sarcasm and irony detection



<font color = green >

## Home Task

</font>

Objective: Apply at least two (2) modern sentiment analysis methods to classify text data and evaluate their performance




Dataset:

Sentiment Analysis Dataset: https://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz

alternative source:
[rt-polaritydata](https://github.com/dennybritz/cnn-text-classification-tf/tree/master/data/rt-polaritydata)

Each line in these two files corresponds to a single snippet (usually containing roughly one single sentence); all snippets are down-cased.

[More info about dataset](https://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt)


- rt-polarity.neg: Contains negative reviews.
- rt-polarity.pos: Contains positive reviews

### Task Description:

1. Data Loading & Preparation:

    - Load rt-polarity.neg and rt-polarity.pos.
    - Split each file into individual snippets.
    - Assign labels (0 for negative, 1 for positive).
    - Combine into a single dataset.
    - Split the dataset into training and testing sets.
2. Implement and evaluate at least three (3) methods from the lecture, prioritizing modern approaches.
3. For each implemented method, report and compare classification metrics: Accuracy, Precision, Recall, and F1-score.

In [None]:
fn='rt-polarity.neg'

with open(fn, "r",encoding='utf-8', errors='ignore') as f: # some invalid symbols encountered
    content = f.read()
texts_neg=  content.splitlines()
print ('len of texts_neg = {:,}'.format (len(texts_neg)))
for review in texts_neg[:5]:
    print ( '\n', review)

In [None]:
fn='rt-polarity.pos'

with open(fn, "r",encoding='utf-8', errors='ignore') as f:
    content = f.read()
texts_pos=  content.splitlines()
print ('len of texts_pos = {:,}'.format (len(texts_pos)))
for review in texts_pos[:5]:
    print ('\n', review)


<font color = green >

## Next lesson: Time Series Forecast  
</font>

