# Sentiment Analysis on Text: MyAnimeList Review Data Set

Collection of methods and models to on how to make a model to make a sentiment analysis. Functions referred to Blueprints for Text Analytics by Albrecht et al. (2021) with several adjustments to make it more clear. No ready-to-use functions at all because we're using other libraries thus it will not be convenient to wrap it again with another functions as there's a lot of parameteres to set. This notebook show several models, including:

- Lexicon-based Approach
- Building features from data, and applying a supervised ML algorithm
- Transfer learning technique + pretrained language models (BERT for instance)

In [9]:
import pandas as pd
import random
import regex as re
import ast

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

from sklearn.svm import LinearSVC

from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

### Data Set Information

This project will use data from Kaggle, <a href='https://www.kaggle.com/datasets/marlesson/myanimelist-dataset-animes-profiles-reviews'> Anime Dataset with Reviews - MyAnimeList by Marlesson</a>. For other informations, please click the link on the text and it will show some infos about the data set.

In [2]:
df = pd.read_csv('/Users/taufiqurrohman/Documents/ds_marketing_portfolio/text_analytics/dataset/activity-8_animereviewDataSet/reviews.csv')

display(df.head(5))
print('length of the dataframe', len(df))

Unnamed: 0,uid,profile,anime_uid,text,score,scores,link
0,255938,DesolatePsyche,34096,\n \n \n \n ...,8,"{'Overall': '8', 'Story': '8', 'Animation': '8...",https://myanimelist.net/reviews.php?id=255938
1,259117,baekbeans,34599,\n \n \n \n ...,10,"{'Overall': '10', 'Story': '10', 'Animation': ...",https://myanimelist.net/reviews.php?id=259117
2,253664,skrn,28891,\n \n \n \n ...,7,"{'Overall': '7', 'Story': '7', 'Animation': '9...",https://myanimelist.net/reviews.php?id=253664
3,8254,edgewalker00,2904,\n \n \n \n ...,9,"{'Overall': '9', 'Story': '9', 'Animation': '9...",https://myanimelist.net/reviews.php?id=8254
4,291149,aManOfCulture99,4181,\n \n \n \n ...,10,"{'Overall': '10', 'Story': '10', 'Animation': ...",https://myanimelist.net/reviews.php?id=291149


length of the dataframe 192112


In [3]:
df_anime = pd.read_csv('/Users/taufiqurrohman/Documents/ds_marketing_portfolio/text_analytics/dataset/activity-8_animereviewDataSet/animes.csv')

display(df_anime.head(5))
print('length of the dataframe', len(df_anime))

Unnamed: 0,uid,title,synopsis,genre,aired,episodes,members,popularity,ranked,score,img_url,link
0,28891,Haikyuu!! Second Season,Following their participation at the Inter-Hig...,"['Comedy', 'Sports', 'Drama', 'School', 'Shoun...","Oct 4, 2015 to Mar 27, 2016",25.0,489888,141,25.0,8.82,https://cdn.myanimelist.net/images/anime/9/766...,https://myanimelist.net/anime/28891/Haikyuu_Se...
1,23273,Shigatsu wa Kimi no Uso,Music accompanies the path of the human metron...,"['Drama', 'Music', 'Romance', 'School', 'Shoun...","Oct 10, 2014 to Mar 20, 2015",22.0,995473,28,24.0,8.83,https://cdn.myanimelist.net/images/anime/3/671...,https://myanimelist.net/anime/23273/Shigatsu_w...
2,34599,Made in Abyss,The Abyss—a gaping chasm stretching down into ...,"['Sci-Fi', 'Adventure', 'Mystery', 'Drama', 'F...","Jul 7, 2017 to Sep 29, 2017",13.0,581663,98,23.0,8.83,https://cdn.myanimelist.net/images/anime/6/867...,https://myanimelist.net/anime/34599/Made_in_Abyss
3,5114,Fullmetal Alchemist: Brotherhood,"""In order for something to be obtained, someth...","['Action', 'Military', 'Adventure', 'Comedy', ...","Apr 5, 2009 to Jul 4, 2010",64.0,1615084,4,1.0,9.23,https://cdn.myanimelist.net/images/anime/1223/...,https://myanimelist.net/anime/5114/Fullmetal_A...
4,31758,Kizumonogatari III: Reiketsu-hen,After helping revive the legendary vampire Kis...,"['Action', 'Mystery', 'Supernatural', 'Vampire']","Jan 6, 2017",1.0,214621,502,22.0,8.83,https://cdn.myanimelist.net/images/anime/3/815...,https://myanimelist.net/anime/31758/Kizumonoga...


length of the dataframe 19311


    Some Data Cleaning Before Merging

In [4]:
# count duplicate value

df.duplicated().value_counts()

False    130519
True      61593
dtype: int64

In [5]:
# drop the duplicated value 

df = df.drop_duplicates(keep='first')
print('length of the dataframe', len(df))

length of the dataframe 130519


In [6]:
# count duplicate value for anime list

df_anime.duplicated().value_counts()

False    16368
True      2943
dtype: int64

In [7]:
# drop the duplicated value 

df_anime = df_anime.drop_duplicates(keep='first')
print('length of the dataframe', len(df_anime))

length of the dataframe 16368


    Inserting all the features

In [8]:
df = df.merge(df_anime[['uid', 'title']], left_on='anime_uid', right_on='uid', how='left')
df = df.drop_duplicates()
df = df[['uid_x', 'profile', 'anime_uid', 'title', 'text', 'score', 'scores']]
df.rename(columns={'uid_x': 'uid'})

display(df.head(5))
print('length of the dataframe', len(df))

Unnamed: 0,uid_x,profile,anime_uid,title,text,score,scores
0,255938,DesolatePsyche,34096,Gintama.,\n \n \n \n ...,8,"{'Overall': '8', 'Story': '8', 'Animation': '8..."
1,259117,baekbeans,34599,Made in Abyss,\n \n \n \n ...,10,"{'Overall': '10', 'Story': '10', 'Animation': ..."
2,253664,skrn,28891,Haikyuu!! Second Season,\n \n \n \n ...,7,"{'Overall': '7', 'Story': '7', 'Animation': '9..."
3,8254,edgewalker00,2904,Code Geass: Hangyaku no Lelouch R2,\n \n \n \n ...,9,"{'Overall': '9', 'Story': '9', 'Animation': '9..."
4,291149,aManOfCulture99,4181,Clannad: After Story,\n \n \n \n ...,10,"{'Overall': '10', 'Story': '10', 'Animation': ..."


length of the dataframe 130519


## Data Cleaning

In [9]:
# check 3 text samples from the review

random_num_list = []

for i in range(3):
    random_num_list.append(random.randint(0, len(df)))

for i in random_num_list:
    print('review from', df.loc[i, 'profile'], ':')
    display(df.loc[i, 'text'])
    print('\n')

review from DevilKirito :


'\n           \n         \n           \n             \n           \n         \n         \n           more pics \n         \n       \n         \n       \n         \n           Overall \n           6 \n         \n         \n           Story \n           5 \n         \n                   \n             Animation \n             5 \n           \n           \n             Sound \n             6 \n           \n                 \n           Character \n           4 \n         \n         \n           Enjoyment \n           6 \n         \n       \n     \n\n                    \n    Bad start, the accident besides being cliché, totally disagrees with the style of anime. \r\nSilly and cliche but evolving romance, unlike most animes who are all in one thing always liking each other but never doing anything. \r\nNice ecchi I like, but I think the ecchi makes the work lose a bit of depth, I have not seen any anime with much ecchi that impresses me in history, animation, etc. \r\nThere were completely



review from SuccHunter :


"\n           \n         \n           \n             \n           \n         \n         \n           more pics \n         \n       \n         \n       \n         \n           Overall \n           10 \n         \n         \n           Story \n           9 \n         \n                   \n             Animation \n             10 \n           \n           \n             Sound \n             10 \n           \n                 \n           Character \n           10 \n         \n         \n           Enjoyment \n           10 \n         \n       \n     \n\n                    \n    Story & Characters: \r\n \r\nBarakamon is an absolute gem and deserves every bit of praise and popularity that it gets. The way it builds its world and characters is nothing short of magic, weaving a big but lovable cast of three dimensional humans who are just living out their mundane lives and through doing so, gradually discover new things about themselves. Character interactions and dynamics are extremely ent



review from thxbookthief :


'\n           \n         \n           \n             \n           \n         \n         \n           more pics \n         \n       \n         \n       \n         \n           Overall \n           2 \n         \n         \n           Story \n           2 \n         \n                   \n             Animation \n             8 \n           \n           \n             Sound \n             8 \n           \n                 \n           Character \n           1 \n         \n         \n           Enjoyment \n           5 \n         \n       \n     \n\n                    \n    This review may contain spoilers \r\n \r\n \r\nHarem Online,I mean Sword art online is an anime aired in 2012 based on a light novel series of the same name by Reki Kawahara (he also write the novels of Accel World) and it gained a lot of fans and popularity. \r\nWell, with that introduction let’s start this review \r\n \r\nStory:2 \r\n \r\nThe story of SAOis divided into 2 arcs and it follows the serie of events afte





In [10]:
def clean_text(text):
    pattern_b = r'^more\spics\s\sOverall\d+Story\d+Animation\s\s\d+Sound\s\s\d+\sCharacter\d+Enjoyment\d+'
    pattern_e = r'\bHelpful\b'
    
    text = text.replace('\n', '')
    text = text.replace('\r', '')
    text = text.replace('   ', '')
    
    text = re.sub(pattern_b, '', text)
    text = re.sub(pattern_e, '', text)
    text = re.sub(r'(?<=[.,!])(?=[^\s])', r' ', text)
    
    text = text.strip()
    
    return text

In [11]:
df['text'] = df['text'].map(clean_text)
df.head(5)

Unnamed: 0,uid_x,profile,anime_uid,title,text,score,scores
0,255938,DesolatePsyche,34096,Gintama.,"First things first. My ""reviews"" system is exp...",8,"{'Overall': '8', 'Story': '8', 'Animation': '8..."
1,259117,baekbeans,34599,Made in Abyss,Let me start off by saying that Made in Abyss ...,10,"{'Overall': '10', 'Story': '10', 'Animation': ..."
2,253664,skrn,28891,Haikyuu!! Second Season,"Art 9/10: It is great, especially the actions ...",7,"{'Overall': '7', 'Story': '7', 'Animation': '9..."
3,8254,edgewalker00,2904,Code Geass: Hangyaku no Lelouch R2,Story taking place 1 yr from where season 1 t...,9,"{'Overall': '9', 'Story': '9', 'Animation': '9..."
4,291149,aManOfCulture99,4181,Clannad: After Story,Kyoto Animations greatest strength is being ab...,10,"{'Overall': '10', 'Story': '10', 'Animation': ..."


In [12]:
# convert dictionary with str type into dict using ast.literal_eval
# functions that helps traverse an abstract syntax tree without parsing it manually

df['scores'] = df['scores'].apply(lambda x: ast.literal_eval(x))
df_scores = df['scores'].apply(pd.Series)
df = df.merge(df_scores, left_index=True, right_index=True)
df = df.drop(['scores', 'score'], axis=1)

display(df.head(5))
print('length of the dataframe', len(df))

Unnamed: 0,uid_x,profile,anime_uid,title,text,Overall,Story,Animation,Sound,Character,Enjoyment
0,255938,DesolatePsyche,34096,Gintama.,"First things first. My ""reviews"" system is exp...",8,8,8,10,9,8
1,259117,baekbeans,34599,Made in Abyss,Let me start off by saying that Made in Abyss ...,10,10,10,10,10,10
2,253664,skrn,28891,Haikyuu!! Second Season,"Art 9/10: It is great, especially the actions ...",7,7,9,8,8,8
3,8254,edgewalker00,2904,Code Geass: Hangyaku no Lelouch R2,Story taking place 1 yr from where season 1 t...,9,9,9,10,10,9
4,291149,aManOfCulture99,4181,Clannad: After Story,Kyoto Animations greatest strength is being ab...,10,10,8,9,10,10


length of the dataframe 130519


### DataFrame Checkpoint

In [14]:
# df.to_csv('/Users/taufiqurrohman/Documents/ds_marketing_portfolio/text_analytics/dataset/activity-8_animereviewDataSet/reviews_cleaned.csv')

# Method 1: Lexicon-Based Approaches 

Lexicon is a dictionary that contains a collection of words and has been compiled using expert knowledge. It has been collected for a specific purpose, and incorporates some specific knowledge around that domain. In this analysis, I will use the Bing Liu Lexicon. To look for more other Lexicon options, <a href='https://arxiv.org/pdf/1901.08319.pdf#:~:text=The%20bing%20lexicon%20was%20developed,positive%20and%204783%20are%20negative.'> click here. </a>

In [59]:
import nltk
from nltk.corpus import opinion_lexicon
from nltk.tokenize import word_tokenize

In [60]:
# check inside the lexicon

print('number of words in Bing Liu opinion lexicon:', len(opinion_lexicon.words()), '\n')
print('example of positive words', opinion_lexicon.positive()[:10])
print('example of negative words', opinion_lexicon.negative()[:10])

number of words in Bing Liu opinion lexicon: 6789 

example of positive words ['a+', 'abound', 'abounds', 'abundance', 'abundant', 'accessable', 'accessible', 'acclaim', 'acclaimed', 'acclamation']
example of negative words ['2-faced', '2-faces', 'abnormal', 'abolish', 'abominable', 'abominably', 'abominate', 'abomination', 'abort', 'aborted']


In [73]:
# create the function for opinion lexicon

word_dict = {}


# giving positive words with +1 value...
# and giving negative words with -1 value

for w in opinion_lexicon.positive():
    word_dict[w] = 1
    
for w in opinion_lexicon.negative():
    word_dict[w] = -1


# define the function

def bing_liu_score(text):
    
    sent_score = 0
    bow = word_tokenize(text.lower())  # tokenize, and lower the word
    for w in bow:
        if w in word_dict:
            sent_score += word_dict[w]  # giving each word score
    
    return sent_score / len(bow)  # dividing by the length of the review

In [62]:
# beware of this code chunk
# take a long time to finish all

df['bing_liu_score'] = df['text'].apply(bing_liu_score)
df[['title', 'text', 'Overall', 'bing_liu_score']].sample(5)

Unnamed: 0,title,text,Overall,bing_liu_score
60498,Boku no Pico,Boku no pico is the worst anime I have ever wa...,4,-0.021277
71920,Guilty Crown,Guilty Crown is certainly the kind of anime yo...,3,0.000941
61541,Orange,I rarely ever write reviews for anime I've com...,8,0.004049
122234,Mai-Otome,"First off, please don't flame me as this is my...",9,0.016129
82487,Tsubasa Chronicle,"First I saw CCS, then I noticed a new anime ca...",10,-0.015748


In [63]:
# check top 5 ranking from bing_liu_score
df[['title', 'text', 'Overall', 'Story', 'Animation', 'Sound', 'Character', 'Enjoyment', 'bing_liu_score']].\
    sort_values('bing_liu_score').tail(5)

Unnamed: 0,title,text,Overall,Story,Animation,Sound,Character,Enjoyment,bing_liu_score
7569,Code Geass: Hangyaku no Lelouch R2,"Amazing anime with a great plot, smart and int...",10,9,8,10,10,10,0.208333
118386,Mouryou Senki Madara,This Was Actually A Pretty Good OVA Series It...,8,0,0,0,0,0,0.225
79593,Cello Hiki no Gauche (1982),Simple. Wonderful. Beautiful. Comforting like ...,8,0,0,0,0,0,0.264706
34948,Howl no Ugoku Shiro,"Visually striking and magnificent trail, this ...",9,9,10,10,10,9,0.272727
66235,Black Clover,Very underrated. This is my first review in My...,10,10,10,10,10,10,0.294416


In [64]:
# check least 5 ranking from bing_liu_score
df[['title', 'text', 'Overall', 'Story', 'Animation', 'Sound', 'Character', 'Enjoyment', 'bing_liu_score']].\
    sort_values('bing_liu_score').head(5)

Unnamed: 0,title,text,Overall,Story,Animation,Sound,Character,Enjoyment,bing_liu_score
86897,Arslan Senki (TV): Fuujin Ranbu,idiot plot idiot plot idiot plot idiot plot id...,1,1,4,4,2,1,-1.0
83680,Isshuukan Friends.,Crappy ending and a huge disappointment. The ...,1,1,1,1,1,1,-0.130435
22140,Fate/stay night: Unlimited Blade Works,The witch is dead the witch is dead gigamesh ...,10,9,10,7,10,9,-0.125
70145,Jigoku Shoujo,"The most stupid I've ever watched. Basically, ...",1,1,1,1,1,1,-0.123894
99018,Soujuu Senshi Psychic Wars,This is without doubt the WORST anime i have e...,1,1,1,1,1,1,-0.12


In [65]:
# save the dataframe as i'm going to use another method...
# in the same dataframe

bing_liu_score = df.copy(deep=True)

# Method 2: Supervised Learning Approaches 

Lexicon is a dictionary that contains a collection of words and has been compiled using expert knowledge. It has been collected for a specific purpose, and incorporates some specific knowledge around that domain. In this analysis, I will use the Bing Liu Lexicon. To look for more other Lexicon options, <a href='https://arxiv.org/pdf/1901.08319.pdf#:~:text=The%20bing%20lexicon%20was%20developed,positive%20and%204783%20are%20negative.'> click here. </a>

In [3]:
# load the dataframe again

df = pd.read_csv('/Users/taufiqurrohman/Documents/ds_marketing_portfolio/text_analytics/dataset/activity-8_animereviewDataSet/reviews_cleaned.csv')

display(df.head(5))
print('length of the dataframe:', len(df))

Unnamed: 0.1,Unnamed: 0,uid_x,profile,anime_uid,title,text,Overall,Story,Animation,Sound,Character,Enjoyment
0,0,255938,DesolatePsyche,34096,Gintama.,"First things first. My ""reviews"" system is exp...",8,8,8,10,9,8
1,1,259117,baekbeans,34599,Made in Abyss,Let me start off by saying that Made in Abyss ...,10,10,10,10,10,10
2,2,253664,skrn,28891,Haikyuu!! Second Season,"Art 9/10: It is great, especially the actions ...",7,7,9,8,8,8
3,3,8254,edgewalker00,2904,Code Geass: Hangyaku no Lelouch R2,Story taking place 1 yr from where season 1 t...,9,9,9,10,10,9
4,4,291149,aManOfCulture99,4181,Clannad: After Story,Kyoto Animations greatest strength is being ab...,10,10,8,9,10,10


length of the dataframe: 130519


In [4]:
# annotate the review first
# anime with score 7 and above will get 1 (positive)
# while anime with score below 6 will get 0 (negative)
# anime that get 6 will be filtered out as a neutral opinion on the anime

thres = 6

# check the length of each score first
print(f'number of rating above {thres}:', len(df[df['Overall'] > thres]))
print(f'number of rating below {thres}:', len(df[df['Overall'] < thres]))
print(f'number of reviews with score {thres}:', len(df[df['Overall'] == thres]))

number of rating above 6: 91853
number of rating below 6: 26220
number of reviews with score 6: 12446


In [27]:
# assigning the score now

df['sentiment'] = 0
df.loc[df['Overall'] > thres, 'sentiment'] = 1
df.loc[df['Overall'] < thres, 'sentiment'] = 0
df = df[df['Overall'] != thres]

display(df[['text', 'Overall', 'sentiment']].sample(3))
print('length of the dataframe:', len(df))

Unnamed: 0,text,Overall,sentiment
78755,Does a very good job on sticking with the mang...,10,1
67530,Akame ga kill. . . . this series is so amazing...,10,1
74904,You know its kind of petty coming to MAL and h...,3,0


length of the dataframe: 118073


    Step 1: Data Preparation    

In [6]:
import fun_nlp_spacy_text as pp_text

In [85]:
import fun_nlp_spacy_text as pp_text

# cleaning all the HTML tags and escapes from the text
# then extracting the lemmatized version and taking only specific pos

df['clean_text'] = df['text'].apply(pp_text.clean)
df['clean_lemmas'] = df['clean_text'].apply(pp_text.extract_lemmas)
df['clean_pos'] = df['clean_text'].apply(pp_text.extract_pos_to_take)

df.sample(5)

In [12]:
# normalize text
# convert to lowercase, remove number, remove punctuation

def normalize_text(text):
    text = re.sub(r'[^\w\s]', '', text)
    text = text.lower()
    text = re.sub(r'\d+', '', text)
    
    return text

In [7]:
# instead of apply, use progress apply to incorporate tqdm to pandas
# injecting tqdm to pandas apply function

# IMPORTANT, running this would take around 3 hours

from tqdm import tqdm

tqdm.pandas()

df['clean_text'] = df['text'].progress_apply(normalize_text)  
df['clean_text'] = df['text'].progress_apply(pp_text.clean)  
df['clean_lemmas'] = df['clean_text'].progress_apply(pp_text.extract_lemmas)

100%|██████████| 118073/118073 [00:53<00:00, 2187.76it/s]
100%|██████████| 118073/118073 [2:39:06<00:00, 12.37it/s]  


In [28]:
df.sample(5)

Unnamed: 0,uid_x,profile,anime_uid,title,text,Overall,Story,Animation,Sound,Character,Enjoyment,sentiment,clean_text,clean_lemmas
108732,247297,Snuggleophagus,25183,Gangsta.,Overall one of the most enticing anime I've ev...,7,6,7,8,7,7,1,Overall one of the most enticing anime I've ev...,overall one of the most enticing anime i ve ev...
48425,231686,tramontina,1818,Claymore,I like comedy and this anime is virtually devo...,8,9,8,7,8,8,1,I like comedy and this anime is virtually devo...,i like comedy and this anime be virtually devo...
76253,213621,YMegumiShimizu,11111,Another,I found Another to be really creepy and grueso...,9,8,10,9,10,8,1,I found Another to be really creepy and grueso...,i find another to be really creepy and gruesom...
82003,69733,HardyisHere,11285,Black★Rock Shooter (TV),I'll just bullet down what I think about it S...,9,9,10,10,8,10,1,I'll just bullet down what I think about it Sh...,i will just bullet down what i think about it ...
50090,220055,camay1997,28907,"Gate: Jieitai Kanochi nite, Kaku Tatakaeri",This is the kind of anime we should be demandi...,10,9,10,9,10,10,1,This is the kind of anime we should be demandi...,this be the kind of anime we should be demand ...


### DataFrame Checkpoint

In [22]:
df.to_csv('/Users/taufiqurrohman/Documents/ds_marketing_portfolio/text_analytics/dataset/activity-8_animereviewDataSet/reviews_cleaned_lemmas.csv')

## Step 2: Train-Test Split

In [30]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df['clean_lemmas'],
                                                    df['sentiment'],
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=df['sentiment']  # to make distribution balanced between train-test
                                                    )

In [32]:
print ('length of training data ', x_train.shape[0]) 
print ('length of test data ', x_test.shape[0])

print ('\ndistribution of classes in training data :')
print ('positive sentiment ', str(sum(y_train == 1)/ len(y_train) * 100.0)) 
print ('negative sentiment ', str(sum(y_train == 0)/ len(y_train) * 100.0))

print ('\ndistribution of classes in testing data :')
print ('positive sentiment ', str(sum(y_test == 1)/ len(y_test) * 100.0))
print ('negative Sentiment ', str(sum(y_test == 0)/ len(y_test) * 100.0))

length of training data  94458
length of test data  23615

distribution of classes in training data :
positive sentiment  77.79330496093502
negative sentiment  22.20669503906498

distribution of classes in testing data :
positive sentiment  77.79377514291764
negative Sentiment  22.20622485708236


## Step 3: Text Vectorization

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(min_df=10, ngram_range=(1,1))

x_train_tf = tfidf.fit_transform(x_train)
x_test_tf = tfidf.transform(x_test)  # only transform, the model was already fit by train

## Step 4: Training the Model

Using the LinearSVC: Works well with a large number of numeric features, and is **quite fast** compared to other algorithms.

In [40]:
from sklearn.svm import LinearSVC

model1 = LinearSVC(random_state=42, tol=1e-5)
model1.fit(x_train_tf, y_train)

LinearSVC(random_state=42, tol=1e-05)

It can be extended by showing the transparency of the algorithm using **Text Class Explanation** from the Activity 5.

In [42]:
# calculating the accuracy of the mode

from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

    Evaluate the Model

In [46]:
ypred = model1.predict(x_test_tf)

print('accuracy:', accuracy_score(y_test, ypred))
print('roc_score:', roc_auc_score(y_test, ypred))

accuracy: 0.9089561719246242
roc_score: 0.8476003701319955


    Compare to the Baseline: Bing Liu Lexicon

In [76]:
def baseline_score(text):
    score = bing_liu_score(text)
    if score > 0:  # positive sentiment
        return 1
    else:  # negative sentiment
        return 0

In [78]:
# evaluation for bing liu model

ypred_baseline = x_test.apply(baseline_score)
acc_score = accuracy_score(ypred_baseline, y_test)
roc_score = roc_auc_score(ypred_baseline, y_test)

print('accuracy:', acc_score)
print('roc_score:', roc_score)

accuracy: 0.7783612110946433
roc_score: 0.6877504837771925


    Check with the Real Data    

In [79]:
# checking the model

sample_text = df[['title', 'text', 'clean_lemmas', 'Overall', 'sentiment']].sample(10)

# take the x sample and transform

x_sample = sample_text['clean_lemmas']
x_sample_tf = tfidf.transform(x_sample)

# fit to the model and make a prediction based on it

sample_text['sentiment_pred'] = model1.predict(x_sample_tf)
sample_text

Unnamed: 0,title,text,clean_lemmas,Overall,sentiment,sentiment_pred
117861,Hitsugi no Chaika: Avenging Battle,"The original ""Hitsugi no Chaika""was a pleasant...",the original hitsugi no chaikawas a pleasant ...,5,0,0
126404,Vampire Knight,I saw an AMV of Vampire Knight in Youtube and ...,i see an amv of vampire knight in youtube and ...,9,1,0
31130,Dr. Stone,What got me interested in the show was this id...,what get i interested in the show be this idea...,3,0,0
52711,Yowamushi Pedal: New Generation,"background: i'm an avid sports anime fan, i lo...",background i be an avid sport anime fan i lo...,9,1,1
44909,Youjo Senki,"I really was thrown at the end, I thought Tany...",i really be throw at the end i think tanya mi...,9,1,1
89862,Yozakura Quartet: Hana no Uta,What do you get when you mix a weird and refre...,what do you get when you mix a weird and refre...,7,1,1
118215,SoniAni: Super Sonico The Animation,SuperSonico: The Animation started off well en...,supersonico the animation start off well enou...,3,0,0
762,Wolf's Rain,Wolf's Rain is one of its kind and that means ...,wolf s rain be one of its kind and that mean t...,10,1,1
31396,Kimi no Suizou wo Tabetai,I honestly don't care if people say this is ge...,i honestly do not care if people say this be g...,10,1,1
53459,Hourou Musuko,"I enjoyed Hourou Musuko a lot, actually. It wa...",i enjoy hourou musuko a lot actually it be r...,7,1,1


# Method 3: Transfer Learning & Pretrained Language Model

Using BERT model, a pretrained model made by Google. BERT model was developed primarily on TensorFlow, so we have to make our data in tensor format. To do this, we can make use of Transformers library by Hugging Face. 

In [29]:
df = pd.read_csv('/Users/taufiqurrohman/Documents/ds_marketing_portfolio/text_analytics/dataset/activity-8_animereviewDataSet/reviews_cleaned_lemmas.csv')
df = df.drop('Unnamed: 0', axis=1)

## Step 1: Loading Models and Tokenization

In [13]:
# model architecture and weighs are downloaded from AWS S3 bucket hosted by HuggingF Face
# add finetuning_task parameter as binary, because we're going to predict the sentiment

from transformers import BertConfig, BertTokenizer, BertForSequenceClassification

config = BertConfig.from_pretrained('bert-base-uncased', finetuning_task='binary')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

    Transforming Data Based on the Architecture

New function here: **assert**

Assertions are statements that you can use to set sanity checks during the development process. Assertions allow you to test the correctness of your code by checking if some specific conditions remain true, which can come in handy while you’re debugging code. The assertion condition should always be true unless you have a bug in your program. If the condition turns out to be false, then the assertion raises an exception and terminates the execution of your program. More info about the assertion,<a href='https://realpython.com/python-assert-statement/#getting-to-know-assertions-in-python'> click here</a>.



In [14]:
def get_tokens(text, tokenizer, max_seq_length, add_special_tokens=True):
    input_ids = tokenizer.encode(text,
                                 add_special_tokens=add_special_tokens,
                                 max_length=max_seq_length,
                                 pad_to_max_length=True)
    attention_mask = [int(id > 0) for id in input_ids]
    
    assert len(input_ids) == max_seq_length  # assertion, should be true. else, raise exception
    assert len(attention_mask) == max_seq_length  # assertion, should be true. else, raise exception
    
    return (input_ids, attention_mask)

In [15]:
ex_text = 'After a difficult debut campaign in Paris, Lionel Messi has netted 18 goals and provided 19 assists this season and has spoken of his happiness at adapting to life in France this term.'

In [21]:
input_ids, attention_mask = get_tokens(ex_text,
                                       tokenizer,
                                       max_seq_length=100,  # set the maximum text length
                                       add_special_tokens=True)

input_tokens = tokenizer.convert_ids_to_tokens(input_ids)  # convert the id of the token to the text itself (token)

print(ex_text, '\n')
print(input_tokens)
print(input_ids)
print(attention_mask)

After a difficult debut campaign in Paris, Lionel Messi has netted 18 goals and provided 19 assists this season and has spoken of his happiness at adapting to life in France this term. 

['[CLS]', 'after', 'a', 'difficult', 'debut', 'campaign', 'in', 'paris', ',', 'lionel', 'mess', '##i', 'has', 'net', '##ted', '18', 'goals', 'and', 'provided', '19', 'assists', 'this', 'season', 'and', 'has', 'spoken', 'of', 'his', 'happiness', 'at', 'adapting', 'to', 'life', 'in', 'france', 'this', 'term', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PA

    Brief Explanation about the Tags

- [CLS] Classification: Marking the beginning of the sentence
- [SEP] Separator: Marking for a new sentence, usually placed after the period mark
- [PAD] Padding: Ensure shorter sequences will have the same length as either the longest sequence in a batch or the maximum length accepted by the model.

    Split model to Train-test Set

In [33]:
x_train, x_test, y_train, y_test = train_test_split(df['text'],  # for deep learning based model, USE THE ORIGINAL TEXT, not the cleaned one
                                                    df['sentiment'],
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=df['sentiment'])

print ('length of training data ', x_train.shape[0]) 
print ('length of test data ', x_test.shape[0])

print ('\ndistribution of classes in training data :')
print ('positive sentiment ', str(sum(y_train == 1)/ len(y_train) * 100.0)) 
print ('negative sentiment ', str(sum(y_train == 0)/ len(y_train) * 100.0))

print ('\ndistribution of classes in testing data :')
print ('positive sentiment ', str(sum(y_test == 1)/ len(y_test) * 100.0))
print ('negative Sentiment ', str(sum(y_test == 0)/ len(y_test) * 100.0))

length of training data  94458
length of test data  23615

distribution of classes in training data :
positive sentiment  77.79330496093502
negative sentiment  22.20669503906498

distribution of classes in testing data :
positive sentiment  77.79377514291764
negative Sentiment  22.20622485708236


In [None]:
x_train_tokens = x_train.apply(get_tokens, args=(tokenizer,
                                                 max_seq_length=100,  # set the maximum text length
                                                 add_special_tokens=True))

In [42]:
# tokenized the train and test text using apply function to series
# beware, take long time to finish running this code

from tqdm import tqdm

tqdm.pandas()

x_train_tokens = x_train.progress_apply(get_tokens, args=(tokenizer, 100, True))
x_test_tokens = x_test.progress_apply(get_tokens, args=(tokenizer, 100, True))

100%|██████████| 94458/94458 [19:46<00:00, 79.62it/s]  
100%|██████████| 23615/23615 [04:29<00:00, 87.64it/s] 


    Converting Data to Tensor
    
To train a deep learning models, we have to train it by converting the data to **Tensor** on GPUs. For now, this project is going it on **PyTorch**.

In [65]:
import torch
from torch.utils.data import TensorDataset

# convert all data into tensor

input_ids_train = torch.tensor(
    [features[0] for features in x_train_tokens.values], dtype=torch.long
)

input_mask_train = torch.tensor(
    [features[1] for features in x_train_tokens.values], dtype=torch.long
)

label_ids_train = torch.tensor(y_train.values, dtype=torch.long)

print(input_ids_train.shape)
print(input_mask_train.shape)
print(label_ids_train.shape)

torch.Size([94458, 100])
torch.Size([94458, 100])
torch.Size([94458])


In [75]:
# wrap it into a tensor dataset
# similar to pandas dataframe, but using tensor to save the data

train_dataset = TensorDataset(input_ids_train, input_mask_train, label_ids_train)
train_dataset

<torch.utils.data.dataset.TensorDataset at 0x7fdb690c82e0>

## Step 2: Model Training

- Split the train data in batches and epochs, thus define the batch size and epoch using PyTorch
- Define the BERT model using the Transformers library
- Wrap all these steps into a nested for loops 

In [76]:
# split the train data in batches and epochs, thus define the batch size and epoch using PyTorch

from torch.utils.data import DataLoader, RandomSampler

train_batch_size = 64  # number of samples processed before the model is updated, adjust by yourself
num_train_epochs = 2  # number of complete passes through the training dataset, adjust by yourself

train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset,
                              sampler=train_sampler,
                              batch_size=train_batch_size)

opt_steps = len(train_dataloader) // num_train_epochs  # formula to calculate the number of training steps

print('num of examples', len(train_dataset))
print('num of epochs', num_train_epochs)
print('num of batch size', train_batch_size)
print('optimization steps', opt_steps)

num of examples 94458
num of epochs 2
num of batch size 64
optimization steps 738


In [78]:
# define the BERT model using the Transformers library
# for the parameteres, see https://towardsdatascience.com/how-to-train-a-bert-model-from-scratch-72cfce554fc6

from transformers import AdamW, get_linear_schedule_with_warmup

learning_rate = 1e-4
adam_epsilon = 1e-8
warmup_steps = 0

optimizer = AdamW(model.parameters(), lr=learning_rate, eps=adam_epsilon)
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps=warmup_steps,
                                            num_training_steps=opt_steps)



For the code below, it's taken directly from the book. Please refer to **page 317** for the complete tutorial. This project will also not run the code as well, as it's taking too much time to train.

In [80]:
# wrap all these steps into a nested for loops 

from tqdm import trange, notebook

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator = trange(num_train_epochs, desc='Epoch')

for epoch in train_iterator:
    epoch_iterator = notebook.tqdm(train_dataloader, desc='Iteration')
    
    for step, batch in enumerate(epoch_iterator):

        # Reset all gradients at start of every iteration
        model.zero_grad()
        # Put the model and the input observations to GPU
        model.to(device)
        batch = tuple(t.to(device) for t in batch)
        
        # Identify the inputs to the model
        inputs = {'input_ids': batch[0],
                  'attention_mask': batch[1],
                  'labels': batch[2]}
        
        # Forward Pass through the model. Input -> Model -> Output
        outputs = model(**inputs)
        
        # Determine the deviation (loss)
        loss = outputs[0] 
        print("\r%f" % loss, end='')
        
        # Back-propogate the loss (automatically calculates gradients)
        loss.backward()
        
        # Prevent exploding gradients by limiting gradients to 1.0
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            
        # Update the parameters and learning rate
        optimizer.step()
        scheduler.step()

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1476 [00:00<?, ?it/s]

0.523702

Epoch:   0%|          | 0/2 [04:36<?, ?it/s]


KeyboardInterrupt: 

## Step 3: Model Evaluation

For the code below, it's taken directly from the book. Please refer to **page 318** for the complete tutorial. This project will also not run the code as well, as it's taking too much time to train.

In [None]:
import numpy as np

from torch.utils.data import SequentialSampler

test_batch_size = 64
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset,
                             sampler=test_sampler,
                             batch_size=test_batch_size)
    
# Load the pretrained model that was saved earlier
# model = model.from_pretrained('/outputs')
# Initialize the prediction and actual labels
preds = None
out_label_ids = None
    
# Put model in "eval" mode
model.eval()

for batch in notebook.tqdm(test_dataloader, desc="Evaluating"):
    
    # Put the model and the input observations to GPU
    model.to(device)
    batch = tuple(t.to(device) for t in batch)
    
    # Do not track any gradients since in 'eval' mode
    with torch.no_grad():
        inputs = {'input_ids': batch[0],
                    'attention_mask': batch[1],
                    'labels':         batch[2]}
        
        # Forward pass through the model
        outputs = model(**inputs)

        # We get loss since we provided the labels
        tmp_eval_loss, logits = outputs[:2]
            
            # There maybe more than one batch of items in the test dataset
        if preds is None:
            preds = logits.detach().cpu().numpy()
            out_label_ids = inputs['labels'].detach().cpu().numpy()
        else:
            preds = np.append(preds, logits.detach().cpu().numpy(), axis=0) 
            out_label_ids = np.append(out_label_ids,
                                      inputs['labels'].detach().cpu().numpy(),
                                      axis=0)

# Get final loss, predictions and accuracy
preds = np.argmax(preds, axis=1)
acc_score = accuracy_score(preds, out_label_ids) 
print ('Accuracy Score on Test data ', acc_score)