## Dataset description: <br>
IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.


## Veri kümesi açıklaması:
Doğal dil işleme veya Metin analitiği için 50K film incelemesine sahip IMDB veri seti.
Bu, önceki kıyaslama veri kümelerinden önemli ölçüde daha fazla veri içeren ikili duyarlılık sınıflandırması için bir veri kümesidir. Eğitim için 25.000 ve test için 25.000 son derece kutupsal film incelemesi sağlıyoruz. Bu nedenle, sınıflandırma veya derin öğrenme algoritmalarını kullanarak olumlu ve olumsuz incelemelerin sayısını tahmin edin.

In [7]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [8]:
data=pd.read_csv("IMDB Dataset.csv")

In [9]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [10]:
data.columns

Index(['review', 'sentiment'], dtype='object')

In [11]:
data.isnull().any()

review       False
sentiment    False
dtype: bool

In [12]:
data.isnull().sum()

review       0
sentiment    0
dtype: int64

In [13]:
data.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [14]:
data['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

In [15]:
data.shape

(50000, 2)

# Text normalization
## tokenization

In [16]:
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from wordcloud import WordCloud,STOPWORDS
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize,sent_tokenize

In [18]:
!pip install spacy
import spacy
import re,string,unicodedata
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.stem import LancasterStemmer,WordNetLemmatizer
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from textblob import TextBlob
from textblob import Word
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from bs4 import BeautifulSoup

You should consider upgrading via the 'C:\Users\SERKAN\AppData\Local\Programs\Python\Python310\python.exe -m pip install --upgrade pip' command.


Collecting spacy
  Downloading spacy-3.4.1-cp310-cp310-win_amd64.whl (11.7 MB)
Collecting preshed<3.1.0,>=3.0.2
  Downloading preshed-3.0.8-cp310-cp310-win_amd64.whl (94 kB)
Collecting pydantic!=1.8,!=1.8.1,<1.10.0,>=1.7.4
  Downloading pydantic-1.9.2-cp310-cp310-win_amd64.whl (2.0 MB)
Collecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
Collecting thinc<8.2.0,>=8.1.0
  Downloading thinc-8.1.4-cp310-cp310-win_amd64.whl (1.3 MB)
Collecting typer<0.5.0,>=0.3.0
  Downloading typer-0.4.2-py3-none-any.whl (27 kB)
Collecting srsly<3.0.0,>=2.4.3
  Downloading srsly-2.4.5-cp310-cp310-win_amd64.whl (479 kB)
Collecting cymem<2.1.0,>=2.0.2
  Downloading cymem-2.0.7-cp310-cp310-win_amd64.whl (29 kB)
Collecting catalogue<2.1.0,>=2.0.6
  Downloading catalogue-2.0.8-py3-none-any.whl (17 kB)
Collecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.3-py3-none-any.whl (9.3 kB)
Collecting murmurhash<1.1.0,>=0.28.0
  Downloading murmurhash-1.0.9-cp310-cp

In [19]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SERKAN\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Tokenization of text


In [20]:
#Tokenization of text
tokenizers=ToktokTokenizer()
#Setting English stopwords
stopwords=nltk.corpus.stopwords.words('english')

In [21]:
#Removing the noisy text
def noiseremoval_text(text):
  soup = BeautifulSoup(text, "html.parser")
  text = soup.get_text()
  text = re.sub('\[[^]]*\]', '', text)
  return text


In [22]:
#Apply function on review column
data['review']=data['review'].apply(noiseremoval_text)

In [23]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. The filming tec...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Stemming

In [24]:
#Stemming the text
def stemmer(text):
    ps=nltk.porter.PorterStemmer()
    text= ' '.join([ps.stem(word) for word in text.split()])
    return text


In [None]:
#Apply function on review column
data['review']=data['review'].apply(stemmer)

In [None]:
data.head()

## Removing stop words

In [None]:
from nltk.corpus import stopwords  
from nltk.tokenize import word_tokenize  


In [None]:
#set stopwords to english

stop_wr=set(stopwords.words('english'))
print(stop_wr)

In [None]:
#removing the stopwords
def removing_stopwords(text, is_lower_case=False):
    #Tokenization of text
    tokenizers=ToktokTokenizer()
    #Setting English stopwords
    tokens = tokenizers.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filter_tokens = [token for token in tokens if token not in stop_wr]
    else:
        filter_tokens = [token for token in tokens if token.lower() not in stop_wr]
    filtered_text = ' '.join(filter_tokens)    
    return filtered_text


In [25]:
#Apply function on review column
data['review']=data['review'].apply(removing_stopwords)

In [26]:
data.head()

Unnamed: 0,review,sentiment
0,one review ha mention watch 1 Oz episod ' hook...,positive
1,wonder littl production. film techniqu veri un...,positive
2,thought thi wa wonder way spend time hot summe...,positive
3,basic ' famili littl boy ( jake ) think ' zomb...,negative
4,"petter mattei ' "" love time money "" visual stu...",positive


## Train test split

In [27]:
#split the dataset  
#train dataset
train_reviews_data=data.review[:30000]

In [28]:
#test dataset
test_reviews_data=data.review[30000:]

## Bag of words

In [31]:
#Count vectorizer for bag of words
cv=CountVectorizer(min_df=0,max_df=1,binary=False,ngram_range=(1,3))
#transformed train reviews
cv_train=cv.fit_transform(train_reviews_data)
#transformed test reviews
cv_test=cv.transform(test_reviews_data)

print('BOW_cv_train:',cv_train.shape)
print('BOW_cv_test:',cv_test.shape)
#vocab=cv.get_feature_names()-toget feature names

BOW_cv_train: (30000, 4954557)
BOW_cv_test: (20000, 4954557)


## TF_IDF

In [32]:
#Tfidf vectorizer
tf=TfidfVectorizer(min_df=0,max_df=1,use_idf=True,ngram_range=(1,3))
#transformed train reviews
tf_train=tf.fit_transform(train_reviews_data)
#transformed test reviews
tf_test=tf.transform(test_reviews_data)
print('Tfidf_train:',tf_train.shape)
print('Tfidf_test:',tf_test.shape)

Tfidf_train: (30000, 4954557)
Tfidf_test: (20000, 4954557)


## Lable encoding

In [33]:
#labeling the sentient data
label=LabelBinarizer()
#transformed sentiment data
sentiment_data=label.fit_transform(data['sentiment'])
print(sentiment_data.shape)

(50000, 1)


In [34]:
train_data=data.sentiment[:30000]


In [35]:
test_data=data.sentiment[30000:]


In [36]:
#training the model
logistic=LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)
#Fitting the model for Bag of words
lr_bow=logistic.fit(cv_train,train_data)
print(lr_bow)
#Fitting the model for tfidf features
lr_tfidf=logistic.fit(tf_train,train_data)
print(lr_tfidf)

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)
LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)


# Predicting the model for bag of words


In [37]:
lr_bow_predict=logistic.predict(cv_test)
print(lr_bow_predict)


['negative' 'negative' 'negative' ... 'negative' 'positive' 'positive']


## Predicting the model for tfidf features


In [38]:
lr_tfidf_predict=logistic.predict(tf_test)
print(lr_tfidf_predict)

['negative' 'negative' 'negative' ... 'negative' 'positive' 'positive']


# Accuracy score for bag of words


In [39]:
lr_bow_score=accuracy_score(test_data,lr_bow_predict)
print("lr_bow_score :",lr_bow_score)


lr_bow_score : 0.74255


# Accuracy score for tfidf features

In [40]:
lr_tfidf_score=accuracy_score(test_data,lr_tfidf_predict)
print("lr_tfidf_score :",lr_tfidf_score)

lr_tfidf_score : 0.7426
