<a href="https://colab.research.google.com/github/vvrgit/NLP-LAB/blob/NLP-SRU/Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **Text Classification**

It is a supervised ML task

One of the widely used natural language processing task in different business problems is “Text Classification”. The goal of text classification is to automatically classify the text documents into one or more defined categories. Some examples of text classification are:

1. Understanding audience sentiment from social media,
2. Detection of spam and non-spam emails,
3. Auto tagging of customer queries, and
4. Categorization of news articles into defined topics.

**Application:**

1.   Email filtering
2.   Customer support (Identify must replied/important tweet)
3.   Sentiment analysis
4.   Language detection
5.   Fake news detection

**Approaches**



1.   Heuristic approach
2.   API (GCP, AWS, Azure, [nlpcloud](https://nlpcloud.com/))
3.   ML
4.   DL



**Problem Statement**

Classify the rating of movie as positive or negative based on viewer review available on IMDB data

**Step 1: Load Data**

In [2]:
import pandas as pd
df=pd.read_csv("/content/drive/MyDrive/AI Data/NLP Data/IMDB Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


**Step 2: Text Preprocessing**


1.   HTML tag Removal
2.   Remove URLs
3.   Handling Emojis
4.   Chat Word Treatment
5.   Remove Punctuations
6.   Data Lowercasing
7.   Removing stopwords
8.   Spelling Correction
9.  Tokenization
10.  Lemmatization



**HTML tag Removal**

In [3]:
import re
def remove_html_tag(text):
  pattern=re.compile('<.*?>')
  return pattern.sub(r'',str(text))

In [4]:
df["review"]=df["review"].apply(remove_html_tag)
df["review"].head()

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. The filming tec...
2    I thought this was a wonderful way to spend ti...
3    Basically there's a family where a little boy ...
4    Petter Mattei's "Love in the Time of Money" is...
Name: review, dtype: object

**Remove URLs**

In [5]:
import re
def remove_url(text):
  pattern=re.compile(r'https?://\S+|www\.\S+')
  return pattern.sub(r'',str(text))

In [6]:
df["review"]=df["review"].apply(remove_url)
df["review"].head()

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. The filming tec...
2    I thought this was a wonderful way to spend ti...
3    Basically there's a family where a little boy ...
4    Petter Mattei's "Love in the Time of Money" is...
Name: review, dtype: object

**Handling Emojis**

In [7]:
!pip install emoji --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting emoji
  Downloading emoji-2.2.0.tar.gz (240 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.9/240.9 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-2.2.0-py3-none-any.whl size=234911 sha256=3d1762107abfd9df5fc01b5c4bb46e9c9ed5e9f0a5c5f4e00681f4223ab3da46
  Stored in directory: /root/.cache/pip/wheels/02/3d/88/51a592b9ad17e7899126563698b4e3961983ebe85747228ba6
Successfully built emoji
Installing collected packages: emoji
Successfully installed emoji-2.2.0


In [8]:
import emoji
def replace_emoji(text):
  text = emoji.demojize(text)
  return text

In [9]:
df["review"]=df["review"].apply(replace_emoji)
df["review"].head()

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. The filming tec...
2    I thought this was a wonderful way to spend ti...
3    Basically there's a family where a little boy ...
4    Petter Mattei's "Love in the Time of Money" is...
Name: review, dtype: object

**Chat Word Treatment**

In [10]:
chat_words = {
'AFAIK':'As Far As I Know',
'AFK':'Away From Keyboard',
'ASAP':'As Soon As Possibl',
'ATK':'At The Keyboard',
'ATM':'At The Moment',
'A3':'Anytime, Anywhere, Anyplace',
'BAK':'Back At Keyboard',
'BBL':'Be Back Later',
'BBS':'Be Back Soon',
'BFN':'Bye For Now',
'B4N':'Bye For Now',
'BRB':'Be Right Back',
'BRT':'Be Right There',
'BTW':'By The Way',
'B4':'Before',
'B4N':'Bye For Now',
'CU':'See You',
'CUL8R':'See You Later',
'CYA':'See You',
'FAQ':'Frequently Asked Questions',
'FC':'Fingers Crossed',
'FWIW':'For What Its Worth',
'FYI':'For Your Information',
'GAL':'Get A Life',
'GG':'Good Game',
'GN':'Good Night',
'GMTA':'Great Minds Think Alike',
'GR8':'Great!',
'G9':'Genius',
'IC':'I See',
'ICQ':'I Seek you',
'ILU':'I Love You',
'IMHO':'In My Honest/Humble Opinion',
'IMO':'In My Opinion',
'IOW':'In Other Words',
'IRL':'In Real Life',
'KISS':'Keep It Simple, Stupid',
'LDR':'Long Distance Relationship',
'LMAO':'Laugh My A.. Off',
'LOL':'Laughing Out Loud',
'LTNS':'Long Time No See',
'L8R':'Later',
'MTE':'My Thoughts Exactly',
'M8':'Mate',
'NRN':'No Reply Necessary',
'OIC':'Oh I See',
'PITA':'Pain In The A..',
'PRT':'Party',
'PRW':'Parents Are Watching',
'ROFL':'Rolling On The Floor Laughing',
'ROFLOL':'Rolling On The Floor Laughing Out Loud',
'ROTFLMAO':'Rolling On The Floor Laughing My A.. Off',
'SK8':'Skate',
'STATS':'Your sex and age',
'ASL':'Age, Sex, Location',
'THX':'Thank You',
'TTFN':'Ta-Ta For Now!',
'TTYL':'Talk To You Later',
'U':'You',
'U2':'You Too',
'U4E':'Yours For Ever',
'WB':'Welcome Back',
'WTF':'What The F...',
'WTG':'Way To Go!',
'WUF':'Where Are You From?',
'W8':'Wait...'
}

In [11]:
def chat_conversion(text):
  new_text=[]
  for w in text.split():
    if w.upper() in chat_words:
      new_text.append(chat_words[w.upper()])
    else:
      new_text.append(w)
  return " ".join(new_text)

In [12]:
df["review"]=df["review"].apply(chat_conversion)
df["review"].head()

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. The filming tec...
2    I thought this was a wonderful way to spend ti...
3    Basically there's a family where a little boy ...
4    Petter Mattei's "Love in the Time of Money" is...
Name: review, dtype: object

**Remove Punctuations**

In [13]:
import string,time
execlude=string.punctuation


def remove_punc_method2(text):
  return text.translate(str.maketrans('','',execlude))

In [14]:
df["review"]=df["review"].apply(remove_punc_method2)
df["review"].head()

0    One of the other reviewers has mentioned that ...
1    A wonderful little production The filming tech...
2    I thought this was a wonderful way to spend ti...
3    Basically theres a family where a little boy J...
4    Petter Matteis Love in the Time of Money is a ...
Name: review, dtype: object

**Data LowerCasing**

In [15]:
df['review']=df['review'].str.lower()
df['review'].head()

0    one of the other reviewers has mentioned that ...
1    a wonderful little production the filming tech...
2    i thought this was a wonderful way to spend ti...
3    basically theres a family where a little boy j...
4    petter matteis love in the time of money is a ...
Name: review, dtype: object

**Stop Words Removal**

In [16]:
import nltk
stop_words=nltk.download('stopwords') # a, an, the, am, is etc.
from nltk.corpus import stopwords
def Remove_Stopwords(text):
  new_text=[]
  for word in text.split():
    if word in stopwords.words('english'):
      new_text.append('')
    else:
      new_text.append(word)
  x=new_text[:]
  new_text.clear()
  return " ".join(x)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [17]:
df['review']=df['review'].apply(Remove_Stopwords)
df['review'].head()

0    one    reviewers  mentioned   watching  1 oz e...
1     wonderful little production  filming techniqu...
2     thought    wonderful way  spend time    hot s...
3    basically theres  family   little boy jake thi...
4    petter matteis love   time  money   visually s...
Name: review, dtype: object

**Lemmatization**


In [18]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [19]:
def lemmatization(text):
  from nltk.stem import WordNetLemmatizer 
  lemmatizer = WordNetLemmatizer()
  return " ".join(lemmatizer.lemmatize(word, pos ="v") for word in text.split())

In [20]:
df['review']=df['review'].apply(lemmatization)
df['review'].head()

0    one reviewers mention watch 1 oz episode youll...
1    wonderful little production film technique una...
2    think wonderful way spend time hot summer week...
3    basically theres family little boy jake think ...
4    petter matteis love time money visually stun f...
Name: review, dtype: object

In [21]:
df['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

In [25]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [23]:
df.duplicated().sum()

423

In [27]:
df.drop_duplicates(inplace=True)

In [28]:
df.duplicated().sum()

0

In [29]:
df.head()

Unnamed: 0,review,sentiment
0,one reviewers mention watch 1 oz episode youll...,positive
1,wonderful little production film technique una...,positive
2,think wonderful way spend time hot summer week...,positive
3,basically theres family little boy jake think ...,negative
4,petter matteis love time money visually stun f...,positive


In [30]:
x=df.iloc[:,0:1]
x

Unnamed: 0,review
0,one reviewers mention watch 1 oz episode youll...
1,wonderful little production film technique una...
2,think wonderful way spend time hot summer week...
3,basically theres family little boy jake think ...
4,petter matteis love time money visually stun f...
...,...
49995,think movie right good job wasnt creative orig...
49996,bad plot bad dialogue bad act idiotic direct a...
49997,catholic teach parochial elementary school nun...
49998,im go disagree previous comment side maltin on...


In [31]:
y=df['sentiment']
y

0        positive
1        positive
2        positive
3        negative
4        positive
           ...   
49995    positive
49996    negative
49997    negative
49998    negative
49999    negative
Name: sentiment, Length: 49577, dtype: object

In [33]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

y = encoder.fit_transform(y)
y

array([1, 1, 1, ..., 0, 0, 0])

In [34]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.2,random_state=1)

In [35]:
X_train.shape

(39661, 1)

In [37]:
X_test.shape

(9916, 1)

**Text Vectorization using bag words**

In [None]:
# Applying BoW
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

In [None]:
X_train_bow.shape