### Text Classification

- It is supervised machine learning task where given input is classified into diff parts , the data is in form of text.



- Types of text classification


1. Binary Classification : Email spam classifier.
2. Multiclass Classification : News dataset , so classify the dataset into diff Genre depending on news headlines.

3. Multilabel Classification


### Application 


1. Email Spam Classifier
2. Customer support on ecommerce sites
3. Sentiment analysis
4. Language detection on google translator.
5. Fake news detection



### Pipeline


1. Data acquistion

2. Text Preprocessing

3. Text vectorization

4. Modelling  : 
                ML - Naive byes , SVM , Random forest , logistic regression
                DL - RNN - LSTM , CNN , BERT

5. Evaluation -     AccuracyScore ,precision/recall, confusionmatrix , ROC_AUC Curve

### Different Approaches


1.  Heuristic approach  - if data is not available.
2.  Using APIs  - Online API are available on third party websites ,  pass the data and it would apply algorithm and give u the result.
3.  Using ML or DL technique.



### API available 

- nlpcloud.io



### Text-Classification 

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [19]:
temp_df=pd.read_csv("D:\Sandesh\Data Science\Class Assignment\Codes\All DataSets\Kaggle Dataset\IMDB Dataset.csv")
temp_df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [20]:
temp_df.shape

(50000, 2)

In [21]:
df= temp_df.iloc[:10000]

In [22]:
df['review'][10]

'Phil the Alien is one of those quirky films where the humour is based around the oddness of everything rather than actual punchlines.<br /><br />At first it was very odd and pretty funny but as the movie progressed I didn\'t find the jokes or oddness funny anymore.<br /><br />Its a low budget film (thats never a problem in itself), there were some pretty interesting characters, but eventually I just lost interest.<br /><br />I imagine this film would appeal to a stoner who is currently partaking.<br /><br />For something similar but better try "Brother from another planet"'

In [23]:
df['sentiment'].value_counts()

positive    5028
negative    4972
Name: sentiment, dtype: int64

In [24]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

In [25]:
df.duplicated().sum()

17

In [26]:
df.drop_duplicates(inplace=True)

###  Basic Preprocessing 

#### Remove HTML tags 

In [31]:
import re 
def remove_tags(raw_text):
    cleaned_text = re.sub(re.compile('<.*?>'), '',raw_text)
    return cleaned_text

In [32]:
df['review']= df['review'].apply(remove_tags)

In [33]:
df.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there's a family where a little boy ...,negative
4,"petter mattei's ""love in the time of money"" is...",positive


In [34]:
df['review'] = df['review'].apply(lambda x:x.lower())

#### Remove stopwords 

In [35]:
from nltk.corpus import stopwords

sw_list = stopwords.words('english')
df['review']= df['review'].apply(lambda x :[item for item in x.split() if item 
                      not in sw_list]).apply(lambda x :" ".join(x))



In [36]:
df.head()

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production. filming technique...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically there's family little boy (jake) thi...,negative
4,"petter mattei's ""love time money"" visually stu...",positive


In [37]:
X=df.iloc[:,0:1]
y=df['sentiment']

In [38]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
y=le.fit_transform(y)

In [39]:
y

array([1, 1, 1, ..., 0, 0, 1])

In [47]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,
                                              random_state=1)

In [59]:
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((7986, 1), (1997, 1), (7986,), (1997,))

In [52]:
X_train

Unnamed: 0,review
6713,"i've waiting superhero movie like long time. ""..."
1178,"movie excellent acted, excellent directed over..."
4707,movie makes want throw every time see it. take...
6772,"first saw movie elementary school, back 1960s...."
7461,show made persons iq lower 80. jokes show lame...
...,...
2895,excellent episode movie ala pulp fiction. 7 da...
7823,"first off, give idea taste movies...2007 comed..."
905,well begin story?? went movie tonight friends ...
5195,"lot horror fans seem love scarecrows, popular ..."


### 1 . CountVectorizer 

In [61]:
from sklearn.feature_extraction.text import CountVectorizer

cv=CountVectorizer()
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

In [62]:
# Vector of size=48282 is getting created .

X_train_bow.shape

(7986, 48282)

In [63]:
X_test_bow.shape

(1997, 48282)

In [64]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train_bow,y_train)

GaussianNB()

In [49]:
X_train_bow.shape

(7986, 48282)

In [65]:
y_predict = gnb.predict(X_test_bow)

In [67]:
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report

print(accuracy_score(y_test,y_predict))

0.6324486730095142


In [68]:
print(confusion_matrix(y_test,y_predict))

[[717 235]
 [499 546]]


In [69]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train_bow,y_train)

RandomForestClassifier()

### Use random forest classifier 

In [70]:
y_predict=rf.predict(X_test_bow)
print(accuracy_score(y_test,y_predict))

0.8537806710065098


### Hyperparameter tuning of CountVectorizer 

In [73]:
cv=CountVectorizer(max_features=3000)
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

In [75]:
rf.fit(X_train_bow,y_train)
y_predict = rf.predict(X_test_bow)
print(accuracy_score(y_test,y_predict))

0.8377566349524287


### 2.    N-Grams 

In [78]:
cv=CountVectorizer(ngram_range=(1,2),max_features=5000)
X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

In [80]:
rf.fit(X_train_bow,y_train)
y_predict = rf.predict(X_test_bow)
print(accuracy_score(y_test,y_predict))

0.8397596394591887


### 3.   TF-IDF 

In [81]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [84]:
tfidf=TfidfVectorizer()

X_train_tfidf = tfidf.fit_transform(X_train['review']).toarray()
X_test_tfidf = tfidf.transform(X_test['review']).toarray()

In [86]:
rf= RandomForestClassifier()

rf.fit(X_train_tfidf,y_train)
y_predict = rf.predict(X_test_tfidf)
accuracy_score(y_test,y_predict)

0.8442663995993991

### 4.  Word2Vec  

1. Can use Pretrained Model 


2. Otherwise build your own model 



In [87]:
import gensim

In [88]:
from nltk import sent_tokenize
from gensim.utils import simple_preprocess

In [90]:
story = []
for doc in df['review']:
    raw_sent = sent_tokenize(doc)
    for sent in raw_sent:
        story.append(simple_preprocess(sent))

In [93]:
model = gensim.models.Word2Vec(window=10,min_count=2)

In [94]:
model.build_vocab(story)

In [95]:
model.train(story,total_examples=model.corpus_count,
           epochs=model.epochs)


(5876324, 6212140)

In [96]:
def doc_vector(doc):
    doc = [word for word in doc.split() if word in model.wv.index_to_key]
    return np.mean(model.wv[doc],axis=0)

In [98]:
doc_vector(df['review'].values[0])

array([-0.1732407 ,  0.5042021 ,  0.23570275,  0.21904449, -0.08753805,
       -0.5946031 ,  0.20120923,  0.9846045 , -0.37480024, -0.2677054 ,
       -0.25992727, -0.46608958,  0.1590258 ,  0.14511548,  0.16485663,
       -0.12598976,  0.01677343, -0.30277133, -0.0833732 , -0.645626  ,
        0.02340727,  0.18804221,  0.07228142, -0.25306758, -0.33495325,
       -0.05134789, -0.2670815 ,  0.02060262, -0.26695168,  0.096218  ,
        0.3019478 , -0.00610226,  0.21735534, -0.29173332, -0.17026754,
        0.32478616,  0.04907368, -0.33918467, -0.21463747, -0.7890605 ,
        0.16833997, -0.23846704, -0.00721441, -0.08184247,  0.41873828,
       -0.14398889, -0.21972436, -0.03043513,  0.08818506,  0.37829992,
        0.02468275, -0.3682918 , -0.45401156, -0.14374119, -0.0747746 ,
        0.28026405,  0.20232683,  0.03714807, -0.29013857,  0.06378711,
        0.03690879,  0.09233177,  0.05377282, -0.12059262, -0.41049796,
        0.25652725,  0.05863883,  0.13870217, -0.362038  ,  0.34

In [99]:
from tqdm import tqdm
X=[]

for doc in tqdm(df['review'].values):
    X.append(doc_vector(doc))

100%|██████████████████████████████████████████████████████████████████████████████| 9983/9983 [19:05<00:00,  8.71it/s]


In [100]:
X=np.array(X)
X.shape

(9983, 100)

In [102]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.20,
                                              random_state=1)

In [103]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
y = le.fit_transform(y)

In [105]:
rf= RandomForestClassifier()

rf.fit(X_train,y_train)
y_predict = rf.predict(X_test)
accuracy_score(y_test,y_predict)

0.771156735102654

### THE END 