# Hindi Sentiment Analysis for Hindi Text

### Introduction
In this Sentiment Analysis project we will take dataset of movie reviews in English and use it to train the Model. And then we will take manual inputs in Hindi Language and use the trained model to predict the sentiment of the manual input as positive or negative. We have used NLTK(Natural Language ToolKit) and SKLearn(Scikit Learn) for the preprocessing and model training. Due to lack of sufficient dataset in Hindi language, the model is trained on English dataset and the testing is done on the Hindi inputs.


### Procedure
Engish dataset, stored in 'movie_review.csv', is read by using pandas. Unnecessary columns are dropped and it is stored in MR.

In [1]:
import pandas as pd
MR = pd.read_csv('movie_review.csv')
MR = MR.drop(['fold_id','cv_tag','html_id','sent_id'],axis = 1)   #MR is movie review which is having 2 columns one is text and other is their polarity.

In [2]:
print(MR.head(3))
MR  = MR.sample(frac=1)
print(MR.head(3))
print(MR.shape)

                                                text  tag
0  films adapted from comic books have had plenty...  pos
1  for starters , it was created by alan moore ( ...  pos
2  to say moore and campbell thoroughly researche...  pos
                                                    text  tag
14920  the character of stargher is an excellent role...  pos
62603  and it misses its best possible opportunity fo...  neg
12327  after dying in a car crash , on his birthday o...  pos
(64720, 2)


### Preprocessing
Preprocessing of data is done in this section which includes:<br>
1. Converting into lower case
2. Removing punctuations<br>
3. Removing numbers
4. Removing white spaces
5. Removing stop words
6. Using Lemmatizer
7. Stemming the words using **Porter Stemmer**

In [3]:
def preprocessing(input_file,target):
    import re
    import string
    
    #Convert All into Lower Case
    input_file[target] = input_file[target].str.lower()
    
    #Removing Punctuation
    input_file[target] = input_file[target].str.replace("[^a-zA-Z#]", " ")
    
    #Removing Numbers
    input_file[target] = input_file[target].str.replace('\d+', '')

    #Removing White Spaces
    input_file[target] = input_file[target].str.strip()
    
    #Tokenization
    #Removal of Stop Words
    from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
    from nltk.tokenize import word_tokenize
    from nltk.stem import WordNetLemmatizer
    from nltk.stem import PorterStemmer
    lemmatizer=WordNetLemmatizer()
    stemmer= PorterStemmer()
        
        
    
    tokenized = input_file[target].apply(lambda p: p.split())
    print(tokenized[0])
    print(type(tokenized))
    for i in range(tokenized.size):
        result = [j for j in tokenized[i] if not j in ENGLISH_STOP_WORDS]
        tokenized[i] = result
        
    #tokenized = tokenized.apply(lambda p: [for i in p if not in i in ENGLISH_STOP_WORDS])
    tokenized = tokenized.apply(lambda p: [lemmatizer.lemmatize(i) for i in p])
    tokenized = tokenized.apply(lambda p : [stemmer.stem(i) for i in p])
    for i in range(len(tokenized)):
        tokenized[i] = ' '.join(tokenized[i])
    input_file[target] = tokenized

    

In [4]:
print(MR.head(10))
preprocessing(MR,'text')
print(MR.head(10))
print(MR.shape)


                                                    text  tag
14920  the character of stargher is an excellent role...  pos
62603  and it misses its best possible opportunity fo...  neg
12327  after dying in a car crash , on his birthday o...  pos
44135                                                [r]  neg
62391                                     nothing else .  neg
35221  so there she is , narrating the six-astronaut ...  neg
48712  when ned's mother ( clarissa kaye ) is jailed ...  neg
58527                                    be forewarned .  neg
34716          what one tough cop lacks is originality .  neg
12716  it has changed horror movies forever and spawn...  pos
['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', 'whether', 'they', 're', 'about', 'superheroes', 'batman', 'superman', 'spawn', 'or', 'geared', 'toward', 'kids', 'casper', 'or', 'the', 'arthouse', 'crowd', 'ghost', 'world', 'but', 'there', 's', 'never', 'really', 'been', 'a', 

### Feature Extraction:
TF-IDF technique is used for the feature extraction

In [5]:
def feature_extraction(input_file,target):
    from sklearn.feature_extraction.text import TfidfVectorizer
    tfidfV = TfidfVectorizer(max_df = 0.9,min_df = 2,max_features = 800, stop_words = 'english')
    vecTor=  tfidfV.fit_transform(input_file[target])
    return vecTor

In [6]:
Xtrain = feature_extraction(MR,'text')
print(Xtrain.shape)
print(Xtrain)
#print(Xtrain)

(64720, 800)
  (0, 103)	0.28501806826914744
  (0, 222)	0.49741543603609617
  (0, 576)	0.37323486918855153
  (0, 526)	0.4265359903844406
  (0, 380)	0.4190045993226688
  (0, 64)	0.4177804588873954
  (1, 436)	0.29854352726740335
  (1, 61)	0.2351345236360206
  (1, 515)	0.28677164695484364
  (1, 476)	0.335764851357128
  (1, 288)	0.2405339436286383
  (1, 624)	0.3073102535016016
  (1, 234)	0.2552191438795563
  (1, 546)	0.2523561661947294
  (1, 63)	0.24522113408474402
  (1, 611)	0.24719444027966347
  (1, 477)	0.2977077845312049
  (1, 337)	0.3354595761452148
  (1, 761)	0.23893375891293042
  (2, 175)	0.4901628623575851
  (2, 89)	0.49038153912511334
  (2, 155)	0.4104708448780358
  (2, 110)	0.46918061001047556
  (2, 389)	0.3614547760730428
  (5, 437)	1.0
  :	:
  (64713, 660)	0.38405829147720616
  (64713, 44)	0.3589066785199378
  (64713, 49)	0.7019034402864084
  (64715, 448)	0.49936967002101784
  (64715, 269)	0.8663890192419914
  (64716, 448)	0.16317138351525579
  (64716, 712)	0.20290382739154286
 

In [7]:
import numpy as np
Ytrain = []
for i in MR['tag']:
    if str(i) == 'pos':
        Ytrain.append(1)
    else:
        Ytrain.append(0)
#print(Ytrain)

### Training of the model
Model has been trained by using Decision Tree Algorithm. Inbuilt library of Scikit Learn is used for implementing **Decision Tree**.

In [8]:
from sklearn.model_selection import train_test_split as TTS
from sklearn.metrics import *
from sklearn.tree import DecisionTreeClassifier as DT
model = DT()
X_train,X_test,Y_train,Y_test = TTS(Xtrain,Ytrain,test_size=0.30, random_state=42)

In [None]:
model.fit(Xtrain,Ytrain)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [None]:
prid = model.predict(X_test)
print(accuracy_score(Y_test,prid))

0.9535949732179646


### Manual Input in Hindi
Here we are taking the input in Hindi language. The input is then preprocessed and given to the trained model. We will get output as positive and negative from the model.

In [None]:
def finalConversion(get_inp):
    from googletrans import Translator
    translator = Translator()
    get_inp = translator.translate(text = get_inp,dest = 'en',src = 'hi')
    get_inp = get_inp.text
    print(get_inp)
    df1 = pd.DataFrame({'text':[get_inp],'tag':[0]})
    new_df = MR
    df1 = df1.append(new_df,ignore_index = True)
    preprocessing(df1,'text')
    tstVec = feature_extraction(df1,'text')
    v = tstVec[1]
    x = model.predict(v)
    
    return x

In [None]:
get_inp = input("Enter Text in Hindi: ")
output = finalConversion(get_inp)
if output == 0:
    output = "Negative"
else:
    output = "Positive"
print("\nPrediction is: " +output)

In [None]:
# भाऊत अच्छी मूवी है ये!
# एबीसीडी 2’ में ड्रामा की कमी है, इस कमी की वजह से ही सुरु और विन्नी का रोमांस भी नहीं उभर पाता।