# Hindi Sentiment Analysis for Hindi Text

### Introduction
In this Sentiment Analysis project we will take dataset of movie reviews in English and use it to train the Model. And then we will take manual inputs in Hindi Language and use the trained model to predict the sentiment of the manual input as positive or negative. We have used NLTK(Natural Language ToolKit) and SKLearn(Scikit Learn) for the preprocessing and model training. Due to lack of sufficient dataset in Hindi language, the model is trained on English dataset and the testing is done on the Hindi inputs.

### Procedure
Engish dataset, stored in 'movie_review.csv', is read by using pandas. Unnecessary columns are dropped and it is stored in MR.

In [8]:
import pandas as pd
MR = pd.read_csv('movie_review.csv')
MR = MR.drop(['fold_id','cv_tag','html_id','sent_id'],axis = 1)   #MR is movie review which is having 2 columns one is text and other is their polarity.

In [9]:
print(MR.head(3))
MR  = MR.sample(frac=1)
print(MR.head(3))
print(MR.shape)

                                                text  tag
0  films adapted from comic books have had plenty...  pos
1  for starters , it was created by alan moore ( ...  pos
2  to say moore and campbell thoroughly researche...  pos
                                                    text  tag
48152  we all know that inter-office politics are jus...  neg
21036  the story and acting are of good quality , but...  pos
30937  a fine film , even though it needs just a litt...  pos
(64720, 2)


### Preprocessing
Preprocessing of data is done in this section which includes:<br>
1. Converting into lower case
2. Removing punctuations<br>
3. Removing numbers
4. Removing white spaces
5. Removing stop words
6. Using Lemmatizer
7. Stemming the words using **Porter Stemmer**

In [10]:
def preprocessing(input_file,target):
    import re
    import string
    
    #Convert All into Lower Case
    input_file[target] = input_file[target].str.lower()
    
    #Removing Punctuation
    input_file[target] = input_file[target].str.replace("[^a-zA-Z#]", " ")
    
    #Removing Numbers
    input_file[target] = input_file[target].str.replace('\d+', '')

    #Removing White Spaces
    input_file[target] = input_file[target].str.strip()
    
    #Tokenization
    #Removal of Stop Words
    from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
    from nltk.tokenize import word_tokenize
    from nltk.stem import WordNetLemmatizer
    from nltk.stem import PorterStemmer
    lemmatizer=WordNetLemmatizer()
    stemmer= PorterStemmer()
        
        
    
    tokenized = input_file[target].apply(lambda p: p.split())
    print(tokenized[0])
    print(type(tokenized))
    for i in range(tokenized.size):
        result = [j for j in tokenized[i] if not j in ENGLISH_STOP_WORDS]
        tokenized[i] = result
        
    #tokenized = tokenized.apply(lambda p: [for i in p if not in i in ENGLISH_STOP_WORDS])
    tokenized = tokenized.apply(lambda p: [lemmatizer.lemmatize(i) for i in p])
    tokenized = tokenized.apply(lambda p : [stemmer.stem(i) for i in p])
    for i in range(len(tokenized)):
        tokenized[i] = ' '.join(tokenized[i])
    input_file[target] = tokenized

    

In [11]:
print(MR.head(10))
preprocessing(MR,'text')
print(MR.head(10))
print(MR.shape)


                                                    text  tag
48152  we all know that inter-office politics are jus...  neg
21036  the story and acting are of good quality , but...  pos
30937  a fine film , even though it needs just a litt...  pos
55162  for example , the two gym teachers , mrs . bal...  neg
6554   one of the most overlooked aspects of this fil...  pos
45894  even after that happens , though , i'll still ...  neg
6995                         but it's us being clapped .  pos
61306  the next day , the team wakes up and discovers...  neg
20209  this is because few moviegoers will care a who...  pos
20810  not coincidentally he was something of a mysti...  pos
['films', 'adapted', 'from', 'comic', 'books', 'have', 'had', 'plenty', 'of', 'success', 'whether', 'they', 're', 'about', 'superheroes', 'batman', 'superman', 'spawn', 'or', 'geared', 'toward', 'kids', 'casper', 'or', 'the', 'arthouse', 'crowd', 'ghost', 'world', 'but', 'there', 's', 'never', 'really', 'been', 'a', 

### Feature Extraction:
TF-IDF technique is used for the feature extraction

In [12]:
def feature_extraction(input_file,target):
    from sklearn.feature_extraction.text import TfidfVectorizer
    tfidfV = TfidfVectorizer(max_df = 0.9,min_df = 2,max_features = 800, stop_words = 'english')
    vecTor=  tfidfV.fit_transform(input_file[target])
    return vecTor

In [13]:
Xtrain = feature_extraction(MR,'text')
print(Xtrain.shape)
print(Xtrain)
#print(Xtrain)

(64720, 800)
  (0, 369)	0.2984188631924792
  (0, 472)	0.3909773470549727
  (0, 509)	0.8313382857158736
  (0, 360)	0.2587654360299233
  (1, 661)	0.32032734776899585
  (1, 5)	0.3731434813516857
  (1, 284)	0.3166009896379076
  (1, 536)	0.4847380998638775
  (1, 606)	0.42274466027904917
  (1, 547)	0.49420043778029105
  (2, 360)	0.2743690457463526
  (2, 256)	0.4247667417167444
  (2, 253)	0.37833423216775036
  (2, 457)	0.3719538013852179
  (2, 394)	0.31865770770307894
  (2, 366)	0.3782478595598175
  (2, 552)	0.46711266556638864
  (3, 221)	0.5016957606367182
  (3, 449)	0.4865932780464587
  (3, 535)	0.5502151205297771
  (3, 269)	0.45693726775078153
  (4, 253)	0.2183621220828705
  (4, 41)	0.5124813057051281
  (4, 452)	0.4325988334773524
  (4, 87)	0.5367837823481592
  :	:
  (64713, 376)	0.4544799690680784
  (64714, 546)	1.0
  (64715, 253)	0.3015469049429314
  (64715, 475)	0.5897265992852079
  (64715, 797)	0.4934358839568921
  (64715, 474)	0.5637490848227809
  (64716, 57)	0.7102887236490734
  (647

In [7]:
import numpy as np
Ytrain = []
for i in MR['tag']:
    if str(i) == 'pos':
        Ytrain.append(1)
    else:
        Ytrain.append(0)
#print(Ytrain)

### Training of the model
Model has been trained by using Decision Tree Algorithm. Inbuilt library of Scikit Learn is used for implementing **Decision Tree**.

In [8]:
from sklearn.model_selection import train_test_split as TTS
from sklearn.metrics import *
from sklearn.tree import DecisionTreeClassifier as DT
model = DT()
X_train,X_test,Y_train,Y_test = TTS(Xtrain,Ytrain,test_size=0.30, random_state=42)

In [9]:
model.fit(Xtrain,Ytrain)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [10]:
prid = model.predict(X_test)
print(accuracy_score(Y_test,prid))

0.9531314379892872


### Manual Input in Hindi
Here we are taking the input in Hindi language. The input is then preprocessed and given to the trained model. We will get output as positive and negative from the model.

In [11]:
def finalConversion(get_inp):
    from googletrans import Translator
    translator = Translator()
    get_inp = translator.translate(text = get_inp,dest = 'en',src = 'hi')
    get_inp = get_inp.text
    print(get_inp)
    df1 = pd.DataFrame({'text':[get_inp],'tag':[0]})
    new_df = MR
    df1 = df1.append(new_df,ignore_index = True)
    preprocessing(df1,'text')
    tstVec = feature_extraction(df1,'text')
    v = tstVec[1]
    x = model.predict(v)
    
    return x

In [18]:
get_inp = input("Enter Text in Hindi: ")
output = finalConversion(get_inp)
if output == 0:
    output = "Negative"
else:
    output = "Positive"
print("\nPrediction is: " +output)

Enter Text in Hindi: एबीसीडी 2’ में ड्रामा की कमी है, इस कमी की वजह से ही सुरु और विन्नी का रोमांस भी नहीं उभर पाता।
ABCD 2 'lack of drama, due to the reduction does not emerge romance of Suru and Vinny.

Prediction is: Positive


In [None]:
# भाऊत अच्छी मूवी है ये!
# एबीसीडी 2’ में ड्रामा की कमी है, इस कमी की वजह से ही सुरु और विन्नी का रोमांस भी नहीं उभर पाता।