<p>It is advisable that you read our introductory documentation webpage before moving on with understading the code. As it would help you understand the problem better.</p>
<p>You can check it out <a href="https://hasocfire.github.io/hasoc/2021/ichcl/index.html">here</a></p>

### Importing Libraries and initializing stopwords and stemmer

In [122]:
import pandas as pd
import numpy as np
from glob import glob
import re
import json

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Dropout


from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
import nltk.stem  as hindi_stemmer

<p>Making a list of english and hindi stopwords. <br>The enlgish stopwords are retrieved from NLTK library as well. <br>And the hindi stopwords are retrieved from a data set on Mendeley Data. To read about how the authors compiled the list, you can check their <a href = "https://arxiv.org/ftp/arxiv/papers/2002/2002.00171.pdf" > publicaion </a> </p>
<p>Initializing an english SnowballStemmer using the NLTK library. <br>And the hindi stemmer used was produced by students of Banasthali University. You can check out their <a href="https://arxiv.org/ftp/arxiv/papers/1305/1305.6211.pdf">publication</a></p>

In [123]:
english_stopwords = stopwords.words("english")
with open('final_stopwords.txt', encoding = 'utf-8') as f:
    hindi_stopwords = f.readlines()
    for i in range(len(hindi_stopwords)):
        hindi_stopwords[i] = re.sub('\n','',hindi_stopwords[i])
stopwords = english_stopwords + hindi_stopwords
english_stemmer = SnowballStemmer("english")

## Reading Data

<p>Initializing a list of various directories that data is stored in using the glob Library.</p>

<p>Reading tree structured data from the directories from the .json files</p>

</p>Defining 2 functions that will turn the data from a tree structure to a flat structure.</p>
<ul>
    <li>tr_flatten: This is to flat the train data. It takes two variables as function parameters. First one is the tweet data and second one is labels. It'll create a list of json structures like following:
        <ul>
            <li> for source tweet: It'll create json with tweet_id, tweet text and label. </li>
            <li> for comment: It'll create json with tweet_id, label and for the text part it'll append the comment after the source tweet. This is a basic technique to provide context of source tweet. </li>
            <li> for reply: It'll create json with tweet_id, label and for the text part it'll append the reply after the comment after the source tweet. So the text here will look like "source tweet-comment-reply"</li>
        </ul>
    </li>
    <li>te_flatten: This is to flat the test data. It works similarly like tr_flatten but without the labels file, as labels won't be available for test set. It'll be used once the test data is available</li>
</ul>

<p>This cell will run both the flatten functions. Again, you can skip the test part if it is not available. The train_len variable will be used later on for splitting the data.</p>

In [124]:
df1 = pd.read_csv('hi_Hasoc2021_train.csv')
df2= pd.read_excel('hasoc2019_hi_test_gold_2919.xlsx')
df3=pd.read_excel('hasoc_2020_hi_train.xlsx')

In [130]:

df3['task_1'] = df3['task_1'].apply((lambda x: re.sub('HOF','1',x)))
df3['task_1'] = df3['task_1'].apply((lambda x: re.sub('NOT','0',x)))
df3['task_1'] =pd.to_numeric(df3['task_1'])


In [132]:
my_dict1 = { 'Hsbinary':df3['task_1'],'Comment':df3['text']}
df = pd.DataFrame(my_dict1)
df
df.to_csv('hi_2010hasoc_binary.csv',index=False)

In [125]:
df1.head()
len(df1)

4594

In [126]:
df2.head()

Unnamed: 0,text,task_1
0,"वक्त, इन्सान और इंग्लैंड का मौसम आपको कभी भी ध...",NOT
1,#कांग्रेस के इस #कमीने की #करतूत को देखिए देश ...,HOF
2,पाकिस्तान को फेकना था फेका गया। जो हार कर भी द...,HOF
3,जो शब्द तूम आज किसी और औरत के लिए यूज कर रहे व...,NOT
4,नेता जी हम समाजवादी सिपाही हमेशा आपके साथ है आ...,NOT


In [127]:
df3.head()

Unnamed: 0,text,task_1
0,1 आदमीं को मारने पर गोडसे आतंकी हो सके है तो\n...,HOF
1,"RT @Vishesh4: @jawaharyadavbjp जवाहर यादव, अगर...",NOT
2,RT @FunKeyBaat: #भगवा वस्त्र पहन कर मतदान नही ...,HOF
3,Yey nina khothani labafazi benu phambili Finis...,HOF
4,RT @Rajeshbhanjan2: जब भी कोई सिकुलर कोंग्रेसी...,HOF


In [128]:

df=pd.concat([df1,d3])

NameError: name 'd3' is not defined

In [None]:
df.head()

In [None]:
# df['tweet_id'].value_counts()

In [None]:
tweets = df.text
y = df.task_1


In [None]:
regex_for_english_hindi_emojis="[^a-zA-Z#\U0001F300-\U0001F5FF'|'\U0001F600-\U0001F64F'|'\U0001F680-\U0001F6FF'|'\u2600-\u26FF\u2700-\u27BF\u0900-\u097F]"
def clean_tweet(tweet):
    tweet = re.sub(r"@[A-Za-z0-9]+",' ', tweet)
    tweet = re.sub(r"https?://[A-Za-z0-9./]+",' ', tweet)
    tweet = re.sub(regex_for_english_hindi_emojis,' ', tweet)
    tweet = re.sub("RT ", " ", tweet)
    tweet = re.sub("\n", " ", tweet)
    tweet = re.sub(r" +", " ", tweet)
    tokens = []
    for token in tweet.split():
        if token not in stopwords:
            token = english_stemmer.stem(token)
            #token = hindi_stemmer.hi_ste(token)
            tokens.append(token)
    return " ".join(tokens)

In [None]:
cleaned_tweets = [clean_tweet(tweet) for tweet in tweets]

In [None]:
cleaned_tweets 

## Preprocessing and featuring the raw text

<p>This is a preprocessing function and the regex will match with anything that is not English, Hindi and Emoji.</p>
<p>The preprocessing steps are as followed:</p>
<ul>
    <li>Remove Handles</li>
    <li>Remove URLs</li>    
    <li>Remove anything that is not English, Hindi and Emoji</li>    
    <li>Remove RT which appears in retweets</li>    
    <li>Remove Abundant Newlines</li>    
    <li>Remove Abundant whitespaces</li>    
    <li>Remove Stopwords</li>
    <li>Stem English text</li>
    <li>Stem Hindi text</li>
</ul>

<p>Using TF-IDF for featuring the text. The vectorizer will only consider vocab terms that appear in more than 5 documents.</p>
<p>To learn more about TF-IDF you can check <a href = "https://towardsdatascience.com/tf-term-frequency-idf-inverse-document-frequency-from-scratch-in-python-6c2b61b78558">here</a> and <a href = "https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html">here</a>.</p>

In [None]:
vectorizer = TfidfVectorizer(min_df = 5)
X = vectorizer.fit_transform(cleaned_tweets)
X = X.todense()

## Training and evaluating model

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.00001, random_state=42)

<p>Training the Logistic Regression classifier provided by Scikit-Learn library.</p>
<p>To learn more about Logistic Regression classifier you can check <a href = "https://www.youtube.com/watch?v=yIYKR4sgzI8">here</a> and <a href = "https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">here</a>.</p>

In [None]:
classifier = LogisticRegression()
classifier.fit(X_train, y_train)

<p>Predicting and priting classification metrics for validation set.</p>

In [None]:
y_pred = classifier.predict(X_val)

In [None]:
print(classification_report(y_val, y_pred))

In [None]:
le = LabelEncoder() #label encoding labels for training Dense Neural Network
y_train = le.fit_transform(y_train)
y_val = le.transform(y_val)

In [None]:
model = Sequential(
    [
        Dense(64, activation="relu"),
        Dense(32, activation="relu"),
        Dense(1, activation="sigmoid"),
    ]
)
model.compile('adam', loss='binary_crossentropy', metrics = ['accuracy']) #compiling a neural network with 3 layers for classification

In [None]:
model.fit(X_train, y_train, epochs = 10, batch_size = 32)

In [None]:
y_pred = model.predict(X_val)
y_pred = (y_pred > 0.5).astype('int64')
y_pred = y_pred.reshape(len(y_pred))    

In [None]:
print(classification_report(y_val, y_pred))

## Predicting test data and making a sample submission file

<p>This part will be used to read and make predictions on the test data once the it is made available. When it is available, make a directory in data directory as 'test' and copy the story direcotries into the test directory.</p>

<p>The test directories do not contain labels.json file so labels list is not initialized for test data.</p>

<p>Flattening the test data.</p>

In [None]:
test_df = pd.read_csv('hi_Hasoc2021_test.csv')

In [None]:
test_df.head()

In [None]:
test_tweets = test_df.text
tweet_ids = test_df.tweet_id

In [None]:
cleaned_test = [clean_tweet(tweet) for tweet in test_tweets]
cleaned_test

In [None]:
X_test = vectorizer.transform(cleaned_test)
X_test = X_test.todense()

In [None]:
submission_prediction = classifier.predict(X_test)
print(submission_prediction)
p=submission_prediction

print(p)

l=[]
ids=[]
x=0
for i in range(len(p)):
    if p[i]=='NOT':
        l.append('NOT')
    if p[i]=='HOF':
        l.append('HOF')
        x=x+1
        
    ids.append(test_df['tweet_id'][i])
    
print(x)

my_dict1 = { 'id':ids,'label':l}
df = pd.DataFrame(my_dict1)
df
# df.to_csv('hi_neural_network_final.csv',index=False)