We cannot work with the text data in machine learning so we need to convert them into numerical vectors, As a part of this practice exercise you will implement different techniques to do the same.

In this notebook we are going to understand some basic text cleaning steps and techniques for encoding text data. We are going to learn about
1. **Understanding the data** - See what's data is all about. what should be considered for cleaning for data (Punctuations , stopwords etc..).
2. **Basic Cleaning** -We will see what parameters need to be considered for cleaning of data (like Punctuations , stopwords etc..)  and its code.
3. **Techniques for Encoding** - All the popular techniques that are used for encoding that I personally came across.
    *           **Bag of Words**
    *           **Binary Bag of Words**
    *           **Bigram, Ngram**
    *           **TF-IDF**( **T**erm  **F**requency - **I**nverse **D**ocument **F**requency)


# 1.Importing Libraries

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [None]:
import nltk
import matplotlib.pyplot as plt

Libraries used in this notebook along with their version:

google	2.0.3

nltk	3.2.5

numpy	1.18.3

pandas	1.0.3

# 2.Reading the data

We will employ a text categorization dataset based on Reviews. Each article is assigned a specific captegory. 
###Implement the code to load the dataset.(Hint: Use the pandas library to load the csv file.)

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
rev = pd.read_csv('/kaggle/input/reviewdata/Review.csv')
print(rev.shape)
rev.head()

1. **Understanding the data**

Our main objective from the dataset is to predict whether a review is **Positive** or **Negative** based on the Text.
 
If we see the Score column, it has values 1,2,3,4,5 .  Considering 1, 2 as Negative reviews and 4, 5 as Positive reviews.
 For Score = 3 we will consider it as Neutral review and lets delete the rows that are neutral, so that we can predict either Positive or Negative
 
HelpfulnessNumerator says about number of people found that review usefull and HelpfulnessDenominator is about usefull review count + not so usefull count.
So, from this we can see that HelfulnessNumerator is always less than or equal to HelpfulnesDenominator.

In [None]:
rev=rev.sample(100000).reset_index()
rev.shape

Converting Score values into class label either Positive or Negative.

In [None]:
def get_sentiment_label(Score):
    if 1<=Score<=2:
        return 'negative'
    elif 4<=Score<-5:
        return 'positive'
    else:
        return 'neutral'

In [None]:
rev['Score']=rev['Score'].apply(get_sentiment_label)

In [None]:
rev=rev[~((rev['HelpfulnessDenominator'])<(rev['HelpfulnessNumerator']))]

In [None]:
rev.shape

2. **Basic Cleaning**
 
**Deduplication** means removing duplicate rows, It is necessary to remove duplicates in order to get unbaised results. Checking duplicates based on UserId, ProfileName, Time, Text. If all these values are equal then we will remove those records. (No user can type a review on same exact time for different products.)


We have seen that HelpfulnessNumerator should always be less than or equal to HelpfulnessDenominator so checking this condition and removing those records also.


In [None]:
rev1=rev.drop_duplicates(['UserId','ProfileName','Time','Text'])
rev1.shape

Converting all words to lowercase and removing punctuations and html tags if any

**Stemming**- Converting the words into their base word or stem word ( Ex - tastefully, tasty,  these words are converted to stem word called 'tasti'). This reduces the vector dimension because we dont consider all similar words  

**Stopwords** - Stopwords are the unnecessary words that even if they are removed the sentiment of the sentence dosent change.

Ex -    This pasta is so tasty ==> pasta tasty    ( This , is, so are stopwords so they are removed)

To see all the stopwords see the below code cell.

###Create a function called "complaint_to_words" to convert each consumer complaint narrative to individual tokens.(Hint: Use regular expression based tokenizer.)

In [None]:
rev_text=rev1['Text']

# 3.Basic Cleaning

We will use the above function here to create a list of list that will store each complaint tokenized into separate words.

## 3.1.Tokenize

In [None]:
docs = rev_text.str.lower().str.replace('[^a-z\s#@]', '') # remove everything other than alphabets, spaces, # , @
docs_tokens = docs.str.split(' ')

tokens_all = []
for tokens in docs_tokens:
    tokens_all.extend(tokens)
print('No. of tokens in entire corpus:', len(tokens_all))

## 3.2.Lower Case

## 3.3.Removing Stopwords

### 3.3.1.Removing Punctuation

### 3.3.2.Removing the Stop Words

In [None]:
nltk.download('stopwords')
common_stopwords = nltk.corpus.stopwords.words('english')
custom_stopwords = ['one', 'even','also']
all_stopwords = np.hstack([common_stopwords,custom_stopwords])
len(all_stopwords)

In [None]:
tokens_freq = pd.Series(tokens_all).value_counts().drop([''])
tokens_freq

In [None]:
df_tokens = pd.DataFrame(tokens_freq).reset_index().rename(columns={'index': 'token', 0: 'frequency'})
df_tokens = df_tokens[~df_tokens['token'].isin(all_stopwords)]
import matplotlib.pyplot as plt
plt.figure(figsize=(14,5))
df_tokens.set_index('token')['frequency'].head(25).plot.bar()

In [None]:
from wordcloud import WordCloud
docs_strings = ' '.join(docs)
len(docs_strings)
wc = WordCloud(background_color='white', stopwords=all_stopwords).generate(docs_strings)
plt.figure(figsize=(20,5))
plt.imshow(wc)
plt.axis('off');

## 3.4.Stemming & Lemitization

### 3.4.1.Stemming

In [None]:
from gensim.parsing.preprocessing import PorterStemmer, remove_stopwords
stemmer = PorterStemmer()
docs =rev_text.str.lower().str.replace('[^a-z\s]', '')
docs = docs.apply(remove_stopwords)
docs = stemmer.stem_documents(docs)

### 3.4.2.Lemitization

In [None]:
import spacy
nlp=spacy.load('en_core_web_sm')

In [None]:
doc1=rev['Text'].iloc[0]
proc_doc=nlp(doc1)
for token in proc_doc:
 print(token,token.lemma_)

## 3.5.PoS

In [None]:
for token in proc_doc:
  print(token,'|',token.pos_)

# Save the data

# 4.**Techniques for Encoding**

4. **Techniques for Encoding**

      **BAG OF WORDS**
      
      In BoW we construct a dictionary that contains set of all unique words from our text review dataset.The frequency of the word is counted here. if there are **d** unique words in our dictionary then for every sentence or review the vector will be of length **d** and count of word from review is stored at its particular location in vector. The vector will be highly sparse in such case.
      
      Ex. pasta is tasty and pasta is good
      
     **[0]....[1]............[1]...........[2]..........[2]............[1]..........**             <== Its vector representation ( remaining all dots will be represented as zeroes)
     
     **[a]..[and].....[good].......[is].......[pasta]....[tasty].......**            <==This is dictionary
      .
      
    Using scikit-learn's CountVectorizer we can get the BoW and check out all the parameters it consists of, one of them is max_features =5000 it tells about to consider only top 5000 most frequently repeated words to place in a dictionary. so our dictionary length or vector length will be only 5000
    


   **BINARY BAG OF WORDS**
    
   In binary BoW, we dont count the frequency of word, we just place **1** if the word appears in the review or else **0**. In CountVectorizer there is a parameter **binary = true** this makes our BoW to binary BoW.
   
  

In [None]:
from gensim.parsing.preprocessing import PorterStemmer, remove_stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [None]:
train_docs, test_docs = train_test_split(rev_text, test_size=0.2, random_state=1)
import nltk
nltk.download('stopwords')
stopwords = nltk.corpus.stopwords.words('english')
stopwords.remove('not')
vectorizer = CountVectorizer(stop_words=stopwords, min_df=10).fit(train_docs)
vocab = vectorizer.get_feature_names()

In [None]:
train_dtm = vectorizer.transform(train_docs)
test_dtm = vectorizer.transform(test_docs)

In [None]:
df_train_dtm = pd.DataFrame(train_dtm.toarray(), index=train_docs.index, columns=vocab)
df_test_dtm = pd.DataFrame(test_dtm.toarray(), index=test_docs.index, columns=vocab)

 **Drawbacks of BoW/ Binary BoW**
 
 Our main objective in doing these text to vector encodings is that similar meaning text vectors should be close to each other, but in some cases this may not possible for Bow
 
For example, if we consider two reviews **This pasta is very tasty** and **This pasta is not tasty** after stopwords removal both sentences will be converted to **pasta tasty** so both giving exact same meaning.

The main problem is here we are not considering the front and back words related to every word, here comes Bigram and Ngram techniques.

## 4.1.N-gram 

### 4.1.1.**BI-GRAM BOW**

Considering pair of words for creating dictionary is Bi-Gram , Tri-Gram means three consecutive words so as NGram.

CountVectorizer has a parameter **ngram_range** if assigned to (1,2) it considers Bi-Gram BoW

But this massively increases our dictionary size 

In [None]:
vectorizer = CountVectorizer(min_df=5, ngram_range=(1,2)).fit(train_docs)
vocab = vectorizer.get_feature_names()
vocab[:5]

## 4.2.**TF-IDF**

**Term Frequency -  Inverse Document Frequency** it makes sure that less importance is given to most frequent words and also considers less frequent words.

**Term Frequency** is number of times a **particular word(W)** occurs in a review divided by totall number of words **(Wr)** in review. The term frequency value ranges from 0 to 1.

**Inverse Document Frequency** is calculated as **log(Total Number of Docs(N) / Number of Docs which contains particular word(n))**. Here Docs referred as Reviews.


**TF-IDF** is **TF * IDF** that is **(W/Wr)*LOG(N/n)**


 Using scikit-learn's tfidfVectorizer we can get the TF-IDF.

So even here we get a TF-IDF value for every word and in some cases it may consider different meaning reviews as similar after stopwords removal. so to over come we can use BI-Gram or NGram.

In [None]:
vectorizer = TfidfVectorizer(min_df=5).fit(train_docs)
vocab = vectorizer.get_feature_names()

train_dtm_tfidf = vectorizer.transform(train_docs)
test_dtm_tfidf = vectorizer.transform(test_docs)

df_train_dtm_tfidf = pd.DataFrame(train_dtm_tfidf.toarray(), index=train_docs.index, columns=vocab)
df_train_dtm_tfidf.head()

## 4.3.Word2Vec

Gensim is a free to use python library. It provides APIs to solve various problems relating to natural language processing. It is fast, scalable and robust.

In this practice exercise we will train our own Word2Vec model using gensim Word2Vec API. Objectives of this practice exercise are, 


1.   Train your word2vec word embedding model.
2.   Visualize trained word embedding model using principal component analysis.


First step will be to load the corpus, clean it and tokenize it.

Libraries used in this notebook along with their version:

google	2.0.3

matplotlib	3.2.1

numpy	1.18.3

pandas	1.0.3

In [None]:
from gensim.models import word2vec
#sentences = [['this', 'movie', 'is', 'good'],
 #            ['this', 'movie', 'is', 'awesome']]
model = word2vec.Word2Vec(sentences=rev_text, vector_size=50, window=3, min_count=1, sg=1)
words = list(model.wv.index_to_key)
embeddings_matrix = model.wv[words]
df_embeddings_matrix = pd.DataFrame(embeddings_matrix, index=words)
df_embeddings_matrix

Next step is to import the Word2Vec model from gensim.

In [None]:
import gensim
gensim.__version__

##### Create your own model using the data_list defined above and gensim Word2Vec API. (Hint: https://radimrehurek.com/gensim/models/word2vec.html)

##### Use PCA algorithm from sklearn to convert high dimesnional word embeddings to two diemnsions and save them in the variable "results".

##### Visualizing the word embeddings.

# 5.Emotion and Sentiment Analysis