**NLP**

This notebook is divided into two sections:
* Building a Natural Language Processor From Scratch.
* Next we'll show how to perform these steps using real tools.

# Building a Natural Language Processor From Scratch
In this section we'll use basic Python to build a basic NLP system. We'll build a *corpus of documents* (three small text files), create a *vocabulary* from all the words in all documents, and then demonstrate a *Bag of Words* technique to extract features from each document.<br>

**Create a text file in a jupyter notebook** <br/>
use %%writefile *filename* to create a text file in jupyter notebook 

In [None]:
%%writefile file1.txt
Learn NLP to play with text data

In [None]:
%%writefile file2.txt
Keras is an Open Source Neural Network library written in Python that runs on top of Theano or Tensorflow

In [None]:
%%writefile file3.txt
Spacy has lot of features and its one of the best NLP library in Python

## Vocabulary Collection
here we are building a numerical array from all the words that appear in every document. Later we'll create instances (vectors) for each individual document.

In [None]:
vocabulary = {}
num = 1
with open('file1.txt') as f:
    words = f.read().lower().split()
for word in words:
    if word in vocabulary:
        continue
    else:
        vocabulary[word]=num
        num+=1

    

In [None]:
print(vocabulary)

In [None]:
with open('file2.txt') as f:
    words = f.read().lower().split()
for word in words:
    if word in vocabulary:
        continue
    else:
        vocabulary[word]=num
        num+=1

In [None]:
print(vocabulary)

In [None]:
with open('file3.txt') as f:
    words = f.read().lower().split()
for word in words:
    if word in vocabulary:
        continue
    else:
        vocabulary[word]=num
        num+=1

In [None]:
print(vocabulary)

# Feature Extraction

Now that we've encapsulated our "entire language" in a dictionary, let's perform *feature extraction* on each of our original documents:

In [None]:
# Create an empty vector with space for each word in the vocabulary:
one = ['file.txt']+[0]*len(vocabulary)
one

In [None]:
# map the frequencies of each word in 1.txt to our vector:
with open('file1.txt') as f:
    x = f.read().lower().split()
    
for word in x:
    one[vocabulary[word]]=one[vocabulary[word]]+1
    
one

In [None]:
two = ['file2.txt']+[0]*len(vocabulary)

with open('file2.txt') as f:
    x = f.read().lower().split()
    
for word in x:
    two[vocabulary[word]]+=1

In [None]:
three = ['file3.txt']+[0]*len(vocabulary)

with open('file3.txt') as f:
    x = f.read().lower().split()
    
for word in x:
    three[vocabulary[word]]+=1

In [None]:
print(f'{one}\n{two}\n{three}')

By comparing the vectors we see that some words are common to all docs, some appear only in `file1.txt`, some appear only in `file2.txt` others only in `file3.txt`. Extending this logic to tens of thousands of documents, we would see the vocabulary dictionary grow to hundreds of thousands of words. Vectors would contain mostly zero values, making them *sparse matrices*.

## Bag of Words and Tf-idf
In the above examples, each vector can be considered a *bag of words*. By itself these may not be helpful until we consider *term frequencies*, or how often individual words appear in documents. A simple way to calculate term frequencies is to divide the number of occurrences of a word by the total number of words in the document. In this way, the number of times a word appears in large documents can be compared to that of smaller documents.

However, it may be hard to differentiate documents based on term frequency if a word shows up in a majority of documents. To handle this we also consider *inverse document frequency*, which is the total number of documents divided by the number of documents that contain the word. In practice we convert this value to a logarithmic scale, as described [here](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency).

Together these terms become [**tf-idf**](https://en.wikipedia.org/wiki/Tf%E2%80%93idf).

## Stop Words and Word Stems
Some words like "the" and "and" appear so frequently, and in so many documents, that we needn't bother counting them. Also, it may make sense to only record the root of a word, say `dog` in place of both `dog` and `dogs`. This will shrink our vocab array and improve performance.

## Tokenization and Tagging
When we created our vectors the first thing we did was split the incoming text on whitespace with `.split()`. This was a crude form of *tokenization* - that is, dividing a document into individual words. In this simple example we didn't worry about punctuation or different parts of speech. In the real world we rely on some fairly sophisticated *morphology* to parse text appropriately.

Once the text is divided, we can go back and *tag* our tokens with information about parts of speech, grammatical dependencies, etc. This adds more dimensions to our data and enables a deeper understanding of the context of specific documents. For this reason, vectors become ***high dimensional sparse matrices***.

# Feature Extraction from Text Using Scikit-learn
In this section we'll actually look at the text of each message in SMSSpamCollection dataset and try to perform a classification based on content using scikit-learn's [feature extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) tools.

**Load Dataset **

In [None]:
# Perform imports and load the dataset:
import numpy as np
import pandas as pd

df = pd.read_csv('../input/sms-spam-collection-dataset/spam.csv',encoding='latin')
df.head()

In [None]:
df = df[['v1','v2']]

In [None]:
df.columns = ['label','sms']

## Check for missing values:

In [None]:
df.isnull().sum()

## Take a quick look at the *ham* and *spam* `label` column:

In [None]:
df['label'].value_counts()

<font color=blue>4825 out of 5572 messages, or 86.6%, are ham. This means that any text classification model we create has to perform **better than 86.6%** to beat random chance.</font>

## Split the data into train & test sets:

In [None]:
from sklearn.model_selection import train_test_split

X = df['sms']  
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Scikit-learn's CountVectorizer
Text preprocessing, tokenizing and the ability to filter out stopwords are all included in [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), which builds a dictionary of features and transforms documents to feature vectors.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

<font color=violet>This shows that our training set is comprised of 3733 documents, and 7082 features.</font>

## Transform Counts to Frequencies with Tf-idf
While counting words is helpful, longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid this we can simply divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called **tf** for Term Frequencies.

Another refinement on top of **tf** is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called **tf–idf** for “Term Frequency times Inverse Document Frequency”.

Both tf and tf–idf can be computed as follows using [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html):

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

Note: the `fit_transform()` method actually performs two operations: it fits an estimator to the data and then transforms our count-matrix to a tf-idf representation.

## Combine Steps with TfidVectorizer
In the future, we can combine the CountVectorizer and TfidTransformer steps into one using [TfidVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html):

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(X_train) # remember to use the original X_train set
X_train_tfidf.shape

## Train a Classifier
Here we'll introduce an SVM classifier that's similar to SVC, called [LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html). LinearSVC handles sparse input better, and scales well to large numbers of samples.

In [None]:
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train_tfidf,y_train)

## Build a Pipeline
Remember that only our training set has been vectorized into a full vocabulary. In order to perform an analysis on our test set we'll have to submit it to the same procedures. Fortunately scikit-learn offers a [**Pipeline**](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) class that behaves like a compound classifier.

In [None]:
from sklearn.pipeline import Pipeline


text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)

## Test the classifier and display results

In [None]:
# Form a prediction set
predictions = text_clf.predict(X_test)

In [None]:
# Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

In [None]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

In [None]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

Using the text of the messages, our model performed exceedingly well with high accuracy<br>

**Will Update the notebook with detailed explanation of Adavanced Topics like Word2vec,Sentiment Analysis,Document Classification,Chatbots. Please hit an upvote, if you find it useful.**