Hey there fellow Kaggler :)

NCOVID-19 has done a tremendous amount of damage to the flow of work we had this year, but guess what? We have more time than ever these days and what can you do?

Train a model on thousands of movie reviews to see if they are positive or negative! Definitely. That is exactly what we will be seeing here today, learning to differentiate between the positive and negative reviews from about 2,000 reviews.

Given to us, is a CSV file which has about 50,000 reviews which are categorized as negative or positive. You would ask me why train on just 2,000 reviews when you have 50,000? My answer to that being that:

1. It would be <font color = 'red'>computationally expensive.</font>
2. Since I am new to this, I want to get my results out <font color  ='green'>as quickly as possible</font>.

Talking about NLP or Natural Language Processing what does it mean?
This sect of the field of Artificial Intelligence deals with sequential data. Sequential data refers to all data that comes in a sequence. The words you say, the words you type and the words you read it is refered to as Sequential Data. Not just this, it also includes audio recognition.

Answering some questions

### Why are we using FastAI Library not PyTorch?
Well, firstly it is built in top of the PyTorch library. It is is easier to use and often gives state of the art results even if you are a beginner. It is has all the hyperparameters set to exactly where they should be and generalize perfectly over all sorts of NLP problems.

### Why do we do such mundane tasks in Natural Language Processing?
Everyone starts somewhere, these are baby steps that are required towards something bigger. Check out this blog for [application of NLP](https://monkeylearn.com/blog/natural-language-processing-applications/).

### Are there a different kind of neural networks that would work for these problems?
Yes we will be using LSTMS:
![](https://missinglink.ai/wp-content/uploads/2019/08/A.png)

# Table of Content

## 1. Libraries
## 2. Introducing the Data
## 3. Creating the Classifier
## 4. Training the Classifier

# Libraries

In [None]:
# Importing from FastAI and Pandas:
from fastai.text.all import *
import pandas as pd

In [None]:
# Defining the path where our data is stored:
path = "../input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv"

# Introducing the Data
Introducing the data, more like defining the paths and reading the csv file where our data exists.

In [None]:
data = pd.read_csv(path)

# Lets scale down this a little bit since my GPU won't be able to handle data this big:
newdata = data[:2000]
newdata.to_csv("./newdata.csv")

# Taking a look at it:
newdata.head(5)

In [None]:
# Defining the DataBlock:
imdb_clas = DataBlock(
    blocks=(TextBlock.from_df('review', seq_len=72), CategoryBlock),
    get_x=ColReader('text'), get_y=ColReader('sentiment'))

In [None]:
# Further processing it:
imdb = imdb_clas.dataloaders(newdata, bs = 64, is_lm=True)

# Taking a look at it:
imdb.show_batch(max_n=1)

In [None]:
# Let's take the first review:
txt = data['review'].iloc[0]
txt

Lets tokenize this till 30 words:

In [None]:

spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))

In [None]:
# Further processing:

tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

# Creating the Classifier
We will be classifying whether the review was positive or negative.

In [None]:
# Time to fin-tune the model:
learn = text_classifier_learner(imdb,
                               AWD_LSTM,
                               metrics = [accuracy, Perplexity()]).to_fp16()

# Training the Classifier
Finally training it.

In [None]:
learn.fit_one_cycle(5, 1e-3)

And we have 80% accuracy in a matter of about 30 seconds!